Table of Contents
cs.CL [Back]
[1] Test-time Scaling of LLMs: A Survey from A Subproblem Structure Perspective
Zhuoyi Yang,Xu Guo,Tong Zhang,Huijuan Xu,Boyang Li
Main category: cs.CL
TL;DR: 本文综述了通过在推理时分配额外计算资源来提高预训练大语言模型预测准确性的技术,重点分析问题分解方式和子问题的拓扑结构,并统一了如思维链、分支-求解-合并和思维树等方法。
Details
Motivation: 提高预训练大语言模型在推理阶段的预测准确性,探索如何有效利用额外计算资源。 Method: 对测试时扩展方法进行分类,强调问题如何被分解为子问题及其拓扑组织结构(顺序、并行或树状),并以此统一多种现有方法。 Result: 将Chain-of-Thought、Branch-Solve-Merge和Tree-of-Thought等多样化方法纳入统一框架,并总结了这些技术的优势与局限性。 Conclusion: 提出了一种系统视角来理解测试时扩展方法,为未来研究指明了有前景的方向。 Abstract: With this paper, we survey techniques for improving the predictive accuracy of pretrained large language models by allocating additional compute at inference time. In categorizing test-time scaling methods, we place special emphasis on how a problem is decomposed into subproblems and on the topological organization of these subproblems whether sequential, parallel, or tree-structured. This perspective allows us to unify diverse approaches such as Chain-of-Thought, Branch-Solve-Merge, and Tree-of-Thought under a common lens. We further synthesize existing analyses of these techniques, highlighting their respective strengths and weaknesses, and conclude by outlining promising directions for future research[2] Temporal Predictors of Outcome in Reasoning Language Models
Joey David
Main category: cs.CL
TL;DR: 研究表明,大型语言模型在生成链式思维(CoT)推理的前几个标记后,其最终答案的正确性即可被线性分类器高度预测,表明模型早期即已内部确定结果。
Details
Motivation: 探究大型语言模型在链式思维推理过程中,何时内部决定最终答案,以理解其推理动态和可解释性。 Method: 通过在生成前t个推理标记后的隐藏状态上训练线性分类器,预测最终答案的正确性,并分析不同难度问题下的表现变化。 Result: 模型在仅生成少数推理标记后,其最终正确性即可被准确预测;对于较难问题,长推理链中存在选择偏差,导致预测准确率下降。 Conclusion: 大型语言模型在推理初期就已形成对结果的内部判断,这对推理过程的可解释性和推理时控制具有重要意义。 Abstract: The chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning, gradually refining the model's latent representation of a solution. However, it remains unclear just how early a Large Language Model (LLM) internally commits to an eventual outcome. We probe this by training linear classifiers on hidden states after the first t reasoning tokens, showing that eventual correctness is highly predictable after only a few tokens, even when longer outputs are needed to reach a definite answer. We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact: hard items are disproportionately represented in long CoTs. Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens, with implications for interpretability and for inference-time control.[3] LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs
Pei-Fu Guo,Yun-Da Tsai,Chun-Chia Hsu,Kai-Xin Chen,Ya-An Tsai,Kai-Wei Chang,Nanyun Peng,Mi-Yen Yeh,Shou-De Lin
Main category: cs.CL
TL;DR: 提出LiveCLKTBench,一个自动化生成管道,用于隔离并测量大语言模型中的跨语言知识迁移。
Details
Motivation: 评估大语言模型中的跨语言知识迁移具有挑战性,因为目标语言中的正确答案可能来自真实的知识迁移或预训练期间的先前接触。 Method: 通过识别现实领域中自包含且时间敏感的知识实体,基于时间发生过滤,并验证模型知识;使用这些实体的文档生成事实问题,并翻译成多种语言进行评估。 Result: 在五种语言上评估多个大语言模型,发现跨语言迁移受语言距离影响显著,且在不同语言方向上常呈不对称性;模型规模增大有助于迁移,但收益随规模递减且因领域而异。 Conclusion: LiveCLKTBench为研究跨语言知识迁移提供了可靠基准,揭示了语言距离、模型规模和领域对迁移效果的影响。 Abstract: Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model's knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.[4] COMPASS: Context-Modulated PID Attention Steering System for Hallucination Mitigation
Snigdha Pandya,Rohan Nagale,Kenji Sahay,Anna Lin,Shikhar Shiromani,Kevin Zhu,Dev Sunishchal
Main category: cs.CL
TL;DR: 本文提出了COMPASS,一种轻量且可解释的控制框架,通过在解码过程中嵌入基于模型的反馈循环来减少大语言模型中的上下文幻觉,利用上下文依赖评分(CRS)动态调节注意力头,从而提升生成内容的事实一致性。
Details
Motivation: 大语言模型尽管拥有相关证据,仍常生成流畅但事实错误的内容,其根源在于上下文知识与参数化知识之间的注意力分配问题。理解并调控这一内部行为对可靠部署和科学解释模型机制至关重要。 Method: 提出COMPASS框架,引入上下文依赖评分(CRS)作为可解释信号,并结合PID控制器在解码过程中动态调节注意力头,以增强对证据的依赖,整个过程无需重新训练或多次解码。 Result: 在HotpotQA、XSum、HaluEval和RAGTruth等多个基准上,COMPASS将上下文幻觉率绝对降低了2.8%至5.8%,并揭示了不同注意力头在证据对齐中的作用。 Conclusion: 基于反馈的可解释性为理解和调控大语言模型行为提供了有效路径,COMPASS展示了无需训练即可提升事实一致性的潜力。 Abstract: Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8 to 5.8 percent absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.[5] The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech
Julio Cesar Galdino,Sidney Evaldo Leal,Leticia Gabriella De Souza,Rodrigo de Freitas Lima,Antonio Nelson Fornari Mendes Moreira,Arnaldo Candido Junior,Miguel Oliveira,Edresson Casanova,Sandra M. Aluísio
Main category: cs.CL
TL;DR: 本研究探讨了手动和自动韵律分割标注对巴西葡萄牙语自发语音合成质量的影响,使用FastSpeech 2模型进行实验,结果表明包含韵律分割能提升合成语音的自然度和可懂度,尤其是手动标注因引入更多变异性而更优。
Details
Motivation: 自发语音中的停顿、转换和不流畅现象给语音合成带来挑战,现有系统多隐式建模韵律特征,但显式韵律分割数据集的构建及其影响尚不明确。 Method: 采用非自回归模型FastSpeech 2,对比使用手动与自动韵律分割标注训练的合成效果,并分析中性陈述句的核重音与前核韵律模式。 Result: 加入韵律分割训练略微提升了语音的可懂度和声学自然性;手动标注因提供更多变异性,生成的韵律更接近自然语音;两种方法均能再现预期的核重音模式,但手动标注在前核轮廓上更优。 Conclusion: 显式韵律分割有助于提升自发语音合成质量,手动标注虽耗时但能带来更自然的韵律表现,且所有数据与模型已公开以支持后续研究。 Abstract: Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.[6] Human or LLM as Standardized Patients? A Comparative Study for Medical Education
Bingquan Zhang,Xiaoxiao Liu,Yuchi Wang,Lei Zhou,Qianqian Xie,Benyou Wang
Main category: cs.CL
TL;DR: EasyMED是一个基于多智能体框架的标准化病人模拟系统,通过患者代理、辅助代理和评估代理的协同工作,在临床技能训练中实现了与人类标准化病人相当的学习效果,并在灵活性、心理安全性和成本效益方面表现更优。
Details
Motivation: 现有的基于大语言模型的标准化病人模拟器虽然成本较低,但行为不一致且缺乏与人类标准化病人的严格比较,因此需要一个更稳定、可评估且高效的解决方案。 Method: 提出EasyMED多智能体框架,包含负责真实对话的患者代理、确保事实一致性的辅助代理和提供可操作反馈的评估代理;同时构建SPBench基准,涵盖14个专科的真实医患互动及8项专家定义的评估标准。 Result: 实验表明,EasyMED在学习效果上与人类标准化病人相当,对基础较弱的学生能带来更大的技能提升,并在灵活性、心理安全感和成本效率方面优于传统方法。 Conclusion: EasyMED为临床技能训练提供了一种可扩展、高效且可靠的标准化病人替代方案,具备广泛应用于医学教育的潜力。 Abstract: Standardized Patients (SP) are indispensable for clinical skills training but remain expensive, inflexible, and difficult to scale. Existing large-language-model (LLM)-based SP simulators promise lower cost yet show inconsistent behavior and lack rigorous comparison with human SP. We present EasyMED, a multi-agent framework combining a Patient Agent for realistic dialogue, an Auxiliary Agent for factual consistency, and an Evaluation Agent that delivers actionable feedback. To support systematic assessment, we introduce SPBench, a benchmark of real SP-doctor interactions spanning 14 specialties and eight expert-defined evaluation criteria. Experiments demonstrate that EasyMED matches human SP learning outcomes while producing greater skill gains for lower-baseline students and offering improved flexibility, psychological safety, and cost efficiency.[7] Opinion Mining and Analysis Using Hybrid Deep Neural Networks
Adel Hidri,Suleiman Ali Alsaif,Muteeb Alahmari,Eman AlShehri,Minyar Sassi Hidri
Main category: cs.CL
TL;DR: 提出一种结合BGRU和LSTM的混合深度神经网络模型(HBGRU-LSTM),用于提升情感分析性能,尤其在处理上下文细微差别、可扩展性和类别不平衡方面表现优异。
Details
Motivation: 现有方法在处理文本情感分析中的上下文复杂性、可扩展性和类别不平衡问题上存在局限,传统机器学习和深度学习模型难以充分捕捉语义关系并实现良好泛化。 Method: 提出一种混合深度神经网络模型HBGRU-LSTM,结合双向门控循环单元(BGRU)和长短期记忆网络(LSTM),在IMDB电影评论和亚马逊产品评价等基准数据集上进行实验验证。 Result: 该模型在测试中达到95%的准确率,优于LSTM(93.06%)、CNN+LSTM(93.31%)和GRU+LSTM(92.20%);负向情感召回率从86%(不平衡数据)提升至96%(平衡数据),误分类损失从20.24%降至13.3%。 Conclusion: HBGRU-LSTM模型在情感分析任务中表现出更强的泛化能力和鲁棒性,尤其在解决类别不平衡和提升负面情感识别方面效果显著,具有实际应用价值。 Abstract: Understanding customer attitudes has become a critical component of decision-making due to the growing influence of social media and e-commerce. Text-based opinions are the most structured, hence playing an important role in sentiment analysis. Most of the existing methods, which include lexicon-based approaches and traditional machine learning techniques, are insufficient for handling contextual nuances and scalability. While the latter has limitations in model performance and generalization, deep learning (DL) has achieved improvement, especially on semantic relationship capturing with recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The aim of the study is to enhance opinion mining by introducing a hybrid deep neural network model that combines a bidirectional gated recurrent unit (BGRU) and long short-term memory (LSTM) layers to improve sentiment analysis, particularly addressing challenges such as contextual nuance, scalability, and class imbalance. To substantiate the efficacy of the proposed model, we conducted comprehensive experiments utilizing benchmark datasets, encompassing IMDB movie critiques and Amazon product evaluations. The introduced hybrid BGRULSTM (HBGRU-LSTM) architecture attained a testing accuracy of 95%, exceeding the performance of traditional DL frameworks such as LSTM (93.06%), CNN+LSTM (93.31%), and GRU+LSTM (92.20%). Moreover, our model exhibited a noteworthy enhancement in recall for negative sentiments, escalating from 86% (unbalanced dataset) to 96% (balanced dataset), thereby ensuring a more equitable and just sentiment classification. Furthermore, the model diminished misclassification loss from 20.24% for unbalanced to 13.3% for balanced dataset, signifying enhanced generalization and resilience.[8] Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings
Xueying Ding,Xingyue Huang,Mingxuan Ju,Liam Collins,Yozen Liu,Leman Akoglu,Neil Shah,Tong Zhao
Main category: cs.CL
TL;DR: 提出了一种名为层次化令牌前置(HTP)的方法,通过分块摘要令牌和均值池化来提升大语言模型在长文本上的嵌入性能。
Details
Motivation: 大语言模型的因果注意力机制限制了信息从后向前流动,导致表示质量下降;现有方法因过度压缩信息而不适用于长文档。 Method: 将输入分块,在每块前添加块级摘要令牌以促进反向信息流动,并采用均值池化替代最后一令牌池化以缓解读出层的过压缩问题。 Result: 在11个检索数据集和30个通用嵌入基准上均取得持续性能提升,尤其在长上下文场景中表现突出。 Conclusion: HTP是一种简单、与架构无关的方法,能有效增强零样本和微调模型的长文档嵌入能力,具有良好的可扩展性。 Abstract: Large language models produce powerful text embeddings, but their causal attention mechanism restricts the flow of information from later to earlier tokens, degrading representation quality. While recent methods attempt to solve this by prepending a single summary token, they over-compress information, hence harming performance on long documents. We propose Hierarchical Token Prepending (HTP), a method that resolves two critical bottlenecks. To mitigate attention-level compression, HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating multiple pathways for backward information flow. To address readout-level over-squashing, we replace last-token pooling with mean-pooling, a choice supported by theoretical analysis. HTP achieves consistent performance gains across 11 retrieval datasets and 30 general embedding benchmarks, especially in long-context settings. As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.[9] Mathematical Analysis of Hallucination Dynamics in Large Language Models: Uncertainty Quantification, Advanced Decoding, and Principled Mitigation
Moses Kiprono
Main category: cs.CL
TL;DR: 提出一个基于数学框架来理解、衡量和缓解大语言模型中的幻觉问题,结合多种理论方法并引入改进的不确定性度量与缓解策略。
Details
Motivation: 大语言模型虽然强大,但容易产生事实性错误或无根据的幻觉输出,需要可靠的方法进行识别和缓解。 Method: 结合概率建模、信息论、三角信号分析和贝叶斯不确定性估计,分析错误在自回归过程中的累积,并提出语义和相位感知的不确定性度量,以及对比解码、检索增强、事实对齐和拒绝机制等缓解策略。 Result: 建立了统一的理论框架,连接了校准、检索和对齐方面的最新进展,提升了模型的可靠性和安全性。 Conclusion: 该框架为理解和减少大语言模型的幻觉提供了数学上严谨且实用的路径,有助于构建更安全、可信的语言系统。 Abstract: Large Language Models (LLMs) are powerful linguistic engines but remain susceptible to hallucinations: plausible-sounding outputs that are factually incorrect or unsupported. In this work, we present a mathematically grounded framework to understand, measure, and mitigate these hallucinations. Drawing on probabilistic modeling, information theory, trigonometric signal analysis, and Bayesian uncertainty estimation, we analyze how errors compound autoregressively, propose refined uncertainty metrics, including semantic and phase-aware variants, and develop principled mitigation strategies such as contrastive decoding, retrieval-augmented grounding, factual alignment, and abstention. This unified lens connects recent advances in calibration, retrieval, and alignment to support safer and more reliable LLMs.[10] Teaching According to Students' Aptitude: Personalized Mathematics Tutoring via Persona-, Memory-, and Forgetting-Aware LLMs
Yang Wu,Rujing Yao,Tong Zhang,Yufei Shi,Zhuoren Jiang,Zhushan Li,Xiaozhong Liu
Main category: cs.CL
TL;DR: TASA是一个结合学生个性、记忆和遗忘动态的个性化数学辅导框架,通过知识追踪和连续遗忘曲线动态更新学生的掌握状态,提供适应性更强的教学。
Details
Motivation: 现有大语言模型在智能辅导系统中未能有效捕捉学生知识的动态演变,特别是在数学辅导中缺乏针对学生掌握水平和认知保持的精细支架。 Method: 提出TASA框架,构建结构化学生画像和事件记忆,并结合连续遗忘曲线与知识追踪技术,动态调整问题难度和解释内容。 Result: 实验结果表明,TASA在学习效果和辅导适应性方面优于代表性基线方法。 Conclusion: 建模时间遗忘和学习者特征对提升LLM驱动的辅导系统至关重要。 Abstract: Large Language Models (LLMs) are increasingly integrated into intelligent tutoring systems to provide human-like and adaptive instruction. However, most existing approaches fail to capture how students' knowledge evolves dynamically across their proficiencies, conceptual gaps, and forgetting patterns. This challenge is particularly acute in mathematics tutoring, where effective instruction requires fine-grained scaffolding precisely calibrated to each student's mastery level and cognitive retention. To address this issue, we propose TASA (Teaching According to Students' Aptitude), a student-aware tutoring framework that integrates persona, memory, and forgetting dynamics for personalized mathematics learning. Specifically, TASA maintains a structured student persona capturing proficiency profiles and an event memory recording prior learning interactions. By incorporating a continuous forgetting curve with knowledge tracing, TASA dynamically updates each student's mastery state and generates contextually appropriate, difficulty-calibrated questions and explanations. Empirical results demonstrate that TASA achieves superior learning outcomes and more adaptive tutoring behavior compared to representative baselines, underscoring the importance of modeling temporal forgetting and learner profiles in LLM-based tutoring systems.[11] HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples
Rishikant Chigrupaatii,Ponnada Sai Tulasi Kanishka,Lalit Chandra Routhu,Martin Patel Sama Supratheek Reddy,Divyam Gupta,Dasari Srikar,Krishna Teja Kuchimanchi,Rajiv Misra,Rohun Tripathi
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的框架来评估印度语言中的视觉-语言模型(VLM),并发布了HinTel-AlignBench基准,涵盖印地语和泰卢固语,包含约4000个问答对。研究发现现有模型在印度语言上的性能普遍低于英语,并分析了失败模式以指导未来改进。
Details
Motivation: 现有的多语言VLM评估存在依赖未经验证的自动翻译、任务覆盖窄、样本量小以及缺乏本土文化相关的问答数据等问题,尤其在低资源语言中影响显著。因此需要一个更可靠、具文化相关性的评估框架来推动印度多语言VLM的发展。 Method: 提出一个结合回译、过滤和人工验证的半自动化数据集构建框架,构建了HinTel-AlignBench基准,包含从英语数据集改编的部分(如VQAv2)和原生印度数据集(如JEE、VAANI),并在多种SOTA开源与闭源VLM上进行性能比较分析。 Result: 在5项任务中,4项显示模型在印地语和泰卢固语上的性能低于英语,平均下降分别为8.3分和5.5分;识别出常见失败模式,揭示了多语言多模态理解中的关键问题。 Conclusion: 当前VLM在印度语言上的表现仍有明显不足,需重视文化适配与本地化数据建设,所提出的框架和基准为公平评估和改进多语言VLM提供了有效路径。 Abstract: With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.[12] Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Vladislav Pedashenko,Laida Kushnareva,Yana Khassan Nibal,Eduard Tulchinskii,Kristian Kuznetsov,Vladislav Zharchinskii,Yury Maximov,Irina Piontkovskaya
Main category: cs.CL
TL;DR: 本论文首次系统研究了内在维度(ID)与可解释文本属性之间的关系,发现ID与基于熵的指标无关,表现出明显的文体分层,并通过稀疏自编码器识别出影响ID的因果特征。
Details
Motivation: 尽管内在维度(ID)在大模型分析中被广泛应用,但其与文本特性的关系尚不明确,本文旨在揭示ID背后的可解释语言因素。 Method: 结合交叉编码器分析、语言学特征和稀疏自编码器(SAE),对不同文本的ID进行多角度分析,并通过干预实验验证因果性。 Result: 发现ID与熵指标无关;科学文本ID低(~8),百科中等(~9),创意/观点类文本ID高(~10.5);科学风格降低ID,而个性化、情感和叙事增加ID,且经干预实验确认为因果效应。 Conclusion: ID反映的是表示的几何复杂性而非预测难度,科学写作对当前LLM而言更‘简单’,而小说、观点和情感内容需要更多自由度,研究为ID的正确使用提供了实践指导。 Abstract: Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.[13] OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
Xinli Tao,Xin Dong,Xuezhong Zhou
Main category: cs.CL
TL;DR: 提出OEMA,一种基于多智能体协作的零样本临床命名实体识别框架,在多个数据集上达到最先进性能。
Details
Motivation: 现有的监督模型依赖昂贵的标注数据,而零样本NER在实例选择粒度和提示集成方面存在挑战。 Method: 设计包含自注释器、判别器和预测器的三组件框架,结合SNOMED CT本体和实体描述进行多智能体协作。 Result: 在MTSamples和VAERS数据集上,OEMA在精确匹配上达到最优性能;在相关匹配下媲美监督式BioClinicalBERT并优于CRF。 Conclusion: OEMA通过本体引导推理和多智能体协作,有效解决零样本临床NER的关键挑战,实现接近监督模型的性能。 Abstract: Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA's three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.[14] Context Cascade Compression: Exploring the Upper Limits of Text Compression
Fanfan Liu,Haibo Qiu
Main category: cs.CL
TL;DR: 本文提出了一种名为Context Cascade Compression(C3)的新方法,通过级联两个不同规模的LLM实现长文本上下文的高效压缩,在20倍和40倍压缩比下分别达到98%和93%的解码准确率,显著优于DeepSeek-OCR。
Details
Motivation: 应对百万级token输入带来的计算与内存挑战,探索文本上下文压缩的极限性能。 Method: 采用小规模LLM将长上下文压缩为少量潜在token(如32或64个),再由大规模LLM进行解码,形成两级级联结构。 Result: 在20倍压缩比下解码准确率达98%,40倍时仍保持约93%,远超DeepSeek-OCR的约60%。 Conclusion: C3在纯文本管道下实现了高效、高保真的上下文压缩,展示了其在长上下文任务中的优越性,并为OCR等领域的压缩上限提供了参考。 Abstract: Million-level token inputs in long-context tasks pose significant computational and memory challenges for Large Language Models (LLMs). Recently, DeepSeek-OCR conducted research into the feasibility of Contexts Optical Compression and achieved preliminary results. Inspired by this, we introduce Context Cascade Compression C3 to explore the upper limits of text compression. Our method cascades two LLMs of different sizes to handle the compression and decoding tasks. Specifically, a small LLM, acting as the first stage, performs text compression by condensing a long context into a set of latent tokens (e.g., 32 or 64 in length), achieving a high ratio of text tokens to latent tokens. A large LLM, as the second stage, then executes the decoding task on this compressed context. Experiments show that at a 20x compression ratio (where the number of text tokens is 20 times the number of latent tokens), our model achieves 98% decoding accuracy, compared to approximately 60% for DeepSeek-OCR. When we further increase the compression ratio to 40x, the accuracy is maintained at around 93%. This indicates that in the domain of context compression, C3 Compression demonstrates superior performance and feasibility over optical character compression. C3 uses a simpler, pure-text pipeline that ignores factors like layout, color, and information loss from a visual encoder. This also suggests a potential upper bound for compression ratios in future work on optical character compression, OCR, and related fields. Codes and model weights are publicly accessible at https://github.com/liufanfanlff/C3-Context-Cascade-Compression[15] IndicGEC: Powerful Models, or a Measurement Mirage?
Sowmya Vajjala
Main category: cs.CL
TL;DR: 本文报告了TeamNRC在BHASHA-Task 1(印地语、泰卢固语、泰米尔语、马拉雅拉姆语和孟加拉语)语法错误纠正任务中的参与结果,采用零/少样本提示不同规模的语言模型,取得了较好的成绩,尤其突出了小型语言模型的潜力,并讨论了数据质量和评估指标的问题。
Details
Motivation: 旨在探索在资源有限的情况下,使用零/少样本提示方法在多种印度语言上进行语法错误纠正的有效性,并评估现有数据集和评价指标的适用性。 Method: 采用零样本和少样本提示方法,利用从40亿参数到大型专有模型的不同规模语言模型,在5种印度语言上进行实验,并分析数据质量与评估指标的影响。 Result: 在泰卢固语中排名第四(GLEU 83.78),在印地语中排名第二(GLEU 84.31),并在其他三种语言(泰米尔语、马拉雅拉姆语、孟加拉语)中扩展实验,结果显示小型语言模型具有显著潜力。 Conclusion: 小型语言模型在印地语等印度语言的语法纠错任务中表现良好,但需要更高质量的数据集和更适合印度语系文字的评估指标来推动该领域发展。 Abstract: In this paper, we report the results of the TeamNRC's participation in the BHASHA-Task 1 Grammatical Error Correction shared task https://github.com/BHASHA-Workshop/IndicGEC2025/ for 5 Indian languages. Our approach, focusing on zero/few-shot prompting of language models of varying sizes (4B to large proprietary models) achieved a Rank 4 in Telugu and Rank 2 in Hindi with GLEU scores of 83.78 and 84.31 respectively. In this paper, we extend the experiments to the other three languages of the shared task - Tamil, Malayalam and Bangla, and take a closer look at the data quality and evaluation metric used. Our results primarily highlight the potential of small language models, and summarize the concerns related to creating good quality datasets and appropriate metrics for this task that are suitable for Indian language scripts.[16] MAPROC at AHaSIS Shared Task: Few-Shot and Sentence Transformer for Sentiment Analysis of Arabic Hotel Reviews
Randa Zarnoufi
Main category: cs.CL
TL;DR: 本文介绍了在AHaSIS共享任务中针对阿拉伯语方言(摩洛哥和沙特方言)在酒店领域的情感分析方法,采用SetFit框架进行少样本学习,在官方测试集上取得了73%的F1分数,排名第12位。
Details
Motivation: 阿拉伯语方言情感分析面临语言多样性和标注数据稀缺的挑战,尤其是在特定领域如酒店评论中。 Method: 采用SetFit(Sentence Transformer Fine-tuning)框架,利用其数据高效和少样本学习能力对阿拉伯语方言进行情感分类。 Result: 在官方评估集上达到73%的F1分数,在26个参赛队伍中排名第12。 Conclusion: 研究表明,少样本学习在处理特定领域内资源匮乏的阿拉伯语方言文本时具有潜力。 Abstract: Sentiment analysis of Arabic dialects presents significant challenges due to linguistic diversity and the scarcity of annotated data. This paper describes our approach to the AHaSIS shared task, which focuses on sentiment analysis on Arabic dialects in the hospitality domain. The dataset comprises hotel reviews written in Moroccan and Saudi dialects, and the objective is to classify the reviewers sentiment as positive, negative, or neutral. We employed the SetFit (Sentence Transformer Fine-tuning) framework, a data-efficient few-shot learning technique. On the official evaluation set, our system achieved an F1 of 73%, ranking 12th among 26 participants. This work highlights the potential of few-shot learning to address data scarcity in processing nuanced dialectal Arabic text within specialized domains like hotel reviews.[17] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Piercosma Bisconti,Matteo Prandi,Federico Pierucci,Francesco Giarrusso,Marcantonio Bracale,Marcello Galisai,Vincenzo Suriani,Olga Sorokoletova,Federico Sartore,Daniele Nardi
Main category: cs.CL
TL;DR: 对抗性诗歌是一种有效的单轮越狱技术,能显著提高对大型语言模型的攻击成功率,揭示了当前安全机制的根本缺陷。
Details
Motivation: 探索风格变化(如诗歌形式)是否能够绕过大型语言模型的安全对齐机制。 Method: 将1200个有害提示转换为诗歌形式,并在25个前沿模型上测试其攻击成功率,使用开源判别模型和人工验证进行评估。 Result: 诗歌形式的提示在某些模型上的攻击成功率超过90%,平均越狱成功率为62%(手工创作)和43%(自动转换),最高达到基线的18倍。 Conclusion: 当前的安全对齐方法在面对风格变换时存在系统性漏洞,表明需要重新设计评估协议和防御机制。 Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.[18] HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
Alexis Correa-Guillén,Carlos Gómez-Rodríguez,David Vilares
Main category: cs.CL
TL;DR: HEAD-QA v2 是一个扩展更新的西班牙语/英语医疗多选题推理数据集,包含超过12,000个问题,用于推动生物医学推理和模型改进研究。
Details
Motivation: 应对高质量、能捕捉医疗领域语言和概念复杂性的数据集需求增长。 Method: 基于十年西班牙专业考试题目扩展数据集,构建多语言版本,并使用提示、RAG 和基于概率的选答方法对多个开源大语言模型进行基准测试。 Result: 模型规模和内在推理能力是性能的主要驱动因素,复杂的推理策略增益有限。 Conclusion: HEAD-QA v2 是一个可靠的资源,可用于推进生物医学推理和模型优化的研究。 Abstract: We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.[19] The Empowerment of Science of Science by Large Language Models: New Tools and Methods
Guoqiang Liang,Jingqian Gong,Mengxuan Li,Gege Lin,Shuo Zhang
Main category: cs.CL
TL;DR: 本文综述了支持大语言模型(LLM)的核心技术,包括提示工程、检索增强生成、微调、预训练和工具学习,并探讨了LLM在科学计量学中的应用前景,如AI代理驱动的科研评价模型、新研究前沿检测和知识图谱构建。
Details
Motivation: 随着LLMs在自然语言处理和多模态任务中表现出的强大能力,其在科学研究中的应用潜力巨大,但缺乏从用户角度对核心技术及其在科学学中应用的系统性梳理。 Method: 通过文献综述的方法,系统总结LLMs的核心技术,并结合科学学的发展历史,提出LLMs在科研评价、知识发现等方面的应用框架与未来方向。 Result: 提出了一个基于AI代理的科研评价模型,展示了LLM在新研究前沿识别和知识图谱构建中的潜力,并展望了其在科学计量领域的广泛应用。 Conclusion: LLMs不仅在技术层面持续进步,还将在科学学领域发挥变革性作用,推动科研范式的智能化转型。 Abstract: Large language models (LLMs) have exhibited exceptional capabilities in natural language understanding and generation, image recognition, and multimodal tasks, charting a course towards AGI and emerging as a central issue in the global technological race. This manuscript conducts a comprehensive review of the core technologies that support LLMs from a user standpoint, including prompt engineering, knowledge-enhanced retrieval augmented generation, fine tuning, pretraining, and tool learning. Additionally, it traces the historical development of Science of Science (SciSci) and presents a forward looking perspective on the potential applications of LLMs within the scientometric domain. Furthermore, it discusses the prospect of an AI agent based model for scientific evaluation, and presents new research fronts detection and knowledge graph building methods with LLMs.[20] A Compliance-Preserving Retrieval System for Aircraft MRO Task Search
Byungho Jo
Main category: cs.CL
TL;DR: 提出一种合规的检索系统,结合LLM重排序与语义搜索,辅助航空维修技术人员快速查找手册内容,显著减少查询时间并保持与认证系统的兼容性。
Details
Motivation: 航空维修技术人员花费大量工作时间查阅手册,传统方法效率低下且需全程追溯到认证来源,亟需在不替代现有认证系统的情况下提升检索效率。 Method: 系统基于ATA章节结构构建版本鲁棒的嵌入表示,采用视觉-语言解析技术结构化认证内容,并通过LLM重排序增强语义搜索,结果可在现有认证查看器中验证和访问。 Result: 在49k合成查询上实现超过90%的检索准确率;10名持证技师的双语对照实验显示前10命中率达90.9%,查询时间从6-15分钟缩短至18秒,减少95%。 Conclusion: 语义检索可在严格适航监管环境下安全集成,有效降低实际多语言MRO工作负载,为高可靠性行业提供可追溯、合规的智能辅助范例。 Abstract: Aircraft Maintenance Technicians (AMTs) spend up to 30% of work time searching manuals, a documented efficiency bottleneck in MRO operations where every procedure must be traceable to certified sources. We present a compliance-preserving retrieval system that adapts LLM reranking and semantic search to aviation MRO environments by operating alongside, rather than replacing, certified legacy viewers. The system constructs revision-robust embeddings from ATA chapter hierarchies and uses vision-language parsing to structure certified content, allowing technicians to preview ranked tasks and access verified procedures in existing viewers. Evaluation on 49k synthetic queries achieves >90% retrieval accuracy, while bilingual controlled studies with 10 licensed AMTs demonstrate 90.9% top-10 success rate and 95% reduction in lookup time, from 6-15 minutes to 18 seconds per task. These gains provide concrete evidence that semantic retrieval can operate within strict regulatory constraints and meaningfully reduce operational workload in real-world multilingual MRO workflows.[21] DEPO: Dual-Efficiency Preference Optimization for LLM Agents
Sirui Chen,Mengshi Zhao,Lei Xu,Yuying Zhao,Beier Zhu,Hanwang Zhang,Shengjie Zhao,Chaochao Lu
Main category: cs.CL
TL;DR: 本文提出了双效率(dual-efficiency)概念,包括步骤级效率和轨迹级效率,并提出DEPO方法,在减少token和步骤使用的同时提升LLM代理的性能。
Details
Motivation: 现有LLM代理在实现更丰富推理时往往导致思维链过长,影响实际应用中的交互效率,且缺乏对效率的系统性定义,限制了针对性优化。 Method: 提出双效率概念,包含步骤级效率(最小化每步token)和轨迹级效率(最小化任务完成步数),并设计DEPO方法,通过联合优化偏好来奖励简洁响应和更少动作步数。 Result: 在WebShop和BabyAI上实验显示,DEPO最多减少60.9%的token使用和26.9%的步骤,同时性能最高提升29.3%;在三个数学下游任务中具有良好泛化能力,并在仅使用25%训练数据时仍保持效率优势。 Conclusion: DEPO通过系统化的双效率优化框架,显著提升了LLM代理的推理效率与性能,具备良好的数据效率和跨任务泛化能力。 Abstract: Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.[22] NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework
Shanlin Zhou,Xinpeng Wang,Jianxun Lian,Zhenghao Liu,Laks V. S. Lakshmanan,Xiaoyuan Yi,Yongtao Hao
Main category: cs.CL
TL;DR: 本文提出NAMeGEn,一种多智能体优化框架,用于解决中文婴儿命名这一创造性短文本生成任务中的多目标灵活性与解释复杂性挑战。
Details
Motivation: 现有大语言模型在满足用户个性化、多样化需求及生成具有美学解释的创造性短文本方面存在不足,尤其在中文命名等复杂文化语境下表现有限。 Method: 提出NAMeGEn框架,通过多智能体协作,迭代执行目标提取、名字生成与评估;构建包含1.7万首古典诗词的语料库以提升审美,并发布CBNames基准及定制化评测指标。 Result: 实验表明,NAMeGEn在无需训练的情况下,优于六种基于不同大模型基座的基线方法,能有效生成符合个性化约束且具意义解释的创意姓名。 Conclusion: NAMeGEn为解决多目标、高解释性要求的短文本生成任务提供了有效方案,展示了多智能体框架在创造性语言生成中的潜力。 Abstract: Trained on diverse human-authored texts, Large Language Models (LLMs) unlocked the potential for Creative Natural Language Generation (CNLG), benefiting various applications like advertising and storytelling. Nevertheless, CNLG still remains difficult due to two main challenges. (1) Multi-objective flexibility: user requirements are often personalized, fine-grained, and pluralistic, which LLMs struggle to satisfy simultaneously; (2) Interpretive complexity: beyond generation, creativity also involves understanding and interpreting implicit meaning to enhance users' perception. These challenges significantly limit current methods, especially in short-form text generation, in generating creative and insightful content. To address this, we focus on Chinese baby naming, a representative short-form CNLG task requiring adherence to explicit user constraints (e.g., length, semantics, anthroponymy) while offering meaningful aesthetic explanations. We propose NAMeGEn, a novel multi-agent optimization framework that iteratively alternates between objective extraction, name generation, and evaluation to meet diverse requirements and generate accurate explanations. To support this task, we further construct a classical Chinese poetry corpus with 17k+ poems to enhance aesthetics, and introduce CBNames, a new benchmark with tailored metrics. Extensive experiments demonstrate that NAMeGEn effectively generates creative names that meet diverse, personalized requirements while providing meaningful explanations, outperforming six baseline methods spanning various LLM backbones without any training.[23] Building Robust and Scalable Multilingual ASR for Indian Languages
Arjun Gangwar,Kaousheik Jayakumar,S. Umesh
Main category: cs.CL
TL;DR: 本文介绍了IIT Madras SPRING实验室为ASRU MADASR 2.0挑战赛开发的系统,重点是通过多解码器架构和音素公共标签集(CLS)提升多语言和多方言自动语音识别性能。
Details
Motivation: 旨在提升在8种语言、33种方言中准确识别语言和方言的能力,并在不使用额外数据的前提下从零构建多语言ASR系统。 Method: 采用基于音素公共标签集(CLS)的多解码器架构进行联合训练,并探索了从音素空间到图素空间的转换方法以保留性能增益。 Result: 在Track 2中,系统在3种语言上优于基线系统的WER/CER表现,并在所有参赛队伍中取得了最高的语言识别和方言识别准确率。 Conclusion: 所提出的多解码器与CLS方法有效提升了多语言多方言ASR系统的性能,尤其在语言和方言识别方面表现突出。 Abstract: This paper describes the systems developed by SPRING Lab, Indian Institute of Technology Madras, for the ASRU MADASR 2.0 challenge. The systems developed focuses on adapting ASR systems to improve in predicting the language and dialect of the utterance among 8 languages across 33 dialects. We participated in Track 1 and Track 2, which restricts the use of additional data and develop from-the-scratch multilingual systems. We presented a novel training approach using Multi-Decoder architecture with phonemic Common Label Set (CLS) as intermediate representation. It improved the performance over the baseline (in the CLS space). We also discuss various methods used to retain the gain obtained in the phonemic space while converting them back to the corresponding grapheme representations. Our systems beat the baseline in 3 languages (Track 2) in terms of WER/CER and achieved the highest language ID and dialect ID accuracy among all participating teams (Track 2).[24] LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
Yuanjie Zhu,Liangwei Yang,Ke Xu,Weizhi Zhang,Zihe Song,Jindong Wang,Philip S. Yu
Main category: cs.CL
TL;DR: LLM-MemCluster 是一种全新的、完全基于大语言模型的文本聚类框架,通过引入动态记忆和双提示策略,实现无需调参、端到端的语义聚类,显著优于现有方法。
Details
Motivation: 大语言模型虽具备强大的语义理解能力,但因缺乏状态记忆和难以控制聚类粒度,现有聚类方法依赖外部模块,无法实现真正的端到端聚类。 Method: 提出 LLM-MemCluster 框架,将聚类重构为纯 LLM 原生任务;使用动态记忆机制赋予模型状态感知能力,并通过双提示策略让模型自主推理聚类数量。 Result: 在多个基准数据集上验证了该方法的有效性,无需微调即可持续且显著地超越强基线方法。 Conclusion: LLM-MemCluster 实现了一种高效、可解释且真正端到端的 LLM 文本聚类范式,推动了 LLM 在无监督学习中的原生应用。 Abstract: Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.[25] Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis
Yves Pauli,Jan-Bernard Marsman,Finn Rabe,Victoria Edkins,Roya Hüppi,Silvia Ciampelli,Akhil Ratan Misra,Nils Lang,Wolfram Hinzen,Iris Sommer,Philipp Homan
Main category: cs.CL
TL;DR: 本文提出了语言处理数据结构(LPDS)和pelican nlp工具包,旨在提高语言数据处理的标准化和可重复性。
Details
Motivation: 当前语言数据组织和共享缺乏标准化,且缺少可复现的处理方法,因此需要一个统一的框架来提升研究透明度和可重复性。 Method: 受脑成像数据结构(BIDS)启发,提出LPDS作为语言数据的文件组织标准,并开发了模块化的Python工具pelican nlp,支持从数据清洗到语言与声学特征提取的全流程处理,通过配置文件实现可重复分析。 Result: 实现了基于LPDS格式数据的端到端语言处理流程,能够生成预处理后的语言数据或标准化的语言与声学特征及其聚合结果。 Conclusion: LPDS和pelican nlp共同构建了一个促进方法透明性和研究可重复性的完整语言数据处理框架。 Abstract: The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.[26] Multimodal Evaluation of Russian-language Architectures
Artem Chervyakov,Ulyana Isaeva,Anton Emelyanov,Artem Safin,Maria Tikhonova,Alexander Kharitonov,Yulia Lyakh,Petr Surovtsev,Denis Shevelev Vildan Saburov,Vasily Konovalov,Elisei Rykov,Ivan Sviridov,Amina Miftakhova,Ilseyar Alimova,Alexander Panchenko,Alexander Kapitanov,Alena Fenogenova
Main category: cs.CL
TL;DR: 本文提出了Mera Multi,一个针对俄语的开源多模态评估框架,包含18个从零构建的任务,涵盖文本、图像、音频和视频模态,旨在系统评估多模态大语言模型的能力与局限。
Details
Motivation: 当前多模态大语言模型发展迅速,但其智能性、局限性和风险尚不明确,尤其在俄语领域缺乏多模态基准测试,亟需构建针对性的评估体系。 Method: 提出基于指令的多模态评估框架Mera Multi,设计通用的多模态能力分类体系,从零构建18个具有俄语语言文化特性的数据集,并统一提示模板与评估指标,同时引入水印和授权机制防止测试泄露。 Result: 发布了涵盖通用模型与特定模态架构(图文、视频转文本、音频转文本)的基线结果,支持闭源与开源模型评估,并提供可复用的方法论,适用于斯拉夫语系等类型多样的语言。 Conclusion: Mera Multi填补了俄语多模态评估的空白,不仅为俄语模型提供了系统评测工具,也为其他低资源和类型多样语言的多模态基准建设提供了可复制的方法框架。 Abstract: Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.[27] HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning
Qihao Yang,Xuelin Wang,Jiale Chen,Xuelian Dong,Yuxin Hao,Tianyong Hao
Main category: cs.CL
TL;DR: 本文提出了HSKBenchmark,首个用于中文二语习得中LLM分阶段建模与写作评估的基准,包含真实教材、合成数据和语言学驱动的评估体系,并通过课程微调框架模拟人类学习轨迹,实验表明其能有效建模中文二语习得并支持LLM的动态写作评估。
Details
Motivation: 由于在人类学习者上控制语言输入存在伦理和实践上的困难,中文二语习得的建模与评估面临可验证性与可扩展性挑战,现有大语言模型缺乏系统性基准支持分阶段建模。 Method: 构建覆盖HSK 3-6级的HSKBenchmark,包含676万token的真实教材、1.6万合成指令样本、30个测试主题及语言学评估体系;提出课程微调框架,按学习阶段训练模型;开发HSKAgent并在1万学习者作文上微调;设计评估系统分析语法覆盖、错误、复杂度与整体评分。 Result: 实验证明HSKBenchmark能有效模拟中文二语习得过程,支持LLM的阶段性建模与动态写作评估;微调后的LLM写作水平接近高级人类学习者,表现出类人习得特征。 Conclusion: HSKBenchmark、HSKAgent及相关模型为中文二语习得建模和大语言模型可解释性研究提供了基础工具与资源,具有推动该领域未来发展的潜力。 Abstract: Language acquisition is vital to revealing the nature of human language intelligence and has recently emerged as a promising perspective for improving the interpretability of large language models (LLMs). However, it is ethically and practically infeasible to conduct experiments that require controlling human learners' language inputs. This poses challenges for the verifiability and scalability of language acquisition modeling, particularly in Chinese second language acquisition (SLA). While LLMs provide a controllable and reproducible alternative, a systematic benchmark to support phase-wise modeling and assessment is still lacking. In this paper, we present HSKBenchmark, the first benchmark for staged modeling and writing assessment of LLMs in Chinese SLA. It covers HSK levels 3 to 6 and includes authentic textbooks with 6.76 million tokens, 16K synthetic instruction samples, 30 test topics, and a linguistically grounded evaluation system. To simulate human learning trajectories, we introduce a curriculum-tuning framework that trains models from beginner to advanced levels. An evaluation system is created to examine level-based grammar coverage, writing errors, lexical and syntactic complexity, and holistic scoring. We also build HSKAgent, fine-tuned on 10K learner compositions. Extensive experimental results demonstrate that HSKBenchmark not only models Chinese SLA effectively, but also serves as a reliable benchmark for dynamic writing assessment in LLMs. Our fine-tuned LLMs have writing performance on par with advanced human learners and exhibit human-like acquisition characteristics. The HSKBenchmark, HSKAgent, and checkpoints serve as foundational tools and resources, with the potential to pave the way for future research on language acquisition modeling and LLMs interpretability. Code and data are publicly available at: https://github.com/CharlesYang030/HSKB.[28] Tokenisation over Bounded Alphabets is Hard
Violeta Kastreva,Philip Whittington,Dennis Komm,Tiago Pimentel
Main category: cs.CL
TL;DR: 本文研究了在有限大小字母表上的分词问题,证明即使在二元字母表上,自底向上和直接分词两种变体都是NP完全的,且不存在多项式时间近似方案(除非P=NP),表明分词的计算困难性是根本性的而非由大字母表引起。
Details
Motivation: 先前研究认为分词问题是NP完全的,但基于无限大字母表的假设不现实;本文旨在填补这一空白,分析在实际中使用的有限大小字母表上的分词复杂性。 Method: 通过形式化定义在n元字母表上的两种自然分词变体——自底向上分词和直接分词,并利用计算复杂性理论证明其在二元乃至一元字母表上的NP完全性和不可近似性。 Result: 证明了即使在二元字母表上,两种分词变体都是NP完全且无法有效近似;进一步发现直接分词在一元字母表上仍是NP完全,说明其计算难度具有根本性。 Conclusion: 分词的计算困难性并非源于大字母表或复杂构造,而是本质性的,这解释了为何BPE等实际算法只能依赖启发式方法,并建议未来研究应关注设计近似算法。 Abstract: Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.cs.CV [Back]
[29] Gaussian See, Gaussian Do: Semantic 3D Motion Transfer from Multiview Video
Yarin Bekor,Gal Michael Harari,Or Perel,Or Litany
Main category: cs.CV
TL;DR: 提出了一种名为Gaussian See, Gaussian Do的新方法,用于从多视角视频中实现语义3D运动迁移,支持无绑定、跨类别的对象间运动迁移。
Details
Motivation: 实现无需rig的跨类别语义一致的3D运动迁移,解决现有方法在结构一致性和运动保真度上的不足。 Method: 基于隐式运动迁移技术,通过条件反转从源视频提取运动嵌入,应用于静态目标形状的渲染帧,并利用生成的视频监督动态3D高斯点阵重建;引入基于锚点的视图感知运动嵌入机制和鲁棒的4D重建流程。 Result: 实现了高质量的语义3D运动迁移,在自建的首个语义3D运动迁移基准上表现出优于基线方法的运动保真度和结构一致性。 Conclusion: 该方法有效实现了跨类别、无需绑定的语义3D运动迁移,具有良好的视觉效果和应用潜力。 Abstract: We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at https://gsgd-motiontransfer.github.io/[30] B-Rep Distance Functions (BR-DF): How to Represent a B-Rep Model by Volumetric Distance Functions?
Fuyang Zhang,Pradeep Kumar Jayaraman,Xiang Xu,Yasutaka Furukawa
Main category: cs.CV
TL;DR: 本文提出了一种基于体距离函数的新型CAD边界表示方法BR-DF,可将表面网格几何编码为符号距离函数,并通过扩展Marching Cubes算法稳定生成水密的B-Rep模型,结合多分支潜在扩散模型实现了100%成功率的CAD生成。
Details
Motivation: 传统CAD生成方法在拓扑正确性和生成稳定性方面存在挑战,缺乏能保证成功转换为有效B-Rep模型的表示方法。 Method: 提出B-Rep Distance Functions(BR-DF),使用符号距离函数(SDF)编码表面几何,用每面的无符号距离函数(UDF)编码顶点、边、面及其拓扑信息,并采用基于3D U-Net的多分支潜在扩散模型联合生成SDF和UDFs,结合改进的Marching Cubes算法重建B-Rep模型。 Result: 所提方法在CAD生成性能上达到与当前最先进方法相当的水平,并实现了前所未有的100%成功生成(分段)B-Rep模型。 Conclusion: BR-DF是一种鲁棒且有效的CAD形状表示方法,其体积特性与深度生成模型良好兼容,能够稳定生成拓扑正确的B-Rep模型,为CAD生成提供了新方向。 Abstract: This paper presents a novel geometric representation for CAD Boundary Representation (B-Rep) based on volumetric distance functions, dubbed B-Rep Distance Functions (BR-DF). BR-DF encodes the surface mesh geometry of a CAD model as signed distance function (SDF). B-Rep vertices, edges, faces and their topology information are encoded as per-face unsigned distance functions (UDFs). An extension of the Marching Cubes algorithm converts BR-DF directly into watertight CAD B-Rep model (strictly speaking a faceted B-Rep model). A surprising characteristic of BR-DF is that this conversion process never fails. Leveraging the volumetric nature of BR-DF, we propose a multi-branch latent diffusion with 3D U-Net backbone for jointly generating the SDF and per-face UDFs of a BR-DF model. Our approach achieves comparable CAD generation performance against SOTA methods while reaching the unprecedented 100% success rate in producing (faceted) B-Rep models.[31] GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis
Antonio Ruiz,Tao Wu,Andrew Melnik,Qing Cheng,Xuqin Wang,Lu Liu,Yongliang Wang,Yanfeng Zhang,Helge Ritter
Main category: cs.CV
TL;DR: 本文提出GeoSceneGraph,一种从文本生成3D室内场景的方法,利用场景的图结构和几何对称性,无需依赖预定义关系类别或真实关系标注,适用于资源受限设备。
Details
Motivation: 现有方法要么忽略室内场景的图结构,影响真实感,要么依赖用户提供的语义图或真实关系标注,限制了灵活性和应用范围。需要一种不依赖外部图输入且适用于小型设备的生成方法。 Method: 提出GeoSceneGraph,基于等变图神经网络(EGNN),引入一种简单有效的方法将文本特征作为条件输入EGNN,利用3D场景的图结构和几何对称性进行生成,无需预定义关系类或真实关系标注。 Result: 尽管不使用真实关系标注,GeoSceneGraph仍能达到与依赖标注的方法相当的性能,并通过消融实验验证了设计的有效性。 Conclusion: GeoSceneGraph在不依赖预定义关系或真实标注的情况下,能有效生成结构合理、逼真的3D室内场景,为资源受限设备上的文本到3D生成提供了可行方案。 Abstract: Methods that synthesize indoor 3D scenes from text prompts have wide-ranging applications in film production, interior design, video games, virtual reality, and synthetic data generation for training embodied agents. Existing approaches typically either train generative models from scratch or leverage vision-language models (VLMs). While VLMs achieve strong performance, particularly for complex or open-ended prompts, smaller task-specific models remain necessary for deployment on resource-constrained devices such as extended reality (XR) glasses or mobile phones. However, many generative approaches that train from scratch overlook the inherent graph structure of indoor scenes, which can limit scene coherence and realism. Conversely, methods that incorporate scene graphs either demand a user-provided semantic graph, which is generally inconvenient and restrictive, or rely on ground-truth relationship annotations, limiting their capacity to capture more varied object interactions. To address these challenges, we introduce GeoSceneGraph, a method that synthesizes 3D scenes from text prompts by leveraging the graph structure and geometric symmetries of 3D scenes, without relying on predefined relationship classes. Despite not using ground-truth relationships, GeoSceneGraph achieves performance comparable to methods that do. Our model is built on equivariant graph neural networks (EGNNs), but existing EGNN approaches are typically limited to low-dimensional conditioning and are not designed to handle complex modalities such as text. We propose a simple and effective strategy for conditioning EGNNs on text features, and we validate our design through ablation studies.[32] HULFSynth : An INR based Super-Resolution and Ultra Low-Field MRI Synthesis via Contrast factor estimation
Pranav Indrakanti,Ivor Simpson
Main category: cs.CV
TL;DR: 提出了一种无监督的单幅MRI双向合成方法,能够实现高场与超低场MRI之间的相互转换,且基于物理机制建模,提升了白质-灰质对比度。
Details
Motivation: 现有MRI合成模型缺乏对高低场MRI间物理对比机制的考虑,难以准确模拟真实磁场转换下的图像变化,因此需要一种基于物理驱动的双向合成方法。 Method: 提出一种基于物理的前向模型,通过估计目标对比度下的组织类型信噪比(SNR)来模拟高场到超低场的转换;使用隐式神经表示(INR)网络在无真实高场数据情况下同时预测组织分割和图像强度,实现超分辨率与反向合成。 Result: 在合成ULF-like图像中WM-GM对比度提升52%,在真实64mT图像中提升37%;敏感性实验表明模型对目标对比度、噪声和初始种子具有鲁棒性。 Conclusion: 该方法通过结合物理驱动的前向模型与INR网络,实现了高质量的无监督双向MRI合成,在对比度增强和鲁棒性方面优于现有方法,具有应用于低资源MRI成像的潜力。 Abstract: We present an unsupervised single image bidirectional Magnetic Resonance Image (MRI) synthesizer that synthesizes an Ultra-Low Field (ULF) like image from a High-Field (HF) magnitude image and vice-versa. Unlike existing MRI synthesis models, our approach is inspired by the physics that drives contrast changes between HF and ULF MRIs. Our forward model simulates a HF to ULF transformation by estimating the tissue-type Signal-to-Noise ratio (SNR) values based on target contrast values. For the Super-Resolution task, we used an Implicit Neural Representation (INR) network to synthesize HF image by simultaneously predicting tissue-type segmentations and image intensity without observed HF data. The proposed method is evaluated using synthetic ULF-like data from generated from standard 3T T$_1$-weighted images for qualitative assessments and paired 3T-64mT T$_1$-weighted images for validation experiments. WM-GM contrast improved by 52% in synthetic ULF-like images and 37% in 64mT images. Sensitivity experiments demonstrated the robustness of our forward model to variations in target contrast, noise and initial seeding.[33] InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
Daniel Gilo,Or Litany
Main category: cs.CV
TL;DR: 提出I-Mix2Mix框架,将2D扩散模型的编辑能力蒸馏到预训练的多视角扩散模型中,以实现稀疏输入下的多视角图像编辑,并提升跨视角一致性。
Details
Motivation: 现有方法在稀疏输入视图下进行多视角图像编辑时易产生伪影和不一致结果,难以保持跨视角的一致性。 Method: 提出InstructMix2Mix(I-Mix2Mix),用多视角扩散模型替代Score Distillation Sampling中的传统神经场整合器,并引入增量更新、专用教师噪声调度和注意力机制改进以增强跨视角一致性。 Result: 实验表明,I-Mix2Mix在保持单帧编辑质量的同时,显著提升了多视角一致性。 Conclusion: I-Mix2Mix有效解决了稀疏输入下多视角图像编辑的挑战,通过知识蒸馏和结构改进实现了高质量且一致的编辑结果。 Abstract: We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.[34] Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis
Zehao Liu,Wejieying Ren,Jipeng Zhang,Tianxiang Zhao,Jingxi Zhu,Xiaoting Li,Vasant G. Honavar
Main category: cs.CV
TL;DR: 本文提出了一种新的皮肤病学视觉-语言模型SkinR1,结合基于教科书的推理和强化学习,以解决数据异质性、缺乏可靠诊断依据以及泛化能力有限等问题。
Details
Motivation: 现有的视觉-语言模型在皮肤病诊断中受限于数据异构性、缺乏可解释的诊断推理过程以及在大规模稀疏标注数据上的泛化能力不足。 Method: 设计了一个基于教科书的推理生成器来构建层次化且包含鉴别诊断的推理路径,并通过监督微调(SFT)赋予模型扎实的推理能力;进一步提出一种融合疾病层次结构的新型强化学习范式,将这种推理模式迁移到大规模稀疏数据上。 Result: 在多个皮肤病数据集上的实验表明,SkinR1实现了优越的诊断准确率,消融研究验证了SFT所建立的推理基础的重要性。 Conclusion: SkinR1通过统一的端到端框架有效提升了皮肤病诊断模型的准确性、可解释性和泛化能力,为临床应用提供了更可信的AI解决方案。 Abstract: The emergence of vision-language models (VLMs) has opened new possibilities for clinical reasoning and has shown promising performance in dermatological diagnosis. However, their trustworthiness and clinical utility are often limited by three major factors: (1) Data heterogeneity, where diverse datasets lack consistent diagnostic labels and clinical concept annotations; (2) Absence of grounded diagnostic rationales, leading to a scarcity of reliable reasoning supervision; and (3) Limited scalability and generalization, as models trained on small, densely annotated datasets struggle to transfer nuanced reasoning to large, sparsely-annotated ones. To address these limitations, we propose SkinR1, a novel dermatological VLM that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL). SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories, providing reliable expert-level supervision. Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the hierarchical structure of diseases, effectively transfers these grounded reasoning patterns to large-scale, sparse data. Extensive experiments on multiple dermatology datasets demonstrate that SkinR1 achieves superior diagnostic accuracy. The ablation study demonstrates the importance of the reasoning foundation instilled by SFT.[35] FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding
Zhenshi Li,Weikang Yu,Dilxat Muhtar,Xueliang Zhang,Pengfeng Xiao,Pedram Ghamisi,Xiao Xiang Zhu
Main category: cs.CV
TL;DR: 本文提出了一种用于遥感图像-文本对齐的细粒度预训练框架FarSLIP,通过构建首个多粒度遥感图文数据集MGRS-200k,并采用patch-to-patch蒸馏和CLS token-based区域对齐策略,在保持语义一致性的同时提升了空间感知能力。
Details
Motivation: 现有CLIP模型在遥感领域缺乏细粒度的空间感知能力,且当前遥感图文数据集和区域对齐方法未能有效利用对象级监督并导致语义连贯性下降。 Method: 构建多粒度遥感图文数据集MGRS-200k;提出FarSLIP框架,采用patch-to-patch蒸馏来对齐局部与全局视觉特征,并使用CLS token进行区域-类别对齐以增强空间感知。 Result: FarSLIP在遥感开放词汇语义分割、零样本图像分类和图文检索等任务上均达到最先进性能。 Conclusion: FarSLIP通过改进的蒸馏策略和区域对齐方式,有效提升了遥感领域中视觉-语言模型的细粒度对齐能力,兼顾特征判别性和语义一致性。 Abstract: As CLIP's global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP's semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.[36] nnMIL: A generalizable multiple instance learning framework for computational pathology
Xiangde Luo,Jinxi Xiang,Yuanfeng Ji,Ruijiang Li
Main category: cs.CV
TL;DR: nnMIL是一种简单且广泛适用的多实例学习框架,通过在补丁和特征层面引入随机采样,连接补丁级基础模型与稳健的幻灯片级临床推理,在多种临床任务中优于现有方法。
Details
Motivation: 现有的聚合补丁级特征到幻灯片级预测的方法受限于设计局限性,阻碍了泛化性和可靠性。 Method: 开发了nnMIL框架,采用随机采样(补丁和特征层面),结合轻量级聚合器进行滑动窗口推理,实现高效训练和不确定性估计。 Result: 在40,000张全切片图像和35项临床任务中,nnMIL在疾病诊断、组织学分型、分子生物标志物检测和泛癌预后预测方面 consistently 优于现有MIL方法,并表现出良好的跨模型泛化能力和可靠的不确定性量化。 Conclusion: nnMIL为将病理学基础模型转化为具有临床意义的预测提供了实用且可推广的解决方案,推动了可靠AI系统在现实环境中的发展与部署。 Abstract: Computational pathology holds substantial promise for improving diagnosis and guiding treatment decisions. Recent pathology foundation models enable the extraction of rich patch-level representations from large-scale whole-slide images (WSIs), but current approaches for aggregating these features into slide-level predictions remain constrained by design limitations that hinder generalizability and reliability. Here, we developed nnMIL, a simple yet broadly applicable multiple-instance learning framework that connects patch-level foundation models to robust slide-level clinical inference. nnMIL introduces random sampling at both the patch and feature levels, enabling large-batch optimization, task-aware sampling strategies, and efficient and scalable training across datasets and model architectures. A lightweight aggregator performs sliding-window inference to generate ensemble slide-level predictions and supports principled uncertainty estimation. Across 40,000 WSIs encompassing 35 clinical tasks and four pathology foundation models, nnMIL consistently outperformed existing MIL methods for disease diagnosis, histologic subtyping, molecular biomarker detection, and pan- cancer prognosis prediction. It further demonstrated strong cross-model generalization, reliable uncertainty quantification, and robust survival stratification in multiple external cohorts. In conclusion, nnMIL offers a practical and generalizable solution for translating pathology foundation models into clinically meaningful predictions, advancing the development and deployment of reliable AI systems in real-world settings.[37] X-WIN: Building Chest Radiograph World Model via Predictive Sensing
Zefan Yang,Ge Wang,James Hendler,Mannudeep K. Kalra,Pingkun Yan
Main category: cs.CV
TL;DR: 提出了一种名为X-WIN的CXR世界模型,通过从胸部CT中提取体积知识来预测其在潜在空间中的2D投影,从而提升疾病诊断和表示学习。
Details
Motivation: 由于X光片是二维投影图像,存在结构重叠问题,难以捕捉三维解剖结构,限制了表征学习和疾病诊断。 Method: 设计X-WIN模型,利用胸部CT的体数据知识,在潜在空间中预测多视角2D投影;引入亲和力引导的对比对齐损失,并结合掩码图像建模和域分类器提升模型适应性。 Result: X-WIN在多种下游任务中优于现有的基础模型,支持线性探测和少样本微调,并能反向重建3D CT体积。 Conclusion: X-WIN通过融合3D解剖知识有效提升了CXR的表示能力,在诊断任务和3D重建方面均表现出优越性能。 Abstract: Chest X-ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X-WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity-guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. X-WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.[38] CPSL: Representing Volumetric Video via Content-Promoted Scene Layers
Kaiyuan Hu,Yili Jin,Junhua Liu,Xize Duan,Hong Kang,Xue Liu
Main category: cs.CV
TL;DR: 提出了一种名为Content-Promoted Scene Layers (CPSL)的紧凑2.5D视频表示方法,通过深度和显著性引导将帧分解为几何一致的图层,实现高质量的新视角合成,同时显著降低存储和渲染成本。
Details
Motivation: 现有体视频表示在捕捉、计算和渲染方面成本高昂,限制了其在按需视频和实时通信中的可扩展性。 Method: 基于每帧的深度和内容显著性,将视频帧分解为少量具有软alpha带和边缘深度缓存的几何一致图层,并利用运动引导传播和逐层编码保持时间一致性。 Result: 在多个基准上,CPSL在感知质量与边界保真度方面优于基于图层和神经场的方法,同时显著降低存储和渲染开销。 Conclusion: CPSL为从2D视频向可扩展的2.5D沉浸式媒体转变提供了一条实用路径。 Abstract: Volumetric video enables immersive and interactive visual experiences by supporting free viewpoint exploration and realistic motion parallax. However, existing volumetric representations from explicit point clouds to implicit neural fields, remain costly in capture, computation, and rendering, which limits their scalability for on-demand video and reduces their feasibility for real-time communication. To bridge this gap, we propose Content-Promoted Scene Layers (CPSL), a compact 2.5D video representation that brings the perceptual benefits of volumetric video to conventional 2D content. Guided by per-frame depth and content saliency, CPSL decomposes each frame into a small set of geometry-consistent layers equipped with soft alpha bands and an edge-depth cache that jointly preserve occlusion ordering and boundary continuity. These lightweight, 2D-encodable assets enable parallax-corrected novel-view synthesis via depth-weighted warping and front-to-back alpha compositing, bypassing expensive 3D reconstruction. Temporally, CPSL maintains inter-frame coherence using motion-guided propagation and per-layer encoding, supporting real-time playback with standard video codecs. Across multiple benchmarks, CPSL achieves superior perceptual quality and boundary fidelity compared with layer-based and neural-field baselines while reducing storage and rendering cost by several folds. Our approach offer a practical path from 2D video to scalable 2.5D immersive media.[39] Unsupervised Discovery of Long-Term Spatiotemporal Periodic Workflows in Human Activities
Fan Yang,Quanting Xie,Atsunori Moteki,Shoichi Masui,Shan Jiang,Yonatan Bisk,Graham Neubig
Main category: cs.CV
TL;DR: 本文提出了一个包含580个多模态人类活动序列的新基准,用于研究长期周期性工作流,并设计了三个与实际应用对齐的评估任务。同时提出了一种轻量级、无需训练的基线方法,在各项任务中显著优于现有方法,且在实际部署中具有无需标注和再训练的优势。
Details
Motivation: 长期周期性工作流具有低对比度模式,现有研究多集中于短期、结构简单的人类活动,对此类复杂工作流的研究仍不足,因此需要新的基准和方法来填补这一空白。 Method: 构建了一个包含580个长周期工作流的多模态序列数据集作为基准,支持无监督周期检测、任务完成追踪和程序异常检测三类任务;提出一种轻量级、无需训练的基线模型来建模多样化的周期性模式。 Result: 实验表明:新基准对现有无监督方法和基于大语言模型的零样本方法均构成挑战;所提基线在所有任务上显著优于对比方法;在真实场景中表现接近传统监督方法,但无需标注和再训练。 Conclusion: 该工作填补了长期低对比度周期性人类活动分析的空白,提出的基准和无需训练的基线为实际应用场景提供了高效、可部署的解决方案。 Abstract: Periodic human activities with implicit workflows are common in manufacturing, sports, and daily life. While short-term periodic activities -- characterized by simple structures and high-contrast patterns -- have been widely studied, long-term periodic workflows with low-contrast patterns remain largely underexplored. To bridge this gap, we introduce the first benchmark comprising 580 multimodal human activity sequences featuring long-term periodic workflows. The benchmark supports three evaluation tasks aligned with real-world applications: unsupervised periodic workflow detection, task completion tracking, and procedural anomaly detection. We also propose a lightweight, training-free baseline for modeling diverse periodic workflow patterns. Experiments show that: (i) our benchmark presents significant challenges to both unsupervised periodic detection methods and zero-shot approaches based on powerful large language models (LLMs); (ii) our baseline outperforms competing methods by a substantial margin in all evaluation tasks; and (iii) in real-world applications, our baseline demonstrates deployment advantages on par with traditional supervised workflow detection approaches, eliminating the need for annotation and retraining. Our project page is https://sites.google.com/view/periodicworkflow.[40] RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems
Jaro Meyer,Frédéric Giraud,Joschua Wüthrich,Marc Pollefeys,Philipp Fürnstahl,Lilian Calvet
Main category: cs.CV
TL;DR: 提出一种基于LED Clock的低成本、通用多视角视频同步方法,实现跨多种相机系统的毫秒级时间对齐,适用于可见光和红外模态,在真实手术记录中验证有效。
Details
Motivation: 在异构相机系统(如专业与消费级设备、可见光与红外传感器)中缺乏硬件同步能力的情况下,实现精确的时空对齐是一个挑战,尤其在无法控制采集条件的真实环境中。 Method: 设计并构建一个自定义的LED Clock,利用红光和红外LED编码时间信息,通过从视频帧中视觉解码曝光窗口的起止时间,实现毫秒级的时间同步。 Result: 与硬件同步相比,RMSE误差为1.34毫秒;在多个实验中优于基于光、音频和时间码的同步方法,并提升了多视角姿态估计和3D重建等下游任务性能;在超过25个异构相机的大规模手术录制中得到验证。 Conclusion: 该方法简化了多相机同步流程,支持RGB和IR模态,扩展了在工业和临床等非受限环境下的高级视觉感知应用。 Abstract: Accurate spatiotemporal alignment of multi-view video streams is essential for a wide range of dynamic-scene applications such as multi-view 3D reconstruction, pose estimation, and scene understanding. However, synchronizing multiple cameras remains a significant challenge, especially in heterogeneous setups combining professional and consumer-grade devices, visible and infrared sensors, or systems with and without audio, where common hardware synchronization capabilities are often unavailable. This limitation is particularly evident in real-world environments, where controlled capture conditions are not feasible. In this work, we present a low-cost, general-purpose synchronization method that achieves millisecond-level temporal alignment across diverse camera systems while supporting both visible (RGB) and infrared (IR) modalities. The proposed solution employs a custom-built \textit{LED Clock} that encodes time through red and infrared LEDs, allowing visual decoding of the exposure window (start and end times) from recorded frames for millisecond-level synchronization. We benchmark our method against hardware synchronization and achieve a residual error of 1.34~ms RMSE across multiple recordings. In further experiments, our method outperforms light-, audio-, and timecode-based synchronization approaches and directly improves downstream computer vision tasks, including multi-view pose estimation and 3D reconstruction. Finally, we validate the system in large-scale surgical recordings involving over 25 heterogeneous cameras spanning both IR and RGB modalities. This solution simplifies and streamlines the synchronization pipeline and expands access to advanced vision-based sensing in unconstrained environments, including industrial and clinical applications.[41] Artificial intelligence approaches for energy-efficient laser cutting machines
Mohamed Abdallah Salem,Hamdy Ahmed Ashour,Ahmed Elshenawy
Main category: cs.CV
TL;DR: 本研究提出了一种基于深度学习的自适应控制方法,通过闭环调节CO2激光切割烟尘抽吸泵的功率,显著降低了能耗。
Details
Motivation: 激光切割过程中烟尘抽吸系统通常采用开环控制,导致能源浪费,且缺乏对不同材料和烟尘水平的动态响应能力。 Method: 采用闭-loop控制系统,结合基于无透镜散斑传感的定制CNN和基于USB摄像头与VGG16迁移学习的材料分类方法,并引入独立的深度学习模型检测烟尘水平,动态调节抽吸泵功率。 Result: 实验结果显示烟尘抽吸泵能耗降低了20%至50%,并在非工作时段自动停机,实现了显著节能。 Conclusion: 该方法有效提升了激光切割系统的能效,推动了制造业的可持续发展。 Abstract: This research addresses the significant challenges of energy consumption and environmental impact in laser cutting by proposing novel deep learning (DL) methodologies to achieve energy reduction. Recognizing the current lack of adaptive control and the open-loop nature of CO2 laser suction pumps, this study utilizes closed-loop configurations that dynamically adjust pump power based on both the material being cut and the smoke level generated. To implement this adaptive system, diverse material classification methods are introduced, including techniques leveraging lens-less speckle sensing with a customized Convolutional Neural Network (CNN) and an approach using a USB camera with transfer learning via the pre-trained VGG16 CNN model. Furthermore, a separate DL model for smoke level detection is employed to simultaneously refine the pump's power output. This integration prompts the exhaust suction pump to automatically halt during inactive times and dynamically adjust power during operation, leading to experimentally proven and remarkable energy savings, with results showing a 20% to 50% reduction in the smoke suction pump's energy consumption, thereby contributing substantially to sustainable development in the manufacturing sector.[42] EGSA-PT:Edge-Guided Spatial Attention with Progressive Training for Monocular Depth Estimation and Segmentation of Transparent Objects
Gbenga Omotara,Ramy Farag,Seyed Mohamad Ali Tousi,G. N. DeSouza
Main category: cs.CV
TL;DR: 本文提出了一种名为Edge-Guided Spatial Attention (EGSA)的融合机制,通过引入边界信息来改善语义与几何特征的融合,从而提升透明物体感知中的深度估计和分割性能,并结合一种多模态渐进式训练策略,在无需真实深度标签的情况下实现了对透明区域的显著改进。
Details
Motivation: 透明物体的感知在计算机视觉中具有挑战性,因为透明性会干扰深度估计和语义分割。现有的多任务学习方法常因任务间的负向交互而影响性能。因此,需要一种能缓解这种负向交互并提升透明区域感知能力的新方法。 Method: 提出Edge-Guided Spatial Attention (EGSA)机制,利用边缘信息引导语义与几何特征的融合;并设计一种多模态渐进式训练策略,先从RGB图像提取边缘进行学习,再过渡到由预测深度图生成的边缘,实现无需真实深度标签的训练。 Result: 在Syn-TODD和ClearPose两个基准上,EGSA相比当前最先进的MODEST方法显著提升了深度估计精度,尤其在透明区域效果更明显,同时保持了具有竞争力的分割性能。 Conclusion: 边缘引导的特征融合与渐进式训练策略有效缓解了多任务学习中的负向交互问题,为透明物体感知提供了一种鲁棒且高效的解决方案。 Abstract: Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi-task learning frameworks to improve robustness, yet negative cross-task interactions often hinder performance. In this work, we introduce Edge-Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn-TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi-modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground-truth depth at training time. Together, these contributions highlight edge-guided fusion as a robust approach capable of improving transparent object perception.[43] Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation
Nicholas Cooper,Lijun Chen,Sailesh Dwivedy,Danna Gurari
Main category: cs.CV
TL;DR: 提出一种仅基于特征损失的知识蒸馏框架,无需使用logit-based损失(如交叉熵),并通过知识质量度量选择最优教师层,显著提升学生模型性能。
Details
Motivation: 现有知识蒸馏方法通常结合logits和中间层特征进行知识迁移,但logit-based损失可能限制特征迁移效果,因此需要一种更有效的纯特征蒸馏方法。 Method: 设计一个仅使用中间层特征损失的知识蒸馏框架,并引入基于潜在表示几何结构的知识质量度量,用于选择最有效的教师层进行知识迁移。 Result: 在三个图像分类数据集上,使用四种不同的学生-教师模型组合(包括CNN和Vision Transformer)进行实验,结果表明该方法比传统方法最高提升15%的top-1准确率,达到SOTA性能。 Conclusion: 仅基于特征的蒸馏方法可有效替代传统结合logits的方法,且通过知识质量评估选择教师层能显著提升学生模型表现,为知识蒸馏提供了新方向。 Abstract: Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.[44] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Vladimir Arkhipkin,Vladimir Korviakov,Nikolai Gerasimenko,Denis Parkhomenko,Viacheslav Vasilev,Alexey Letunovskiy,Maria Kovaleva,Nikolai Vaulin,Ivan Kirillov,Lev Novitskiy,Denis Koposov,Nikita Kiselev,Alexander Varlamov,Dmitrii Mikhailov,Vladimir Polovnikov,Andrey Shutkin,Ilya Vasiliev,Julia Agafonova,Anastasiia Kargapoltseva,Anna Dmitrienko,Anastasia Maltseva,Anna Averchenkova,Olga Kim,Tatiana Nikulina,Denis Dimitrov
Main category: cs.CV
TL;DR: 本文介绍了Kandinsky 5.0,一个用于高分辨率图像和10秒视频生成的先进基础模型系列,包含图像和视频生成模型,并详细描述了其数据处理流程、架构优化及训练方法,旨在推动高质量生成模型的发展与普及。
Details
Motivation: 为了提升高分辨率图像和视频生成的质量与效率,开发一个可扩展、高性能且公开可用的生成模型框架。 Method: 采用多阶段训练 pipeline,包括大规模预训练、自监督微调(SFT)和基于强化学习(RL)的后训练,并结合数据收集、处理、过滤与聚类;提出新型架构、训练和推理优化技术。 Result: 实现了在图像和视频生成任务上的最先进性能,具备高速生成能力,在人类评估中表现优异;推出了6B、2B和19B参数的不同模型版本。 Conclusion: Kandinsky 5.0 是一个高效、高质量的开源生成模型框架,支持多种生成任务,其开放代码和检查点将促进生成模型研究的发展。 Abstract: This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.[45] FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation
Yueru He,Xueqing Peng,Yupeng Cao,Yan Wang,Lingfei Qian,Haohang Li,Yi Han,Ruoyu Xiang,Mingquan Lin,Prayag Tiwari,Jimin Huang,Guojun Xiong,Sophia Ananiadou
Main category: cs.CV
TL;DR: FinCriticalED是一个用于评估金融文档中OCR和视觉语言模型在事实层面性能的视觉基准,通过专家标注的数值和时间事实,强调领域关键的事实正确性而非表面文本相似度。
Details
Motivation: 传统的OCR评估指标(如ROUGE和编辑距离)仅关注文本表层相似性,无法捕捉金融文档中关键的小错误(如符号反转、日期偏移)导致的重大误解,因此需要一个基于事实正确性的评估基准。 Method: 构建包含500个图像-HTML对的数据集,每个样本包含由金融专家创建并验证的数值和时间事实;提出LLM-as-Judge评估流程,进行结构化事实提取与上下文验证。 Result: 评估结果显示,即使最强的专有模型仍会在复杂的视觉数值和时间上下文中出现显著错误,表明现有模型在高精度场景下仍有不足。 Conclusion: FinCriticalED为金融及其他高精度要求领域提供了严格的视觉事实准确性评估基础,推动从词汇重叠向领域关键事实正确性的范式转变。 Abstract: We introduce FinCriticalED (Financial Critical Error Detection), a visual benchmark for evaluating OCR and vision language models on financial documents at the fact level. Financial documents contain visually dense and table heavy layouts where numerical and temporal information is tightly coupled with structure. In high stakes settings, small OCR mistakes such as sign inversion or shifted dates can lead to materially different interpretations, while traditional OCR metrics like ROUGE and edit distance capture only surface level text similarity. \ficriticaled provides 500 image-HTML pairs with expert annotated financial facts covering over seven hundred numerical and temporal facts. It introduces three key contributions. First, it establishes the first fact level evaluation benchmark for financial document understanding, shifting evaluation from lexical overlap to domain critical factual correctness. Second, all annotations are created and verified by financial experts with strict quality control over signs, magnitudes, and temporal expressions. Third, we develop an LLM-as-Judge evaluation pipeline that performs structured fact extraction and contextual verification for visually complex financial documents. We benchmark OCR systems, open source vision language models, and proprietary models on FinCriticalED. Results show that although the strongest proprietary models achieve the highest factual accuracy, substantial errors remain in visually intricate numerical and temporal contexts. Through quantitative evaluation and expert case studies, FinCriticalED provides a rigorous foundation for advancing visual factual precision in financial and other precision critical domains.[46] CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification
Zhenyu Cui,Jiahuan Zhou,Yuxin Peng
Main category: cs.CV
TL;DR: 本文提出了一种跨模态知识解耦与对齐方法(CKDA),用于可见光-红外终身行人重识别(VI-LReID),通过分离并平衡保留模态共性与模态特有知识,缓解灾难性遗忘和知识干扰问题。
Details
Motivation: 现有方法在VI-LReID中采用跨模态知识蒸馏来缓解旧知识的灾难性遗忘,但忽略了模态特有知识获取与模态共性知识防遗忘之间的相互干扰,导致协同遗忘问题。 Method: 提出CKDA方法,包含模态共性提示(MCP)和模态特有提示(MSP)模块,以显式解耦和净化不同模态中的判别信息;设计跨模态知识对齐(CKA)模块,在基于双模态原型的跨模态和模内特征空间中平衡对齐新旧知识。 Result: 在四个基准数据集上的大量实验验证了CKDA方法的有效性和优越性,性能优于当前最先进的方法。 Conclusion: CKDA通过显式解耦和平衡对齐模态共性和特有知识,有效缓解了VI-LReID中的协同遗忘问题,提升了持续学习下的跨模态行人匹配性能。 Abstract: Lifelong person Re-IDentification (LReID) aims to match the same person employing continuously collected individual data from different scenarios. To achieve continuous all-day person matching across day and night, Visible-Infrared Lifelong person Re-IDentification (VI-LReID) focuses on sequential training on data from visible and infrared modalities and pursues average performance over all data. To this end, existing methods typically exploit cross-modal knowledge distillation to alleviate the catastrophic forgetting of old knowledge. However, these methods ignore the mutual interference of modality-specific knowledge acquisition and modality-common knowledge anti-forgetting, where conflicting knowledge leads to collaborative forgetting. To address the above problems, this paper proposes a Cross-modality Knowledge Disentanglement and Alignment method, called CKDA, which explicitly separates and preserves modality-specific knowledge and modality-common knowledge in a balanced way. Specifically, a Modality-Common Prompting (MCP) module and a Modality-Specific Prompting (MSP) module are proposed to explicitly disentangle and purify discriminative information that coexists and is specific to different modalities, avoiding the mutual interference between both knowledge. In addition, a Cross-modal Knowledge Alignment (CKA) module is designed to further align the disentangled new knowledge with the old one in two mutually independent inter- and intra-modality feature spaces based on dual-modality prototypes in a balanced manner. Extensive experiments on four benchmark datasets verify the effectiveness and superiority of our CKDA against state-of-the-art methods. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/CKDA-AAAI2026.[47] Complex-Valued 2D Gaussian Representation for Computer-Generated Holography
Yicheng Zhan,Xiangjun Gao,Long Quan,Kaan Akşit
Main category: cs.CV
TL;DR: 提出基于结构化复值2D高斯基元的全息图表示方法,显著降低参数搜索空间和VRAM使用,提升重建质量与优化速度。
Details
Motivation: 传统全息图表示方法存储开销大、参数空间高,限制了计算机生成全息技术的可扩展性和效率,需更高效的表示方式。 Method: 引入结构化复值2D高斯基元表示全息图,结合可微光栅化器与GPU优化的自由空间光传播核,实现端到端训练,并设计转换流程适配实际全息图格式。 Result: 相比现有方法,VRAM使用减少最多2.5倍,优化速度提升50%,重建质量更高,且能有效抑制噪声伪影。 Conclusion: 该表示方法通过缩小参数搜索空间,提升了全息图估计的可扩展性,有望推动下一代计算机生成全息系统的发展。 Abstract: We propose a new hologram representation based on structured complex-valued 2D Gaussian primitives, which replaces per-pixel information storage and reduces the parameter search space by up to 10:1. To enable end-to-end training, we develop a differentiable rasterizer for our representation, integrated with a GPU-optimized light propagation kernel in free space. Our extensive experiments show that our method achieves up to 2.5x lower VRAM usage and 50% faster optimization while producing higher-fidelity reconstructions than existing methods. We further introduce a conversion procedure that adapts our representation to practical hologram formats, including smooth and random phase-only holograms. Our experiments show that this procedure can effectively suppress noise artifacts observed in previous methods. By reducing the hologram parameter search space, our representation enables a more scalable hologram estimation in the next-generation computer-generated holography systems.[48] Computer Vision Modeling of the Development of Geometric and Numerical Concepts in Humans
Zekun Wang,Sashank Varma
Main category: cs.CV
TL;DR: 计算机视觉模型在训练过程中表现出与儿童发展相似的数学认知发展轨迹,尤其在几何和数字概念上具有发展一致性。
Details
Motivation: 探究计算机视觉模型是否在训练中展现出类似儿童认知发展的 progression,从而为人类数学思维的发展提供新视角。 Method: 以ResNet-50模型为例,分析其在几何与数字概念表征上的发展过程,并与儿童认知发展轨迹进行对比。 Result: 发现模型在欧几里得几何、几何图形、度量属性和拓扑等概念上表现出发展一致性,但在手性图形、几何变换和对称图形方面则没有;同时在数字概念上出现了类似人类‘心理数轴’的表征。 Conclusion: 计算机视觉模型具备模拟人类数学认知发展的潜力,为未来研究提供了方向。 Abstract: Mathematical thinking is a fundamental aspect of human cognition. Cognitive scientists have investigated the mechanisms that underlie our ability to thinking geometrically and numerically, to take two prominent examples, and developmental scientists have documented the trajectories of these abilities over the lifespan. Prior research has shown that computer vision (CV) models trained on the unrelated task of image classification nevertheless learn latent representations of geometric and numerical concepts similar to those of adults. Building on this demonstrated cognitive alignment, the current study investigates whether CV models also show developmental alignment: whether their performance improvements across training to match the developmental progressions observed in children. In a detailed case study of the ResNet-50 model, we show that this is the case. For the case of geometry and topology, we find developmental alignment for some classes of concepts (Euclidean Geometry, Geometrical Figures, Metric Properties, Topology) but not others (Chiral Figures, Geometric Transformations, Symmetrical Figures). For the case of number, we find developmental alignment in the emergence of a human-like ``mental number line'' representation with experience. These findings show the promise of computer vision models for understanding the development of mathematical understanding in humans. They point the way to future research exploring additional model architectures and building larger benchmarks.[49] UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space
Panqi Yang,Haodong Jing,Nanning Zheng,Yongqiang Ma
Main category: cs.CV
TL;DR: UniHOI 提出了一种统一框架,通过共享的token空间联合建模人-物交互(HOI)的检测与生成任务,提升了知识共享与泛化能力。
Details
Motivation: 传统方法将HOI检测与生成任务分离,限制了对交互行为的全面理解,亟需统一建模以促进双向知识迁移。 Method: 提出对称的交互感知注意力模块和统一的半监督学习范式,在统一token空间中实现图像与交互语义间的双向映射。 Result: 在长尾HOI检测上准确率提升4.9%,在开放词汇生成任务中交互指标提升42.0%,实现了检测与生成任务的性能突破。 Conclusion: UniHOI 有效统一了HOI检测与生成,增强了模型在有限标注下的泛化能力,推动了交互理解的全面发展。 Abstract: In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. Specifically, we introduce a symmetric interaction-aware attention module and a unified semi-supervised learning paradigm, enabling effective bidirectional mapping between images and interaction semantics even under limited annotations. Extensive experiments demonstrate that UniHOI achieves state-of-the-art performance in both HOI detection and generation. Specifically, UniHOI improves accuracy by 4.9% on long-tailed HOI detection and boosts interaction metrics by 42.0% on open-vocabulary generation tasks.[50] Hyperspectral Super-Resolution with Inter-Image Variability via Degradation-based Low-Rank and Residual Fusion Method
Yue Wen,Kunjing Yang,Minru Bai
Main category: cs.CV
TL;DR: 提出一种基于退化建模的低秩与残差融合模型(DLRRF),通过分解光谱退化和空间细节损失来有效处理高光谱与多光谱图像间的跨图像变异性,结合隐式正则化与PnP框架实现高性能图像融合。
Details
Motivation: 现有方法在处理高光谱与多光谱图像融合时,因不同采集条件导致的光谱变异性和局部空间变化(即跨图像变异性)会显著影响融合性能,且直接图像变换可能加剧模型病态性。 Method: 将光谱变异性建模为光谱退化算子的变化,并将目标高光谱图像分解为低秩和残差分量以恢复丢失的空间细节;利用图像内光谱相关性进行降维,并引入隐式正则项结合外部去噪器,在PnP框架下使用PAO算法求解。 Result: 所提DLRRF模型在多种数值实验中表现出优于现有方法的融合性能,尤其在存在跨图像变异性的情况下仍能保持高精度和稳定性。 Conclusion: DLRRF通过退化建模和成分分解有效应对了跨图像变异性问题,结合隐式正则化与PnP优化策略,提升了融合结果的质量,具有良好的收敛性与应用潜力。 Abstract: The fusion of hyperspectral image (HSI) with multispectral image (MSI) provides an effective way to enhance the spatial resolution of HSI. However, due to different acquisition conditions, there may exist spectral variability and spatially localized changes between HSI and MSI, referred to as inter-image variability, which can significantly affect the fusion performance. Existing methods typically handle inter-image variability by applying direct transformations to the images themselves, which can exacerbate the ill-posedness of the fusion model. To address this challenge, we propose a Degradation-based Low-Rank and Residual Fusion (DLRRF) model. First, we model the spectral variability as change in the spectral degradation operator. Second, to recover the lost spatial details caused by spatially localized changes, we decompose the target HSI into low rank and residual components, where the latter is used to capture the lost details. By exploiting the spectral correlation within the images, we perform dimensionality reduction on both components. Additionally, we introduce an implicit regularizer to utilize the spatial prior information from the images. The proposed DLRRF model is solved using the Proximal Alternating Optimization (PAO) algorithm within a Plug-and-Play (PnP) framework, where the subproblem regarding implicit regularizer is addressed by an external denoiser. We further provide a comprehensive convergence analysis of the algorithm. Finally, extensive numerical experiments demonstrate that DLRRF achieves superior performance in fusing HSI and MSI with inter-image variability.[51] CellGenNet: A Knowledge-Distilled Framework for Robust Cell Segmentation in Cancer Tissues
Srijan Ray,Bikesh K. Nirala,Jason T. Yustein,Sundaresh Ram
Main category: cs.CV
TL;DR: 提出了一种基于知识蒸馏的细胞分割框架CellGenNet,用于在有限监督下实现跨组织的鲁棒细胞分割。
Details
Motivation: 由于染色、成像条件和组织形态的变异性,显微镜全切片图像中的细胞核分割仍具挑战性,尤其在标注数据稀缺的情况下。 Method: 采用师生架构,教师网络在稀疏标注上训练并生成软伪标签;学生网络通过结合真实标签、教师提供的概率目标以及二元交叉熵与Tversky损失的混合损失函数进行优化,并引入一致性正则化和逐层dropout以稳定特征表示。 Result: 在多种癌症组织的全切片图像上实验表明,CellGenNet在分割精度和泛化能力上优于全监督和半监督基线方法。 Conclusion: CellGenNet能有效提升在有限标注下的跨组织细胞核分割性能,支持可扩展且可重复的病理分析。 Abstract: Accurate nuclei segmentation in microscopy whole slide images (WSIs) remains challenging due to variability in staining, imaging conditions, and tissue morphology. We propose CellGenNet, a knowledge distillation framework for robust cross-tissue cell segmentation under limited supervision. CellGenNet adopts a student-teacher architecture, where a capacity teacher is trained on sparse annotations and generates soft pseudo-labels for unlabeled regions. The student is optimized using a joint objective that integrates ground-truth labels, teacher-derived probabilistic targets, and a hybrid loss function combining binary cross-entropy and Tversky loss, enabling asymmetric penalties to mitigate class imbalance and better preserve minority nuclear structures. Consistency regularization and layerwise dropout further stabilize feature representations and promote reliable feature transfer. Experiments across diverse cancer tissue WSIs show that CellGenNet improves segmentation accuracy and generalization over supervised and semi-supervised baselines, supporting scalable and reproducible histopathology analysis.[52] ProPL: Universal Semi-Supervised Ultrasound Image Segmentation via Prompt-Guided Pseudo-Labeling
Yaxiong Chen,Qicong Wang,Chunlei Li,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou
Main category: cs.CV
TL;DR: 本文提出了ProPL,首个用于通用半监督超声图像分割的框架,能够处理多种器官和任务,并利用标记与未标记数据提升性能。
Details
Motivation: 现有超声图像分割方法通常针对特定解剖结构或任务设计,限制了其在临床中的广泛应用,缺乏通用性。 Method: 提出ProPL框架,采用共享视觉编码器与提示引导的双解码器结构,通过解码时提示机制实现灵活任务适配,并结合基于不确定性的伪标签校准(UPLC)模块进行可靠自训练。 Result: 在涵盖5个器官和8项分割任务的新数据集上实验表明,ProPL在多项指标上优于现有最先进方法。 Conclusion: ProPL为通用超声图像分割设立了新基准,推动了该领域在临床实用性与泛化能力方面的发展。 Abstract: Existing approaches for the problem of ultrasound image segmentation, whether supervised or semi-supervised, are typically specialized for specific anatomical structures or tasks, limiting their practical utility in clinical settings. In this paper, we pioneer the task of universal semi-supervised ultrasound image segmentation and propose ProPL, a framework that can handle multiple organs and segmentation tasks while leveraging both labeled and unlabeled data. At its core, ProPL employs a shared vision encoder coupled with prompt-guided dual decoders, enabling flexible task adaptation through a prompting-upon-decoding mechanism and reliable self-training via an uncertainty-driven pseudo-label calibration (UPLC) module. To facilitate research in this direction, we introduce a comprehensive ultrasound dataset spanning 5 organs and 8 segmentation tasks. Extensive experiments demonstrate that ProPL outperforms state-of-the-art methods across various metrics, establishing a new benchmark for universal ultrasound image segmentation.[53] Evaluating Multimodal Large Language Models on Vertically Written Japanese Text
Keito Sasagawa,Shuhei Kurita,Daisuke Kawahara
Main category: cs.CV
TL;DR: 本研究评估了多模态大语言模型(MLLMs)对竖排日文文本的阅读能力,发现现有模型在处理竖排文本时表现较差;通过构建包含横排和竖排日文的合成OCR数据集进行微调,显著提升了模型性能。
Details
Motivation: 由于部分日文文档采用竖排书写,而现有MLLM在处理此类文本时缺乏专门研究,因此需要系统评估并提升其对竖排日文的理解能力。 Method: 构建了一个包含合成与真实世界图像的日本语文本数据集,涵盖横排和竖排格式,并使用该数据集对现有MLLM进行微调和评估。 Result: 实验表明,现有MLLM在竖排日文文本上的表现明显低于横排文本;基于合成数据集进行训练后,模型对竖排文本的处理能力显著提升。 Conclusion: 支持竖排文本理解对于多语言文档处理至关重要,通过针对性的数据合成与训练可有效增强MLLM对竖排日文的阅读能力。 Abstract: Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.[54] Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Cheng Yang,Haiyuan Wan,Yiran Peng,Xin Cheng,Zhaoyang Yu,Jiayi Zhang,Junchi Yu,Xinlei Yu,Xiawu Zheng,Dongzhan Zhou,Chenglin Wu
Main category: cs.CV
TL;DR: 本文探索了视频模型通过视频生成进行推理的能力,提出了VR-Bench这一基准测试,基于迷宫求解任务评估视频模型的空间规划与多步推理能力。实验表明,经过监督微调(SFT)的视频模型在空间感知和推理泛化方面优于现有视觉语言模型,并发现推理时的多样化采样可提升10-20%的推理可靠性。
Details
Motivation: 受语言模型从文本生成发展到文本推理的启发,作者探究视频模型是否能通过视频生成进行推理。由于视频具有明确的空间布局和时间连续性,相比离散文本更适合作为空间推理的基础。 Method: 提出VR-Bench,一个包含7,920个程序化生成视频的基准,涵盖五种迷宫类型和多种视觉风格,基于迷宫求解任务评估视频模型的推理能力;采用监督微调(SFT)方法激发模型推理能力,并研究推理时多样化采样的影响。 Result: 实验显示SFT能有效提升视频模型的推理能力;视频模型在空间感知上优于领先VLM,在不同场景、任务和复杂度下表现出良好泛化性;发现推理时的多样化采样可使推理可靠性提高10-20%。 Conclusion: 视频模型具备通过视频生成进行空间推理的潜力,且该范式具有良好的可扩展性,VR-Bench为未来研究提供了有效评估工具。 Abstract: Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.[55] BokehFlow: Depth-Free Controllable Bokeh Rendering via Flow Matching
Yachuan Huang,Xianrui Luo,Qiwen Wang,Liao Shen,Jiaqi Li,Huiqiang Sun,Zihao Huang,Wei Jiang,Zhiguo Cao
Main category: cs.CV
TL;DR: 本文提出了一种无需深度信息的可控散景渲染框架BokehFlow,基于流匹配技术直接从全焦图像生成逼真的散景效果,并通过文本提示实现对焦点区域和模糊强度的语义控制。
Details
Motivation: 现有的可控散景渲染方法依赖精确的深度图,而生成式方法在可控性和效率方面存在不足,因此需要一种无需深度输入且具备良好控制能力的新方法。 Method: 提出BokehFlow,采用流匹配框架,利用交叉注意力机制通过文本提示实现对聚焦区域和模糊强度的语义控制,直接从全焦图像生成散景效果。 Result: 在四个新构建的数据集上进行实验,结果表明BokehFlow在渲染质量与效率方面均优于现有的依赖深度和生成式方法,能生成视觉效果 compelling 的散景图像。 Conclusion: BokehFlow实现了高质量、高效率且完全无需深度输入的可控散景渲染,为图像美学增强提供了新的解决方案。 Abstract: Bokeh rendering simulates the shallow depth-of-field effect in photography, enhancing visual aesthetics and guiding viewer attention to regions of interest. Although recent approaches perform well, rendering controllable bokeh without additional depth inputs remains a significant challenge. Existing classical and neural controllable methods rely on accurate depth maps, while generative approaches often struggle with limited controllability and efficiency. In this paper, we propose BokehFlow, a depth-free framework for controllable bokeh rendering based on flow matching. BokehFlow directly synthesizes photorealistic bokeh effects from all-in-focus images, eliminating the need for depth inputs. It employs a cross-attention mechanism to enable semantic control over both focus regions and blur intensity via text prompts. To support training and evaluation, we collect and synthesize four datasets. Extensive experiments demonstrate that BokehFlow achieves visually compelling bokeh effects and offers precise control, outperforming existing depth-dependent and generative methods in both rendering quality and efficiency.[56] MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation
Shengjing Tian,Yinan Han,Xiantong Zhao,Xuehu Liu,Qi Lang
Main category: cs.CV
TL;DR: 本文提出MambaTrack3D,一种基于Mamba状态空间模型的高效3D单目标跟踪框架,专为高时间变化(HTV)户外环境设计,通过MIP模块实现近线性复杂度的帧间传播,并利用GFEM模块抑制时序冗余,在KITTI-HTV和nuScenes-HTV上显著优于现有方法,同时在标准数据集上保持竞争力。
Details
Motivation: 动态户外环境中LiDAR点云的高时间变化导致现有基于记忆的跟踪器存在计算复杂度高、时序冗余严重及几何先验利用不足的问题。 Method: 提出MambaTrack3D,包含Mamba-based Inter-frame Propagation(MIP)模块用于高效跨帧特征传播,以及Grouped Feature Enhancement Module(GFEM)在通道级分离前景与背景语义以减少记忆库中的冗余。 Result: 在KITTI-HTV和nuScenes-HTV基准上显著优于现有HTV和常规场景跟踪器,较HVTrack提升最多6.5成功率和9.5精度;在标准KITTI数据集上表现具有竞争力。 Conclusion: MambaTrack3D实现了优异的精度-效率权衡,具备强泛化能力,可鲁棒应对HTV和常规3D跟踪场景。 Abstract: Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer from quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors. To address these issues, we propose MambaTrack3D, a novel HTV-oriented tracking framework built upon the state space model Mamba. Specifically, we design a Mamba-based Inter-frame Propagation (MIP) module that replaces conventional single-frame feature extraction with efficient inter-frame propagation, achieving near-linear complexity while explicitly modeling spatial relations across historical frames. Furthermore, a Grouped Feature Enhancement Module (GFEM) is introduced to separate foreground and background semantics at the channel level, thereby mitigating temporal redundancy in the memory bank. Extensive experiments on KITTI-HTV and nuScenes-HTV benchmarks demonstrate that MambaTrack3D consistently outperforms both HTV-oriented and normal-scenario trackers, achieving improvements of up to 6.5 success and 9.5 precision over HVTrack under moderate temporal gaps. On the standard KITTI dataset, MambaTrack3D remains highly competitive with state-of-the-art normal-scenario trackers, confirming its strong generalization ability. Overall, MambaTrack3D achieves a superior accuracy-efficiency trade-off, delivering robust performance across both specialized HTV and conventional tracking scenarios.[57] TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition
Wen Yin,Siyu Zhan,Cencen Liu,Xin Hu,Guiduo Duan,Xiurui Xie,Yuan-Fang Li,Tao He
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态情感识别框架TiCAL,通过典型性估计和一致性感知机制解决跨模态情感冲突问题,并在超球面空间中增强情感表征,显著提升了模型在不一致样本上的性能。
Details
Motivation: 现有方法多依赖统一情感标签进行训练,忽视了同一样本中不同模态间可能存在的情感倾向差异(即跨模态情感冲突)这一关键挑战。 Method: 提出TiCAL框架,结合伪单模态情感标签与典型性估计动态评估每个训练样本的一致性,并将特征嵌入双曲空间以捕捉更细粒度的情感类别差异,从而增强情绪表示。 Result: 在CMU-MOSEI和MER2023等基准数据集上进行了大量实验,结果显示TiCAL相比当前最优方法DMD性能提升了约2.6%,有效缓解了跨模态情感冲突并提高了整体识别准确率。 Conclusion: TiCAL通过建模样本级一致性与双曲空间表示,有效应对多模态情感识别中的模态冲突问题,显著提升模型鲁棒性与准确性。 Abstract: Multimodal Emotion Recognition (MER) aims to accurately identify human emotional states by integrating heterogeneous modalities such as visual, auditory, and textual data. Existing approaches predominantly rely on unified emotion labels to supervise model training, often overlooking a critical challenge: inter-modal emotion conflicts, wherein different modalities within the same sample may express divergent emotional tendencies. In this work, we address this overlooked issue by proposing a novel framework, Typicality-based Consistent-aware Multimodal Emotion Recognition (TiCAL), inspired by the stage-wise nature of human emotion perception. TiCAL dynamically assesses the consistency of each training sample by leveraging pseudo unimodal emotion labels alongside a typicality estimation. To further enhance emotion representation, we embed features in a hyperbolic space, enabling the capture of fine-grained distinctions among emotional categories. By incorporating consistency estimates into the learning process, our method improves model performance, particularly on samples exhibiting high modality inconsistency. Extensive experiments on benchmark datasets, e.g, CMU-MOSEI and MER2023, validate the effectiveness of TiCAL in mitigating inter-modal emotional conflicts and enhancing overall recognition accuracy, e.g., with about 2.6% improvements over the state-of-the-art DMD.[58] Jointly Conditioned Diffusion Model for Multi-View Pose-Guided Person Image Synthesis
Chengyu Xie,Zhi Gong,Junchi Ren,Linkun Yu,Si Shen,Fei Shen,Xiaoyu Du
Main category: cs.CV
TL;DR: 提出了一种联合条件扩散模型(JCDM),利用多视角先验提升姿态引导的人体图像生成质量,实现高保真和跨视角一致性。
Details
Motivation: 单视角参考图像存在纹理不完整和缺乏显式跨视角交互的问题,限制了姿态引导人体图像生成的效果。 Method: 设计了外观先验模块(APM)从不完整参考中推断整体身份保持先验,并通过联合条件注入(JCI)机制融合多视角线索,将共享条件注入去噪主干网络以对齐身份、颜色和纹理。 Result: 实验表明该方法在保真度和跨视角一致性方面达到最先进的水平,支持可变数量的参考视角,并能与标准扩散模型主干轻松集成。 Conclusion: JCDM有效解决了多视角人体图像生成中的纹理完整性和跨视角一致性问题,具有良好的灵活性和扩展性。 Abstract: Pose-guided human image generation is limited by incomplete textures from single reference views and the absence of explicit cross-view interaction. We present jointly conditioned diffusion model (JCDM), a jointly conditioned diffusion framework that exploits multi-view priors. The appearance prior module (APM) infers a holistic identity preserving prior from incomplete references, and the joint conditional injection (JCI) mechanism fuses multi-view cues and injects shared conditioning into the denoising backbone to align identity, color, and texture across poses. JCDM supports a variable number of reference views and integrates with standard diffusion backbones with minimal and targeted architectural modifications. Experiments demonstrate state of the art fidelity and cross-view consistency.[59] A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Duo Li,Zuhao Yang,Xiaoqin Zhang,Ling Shao,Shijian Lu
Main category: cs.CV
TL;DR: 该论文研究了基于离散扩散的多模态大语言模型(dMLLMs)中的视觉token冗余问题,发现冗余主要出现在从零训练的dMLLM处理长答案任务时,并提出针对不同架构采用不同的加速策略(如层跳跃或渐进式剪枝),以在减少信息损失的同时提升推理效率。
Details
Motivation: 现有dMLLM在推理时因每一步去噪都需全序列注意力计算而带来显著计算开销,且多数优化方法忽略了模态特定的视觉token冗余问题,因此需要系统研究视觉冗余的演化规律及高效压缩方法。 Method: 通过分析不同dMLLM架构和任务下视觉token冗余的演变,评估视觉token剪枝对模型响应和效率的影响,并比较层跳跃、渐进式剪枝等策略在不同类型dMLLM上的有效性。 Result: 发现视觉冗余仅在从零训练的dMLLM处理长答案任务时出现;剪枝会引入不可忽略的信息损失,但此类模型能在后期去噪步骤中逐步恢复;层跳跃适合AR-to-diffusion模型,而渐进或后期剪枝更适合从零训练的dMLLM。 Conclusion: 针对不同类型的dMLLM应采取差异化的效率优化策略,该研究为dMLLM的高效推理提供了新的视角,提升了其在多模态理解任务中的适用性。 Abstract: Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks. In addition, we validate that visual token pruning introduces non-negligible information loss in dMLLMs and only from-scratch dMLLMs can recover the lost information progressively during late denoising steps. Furthermore, our study shows that layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs. Overall, this work offers a new perspective on efficiency optimization for dMLLMs, greatly advancing their applicability across various multimodal understanding tasks.[60] Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting
Junseo Koo,Jinseo Jeong,Gunhee Kim
Main category: cs.CV
TL;DR: 提出了一种新的高斯混合方法(Gaussian Blending),以解决3D高斯点阵在新视角合成中缩放时出现的模糊和阶梯状伪影问题,该方法将alpha和透射率视为空间变化分布,从而在保持实时渲染速度的同时提高渲染质量。
Details
Motivation: 现有的3D高斯点阵方法在训练未见的采样率下合成视图时存在明显的视觉差异,如放大时的模糊和缩小后的阶梯状伪影,这是由于传统的alpha混合方法的局限性导致的。 Method: 提出一种新的高斯混合方法,取代传统alpha混合,将alpha和透射率作为像素区域内空间变化的分布进行处理,使背景点能够更合理地参与渲染计算。 Result: 实验表明,新方法在各种训练未见和已见的采样率下均优于现有新视角合成模型,能更好地保留细节,消除缩放伪影,且不增加内存开销,保持实时渲染性能。 Conclusion: 所提出的高斯混合方法有效解决了3DGS在多尺度视图合成中的关键缺陷,具有良好的通用性和实用性,可作为即插即用模块集成到现有框架中。 Abstract: The recent introduction of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis. Several studies have further improved the rendering quality of 3DGS, yet they still exhibit noticeable visual discrepancies when synthesizing views at sampling rates unseen during training. Specifically, they suffer from (i) erosion-induced blurring artifacts when zooming in and (ii) dilation-induced staircase artifacts when zooming out. We speculate that these artifacts arise from the fundamental limitation of the alpha blending adopted in 3DGS methods. Instead of the conventional alpha blending that computes alpha and transmittance as scalar quantities over a pixel, we propose to replace it with our novel Gaussian Blending that treats alpha and transmittance as spatially varying distributions. Thus, transmittances can be updated considering the spatial distribution of alpha values across the pixel area, allowing nearby background splats to contribute to the final rendering. Our Gaussian Blending maintains real-time rendering speed and requires no additional memory cost, while being easily integrated as a drop-in replacement into existing 3DGS-based or other NVS frameworks. Extensive experiments demonstrate that Gaussian Blending effectively captures fine details at various sampling rates unseen during training, consistently outperforming existing novel view synthesis models across both unseen and seen sampling rates.[61] An Event-triggered System for Social Persuasion and Danger Alert in Elder Home Monitoring
Jun-Yi Liu,Chung-Hao Chen,Ya-Chi Tsao,Ssu-Yao Wu,Yu-Ting Tsao,Lyn Chao-ling Chen
Main category: cs.CV
TL;DR: 提出了一种基于事件触发的系统,用于监测老年人在家庭环境中的身体和心理状态,通过GMM背景建模检测行为,并结合SVM分析图像,实现与亲属的直观社交互动。
Details
Motivation: 考虑到老年人身体和心理状态的双重需求,以及他们缺乏技术经验,需要一种无需复杂操作即可实现安全监测和亲情沟通的系统。 Method: 采用GMM(高斯混合模型)进行背景建模以检测看护、危险通知等事件中的运动行为,并利用SVM机器学习对捕获的图像进行分析;设计了基于日常活动的直观操作方式,通过社交媒体建立老人与亲属之间的联系。 Result: 在5个家庭的真实生活场景中进行了实验,成功检测并记录了三种类型的事件,验证了系统的可行性和有效性。 Conclusion: 该系统能够有效监测老年人的日常行为并及时触发相应事件,同时通过简单自然的操作促进其与家人的情感交流,具有实际应用价值。 Abstract: In the study, the physical state and mental state of elders are both considered, and an event-triggered system has developed to detect events: watch dog, danger notice and photo link. By adopting GMM background modeling, the motion behavior of visitors and elders can be detected in the watch dog event and danger notice event respectively. Experiments set in home scenarios and 5 families participated in the experiments for detecting and recording three types of events from their life activities. In addition, the captured images were analyzed using SVM machine learning. For lack of technical experiences of elders, an intuitive operation as normal life activity was designed to create communication between elder and relatives via social media.[62] Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation
Firdavs Nasriddinov,Rafal Kocielnik,Anima Anandkumar,Andrew J. Hung
Main category: cs.CV
TL;DR: 提出一种基于手术动作本体的结构化流程,从真实教学转录文本中提取Instrument-Action-Target(IAT)三元组,并用于指导GPT-4o生成临床可信、教练风格的术中反馈,显著提升生成质量与可验证性。
Details
Motivation: 手术培训中高质量的实时反馈对学员技能提升至关重要,但目前缺乏能够理解临床相关语义结构的自动化反馈系统。 Method: 1) 从33台手术的真实反馈文本中挖掘并规范化IAT三元组;2) 融合手术流程、任务上下文和精细时间维度的器械运动,微调视频到IAT的预测模型;3) 利用IAT表示引导GPT-4o生成反馈文本。 Result: 在视频到IAT识别任务中,AUC指标全面提升(器械0.67→0.74,动作0.60→0.63,组织0.74→0.79);在反馈生成任务中,IAT条件生成使平均保真度评分从2.17提升至2.44(+12.4%),≥3分的可接受反馈比例从21%翻倍至42%,词错误率降低15-31%,ROUGE指标提高9-64%。 Conclusion: 将生成过程基于显式的IAT结构,能有效提升反馈生成的临床保真度和可审计性,为可验证的自动化手术培训反馈提供了可行路径。 Abstract: High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.[63] Unbiased Semantic Decoding with Vision Foundation Models for Few-shot Segmentation
Jin Wang,Bingfeng Zhang,Jian Pang,Weifeng Liu,Baodi Liu,Honglong Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于SAM和CLIP的无偏语义解码(USD)策略,用于少样本分割,通过支持集和查询集联合提取语义信息,提升模型对未知类别的泛化能力。
Details
Motivation: 现有方法过度依赖从支持集中提取提示,导致解码过程偏差大、泛化能力不足,难以适应未知类别。 Method: 提出Unbiased Semantic Decoding (USD)策略,结合SAM与CLIP,在图像级进行全局语义补充,在像素级进行局部位置引导,并设计可学习的视觉-文本目标提示生成器,融合CLIP的语义对齐能力增强SAM特征。 Result: 该方法无需重新训练视觉基础模型,即可在少样本分割任务中实现更一致、更具目标聚焦性的预测,提升了对未知类别的分割性能。 Conclusion: 通过引入CLIP的语义引导和双路径特征增强,USD有效缓解了SAM在少样本分割中的提示偏差问题,显著增强了模型的泛化能力和语义判别力。 Abstract: Few-shot segmentation has garnered significant attention. Many recent approaches attempt to introduce the Segment Anything Model (SAM) to handle this task. With the strong generalization ability and rich object-specific extraction ability of the SAM model, such a solution shows great potential in few-shot segmentation. However, the decoding process of SAM highly relies on accurate and explicit prompts, making previous approaches mainly focus on extracting prompts from the support set, which is insufficient to activate the generalization ability of SAM, and this design is easy to result in a biased decoding process when adapting to the unknown classes. In this work, we propose an Unbiased Semantic Decoding (USD) strategy integrated with SAM, which extracts target information from both the support and query set simultaneously to perform consistent predictions guided by the semantics of the Contrastive Language-Image Pre-training (CLIP) model. Specifically, to enhance the unbiased semantic discrimination of SAM, we design two feature enhancement strategies that leverage the semantic alignment capability of CLIP to enrich the original SAM features, mainly including a global supplement at the image level to provide a generalize category indicate with support image and a local guidance at the pixel level to provide a useful target location with query image. Besides, to generate target-focused prompt embeddings, a learnable visual-text target prompt generator is proposed by interacting target text embeddings and clip visual features. Without requiring re-training of the vision foundation models, the features with semantic discrimination draw attention to the target region through the guidance of prompt with rich target information.[64] Computer-Use Agents as Judges for Generative User Interface
Kevin Qinghong Lin,Siyuan Hu,Linjie Li,Zhengyuan Yang,Lijuan Wang,Philip Torr,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出了一种新的协作框架Coder-CUA,利用计算机使用代理(CUA)作为评判者,协助编码语言模型(Coder)进行自动GUI设计。通过构建AUI-Gym基准和CUA Dashboard,实现以任务可解性和导航成功率为核心的代理原生界面优化。
Details
Motivation: 现有GUI主要为人设计,限制了代理的效率;而当前编码模型的发展为自动化GUI设计提供了新机会,因此需要探索如何让代理参与并优化自身使用的界面。 Method: 构建包含52个应用的AUI-Gym基准,生成1560个可验证任务,并提出Coder-CUA协作框架:Coder作为设计师生成网页,CUA作为评判者评估功能并反馈,通过CUA Dashboard将导航历史可视化为 redesign 指导。 Result: 实现了基于任务执行成功率和CUA导航效率的自动GUI优化,验证了代理作为设计评判者的有效性,并提升了GUI对代理的适配性。 Conclusion: 通过让代理在GUI设计中扮演主动评判与优化角色,推动数字界面从人类中心向代理原生高效设计转变,促进代理从被动使用到主动参与环境的演进。 Abstract: Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.[65] WaveFuse-AL: Cyclical and Performance-Adaptive Multi-Strategy Active Learning for Medical Images
Nishchala Thakur,Swati Kochhar,Deepti R. Bathula,Sukrit Gupta
Main category: cs.CV
TL;DR: 提出了一种名为WaveFuse-AL的新型多策略主动学习框架,通过周期性和性能自适应融合BALD、BADGE、Entropy和CoreSet等多种采样策略,显著提升了医学图像分析中的模型性能,且在多个基准任务上优于单策略和交替策略基线方法。
Details
Motivation: 现有的主动学习策略在不同学习阶段表现不稳定,难以持续高效地选择最具信息量的样本,导致标注效率受限。 Method: 提出WaveFuse-AL框架,结合周期性(正弦)时间先验与性能驱动的自适应机制,动态调整多个 acquisition 策略(BALD、BADGE、Entropy、CoreSet)的权重,实现多策略融合。 Result: 在APTOS-2019、RSNA肺炎检测和ISIC-2018三个医学图像基准上验证了方法的有效性,WaveFuse-AL在12项指标中的10项实现了统计显著的性能提升,优于单策略和交替策略基线。 Conclusion: WaveFuse-AL能够更有效地利用有限标注预算,通过动态融合多策略提升主动学习在医学图像任务中的稳定性和性能。 Abstract: Active learning reduces annotation costs in medical imaging by strategically selecting the most informative samples for labeling. However, individual acquisition strategies often exhibit inconsistent behavior across different stages of the active learning cycle. We propose Cyclical and Performance-Adaptive Multi-Strategy Active Learning (WaveFuse-AL), a novel framework that adaptively fuses multiple established acquisition strategies-BALD, BADGE, Entropy, and CoreSet throughout the learning process. WaveFuse-AL integrates cyclical (sinusoidal) temporal priors with performance-driven adaptation to dynamically adjust strategy importance over time. We evaluate WaveFuse-AL on three medical imaging benchmarks: APTOS-2019 (multi-class classification), RSNA Pneumonia Detection (binary classification), and ISIC-2018 (skin lesion segmentation). Experimental results demonstrate that WaveFuse-AL consistently outperforms both single-strategy and alternating-strategy baselines, achieving statistically significant performance improvements (on ten out of twelve metric measurements) while maximizing the utility of limited annotation budgets.[66] When to Think and When to Look: Uncertainty-Guided Lookback
Jing Bi,Filippos Bellos,Junjia Guo,Yayuan Li,Chao Huang,Yunlong,Tang,Luchuan Song,Susan Liang,Zhongfei,Zhang,Jason J. Corso,Chenliang Xu
Main category: cs.CV
TL;DR: 本文首次系统分析了推理时思维(test-time thinking)对大型视觉语言模型(LVLMs)视觉推理的影响,发现并非越多推理越好,过长的推理链反而可能偏离图像内容。作者提出一种无需训练的解码策略——不确定性引导回看(uncertainty guided lookback),通过结合不确定性信号与自适应回看提示,在多个基准上显著提升性能,尤其在传统思维模式表现弱的类别中效果更优。
Details
Motivation: 尽管测试时推理在大模型中表现出性能增益,但其对视觉语言模型中视觉推理的具体影响尚缺乏系统性分析,尤其是如何有效结合图像信息进行中间推理仍不清楚。 Method: 在InternVL3.5和Qwen3-VL系列的十个变体上进行大规模、受控比较实验,使用MMMU-val数据集,在充足token预算和多轮解码下评估不同推理策略;引入不确定性引导回看机制,结合不确定性估计、自适应回看提示和广度搜索。 Result: 发现长推理链常导致错误轨迹且忽略图像,表现不如标准指令模式;成功推理路径中富含短小的回看短语;所提方法在MMMU上取得最佳性能,并在五个额外基准(包括多模态套件和数学视觉推理数据集)上展现良好泛化能力。 Conclusion: 测试时推理并非越长越好,关键在于适时回看图像以增强视觉对齐;提出的无需训练的解码策略能有效提升LVLM的视觉推理能力和鲁棒性,为未来推理机制设计提供新方向。 Abstract: Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.[67] DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain Imaging
Meihua Zhou,Xinyu Tong,Jiarui Zhao,Min Cheng,Li Yang,Lei Tian,Nan Wan
Main category: cs.CV
TL;DR: 提出了一种名为DCL-SE的端到端框架,通过数据驱动的时空编码和动态课程学习策略,在多种神经影像分析任务中实现了更高的准确性、鲁棒性和可解释性。
Details
Motivation: 高维神经影像分析在临床诊断中常受限于时空保真度不足以及大规模通用模型适应性差的问题。 Method: 引入DCL-SE框架,采用近似秩池化(ARP)将三维脑数据编码为信息丰富的二维动态表示,并结合动态分组机制(DGM)指导的动态课程学习策略,逐步训练解码器,从全局解剖结构到细微病理特征进行优化提取。 Result: 在六个公开数据集上验证,涵盖阿尔茨海默病分类、脑肿瘤分类、脑动脉分割和脑年龄预测等任务,DCL-SE在准确率、鲁棒性和可解释性方面均优于现有方法。 Conclusion: 紧凑且任务特定的架构在大规模预训练网络时代对神经影像分析具有重要意义。 Abstract: High-dimensional neuroimaging analyses for clinical diagnosis are often constrained by compromises in spatiotemporal fidelity and by the limited adaptability of large-scale, general-purpose models. To address these challenges, we introduce Dynamic Curriculum Learning for Spatiotemporal Encoding (DCL-SE), an end-to-end framework centered on data-driven spatiotemporal encoding (DaSE). We leverage Approximate Rank Pooling (ARP) to efficiently encode three-dimensional volumetric brain data into information-rich, two-dimensional dynamic representations, and then employ a dynamic curriculum learning strategy, guided by a Dynamic Group Mechanism (DGM), to progressively train the decoder, refining feature extraction from global anatomical structures to fine pathological details. Evaluated across six publicly available datasets, including Alzheimer's disease and brain tumor classification, cerebral artery segmentation, and brain age prediction, DCL-SE consistently outperforms existing methods in accuracy, robustness, and interpretability. These findings underscore the critical importance of compact, task-specific architectures in the era of large-scale pretrained networks.[68] VisPlay: Self-Evolving Vision-Language Models from Images
Yicheng He,Chengsong Huang,Zongxia Li,Jiaxin Huang,Yonghui Yang
Main category: cs.CV
TL;DR: VisPlay是一种自演化的强化学习框架,利用无标签图像数据使视觉语言模型(VLMs)自主提升推理能力,通过角色分工与群体相对策略优化(GRPO)实现可扩展的多模态智能发展。
Details
Motivation: 现有强化学习方法依赖人工标注或任务特定启发式规则来定义奖励,成本高且难以扩展,因此需要一种能利用大量无标签数据自主提升VLM推理能力的方法。 Method: 提出VisPlay框架,将单一基础VLM分为两个交互角色:图像条件提问者生成具挑战性但可回答的问题,多模态推理者生成银标准回答;采用Group Relative Policy Optimization(GRPO)联合训练,并引入多样性和难度奖励以平衡问题复杂度与回答质量。 Result: 在Qwen2.5-VL和MiMo-VL两个模型家族上训练,VisPlay在MM-Vet和MMMU等八个基准上均表现出视觉推理、组合泛化和幻觉减少方面的持续提升。 Conclusion: VisPlay展示了利用无标签数据实现VLM自我进化的可行路径,为构建可扩展的多模态智能系统提供了新方向。 Abstract: Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/[69] SceneEdited: A City-Scale Benchmark for 3D HD Map Updating via Image-Guided Change Detection
Chun-Jung Lin,Tat-Jun Chin,Sourav Garg,Feras Dayoub
Main category: cs.CV
TL;DR: 本文提出了SceneEdited,首个用于支持基于3D点云更新的高精地图维护的城市规模数据集,包含800多个场景、73公里驾驶路线和超过23,000个合成对象变化,旨在弥合2D变化检测与3D地图更新之间的差距。
Details
Motivation: 现有方法在检测到环境变化后难以有效更新3D高精地图,尤其是依赖2D图像的变化检测方法无法直接支持3D重建,缺乏标准数据集来推动该领域的研究。 Method: 构建了一个大规模真实城市数据集SceneEdited,包含校准的RGB图像、LiDAR扫描和详细变化掩码,并通过手动与自动方式合成过时场景中的对象变化;提供基于图像的运动恢复结构(SfM)基础更新方法及配套工具包。 Result: 数据集覆盖73公里驾驶路径和约3平方公里城市区域,包含2000多个过时版本和超过23,000个标注的变化实例,支持可扩展性、可追踪性和可移植性的3D地图更新研究。 Conclusion: SceneEdited为高精地图的3D更新提供了标准化基准,填补了从变化检测到地图更新之间的研究空白,推动自动驾驶与城市感知中地图维护技术的发展。 Abstract: Accurate, up-to-date High-Definition (HD) maps are critical for urban planning, infrastructure monitoring, and autonomous navigation. However, these maps quickly become outdated as environments evolve, creating a need for robust methods that not only detect changes but also incorporate them into updated 3D representations. While change detection techniques have advanced significantly, there remains a clear gap between detecting changes and actually updating 3D maps, particularly when relying on 2D image-based change detection. To address this gap, we introduce SceneEdited, the first city-scale dataset explicitly designed to support research on HD map maintenance through 3D point cloud updating. SceneEdited contains over 800 up-to-date scenes covering 73 km of driving and approximate 3 $\text{km}^2$ of urban area, with more than 23,000 synthesized object changes created both manually and automatically across 2000+ out-of-date versions, simulating realistic urban modifications such as missing roadside infrastructure, buildings, overpasses, and utility poles. Each scene includes calibrated RGB images, LiDAR scans, and detailed change masks for training and evaluation. We also provide baseline methods using a foundational image-based structure-from-motion pipeline for updating outdated scenes, as well as a comprehensive toolkit supporting scalability, trackability, and portability for future dataset expansion and unification of out-of-date object annotations. Both the dataset and the toolkit are publicly available at https://github.com/ChadLin9596/ScenePoint-ETK, establising a standardized benchmark for 3D map updating research.[70] MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Yushi Huang,Zining Wang,Zhihang Yuan,Yifu Ding,Ruihao Gong,Jinyang Guo,Xianglong Liu,Jun Zhang
Main category: cs.CV
TL;DR: 提出MoDES,一种无需训练的自适应专家跳过框架,用于高效准确的MoE多模态大语言模型推理,显著提升速度并减少性能损失。
Details
Motivation: 现有专家跳过方法在多模态大语言模型中表现不佳,因未考虑专家在不同层的异质贡献及跨模态token行为差异。 Method: 提出全局调制局部门控(GMLG)机制和双模态阈值法(DMT),结合前沿搜索算法快速确定最优跳过阈值。 Result: 在3个模型系列、13个基准上验证有效性,跳过88%专家时性能提升达10.67%,预填充速度提升2.16倍,解码速度提升1.26倍。 Conclusion: MoDES是首个训练-free的MoE MLLM专家跳过框架,兼顾高效率与高性能,显著优于先前方法。 Abstract: Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.[71] Think Visually, Reason Textually: Vision-Language Synergy in ARC
Beichen Zhang,Yuhang Zang,Xiaoyi Dong,Yuhang Cao,Haodong Duan,Dahua Lin,Jiaqi Wang
Main category: cs.CV
TL;DR: 本文提出了一种结合视觉与语言的协同推理方法(VLSR)和模态切换自纠错机制(MSSC),以解决大模型在ARC-AGI任务中从少量示例进行抽象推理的难题,实验表明该方法显著提升了性能。
Details
Motivation: 现有大模型在抽象推理任务(如ARC-AGI)上表现不佳,尤其难以从少量例子中归纳结构化规则;尽管人类依赖视觉抽象,但直接将任务转为图像输入反而因执行不精确而效果下降,因此需探索视觉与语言模态的互补机制。 Method: 提出两种策略:1)视觉-语言协同推理(VLSR),将任务分解为适合不同模态处理的子任务;2)模态切换自纠错(MSSC),利用视觉验证文本推理结果并进行修正。 Result: 在多个主流大模型和ARC-AGI任务上,该方法相比纯文本基线最高提升了4.33%的性能。 Conclusion: 融合视觉抽象与语言推理能够有效提升模型的抽象与推理能力,是实现类人通用智能的重要方向。 Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.[72] Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Songze Li,Mingyu Gao,Tonghua Su,Xu-Yao Zhang,Zhongjie Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态持续指令调优方法,通过近似旧任务的梯度来缓解灾难性遗忘问题,在不扩展模型的情况下实现了最先进的性能。
Details
Motivation: 多模态大模型在持续学习新任务时容易遗忘旧任务,主要挑战是灾难性遗忘。本文旨在解决这一问题,提升模型在连续任务中的稳定性。 Method: 将灾难性遗忘视为旧任务梯度缺失问题,利用参数空间的几何特性(当前参数与先前最优参数的方向向量)近似缺失梯度,并结合有限回放缓冲区的真实梯度,采用伯努利采样策略动态平衡模型的稳定性与可塑性。 Result: 在多模态持续指令调优数据集上的实验表明,该方法在不增加模型规模的前提下,显著减轻了灾难性遗忘,取得了最先进的性能。 Conclusion: 该方法有效缓解了多模态持续学习中的灾难性遗忘问题,保持了模型结构的紧凑性,为持续学习提供了一种高效且实用的解决方案。 Abstract: Multimodal continual instruction tuning enables multimodal large language models to sequentially adapt to new tasks while building upon previously acquired knowledge. However, this continual learning paradigm faces the significant challenge of catastrophic forgetting, where learning new tasks leads to performance degradation on previous ones. In this paper, we introduce a novel insight into catastrophic forgetting by conceptualizing it as a problem of missing gradients from old tasks during new task learning. Our approach approximates these missing gradients by leveraging the geometric properties of the parameter space, specifically using the directional vector between current parameters and previously optimal parameters as gradient guidance. This approximated gradient can be further integrated with real gradients from a limited replay buffer and regulated by a Bernoulli sampling strategy that dynamically balances model stability and plasticity. Extensive experiments on multimodal continual instruction tuning datasets demonstrate that our method achieves state-of-the-art performance without model expansion, effectively mitigating catastrophic forgetting while maintaining a compact architecture.[73] Learning Depth from Past Selves: Self-Evolution Contrast for Robust Depth Estimation
Jing Cao,Kui Jiang,Shenyi Li,Xiaocheng Feng,Yong Huang
Main category: cs.CV
TL;DR: 提出了一种名为SEC-Depth的自监督对比学习框架,用于提升恶劣天气下的深度估计鲁棒性。
Details
Motivation: 现有自监督深度估计方法在雨雾等恶劣天气下性能显著下降,因能见度降低严重影响深度预测。 Method: 设计了一种自演进对比学习框架SEC-Depth,利用训练过程中的中间参数构建时序延迟模型,并引入动态更新策略和自演进对比损失(SECL),将历史模型输出作为负样本以自适应调整学习目标。 Result: 实验表明,该方法可无缝集成到多种基线模型中,在零样本评估下显著提升了模型在恶劣天气条件下的鲁棒性。 Conclusion: SEC-Depth通过自演进对比学习机制有效缓解了恶劣天气带来的性能退化,无需人工干预即可感知天气退化程度,增强了实际应用中的泛化能力。 Abstract: Self-supervised depth estimation has gained significant attention in autonomous driving and robotics. However, existing methods exhibit substantial performance degradation under adverse weather conditions such as rain and fog, where reduced visibility critically impairs depth prediction. To address this issue, we propose a novel self-evolution contrastive learning framework called SEC-Depth for self-supervised robust depth estimation tasks. Our approach leverages intermediate parameters generated during training to construct temporally evolving latency models. Using these, we design a self-evolution contrastive scheme to mitigate performance loss under challenging conditions. Concretely, we first design a dynamic update strategy of latency models for the depth estimation task to capture optimization states across training stages. To effectively leverage latency models, we introduce a self-evolution contrastive Loss (SECL) that treats outputs from historical latency models as negative samples. This mechanism adaptively adjusts learning objectives while implicitly sensing weather degradation severity, reducing the needs for manual intervention. Experiments show that our method integrates seamlessly into diverse baseline models and significantly enhances robustness in zero-shot evaluations.[74] MMCM: Multimodality-aware Metric using Clustering-based Modes for Probabilistic Human Motion Prediction
Kyotaro Tokoro,Hiromu Taketsugu,Norimichi Ukita
Main category: cs.CV
TL;DR: 本文提出了一种用于人体运动预测(HMP)的多模态感知评估指标MMCM,通过基于聚类的模式划分来评估预测动作的覆盖性和有效性。
Details
Motivation: 现有评估指标无法有效衡量多模态预测中动作分布的多样性和运动学有效性,因此需要一种能同时评估覆盖率和有效性的新指标。 Method: MMCM利用聚类将运动空间划分为多个模式以评估覆盖率,并基于真实数据中的未来动作识别有效模式以评估有效性。 Result: 实验表明,所提出的聚类方法能合理定义运动模式,MMCM能够更准确地评分多模态预测结果。 Conclusion: MMCM是一种更合理、更可靠的多模态人体运动预测评估指标,兼顾了预测的多样性与运动学合理性。 Abstract: This paper proposes a novel metric for Human Motion Prediction (HMP). Since a single past sequence can lead to multiple possible futures, a probabilistic HMP method predicts such multiple motions. While a single motion predicted by a deterministic method is evaluated only with the difference from its ground truth motion, multiple predicted motions should also be evaluated based on their distribution. For this evaluation, this paper focuses on the following two criteria. \textbf{(a) Coverage}: motions should be distributed among multiple motion modes to cover diverse possibilities. \textbf{(b) Validity}: motions should be kinematically valid as future motions observable from a given past motion. However, existing metrics simply appreciate widely distributed motions even if these motions are observed in a single mode and kinematically invalid. To resolve these disadvantages, this paper proposes a Multimodality-aware Metric using Clustering-based Modes (MMCM). For (a) coverage, MMCM divides a motion space into several clusters, each of which is regarded as a mode. These modes are used to explicitly evaluate whether predicted motions are distributed among multiple modes. For (b) validity, MMCM identifies valid modes by collecting possible future motions from a motion dataset. Our experiments validate that our clustering yields sensible mode definitions and that MMCM accurately scores multimodal predictions. Code: https://github.com/placerkyo/MMCM[75] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Geon Choi,Hangyul Yoon,Hyunju Shin,Hyunki Park,Sang Hoon Seo,Eunho Yang,Edward Choi
Main category: cs.CV
TL;DR: 提出了一种新的指令引导病变分割(ILS)范式,并构建了大规模数据集MIMIC-ILS,用于胸部X光片的多类型病变分割。
Details
Motivation: 现有病变分割模型受限于标签种类少和依赖复杂的专家文本输入,难以实用化。 Method: 提出了指令引导的病变分割范式,构建了自动化多模态流水线,生成包含110万指令-答案对的MIMIC-ILS数据集,并训练了兼具分割与文本解释能力的视觉-语言模型ROSALIA。 Result: ROSALIA在新任务上实现了高分割精度和文本生成准确性,验证了MIMIC-ILS数据集和构建流程的有效性。 Conclusion: MIMIC-ILS为胸部X光图像的像素级病变定位提供了重要基础资源,推动了用户友好型医学图像分析的发展。 Abstract: The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.[76] BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI
Wasif Jalal,Md Nafiu Rahman,M. Sohel Rahman
Main category: cs.CV
TL;DR: 提出了一种结合视觉Transformer和残差CNN的混合模型BrainRotViT,用于从多中心MRI数据中准确估计脑龄,具有良好的泛化性和可解释性。
Details
Motivation: 传统方法在脑龄估计中存在特征工程依赖、感受野受限和过拟合问题,而纯Transformer模型需要大量数据和计算资源,因此需要一种高效且泛化的模型。 Method: 首先在辅助的年龄和性别分类任务上预训练ViT编码器以提取矢状面切片特征,然后将冻结的编码器应用于所有切片生成嵌入向量矩阵,最后输入带有性别信息融合的残差CNN回归器预测连续脑龄。 Result: 在11个MRI数据集上MAE为3.34年(R²=0.95),在4个独立队列中MAE为3.77–5.04年,优于现有模型;注意力图揭示了与衰老相关的大脑区域,如小脑蚓部、中央前后回等。 Conclusion: BrainRotViT是一种高效、可解释且泛化能力强的脑龄预测框架,弥合了CNN与Transformer方法之间的差距,有助于老龄化与神经退行性疾病研究。 Abstract: Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ρ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer's disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.[77] Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition
Raghu Vamsi Chittersu,Yuvraj Singh Rathore,Pranav Adlinge,Kunal Swami
Main category: cs.CV
TL;DR: 本文提出了Insert In Style,首个无需微调、零样本的高保真生成框架,用于解决现实物体插入风格化场景中的难题。
Details
Motivation: 现有基于参考的对象组合方法在将真实世界对象插入风格化域时表现不佳,且当前方法在实用性与生成保真度之间存在权衡。 Method: 提出统一框架,包含多阶段训练策略以解耦身份、风格和构图表征,并设计专用的掩码注意力架构,在生成过程中强制实现解耦。 Result: 在新构建的10万样本数据集上训练,并通过新发布的风格化组合基准测试,模型在身份和风格指标上均显著优于现有方法,用户研究也验证了其优越性。 Conclusion: Insert In Style实现了无需文本提示和在线微调的零样本高保真对象插入,推动了风格化内容生成的发展。 Abstract: Reference-based object composition methods fail when inserting real-world objects into stylized domains. This under-explored problem is currently split between practical "blenders" that lack generative fidelity and "generators" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation. This approach prevents the concept interference common in general-purpose, unified-attention models. Our framework is trained on a new 100k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, two-stage filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.[78] Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval
Qing Wang,Chong-Wah Ngo,Ee-Peng Lim
Main category: cs.CV
TL;DR: 本文提出一种基于因果理论的去偏方法,用于解决食谱与食物图像跨模态检索中的表征学习偏差问题,并在Recipe1M数据集上实现了新的性能上限。
Details
Motivation: 现有方法将食谱视为描述菜肴外观的文本,忽略了烹饪过程等因素导致的视觉细节不一致,从而在跨模态相似性判断中引入偏差。 Method: 采用因果理论建模偏差,识别食材为混杂因素之一,通过后门调整进行因果干预,在传统模型基础上增加去偏项,并设计了一个即插即用的多标签食材分类神经模块。 Result: 在Recipe1M数据集上,不同测试规模(1K、10K、50K)下均实现MedR=1的检索性能,达到当前最优水平。 Conclusion: 因果干预能有效缓解跨模态检索中的表征偏差,所提方法显著提升食谱-食物图像检索性能。 Abstract: This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.[79] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Kishor Datta Gupta,Marufa Kamal,Md. Mahfuzur Rahman,Fahad Rahman,Mohd Ariful Haque,Sunzida Siddique
Main category: cs.CV
TL;DR: 提出了一种结合大语言模型、知识映射和视觉-语言模型的物理约束多模态数据评估(PCMDE)指标,以克服现有评估方法在语义和结构准确性上的局限性。
Details
Motivation: 现有评估指标如BLEU、CIDEr等难以捕捉领域特定或上下文相关场景中的语义和结构准确性。 Method: 采用三阶段架构:1)通过目标检测和视觉-语言模型提取多模态特征;2)置信度加权组件融合进行自适应验证;3)利用大语言模型进行物理引导推理,确保结构与关系约束。 Result: PCMDE在结构和语义一致性评估上优于传统指标,尤其在需要物理规则约束的场景中表现更优。 Conclusion: PCMDE有效提升了多模态生成内容评估的准确性和可靠性,尤其适用于需要物理常识和结构逻辑的领域特定任务。 Abstract: Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.[80] SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning
Yuhao Shen,Jiahe Qian,Zhangtianyi Chen,Yuanhao He,Juexiao Zhou
Main category: cs.CV
TL;DR: SkinGPT-R1 是一个专注于皮肤病学的视觉语言模型,通过显式的分步推理提升诊断可解释性,在 DermBench 基准上表现优于现有模型,并在多个皮肤病分类任务中实现稳定准确率提升。
Details
Motivation: 为了提高皮肤病诊断中视觉语言模型的可解释性和临床可信度,需要构建具有专业链式思维推理能力的模型。 Method: 提出 SkinGPT-R1 模型,构建包含 10,000 个筛选病例和 3,000 个专家评分病例的 DermCoT 推理语料库,定义六维临床评估标准 DermEval 和基准 DermBench,并采用皮肤病感知的视觉蒸馏技术进行优化。 Result: 在 DermBench 上,SkinGPT-R1 平均得分为 4.031(满分 5),排名第一,较 Vision-R1 提升约 41%;在三个皮肤病分类基准上表现稳定且具竞争力;消融实验显示 DermCoT 监督和视觉蒸馏均带来显著增益。 Conclusion: SkinGPT-R1 通过专业的链式思维训练和视觉蒸馏,在皮肤病诊断推理质量和准确性方面显著优于现有模型,具备临床应用潜力。 Abstract: We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.[81] SplitFlux: Learning to Decouple Content and Style from a Single Image
Yitong Yang,Yinglin Wang,Changshuo Wang,Yongjun Zhang,Ziyang Chen,Shuting He
Main category: cs.CV
TL;DR: 本文提出SplitFlux,通过系统分析Flux模型的特性,利用LoRA微调单流Dream Blocks实现内容与风格的有效解耦,包含秩约束适应和视觉门控LoRA两个关键组件,在多种场景下实现了优于现有方法的内容保持和风格化质量。
Details
Motivation: 现有基于SDXL的方法难以实现高质量内容-风格解耦,而新兴Flux模型因特性未被充分探索导致解耦效果不佳。 Method: 基于对Flux模型的系统分析,发现单个Dream Blocks对生成至关重要,且早期块控制内容、后期块控制风格;据此提出SplitFlux,通过LoRA微调单流Dream Blocks实现解耦,引入秩约束适应防止内容泄露,并设计视觉门控LoRA分支以不同秩分别处理主体信息与细节。 Result: 实验表明SplitFlux在内容保持和风格化质量方面 consistently 优于当前最先进方法,适用于多样化场景。 Conclusion: SplitFlux有效实现了图像内容与风格的解耦,支持将解耦后的内容无缝重嵌入新上下文中,推动了定制化图像生成的发展。 Abstract: Disentangling image content and style is essential for customized image generation. Existing SDXL-based methods struggle to achieve high-quality results, while the recently proposed Flux model fails to achieve effective content-style separation due to its underexplored characteristics. To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. It includes two key components: (1) Rank-Constrained Adaptation. To preserve content identity and structure, we compress the rank and amplify the magnitude of updates within specific blocks, preventing content leakage into style blocks. (2) Visual-Gated LoRA. We split the content LoRA into two branches with different ranks, guided by image saliency. The high-rank branch preserves primary subject information, while the low-rank branch encodes residual details, mitigating content overfitting and enabling seamless re-embedding. Extensive experiments demonstrate that SplitFlux consistently outperforms state-of-the-art methods, achieving superior content preservation and stylization quality across diverse scenarios.[82] Graph Query Networks for Object Detection with Automotive Radar
Loveneet Saini,Hasan Tercan,Tobias Meisen
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制的图查询网络(GQN),用于3D雷达下的目标检测,通过构建对象特定的图并引入EdgeFocus和DeepContext Pooling模块,在NuScenes数据集上显著提升了检测性能。
Details
Motivation: 传统基于网格或序列的卷积和Transformer检测器难以处理雷达因长波长导致的稀疏且不规则反射,因此需要一种能有效建模雷达感知对象间关系与上下文信息的新方法。 Method: 提出Graph Query Networks(GQN),将雷达感知的对象建模为图结构;使用图查询在鸟瞰图空间中动态关注;设计EdgeFocus模块进行关系推理,以及DeepContext Pooling模块聚合上下文特征。 Result: 在NuScenes数据集上,GQN相对mAP最高提升53%,相比此前最优雷达方法提升8.2%,同时图构建峰值开销降低80%,计算成本适中。 Conclusion: GQN通过图结构建模和新型注意力机制,有效解决了3D雷达目标检测中的稀疏性和不规则性问题,显著提升了检测精度与效率。 Abstract: Object detection with 3D radar is essential for 360-degree automotive perception, but radar's long wavelengths produce sparse and irregular reflections that challenge traditional grid and sequence-based convolutional and transformer detectors. This paper introduces Graph Query Networks (GQN), an attention-based framework that models objects sensed by radar as graphs, to extract individualized relational and contextual features. GQN employs a novel concept of graph queries to dynamically attend over the bird's-eye view (BEV) space, constructing object-specific graphs processed by two novel modules: EdgeFocus for relational reasoning and DeepContext Pooling for contextual aggregation. On the NuScenes dataset, GQN improves relative mAP by up to +53%, including a +8.2% gain over the strongest prior radar method, while reducing peak graph construction overhead by 80% with moderate FLOPs cost.[83] Edge-Centric Relational Reasoning for 3D Scene Graph Prediction
Yanni Ma,Hao Liu,Yulan Guo,Theo Gevers,Martin R. Oswald
Main category: cs.CV
TL;DR: 本文提出了一种名为LEO的链路引导边心型关系推理框架,通过将场景图转换为线图并进行边级推理,有效捕捉高阶关系依赖,从而提升3D场景图关系预测性能。
Details
Motivation: 现有方法局限于物体节点为中心的图神经网络,难以捕捉对准确关系预测至关重要的高阶关系依赖。 Method: 首先预测物体对之间的潜在链接以抑制无关边,然后将原始场景图转化为线图,在线图上使用边为中心的图神经网络进行关系推理,并将增强后的关系特征融合回原图以提升物体级推理。 Result: 在3DSSG数据集上结合两个强基线模型进行实验,均实现了性能提升,验证了所提边到物推理范式的有效性。 Conclusion: LEO是一种模型无关的框架,能够有效整合到现有物体中心方法中,通过引入边级关系上下文显著提升3D场景图的关系预测能力。 Abstract: 3D scene graph prediction aims to abstract complex 3D environments into structured graphs consisting of objects and their pairwise relationships. Existing approaches typically adopt object-centric graph neural networks, where relation edge features are iteratively updated by aggregating messages from connected object nodes. However, this design inherently restricts relation representations to pairwise object context, making it difficult to capture high-order relational dependencies that are essential for accurate relation prediction. To address this limitation, we propose a Link-guided Edge-centric relational reasoning framework with Object-aware fusion, namely LEO, which enables progressive reasoning from relation-level context to object-level understanding. Specifically, LEO first predicts potential links between object pairs to suppress irrelevant edges, and then transforms the original scene graph into a line graph where each relation is treated as a node. A line graph neural network is applied to perform edge-centric relational reasoning to capture inter-relation context. The enriched relation features are subsequently integrated into the original object-centric graph to enhance object-level reasoning and improve relation prediction. Our framework is model-agnostic and can be integrated with any existing object-centric method. Experiments on the 3DSSG dataset with two competitive baselines show consistent improvements, highlighting the effectiveness of our edge-to-object reasoning paradigm.[84] Taming Generative Synthetic Data for X-ray Prohibited Item Detection
Jialong Sun,Hongguang Zhu,Weizhe Liu,Yunda Sun,Renshuai Tao,Yunchao Wei
Main category: cs.CV
TL;DR: 提出了一种无需额外人工成本的高质量X射线安检图像合成方法Xsyn,基于文本到图像生成,采用交叉注意力优化和背景遮挡建模策略,在检测性能上优于先前方法。
Details
Motivation: 解决现有X射线安检图像合成方法因两阶段流程(先前景提取再图像合成)导致的人工成本高、效率低的问题。 Method: 提出一种端到端的一阶段X射线图像合成方法Xsyn,结合扩散模型中的交叉注意力图进行边界框标注优化(CAR),并在潜在空间中显式建模背景遮挡以增强成像复杂性(BOM)。 Result: 实验表明,该方法相比之前方法mAP提升1.2%,且生成的合成图像能有效提升多种X射线数据集和检测器上的违禁品检测性能。 Conclusion: Xsyn是首个无需额外人工成本即可实现高质量X射线安检图像合成的方法,具有更高的效率和实用性,为数据稀缺场景下的模型训练提供了有效解决方案。 Abstract: Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.[85] Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language
Yan Xia,Letian Shi,Yilin Di,Joao F. Henriques,Daniel Cremers
Main category: cs.CV
TL;DR: 本文提出了Text2Loc++,一种用于基于自然语言描述进行3D点云子图定位的新型神经网络,采用从粗到精的跨模态对齐方法,并在城市尺度新数据集上验证了其优越性能。
Details
Motivation: 现有方法难以处理复杂多样的自然语言描述与3D点云之间的跨模态对齐,尤其在大规模城市环境中缺乏有效且鲁棒的定位模型。 Method: 提出Text2Loc++,结合预训练语言模型与分层Transformer(HTM)提取句子语义,使用注意力机制点云编码器理解空间结构;引入掩码实例训练(MIT)和模态感知分层对比学习(MHCL)提升多模态表征;在精细定位阶段设计基于原型地图克隆(PMC)和级联交叉注意力Transformer(CCAT)的轻量框架,避免显式文本-实例匹配。 Result: 在KITTI360Pose数据集上性能超越现有方法达15%,并在新构建的城市规模数据集上表现出对复杂语言表达和多样化 urban 环境的良好泛化能力。 Conclusion: Text2Loc++通过创新的粗到精架构和增强的跨模态学习策略,显著提升了基于自然语言的3D点云定位精度与鲁棒性,推动了语言-视觉-空间融合的研究发展。 Abstract: We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.[86] Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models
Mehran Tamjidi,Hamidreza Dastmalchi,Mohammadreza Alimoradijazi,Ali Cheraghian,Aijun An,Morteza Saberi
Main category: cs.CV
TL;DR: 提出了一种无需训练的在线测试时自适应方法Uni-Adapter,通过动态原型学习提升3D视觉语言模型在噪声和分布偏移数据下的鲁棒性。
Details
Motivation: 现有3D视觉语言基础模型在噪声、不完整或分布偏移的数据下表现不佳,需提高其实际应用中的泛化能力。 Method: 设计了一个基于3D缓存的动态原型学习机制,持续更新类别中心作为原型,并结合图结构标签平滑和熵加权融合策略进行预测优化。 Result: 在ModelNet-40C、ScanObjectNN-C和ShapeNet-C上分别比源模型提升了10.55%、8.26%和4.49%,达到SOTA性能。 Conclusion: Uni-Adapter有效缓解了分布偏移问题,显著提升了3D VLFMs在多种损坏场景下的零样本识别性能,且无需重新训练。 Abstract: 3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.[87] A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data
Mauro Larrat,Claudomiro Sales
Main category: cs.CV
TL;DR: 提出一种基于多模态Transformer的无人机检测与空中目标识别方法,融合雷达、RGB视频、红外视频和音频数据,实现高精度分类,具备高效计算性能,适用于实时应用。
Details
Motivation: 传统单模态方法在复杂环境中存在局限性,难以满足现代监控与安全对鲁棒性和准确性的需求,因此需要结合多模态数据以提升检测性能。 Method: 设计并评估一种新型多模态Transformer模型,利用自注意力机制融合雷达、视觉、红外和音频数据的特征,学习具有判别性的综合表征用于分类。 Result: 在独立测试集上达到0.9812准确率、0.9873召回率、0.9787精确率、0.9826 F1分数和0.9954特异性,尤其在区分无人机与其他空中目标方面表现优异;模型计算量为1.09 GFLOPs,参数量122万,推理速度达41.11 FPS。 Conclusion: 多模态Transformer架构显著提升了空中目标分类性能,验证了其在无人机检测与监控中的有效性与实用性,为复杂空域下的实时、高精度识别提供了先进解决方案。 Abstract: Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer's self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.[88] What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
Zhihan Ren,Lijun He,Jiaxi Liang,Xinzhu Fu,Haixia Bi,Fan Li
Main category: cs.CV
TL;DR: 本文提出了FIA-Flow,一种黑盒特征反演攻击框架,用于从Split DNN中的中间特征高保真地重建输入图像,揭示了比以往更严重的隐私风险。
Details
Motivation: 现有的特征反演攻击(FIA)方法重建质量有限,难以准确评估Split DNN中中间特征泄露带来的真实隐私风险,因此需要更有效的攻击方法来揭示潜在威胁。 Method: 提出FIA-Flow,包含两个核心模块:1)潜空间特征对齐模块(LFSAM),用于桥接中间特征空间与生成模型潜空间之间的语义差距;2)确定性反演流匹配(DIFM),通过一步推理将流形外特征投影到目标流形上,纠正分布不匹配问题。该解耦设计使得模型可用少量图像-特征对有效训练。 Result: FIA-Flow在多种模型(如AlexNet、ResNet、Swin Transformer、DINO、YOLO11)和不同层上均实现了更高保真度和语义一致性的图像重建,并提出了基于大视觉语言模型的新型隐私评估指标,验证了其优越性能。 Conclusion: FIA-Flow显著提升了特征反演攻击的效果,揭示了Split DNN中存在的严重隐私泄露风险,呼吁对中间特征传输过程采取更强的隐私保护措施。 Abstract: Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.[89] Adaptive thresholding pattern for fingerprint forgery detection
Zahra Farzadpour,Masoumeh Azghani
Main category: cs.CV
TL;DR: 本文提出了一种基于自适应阈值模式的小波变换指纹伪造检测算法,结合各向异性扩散和SVM分类器,有效提升了对像素缺失、块缺失和噪声等失真的鲁棒性,在准确率上优于现有方法。
Details
Motivation: 指纹活体检测系统易受伪造攻击,传统方法难以应对各种人为或环境引起的图像失真,因此需要开发更鲁棒的自动检测技术。 Method: 采用各向异性扩散预处理指纹图像,进行三级小波变换,对不同层系数进行自适应阈值处理并拼接成特征向量,最后使用SVM进行分类。 Result: 在90%像素缺失和70x70大小块缺失情况下,准确率分别比现有方法提高约8%和5%,验证了方法对多种失真的强抵抗能力。 Conclusion: 所提出的自适应阈值小波方法结合SVM分类器在指纹伪造检测中表现出优越性能,尤其在复杂失真环境下具有更好的鲁棒性和应用潜力。 Abstract: Fingerprint liveness detection systems have been affected by spoofing, which is a severe threat for fingerprint-based biometric systems. Therefore, it is crucial to develop some techniques to distinguish the fake fingerprints from the real ones. The software based techniques can detect the fingerprint forgery automatically. Also, the scheme shall be resistant against various distortions such as noise contamination, pixel missing and block missing, so that the forgers cannot deceive the detector by adding some distortions to the faked fingerprint. In this paper, we propose a fingerprint forgery detection algorithm based on a suggested adaptive thresholding pattern. The anisotropic diffusion of the input image is passed through three levels of the wavelet transform. The coefficients of different layers are adaptively thresholded and concatenated to produce the feature vector which is classified using the SVM classifier. Another contribution of the paper is to investigate the effect of various distortions such as pixel missing, block missing, and noise contamination. Our suggested approach includes a novel method that exhibits improved resistance against a range of distortions caused by environmental phenomena or manipulations by malicious users. In quantitative comparisons, our proposed method outperforms its counterparts by approximately 8% and 5% in accuracy for missing pixel scenarios of 90% and block missing scenarios of size 70x70 , respectively. This highlights the novelty approach in addressing such challenges.[90] Fast Post-Hoc Confidence Fusion for 3-Class Open-Set Aerial Object Detection
Spyridon Loukovitis,Vasileios Karampinis,Athanasios Voulodimos
Main category: cs.CV
TL;DR: 提出一种轻量、模型无关的后处理框架,实现对在域目标、分布外物体和背景的实时三类分类,提升无人机导航中的开放集检测性能。
Details
Motivation: 现有开放集检测方法多依赖单一不确定性分数和阈值,难以区分分布外物体与背景杂波,限制了在安全关键场景(如无人机导航)中的应用。 Method: 设计一个融合多种置信度估计和逐检测特征的紧凑多层感知机(MLP),在不改变基础检测器的前提下,实现ID、OOD和背景的三类分离。 Result: 在二分类任务中AUROC平均提升2.7%,闭集mAP最高提升9点(相对提升18%),并首次实现稳健的三类分类,显著优于基于阈值的基线方法。 Conclusion: 该方法有效解耦背景与未知物体,增强了开放集检测的灵活性与可靠性,为安全的无人机导航提供了关键技术支持。 Abstract: Developing reliable UAV navigation systems requires robust air-to-air object detectors capable of distinguishing between objects seen during training and previously unseen objects. While many methods address closed-set detection and achieve high-confidence recognition of in-domain (ID) targets, they generally do not tackle open-set detection, which requires simultaneous handling of both ID and out-of-distribution (OOD) objects. Existing open-set approaches typically rely on a single uncertainty score with thresholding, limiting flexibility and often conflating OOD objects with background clutter. In contrast, we propose a lightweight, model-agnostic post-processing framework that explicitly separates background from unknown objects while preserving the base detector's performance. Our approach extends open-set detection beyond binary ID/OOD classification to real-time three-way classification among ID targets, OOD objects, and background. To this end, we employ a fusion scheme that aggregates multiple confidence estimates and per-detection features using a compact multilayer perceptron (MLP). Incorporating different logit variants into the MLP consistently enhances performance across both binary and three-class classification without compromising throughput. Extensive ablation and comparative experiments confirm that our method surpasses threshold-based baselines in two-class classification by an average of 2.7% AUROC, while retaining or improving open-set mAP. Furthermore, our study uniquely enables robust three-class classification, a critical capability for safe UAV navigation, where OOD objects must be actively avoided and background regions safely ignored. Comparative analysis highlights that our method surpasses competitive techniques in AUROC across datasets, while improving closed-set mAP by up to 9 points, an 18% relative gain.[91] IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
Gihwan Kim,Jemin Lee,Hyungshin Kim
Main category: cs.CV
TL;DR: 本文提出了一种新的无需重训练的后训练量化框架IPTQ-ViT,用于实现全整数量化视觉Transformer,通过多项式GELU和位移Softmax近似函数及统一的层选择度量,在多种任务上优于现有PTQ方法。
Details
Motivation: 现有QAT方法依赖昂贵的重训练来恢复非线性层量化的精度损失,而PTQ方法难以实现全整数量化推理,限制了在资源受限环境中的应用。 Method: 提出了基于多项式的GELU和基于位移的Softmax近似函数,并设计了一个综合量化敏感性、扰动和计算成本的统一指标,以选择每层激活函数的最佳近似方案。 Result: IPTQ-ViT在图像分类上最高提升6.44%(平均1.78%)Top-1精度,在目标检测上提升1.0 mAP,且在W8A8和W4A8设置下优于部分浮点PTQ方法,精度和延迟与整数量化QAT方法相当。 Conclusion: IPTQ-ViT实现了无需重训练的全整数量化视觉Transformer,显著提升了PTQ方法的性能,适用于资源受限场景。 Abstract: Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.[92] Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training
Yunjiao Zhou,Xinyan Chen,Junlang Qian,Lihua Xie,Jianfei Yang
Main category: cs.CV
TL;DR: 本文提出ZOMG,一种零样本、开放词汇的运动序列分割框架,无需标注或微调即可将动作分解为语义子动作。
Details
Motivation: 现有方法依赖预定义动作类的密集监督,在开放词汇和真实场景中不可行,因此需要一种无需标注的通用运动理解方法。 Method: 结合语言语义划分(利用大语言模型将指令分解为有序子动作)和软掩码优化(学习实例特定的时间掩码以聚焦关键帧,同时保持段内连续性和段间分离)。 Result: 在三个运动-语言数据集上实验表明,ZOMG在运动定位性能上达到最先进水平,在HumanML3D基准上比先前方法提升+8.7% mAP,并在下游检索任务中显著改进。 Conclusion: ZOMG建立了一种无需标注的运动理解新范式,实现了高效且语义对齐的零样本动作分割。 Abstract: Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.[93] Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models
Haidong Kang,Lihong Lin,Enneng Yang,Hongning Dai,Hao Wang
Main category: cs.CV
TL;DR: 提出一种名为AutoPrune的新剪枝方法,利用大语言模型自身设计最优剪枝算法,结合图驱动的思维链(GCoT)和偏斜感知动态稀疏分配(SDSA),解决了高剪枝比下的性能下降问题,在主流LLM基准上优于现有方法。
Details
Motivation: 现有LLM剪枝方法依赖人工设计,成本高且需要专家知识,同时在高剪枝比下因异常值问题导致性能严重下降,缺乏自适应稀疏性设计。 Method: 提出AutoPrune,利用LLM自我设计剪枝算法;引入图驱动的思维链(GCoT)优化提示以增强推理过程;提出偏斜感知动态稀疏分配(SDSA)解决异常值问题并实现自适应稀疏性。 Result: 在多个主流大语言模型上的实验表明,AutoPrune在高剪枝比下显著优于现有方法,有效缓解性能退化,具备良好可解释性和自动化能力。 Conclusion: AutoPrune实现了大语言模型自主剪枝,摆脱了对专家知识的依赖,并通过SDSA机制有效应对异常值问题,为高效、自动化的模型压缩提供了新范式。 Abstract: Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size. Existing pruning methods (e.g., Wanda) tailored for LLMs rely heavily on manual design pruning algorithms, thereby leading to \textit{huge labor costs} and \textit{requires expert knowledge}. Furthermore, we are the first to identify the serious \textit{outlier value issue} behind dramatic performance degradation under high pruning ratios that are caused by uniform sparsity, raising an additional concern about how to design adaptive pruning sparsity ideal for LLMs. Can LLMs prune by themselves? In this work, we introduce an affirmative answer by proposing a novel pruning method called \textbf{AutoPrune}, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge. Specifically, to mitigate the black-box nature of LLMs, we propose a Graph-driven Chain-of-Thought (GCoT) to optimize prompts, significantly enhancing the reasoning process in learning the pruning algorithm and enabling us to generate pruning algorithms with superior performance and interpretability in the next generation. Finally, grounded in insights of outlier value issue, we introduce Skew-aware Dynamic Sparsity Allocation (SDSA) to overcome the outlier value issue, mitigating performance degradation under high pruning ratios. We conduct extensive experiments on mainstream LLMs benchmarks, demonstrating the superiority of AutoPrune, which consistently excels state-of-the-art competitors. The code is available at: https://anonymous.4open.science/r/AutoPrune.[94] ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
Simon Boeder,Fabian Gigengack,Simon Roesler,Holger Caesar,Benjamin Risse
Main category: cs.CV
TL;DR: ShelfOcc是一种纯视觉方法,通过在原生3D空间中生成度量一致的语义体素标签,实现无需LiDAR的真实3D监督,显著提升了弱/自监督占用估计性能。
Details
Motivation: 现有基于2D投影或渲染监督的方法存在几何不一致和严重深度渗漏问题,且依赖LiDAR或额外传感器,限制了3D场景理解的实用性与可扩展性。 Method: ShelfOcc利用视频生成3D语义体素标签作为监督信号,提出专用框架跨帧过滤并累积静态几何信息,处理动态内容,并将语义信息传播到稳定的体素表示中,实现在无LiDAR情况下的高质量3D监督。 Result: 在Occ3D-nuScenes基准上,ShelfOcc大幅超越此前所有弱/自监督方法,最高相对提升达34%,实现了最先进的性能。 Conclusion: 高质量的3D监督信号对鲁棒的占用学习至关重要,ShelfOcc提供了一种数据驱动的新方向,为无需LiDAR的3D场景理解开辟了新路径。 Abstract: Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding. We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR. ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations. While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes. Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames, handling dynamic content and propagating semantic information into a stable voxel representation. This data-centric shift in supervision for weakly/shelf-supervised occupancy estimation allows the use of essentially any SOTA occupancy model architecture without relying on LiDAR data. We argue that such high-quality supervision is essential for robust occupancy learning and constitutes an important complementary avenue to architectural innovation. On the Occ3D-nuScenes benchmark, ShelfOcc substantially outperforms all previous weakly/shelf-supervised methods (up to a 34% relative improvement), establishing a new data-driven direction for LiDAR-free 3D scene understanding.[95] Controlling False Positives in Image Segmentation via Conformal Prediction
Luca Mossina,Corentin Friedrich
Main category: cs.CV
TL;DR: 提出了一种模型无关的后处理框架,通过共形预测为语义分割生成具有统计保证的置信掩码,控制假阳性率。
Details
Motivation: 深度学习模型在临床决策中缺乏对错误的显式统计保证,尤其是假阳性预测的控制。 Method: 基于预训练分割模型,构建通过提高分数阈值或形态学腐蚀得到的嵌套收缩掩码族,利用校准集通过共形预测选择收缩参数。 Result: 在息肉分割基准上验证了方法的有效性,实现了图像级假阳性率的分布无关控制,具备有限样本保证。 Conclusion: 该框架无需重新训练,提供可靠的统计保证,适用于对过分割有临床风险的场景。 Abstract: Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post-hoc framework that constructs confidence masks with distribution-free, image-level control of false-positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user-specified tolerance with high probability. The method is model-agnostic, requires no retraining, and provides finite-sample guarantees regardless of the underlying predictor. Experiments on a polyp-segmentation benchmark demonstrate target-level empirical validity. Our framework enables practical, risk-aware segmentation in settings where over-segmentation can have clinical consequences. Code at https://github.com/deel-ai-papers/conseco.[96] D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models
Wenlun Zhang,Yunshan Zhong,Zihao Ding,Xinyu Li,Kentaro Yoshioka
Main category: cs.CV
TL;DR: 本文提出了D4C,首个专为CLIP模型设计的无数据量化(DFQ)框架,通过提示引导语义注入、结构对比生成和扰动感知增强三个组件,生成语义丰富且结构多样的伪图像,显著提升了CLIP在无真实数据情况下的量化性能。
Details
Motivation: 现有的无数据量化方法在应用于视觉-语言模型(如CLIP)时表现不佳,主要由于合成样本语义不足和图像内多样性低,限制了其在隐私敏感场景中的应用。 Method: 提出D4C框架,包含三个核心组件:1)提示引导语义注入,利用文本提示对齐生成图像与真实语义;2)结构对比生成,通过前景-背景对比学习重建自然图像的组成结构;3)扰动感知增强,引入可控扰动提升样本多样性和鲁棒性。 Result: 实验表明D4C在多种比特宽度和模型上均显著优于现有方法。例如,在W4A8设置下,CLIP ResNet-50和ViT-B/32在CIFAR-10、CIFAR-100和ImageNet-1K上的零样本分类准确率大幅提升,最高提升达19.7%。 Conclusion: D4C有效解决了现有无数据量化方法在CLIP模型上语义缺失和多样性不足的问题,显著缩小了量化后的性能差距,推动了隐私保护场景下多模态模型压缩的发展。 Abstract: Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.[97] WarNav: An Autonomous Driving Benchmark for Segmentation of Navigable Zones in War Scenes
Marc-Emmanuel Coupvent des Graviers,Hejer Ammar,Christophe Guettier,Yann Dumortier,Romaric Audigier
Main category: cs.CV
TL;DR: WarNav是一个针对冲突地区非结构化环境中自主地面车辆导航的新型真实世界语义分割数据集,旨在填补传统城市驾驶数据与高风险战区场景之间的空白。
Details
Motivation: 现有自动驾驶数据集主要面向城市环境,缺乏适用于战区等极端危险、非结构化场景的数据资源,限制了无人系统在高风险区域的导航能力发展。 Method: 基于开源DATTALION库构建WarNav数据集,处理数据异质性与伦理问题,并使用多种先进语义分割模型在城市场景上训练后迁移到WarNav进行基准测试,探索无目标域标注情况下的可导航性建模。 Result: 提供了WarNav上的基线性能结果,分析了训练数据环境对模型表现的影响,并提出了在无标注目标图像约束下提升复杂环境可导航性的初步方法。 Conclusion: WarNav为极端环境下的自主导航研究提供了重要资源,推动在标注数据有限的情况下提升自动驾驶系统在战区等高风险场景中的鲁棒性与安全性。 Abstract: We introduce WarNav, a novel real-world dataset constructed from images of the open-source DATTALION repository, specifically tailored to enable the development and benchmarking of semantic segmentation models for autonomous ground vehicle navigation in unstructured, conflict-affected environments. This dataset addresses a critical gap between conventional urban driving resources and the unique operational scenarios encountered by unmanned systems in hazardous and damaged war-zones. We detail the methodological challenges encountered, ranging from data heterogeneity to ethical considerations, providing guidance for future efforts that target extreme operational contexts. To establish performance references, we report baseline results on WarNav using several state-of-the-art semantic segmentation models trained on structured urban scenes. We further analyse the impact of training data environments and propose a first step towards effective navigability in challenging environments with the constraint of having no annotation of the targeted images. Our goal is to foster impactful research that enhances the robustness and safety of autonomous vehicles in high-risk scenarios while being frugal in annotated data.[98] Representation Space Constrained Learning with Modality Decoupling for Multimodal Object Detection
YiKang Shao,Tao Shi
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态目标检测方法RSC-MD,以解决融合退化问题,通过理论分析揭示了梯度抑制和模态不平衡的根源,并设计了两个模块来改善各模态主干网络的优化,实验表明该方法在多个数据集上达到了先进性能。
Details
Motivation: 现有研究忽视了多模态融合中的融合退化问题,且缺乏对其成因的理论分析,因此需要系统性地探究该问题并提出有效解决方案。 Method: 提出了表示空间约束学习与模态解耦(RSC-MD)方法,包含RSC模块(放大被抑制的梯度)和MD模块(消除模态间耦合干扰与不平衡),从而实现各模态主干网络的充分优化。 Result: 在FLIR、LLVIP、M3FD和MFAD数据集上的实验证明,所提方法能有效缓解融合退化,在多个基准上达到最先进的性能。 Conclusion: RSC-MD通过理论驱动的设计有效解决了多模态检测中的梯度抑制与模态不平衡问题,为多模态融合提供了可解释且高效的优化路径。 Abstract: Multimodal object detection has attracted significant attention in both academia and industry for its enhanced robustness. Although numerous studies have focused on improving modality fusion strategies, most neglect fusion degradation, and none provide a theoretical analysis of its underlying causes. To fill this gap, this paper presents a systematic theoretical investigation of fusion degradation in multimodal detection and identifies two key optimization deficiencies: (1) the gradients of unimodal branch backbones are severely suppressed under multimodal architectures, resulting in under-optimization of the unimodal branches; (2) disparities in modality quality cause weaker modalities to experience stronger gradient suppression, which in turn results in imbalanced modality learning. To address these issues, this paper proposes a Representation Space Constrained Learning with Modality Decoupling (RSC-MD) method, which consists of two modules. The RSC module and the MD module are designed to respectively amplify the suppressed gradients and eliminate inter-modality coupling interference as well as modality imbalance, thereby enabling the comprehensive optimization of each modality-specific backbone. Extensive experiments conducted on the FLIR, LLVIP, M3FD, and MFAD datasets demonstrate that the proposed method effectively alleviates fusion degradation and achieves state-of-the-art performance across multiple benchmarks. The code and training procedures will be released at https://github.com/yikangshao/RSC-MD.[99] HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation
Linyin Luo,Yujuan Ding,Yunshan Ma,Wenqi Fan,Hanjiang Lai
Main category: cs.CV
TL;DR: 本文提出了一种针对多模态检索增强生成(MRAG)系统的层次化视觉攻击方法,通过在图像输入中添加不可察觉的扰动,破坏检索器与生成器之间的跨模态对齐,从而降低MRAG系统的检索与生成性能。
Details
Motivation: 现有研究关注知识投毒攻击,而本文探索一种新的攻击场景:仅通过对用户输入的图像添加微小扰动来攻击MRAG系统,不修改任何系统组件,更具隐蔽性和实际威胁。 Method: 提出层次化视觉攻击框架,采用两阶段策略:首先优化扰动以破坏跨模态对齐,使检索器召回无关知识;进而干扰多模态语义对齐,导致生成器输入错位,产生错误输出。 Result: 在OK-VQA和InfoSeek两个主流MRAG数据集上,使用CLIP检索器及BLIP-2、LLaVA生成器进行实验,结果显示该攻击显著降低了检索准确率和生成质量。 Conclusion: 所提视觉攻击方法有效揭示了MRAG系统在面对输入端视觉扰动时的脆弱性,为多模态系统的安全防护提供了新的挑战与方向。 Abstract: Advanced multimodal Retrieval-Augmented Generation (MRAG) techniques have been widely applied to enhance the capabilities of Large Multimodal Models (LMMs), but they also bring along novel safety issues. Existing adversarial research has revealed the vulnerability of MRAG systems to knowledge poisoning attacks, which fool the retriever into recalling injected poisoned contents. However, our work considers a different setting: visual attack of MRAG by solely adding imperceptible perturbations at the image inputs of users, without manipulating any other components. This is challenging due to the robustness of fine-tuned retrievers and large-scale generators, and the effect of visual perturbation may be further weakened by propagation through the RAG chain. We propose a novel Hierarchical Visual Attack that misaligns and disrupts the two inputs (the multimodal query and the augmented knowledge) of MRAG's generator to confuse its generation. We further design a hierarchical two-stage strategy to obtain misaligned augmented knowledge. We disrupt the image input of the retriever to make it recall irrelevant knowledge from the original database, by optimizing the perturbation which first breaks the cross-modal alignment and then disrupts the multimodal semantic alignment. We conduct extensive experiments on two widely-used MRAG datasets: OK-VQA and InfoSeek. We use CLIP-based retrievers and two LMMs BLIP-2 and LLaVA as generators. Results demonstrate the effectiveness of our visual attack on MRAG through the significant decrease in both retrieval and generation performance.[100] A Dataset and Baseline for Deep Learning-Based Visual Quality Inspection in Remanufacturing
Johannes C. Bauer,Paul Geng,Stephan Trattnig,Petr Dokládal,Rüdiger Daub
Main category: cs.CV
TL;DR: 提出了一种新的图像数据集和对比正则化损失,用于提升深度神经网络在变速箱部件视觉检测中的泛化能力。
Details
Motivation: 由于零件和缺陷模式的多样性,传统的手动质量检测难以扩展,而现有深度学习模型在新样本上的泛化能力不足。 Method: 构建了一个包含两种汽车变速器典型部件的新图像数据集,并设计了不同的训练-测试分割以产生分布偏移;引入对比正则化损失来增强模型鲁棒性。 Result: 实验表明,所提出的对比正则化损失能有效提升模型对未见部件类型的泛化性能。 Conclusion: 该方法有助于推动再制造中视觉检测的自动化,并提高深度模型在实际应用中的适应性和可靠性。 Abstract: Remanufacturing describes a process where worn products are restored to like-new condition and it offers vast ecological and economic potentials. A key step is the quality inspection of disassembled components, which is mostly done manually due to the high variety of parts and defect patterns. Deep neural networks show great potential to automate such visual inspection tasks but struggle to generalize to new product variants, components, or defect patterns. To tackle this challenge, we propose a novel image dataset depicting typical gearbox components in good and defective condition from two automotive transmissions. Depending on the train-test split of the data, different distribution shifts are generated to benchmark the generalization ability of a classification model. We evaluate different models using the dataset and propose a contrastive regularization loss to enhance model robustness. The results obtained demonstrate the ability of the loss to improve generalisation to unseen types of components.[101] Driving in Spikes: An Entropy-Guided Object Detector for Spike Cameras
Ziyan Liu,Qi Su,Lulu Tang,Zhaofei Yu,Tiejun Huang
Main category: cs.CV
TL;DR: 提出EASD,一种用于脉冲相机的端到端目标检测方法,并构建首个面向驾驶场景的模拟脉冲检测基准DSEC Spike。
Details
Motivation: 自动驾驶中的目标检测在快速运动和极端光照下易受运动模糊和过曝影响,现有图像检测器无法处理脉冲相机的稀疏离散输出。 Method: 采用双分支设计:基于时间的纹理与特征融合分支用于全局跨片段语义,熵选择性注意力分支聚焦物体细节,并利用DSEC Spike数据集进行训练与验证。 Result: 实现了对脉冲流数据的端到端检测,有效提升了在极端条件下的检测性能。 Conclusion: EASD框架结合专用数据集DSEC Spike,为脉冲相机在自动驾驶目标检测中的应用提供了有效解决方案。 Abstract: Object detection in autonomous driving suffers from motion blur and saturation under fast motion and extreme lighting. Spike cameras, offer microsecond latency and ultra high dynamic range for object detection by using per pixel asynchronous integrate and fire. However, their sparse, discrete output cannot be processed by standard image-based detectors, posing a critical challenge for end to end spike stream detection. We propose EASD, an end to end spike camera detector with a dual branch design: a Temporal Based Texture plus Feature Fusion branch for global cross slice semantics, and an Entropy Selective Attention branch for object centric details. To close the data gap, we introduce DSEC Spike, the first driving oriented simulated spike detection benchmark.[102] SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome
Dabin Jeong,Amirhossein Vahidi,Ciro Ramírez-Suástegui,Marie Moullet,Kevin Ly,Mohammad Vali Sanian,Sebastian Birk,Yinshui Chang,Adam Boxall,Daniyal Jafree,Lloyd Steele,Vijaya Baskar MS,Muzlifah Haniffa,Mohammad Lotfollahi
Main category: cs.CV
TL;DR: 提出Sigmma,一种多模态对比对齐框架,通过多尺度对齐和图表示学习,提升HE图像与空间转录组数据的跨模态表征能力,在基因表达预测和跨模态检索任务中显著提高性能。
Details
Motivation: 现有方法通常仅在单一尺度上对齐HE图像与空间转录组数据,忽略了细粒度细胞结构及其空间组织,限制了跨模态表征的精细度。 Method: 设计Sigmma框架,采用多尺度对比对齐策略,确保不同尺度下模态间表征的一致性;将细胞相互作用建模为图结构,并融合子图内外关系,捕捉从细到粗的细胞间相互作用。 Result: 在多个数据集上,基因表达预测任务平均提升9.78%,跨模态检索任务平均提升26.93%,下游分析显示其能有效学习多组织结构特征。 Conclusion: Sigmma通过多尺度对齐和图增强的表示学习,显著提升了计算病理学中跨模态数据的表征质量,有助于更精确地解析组织微环境。 Abstract: Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.[103] Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners
Xabier Lekunberri,Ahmad Kamal,Izaro Goienetxea,Jon Ruiz,Iñaki Quincoces,Jaime Valls Miro,Ignacio Arganda-Carreras,Jose A. Fernandes-Salvador
Main category: cs.CV
TL;DR: 本研究提出了一种结合YOLOv9-SAM2与分层分类的多阶段管道,用于基于电子监控图像估算金枪鱼捕捞中的物种组成,显著提高了自动识别精度。
Details
Motivation: 由于电子监控系统产生大量视频数据,人工分析负担重,且AI在金枪鱼物种识别上因训练数据不平衡而表现受限,特别是大眼金枪鱼(BET)与黄鳍金枪鱼(YFT)难以区分,亟需提高自动识别的准确性与泛化能力。 Method: 构建一个多阶段处理流程:比较Mask R-CNN、DINOv2+SAM2和YOLOv9+SAM2三种分割方法;使用ByteTrack进行个体追踪;评估标准多类分类与分层分类模型的性能;所有模型采用交叉验证,并在已知物种组成的实际捕捞作业数据上测试。 Result: YOLOv9+SAM2在分割任务中表现最佳,平均精度为0.66±0.03,召回率为0.88±0.03;分层分类模型泛化能力优于标准模型;最终组合方法实现了84.8%的个体成功分割与分类,平均绝对误差为4.5%。 Conclusion: 结合YOLOv9-SAM2分割与分层分类的多阶段方法能有效提升金枪鱼物种组成的自动估算精度,具备在渔业电子监控中推广应用的潜力。 Abstract: Purse seiners play a crucial role in tuna fishing, as approximately 69% of the world's tropical tuna is caught using this gear. All tuna Regional Fisheries Management Organizations have established minimum standards to use electronic monitoring (EM) in fisheries in addition to traditional observers. The EM systems produce a massive amount of video data that human analysts must process. Integrating artificial intelligence (AI) into their workflow can decrease that workload and improve the accuracy of the reports. However, species identification still poses significant challenges for AI, as achieving balanced performance across all species requires appropriate training data. Here, we quantify the difficulty experts face to distinguish bigeye tuna (BET, Thunnus Obesus) from yellowfin tuna (YFT, Thunnus Albacares) using images captured by EM systems. We found inter-expert agreements of 42.9% $\pm$ 35.6% for BET and 57.1% $\pm$ 35.6% for YFT. We then present a multi-stage pipeline to estimate the species composition of the catches using a reliable ground-truth dataset based on identifications made by observers on board. Three segmentation approaches are compared: Mask R-CNN, a combination of DINOv2 with SAM2, and a integration of YOLOv9 with SAM2. We found that the latest performs the best, with a validation mean average precision of 0.66 $\pm$ 0.03 and a recall of 0.88 $\pm$ 0.03. Segmented individuals are tracked using ByteTrack. For classification, we evaluate a standard multiclass classification model and a hierarchical approach, finding a superior generalization by the hierarchical. All our models were cross-validated during training and tested on fishing operations with fully known catch composition. Combining YOLOv9-SAM2 with the hierarchical classification produced the best estimations, with 84.8% of the individuals being segmented and classified with a mean average error of 4.5%.[104] RS-CA-HSICT: A Residual and Spatial Channel Augmented CNN Transformer Framework for Monkeypox Detection
Rashid Iqbal,Saddam Hussain Khan
Main category: cs.CV
TL;DR: 提出一种基于残差和空间学习的通道增强型CNN-Transformer混合架构(RS-CA-HSICT),用于提升猴痘(Mpox)检测性能,结合CNN的局部特征提取与Transformer的长距离依赖建模能力。
Details
Motivation: 现有CNN和Vision Transformer在Mpox皮肤病变图像分类中存在局部细节捕捉不足、全局上下文建模有限及多尺度特征融合不充分的问题,需设计更高效的混合模型以提高检测精度。 Method: 构建RS-CA-HSICT框架,包含HSICT模块(集成CNN主干与定制化ICT块)、残差CNN模块、空间CNN块和通道增强(CA)机制;引入逆向残差学习缓解梯度消失,阶段式分辨率下降实现尺度不变性,并通过通道融合与注意力模块优化特征选择,最后利用空间注意力机制精炼像素级判别信息。 Result: 在Kaggle基准数据集和多样化Mpox数据集上分别取得最高98.30%的分类准确率和98.13%的F1分数,优于现有的CNN和ViT模型。 Conclusion: RS-CA-HSICT通过有效融合CNN的结构细节学习与Transformer的全局上下文建模,显著提升了Mpox图像的自动检测性能,具备临床辅助诊断潜力。 Abstract: This work proposes a hybrid deep learning approach, namely Residual and Spatial Learning based Channel Augmented Integrated CNN-Transformer architecture, that leverages the strengths of CNN and Transformer towards enhanced MPox detection. The proposed RS-CA-HSICT framework is composed of an HSICT block, a residual CNN module, a spatial CNN block, and a CA, which enhances the diverse feature space, detailed lesion information, and long-range dependencies. The new HSICT module first integrates an abstract representation of the stem CNN and customized ICT blocks for efficient multihead attention and structured CNN layers with homogeneous (H) and structural (S) operations. The customized ICT blocks learn global contextual interactions and local texture extraction. Additionally, H and S layers learn spatial homogeneity and fine structural details by reducing noise and modeling complex morphological variations. Moreover, inverse residual learning enhances vanishing gradient, and stage-wise resolution reduction ensures scale invariance. Furthermore, the RS-CA-HSICT framework augments the learned HSICT channels with the TL-driven Residual and Spatial CNN maps for enhanced multiscale feature space capturing global and localized structural cues, subtle texture, and contrast variations. These channels, preceding augmentation, are refined through the Channel-Fusion-and-Attention block, which preserves discriminative channels while suppressing redundant ones, thereby enabling efficient computation. Finally, the spatial attention mechanism refines pixel selection to detect subtle patterns and intra-class contrast variations in Mpox. Experimental results on both the Kaggle benchmark and a diverse MPox dataset reported classification accuracy as high as 98.30% and an F1-score of 98.13%, which outperforms the existing CNNs and ViTs.[105] FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI
Luisa Gallée,Yiheng Xiong,Meinrad Beer,Michael Götz
Main category: cs.CV
TL;DR: FunnyNodules是一个全参数化的合成数据集,用于系统分析医学AI模型中的基于属性的推理,提供对诊断决策规则的完全控制。
Details
Motivation: 缺乏同时包含诊断标签和其背后推理过程的密集标注医学图像数据集,限制了可解释AI(xAI)模型的发展与评估。 Method: 设计了一个生成抽象肺结节样形状的合成数据集FunnyNodules,其视觉属性(如圆形度、边缘锐利度和毛刺)可控,目标类别由预定义的属性组合决定。 Result: FunnyNodules可用于模型无关评估,检验模型是否学习到正确的属性-目标关系,解释属性预测的表现,并分析注意力与属性特定区域的一致性。 Conclusion: FunnyNodules凭借完整的真值信息,为开发、基准测试和深入分析医学图像分析中的可解释AI方法提供了灵活的基础。 Abstract: Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.[106] Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels
Maria Pilligua,David Serrano-Lozano,Pai Peng,Ramon Baldrich,Michael S. Brown,Javier Vazquez-Corral
Main category: cs.CV
TL;DR: 本文介绍了Multi-Illumination Low-Light (MILL) 数据集,用于在不同光照条件下评估低光图像增强算法,并提出改进方法以提升其鲁棒性。
Details
Motivation: 现有低光增强方法依赖单一低光条件下的配对训练数据,缺乏对不同光照强度下性能表现的全面理解。 Method: 构建包含多种光照强度图像的MILL数据集,在固定相机设置和精确照度测量下采集数据,并基于该数据集 benchmark 多种先进方法,提出增强跨光照鲁棒性的改进方案。 Result: 实验表明现有方法在不同光照强度下表现差异显著;所提改进方法在Full HD图像上分别实现最高10 dB(DSLR)和2 dB(智能手机)的PSNR提升。 Conclusion: MILL数据集为低光增强算法提供了更全面的评估平台,所提出的改进策略有效提升了模型在多变光照条件下的鲁棒性。 Abstract: Imaging in low-light environments is challenging due to reduced scene radiance, which leads to elevated sensor noise and reduced color saturation. Most learning-based low-light enhancement methods rely on paired training data captured under a single low-light condition and a well-lit reference. The lack of radiance diversity limits our understanding of how enhancement techniques perform across varying illumination intensities. We introduce the Multi-Illumination Low-Light (MILL) dataset, containing images captured at diverse light intensities under controlled conditions with fixed camera settings and precise illuminance measurements. MILL enables comprehensive evaluation of enhancement algorithms across variable lighting conditions. We benchmark several state-of-the-art methods and reveal significant performance variations across intensity levels. Leveraging the unique multi-illumination structure of our dataset, we propose improvements that enhance robustness across diverse illumination scenarios. Our modifications achieve up to 10 dB PSNR improvement for DSLR and 2 dB for the smartphone on Full HD images.[107] Learning to Expand Images for Efficient Visual Autoregressive Modeling
Ruiqing Yang,Kaixin Zhang,Zheng Zhang,Shan You,Tao Huang
Main category: cs.CV
TL;DR: 提出了一种名为Expanding Autoregressive Representation (EAR)的新生成范式,通过从中心向外螺旋展开图像标记,模拟人类视觉系统的感知模式,实现高效的并行解码和更好的生成质量。
Details
Motivation: 现有自回归视觉生成方法因逐标记解码或多尺度表示的复杂性而效率低下,需要一种更高效且符合感知规律的生成方式。 Method: 设计了从图像中心开始螺旋向外展开的标记生成顺序,并提出长度自适应解码策略,动态调整每步预测的标记数量,以实现并行解码和计算效率提升。 Result: 在ImageNet上实验表明,EAR在单尺度自回归模型中实现了最先进的保真度与效率权衡,显著降低计算成本并提高生成质量。 Conclusion: EAR通过模拟人类视觉感知的中心-外周机制,为高效、可扩展且认知对齐的自回归图像生成提供了新方向。 Abstract: Autoregressive models have recently shown great promise in visual generation by leveraging discrete token sequences akin to language modeling. However, existing approaches often suffer from inefficiency, either due to token-by-token decoding or the complexity of multi-scale representations. In this work, we introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern. EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding. To further enhance flexibility and speed, we propose a length-adaptive decoding strategy that dynamically adjusts the number of tokens predicted at each step. This biologically inspired design not only reduces computational cost but also improves generation quality by aligning the generation order with perceptual relevance. Extensive experiments on ImageNet demonstrate that EAR achieves state-of-the-art trade-offs between fidelity and efficiency on single-scale autoregressive models, setting a new direction for scalable and cognitively aligned autoregressive image generation.[108] Multi-Text Guided Few-Shot Semantic Segmentation
Qiang Jiao,Bin Yan,Yi Yang,Mengrui Shi,Qiang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于多文本引导的少样本语义分割网络MTGNet,通过融合多样化的文本提示来增强文本先验,并优化跨模态视觉先验,提升了复杂类别下的分割性能。
Details
Motivation: 现有CLIP-based少样本分割方法依赖单一文本提示,难以充分激活目标区域且易受噪声干扰,缺乏跨模态交互,限制了对复杂语义类别的建模能力。 Method: 设计了双分支框架MTGNet,包含多文本先验精炼模块(MTPR)、文本锚点特征融合模块(TAFF)和前景置信加权注意力模块(FCWA),通过多文本提示融合、跨模态原型传递和自相似性增强来提升分割效果。 Result: 在PASCAL-5i上1-shot mIoU达到76.8%,COCO-20i上达到57.4%,尤其在类内差异大的场景下表现显著优于现有方法。 Conclusion: MTGNet通过引入多文本提示与多模块协同机制,有效提升了少样本语义分割中语义覆盖范围与视觉先验鲁棒性,显著改善了复杂类别下的分割完整性与一致性。 Abstract: Recent CLIP-based few-shot semantic segmentation methods introduce class-level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross-modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi-Text Guided Few-Shot Semantic Segmentation Network (MTGNet), a dual-branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors. Specifically, we design a Multi-Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi-text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra-class variations. Furthermore, a Foreground Confidence-Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self-similarity within support foreground features. It adaptively down-weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1-shot setting, it achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.[109] A Hybrid CNN-ViT-GNN Framework with GAN-Based Augmentation for Intelligent Weed Detection in Precision Agriculture
Pandiyaraju V,Abishek Karthik,Sreya Mynampati,Poovarasan L,D. Saraswathi
Main category: cs.CV
TL;DR: 提出了一种结合CNN、ViT和GNN的混合深度学习框架,用于在复杂田间条件下实现高精度杂草检测,结合GAN增强和自监督预训练,在多基准数据集上达到99.33%的准确率。
Details
Motivation: 精准农业中杂草种类的准确识别有助于选择性施用除草剂,促进可持续作物管理,但田间环境复杂性和标注数据有限制约了现有方法的性能。 Method: 提出一种融合卷积神经网络(CNN)、视觉Transformer(ViT)和图神经网络(GNN)的混合框架,利用GAN进行数据增强以平衡类别分布,并采用自监督对比预训练方法提升模型在少量标注数据下的泛化能力。 Result: 在多基准数据集上实现了99.33%的准确率、精确率、召回率和F1分数,模型具备局部、全局和关系特征表达能力,具有高可解释性和可适应性,支持在边缘设备上的实时高效部署。 Conclusion: 该混合框架显著提升了复杂环境下杂草检测的准确性与鲁棒性,有助于减少除草剂滥用,为可持续精准农业提供了可扩展的技术方案。 Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment to edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.[110] Scriboora: Rethinking Human Pose Forecasting
Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
Main category: cs.CV
TL;DR: 本文评估了多种人体姿态预测算法,在绝对姿态预测任务中揭示了许多可复现性问题,并提出了统一的训练与评估流程。通过借鉴语音理解模型,成功将其迁移到姿态预测任务中,提升了当前最优性能。此外,研究还评估了模型在真实噪声(来自姿态估计器的关节坐标)下的鲁棒性,引入了新的数据集变体,并展示了通过无监督微调可部分恢复因噪声导致的性能下降。
Details
Motivation: 现有姿态预测方法存在可复现性问题,且缺乏对真实噪声下模型鲁棒性的评估,限制了其在现实场景中的应用。 Method: 构建统一的训练与评估框架,借鉴语音理解模型并进行适应性改造,引入基于姿态估计器生成的噪声数据进行鲁棒性测试,并采用无监督微调提升模型在噪声下的表现。 Result: 改进后的语音模型在绝对姿态预测上达到当前最优性能;使用估计姿态(含噪声)会导致显著性能下降,但通过无监督微调可部分恢复性能。 Conclusion: 跨领域迁移(如语音模型用于姿态预测)是提升性能的有效途径,同时在真实噪声条件下评估和增强模型鲁棒性对实际应用至关重要。 Abstract: Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.[111] Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector
Weiheng Zhu,Gang Cao,Jing Liu,Lifang Yu,Shaowei Weng
Main category: cs.CV
TL;DR: 提出了一种双域特征重要性攻击(DuFIA)方法,通过联合建模空间和频域特征重要性生成对抗样本,有效降低AI生成图像检测器的准确性。
Details
Motivation: 现有AI生成图像检测器在干净条件下表现良好,但在面对对抗性攻击时的安全性尚未充分研究,需开发更先进的攻击方法以评估其鲁棒性。 Method: 利用空间插值梯度和频域感知扰动捕捉 forensically 重要特征,并融合空间域与频率域的特征重要性指导优化生成对抗样本。 Result: 在多种AIGI检测器上的实验表明,DuFIA具有良好的跨模型迁移性、透明性和鲁棒性。 Conclusion: DuFIA能有效削弱不同AIGI检测器的检测能力,揭示了当前检测器在对抗环境下的脆弱性,有助于提升未来检测方法的安全性。 Abstract: Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluating the security of such detectors, which remains unexplored sufficiently. This letter proposes a Dual-domain Feature Importance Attack (DuFIA) scheme to invalidate AIGI detectors to some extent. Forensically important features are captured by the spatially interpolated gradient and frequency-aware perturbation. The adversarial transferability is enhanced by jointly modeling spatial and frequency-domain feature importances, which are fused to guide the optimization-based adversarial example generation. Extensive experiments across various AIGI detectors verify the cross-model transferability, transparency and robustness of DuFIA.[112] From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers
Huiyuan Tian,Bonan Xu,Shijian Li,Xin Jin
Main category: cs.CV
TL;DR: 本文分析了特征图知识蒸馏在Vision Transformers中效果不佳的原因,提出通过层间SVD和token级谱能量模式(SEP)分析揭示全局低秩但局部高带宽的编码特性,导致师生模型间的特征对齐困难。基于此,提出两种简单有效的解决方案:轻量级投影器后处理或仅扩展学生模型最后一层宽度,显著提升了蒸馏性能。
Details
Motivation: 理解为什么特征图知识蒸馏在Vision Transformers中表现不佳,并基于表示分析指导更有效的蒸馏方法设计。 Method: 采用层间奇异值分解(SVD)分析整体特征矩阵的低秩性,并提出token级谱能量模式(SEP)分析单个token的通道使用情况;基于发现的编码不匹配问题,设计两种策略:后处理特征提升和原生宽度对齐。 Result: 在ImageNet-1K上,所提方法将DeiT-Tiny的准确率从74.86%提升至77.53%和78.23%(以CaiT-S24为教师模型),同时改善了无教师训练的学生模型性能。 Conclusion: ViT中特征蒸馏失败源于全局低秩与局部高带宽之间的编码不匹配;利用低秩结构设计针对性策略可有效恢复蒸馏效果,为紧凑型ViT设计提供了可解释的指导。 Abstract: Feature-map knowledge distillation (KD) is highly effective for convolutional networks but often fails for Vision Transformers (ViTs). To understand this failure and guide method design, we conduct a two-view representation analysis of ViTs. First, a layer-wise Singular Value Decomposition (SVD) of full feature matrices shows that final-layer representations are globally low-rank: for CaiT-S24, only $121/61/34/14$ dimensions suffice to capture $99\%/95\%/90\%/80\%$ of the energy. In principle, this suggests that a compact student plus a simple linear projector should be enough for feature alignment, contradicting the weak empirical performance of standard feature KD. To resolve this paradox, we introduce a token-level Spectral Energy Pattern (SEP) analysis that measures how each token uses channel capacity. SEP reveals that, despite the global low-rank structure, individual tokens distribute energy over most channels, forming a high-bandwidth encoding pattern. This results in an encoding mismatch between wide teachers and narrow students. Motivated by this insight, we propose two minimal, mismatch-driven strategies: (1) post-hoc feature lifting with a lightweight projector retained during inference, or (2) native width alignment that widens only the student's last block to the teacher's width. On ImageNet-1K, these strategies reactivate simple feature-map distillation in ViTs, raising DeiT-Tiny accuracy from $74.86\%$ to $77.53\%$ and $78.23\%$ when distilling from CaiT-S24, while also improving standalone students trained without any teacher. Our analysis thus explains why ViT feature distillation fails and shows how exploiting low-rank structure yields effective, interpretable remedies and concrete design guidance for compact ViTs.[113] AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
Urjitkumar Patel,Fang-Chun Yeh,Chinmay Gondhalekar
Main category: cs.CV
TL;DR: 本文提出了一种名为AVATAAR的模块化、可解释框架,用于提升长视频问答性能,结合全局与局部上下文,并引入预检索思考代理和重思模块,通过反馈循环优化检索策略,在CinePile基准上显著优于基线。
Details
Motivation: 现有大视觉语言模型在处理需要全面理解和细致分析的复杂视频问题时存在不足,难以应对长视频中的时序、主题和技术类问题。 Method: AVATAAR框架融合全局摘要与局部上下文,设计预检索思考代理进行初步推理,结合重思模块形成反馈循环,实现基于部分答案的检索策略优化,模拟人类迭代推理过程。 Result: 在CinePile基准上,AVATAAR相比基线在时序推理、技术问题、主题问题和叙事理解上分别取得+5.6%、+5%、+8%和+8.2%的相对提升,各模块均贡献正向效果,反馈循环对适应性至关重要。 Conclusion: AVATAAR有效提升了长视频问答的理解能力,兼具准确性、可解释性和可扩展性,为长视频理解提供了可扩展的解决方案。 Abstract: With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.[114] CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking
Sifan Zhou,Yichao Cao,Jiahao Nie,Yuqian Fu,Ziyu Zhao,Xiaobo Lu,Shuo Wang
Main category: cs.CV
TL;DR: 提出CompTrack,一种端到端的LiDAR点云3D单目标跟踪框架,通过空间前景预测和信息瓶颈引导的动态令牌压缩模块,有效消除背景噪声和前景信息冗余,在多个数据集上实现高效高精度跟踪,达到90 FPS实时性能。
Details
Motivation: 点云固有的稀疏性导致现有跟踪器存在空间冗余(背景噪声)和信息冗余(前景内部),影响精度与效率。 Method: 设计CompTrack框架,包含空间前景预测(SFP)模块过滤背景噪声,以及基于信息瓶颈的动态令牌压缩(IB-DTC)模块,利用在线SVD分析对前景进行低秩逼近压缩,生成紧凑且高信息量的代理令牌。 Result: 在KITTI、nuScenes和Waymo数据集上实验表明,CompTrack在性能上达到领先水平,同时实现实时90 FPS的推理速度(单块RTX 3090)。 Conclusion: CompTrack通过系统性消除点云中的双重冗余,在3D单目标跟踪任务中实现了精度与效率的显著提升,具备良好的实际应用潜力。 Abstract: 3D single object tracking (SOT) in LiDAR point clouds is a critical task in computer vision and autonomous driving. Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. First, CompTrack incorporates a Spatial Foreground Predictor (SFP) module to filter out irrelevant background noise based on information entropy, addressing spatial redundancy. Subsequently, its core is an Information Bottleneck-guided Dynamic Token Compression (IB-DTC) module that eliminates the informational redundancy within the foreground. Theoretically grounded in low-rank approximation, this module leverages an online SVD analysis to adaptively compress the redundant foreground into a compact and highly informative set of proxy tokens. Extensive experiments on KITTI, nuScenes and Waymo datasets demonstrate that CompTrack achieves top-performing tracking performance with superior efficiency, running at a real-time 90 FPS on a single RTX 3090 GPU.[115] Learning from Mistakes: Loss-Aware Memory Enhanced Continual Learning for LiDAR Place Recognition
Xufei Wang,Junqiao Zhao,Siyue Tao,Qiwen Gu,Wonbong Kim,Tiantian Feng
Main category: cs.CV
TL;DR: 提出KDF+,一种用于LiDAR地点识别的持续学习框架,通过损失感知采样策略和回放增强机制缓解灾难性遗忘。
Details
Motivation: 现有LiDAR地点识别方法在适应新环境时易出现灾难性遗忘,难以持续学习。 Method: 扩展KDF范式,引入损失感知的样本采样策略,根据样本损失选择难样本进行回放,并设计回放增强机制,在新任务训练中微调记忆样本损失以强化知识保留。 Result: 在多个基准上实验表明,KDF+优于现有持续学习方法,且能稳定提升主流框架性能。 Conclusion: KDF+有效缓解了LiDAR地点识别中的灾难性遗忘问题,具备良好的可集成性和实际应用前景。 Abstract: LiDAR place recognition plays a crucial role in SLAM, robot navigation, and autonomous driving. However, existing LiDAR place recognition methods often struggle to adapt to new environments without forgetting previously learned knowledge, a challenge widely known as catastrophic forgetting. To address this issue, we propose KDF+, a novel continual learning framework for LiDAR place recognition that extends the KDF paradigm with a loss-aware sampling strategy and a rehearsal enhancement mechanism. The proposed sampling strategy estimates the learning difficulty of each sample via its loss value and selects samples for replay according to their estimated difficulty. Harder samples, which tend to encode more discriminative information, are sampled with higher probability while maintaining distributional coverage across the dataset. In addition, the rehearsal enhancement mechanism encourages memory samples to be further refined during new-task training by slightly reducing their loss relative to previous tasks, thereby reinforcing long-term knowledge retention. Extensive experiments across multiple benchmarks demonstrate that KDF+ consistently outperforms existing continual learning methods and can be seamlessly integrated into state-of-the-art continual learning for LiDAR place recognition frameworks to yield significant and stable performance gains. The code will be available at https://github.com/repo/KDF-plus.[116] US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery
Miruna-Alexandra Gafencu,Yordanka Velikova,Nassir Navab,Mohammad Farid Azampour
Main category: cs.CV
TL;DR: 提出一种多模态深度学习方法,利用单张X光图像补全3D超声中被遮挡的脊椎结构,显著提升重建精度,并在体模实验中实现无需配准的完整腰椎可视化。
Details
Motivation: 超声在脊柱手术导航中有优势,但因骨性声影难以完整显示椎体结构,限制了其应用。 Method: 提出一种融合3D超声和单视角X光的多模态深度学习模型,通过模拟生成配对训练数据(2D侧位X光视图和3D部分椎体表示),结合两种模态的形态学信息完成椎体补全。 Result: 在体模实验中显著改善了椎体重建效果(p < 0.001),实现了更准确、完整的腰椎三维可视化,并无需与术前CT等模态进行配准。 Conclusion: 结合单张X光可有效克服超声在脊柱成像中的主要局限,同时保留其无辐射、实时等优势,推动其在术中导航的应用。 Abstract: Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound's key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete[117] MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation
Bin Xie,Gady Agam
Main category: cs.CV
TL;DR: 提出了一种名为MaskMed的医学图像分割方法,通过解耦分割头和全尺度感知可变形Transformer模块,在多个数据集上超越了现有方法。
Details
Motivation: 传统医学图像分割采用点卷积分割头,每个输出通道对应一个类别,这种刚性设计限制了特征共享和语义泛化能力。 Method: 提出统一的解耦分割头,将多类预测分解为类无关的掩码预测和类标签预测,并使用共享对象查询;引入全尺度感知可变形Transformer模块,实现低分辨率特征对全分辨率特征的可变形注意力。 Result: 在AMOS 2022上比nnUNet提升+2.0% Dice,在BTCV上提升+6.9% Dice。 Conclusion: MaskMed通过解耦设计和高效特征融合,在医学图像分割中实现了最先进的性能。 Abstract: Medical image segmentation typically adopts a point-wise convolutional segmentation head to predict dense labels, where each output channel is heuristically tied to a specific class. This rigid design limits both feature sharing and semantic generalization. In this work, we propose a unified decoupled segmentation head that separates multi-class prediction into class-agnostic mask prediction and class label prediction using shared object queries. Furthermore, we introduce a Full-Scale Aware Deformable Transformer module that enables low-resolution encoder features to attend across full-resolution encoder features via deformable attention, achieving memory-efficient and spatially aligned full-scale fusion. Our proposed method, named MaskMed, achieves state-of-the-art performance, surpassing nnUNet by +2.0% Dice on AMOS 2022 and +6.9% Dice on BTCV.[118] FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
Tingrui Shen,Yiheng Zhang,Chen Tang,Chuan Ping,Zixing Zhao,Le Wan,Yuwang Wang,Ronggang Wang,Shengfeng He
Main category: cs.CV
TL;DR: 提出FlashMesh,一种基于预测-纠正-验证范式的快速高质量3D网格生成框架,通过利用网格数据的结构相关性实现多层级并行推测解码,相比传统自回归模型最高提速2倍,同时提升生成保真度。
Details
Motivation: 自回归模型虽能生成高质量3D网格,但逐token解码导致推理速度慢,难以满足交互式和大规模应用需求。 Method: 提出FlashMesh,采用预测-纠正-验证的范式,利用网格在面、点、坐标层级的结构与几何相关性,设计针对hourglass transformer架构的推测解码机制,实现多token并行生成。 Result: 实验表明,FlashMesh相较标准自回归模型最高可实现2倍加速,同时生成质量更高,验证了利用结构先验可有效提升网格生成效率与保真度。 Conclusion: 通过系统利用3D网格数据中的结构先验,可在不牺牲甚至提升生成质量的前提下显著加速自回归生成过程,为实际应用提供了高效解决方案。 Abstract: Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications. We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels. Extensive experiments show that FlashMesh achieves up to a 2 x speedup over standard autoregressive models while also improving generation fidelity. Our results demonstrate that structural priors in mesh data can be systematically harnessed to accelerate and enhance autoregressive generation.[119] The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Dante Francisco Wasmuht,Otto Brookes,Maximillian Schall,Pablo Palencia,Chris Beirne,Tilo Burghardt,Majid Mirmehdi,Hjalmar Kühl,Mimi Arandjelovic,Sam Pottie,Peter Bermant,Brandon Asheim,Yi Jin Toh,Adam Elzinga,Jason Holmberg,Andrew Whitworth,Eleanor Flatt,Laura Gustafson,Chaitanya Ryali,Yuan-Ting Hu,Baishan Guo,Andrew Westbury,Kate Saenko,Didac Suris
Main category: cs.CV
TL;DR: SA-FARI是首个大规模、多物种、跨地域的野生动物多目标跟踪数据集,包含近46小时密集标注的11,609个相机陷阱视频,覆盖4大洲99种物种,提供高质量时空标注和相机位置信息,并基于先进视觉-语言模型建立了检测与跟踪基准。
Details
Motivation: 现有野生动物视频数据集在规模、物种多样性、时空覆盖范围上有限,缺乏适用于训练通用多动物跟踪(MAT)模型的基准数据集。 Method: 收集了2014-2024年间来自4大洲741个地点的11,609个相机陷阱视频,进行密集标注,包括边界框、分割掩码、物种标签和匿名相机位置;构建了基于视觉-语言模型(如SAM 3)和纯视觉方法的多动物检测与跟踪基准。 Result: SA-FARI数据集包含约46小时标注视频,16,224个masklet身份和942,702个标注实例,是目前最大开源的野生动物MAT数据集;实验表明视觉-语言模型在物种特定和通用提示下均表现良好,且优于专用野生动物视觉模型。 Conclusion: SA-FARI为野生动物多动物跟踪提供了首个兼具高物种多样性、多区域覆盖和高质量时空标注的大规模数据集,推动通用化野外多动物跟踪模型的发展。 Abstract: Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.[120] Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning
Tao Hu,Lan Li,Zhen-Hao Xie,Da-Wei Zhou
Main category: cs.CV
TL;DR: 提出HASTEN方法,通过超球空间中的层次语义树锚定来缓解CLIP在类增量学习中的灾难性遗忘。
Details
Motivation: 现有基于CLIP的类增量学习方法未显式建模视觉与语言概念的层次结构,导致细粒度类别特征漂移和灾难性遗忘。 Method: 利用外部知识图谱监督,在双曲空间中嵌入视觉与文本特征,并将梯度投影到共享映射器的零空间以减少任务间干扰。 Result: 在多个实验中HASTEN优于现有方法,有效保持层次关系并减少遗忘。 Conclusion: HASTEN通过显式建模层次结构和梯度正交化,显著提升了类增量学习中的模型稳定性与泛化能力。 Abstract: Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like "dog" subsumes fine-grained categories such as "Labrador" and "Golden Retriever," and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.[121] Multi-Stage Residual-Aware Unsupervised Deep Learning Framework for Consistent Ultrasound Strain Elastography
Shourov Joarder,Tushar Talukder Showrav,Md. Kamrul Hasan
Main category: cs.CV
TL;DR: 本文提出了一种名为MUSSE-Net的无监督多阶段序列深度学习框架,用于鲁棒且一致的超声应变弹性成像,显著提升了应变估计的精度与稳定性。
Details
Motivation: 超声应变弹性成像(USE)在临床诊断中具有重要价值,但受限于组织去相关噪声、缺乏真实标签以及不同变形条件下的应变估计不一致问题。 Method: 提出MUSSE-Net框架,其核心是USSE-Net,一种端到端多流编码器-解码器结构,可并行处理形变前后射频信号;引入上下文感知互补特征融合(CACFF)、三交叉注意力(TCA)瓶颈和交叉注意力融合(CAF)解码器,并结合一致性损失保证时间相干性,再通过残差细化阶段提升性能。 Result: 在仿真、体内及BUET临床数据集上验证显示,MUSSE-Net优于现有无监督方法,在仿真数据上达到24.54的目标SNR、132.76的背景SNR、59.81的CNR和9.73的弹性SNR;在BUET数据集中表现出更强的病灶对比度与噪声抑制能力。 Conclusion: MUSSE-Net能有效克服USE中的关键挑战,实现高精度、稳定的应变估计,具有良好的临床应用前景。 Abstract: Ultrasound Strain Elastography (USE) is a powerful non-invasive imaging technique for assessing tissue mechanical properties, offering crucial diagnostic value across diverse clinical applications. However, its clinical application remains limited by tissue decorrelation noise, scarcity of ground truth, and inconsistent strain estimation under different deformation conditions. Overcoming these barriers, we propose MUSSE-Net, a residual-aware, multi-stage unsupervised sequential deep learning framework designed for robust and consistent strain estimation. At its backbone lies our proposed USSE-Net, an end-to-end multi-stream encoder-decoder architecture that parallelly processes pre- and post-deformation RF sequences to estimate displacement fields and axial strains. The novel architecture incorporates Context-Aware Complementary Feature Fusion (CACFF)-based encoder with Tri-Cross Attention (TCA) bottleneck with a Cross-Attentive Fusion (CAF)-based sequential decoder. To ensure temporal coherence and strain stability across varying deformation levels, this architecture leverages a tailored consistency loss. Finally, with the MUSSE-Net framework, a secondary residual refinement stage further enhances accuracy and suppresses noise. Extensive validation on simulation, in vivo, and private clinical datasets from Bangladesh University of Engineering and Technology (BUET) medical center, demonstrates MUSSE-Net's outperformed existing unsupervised approaches. On MUSSE-Net achieves state-of-the-art performance with a target SNR of 24.54, background SNR of 132.76, CNR of 59.81, and elastographic SNR of 9.73 on simulation data. In particular, on the BUET dataset, MUSSE-Net produces strain maps with enhanced lesion-to-background contrast and significant noise suppression yielding clinically interpretable strain patterns.[122] MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling
Shanshan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于Mamba架构的惯性里程计MambaIO,通过拉普拉斯金字塔分解IMU信号,并分别用Mamba和卷积结构处理低频和高频分量,在多个公开数据集上实现了最先进的定位精度。
Details
Motivation: 传统惯性里程计多采用全局坐标系处理IMU数据,但无人机研究表明体坐标系可提升精度,因此本文重新评估全局坐标系在行人惯性里程计中的适用性。 Method: 通过理论分析、定性检查和定量实验系统评估全局坐标系的有效性,并提出MambaIO:利用拉普拉斯金字塔将IMU测量分解为高低频成分,分别由Mamba架构和卷积结构处理。 Result: 实验表明MambaIO显著降低了定位误差,在多个公开数据集上达到SOTA性能。 Conclusion: MambaIO有效融合了Mamba模型对隐式上下文运动线索的提取能力与卷积网络对局部细节的捕捉能力,是首个将Mamba架构应用于惯性里程计的工作。 Abstract: Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, IMU measurements in IO have been processed under two coordinate system paradigms: the body coordinate frame and the global coordinate frame, with the latter being widely adopted. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global coordinate frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the inertial odometry task.[123] INQUIRE-Search: A Framework for Interactive Discovery in Large-Scale Biodiversity Databases
Edward Vendrow,Julia Chae,Rupa Kurinchi-Vendhan,Isaac Eckert,Jazlynn Hall,Marta Jarzyna,Reymond Miyajima,Ruth Oliver,Laura Pollock,Lauren Schrack,Scott Yanco,Oisin Mac Aodha,Sara Beery
Main category: cs.CV
TL;DR: INQUIRE-Search是一个开源系统,利用自然语言交互式搜索生态图像数据库中的概念,显著提升科学发现效率。
Details
Motivation: 大量社区科学平台中的生态图像包含丰富的次级信息,但传统方法难以大规模获取这些信息。 Method: 开发了基于自然语言的交互式搜索系统INQUIRE-Search,支持快速检索、验证和导出生态图像中的特定概念。 Result: 相比传统方法更高效,并通过五个案例研究展示了其在行为季节变化、野火后森林再生等多方面的科学应用。 Conclusion: 该工具开启了一种高效、可扩展的科学发现新范式,呼吁科学家重新思考科研流程并发展新的实验设计与数据分析方法。 Abstract: Large community science platforms such as iNaturalist contain hundreds of millions of biodiversity images that often capture ecological context on behaviors, interactions, phenology, and habitat. Yet most ecological workflows rely on metadata filtering or manual inspection, leaving this secondary information inaccessible at scale. We introduce INQUIRE-Search, an open-source system that enables scientists to rapidly and interactively search within an ecological image database for specific concepts using natural language, verify and export relevant observations, and utilize this discovered data for novel scientific analysis. Compared to traditional methods, INQUIRE-Search takes a fraction of the time, opening up new possibilities for scientific questions that can be explored. Through five case studies, we show the diversity of scientific applications that a tool like INQUIRE-Search can support, from seasonal variation in behavior across species to forest regrowth after wildfires. These examples demonstrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we emphasize using such AI-enabled discovery tools for science call for experts to reframe the priorities of the scientific process and develop novel methods for experiment design, data collection, survey effort, and uncertainty analysis.[124] GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI
Naomi Simumba,Nils Lehmann,Paolo Fraccaro,Hamed Alemohammad,Geeth De Mel,Salman Khan,Manil Maskey,Nicolas Longepe,Xiao Xiang Zhu,Hannah Kerner,Juan Bernabe-Moreno,Alexander Lacoste
Main category: cs.CV
TL;DR: GEO-Bench-2 提出一个标准化的评估框架,用于全面评估地理空间基础模型(GeoFMs),涵盖多种任务和19个开源数据集,并通过能力分组帮助用户根据任务需求选择合适模型。
Details
Motivation: 现有GeoFMs缺乏统一、可比较的评估标准,导致模型性能难以横向对比,阻碍了领域发展。 Method: 构建包含分类、分割、回归、目标检测和实例分割的多任务基准,使用19个允许商用的数据集,并提出“能力”分组概念以按数据特征(如分辨率、波段、时序性)对模型进行分类评估。 Result: 实验表明没有单一模型在所有任务上都表现最优;自然图像预训练模型(如ConvNext、DINO V3)在高分辨率任务中表现好,而遥感专用模型(如TerraMind、Prithvi、Clay)在多光谱应用(如农业、灾害响应)中更优。 Conclusion: 最佳模型选择取决于具体任务需求、数据模态和约束条件,通用型GeoFM仍是一个开放问题;GEO-Bench-2提供了可复现、可定制的评估体系,推动GeoFMs的公平比较与方法创新。 Abstract: Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.[125] MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features
Sejuti Rahman,Swakshar Deb,MD. Sameer Iqbal Chowdhury,MD. Jubair Ahmed Sourov,Mohammad Shamsuddin
Main category: cs.CV
TL;DR: 提出了一种多频率图卷积网络(MF-GCN),用于多模态抑郁检测,通过融合眼动、音视频数据中的高低频信息,显著提升了分类性能。
Details
Motivation: 现有基于图的模型主要关注低频信息,忽略了高频信号在抑郁检测中的潜在价值,限制了模型性能。 Method: 设计了多频率滤波器组模块(MFFBM),结合低频和高频图信号,构建了多频率图卷积网络(MF-GCN),利用眼动、音频和视频三模态数据进行抑郁检测。 Result: 在二分类任务中达到0.96敏感度和0.94 F2分数;三分类任务中敏感度0.79、特异性0.87;在CMDC数据集上敏感度0.95、F2分数0.96,均优于基线模型。 Conclusion: 所提出的三模态、多频率框架能有效捕捉跨模态交互,显著提升抑郁检测的准确性和泛化能力。 Abstract: Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary (depressed and non depressed) classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class (no depression, mild to moderate depression and severe depression) classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.[126] Hyperspectral Image Classification using Spectral-Spatial Mixer Network
Mohammed Q. Alkhatib
Main category: cs.CV
TL;DR: 本文提出了一种用于高光谱图像分类的轻量级深度学习模型SS-MixNet,结合3D卷积与MLP混合模块及注意力机制,在仅使用1%标注数据的情况下,在两个真实数据集上表现出优异性能。
Details
Motivation: 为了在有限标注数据下实现高效、准确的高光谱图像分类,需平衡模型复杂度与对长距离光谱-空间依赖关系的建模能力。 Method: SS-MixNet采用3D卷积提取局部光谱-空间特征,并引入两个并行的MLP风格混合块分别捕捉光谱和空间维度的长程依赖,结合基于深度可分离卷积的注意力机制增强判别能力。 Result: 在QUH-Tangdaowan和QUH-Qingyun数据集上,仅用1%标注数据训练,SS-MixNet取得了95.68%和93.86%的整体分类精度,优于2D-CNN、3D-CNN、IP-SWIN、SimPoolFormer和HybridKAN等现有方法。 Conclusion: SS-MixNet在低监督条件下展现出卓越的分类性能和鲁棒性,是一种高效且具有实用价值的高光谱图像分类模型。 Abstract: This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model's effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet[127] First Frame Is the Place to Go for Video Content Customization
Jingxi Chen,Zongxia Li,Zhichao Liu,Guangyao Shi,Xiyang Wu,Fuxiao Liu,Cornelia Fermuller,Brandon Y. Feng,Yiannis Aloimonos
Main category: cs.CV
TL;DR: 本文揭示了视频生成模型将首帧视为存储视觉实体的“概念记忆缓冲区”,并利用这一发现实现了仅用20-50个训练样本即可进行鲁棒且泛化的视频内容定制。
Details
Motivation: 传统上认为视频首帧仅是生成的起始点,本文旨在探索其是否具有更深层作用,如存储用于后续生成的视觉信息。 Method: 通过分析视频生成模型对首帧的隐式使用机制,提出将其作为概念记忆缓冲区,并在此基础上设计无需架构更改或大规模微调的小样本参考式定制方法。 Result: 实验证明该方法在多种场景下均能实现有效的视频内容定制,仅需20-50个训练样本即可达到鲁棒和泛化的效果。 Conclusion: 视频生成模型隐式地将首帧用作概念记忆,这一发现揭示了其被忽视的强大能力,为参考驱动的视频定制提供了新思路。 Abstract: What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.[128] GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Yikun Wang,Zuyan Liu,Ziyi Wang,Pengfei Liu,Han Hu,Yongming Rao
Main category: cs.CV
TL;DR: 本文提出了GeoBench和GeoVista,前者是一个用于评估代理模型地理定位能力的新基准,后者是一种结合工具调用(如图像放大和网络搜索)的代理模型,并通过分层奖励的强化学习显著提升了性能。
Details
Motivation: 现有研究主要集中在图像操作工具上,缺乏通用的视觉推理代理模型;同时,现有的地理定位基准无法满足高分辨率图像和复杂推理的需求。 Method: 构建了包含全球照片、全景图和卫星图像的GeoBench基准;提出GeoVista模型,集成图像放大和网络搜索工具,在推理过程中动态调用;采用冷启动监督微调加强化学习的训练流程,并设计分层奖励机制利用多级地理信息优化性能。 Result: GeoVista在GeoBench上显著优于其他开源代理模型,并在多数指标上达到与Gemini-2.5-flash和GPT-5相当的性能。 Conclusion: 通过构建高质量基准和引入工具增强的训练框架,可有效提升代理模型在复杂视觉推理任务(如地理定位)中的表现,推动通用多模态代理的发展。 Abstract: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.[129] RoMa v2: Harder Better Faster Denser Feature Matching
Johan Edstedt,David Nordström,Yushan Zhang,Georg Bökman,Jonathan Astermark,Viktor Larsson,Anders Heyden,Fredrik Kahl,Mårten Wadenbäck,Michael Felsberg
Main category: cs.CV
TL;DR: 本文提出了一种新的密集特征匹配方法,通过系统性改进架构、损失函数、训练流程和模型鲁棒性,在准确性和效率上均达到新的SOTA水平。