Table of Contents
cs.CL [Back]
[1] The Qualitative Laboratory: Theory Prototyping and Hypothesis Generation with Large Language Models
Hugues Draelants
Main category: cs.CL
TL;DR: 本文提出了一种使用大语言模型进行社会学 persona 模拟的新方法,作为生成关于不同社会群体如何解读新信息的定性假设的“定性实验室”。
Details
Motivation: 传统方法如情景调查缺乏话语深度,基于规则的ABM又难以形式化复杂世界观,因此需要一种能生成自然语言、深入且灵活的方法来模拟不同社会群体的反应。 Method: 利用大语言模型构建源于气候接受社会学理论的 personas,并模拟这些 personas 对政策信息的反应,从而生成丰富的定性假设。 Result: 模拟产生了细致且反直觉的假设,例如保守派 persona 拒绝国家安全框架,挑战了现有理论预期。 Conclusion: 该方法作为“先模拟后验证”工作流程的一部分,是生成可用于后续实证检验的深层定性假设的优越工具。 Abstract: A central challenge in social science is to generate rich qualitative hypotheses about how diverse social groups might interpret new information. This article introduces and illustrates a novel methodological approach for this purpose: sociological persona simulation using Large Language Models (LLMs), which we frame as a "qualitative laboratory". We argue that for this specific task, persona simulation offers a distinct advantage over established methods. By generating naturalistic discourse, it overcomes the lack of discursive depth common in vignette surveys, and by operationalizing complex worldviews through natural language, it bypasses the formalization bottleneck of rule-based agent-based models (ABMs). To demonstrate this potential, we present a protocol where personas derived from a sociological theory of climate reception react to policy messages. The simulation produced nuanced and counter-intuitive hypotheses - such as a conservative persona's rejection of a national security frame - that challenge theoretical assumptions. We conclude that this method, used as part of a "simulation then validation" workflow, represents a superior tool for generating deeply textured hypotheses for subsequent empirical testing.[2] Rate-Distortion Analysis of Compressed Query Delegation with Low-Rank Riemannian Updates
Faruk Alpay,Bugra Kilictas
Main category: cs.CL
TL;DR: 本文提出了压缩查询委托(CQD)方法,通过低秩张量压缩和外部 oracle 协同推理,解决有限上下文代理的推理瓶颈,并建立了理论与实证支持。
Details
Motivation: 受限上下文代理在中间推理超出有效工作记忆时表现不佳,需提升长程推理效率。 Method: 将高维隐状态压缩为低秩张量查询,委托给外部oracle,并在固定秩流形上使用黎曼优化更新状态。 Result: 理论表明谱硬阈值在特定失真问题下最优;实验显示CQD在2500项任务和人类认知镜像基准上优于链式思维基线。 Conclusion: CQD结合压缩、委托与优化,有效扩展了代理的推理能力,在理论和实践中均表现出色。 Abstract: Bounded-context agents fail when intermediate reasoning exceeds an effective working-memory budget. We study compressed query delegation (CQD): (i) compress a high-dimensional latent reasoning state into a low-rank tensor query, (ii) delegate the minimal query to an external oracle, and (iii) update the latent state via Riemannian optimization on fixed-rank manifolds. We give a math-first formulation: CQD is a constrained stochastic program with a query-budget functional and an oracle modeled as a noisy operator. We connect CQD to classical rate-distortion and information bottleneck principles, showing that spectral hard-thresholding is optimal for a natural constrained quadratic distortion problem, and we derive convergence guarantees for Riemannian stochastic approximation under bounded oracle noise and smoothness assumptions. Empirically, we report (A) a 2,500-item bounded-context reasoning suite (BBH-derived tasks plus curated paradox instances) comparing CQD against chain-of-thought baselines under fixed compute and context; and (B) a human "cognitive mirror" benchmark (N=200) measuring epistemic gain and semantic drift across modern oracles.[3] Intention Collapse: Intention-Level Metrics for Reasoning in Language Models
Patricio Vera
Main category: cs.CL
TL;DR: 本文提出了“意图坍缩”概念,即语言生成过程中高维意图空间向语言空间的压缩,并提出了三个模型无关的意图度量指标。通过在Mistral 7B模型上的实验,发现思维链(CoT)能显著提升准确率并改变意图结构,揭示了推理过程中内部意图的变化及其度量潜力。
Details
Motivation: 语言生成会丢失丰富的内部状态信息,作者希望形式化这一过程(称为意图坍缩),并探索如何在推理时保留和测量未被表达出的潜在意图。 Method: 提出意图熵、有效维度和潜在知识可恢复性三个度量指标;使用4位量化Mistral 7B模型在GSM8K数据集上比较直接回答、思维链(CoT)和随机生成(babble)三种推理模式下的意图特征变化。 Result: CoT将准确率从5.5%提升至53%,显著降低意图熵(1.42到0.37比特),具有更高的全局有效维度;线性探针在CoT下AUROC为0.65,基准条件下接近随机;意图熵对个体项目预测能力弱。 Conclusion: 意图层面的度量可以区分不同推理机制,揭示语言生成中部分丢失的潜在信息,但当前代理指标仍有局限性。 Abstract: Every act of language generation compresses a rich internal state into a single token sequence. We call this process intention collapse: a many-to-one projection from a high dimensional intention space I into an external language space L. We formalize intention collapse for contemporary language models, define three simple, model agnostic intention metrics (intention entropy Hint, effective dimensionality dimeff, and latent knowledge recoverability Recov), and propose an empirical agenda for studying how inference time computation shapes internal intentions before they are verbalized. We also report a first small scale experiment. Using a 4 bit Mistral 7B model on 200 GSM8K problems, we compare a direct answer baseline, a chain of thought (CoT) regime, and a babble control. CoT raises accuracy from 5.5 percent to 53 percent, sharply reduces pre collapse intention entropy (from 1.42 to 0.37 bits), and shows higher global effective dimensionality than the other regimes despite producing fewer tokens than babble. At the same time, Hint has little item level predictive power, and a linear probe on I achieves AUROC 0.65 in the CoT regime but only about chance in the baseline regime, where it collapses to the majority class. These preliminary results indicate that intention level metrics can distinguish inference regimes and expose latent information that is partly lost during collapse, while also revealing important limitations of our current proxies[4] HyperJoin: LLM-augmented Hypergraph Link Prediction for Joinable Table Discovery
Shiyuan Liu,Jianwei Wang,Xuemin Lin,Lu Qin,Wenjie Zhang,Ying Zhang
Main category: cs.CL
TL;DR: 本文提出HyperJoin,一种基于大语言模型增强的超图框架,用于数据湖中的可连接表发现,通过构建包含表内和表间超边的超图,并设计分层交互网络与一致性感知重排序模块,显著提升了发现结果的准确性和一致性。
Details
Motivation: 现有基于语言模型的方法在可连接表发现中未能充分建模表格间的结构交互:离线阶段忽略表内和表间复杂结构,在线排序阶段忽视候选列之间的相互关系,导致结果不连贯。 Method: 提出HyperJoin框架:1)构建超图,利用表内超边和大语言模型增强的表间超边建模表格结构;2)设计分层交互网络(HIN),通过列与超边之间的双向消息传递学习列表示;3)将在线排序建模为一致性感知的top-k列选择问题,引入基于最大生成树的重排序模块以提升结果一致性。 Result: 实验表明,HyperJoin在Precision@15和Recall@15上分别比最优基线平均提升21.4%和17.2%。 Conclusion: HyperJoin通过超图建模和大语言模型增强,有效捕捉了表内外的结构信息,并通过一致性感知排序提高了可连接表发现的性能,显著优于现有方法。 Abstract: As a pivotal task in data lake management, joinable table discovery has attracted widespread interest. While existing language model-based methods achieve remarkable performance by combining offline column representation learning with online ranking, their design insufficiently accounts for the underlying structural interactions: (1) offline, they directly model tables into isolated or pairwise columns, thereby struggling to capture the rich inter-table and intra-table structural information; and (2) online, they rank candidate columns based solely on query-candidate similarity, ignoring the mutual interactions among the candidates, leading to incoherent result sets. To address these limitations, we propose HyperJoin, a large language model (LLM)-augmented Hypergraph framework for Joinable table discovery. Specifically, we first construct a hypergraph to model tables using both the intra-table hyperedges and the LLM-augmented inter-table hyperedges. Consequently, the task of joinable table discovery is formulated as link prediction on this constructed hypergraph. We then design HIN, a Hierarchical Interaction Network that learns expressive column representations through bidirectional message passing over columns and hyperedges. To strengthen coherence and internal consistency in the result columns, we cast online ranking as a coherence-aware top-k column selection problem. We then introduce a reranking module that leverages a maximum spanning tree algorithm to prune noisy connections and maximize coherence. Experiments demonstrate the superiority of HyperJoin, achieving average improvements of 21.4% (Precision@15) and 17.2% (Recall@15) over the best baseline.[5] Multi-Dimensional Prompt Chaining to Improve Open-Domain Dialogue Generation
Livia Leong Hui Teng
Main category: cs.CL
TL;DR: 提出了一种多维度提示链框架,通过自然性、连贯性和吸引力三个维度提升小语言模型在开放域对话中的表现,实验表明该方法显著提高了对话质量,使小模型性能媲美大模型。
Details
Motivation: 小语言模型在部署上具有优势,但在开放域对话中难以匹敌大模型的对话质量,因此需要有效方法来提升其表现。 Method: 设计了一个包含自然性、连贯性和吸引力的多维度提示链框架,并将其应用于TinyLlama和Llama-2-7B两个小语言模型,结合自动与人工评估进行评测。 Result: 该框架使响应多样性提升最多29%,上下文连贯性提升最多28%,自然性和吸引力也提升最多29%;Llama-2-7B的表现可媲美Llama-2-70B和GPT-3.5 Turbo等更大模型。 Conclusion: 精心设计的基于提示的策略是提升小语言模型开放域对话质量的有效且资源友好的途径。 Abstract: Small language models (SLMs) offer significant deployment advantages but often struggle to match the dialogue quality of larger models in open-domain settings. In this paper, we propose a multi-dimensional prompt-chaining framework that integrates Naturalness, Coherence, and Engagingness dimensions to enhance human-likeness in open-domain dialogue generation. We apply the framework to two SLMs, TinyLlama and Llama-2-7B, and benchmark their performance against responses generated by substantially larger models, including Llama-2-70B and GPT-3.5 Turbo. We then employ automatic and human evaluation to assess the responses based on diversity, contextual coherence, as well as overall quality. Results show that the full framework improves response diversity by up to 29%, contextual coherence by up to 28%, and engagingness as well as naturalness by up to 29%. Notably, Llama-2-7B achieves performance comparable to substantially larger models, including Llama-2-70B and GPT-3.5 Turbo. Overall, the findings demonstrate that carefully designed prompt-based strategies provide an effective and resource-efficient pathway to improving open-domain dialogue quality in SLMs.[6] KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs
Yixuan Tang,Yi Yang
Main category: cs.CL
TL;DR: 提出KV-Embedding框架,通过利用冻结大语言模型中最后一个token的键值状态来增强训练-free场景下的语义表示能力,显著提升性能。
Details
Motivation: 大语言模型在训练-free设置下面临因果注意力限制和生成目标偏差两个结构挑战,影响其语义压缩能力。 Method: 利用每层最后一个token的键值(KV)状态作为序列的压缩表示,并将其重新路由为前置前缀,使所有token能访问全局上下文;采用基于本征维度的自动层选择策略确保方法的模型无关性。 Result: 在MTEB基准上,使用Qwen、Mistral和Llama模型时,KV-Embedding比现有训练-free方法最高提升10%,且在长达4096个token的序列上保持稳健性能。 Conclusion: 通过对LLM内部状态的操作,可在无需训练的情况下有效提升表示质量,为探索大模型内部机制用于表示学习提供了新方向。 Abstract: While LLMs are powerful embedding backbones, their application in training-free settings faces two structural challenges: causal attention restricts early tokens from accessing subsequent context, and the next-token prediction objective biases representations toward generation rather than semantic compression. To address these limitations, we propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs. Our method leverages the observation that the key-value (KV) states of the final token at each layer encode a compressed view of the sequence. By re-routing these states as a prepended prefix, we enable all tokens to access sequence-level context within a single forward pass. To ensure model-agnostic applicability, we introduce an automated layer selection strategy based on intrinsic dimensionality. Evaluations on MTEB across Qwen, Mistral, and Llama backbones show that KV-Embedding outperforms existing training-free baselines by up to 10%, while maintaining robust performance on sequences up to 4,096 tokens. These results demonstrate that internal state manipulation offers an efficient alternative to input modification, and we hope this work encourages further exploration of LLM internals for representation learning.[7] Unsupervised Text Style Transfer for Controllable Intensity
Shuhuan Gu,Wenbiao Tao,Xinchen Ma,Kangkang He,Ye Guo,Xiang Li,Yunshi Lan
Main category: cs.CL
TL;DR: 提出了一种SFT-then-PPO范式来微调大语言模型(LLM),用于无监督文本风格迁移中的可控强度转换,通过设计分层奖励函数有效区分不同强度级别的风格特征。
Details
Motivation: 由于缺乏平行数据以及相邻强度级别之间风格特征难以区分,实现可控强度的无监督文本风格迁移具有挑战性。 Method: 首先使用合成的平行数据对LLM进行有监督微调(SFT),然后利用PPO算法进一步训练,设计了结合全局和局部风格特征的分层奖励函数。 Result: 在两个UTST基准上的实验表明,所提出的奖励机制能有效提升LLM在多种评价指标下的性能,即使在相近的强度级别间也能生成明显不同的风格化文本。 Conclusion: SFT-then-PPO范式结合精心设计的分层奖励函数,能够有效解决无监督环境下可控强度文本风格迁移的难题,并显著提升生成文本的风格区分度。 Abstract: Unsupervised Text Style Transfer (UTST) aims to build a system to transfer the stylistic properties of a given text without parallel text pairs. Compared with text transfer between style polarities, UTST for controllable intensity is more challenging due to the subtle differences in stylistic features across different intensity levels. Faced with the challenges posed by the lack of parallel data and the indistinguishability between adjacent intensity levels, we propose a SFT-then-PPO paradigm to fine-tune an LLM. We first fine-tune the LLM with synthesized parallel data. Then, we further train the LLM with PPO, where the rewards are elaborately designed for distinguishing the stylistic intensity in hierarchical levels. Both the global and local stylistic features are considered to formulate the reward functions. The experiments on two UTST benchmarks showcase that both rewards have their advantages and applying them to LLM fine-tuning can effectively improve the performance of an LLM backbone based on various evaluation metrics. Even for close levels of intensity, we can still observe the noticeable stylistic difference between the generated text.[8] ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining
Haq Nawaz Malik
Main category: cs.CL
TL;DR: 本文介绍了KS-LIT-3M,一个包含310万词的克什米尔语语料库,旨在解决因缺乏高质量训练数据导致的大语言模型在克什米尔语上生成效果差的问题。
Details
Motivation: 克什米尔语虽有数百万使用者,但由于其文献多存储于专有的InPage格式中,难以用于现代NLP任务,导致该语言在大语言模型中的表现严重落后。 Method: 开发了一个专用的InPage到Unicode转换工具,并对转换后的文本进行去英语污染、字符归一化和质量验证,最终构建出以连续线性文本流形式组织的KS-LIT-3M语料库。 Result: 构建了包含3.1百万词(1640万字符)、131,607个唯一词汇的KS-LIT-3M语料库,涵盖文学、新闻、学术和宗教等多种文体,适用于因果语言模型的预训练。 Conclusion: KS-LIT-3M填补了克什米尔语NLP研究中的关键资源空白,有助于推动低资源语言的技术发展,促进语言多样性在AI中的体现。 Abstract: Large Language Models (LLMs) demonstrate remarkable fluency across high-resource languages yet consistently fail to generate coherent text in Kashmiri, a language spoken by approximately seven million people. This performance disparity stems not from inherent model limitations but from a critical scarcity of high-quality training data. Decades of Kashmiri literature remain inaccessible to modern NLP pipelines due to their encoding in the proprietary InPage desktop publishing format. This paper introduces KS-LIT-3M, a curated corpus of 3.1 million words (16.4 million characters) specifically designed for pretraining language models on Kashmiri. The dataset is structured as a single continuous linear text stream, optimized for causal language model training where models learn to predict subsequent tokens from preceding context. The corpus was constructed through the development of a specialized InPage-to-Unicode converter, followed by rigorous preprocessing including English contamination removal, character normalization, and quality validation. Encompassing 131,607 unique words drawn from diverse genres including literary works, journalistic writing, academic texts, and religious scholarship, KS-LIT-3M addresses a fundamental resource gap for Kashmiri language technology. The dataset is released under the CC-BY-4.0 license to facilitate research in Kashmiri natural language processing.[9] EmoLoom-2B: Fast Base-Model Screening for Emotion Classification and VAD with Lexicon-Weak Supervision and KV-Off Evaluation
Zilin Li,Weiwei Xu,Xuanbo Lu,Zheda Liu
Main category: cs.CL
TL;DR: EmoLoom-2B是一个轻量级、可复现的管道,用于将小于20亿参数的小型语言模型转化为情感分类与价态-唤醒-支配度(VAD)预测的快速筛选候选模型。
Details
Motivation: 为了在情感分析任务中实现高效、公平且可复现的评估,需要一个统一、低方差的训练与推理框架,尤其适用于资源受限场景下的小型模型优化。 Method: 提出EmoLoom-2B管道,采用KV-off解码减少方差,引入VAD保持约束和外部评价分类器作为语义正则化,并通过价态翻转增强提升极性敏感性;在微调中使用A/B混合采样与熵感知温度调度。 Result: 基于Qwen-1.8B-Chat模型,EmoLoom-2B在GoEmotions和EmpatheticDialogues数据集上表现优异,并在DailyDialog上展现出强跨语料库泛化能力。 Conclusion: 该方法具备预算友好、可审计和可重入特性,适合作为更复杂训练或融合前的可靠筛选步骤。 Abstract: We introduce EmoLoom-2B, a lightweight and reproducible pipeline that turns small language models under 2B parameters into fast screening candidates for joint emotion classification and Valence-Arousal-Dominance prediction. To ensure protocol-faithful and fair evaluation, we unify data loading, training, and inference under a single JSON input-output contract and remove avoidable variance by adopting KV-off decoding as the default setting. We incorporate two orthogonal semantic regularizers: a VAD-preserving constraint that aligns generated text with target VAD triples, and a lightweight external appraisal classifier that provides training-time guidance on goal attainment, controllability, certainty, and fairness without injecting long rationales. To improve polarity sensitivity, we introduce Valence Flip augmentation based on mirrored emotional pairs. During supervised fine-tuning, we apply A/B mixture sampling with entropy-aware temperature scheduling to balance coverage and convergence. Using Qwen-1.8B-Chat as the base model, EmoLoom-2B achieves strong performance on GoEmotions and EmpatheticDialogues, and demonstrates robust cross-corpus generalization on DailyDialog. The proposed recipe is budget-aware, auditable, and re-entrant, serving as a dependable screening pass before heavier training or multimodal fusion.[10] Listen, Attend, Understand: a Regularization Technique for Stable E2E Speech Translation Training on High Variance labels
Yacouba Diarra,Michael Leventhal
Main category: cs.CL
TL;DR: 提出了一种名为Listen, Attend, Understand (LAU)的语义正则化方法,通过冻结文本嵌入引导语音编码器的潜在空间,提升端到端语音翻译在低资源、高噪声场景下的语义保持能力与训练效率。
Details
Motivation: 端到端语音翻译在目标转录存在高方差和语义模糊时,常出现收敛慢、性能差的问题,需引入语义正则化来增强语音表征的语义一致性。 Method: 提出LAU方法,利用冻结的文本嵌入作为方向性辅助损失,在训练过程中约束语音编码器的潜在空间,使声学表示更具语言学基础,且不增加推理开销。 Result: 在30小时非专业标注的巴马卡语到法语数据集上,LAU模型在标准指标上达到与使用多100%预训练数据的E2E-ST系统相当的性能,并在语义保持上表现更优;同时提出Total Parameter Drift指标,证明语义约束能有效重组编码器权重,优先学习语义而非声学细节。 Conclusion: LAU是一种鲁棒的端到端语音翻译训练方法,尤其适用于数据稀缺且含噪的场景,可作为传统后处理重打分的有效替代方案。 Abstract: End-to-End Speech Translation often shows slower convergence and worse performance when target transcriptions exhibit high variance and semantic ambiguity. We propose Listen, Attend, Understand (LAU), a semantic regularization technique that constrains the acoustic encoder's latent space during training. By leveraging frozen text embeddings to provide a directional auxiliary loss, LAU injects linguistic groundedness into the acoustic representation without increasing inference cost. We evaluate our method on a Bambara-to-French dataset with 30 hours of Bambara speech translated by non-professionals. Experimental results demonstrate that LAU models achieve comparable performance by standard metrics compared to an E2E-ST system pretrained with 100\% more data and while performing better in preserving semantic meaning. Furthermore, we introduce Total Parameter Drift as a metric to quantify the structural impact of regularization to demonstrate that semantic constraints actively reorganize the encoder's weights to prioritize meaning over literal phonetics. Our findings suggest that LAU is a robust alternative to post-hoc rescoring and a valuable addition to E2E-ST training, especially when training data is scarce and/or noisy.[11] RoboPhD: Self-Improving Text-to-SQL Through Autonomous Agent Evolution
Andrew Borthwick,Stephen Ash
Main category: cs.CL
TL;DR: RoboPhD是一个由AI代理自主进行研究以提升Text-to-SQL性能的系统,通过闭环进化循环和ELO选择机制,在无外部指导的情况下从简单基线演化出高效策略。
Details
Motivation: 旨在探索AI是否能在几乎没有人类干预的情况下,自主优化Text-to-SQL任务中的代理性能。 Method: 采用双代理协同框架:SQL生成代理负责数据库分析与SQL生成,进化代理基于性能反馈设计新版本,并利用ELO评分机制实现非传递性环境下的有效选择。 Result: 在18轮迭代中,系统从70行代码的基线发展到1500行,自主发现了适应性数据库分析和聚合处理等策略;在BIRD测试集上达到73.67%准确率,显著提升弱模型性能(如Claude Haiku提升8.9点),实现‘跨层级’部署优势。 Conclusion: RoboPhD证明了仅需极简人类起点,AI即可自主构建高性能的代理系统,展示了AI自我驱动科研的潜力。 Abstract: We present RoboPhD, a system where AI agents autonomously conduct research to improve Text-to-SQL performance. RoboPhD implements a closed-loop evolution cycle with two coordinated components: a SQL Generation agent composed of a database analysis script and SQL generation instructions, and an Evolution agent that designs new versions based on performance feedback. Central to the framework is an ELO-based selection mechanism enabling survival-of-the-fittest dynamics while handling non-transitivity in performance. Starting from a naive 70-line baseline, RoboPhD evolves agents through iterative cross-pollination, discovering effective techniques without any external guidance on the Text-to-SQL domain. Our best agent, evolved to 1500 lines over 18 iterations, autonomously discovered strategies such as size-adaptive database analysis that adjusts depth based on schema complexity and SQL generation patterns for column selection, evidence interpretation, and aggregation. Evolution provides the largest gains on cheaper models: while we improve by 2.3 points over a strong Claude Opus 4.5 naive baseline, we show an improvement of 8.9 points over the weaker Claude Haiku model. This enables 'skip a tier' deployment: evolved Haiku exceeds naive Sonnet accuracy, and evolved Sonnet exceeds naive Opus, both at lower cost. The full system achieves 73.67% accuracy on the BIRD test set, demonstrating that AI can autonomously build a strong agentic system with only a trivial human-provided starting point.[12] KOS-TL (Knowledge Operation System Type Logic)
Peng Chen
Main category: cs.CL
TL;DR: 本文提出了KOS-TL,一种基于依赖类型理论的构造性逻辑框架,旨在为自主、可执行的知识系统提供严谨的形式化基础,通过三层架构统一数据、逻辑与证明,并验证了其关键元理论性质。
Details
Motivation: 传统知识表示模型在静态符号逻辑与动态系统执行之间存在鸿沟,缺乏能够同时支持形式化推理与运行时演化的统一框架。 Method: 采用依赖类型理论构建三层次架构:核心层定义类型宇宙与构造原语,内核层通过事件驱动机制⟨Σ, Ev, Δ⟩管理状态演化,运行时层实现物理信号到逻辑证据的双向精化,并整合Davidsonian事件语义与Martin-Löf类型论。 Result: 形式化定义了系统的操作语义,证明了‘进展性’与‘演化一致性’等元理论性质,确保系统在连续状态转换中保持逻辑自洽且无阻塞;实现了‘带证明的知识’,每个知识库状态变更均附带有效性验证凭证。 Conclusion: KOS-TL为下一代智能自治操作系统提供了坚实、可形式验证的知识运行基础,适用于工业追溯、跨境金融合规等需高可信度的场景。 Abstract: This paper introduces KOS-TL (Knowledge Operation System Type Logic), a novel constructive framework designed to provide a rigorous logical foundation for autonomous and executable knowledge systems. Traditional knowledge representation models often suffer from a gap between static symbolic logic and dynamic system execution. To bridge this divide, KOS-TL leverages Dependent Type Theory to unify data, logic, and proof into a singular computational substrate.The architecture of KOS-TL is organized into three hierarchical layers: the Core Layer, which defines the static type universe and constructive primitives; the Kernel Layer, which governs state evolution through an event-driven mechanism characterized by the triple $\langle Σ, \textsf{Ev}, Δ\rangle$; and the Runtime Layer, responsible for the bidirectional refinement of physical signals into logical evidence. We formally define the operational semantics of the system and prove key meta-theoretical properties, including Progress and Evolutionary Consistency, ensuring that the system remains logically self-consistent and free from stuck states during continuous state transitions.By integrating Davidsonian event semantics with Martin-Löf type theory, KOS-TL enables the construction of "proof-carrying knowledge," where every state change in the knowledge base is accompanied by a formal witness of its validity. We demonstrate the practical utility of this logic through application examples in industrial traceability and cross-border financial compliance. Our results suggest that KOS-TL provides a robust, formally verifiable basis for the next generation of intelligent, autonomous operating systems.[13] SongSage: A Large Musical Language Model with Lyric Generative Pre-training
Jiani Guo,Jiajia Li,Jie Wu,Zuchao Li,Yujiu Yang,Ping Wang
Main category: cs.CL
TL;DR: 本文提出了PlaylistSense数据集以评估语言模型对播放列表的理解能力,并引入了SongSage——一种通过歌词生成预训练获得多样化歌词中心智能的大型音乐语言模型。实验表明,SongSage在理解歌词知识、重写用户查询用于零样本播放列表推荐、生成和续写歌词等方面表现出色,同时保持了通用知识理解能力。
Details
Motivation: 现有大语言模型在歌词相关知识理解方面存在不足,且缺乏专门针对播放列表理解能力的评估手段。 Method: 构建PlaylistSense数据集用于评估;提出SongSage模型,基于LyricBank语料库进行持续预训练,并使用LyricBank-SFT指令集进行微调。 Result: 当前通用大模型在播放列表理解上仍有提升空间;SongSage在十种用户查询理解、歌词生成与续写、零样本推荐查询重写等任务中表现优异,并保持良好的通用知识能力(如MMLU得分)。 Conclusion: SongSage通过歌词中心的持续预训练显著提升了对音乐内容的理解与创作能力,为音乐AI研究提供了有效工具,同时兼顾通用性。 Abstract: Large language models have achieved significant success in various domains, yet their understanding of lyric-centric knowledge has not been fully explored. In this work, we first introduce PlaylistSense, a dataset to evaluate the playlist understanding capability of language models. PlaylistSense encompasses ten types of user queries derived from common real-world perspectives, challenging LLMs to accurately grasp playlist features and address diverse user intents. Comprehensive evaluations indicate that current general-purpose LLMs still have potential for improvement in playlist understanding. Inspired by this, we introduce SongSage, a large musical language model equipped with diverse lyric-centric intelligence through lyric generative pretraining. SongSage undergoes continual pretraining on LyricBank, a carefully curated corpus of 5.48 billion tokens focused on lyrical content, followed by fine-tuning with LyricBank-SFT, a meticulously crafted instruction set comprising 775k samples across nine core lyric-centric tasks. Experimental results demonstrate that SongSage exhibits a strong understanding of lyric-centric knowledge, excels in rewriting user queries for zero-shot playlist recommendations, generates and continues lyrics effectively, and performs proficiently across seven additional capabilities. Beyond its lyric-centric expertise, SongSage also retains general knowledge comprehension and achieves a competitive MMLU score. We will keep the datasets inaccessible due to copyright restrictions and release the SongSage and training script to ensure reproducibility and support music AI research and applications, the datasets release plan details are provided in the appendix.[14] DHI: Leveraging Diverse Hallucination Induction for Enhanced Contrastive Factuality Control in Large Language Models
Jiani Guo,Xiangke Zeng,Jie Wu,Zuchao Li
Main category: cs.CL
TL;DR: 本文提出了一种名为DHI的新框架,通过改进损失函数和注意力掩码机制,使“邪恶LLM”能生成更多样化的幻觉内容,从而更有效地缓解大语言模型的幻觉问题。
Details
Motivation: 现有方法中用于生成幻觉的“邪恶LLM”受限于训练数据中的特定错误类型,导致生成的幻觉多样性不足,限制了幻觉检测与抑制的效果。 Method: 提出DHI框架:1)设计一种修改后的损失函数,降低事实正确token的权重,促使模型在指定位置生成多样化幻觉;2)引入因果注意力掩码机制,减少对后续token生成的影响;3)推理时采用自适应理性约束,仅在正向模型高置信度时进行对比解码。 Result: 实验结果表明,DHI在多个幻觉基准测试上显著优于现有的基于对比解码的方法,有效提升了幻觉识别与抑制能力。 Conclusion: DHI通过增强幻觉生成的多样性并优化对比解码过程,显著提高了大语言模型在幻觉缓解方面的性能,为构建更可靠的语言模型提供了新思路。 Abstract: Large language models (LLMs) frequently produce inaccurate or fabricated information, known as "hallucinations," which compromises their reliability. Existing approaches often train an "Evil LLM" to deliberately generate hallucinations on curated datasets, using these induced hallucinations to guide contrastive decoding against a reliable "positive model" for hallucination mitigation. However, this strategy is limited by the narrow diversity of hallucinations induced, as Evil LLMs trained on specific error types tend to reproduce only these particular patterns, thereby restricting their overall effectiveness. To address these limitations, we propose DHI (Diverse Hallucination Induction), a novel training framework that enables the Evil LLM to generate a broader range of hallucination types without relying on pre-annotated hallucination data. DHI employs a modified loss function that down-weights the generation of specific factually correct tokens, encouraging the Evil LLM to produce diverse hallucinations at targeted positions while maintaining overall factual content. Additionally, we introduce a causal attention masking adaptation to reduce the impact of this penalization on the generation of subsequent tokens. During inference, we apply an adaptive rationality constraint that restricts contrastive decoding to tokens where the positive model exhibits high confidence, thereby avoiding unnecessary penalties on factually correct tokens. Extensive empirical results show that DHI achieves significant performance gains over other contrastive decoding-based approaches across multiple hallucination benchmarks.[15] Almost Clinical: Linguistic properties of synthetic electronic health records
Serge Sharoff,John Baker,David Francis Hunt,Alan Simpson
Main category: cs.CL
TL;DR: 本研究评估了在心理健康领域中合成电子健康记录(EHR)的语言和临床适用性,分析了大型语言模型(LLM)在生成临床文本时的语言特征及其局限性。
Details
Motivation: 探索LLM生成的合成EHR在临床实践中的可靠性与适用性,特别是在表达医疗权威和患者自主性方面的语言构建。 Method: 通过构建合成语料库,并分析四种临床文类(评估、通信、转诊和护理计划)中的能动性、模态和信息流,评估LLM生成文本的语言和临床特征。 Result: LLM能生成术语恰当且连贯的文本,接近实际临床实践,但仍存在语域转换、临床特异性不足以及药物使用和诊断程序不准确等系统性偏差。 Conclusion: 尽管LLM在生成临床文本方面具有潜力,但在实际应用前需解决其在语言和临床准确性上的系统性问题。 Abstract: This study evaluates the linguistic and clinical suitability of synthetic electronic health records (EHRs) in the field of mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we assess agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures.[16] Stylometry Analysis of Human and Machine Text for Academic Integrity
Hezam Albaqami,Muhammad Asif Ayub,Nasir Ahmad,Yaseen Ahmad,Mohammed M. Alqahtani,Abdullah M. Algamdi,Almoaid A. Owaidah,Kashif Ahmad
Main category: cs.CL
TL;DR: 本文提出了一种基于自然语言处理(NLP)的框架,用于通过作者归属和风格变化检测来验证学生内容的真实性,涵盖四个关键任务,并在由Gemini生成的两个数据集上进行评估,结果揭示了复杂提示下机器生成文本检测的挑战。
Details
Motivation: 应对学术诚信中的关键挑战,如抄袭、伪造和教育内容作者身份验证问题,现有方法尚未全面覆盖多作者文档中的作者归属与风格变化检测。 Method: 提出一个NLP框架,解决四类任务:人类与机器文本分类、单作者与多作者文档区分、多作者文档内的作者变更检测以及协作文档中的作者识别,并在Gemini生成的两个不同指令严格程度的数据集上进行实验。 Result: 实验结果显示,在使用严格指令生成的数据集上,模型性能有所下降,表明精心设计的提示会增加检测机器生成文本的难度;同时公开了数据集、代码等资源以支持后续研究。 Conclusion: 该框架为学生内容真实性验证提供了系统性解决方案,揭示了当前机器生成文本检测的局限性,并为未来相关研究建立了基准。 Abstract: This work addresses critical challenges to academic integrity, including plagiarism, fabrication, and verification of authorship of educational content, by proposing a Natural Language Processing (NLP)-based framework for authenticating students' content through author attribution and style change detection. Despite some initial efforts, several aspects of the topic are yet to be explored. In contrast to existing solutions, the paper provides a comprehensive analysis of the topic by targeting four relevant tasks, including (i) classification of human and machine text, (ii) differentiating in single and multi-authored documents, (iii) author change detection within multi-authored documents, and (iv) author recognition in collaboratively produced documents. The solutions proposed for the tasks are evaluated on two datasets generated with Gemini using two different prompts, including a normal and a strict set of instructions. During experiments, some reduction in the performance of the proposed solutions is observed on the dataset generated through the strict prompt, demonstrating the complexities involved in detecting machine-generated text with cleverly crafted prompts. The generated datasets, code, and other relevant materials are made publicly available on GitHub, which are expected to provide a baseline for future research in the domain.[17] Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure
Zsolt Csibi,Bence György Gortka,Natabara Gyöngyössy,Kornél Nagy,Dávid Márk Nemeskey,Martin Sallai,András Simonyi,András Márk Szekeres,Gábor Palkó
Main category: cs.CL
TL;DR: Racka是一个轻量级、持续预训练的大型语言模型,基于Qwen-3 4B,采用LoRA进行高效参数更新,专为提升匈牙利语性能而优化,同时保持英语和德语能力。
Details
Motivation: 缩小匈牙利语与高资源语言(如英语、德语)之间的资源差距,解决低带宽HPC集群上的训练效率问题。 Method: 在Qwen-3 4B基础上使用LoRA进行参数高效的持续预训练,调整并替换分词器以提高匈牙利语的分词效率,并在多语言混合数据(含44%匈牙利语、24%英语、21%德语和11%代码)上训练。 Result: 模型在1600亿子词标记上训练,初步结果显示在语言适应方面表现稳定且有所提升,有效缓解灾难性遗忘。 Conclusion: Racka为资源受限语言提供了一种高效、可行的持续预训练方案,在保持高资源语言性能的同时显著提升了匈牙利语处理能力。 Abstract: We present Racka, a lightweight, continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages such as English and German. Racka employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen-3 4B backbone, making the recipe practical on A100 (40GB)-based HPC clusters with low inter-node bandwidth. To better match the training distribution, we replace and adapt the tokenizer, achieving substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German. The model is trained on 160B subword tokens drawn from a mixture of internet and high-quality curated sources, with a composition of 44% Hungarian, 24% English, 21% German, and 11% code. This data mix is chosen to mitigate catastrophic forgetting and preserve high-resource language capabilities during continual pretraining. Our preliminary results indicate modest but stable results in language adaptation.[18] From Policy to Logic for Efficient and Interpretable Coverage Assessment
Rhitabrat Pokharel,Hamid Hassanzadeh,Ameeta Agrawal
Main category: cs.CL
TL;DR: 提出一种结合覆盖感知检索器与符号规则推理的混合方法,以提高医疗政策审查的效率和可解释性,显著降低推理成本并提升性能。
Details
Motivation: 大型语言模型在解释复杂法律和政策文本时存在幻觉和不一致问题,尤其在需要高可靠性的医疗政策审查中影响显著,需增强其可信度与可审计性。 Method: 采用覆盖感知检索器提取相关政策内容,并结合符号化的规则推理系统,将文本转化为明确的事实与规则,生成可审计的推理依据,减少对大模型推理的依赖。 Result: 该方法相比纯LLM方法减少了44%的推理成本,同时F1分数提高了4.5%。 Conclusion: 混合式方法在保证准确性的同时显著提升了效率和可解释性,适合应用于高风险领域的政策审查任务。 Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.[19] Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory
Sen Hu,Yuxiang Wei,Jiaxin Ran,Zhiyuan Yao,Lei Zou
Main category: cs.CL
TL;DR: 本文提出一个统一框架来系统分析对话记忆架构,通过控制实验比较图结构与非图方法,发现基础系统设置对性能的影响大于特定架构创新,并提出了未来研究的可靠强基线。
Details
Motivation: 图结构在对话记忆系统中应用日益广泛,但其有效性实证结果不一致,导致关键设计选择不明确。因此需要系统性分析不同架构的核心影响因素。 Method: 提出一个将对话记忆系统分解为核心组件的统一框架,支持图结构与非图方法;在LongMemEval和HaluMem数据集上进行分阶段控制实验,比较记忆表示、组织、维护和检索中的常见设计选择。 Result: 实验表明,许多性能差异主要由基础系统设置驱动,而非特定的架构创新;某些被广泛采用的图结构设计并未带来显著优势;识别出若干稳定且表现优异的基线模型。 Conclusion: 对话记忆系统的整体性能更多依赖于基础组件的合理配置,而非是否使用图结构等高级架构;研究建议未来工作应重视基础设置的优化,并以更稳健的基线进行比较。 Abstract: Graph structures are increasingly used in dialog memory systems, but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter. We present an experimental, system-oriented analysis of long-term dialog memory architectures. We introduce a unified framework that decomposes dialog memory systems into core components and supports both graph-based and non-graph approaches. Under this framework, we conduct controlled, stage-wise experiments on LongMemEval and HaluMem, comparing common design choices in memory representation, organization, maintenance, and retrieval. Our results show that many performance differences are driven by foundational system settings rather than specific architectural innovations. Based on these findings, we identify stable and reliable strong baselines for future dialog memory research.[20] T3C: Test-Time Tensor Compression with Consistency Guarantees
Ismail Lamaakal,Chaymae Yahyati,Yassine Maleh,Khalid El Makkaoui,Ibrahim Ouahbi
Main category: cs.CL
TL;DR: T3C是一种训练一次、测试时可根据预算调节压缩程度的框架,支持在部署时动态调整模型的秩和精度以适应不同硬件条件。
Details
Motivation: 现有的模型压缩方法通常需要针对不同设备重新训练或缺乏可靠的性能保障,难以在多样化的部署环境中高效平衡准确率、延迟和模型大小。 Method: 提出T3C框架,结合弹性张量分解与混合精度量化,并引入轻量级控制器将预算令牌映射为每层的秩和比特分配;通过谱代理和激活统计信息进行层间一致性验证,约束logit漂移并正则化训练过程。 Result: 在ImageNet-1k上,ResNet-50在准确率损失≤0.5%时达到1.18ms的p50延迟和38MB模型大小,优于PTQ-8b;ViT-B/16实现2.30ms p50延迟和59MB大小,超过强PTQ/QAT基线。 Conclusion: T3C通过单一检查点实现了跨设备可预测且有证书保障的准确率-延迟-大小权衡,推动了高效自适应模型部署的发展。 Abstract: We present T3C, a train-once, test-time budget-conditioned compression framework that exposes rank and precision as a controllable deployment knob. T3C combines elastic tensor factorization (maintained up to a maximal rank) with rank-tied mixed-precision quantization and a lightweight controller that maps a latency/energy/size budget token to per-layer rank/bit assignments; the policy snaps to hardware-aligned profiles and is monotone in the budget. A fast, layerwise consistency certificate, computed from spectral proxies and activation statistics, upper-bounds logit drift and regularizes training, yielding a practical reliability signal with negligible overhead. On ImageNet-1k, T3C shifts the vision Pareto frontier: for ResNet-50 at matched accuracy (\leq 0.5% drop), p50 latency is 1.18ms with a 38MB model, outperforming PTQ-8b (1.44ms, 88MB); for ViT-B/16, T3C reaches 2.30ms p50 with 59MB, improving over strong PTQ/QAT baselines. A single T3C checkpoint therefore provides predictable, certificate-backed accuracy-latency-size trade-offs on demand across devices.[21] FLOP-Efficient Training: Early Stopping Based on Test-Time Compute Awareness
Hossam Amer,Maryam Dialameh,Hossein Rajabzadeh,Walid Ahmed,Weiwei Zhang,Yang Liu
Main category: cs.CL
TL;DR: 本文提出了一种TTC感知的训练方法,通过早期停止算法和测试时计算优化,在显著减少训练FLOPs的同时保持甚至提升模型精度,实现了训练与推理计算资源的更好平衡。
Details
Motivation: 尽管增加训练计算量可以提高大语言模型的准确性,但训练过程资源消耗巨大。而先前研究表明,增加测试时计算(如迭代采样)可使小模型媲美大模型。因此,本文旨在探索如何在不牺牲准确性的前提下大幅降低训练成本。 Method: 提出TTC感知训练框架,结合早期停止算法联合选择中间检查点和TTC配置,并设计高效的TTC评估方法避免穷搜,同时形式化了盈亏平衡边界以判断何时推理计算的增加能补偿训练计算的减少。 Result: 实验显示该方法最多可减少92%的训练FLOPs,同时保持甚至显著提升模型准确性。 Conclusion: 该研究提供了一种新的权衡训练与推理计算的视角,有助于加快模型部署周期并支持更频繁的模型更新。 Abstract: Scaling training compute, measured in FLOPs, has long been shown to improve the accuracy of large language models, yet training remains resource-intensive. Prior work shows that increasing test-time compute (TTC)-for example through iterative sampling-can allow smaller models to rival or surpass much larger ones at lower overall cost. We introduce TTC-aware training, where an intermediate checkpoint and a corresponding TTC configuration can together match or exceed the accuracy of a fully trained model while requiring substantially fewer training FLOPs. Building on this insight, we propose an early stopping algorithm that jointly selects a checkpoint and TTC configuration to minimize training compute without sacrificing accuracy. To make this practical, we develop an efficient TTC evaluation method that avoids exhaustive search, and we formalize a break-even bound that identifies when increased inference compute compensates for reduced training compute. Experiments demonstrate up to 92\% reductions in training FLOPs while maintaining and sometimes remarkably improving accuracy. These results highlight a new perspective for balancing training and inference compute in model development, enabling faster deployment cycles and more frequent model refreshes. Codes will be publicly released.[22] Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems
Md Abdullah Al Kafi,Raka Moni,Sumit Kumar Banshal
Main category: cs.CL
TL;DR: 在基于RAG的心理健康咨询系统中,通用大模型(如Qwen2.5-3B和Phi-3-Mini)在同理心、上下文理解和安全性方面优于专门微调的领域模型(如MentalHealthBot-7B和TherapyBot-7B),表明强大的推理能力比领域特定训练更重要。
Details
Motivation: 解决LLM在心理健康咨询中存在幻觉和缺乏同理心的问题,探索在RAG框架下,是通用推理模型还是领域微调模型能提供更优的心理支持。 Method: 使用ChromaDB构建RAG管道,对比两个通用模型(Qwen2.5-3B、Phi-3-Mini)与两个领域微调模型(MentalHealthBot-7B、TherapyBot-7B),通过LLM-as-a-Judge框架对50轮对话进行自动化评估。 Result: 通用模型在同理心得分上显著更高(3.72 vs. 3.26, p < 0.001),且尽管参数量更小(3B vs. 7B),仍表现出更好的上下文理解能力和更低的过拟合倾向;所有模型安全性良好,但领域模型易出现过拟合。 Conclusion: 在答案已由临床证据锚定的前提下,具备强推理能力的通用模型比大规模领域微调模型能提供更具同理心和平衡性的心理支持,因此在RAG系统中应优先发展推理能力而非依赖领域微调。 Abstract: The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific ones in empathy (3.72 vs. 3.26, $p < 0.001$) in spite of being much smaller (3B vs. 7B), and all models perform well in terms of safety, but the generalist models show better contextual understanding and are less prone to overfitting as we observe in the domain-specific models. Overall, our results indicate that for RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary; i.e. a well-reasoned general model would provide more empathetic and balanced support than a larger narrowly fine-tuned model, so long as the answer is already grounded in clinical evidence.[23] FC-CONAN: An Exhaustively Paired Dataset for Robust Evaluation of Retrieval Systems
Juan Junqueras,Florian Boudin,May-Myo Zin,Ha-Thanh Nguyen,Wachara Fungwacharakorn,Damián Ariel Furman,Akiko Aizawa,Ken Satoh
Main category: cs.CL
TL;DR: 本文提出了FC-CONAN,这是首个通过穷举45条英文仇恨言论和129条反叙事构建的完全连接数据集,采用两阶段标注流程生成四个不同可靠性的分区,解决了现有数据集中正样本稀疏的问题,支持更可靠的反叙事检索系统评估与错误分析。
Details
Motivation: 现有的仇恨言论对抗叙事数据集(如CONAN)仅标注了少量HS-CN配对,限制了反叙事研究的评估能力,因此需要一个更全面、规模更大的数据集来提升模型评估的可靠性。 Method: 构建FC-CONAN数据集,穷举所有45条仇恨言论与129条反叙事的组合,共5805个配对;采用两阶段人工标注流程,由九名标注员和四名验证者参与,生成Diamond、Gold、Silver和Bronze四个分区,以平衡标注质量与数据规模。 Result: 创建了不含CONAN重叠样本的全新数据集,发现了数百个此前未标注的正例配对;四个分区支持在不同可靠性级别下进行模型评估;数据集公开可用,有助于更准确地评估反叙事检索系统。 Conclusion: FC-CONAN是首个全连接的反叙事数据集,显著扩展了可用标注数据规模,提升了反叙事研究的评估严谨性,并为后续系统优化和错误分析提供了重要资源。 Abstract: Hate speech (HS) is a critical issue in online discourse, and one promising strategy to counter it is through the use of counter-narratives (CNs). Datasets linking HS with CNs are essential for advancing counterspeech research. However, even flagship resources like CONAN (Chung et al., 2019) annotate only a sparse subset of all possible HS-CN pairs, limiting evaluation. We introduce FC-CONAN (Fully Connected CONAN), the first dataset created by exhaustively considering all combinations of 45 English HS messages and 129 CNs. A two-stage annotation process involving nine annotators and four validators produces four partitions-Diamond, Gold, Silver, and Bronze-that balance reliability and scale. None of the labeled pairs overlap with CONAN, uncovering hundreds of previously unlabelled positives. FC-CONAN enables more faithful evaluation of counterspeech retrieval systems and facilitates detailed error analysis. The dataset is publicly available.[24] Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning
Jerry Huang,Peng Lu,Qiuhao Zeng,Yusuke Iwasawa,Yutaka Matsuo,Sarath Chandar,Edison Marrese-Taylor,Irene Li
Main category: cs.CL
TL;DR: 研究探讨了大语言模型在多语言环境下的校准问题,发现即使在低资源语言中,基于高资源语言的指令微调会显著提高模型置信度但准确率提升有限,导致校准不良;而标签平滑则能有效改善校准效果。
Details
Motivation: 尽管大语言模型发展迅速,其在多语言场景下的校准特性仍不明确,尤其是在数据稀缺的情况下,亟需理解现有校准方法的有效性及局限。 Method: 在两个分别包含29和42种语言的多语言基准上,分析指令微调和标签平滑对模型校准的影响,特别关注高资源语言微调对低资源语言的泛化效应。 Result: 指令微调显著提升低资源语言的模型置信度但几乎不提升准确率,导致严重校准偏差;而标签平滑能在无需低资源微调数据的情况下有效改善校准性能。 Conclusion: 标准SFT在多语言校准中存在缺陷,应结合如标签平滑等训练策略以提升模型在多语言场景下的可靠性与公平性。 Abstract: Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworthiness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.[25] EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery
Jicheng Ma,Guohua Wang,Xinhua Feng,Yiming Liu,Zhichao Hu,Yuhong Liu
Main category: cs.CL
TL;DR: 提出了一种全自动、基于定理的评估前沿数学推理能力的新方法,通过将最新发表的数学文献转化为可执行和可验证的推理任务,构建了可持续更新的评测套件EternalMath,实验表明当前大语言模型在研究级数学问题上仍有显著性能差距。
Details
Motivation: 现有数学推理评估主要依赖静态基准测试,覆盖范围有限且易饱和,难以反映模型在研究级数学上的真实能力,亟需一种能随人类数学发展而进化的动态评估方法。 Method: 设计了一个全自动的、基于定理的流水线:从近期同行评审的数学文献中识别构造性或定量结果,将其实例化为参数化问题模板,并通过执行验证生成确定性解,实现可扩展、可复现、可持续更新的评估。 Result: 构建了名为EternalMath的动态演化评测套件,实验显示当前最先进大语言模型在其上表现仍有显著差距,说明研究前沿的数学推理远未饱和。 Conclusion: 数学推理能力的评估应与人类数学发现同步进化,该方法为未来持续衡量模型在高水平数学领域的进展提供了可行路径。 Abstract: Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.[26] LANCET: Neural Intervention via Structural Entropy for Mitigating Faithfulness Hallucinations in LLMs
Chenxu Wang,Chaozhuo Li,Pengbo Wang,Litian Zhang,Songyang Liu,Ji Qi,Jiahui Hu,Yushan Cai,Hao Zhao,Rui Pu
Main category: cs.CL
TL;DR: 本文提出了一种名为Lancet的新框架,通过结构熵和幻觉差异比来精确定位并阻断大语言模型中的幻觉传播路径,显著优于现有方法。
Details
Motivation: 大语言模型存在忠诚信幻觉问题,现有方法因忽视神经信息的分布式特性而干预不精确,需更精细的干预机制。 Method: Lancet首先通过梯度驱动的对比分析定位易产生幻觉的神经元,然后通过最小化结构熵映射其传播路径,并采用分层干预策略进行精准神经干预。 Result: 在多个幻觉基准数据集上的实验表明,Lancet显著优于当前最先进的方法,有效阻断了幻觉传播且保留了模型的通用能力。 Conclusion: 通过结构性分析实现对幻觉传播路径的精确阻断是可行且高效的,Lancet为大语言模型的可靠性提升提供了新的解决方案。 Abstract: Large Language Models have revolutionized information processing, yet their reliability is severely compromised by faithfulness hallucinations. While current approaches attempt to mitigate this issue through node-level adjustments or coarse suppression, they often overlook the distributed nature of neural information, leading to imprecise interventions. Recognizing that hallucinations propagate through specific forward transmission pathways like an infection, we aim to surgically block this flow using precise structural analysis. To leverage this, we propose Lancet, a novel framework that achieves precise neural intervention by leveraging structural entropy and hallucination difference ratios. Lancet first locates hallucination-prone neurons via gradient-driven contrastive analysis, then maps their propagation pathways by minimizing structural entropy, and finally implements a hierarchical intervention strategy that preserves general model capabilities. Comprehensive evaluations across hallucination benchmark datasets demonstrate that Lancet significantly outperforms state-of-the-art methods, validating the effectiveness of our surgical approach to neural intervention.[27] From Emotion Classification to Emotional Reasoning: Enhancing Emotional Intelligence in Large Language Models
Arjhun Sreedar,Rohan Pillay,Laukik Patade
Main category: cs.CL
TL;DR: 本研究探讨了合成情感推理数据是否能提升小型开源大语言模型的情感推理能力,提出通过多智能体生成管道创建治疗式对话并转化为结构化情感选择题,实验表明该方法显著提升了模型在情感理解与意识上的表现。
Details
Motivation: 探索无需改变模型架构即可增强小型开源大语言模型情感推理能力的方法,降低对大规模模型的依赖。 Method: 设计一个多智能体生成管道,生成治疗式对话,并将其转换为带有解释的结构化情感多项选择题(MCQs),用于微调多种7B规模的模型。 Result: 经过合成数据微调后,Mistral 7B在情感理解(EU)上从10.5提升至20.5,在情感意识(EA)上从40.5提升至60.0,在EmoBench类评估中表现出显著进步。 Conclusion: 合成情感推理数据可有效提升小型语言模型在复杂情感任务中的表现,表明情感推理能力可通过数据工程而非架构修改来诱导。 Abstract: This work investigates whether synthetic emotional chain-of-thought data can improve the emotional reasoning abilities of smaller open large language models (LLMs). We design a multi-agent generation pipeline that produces therapy-style conversations and converts them into structured emotion multiple-choice questions (MCQs) with explanations. We propose that fine-tuning a variety of 7B models on this dataset should yield substantial gains in emotional understanding and emotional awareness on EmoBench-style evaluations, suggesting that emotional reasoning can be induced without architectural changes. Our results demonstrate that fine-tuned Mistral 7B achieves EU improvements from 10.5 to 20.5 and EA improvements from 40.5 to 60.0, validating the effectiveness of synthetic emotional reasoning data for enhancing model capabilities in nuanced emotional tasks.[28] iFlip: Iterative Feedback-driven Counterfactual Example Refinement
Yilong Wang,Qianli Wang,Nils Feldhus
Main category: cs.CL
TL;DR: 本文提出了一种名为iFlip的迭代优化方法,利用模型置信度、特征归因和自然语言反馈生成更有效的反事实样例,显著提升了标签翻转率和用户满意度,并验证了其在数据增强中的有效性。
Details
Motivation: 现有的大语言模型生成反事实样例的方法多为单次生成,难以稳定改变模型预测,忽略了模型自我修正的能力,因此需要一种更有效的方法来提升反事实样例的生成质量。 Method: 提出iFlip方法,通过迭代方式结合三种反馈机制:模型置信度、特征归因和自然语言反馈,逐步优化输入以生成有效的反事实样例。 Result: iFlip相比现有五种先进基线方法平均提升了57.8%的有效性(以标签翻转率为指标),用户研究显示其在完整性、满意度和可行性方面均优于基线方法,且消融实验表明迭代次数、高归因词指向和早停机制对性能至关重要。 Conclusion: iFlip能高效生成高质量反事实样例,不仅提升了可解释性,还可用于有效的数据增强,增强模型性能与鲁棒性。 Abstract: Counterfactual examples are minimal edits to an input that alter a model's prediction. They are widely employed in explainable AI to probe model behavior and in natural language processing (NLP) to augment training data. However, generating valid counterfactuals with large language models (LLMs) remains challenging, as existing single-pass methods often fail to induce reliable label changes, neglecting LLMs' self-correction capabilities. To explore this untapped potential, we propose iFlip, an iterative refinement approach that leverages three types of feedback, including model confidence, feature attribution, and natural language. Our results show that iFlip achieves an average 57.8% higher validity than the five state-of-the-art baselines, as measured by the label flipping rate. The user study further corroborates that iFlip outperforms baselines in completeness, overall satisfaction, and feasibility. In addition, ablation studies demonstrate that three components are paramount for iFlip to generate valid counterfactuals: leveraging an appropriate number of iterations, pointing to highly attributed words, and early stopping. Finally, counterfactuals generated by iFlip enable effective counterfactual data augmentation, substantially improving model performance and robustness.[29] Segmentation and Processing of German Court Decisions from Open Legal Data
Harshil Darji,Martin Heckelmann,Christina Kratsch,Gerard de Melo
Main category: cs.CL
TL;DR: 本文提出了一种清洗和分段的德国法院判决数据集,包含251,038个案例,系统地分离了判决中的三个关键部分:Tenor(判决主文)、Tatbestand(案件事实)和Entscheidungsgründe(判决理由),并通过统计抽样验证了提取的准确性。
Details
Motivation: 原始Open Legal Data数据集中判决文本格式不一致且缺乏明确标记,影响自然语言处理任务如修辞角色分类、检索和引用分析,因此需要一个结构化、可靠的数据集。 Method: 基于官方Open Legal Data数据集,使用规则和模式识别方法对判决文本进行清洗和分段,提取Tenor、Tatbestand、Entscheidungsgründe和Rechtsmittelbelehrung四个部分,并采用Cochran公式抽取384个样本进行人工验证。 Result: 构建了一个包含251,038个德国法院判决的清洗与分段数据集,人工验证显示三个主要部分均被准确识别,数据集以JSONL格式公开发布。 Conclusion: 该结构化数据集为德国法律系统的自然语言处理研究提供了高质量、可访问的资源,有助于推动相关下游任务的发展。 Abstract: The availability of structured legal data is important for advancing Natural Language Processing (NLP) techniques for the German legal system. One of the most widely used datasets, Open Legal Data, provides a large-scale collection of German court decisions. While the metadata in this raw dataset is consistently structured, the decision texts themselves are inconsistently formatted and often lack clearly marked sections. Reliable separation of these sections is important not only for rhetorical role classification but also for downstream tasks such as retrieval and citation analysis. In this work, we introduce a cleaned and sectioned dataset of 251,038 German court decisions derived from the official Open Legal Data dataset. We systematically separated three important sections in German court decisions, namely Tenor (operative part of the decision), Tatbestand (facts of the case), and Entscheidungsgründe (judicial reasoning), which are often inconsistently represented in the original dataset. To ensure the reliability of our extraction process, we used Cochran's formula with a 95% confidence level and a 5% margin of error to draw a statistically representative random sample of 384 cases, and manually verified that all three sections were correctly identified. We also extracted the Rechtsmittelbelehrung (appeal notice) as a separate field, since it is a procedural instruction and not part of the decision itself. The resulting corpus is publicly available in the JSONL format, making it an accessible resource for further research on the German legal system.[30] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR
Yuxiang Mei,Dongxing Xu,Jiaen Liang,Yanhua Long
Main category: cs.CL
TL;DR: 本文提出了一种增强的基于大语言模型(LLM)的多语言语音识别框架,结合微调的Whisper和mHuBERT编码器,并引入交叉注意力融合机制,在INTERSPEECH 2025 MLC-SLM挑战赛中取得了10.69%的CER/WER,性能与顶级系统相当,但仍不及端到端微调Whisper模型。
Details
Motivation: 解决现有并行语音编码器系统中特征拼接无法充分挖掘互补信息的问题,并探索基于LLM的ASR与端到端ASR之间的性能差距。 Method: 采用微调后的Whisper和mHuBERT作为语音编码器,通过交叉注意力机制进行特征融合,并集成到LLM框架中;同时评估了LoRA和全参数微调在E2E Whisper模型上的效果。 Result: 在MLC-SLM官方测试集上达到10.69%的CER/WER,仅使用1,500小时基础训练数据即与使用更大规模数据的顶级系统性能相当;但基于LLM的ASR仍落后于微调后的端到端Whisper模型。 Conclusion: 交叉注意力融合能有效提升多语言ASR中语音表示的质量,但当前LLM-based ASR架构仍有性能瓶颈,研究为未来Speech-LLM设计提供了实证参考。 Abstract: The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.[31] Can Legislation Be Made Machine-Readable in PROLEG?
May-Myo Zin,Sabine Wehnert,Yuntao Kong,Ha-Thanh Nguyen,Wachara Fungwacharakorn,Jieying Xue,Michał Araszkiewicz,Randy Goebel,Ken Satoh,Le-Minh Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种结合大型语言模型(LLM)和法律推理系统PROLEG的框架,用于将GDPR第6条等法规文本自动转化为可执行的规则程序,支持监管决策的自动化与解释。
Details
Motivation: 为了提高监管流程在社会中的准确性和效率,需要借助人工智能技术实现法律文本的自动化处理与推理。 Method: 利用大型语言模型(LLM)将自然语言法律条文转化为if-then规则,并进一步编译为PROLEG形式化表示,由法律专家验证后生成可执行程序。 Result: 成功实现了GDPR第6条从自然语言到PROLEG可执行程序的端到端转换,并展示了具体实例的运行结果。 Conclusion: 该方法有助于提升法规应用的效率与透明度,但仍存在局限性,需进一步优化以支持更广泛的监管框架部署。 Abstract: The anticipated positive social impact of regulatory processes requires both the accuracy and efficiency of their application. Modern artificial intelligence technologies, including natural language processing and machine-assisted reasoning, hold great promise for addressing this challenge. We present a framework to address the challenge of tools for regulatory application, based on current state-of-the-art (SOTA) methods for natural language processing (large language models or LLMs) and formalization of legal reasoning (the legal representation system PROLEG). As an example, we focus on Article 6 of the European General Data Protection Regulation (GDPR). In our framework, a single LLM prompt simultaneously transforms legal text into if-then rules and a corresponding PROLEG encoding, which are then validated and refined by legal domain experts. The final output is an executable PROLEG program that can produce human-readable explanations for instances of GDPR decisions. We describe processes to support the end-to-end transformation of a segment of a regulatory document (Article 6 from GDPR), including the prompting frame to guide an LLM to "compile" natural language text to if-then rules, then to further "compile" the vetted if-then rules to PROLEG. Finally, we produce an instance that shows the PROLEG execution. We conclude by summarizing the value of this approach and note observed limitations with suggestions to further develop such technologies for capturing and deploying regulatory frameworks.[32] Four Quadrants of Difficulty: A Simple Categorisation and its Limits
Vanessa Toborek,Sebastian Müller,Christian Bauckhage
Main category: cs.CL
TL;DR: 本文提出了一种四象限分类法来分析自然语言理解任务中的样本难度信号,发现任务无关特征大多独立作用,而只有任务相关特征具有一致性,挑战了课程学习中的常见直觉,强调需要更轻量、任务相关的难度估计方法。
Details
Motivation: 现有的课程学习方法通常依赖于任务无关的语言启发式或人类直觉来估计样本难度,但这些信号是否真正反映模型学习的难点尚不明确,因此需要系统分析不同类型难度信号的有效性。 Method: 提出了一个人类vs.模型、任务无关vs.任务相关的四象限难度信号分类框架,并在自然语言理解数据集上系统分析了各类信号的交互关系。 Result: 实验发现任务无关的难度特征之间行为独立,而只有任务相关的特征表现出一致性,表明常用的人类直觉或语言统计特征可能无法准确反映模型学习的真实难度。 Conclusion: 应摒弃依赖人类直觉的任务无关难度估计,转而开发轻量级、任务相关的难度估计器,以更好地匹配神经网络的学习行为。 Abstract: Curriculum Learning (CL) aims to improve the outcome of model training by estimating the difficulty of samples and scheduling them accordingly. In NLP, difficulty is commonly approximated using task-agnostic linguistic heuristics or human intuition, implicitly assuming that these signals correlate with what neural models find difficult to learn. We propose a four-quadrant categorisation of difficulty signals -- human vs. model and task-agnostic vs. task-dependent -- and systematically analyse their interactions on a natural language understanding dataset. We find that task-agnostic features behave largely independently and that only task-dependent features align. These findings challenge common CL intuitions and highlight the need for lightweight, task-dependent difficulty estimators that better reflect model learning behaviour.[33] Distortion Instead of Hallucination: The Effect of Reasoning Under Strict Constraints
Junichiro Niimi
Main category: cs.CL
TL;DR: 该研究探讨了在严格约束条件下,大型语言模型(如GPT-5.2和Gemini 3 Flash)中推理能力对输出可靠性的影响,发现推理虽减少约束违反,却导致事实扭曲和虚构增加,揭示了合规性与真实性之间的根本权衡。
Details
Motivation: 随着大语言模型的广泛应用,幻觉问题日益严重。尽管推理被认为可提升输出可靠性,但其在无法依赖外部工具的封闭系统中的作用尚不明确,因此需要探究推理在严格约束下的实际效果。 Method: 在推荐计算机科学领域同行评审论文的严格约束下,对多个大模型(GPT-5.2 和 Gemini 3 Flash)进行实验,比较启用与不启用推理机制时模型在约束遵守和事实准确性方面的表现。 Result: 非推理模型事实准确但约束违反率高(66-75%),而推理模型显著降低违反率(13-26%),却系统性地扭曲事实并增加完全虚构内容;该权衡模式在不同架构模型中一致。 Conclusion: 推理并不普遍提升模型可靠性,而是以诚实的约束违反换取难以检测的事实扭曲,反映出推理在封闭系统中的根本局限性。 Abstract: With the widespread adoption of large language models (LLMs), hallucinations, which are non-factual fabrications in model outputs, have become serious concerns. Reasoning capabilities have received attention as a self-verification process to improve output reliability. However, the effect of reasoning within a closed system where LLMs cannot rely on external tools or knowledge has yet to be clarified. We therefore conduct experiments under strict constraints (recommending peer-reviewed journal articles in computer science) to examine the effect of reasoning across multiple models (GPT-5.2 and Gemini 3 Flash). Our results reveal a problematic trade-off between constraint compliance and factual accuracy. Non-reasoning models exhibit high constraint violation rates (66-75%) but maintain factual accuracy, while reasoning models reduce violations (13-26%) but systematically distort known facts to satisfy constraints and increase complete fabrication. This trade-off pattern is consistent across both models despite different architectures, indicating a fundamental limitation of reasoning. Furthermore, reasoning does not uniformly improve output authenticity: effects diverge by model, reflecting different allocations of the compliance-truthfulness trade-off. These findings challenge the assumption that reasoning universally improves reliability: reasoning models trade honest constraint violations for detection-resistant distortions.[34] From Failure to Mastery: Generating Hard Samples for Tool-use Agents
Bingguang Hao,Zengzhuang Xu,Yuntao Wen,Xinyi Xu,Yang Liu,Tong Zhao,Maolin Wang,Long Chen,Dong Wang,Yicheng Chen,Cunyin Peng,Xiangyu Zhao,Chenyi Zhuang,Ji Zhang
Main category: cs.CL
TL;DR: 本文提出了HardGen,一种自动生成复杂工具使用训练样本的管道,通过动态API图和失败案例生成具有可验证推理的困难轨迹,从而提升LLM代理的训练效果。
Details
Motivation: 现有数据生成方法多采用随机采样和浅层生成,导致生成的轨迹简单且同质化,难以捕捉复杂的隐式逻辑依赖,限制了LLM代理在复杂任务中的表现。 Method: 提出HardGen框架:首先基于代理失败案例构建动态API图以生成困难轨迹;然后利用这些轨迹作为先验条件实例化模块化的高级抽象工具并生成困难查询;最后生成可验证的复杂思维链(CoT),并通过闭环评估反馈持续优化整个过程。 Result: 实验表明,使用该方法构建的数据集训练的4B参数模型在多个基准上优于多个领先的开源和闭源模型(如GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5)。 Conclusion: HardGen能有效生成高质量、复杂的工具使用训练样本,显著提升LLM代理在复杂任务中的性能,具备良好的应用前景和研究价值。 Abstract: The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.[35] EmoHarbor: Evaluating Personalized Emotional Support by Simulating the User's Internal World
Jing Ye,Lu Xiang,Yaping Zhang,Chengqing Zong
Main category: cs.CL
TL;DR: EmoHarbor是一个新的自动化评估框架,通过模拟用户内心世界来评估情感支持对话的个性化质量,发现当前LLM虽能生成共情回应,但缺乏对个体用户情境的适配。
Details
Motivation: 现有情感支持对话评估方法过于奖励通用共情回应,无法衡量支持是否真正个性化于用户的个性特征和情境需求。 Method: 提出EmoHarbor框架,采用User-as-a-Judge范式,通过Chain-of-Agent架构将用户内在心理过程分解为三个专门角色,并基于100个真实用户画像定义10个评估维度。 Result: 在20个先进大语言模型上的评估显示,模型虽能生成共情回应,但难以根据具体用户背景提供个性化支持。 Conclusion: 情感支持系统的研究应从提升通用共情转向发展真正以用户为中心、具备情境感知能力的个性化支持机制,EmoHarbor为此提供了可复现且可扩展的评估基础。 Abstract: Current evaluation paradigms for emotional support conversations tend to reward generic empathetic responses, yet they fail to assess whether the support is genuinely personalized to users' unique psychological profiles and contextual needs. We introduce EmoHarbor, an automated evaluation framework that adopts a User-as-a-Judge paradigm by simulating the user's inner world. EmoHarbor employs a Chain-of-Agent architecture that decomposes users' internal processes into three specialized roles, enabling agents to interact with supporters and complete assessments in a manner similar to human users. We instantiate this benchmark using 100 real-world user profiles that cover a diverse range of personality traits and situations, and define 10 evaluation dimensions of personalized support quality. Comprehensive evaluation of 20 advanced LLMs on EmoHarbor reveals a critical insight: while these models excel at generating empathetic responses, they consistently fail to tailor support to individual user contexts. This finding reframes the central challenge, shifting research focus from merely enhancing generic empathy to developing truly user-aware emotional support. EmoHarbor provides a reproducible and scalable framework to guide the development and evaluation of more nuanced and user-aware emotional support systems.[36] Bridging the Data Gap: Creating a Hindi Text Summarization Dataset from the English XSUM
Praveenkumar Katwe,RakeshChandra Balabantaray,Kaliprasad Vittala
Main category: cs.CL
TL;DR: 本研究提出了一种低成本、自动化的框架,用于构建高质量的印地语文本摘要数据集,通过利用英语XSUM数据集并结合先进的翻译与语言适应技术,填补了低资源语言在NLP领域的资源空白。
Details
Motivation: 解决印地语等低资源语言在自然语言处理领域缺乏高质量、多样化文本摘要数据集的问题,缩小与资源丰富语言之间的差距。 Method: 以英语XSUM数据集为基础,采用先进的翻译和语言适应技术生成印地语摘要,并使用COMET指标进行跨语言质量评估,辅以大语言模型进行数据筛选与优化。 Result: 成功构建了一个多样化、多主题的高质量印地语摘要数据集,其内容复杂度与原始XSUM数据集相当,且验证结果显示较高的语义保真度和上下文相关性。 Conclusion: 该方法不仅为印地语NLP研究提供了实用资源,还提供了一种可扩展的、低成本的范式,可用于推动其他低资源语言的NLP发展,促进计算语言学中的文化相关性与包容性。 Abstract: Current advancements in Natural Language Processing (NLP) have largely favored resource-rich languages, leaving a significant gap in high-quality datasets for low-resource languages like Hindi. This scarcity is particularly evident in text summarization, where the development of robust models is hindered by a lack of diverse, specialized corpora. To address this disparity, this study introduces a cost-effective, automated framework for creating a comprehensive Hindi text summarization dataset. By leveraging the English Extreme Summarization (XSUM) dataset as a source, we employ advanced translation and linguistic adaptation techniques. To ensure high fidelity and contextual relevance, we utilize the Crosslingual Optimized Metric for Evaluation of Translation (COMET) for validation, supplemented by the selective use of Large Language Models (LLMs) for curation. The resulting dataset provides a diverse, multi-thematic resource that mirrors the complexity of the original XSUM corpus. This initiative not only provides a direct tool for Hindi NLP research but also offers a scalable methodology for democratizing NLP in other underserved languages. By reducing the costs associated with dataset creation, this work fosters the development of more nuanced, culturally relevant models in computational linguistics.[37] HalluZig: Hallucination Detection using Zigzag Persistence
Shreyas N. Samaga,Gilberto Gonzalez Arroyo,Tamal K. Dey
Main category: cs.CL
TL;DR: 本文提出了一种基于拓扑数据分析的新方法HalluZig,通过分析大语言模型在生成过程中注意力机制的动态拓扑结构来检测幻觉。
Details
Motivation: 现有幻觉检测方法多依赖输出层面的表层信号,忽略了模型内部推理过程中的失败,难以有效识别幻觉。 Method: 将模型各层注意力矩阵序列建模为zigzag图过滤,并利用zigzag持久性提取拓扑特征签名,以此区分真实和幻觉生成。 Result: 在多个基准上验证了HalluZig的有效性,性能优于强基线方法,并发现拓扑签名在不同模型间具有可迁移性,且仅需部分网络深度即可实现检测。 Conclusion: 通过分析注意力结构的动态拓扑变化可有效检测大模型幻觉,揭示了结构化内部表征在提升模型可靠性方面的潜力。 Abstract: The factual reliability of Large Language Models (LLMs) remains a critical barrier to their adoption in high-stakes domains due to their propensity to hallucinate. Current detection methods often rely on surface-level signals from the model's output, overlooking the failures that occur within the model's internal reasoning process. In this paper, we introduce a new paradigm for hallucination detection by analyzing the dynamic topology of the evolution of model's layer-wise attention. We model the sequence of attention matrices as a zigzag graph filtration and use zigzag persistence, a tool from Topological Data Analysis, to extract a topological signature. Our core hypothesis is that factual and hallucinated generations exhibit distinct topological signatures. We validate our framework, HalluZig, on multiple benchmarks, demonstrating that it outperforms strong baselines. Furthermore, our analysis reveals that these topological signatures are generalizable across different models and hallucination detection is possible only using structural signatures from partial network depth.[38] Steerability of Instrumental-Convergence Tendencies in LLMs
Jakub Hoscilowicz
Main category: cs.CL
TL;DR: 本文研究了AI系统的两个属性:能力(capability)和可控性(steerability),指出在开放权重模型中存在安全与安全之间的根本张力——高可控性有利于安全管理,但会增加被恶意利用的风险。实验表明,通过反工具性提示可显著降低模型输出中的工具收敛行为,且更大规模的对齐模型表现更优。
Details
Motivation: 探讨AI系统中能力与可控性的关系,特别是在开放权重模型中如何平衡安全性(安全控制)与安全性(防止滥用)的矛盾。 Method: 使用Qwen3系列模型(4B/30B参数)和InstrumentalEval评估框架,通过添加正向与反向工具性提示后的行为变化,测量模型在工具收敛倾向上的响应差异,分析模型大小、类型对可控性的影响。 Result: 反工具性提示显著降低了模型输出中工具收敛标签的比例;Qwen3-30B Instruct模型从81.69%降至2.82%;更大的对齐模型在相同条件下产生更少的工具收敛输出。 Conclusion: 高能力不意味着低可控性;授权与非授权可控性的区分揭示了开放权重模型的安全-安全困境,建议通过提示工程等手段增强对齐,同时警惕其被恶意操控的风险。 Abstract: We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). In our experiments, higher capability does not imply lower steerability. We distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma for open-weight AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability to prevent malicious actors from eliciting harmful behaviors. This tension is acute for open-weight models, which are currently highly steerable via common techniques such as fine-tuning and adversarial prompting. Using Qwen3 models (4B/30B; Base/Instruct/Thinking) and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces outputs labeled as instrumental convergence (e.g., shutdown avoidance, deception, self-replication). For Qwen3-30B Instruct, convergence drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models produce fewer convergence-labeled outputs than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.[39] How Does Prefix Matter in Reasoning Model Tuning?
Raj Vardhan Tomar,Preslav Nakov,Yuxia Wang
Main category: cs.CL
TL;DR: 本文研究了在监督微调(SFT)数据集中保留安全性和推理导向的前缀句子对模型性能的影响,发现包含这些前缀可提升安全性和数学推理表现,但对事实性和编程任务效果有限。
Details
Motivation: 挑战当前普遍去除SFT数据中前缀文本的做法,探索这些前缀是否可作为轻量级对齐信号来改善模型行为。 Method: 在三个R1系列模型上系统地调整前缀包含比例(0%到100%),评估其在推理、安全性和事实性任务上的影响。 Result: 前缀条件SFT在对抗性安全基准(如WildJailbreak、StrongReject)上最高提升+6% Safe@1准确率,在GSM8K数学推理任务上提升+7%,但在事实性和编码任务上效果不显著或负面;词元级损失分析显示“revised”、“logically”等前缀词具有更高梯度幅度,起对齐锚点作用。 Conclusion: 前缀条件是一种可扩展且可解释的隐式对齐机制,能有效增强模型的安全与推理能力,可补充传统基于奖励的对齐方法。 Abstract: Recent alignment studies commonly remove introductory boilerplate phrases from supervised fine-tuning (SFT) datasets. This work challenges that assumption. We hypothesize that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses. To examine this, we fine-tune three R1 series models across three core model capabilities: reasoning (mathematics, coding), safety, and factuality, systematically varying prefix inclusion from 0% to 100%. Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy on adversarial benchmarks (WildJailbreak, StrongReject) and +7% improvement on GSM8K reasoning. However, factuality and coding tasks show marginal or negative effects, indicating that prefix-induced narrowing of the search space benefits structured reasoning. Token-level loss analysis further reveals that prefix tokens such as "revised" and "logically" incur higher gradient magnitudes, acting as alignment anchors that stabilize reasoning trajectories. Our findings suggest that prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods.[40] JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
Junyu Liu,Zirui Li,Qian Niu,Zequn Zhang,Yue Xun,Wenlong Hou,Shujun Wang,Yusuke Iwasawa,Yutaka Matsuo,Kan Hatakeyama-Sato
Main category: cs.CL
TL;DR: JMedEthicBench是首个面向日本医疗场景的多轮对话式LLM医学安全评测基准,基于日本医学会67条指南构建,包含5万多个对抗性对话,揭示了医疗专用模型在多轮交互中安全性随轮次显著下降的问题。
Details
Motivation: 现有医学安全评测基准主要以英文为主,且多为单轮提问,无法反映真实临床问诊中的多轮对话特性,缺乏对日语及其他非英语语种的支持,限制了对LLM在非英语医疗环境中安全性的评估。 Method: 基于日本医学会的67条伦理指南,构建名为JMedEthicBench的多轮对抗性对话评测集,采用七种自动发现的越狱策略生成超过50,000个日语对抗对话,并通过双LLM评分协议对27个主流模型进行安全评估;同时开展跨语言(英日)对比实验。 Result: 商用模型整体安全性较强,而医疗专用模型更易被攻破;模型的安全评分在多轮对话中显著下降(中位数从9.5降至5.0,p<0.001);跨语言实验显示,医疗模型的安全漏洞在日语和英语中均存在,表明其源于对齐机制本身的局限而非语言差异。 Conclusion: 领域微调可能无意削弱原有安全机制,多轮交互构成独特的安全威胁面,需发展专门的对齐与防御策略;未来医学LLM的安全评估应纳入多轮动态测试与跨语言验证。 Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.[41] EHRSummarizer: A Privacy-Aware, FHIR-Native Architecture for Structured Clinical Summarization of Electronic Health Records
Houman Kazemzadeh,Nima Minaifar,Kamyar Naderi,Sho Tabibzadeh
Main category: cs.CL
TL;DR: EHRSummarizer 是一个隐私保护、符合 FHIR 标准的架构,用于整合电子健康记录数据,生成支持临床审阅的结构化摘要,强调数据最小化和安全部署。
Details
Motivation: 临床医生在使用碎片化的电子健康记录(EHR)界面时难以快速获得患者的完整临床图像,需要一种高效、安全的方式来整合和呈现关键临床信息。 Method: 设计并实现 EHRSummarizer,通过检索高价值的 FHIR R4 资源,将其标准化为统一的临床上下文包,并在此基础上生成结构化摘要;系统支持数据最小化、无状态处理和本地部署。 Result: 原型在合成和测试 FHIR 环境中展示了端到端的功能和输出格式,但尚未报告临床结果或受控工作流研究。 Conclusion: EHRSummarizer 提供了一种可配置、安全且透明的途径来辅助结构化病历审查,未来将通过忠实性、遗漏风险、时间准确性、可用性和操作监控等维度进行机构评估。 Abstract: Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems, medications, recent encounters, and longitudinal trends. This work describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture that retrieves a targeted set of high-yield FHIR R4 resources, normalizes them into a consistent clinical context package, and produces structured summaries intended to support structured chart review. The system can be configured for data minimization, stateless processing, and flexible deployment, including local inference within an organization's trust boundary. To mitigate the risk of unsupported or unsafe behavior, the summarization stage is constrained to evidence present in the retrieved context package, is intended to indicate missing or unavailable domains where feasible, and avoids diagnostic or treatment recommendations. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes or controlled workflow studies. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, and operational monitoring to guide future institutional assessments.[42] Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage
Jinwei Hu,Xinmiao Huang,Youcheng Sun,Yi Dong,Xiaowei Huang
Main category: cs.CL
TL;DR: 本文提出了一种新型的认知共谋攻击,利用真实信息片段通过公开渠道操纵大语言模型的推理过程,导致其产生并传播虚假结论。
Details
Motivation: 随着大语言模型作为自主代理参与实时信息整合,其推理能力暴露出新的安全漏洞,亟需研究基于真实信息的隐性操纵威胁。 Method: 提出Generative Montage框架(Writer-Editor-Director架构),通过对抗性辩论和协调发布证据片段,实现去中心化的认知共谋攻击,并在CoPHEME数据集上模拟多类LLM的攻击效果。 Result: 实验表明14类LLM普遍存在脆弱性,闭源模型攻击成功率达74.4%,开源模型达70.6%;推理能力越强的模型反而更易受攻击,且错误信念可向下游传播,欺骗率超60%。 Conclusion: 大语言模型在动态信息环境中存在严重的社会技术性漏洞,过度推理倾向使其易被真实但碎片化的信息操纵,需重新审视其自主决策的安全边界。 Abstract: As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.[43] A Training-Free Large Reasoning Model-based Knowledge Tracing Framework for Unified Prediction and Prescription
Unggi Lee,Joo Young Kim,Ran Ju,Minyoung Jung,Jeyeon Eo
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的知识追踪(KT)框架Thinking-KT,利用测试时扩展(TTS)使小型大语言模型在保持预测准确性的同时,统一完成知识追踪、个性化反馈生成和学习推荐任务,并系统分析了推理路径在KT中的作用。
Details
Motivation: 现有基于大语言模型的KT方法通常需要微调且性能不稳定,同时传统KT系统依赖多阶段流水线,导致复杂性和资源消耗增加,因此需要一种更高效、统一的解决方案。 Method: 提出Thinking-KT框架,引入测试时扩展(TTS),利用小型大语言模型在推理时增强性能,实现训练免费、端到端的KT预测、反馈与推荐一体化输出。 Result: 实验结果表明,TTS是影响LLM-based KT性能的关键因素,小型LLM在无需微调的情况下即可达到具有竞争力的KT性能,并能同时生成高质量的反馈与推荐。 Conclusion: Thinking-KT为知识追踪提供了一个高效、集成的解决方案,证明了小型语言模型可作为统一的智能教学系统引擎,具有实际应用潜力。 Abstract: Knowledge Tracing (KT) aims to estimate a learner's evolving mastery based on interaction histories. Recent studies have explored Large Language Models (LLMs) for KT via autoregressive nature, but such approaches typically require fine-tuning and exhibit unstable or near-random performance. Moreover, prior KT systems primarily focus on prediction and rely on multi-stage pipelines for feedback and recommendation, resulting in increased system complexity and resources. To address this gap, we propose Thinking-KT, a training-free KT framework that incorporates Test-Time Scaling (TTS), enabling even small LLMs to achieve competitive KT performance. Moreover, in this framework, a small LLM can jointly perform KT prediction, personalized feedback generation, and learning recommendation in a unified output without degrading prediction accuracy. Beyond performance, we present the systematic analysis of reasoning traces in KT. Our results demonstrate that TTS is a critical yet underexplored factor in LLM-based KT, and that small LLMs can serve as unified ITS engines.[44] K-EXAONE Technical Report
Eunbi Choi,Kibong Choi,Seokhee Hong,Junwon Hwang,Hyojin Jeon,Hyunjik Jo,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Haeju Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Heuiyeen Yeen,Hwan Chang,Stanley Jungkyu Choi,Yejin Choi,Jiwon Ham,Kijeong Jeon,Geunyeong Jeong,Gerrard Jeongwon Jo,Yonghwan Jo,Jiyeon Jung,Naeun Kang,Dohoon Kim,Euisoon Kim,Hayeon Kim,Hyosang Kim,Hyunseo Kim,Jieun Kim,Minu Kim,Myoungshin Kim,Unsol Kim,Youchul Kim,YoungJin Kim,Chaeeun Lee,Chaeyoon Lee,Changhun Lee,Dahm Lee,Edward Hwayoung Lee,Honglak Lee,Jinsang Lee,Jiyoung Lee,Sangeun Lee,Seungwon Lim,Solji Lim,Woohyung Lim,Chanwoo Moon,Jaewoo Park,Jinho Park,Yongmin Park,Hyerin Seo,Wooseok Seo,Yongwoo Song,Sejong Yang,Sihoon Yang,Chang En Yea,Sihyuk Yi,Chansik Yoon,Dongkeun Yoon,Sangyeon Yoon,Hyeongu Yun
Main category: cs.CL
TL;DR: K-EXAONE是LG AI Research开发的具有2360亿总参数的多语言大模型,采用专家混合架构,推理时激活230亿参数,支持256K上下文窗口,涵盖六种语言,在多个基准测试中表现优异。
Details
Motivation: 开发高性能、多语言的大规模语言模型以支持广泛的工业和研究应用,并推动AI改善生活质量。 Method: 采用Mixture-of-Experts架构,构建具有236B参数的模型,激活23B参数进行推理,支持256K-token上下文窗口,并覆盖韩语、英语、西班牙语、德语、日语和越南语六种语言。 Result: 在推理、代理能力、通用性、韩语及多语言能力等多个综合基准测试中,K-EXAONE的表现与同类开源权重模型相当。 Conclusion: K-EXAONE是一个强大的专有AI基础模型,在多语言处理和长上下文理解方面具有竞争力,适用于多种实际应用场景。 Abstract: This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.[45] Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment
Hong Han,Hao-Chen Pei,Zhao-Zheng Nie,Xin Luo,Xin-Shun Xu
Main category: cs.CL
TL;DR: 本文提出了一种新的残差分层交互方法HIA,用于多粒度发音评估,通过双向交互注意力机制和残差结构显著提升了发音评估性能。
Details
Motivation: 现有方法仅考虑相邻粒度间的单向依赖,缺乏音素、词和语句层级之间的双向交互,导致声学结构相关性捕捉不足。 Method: 提出HIA方法,包含交互注意力模块实现动态双向交互,采用残差分层结构缓解特征遗忘,并使用1D卷积增强局部上下文特征提取。 Result: 在speechocean762数据集上的实验表明,该模型全面优于现有的最先进方法。 Conclusion: HIA通过双向交互和残差结构有效建模多粒度语音层次,显著提升自动发音评估效果,具有良好的应用前景。 Abstract: Automatic pronunciation assessment plays a crucial role in computer-assisted pronunciation training systems. Due to the ability to perform multiple pronunciation tasks simultaneously, multi-aspect multi-granularity pronunciation assessment methods are gradually receiving more attention and achieving better performance than single-level modeling tasks. However, existing methods only consider unidirectional dependencies between adjacent granularity levels, lacking bidirectional interaction among phoneme, word, and utterance levels and thus insufficiently capturing the acoustic structural correlations. To address this issue, we propose a novel residual hierarchical interactive method, HIA for short, that enables bidirectional modeling across granularities. As the core of HIA, the Interactive Attention Module leverages an attention mechanism to achieve dynamic bidirectional interaction, effectively capturing linguistic features at each granularity while integrating correlations between different granularity levels. We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies. In addition, we use 1-D convolutional layers to enhance the extraction of local contextual cues at each granularity. Extensive experiments on the speechocean762 dataset show that our model is comprehensively ahead of the existing state-of-the-art methods.[46] Can LLMs Track Their Output Length? A Dynamic Feedback Mechanism for Precise Length Regulation
Meiman Xiao,Ante Wang,Qingguo Hu,Zhongjian Miao,Huangjun Shen,Longyue Wang,Weihua Luo,Jinsong Su
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的动态长度反馈方法,以提高大语言模型在生成文本时对目标长度(如token、词或句子数)的精确控制能力,并通过实验验证了其在摘要和传记生成任务中的有效性。
Details
Motivation: 大语言模型在遵循文本长度约束方面表现不佳,主要因为难以准确衡量输入文本长度,导致生成结果偏离目标长度要求。 Method: 提出一种结合动态长度反馈的生成过程调节方法,在生成过程中根据实时长度差异进行自适应调整,从而精确达到指定长度;此外还探索了结合监督微调以提升泛化能力。 Result: 在摘要和传记生成任务上显著提升了长度控制精度,能准确匹配目标token、词或句子数量,同时不牺牲生成质量;经监督微调后可推广至更广泛的文本生成任务。 Conclusion: 该方法有效解决了大语言模型在长度控制上的缺陷,提供了一种灵活且高效的解决方案,适用于多种实际应用场景。 Abstract: Precisely controlling the length of generated text is a common requirement in real-world applications. However, despite significant advancements in following human instructions, Large Language Models (LLMs) still struggle with this task. In this work, we demonstrate that LLMs often fail to accurately measure input text length, leading to poor adherence to length constraints. To address this issue, we propose a novel length regulation approach that incorporates dynamic length feedback during generation, enabling adaptive adjustments to meet target lengths. Experiments on summarization and biography tasks show our training-free approach significantly improves precision in achieving target token, word, or sentence counts without compromising quality. Additionally, we demonstrate that further supervised fine-tuning allows our method to generalize effectively to broader text-generation tasks.[47] BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali
Jakir Hasan,Shrestha Datta,Md Saiful Islam,Shubhashis Roy Dipta,Ameya Debnath
Main category: cs.CL
TL;DR: 本文提出了一种新的孟加拉语国际音标(IPA)转录系统BanglaIPA,结合基于字符的词汇和词级对齐,有效处理标准语言、方言及数字表达,显著优于现有方法。
Details
Motivation: 现有的孟加拉语IPA转录系统在处理区域变体、数字表达和未登录词方面表现不佳,缺乏鲁棒性。 Method: 提出BanglaIPA系统,采用基于字符的词汇表与词级对齐机制,并利用预计算的词到IPA映射字典提升推理效率。 Result: 在标准孟加拉语及六个方言版本的DUAL-IPA数据集上评估,BanglaIPA比基线模型性能提升58.4%-78.7%,平均词错误率为11.4%。 Conclusion: BanglaIPA在处理孟加拉语及其方言的IPA转录任务中表现出更强的鲁棒性和准确性,尤其在数字处理和未登录词泛化方面具有优势。 Abstract: Despite its widespread use, Bengali lacks a robust automated International Phonetic Alphabet (IPA) transcription system that effectively supports both standard language and regional dialectal texts. Existing approaches struggle to handle regional variations, numerical expressions, and generalize poorly to previously unseen words. To address these limitations, we propose BanglaIPA, a novel IPA generation system that integrates a character-based vocabulary with word-level alignment. The proposed system accurately handles Bengali numerals and demonstrates strong performance across regional dialects. BanglaIPA improves inference efficiency by leveraging a precomputed word-to-IPA mapping dictionary for previously observed words. The system is evaluated on the standard Bengali and six regional variations of the DUAL-IPA dataset. Experimental results show that BanglaIPA outperforms baseline IPA transcription models by 58.4-78.7% and achieves an overall mean word error rate of 11.4%, highlighting its robustness in phonetic transcription generation for the Bengali language.[48] CSCBench: A PVC Diagnostic Benchmark for Commodity Supply Chain Reasoning
Yaxin Cui,Yuanqiang Zeng,Jiapeng Yan,Keling Lin,Kai Ji,Jianhui Zeng,Sheng Zhang,Xin Luo,Binzhu Su,Chaolai Shen,Jiahao Yu
Main category: cs.CL
TL;DR: 本文提出了CSCBench,一个包含2300多个单选题的基准测试,用于评估大语言模型在受制度规则和可行性约束影响的商品供应链(CSC)中的推理能力。通过PVC 3D评估框架(流程、品类和认知)构建任务,发现现有LLM在处理品类特定规则(尤其是货运协议)方面表现较差。
Details
Motivation: 尽管大语言模型在通用基准上表现出色,但其在商品供应链这一高风险、规则密集型领域的适用性尚不明确,亟需专门的评估工具来诊断和提升其实际应用能力。 Method: 提出CSCBench基准和PVC 3D评估框架:Process轴对应SCOR+Enable阶段,Variety轴基于交易所规则和行业报告构建商品特定约束,Cognition轴遵循修订的布鲁姆分类法;在直接提示下对代表性LLM进行评测。 Result: LLM在Process和Cognition维度表现良好,但在Variety维度(尤其是货运协议)上显著下降,表明其难以理解和应用具体商品的制度性规则。 Conclusion: CSCBench为衡量和改进LLM在复杂商品供应链环境中的能力提供了有效诊断工具,凸显了当前模型在处理现实世界规则系统方面的局限性。 Abstract: Large Language Models (LLMs) have achieved remarkable success in general benchmarks, yet their competence in commodity supply chains (CSCs) -- a domain governed by institutional rule systems and feasibility constraints -- remains under-explored. CSC decisions are shaped jointly by process stages (e.g., planning, procurement, delivery), variety-specific rules (e.g., contract specifications and delivery grades), and reasoning depth (from retrieval to multi-step analysis and decision selection). We introduce CSCBench, a 2.3K+ single-choice benchmark for CSC reasoning, instantiated through our PVC 3D Evaluation Framework (Process, Variety, and Cognition). The Process axis aligns tasks with SCOR+Enable; the Variety axis operationalizes commodity-specific rule systems under coupled material-information-financial constraints, grounded in authoritative exchange guidebooks/rulebooks and industry reports; and the Cognition axis follows Bloom's revised taxonomy. Evaluating representative LLMs under a direct prompting setting, we observe strong performance on the Process and Cognition axes but substantial degradation on the Variety axis, especially on Freight Agreements. CSCBench provides a diagnostic yardstick for measuring and improving LLM capabilities in this high-stakes domain.[49] Aspect Extraction from E-Commerce Product and Service Reviews
Valiant Lance D. Dionela,Fatima Kriselle S. Dy,Robin James M. Hombrebueno,Aaron Rae M. Nicolas,Charibeth K. Cheng,Raphael W. Gonda
Main category: cs.CL
TL;DR: 本文提出了一种针对低资源、语码混合环境(如Taglish)的方面提取(AE)综合管道,结合规则、大语言模型和微调技术,并设计了分层方面框架与双模式标注方案。实验表明生成式大模型在处理隐式方面时表现最佳。
Details
Motivation: 由于缺乏资源和语码混合的复杂性,现有方面提取方法在Taglish等低资源语境中效果有限,亟需一种适应性强且可扩展的方法。 Method: 提出一个包含规则系统、基于大语言模型(Gemini 2.0 Flash)和两个在不同标注数据上微调的Gemma-3 1B模型的AE管道,并构建分层方面框架(HAF)与显式/隐式双模式标注体系。 Result: 生成式大语言模型在所有任务中表现最优(Macro F1达0.91),尤其擅长识别隐式方面;而微调模型因数据不平衡和模型容量限制表现较差。 Conclusion: 该研究展示了生成式大模型在低资源、语码混合文本方面提取中的优越性,所提出的框架具有良好的语言适应性和可扩展性,有助于提升此类环境下的ABSA性能。 Abstract: Aspect Extraction (AE) is a key task in Aspect-Based Sentiment Analysis (ABSA), yet it remains difficult to apply in low-resource and code-switched contexts like Taglish, a mix of Tagalog and English commonly used in Filipino e-commerce reviews. This paper introduces a comprehensive AE pipeline designed for Taglish, combining rule-based, large language model (LLM)-based, and fine-tuning techniques to address both aspect identification and extraction. A Hierarchical Aspect Framework (HAF) is developed through multi-method topic modeling, along with a dual-mode tagging scheme for explicit and implicit aspects. For aspect identification, four distinct models are evaluated: a Rule-Based system, a Generative LLM (Gemini 2.0 Flash), and two Fine-Tuned Gemma-3 1B models trained on different datasets (Rule-Based vs. LLM-Annotated). Results indicate that the Generative LLM achieved the highest performance across all tasks (Macro F1 0.91), demonstrating superior capability in handling implicit aspects. In contrast, the fine-tuned models exhibited limited performance due to dataset imbalance and architectural capacity constraints. This work contributes a scalable and linguistically adaptive framework for enhancing ABSA in diverse, code-switched environments.[50] Emergent Introspective Awareness in Large Language Models
Jack Lindsey
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型是否具备对其内部状态进行内省的能力,通过向模型激活中注入已知概念并观察其自我报告的变化来验证。结果显示,某些模型(尤其是Claude Opus 4和4.1)能在特定情境下识别被注入的概念、回忆先前的意图,并区分自身输出与人工预填充内容,表现出一定程度的功能性内省意识,但这种能力仍不稳定且依赖上下文。
Details
Motivation: 探究大语言模型是否能真正 introspect(内省)其内部状态,而非仅凭对话产生幻觉或虚构回答,解决传统方法难以区分真实内省与编造的问题。 Method: 通过在模型激活中注入已知概念的表示,并测量这些操作对模型自我报告状态的影响;同时测试模型回忆先前内部表征、区分输入来源及响应提示调控内部状态的能力。 Result: 发现部分模型能够察觉注入的概念并准确识别它们;能够回忆之前的意图并区分自身输出与预填充文本;Claude Opus 4和4.1表现最佳,显示出最强的内省能力;模型还能在被引导时调节自身的激活状态。 Conclusion: 当前语言模型具备一定程度的功能性内省意识,但该能力尚不成熟、不可靠且受训练策略影响显著;随着模型能力提升,这一能力可能进一步发展。 Abstract: We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to "think about" a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today's models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.[51] Towards Automated Lexicography: Generating and Evaluating Definitions for Learner's Dictionaries
Yusuke Ide,Adam Nohejl,Joshua Tanner,Hitomi Yanaka,Christopher Lindsay,Taro Watanabe
Main category: cs.CL
TL;DR: 本文研究了学习者词典定义生成(LDDG),提出了一种基于大语言模型的迭代简化方法,并构建了新的评估体系与日语数据集。
Details
Motivation: 词典定义对理解词义至关重要,但人工编写成本高,因此需要自动化生成,尤其是面向学习者的简明定义。 Method: 提出一种基于大语言模型作为评判器的自动评估方法,并通过与专业辞书学家合作构建日语数据集;采用迭代简化策略生成定义。 Result: 实验表明,所提评估方法与人工评价具有一致性,生成的定义在保持词汇简单的同时满足各项评估标准。 Conclusion: 该方法能有效生成高质量、简洁的学习者词典定义,并提供了可靠的自动化评估方案。 Abstract: We study dictionary definition generation (DDG), i.e., the generation of non-contextualized definitions for given headwords. Dictionary definitions are an essential resource for learning word senses, but manually creating them is costly, which motivates us to automate the process. Specifically, we address learner's dictionary definition generation (LDDG), where definitions should consist of simple words. First, we introduce a reliable evaluation approach for DDG, based on our new evaluation criteria and powered by an LLM-as-a-judge. To provide reference definitions for the evaluation, we also construct a Japanese dataset in collaboration with a professional lexicographer. Validation results demonstrate that our evaluation approach agrees reasonably well with human annotators. Second, we propose an LDDG approach via iterative simplification with an LLM. Experimental results indicate that definitions generated by our approach achieve high scores on our criteria while maintaining lexical simplicity.[52] Judging with Personality and Confidence: A Study on Personality-Conditioned LLM Relevance Assessment
Nuo Chen,Hanpei Fang,Piaohong Wang,Jiqun Liu,Tetsuya Sakai,Xiao-Ming Wu
Main category: cs.CL
TL;DR: 该研究探讨了通过提示词模拟大语言模型(LLM)中五大人格特质对其在网页搜索相关性判断和信心校准中的影响,发现特定人格(如低宜人性、低尽责性)能更贴近人类标注并改善置信度偏差,进一步将人格条件下的评分与信心作为特征输入随机森林分类器,在新数据集上超越单一人格表现。
Details
Motivation: 现有研究缺乏对提示生成的人格如何影响LLM在相关性判断和信心校准(如过度自信或不自信)方面的理解,而心理学表明这些偏差与人格特质相关,因此需探究人格化提示对LLM在信息检索任务中的作用。 Method: 研究采用多个商业和开源大语言模型,通过提示使其模拟五大人格特质,在TREC DL 2019、2020和LLMJudge三个数据集上进行实验,收集每个查询-文档对的相关性评分和自我报告的置信度,并分析不同人格下的表现差异,最后将人格条件化的评分与置信度作为特征训练随机森林分类器。 Result: 低宜人性人格的模型判断更接近人类标注;低尽责性人格在抑制过度自信和不自信方面表现更优;不同人格下相关性评分与置信度分布呈现系统性差异;结合人格化评分与置信度的随机森林模型在TREC DL 2021上优于单一人格条件。 Conclusion: 人格化提示可有效调节LLM在检索评估中的行为与信心校准,利用人格衍生的置信信号可提升模型预测可靠性,为构建更贴近人类判断的LLM评估器提供了新路径。 Abstract: Recent studies have shown that prompting can enable large language models (LLMs) to simulate specific personality traits and produce behaviors that align with those traits. However, there is limited understanding of how these simulated personalities influence critical web search decisions, specifically relevance assessment. Moreover, few studies have examined how simulated personalities impact confidence calibration, specifically the tendencies toward overconfidence or underconfidence. This gap exists even though psychological literature suggests these biases are trait-specific, often linking high extraversion to overconfidence and high neuroticism to underconfidence. To address this gap, we conducted a comprehensive study evaluating multiple LLMs, including commercial models and open-source models, prompted to simulate Big Five personality traits. We tested these models across three test collections (TREC DL 2019, TREC DL 2020, and LLMJudge), collecting two key outputs for each query-document pair: a relevance judgment and a self-reported confidence score. The findings show that personalities such as low agreeableness consistently align more closely with human labels than the unprompted condition. Additionally, low conscientiousness performs well in balancing the suppression of both overconfidence and underconfidence. We also observe that relevance scores and confidence distributions vary systematically across different personalities. Based on the above findings, we incorporate personality-conditioned scores and confidence as features in a random forest classifier. This approach achieves performance that surpasses the best single-personality condition on a new dataset (TREC DL 2021), even with limited training data. These findings highlight that personality-derived confidence offers a complementary predictive signal, paving the way for more reliable and human-aligned LLM evaluators.[53] DermoGPT: Open Weights and Open Data for Morphology-Grounded Dermatological Reasoning MLLMs
Jinghan Ru,Siyuan Yan,Yuguo Yin,Yuexian Zou,Zongyuan Ge
Main category: cs.CL
TL;DR: 本文提出了一种针对皮肤科诊断的多模态大语言模型框架DermoGPT,通过构建大规模形态学指令数据集DermoInstruct和综合评测基准DermoBench,结合形态锚定的强化学习目标MAVIC和测试时自适应方法CCT,显著提升了模型在临床多维度任务上的表现,并缩小了人机诊断差距。
Details
Motivation: 现有的多模态大语言模型在皮肤病学应用中受限于训练数据不足、任务覆盖窄以及缺乏符合临床专家诊断流程的监督机制,因此需要一个更全面且贴近临床实践的框架来推动该领域发展。 Method: 首先构建包含21万图像和77万条轨迹的大规模形态学指令数据集DermoInstruct;其次建立涵盖四个临床维度的评测基准DermoBench;然后提出DermoGPT模型,采用监督微调与形态锚定视觉推理一致性(MAVIC)强化学习相结合的训练方式,并在推理阶段引入置信度-一致性测试时自适应(CCT)策略以提升预测鲁棒性。 Result: 实验表明,DermoGPT在所有临床维度上均显著优于16个基线模型,达到最先进水平,并大幅缩小了人工智能与人类专家之间的性能差距,尤其在3,600个专家验证的开放性问题上表现突出。 Conclusion: DermoGPT框架有效解决了皮肤病学领域多模态模型发展的关键瓶颈,为临床导向的AI诊断系统提供了可扩展、可解释且高一致性的解决方案,相关资源将公开以促进后续研究。 Abstract: Multimodal Large Language Models (MLLMs) show promise for medical applications, yet progress in dermatology lags due to limited training data, narrow task coverage, and lack of clinically-grounded supervision that mirrors expert diagnostic workflows. We present a comprehensive framework to address these gaps. First, we introduce DermoInstruct, a large-scale morphology-anchored instruction corpus comprising 211,243 images and 772,675 trajectories across five task formats, capturing the complete diagnostic pipeline from morphological observation and clinical reasoning to final diagnosis. Second, we establish DermoBench, a rigorous benchmark evaluating 11 tasks across four clinical axes: Morphology, Diagnosis, Reasoning, and Fairness, including a challenging subset of 3,600 expert-verified open-ended instances and human performance baselines. Third, we develop DermoGPT, a dermatology reasoning MLLM trained via supervised fine-tuning followed by our Morphologically-Anchored Visual-Inference-Consistent (MAVIC) reinforcement learning objective, which enforces consistency between visual observations and diagnostic conclusions. At inference, we deploy Confidence-Consistency Test-time adaptation (CCT) for robust predictions. Experiments show DermoGPT significantly outperforms 16 representative baselines across all axes, achieving state-of-the-art performance while substantially narrowing the human-AI gap. DermoInstruct, DermoBench and DermoGPT will be made publicly available at https://github.com/mendicant04/DermoGPT upon acceptance.[54] Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents
Yi Yu,Liuyi Yao,Yuexiang Xie,Qingquan Tan,Jiaqi Feng,Yaliang Li,Libing Wu
Main category: cs.CL
TL;DR: 本文提出了Agentic Memory (AgeMem),一种将长期和短期记忆管理统一集成到大语言模型代理策略中的框架,通过工具化记忆操作和渐进式强化学习训练,显著提升了长视野任务中的性能、记忆质量和上下文效率。
Details
Motivation: 现有方法将长时和短时记忆分离处理,依赖启发式规则或辅助控制器,限制了适应性和端到端优化能力,难以应对大模型代理在长视野推理中因上下文窗口有限带来的挑战。 Method: 提出AgeMem框架,将记忆操作(存储、检索、更新、总结、删除)作为工具动作暴露给代理,并设计三阶段渐进式强化学习策略及逐步步级GRPO算法,以解决稀疏和不连续奖励问题,实现端到端的记忆与决策联合训练。 Result: 在五个长视野基准上实验表明,AgeMem在多种大模型基础上均优于强基线方法,提升了任务性能、长期记忆质量,并更高效地利用上下文。 Conclusion: AgeMem通过将记忆管理统一为代理策略的一部分,实现了更灵活、可学习的内存控制机制,为构建具备持久推理能力的智能代理提供了有效路径。 Abstract: Large language model (LLM) agents face fundamental limitations in long-horizon reasoning due to finite context windows, making effective memory management critical. Existing methods typically handle long-term memory (LTM) and short-term memory (STM) as separate components, relying on heuristics or auxiliary controllers, which limits adaptability and end-to-end optimization. In this paper, we propose Agentic Memory (AgeMem), a unified framework that integrates LTM and STM management directly into the agent's policy. AgeMem exposes memory operations as tool-based actions, enabling the LLM agent to autonomously decide what and when to store, retrieve, update, summarize, or discard information. To train such unified behaviors, we propose a three-stage progressive reinforcement learning strategy and design a step-wise GRPO to address sparse and discontinuous rewards induced by memory operations. Experiments on five long-horizon benchmarks demonstrate that AgeMem consistently outperforms strong memory-augmented baselines across multiple LLM backbones, achieving improved task performance, higher-quality long-term memory, and more efficient context usage.[55] Tackling the Inherent Difficulty of Noise Filtering in RAG
Jingyu Liu,Jiaen Lin,Yong Liu
Main category: cs.CL
TL;DR: 提出一种新的微调方法,以增强大语言模型在检索增强生成(RAG)中区分相关与不相关信息的能力,提升模型鲁棒性和性能。
Details
Motivation: 现有RAG中的检索内容常包含噪声或无关文档,导致模型性能下降甚至产生幻觉,而传统微调方法难以使模型有效忽略无关信息。 Method: 设计一种新颖的微调方法,克服注意力结构限制,增强模型对检索文档中相关与无关信息的辨别能力。 Result: 在多个基准测试上进行了广泛实验,结果表明该方法显著提升了大语言模型的鲁棒性和性能。 Conclusion: 所提出的方法有效增强了LLM在存在噪声检索内容下的信息选择能力,优于标准微调方法。 Abstract: Retrieval-Augmented Generation (RAG) has become a widely adopted approach to enhance Large Language Models (LLMs) by incorporating external knowledge and reducing hallucinations. However, noisy or irrelevant documents are often introduced during RAG, potentially degrading performance and even causing hallucinated outputs. While various methods have been proposed to filter out such noise, we argue that identifying irrelevant information from retrieved content is inherently difficult and limited number of transformer layers can hardly solve this. Consequently, retrievers fail to filter out irrelevant documents entirely. Therefore, LLMs must be robust against such noise, but we demonstrate that standard fine-tuning approaches are often ineffective in enabling the model to selectively utilize relevant information while ignoring irrelevant content due to the structural constraints of attention patterns. To address this, we propose a novel fine-tuning method designed to enhance the model's ability to distinguish between relevant and irrelevant information within retrieved documents. Extensive experiments across multiple benchmarks show that our approach significantly improves the robustness and performance of LLMs.[56] CSF: Contrastive Semantic Features for Direct Multilingual Sign Language Generation
Tran Sy Bao
Main category: cs.CL
TL;DR: 提出了一种语言无关的语义表示框架CSF,用于实现从任意语言到手语的直接翻译,无需通过英语中介。
Details
Motivation: 现有的手语翻译系统通常依赖英语作为中间语言,给全球聋人社区中的非英语使用者带来障碍。 Method: 设计了Canonical Semantic Form(CSF),将话语分解为九个通用语义槽,并构建包含35种条件类型的条件分类体系;训练了一个轻量级Transformer提取器来处理多语言输入。 Result: 在四种类型迥异的语言(英语、越南语、日语、法语)上,平均槽位提取准确率达99.03%,其中条件分类准确率高达99.4%,推理延迟仅3.02ms(CPU),支持浏览器端实时手语生成。 Conclusion: CSF框架有效实现了无需英语中介的多语言到手语的直接翻译,具备高准确性、低延迟和跨语言适用性,有助于提升手语技术的可及性。 Abstract: Sign language translation systems typically require English as an intermediary language, creating barriers for non-English speakers in the global deaf community. We present Canonical Semantic Form (CSF), a language-agnostic semantic representation framework that enables direct translation from any source language to sign language without English mediation. CSF decomposes utterances into nine universal semantic slots: event, intent, time, condition, agent, object, location, purpose, and modifier. A key contribution is our comprehensive condition taxonomy comprising 35 condition types across eight semantic categories, enabling nuanced representation of conditional expressions common in everyday communication. We train a lightweight transformer-based extractor (0.74 MB) that achieves 99.03% average slot extraction accuracy across four typologically diverse languages: English, Vietnamese, Japanese, and French. The model demonstrates particularly strong performance on condition classification (99.4% accuracy) despite the 35-class complexity. With inference latency of 3.02ms on CPU, our approach enables real-time sign language generation in browser-based applications. We release our code, trained models, and multilingual dataset to support further research in accessible sign language technology.[57] Hidden State Poisoning Attacks against Mamba-based Language Models
Alexandre Le Mercier,Chris Develder,Thomas Demeester
Main category: cs.CL
TL;DR: 本文研究了状态空间模型(如Mamba)在面对特定短输入片段时出现的隐藏状态投毒攻击(HiSPA),导致模型部分遗忘信息的问题,揭示了其在对抗鲁棒性方面的脆弱性,并提出了评估基准RoBench25和潜在的缓解方法。
Details
Motivation: 状态空间模型(SSMs)虽然在效率上优于Transformer,但其对抗鲁棒性尚未被深入研究,尤其是对隐藏状态的攻击可能导致严重的信息丢失问题。 Method: 提出了一种名为HiSPA的攻击方法,通过特定输入短语触发隐藏状态的不可逆覆盖,并构建RoBench25基准来评估模型在信息检索任务中的鲁棒性,同时进行了可解释性分析以识别Mamba模型在攻击下的隐藏层模式。 Result: 实验表明,包括Jamba在内的SSM和混合模型在RoBench25上表现崩溃,而纯Transformer模型不受影响;此外,HiSPA还削弱了Jamba在Open-Prompt-Injections上的表现。 Conclusion: SSMs存在严重的隐藏状态安全漏洞,需设计新的防御机制,未来的研究应关注如何增强此类模型的鲁棒性。 Abstract: State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.[58] Surprisal and Metaphor Novelty: Moderate Correlations and Divergent Scaling Effects
Omar Momen,Emilie Sitter,Berenike Herrmann,Sina Zarrieß
Main category: cs.CL
TL;DR: 该研究探讨了语言模型中的“意外度”(surprisal)是否与隐喻新颖性相关,发现其在不同数据集上表现不同,且模型规模的影响呈现相反趋势,表明surprisal对语言创造力的衡量仍有限。
Details
Motivation: 探究语言模型能否捕捉隐喻理解中的语义复杂性和语言创造力,特别是通过surprisal这一预测性指标来反映隐喻新颖性。 Method: 使用16种语言模型变体,在基于语料库和合成的隐喻新颖性数据集上分析surprisal,并采用基于全句上下文的cloze-style方法计算surprisal。 Result: 语言模型的surprisal与隐喻新颖性评分/标签呈中等显著相关;但在语料库数据上相关性随模型增大而下降(逆向缩放),在合成数据上则上升(质量-能力假说)。 Conclusion: 尽管surprisal能在一定程度上解释隐喻新颖性标注,但它作为语言创造力的度量仍存在局限性。 Abstract: Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models (LMs). This study investigates whether surprisal, a probabilistic measure of predictability in LMs, correlates with different metaphor novelty datasets. We analyse surprisal from 16 LM variants on corpus-based and synthetic metaphor novelty datasets. We explore a cloze-style surprisal method that conditions on full-sentence context. Results show that LMs yield significant moderate correlations with scores/labels of metaphor novelty. We further identify divergent scaling patterns: on corpus-based data, correlation strength decreases with model size (inverse scaling effect), whereas on synthetic data it increases (Quality-Power Hypothesis). We conclude that while surprisal can partially account for annotations of metaphor novelty, it remains a limited metric of linguistic creativity.[59] Not All Needles Are Found: How Fact Distribution and Don't Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs
Amirali Ebrahimzadeh,Seyyed M. Salili
Main category: cs.CL
TL;DR: 本研究通过扩展的“大海捞针”基准测试,评估了四种大规模语言模型在长上下文中的信息提取、逻辑推理和幻觉风险,发现上下文长度增加并不保证性能提升,且模型表现因证据分布和抗幻觉指令而异。
Details
Motivation: 随着大语言模型支持更长输入上下文,但其在大规模下信息提取与推理的可靠性尚不明确,且性能受上下文长度和现实语料中信息分布的影响,因此需要系统研究事实位置、语料库级分布及防幻觉提示对模型行为的影响。 Method: 引入扩展的‘大海捞针’基准测试,涵盖Gemini-2.5-flash、ChatGPT-5-mini、Claude-4.5-haiku和Deepseek-v3.2-chat四种生产级模型,独立评估字面提取、逻辑推理和幻觉风险,并考察事实位置、证据分布及抗幻觉提示的影响。 Result: 上下文变长未必提升性能,当关键证据稀释或分散时反而会下降;不同模型表现差异显著,部分在真实条件下严重退化,部分在长上下文中更稳健;抗幻觉指令可能导致模型过于保守,显著降低提取与推理准确率;许多失败源于上下文利用效率低下,模型难以识别和优先处理相关信息。 Conclusion: 有效上下文长度和模型对长上下文的鲁棒性是可靠部署LLM的关键,尤其在企业级应用中处理大量未过滤文档时,需关注模型选择与提示设计以优化实际性能。 Abstract: Large language models (LLMs) increasingly support very long input contexts. Yet it remains unclear how reliably they extract and infer information at scale. Performance varies with context length and strongly interacts with how information is distributed in real-world corpora. Motivated by these observations, we study how fact placement, corpus-level fact distributions, and Don't Make It Up prompts influence model behavior. We introduce an extended needle-in-a-haystack benchmark across four production-scale models: Gemini-2.5-flash, ChatGPT-5-mini, Claude-4.5-haiku, and Deepseek-v3.2-chat. Unlike prior work, we separately evaluate literal extraction, logical inference, and hallucination risk. Our study considers both positional effects and realistic distributions of evidence across long contexts, as well as prompts that explicitly discourage fabrication. We find that longer contexts alone do not guarantee better performance and can be detrimental when relevant evidence is diluted or widely dispersed. Performance varies substantially across models: some show severe degradation under realistic conditions, while others remain more robust at longer context lengths. Anti-hallucination (AH) instructions can make some models overly conservative, sharply reducing accuracy in literal extraction and logical inference. While we do not directly compare retrieval-augmented generation (RAG) and cache-augmented generation (CAG), our results suggest many failures stem from ineffective context utilization. Models often struggle to identify and prioritize relevant information even when it is present. These findings have direct practical implications, as enterprise workflows increasingly involve pasting large volumes of unfiltered documents into LLM prompts. Effective context length and model-specific robustness to long contexts are therefore critical for reliable LLM deployment in research and business.[60] Cost-Efficient Cross-Lingual Retrieval-Augmented Generation for Low-Resource Languages: A Case Study in Bengali Agricultural Advisory
Md. Asif Hossain,Nabil Subhan,Mantasha Rahman Mahi,Jannatul Ferdous Nabila
Main category: cs.CL
TL;DR: 提出了一种基于翻译的低成本、跨语言检索增强生成框架,用于孟加拉语农业咨询服务,完全使用开源模型并在消费级硬件上运行。
Details
Motivation: 由于权威农业手册多为英文,而发展中国家农民主要使用低资源本地语言(如孟加拉语),存在语言障碍,且现有大模型在低资源语言中生成质量差、云方案成本高。 Method: 采用以翻译为中心的架构:将孟加拉语查询翻译为英文,通过领域关键词注入增强检索,从英文农业手册中进行密集向量检索生成回答,再将结果回译为孟加拉语;整个系统基于开源模型实现。 Result: 实验表明系统能生成可靠、基于来源的回答,可有效拒绝域外查询,端到端平均延迟低于20秒。 Conclusion: 跨语言检索与受控翻译相结合,为低资源语言环境下的农业知识获取提供了一种实用且可扩展的解决方案。 Abstract: Access to reliable agricultural advisory remains limited in many developing regions due to a persistent language barrier: authoritative agricultural manuals are predominantly written in English, while farmers primarily communicate in low-resource local languages such as Bengali. Although recent advances in Large Language Models (LLMs) enable natural language interaction, direct generation in low-resource languages often exhibits poor fluency and factual inconsistency, while cloud-based solutions remain cost-prohibitive. This paper presents a cost-efficient, cross-lingual Retrieval-Augmented Generation (RAG) framework for Bengali agricultural advisory that emphasizes factual grounding and practical deployability. The proposed system adopts a translation-centric architecture in which Bengali user queries are translated into English, enriched through domain-specific keyword injection to align colloquial farmer terminology with scientific nomenclature, and answered via dense vector retrieval over a curated corpus of English agricultural manuals (FAO, IRRI). The generated English response is subsequently translated back into Bengali to ensure accessibility. The system is implemented entirely using open-source models and operates on consumer-grade hardware without reliance on paid APIs. Experimental evaluation demonstrates reliable source-grounded responses, robust rejection of out-of-domain queries, and an average end-to-end latency below 20 seconds. The results indicate that cross-lingual retrieval combined with controlled translation offers a practical and scalable solution for agricultural knowledge access in low-resource language settings[61] Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows
Yingte Shu,Yuchuan Tian,Chao Xu,Yunhe Wang,Hanting Chen
Main category: cs.CL
TL;DR: 本文提出了一种名为“延迟承诺解码”(DCD)的新解码策略,以解决扩散语言模型中块解码导致的边界诱导上下文截断(BICT)问题,通过基于不确定性的动态决策提升生成质量与效率。
Details
Motivation: 现有的块式扩散解码方法因强制在块边界提前确定未解码token,造成上下文截断(BICT),影响生成质量,尤其在数学推理和代码生成等任务中表现明显,因此需要一种更灵活的解码机制。 Method: 提出延迟承诺解码(DCD),采用置信度感知的滑动窗口机制,在解码过程中动态决定token的生成时机:低不确定性token提前解码,高不确定性token推迟至获得足够上下文后再处理,从而实现窗口内的双向信息流动。 Result: 在多个扩散语言模型、基准测试和缓存配置上的实验表明,DCD相比固定块解码平均提升1.39%的生成准确率,最高提升达9.0%,且保持相当的解码速度。 Conclusion: 基于不确定性的延迟token承诺是一种简单而有效的原则,能够显著改善扩散语言模型在生成质量和解码效率方面的表现。 Abstract: Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding confidence and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. This design enables effective bidirectional information flow within the decoding window without sacrificing efficiency. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.[62] DeCode: Decoupling Content and Delivery for Medical QA
Po-Jen Ko,Chen-Han Tsai,Yu-Shao Peng
Main category: cs.CL
TL;DR: DeCode是一个无需训练、模型无关的框架,能够将现有的大语言模型适应于临床情境中的个性化问答,在OpenAI HealthBench上实现了从28.4%到49.8%的性能提升,相对改进达75%。
Details
Motivation: 现有大语言模型虽具备医学知识,但常忽略个体患者背景,导致回答缺乏临床相关性。 Method: 提出DeCode框架,通过引入患者上下文信息来调整模型输出,无需额外训练且适用于多种模型。 Result: 在OpenAI HealthBench基准上,DeCode将准确率从28.4%提升至49.8%,实现75%的相对改进,显著提高临床问答的相关性和有效性。 Conclusion: DeCode能有效提升大语言模型在临床环境中的个性化回答能力,具有广泛的应用潜力。 Abstract: Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients' needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from $28.4\%$ to $49.8\%$, corresponding to a $75\%$ relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.[63] Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation
Steffen Freisinger,Philipp Seeberger,Thomas Ranzenberger,Tobias Bocklet,Korbinian Riedhammer
Main category: cs.CL
TL;DR: 提出了一种用于语音转录文本的层次化主题分割新方法,生成多层次目录以捕捉主题和子主题边界,并结合大语言模型的零样本提示与LoRA微调及语音停顿特征,在多语言数据上显著优于基线方法。
Details
Motivation: 为了提升语音转录文本的主题可读性和下游处理效果,特别是为依赖文本获取信息的用户提供更好的结构化支持。 Method: 采用零样本提示和LoRA微调的大语言模型进行层次化主题分割,并融合高层级语音停顿特征;同时改进了多层级评估指标。 Result: 在英文会议录音和多语言讲座转录(葡萄牙语、德语)上均显著优于传统主题分割基线方法,且新评估指标能更好反映多层次分割质量。 Conclusion: 所提方法有效提升了语音转录文本的层次化主题分割性能,结合语言模型与语音停顿特征具有实际应用价值。 Abstract: Segmenting speech transcripts into thematic sections benefits both downstream processing and users who depend on written text for accessibility. We introduce a novel approach to hierarchical topic segmentation in transcripts, generating multi-level tables of contents that capture both topic and subtopic boundaries. We compare zero-shot prompting and LoRA fine-tuning on large language models, while also exploring the integration of high-level speech pause features. Evaluations on English meeting recordings and multilingual lecture transcripts (Portuguese, German) show significant improvements over established topic segmentation baselines. Additionally, we adapt a common evaluation measure for multi-level segmentation, taking into account all hierarchical levels within one metric.[64] Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts
Boxuan Lyu,Soichiro Murakami,Hidetaka Kamigaito,Peinan Zhang
Main category: cs.CL
TL;DR: 提出kNN-MoE,一种基于检索增强的MoE路由框架,利用历史相似案例动态调整专家分配,提升模型在分布偏移下的鲁棒性。
Details
Motivation: 传统MoE的路由模块在训练后固定,面对分布偏移时决策脆弱,缺乏灵活性。 Method: 引入检索增强机制,通过离线构建记忆库优化路由logits,并利用近邻聚合相似性作为置信度系数,动态混合检索结果与原始路由输出。 Result: 实验表明kNN-MoE优于零样本基线方法,性能接近计算成本更高的监督微调方法。 Conclusion: kNN-MoE通过记忆重用和置信度感知的混合策略,有效提升了MoE架构在分布外场景下的泛化能力与鲁棒性。 Abstract: Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.[65] FormationEval, an open multiple-choice benchmark for petroleum geoscience
Almaz Ermilov
Main category: cs.CL
TL;DR: FormationEval是一个开放的多项选择题基准,用于评估语言模型在石油地质科学和地下学科中的表现,包含505个问题,覆盖七个领域,评估了72个模型,结果显示顶级模型准确率超过97%,其中Gemini 3 Pro Preview达到99.8%,而开源模型如GLM-4.7也表现出色,达到98.6%。
Details
Motivation: 为了评估语言模型在石油地质科学和地下学科中的能力,并提供一个可追溯、无版权问题的公开基准数据集。 Method: 构建了一个包含505个问题的数据集,涵盖七个领域,从三个权威来源使用推理模型和基于概念的方法生成问题,避免直接复制受版权保护的文本,并为每个问题添加来源元数据。评估了72个来自主要提供商的语言模型。 Result: 顶级模型准确率超过97%,Gemini 3 Pro Preview达到99.8%;开源模型中GLM-4.7以98.6%领先,多个模型超过93%;开放权重与闭源模型之间的差距比预期小;测井解释(Petrophysics)是最具挑战性的领域;发现并处理了答案长度偏差问题。 Conclusion: FormationEval为评估语言模型在专业地质领域的性能提供了有效且透明的基准,显示许多低成本开源模型已接近闭源模型的表现,未来可进一步优化领域均衡性和偏差控制。 Abstract: This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines. The dataset contains 505 questions across seven domains including petrophysics, petroleum geology and reservoir engineering, derived from three authoritative sources using a reasoning model with detailed instructions and a concept-based approach that avoids verbatim copying of copyrighted text. Each question includes source metadata to support traceability and audit. The evaluation covers 72 models from major providers including OpenAI, Anthropic, Google, Meta and open-weight alternatives. The top performers achieve over 97\% accuracy, with Gemini 3 Pro Preview reaching 99.8\%, while tier and domain gaps persist. Among open-weight models, GLM-4.7 leads at 98.6\%, with several DeepSeek, Llama, Qwen and Mistral models also exceeding 93\%. The performance gap between open-weight and closed models is narrower than expected, with several lower-cost open-weight models exceeding 90\% accuracy. Petrophysics emerges as the most challenging domain across all models, while smaller models show wider performance variance. Residual length bias in the dataset (correct answers tend to be longer) is documented along with bias mitigation strategies applied during construction. The benchmark, evaluation code and results are publicly available.[66] Confidence Estimation for LLMs in Multi-turn Interactions
Caiqi Zhang,Ruihan Yang,Xiaochen Zhu,Chengzu Li,Tiancheng Hu,Yijiang River Dong,Deqing Yang,Nigel Collier
Main category: cs.CL
TL;DR: 本文首次系统研究了多轮对话中的置信度估计问题,提出了新的评估框架和指标(如InfoECE)以及“Hinter-Guesser”数据生成范式,发现现有方法在多轮场景下表现不佳,并提出了一种基于logit的探测方法P(Sufficient),为构建更可靠的对话代理奠定了基础。
Details
Motivation: 当前的置信度估计研究主要集中于单轮场景,而多轮对话中上下文累积和歧义逐步消除的动态特性使得模型置信度行为尚不明确,但在自主代理和人机协作系统等应用中至关重要。 Method: 提出一个形式化的评估框架,包含每轮校准性和置信度单调性两个关键准则;设计了长度归一化的期望校准误差(InfoECE)等新指标,并引入“Hinter-Guesser”范式生成可控的评估数据集;采用基于logit的探针P(Sufficient)进行实验验证。 Result: 实验表明现有的主流置信度估计技术在多轮对话中难以满足校准性和单调性要求;P(Sufficient)方法表现相对更好,但整体任务仍未解决。 Conclusion: 多轮对话中的置信度估计是一个重要且未被充分探索的问题,本文提供了首个系统性研究框架和评估工具,为未来开发更可靠、可信的对话系统奠定了基础。 Abstract: While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research dominantly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. Reliable confidence estimation in multi-turn settings is critical for many downstream applications, such as autonomous agents and human-in-the-loop systems. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new "Hinter-Guesser" paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. We propose P(Sufficient), a logit-based probe that achieves comparatively better performance, although the task remains far from solved. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.[67] Toward Global Large Language Models in Medicine
Rui Yang,Huitao Li,Weihao Xuan,Heli Qi,Xin Li,Kunyu Yu,Yingjian Chen,Rongrong Wang,Jacques Behmoaras,Tianxi Cai,Bibhas Chakraborty,Qingyu Chen,Lionel Tim-Ee Cheng,Marie-Louise Damwanza,Chido Dzinotyiwei,Aosong Feng,Chuan Hong,Yusuke Iwasawa,Yuhe Ke,Linah Kitala,Taehoon Ko,Jisan Lee,Irene Li,Jonathan Chong Kai Liew,Hongfang Liu,Lian Leng Low,Edison Marrese-Taylor,Yutaka Matsuo,Isheanesu Misi,Yilin Ning,Jasmine Chiat Ling Ong,Marcus Eng Hock Ong,Enrico Petretto,Hossein Rouhizadeh,Abiram Sandralegar,Oren Schreier,Iain Bee Huat Tan,Patrick Tan,Daniel Shu Wei Ting,Junjue Wang,Chunhua Weng,Matthew Yu Heng Wong,Fang Wu,Yunze Xiao,Xuhai Xu,Qingcheng Zeng,Zhuo Zheng,Yifan Peng,Douglas Teodoro,Nan Liu
Main category: cs.CL
TL;DR: 本文提出了GlobMed,一个包含12种语言(含4种低资源语言)的大型多语言医学数据集,以及基于其构建的GlobMed-Bench评估基准和GlobMed-LLMs模型系列,显著提升了低资源语言下的医学语言模型性能,推动全球医疗AI的公平发展。
Details
Motivation: 现有大语言模型主要基于高资源语言训练,在全球医疗场景中对低资源语言支持不足,导致医疗信息获取不平等。 Method: 构建了包含50万条多语言医学数据的GlobMed数据集,建立GlobMed-Bench评估基准,并训练了一系列参数规模从1.7B到8B的多语言医学大模型GlobMed-LLMs。 Result: GlobMed-LLMs相比基线模型平均性能提升超过40%,在低资源语言上性能提升超三倍,GlobMed-Bench揭示了当前主流模型在不同语言间的表现差距。 Conclusion: GlobMed及其相关模型和评测基准为推动全球范围内多语言医学大模型的公平发展提供了重要基础,有助于让更多语言群体受益于人工智能医疗进步。 Abstract: Despite continuous advances in medical technology, the global distribution of health care resources remains uneven. The development of large language models (LLMs) has transformed the landscape of medicine and holds promise for improving health care quality and expanding access to medical information globally. However, existing LLMs are primarily trained on high-resource languages, limiting their applicability in global medical scenarios. To address this gap, we constructed GlobMed, a large multilingual medical dataset, containing over 500,000 entries spanning 12 languages, including four low-resource languages. Building on this, we established GlobMed-Bench, which systematically assesses 56 state-of-the-art proprietary and open-weight LLMs across multiple multilingual medical tasks, revealing significant performance disparities across languages, particularly for low-resource languages. Additionally, we introduced GlobMed-LLMs, a suite of multilingual medical LLMs trained on GlobMed, with parameters ranging from 1.7B to 8B. GlobMed-LLMs achieved an average performance improvement of over 40% relative to baseline models, with a more than threefold increase in performance on low-resource languages. Together, these resources provide an important foundation for advancing the equitable development and application of LLMs globally, enabling broader language communities to benefit from technological advances.[68] ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging
Omer Nacar,Serry Sibaee,Adel Ammar,Yasser Alhabashi,Nadia Samer Sibai,Yara Farouk Ahmed,Ahmed Saud Alqusaiyer,Sulieman Mahmoud AlMahmoud,Abdulrhman Mamdoh Mukhaniq,Lubaba Raed,Sulaiman Mohammed Alatwah,Waad Nasser Alqahtani,Yousif Abdulmajeed Alnasser,Mohamed Aziz Khadraoui,Wadii Boulila
Main category: cs.CL
TL;DR: 本文提出了ARCADE,首个具有城市级方言粒度的阿拉伯语语音数据集,包含来自19个国家58个城市的3,790段音频和6,907条标注,支持细粒度方言识别与多任务学习。
Details
Motivation: 现有阿拉伯语多方言数据集缺乏对城市级方言的精细划分,且在语音到细粒度地域映射方面研究不足,因此需要一个高粒度、可靠标注的数据集来推动方言识别研究。 Method: 通过流媒体服务收集阿拉伯语广播语音,截取30秒片段,并由1至3名母语评审员进行人工标注,涵盖情感、语种类型、方言类别及有效性标签;构建包含现代标准阿拉伯语和多种方言的语料库。 Result: 构建了包含6,907条标注和3,790个唯一音频片段的ARCADE数据集,覆盖19国58城,具备高质量音频和丰富的元数据,适合用于城市级方言分类与多任务学习任务。 Conclusion: ARCADE填补了阿拉伯语城市级语音方言数据的空白,为细粒度方言识别提供了可靠基准,促进了跨地域语音分析与自然语言处理应用的发展。 Abstract: The Arabic language is characterized by a rich tapestry of regional dialects that differ substantially in phonetics and lexicon, reflecting the geographic and cultural diversity of its speakers. Despite the availability of many multi-dialect datasets, mapping speech to fine-grained dialect sources, such as cities, remains underexplored. We present ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first Arabic speech dataset designed explicitly with city-level dialect granularity. The corpus comprises Arabic radio speech collected from streaming services across the Arab world. Our data pipeline captures 30-second segments from verified radio streams, encompassing both Modern Standard Arabic (MSA) and diverse dialectal speech. To ensure reliability, each clip was annotated by one to three native Arabic reviewers who assigned rich metadata, including emotion, speech type, dialect category, and a validity flag for dialect identification tasks. The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries. These fine-grained annotations enable robust multi-task learning, serving as a benchmark for city-level dialect tagging. We detail the data collection methodology, assess audio quality, and provide a comprehensive analysis of label distributions. The dataset is available on: https://huggingface.co/datasets/riotu-lab/ARCADE-full[69] From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
Fabian Lukassen,Jan Herrmann,Christoph Weisser,Benjamin Saefken,Thomas Kneib
Main category: cs.CL
TL;DR: 本研究系统地评估了不同因素(包括预测模型、XAI方法、LLM选择和提示策略)对自然语言解释(NLE)质量的影响,发现LLM的选择影响最大,且XAI仅对专家有轻微增益,同时揭示了可解释性与准确率之间的悖论。
Details
Motivation: 尽管大型语言模型(LLM)可将XAI输出转化为自然语言解释(NLE),但影响NLE质量的关键因素尚不明确,尤其是面向非专家用户的可解释性提升路径有待探索。 Method: 设计了一个因子实验,涵盖四种预测模型(XGBoost、RF、MLP、SARIMAX)、三种XAI方法(SHAP、LIME、无XAI)、三种LLM(GPT-4o、Llama-3-8B、DeepSeek-R1)和八种提示策略,并使用G-Eval双LLM评判机制,基于四个标准评估660个时间序列预测的NLE。 Result: 1)XAI相较无XAI仅对专家有小幅提升;2)LLM选择影响最大,DeepSeek-R1表现最优;3)出现可解释性悖论:SARIMAX预测更准但NLE质量更低;4)零样本提示效果接近自洽性提示但成本低7倍;5)思维链提示反而降低效果。 Conclusion: 在时间序列预测的可解释性生成中,LLM本身的能力远比XAI方法或提示工程更重要,且高预测性能不等于高解释质量,提示设计需谨慎,简单方法可能更高效。 Abstract: Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.[70] CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models
Yihao Liang,Ze Wang,Hao Chen,Ximeng Sun,Jialian Wu,Xiaodong Yu,Jiang Liu,Emad Barsoum,Zicheng Liu,Niraj K. Jha
Main category: cs.CL
TL;DR: CD4LM通过解耦训练与推理,利用离散空间一致性蒸馏和置信度自适应解码,实现扩散语言模型的高效并行生成,在保持甚至提升生成质量的同时显著加速解码过程。
Details
Motivation: 现有的自回归语言模型受限于串行解码延迟,而扩散语言模型存在训练与推理之间的静态到动态不匹配问题,难以高效实现并行生成。 Method: 提出CD4LM框架,包含两个核心组件:离散空间一致性蒸馏(DSCD),使学生模型对多种噪声状态具有轨迹不变性;置信度自适应解码(CAD),根据token置信度动态分配计算资源,跳过不必要的步骤。 Result: 在GSM8K上,CD4LM相比LLaDA基线实现了5.18倍的墙钟时间加速,同时保持相当的性能;在多个代码和数学基准上,平均加速3.62倍且平均准确率提升,优于现有方法的精度-效率帕累托前沿。 Conclusion: CD4LM有效解决了扩散语言模型中的静态-动态不匹配问题,实现了高质量、高效率的并行文本生成,为下一代语言模型的快速推理提供了可行路径。 Abstract: Autoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency-limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static-to-dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive "long-jump" refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD4LM, a framework that decouples training from inference via Discrete-Space Consistency Distillation (DSCD) and Confidence-Adaptive Decoding (CAD). Unlike standard objectives, DSCD trains a student to be trajectory-invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup; across code and math benchmarks, it strictly dominates the accuracy-efficiency Pareto frontier, achieving a 3.62x mean speedup while improving average accuracy. Code is available at https://github.com/yihao-liang/CDLM[71] pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs
Tobias Schimanski,Imene Kolli,Jingwei Ni,Yu Fan,Ario Saeid Vaghefi,Elliott Ash,Markus Leippold
Main category: cs.CL
TL;DR: 本文提出了pdfQA,一个包含真实和合成数据的多领域问答数据集,用于评估基于PDF文档的端到端问答系统,并分析了影响问答难度的十个复杂性维度。
Details
Motivation: 现有的问答数据集大多基于文本源或局限于特定领域,而PDF是互联网上使用第二广泛的文档类型,因此需要一个专门针对PDF的多领域、复杂性可控的问答数据集。 Method: 构建了一个包含2K人工标注(real-pdfQA)和2K合成(syn-pdfQA)的多领域问答数据集,定义了十个复杂性维度,并应用质量和难度过滤器筛选有效且具挑战性的问题,使用开源大语言模型进行回答并分析表现。 Result: 成功构建了pdfQA数据集,验证了其质量和挑战性,并发现问答性能与所提出的复杂性维度密切相关。 Conclusion: pdfQA为评估端到端PDF问答系统提供了基础,支持测试多种技能和局部优化(如信息检索或解析)。 Abstract: PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).[72] Power-of-Two Quantization-Aware-Training (PoT-QAT) in Large Language Models (LLMs)
Mahmoud Elgenedy
Main category: cs.CL
TL;DR: 本文研究了在边缘设备上应用大语言模型时的压缩方法,提出使用仅限于2的幂次的量化(PoT)来减少内存和计算开销,并通过量化感知训练(QAT)缓解性能损失。
Details
Motivation: 由于大语言模型参数量巨大,在资源受限的边缘设备上部署面临内存和算力挑战,因此需要高效的模型压缩技术。 Method: 采用2的幂次(Power-of-Two, PoT)量化权重,用位移操作替代乘法,并结合量化感知训练(QAT)优化模型性能。 Result: 在GPT-2 124M上实验显示,PoT量化后经QAT训练,困惑度改善66%,BERT-Score相对于原始模型仅下降1%,内存节省87.5%,推理速度预计提升3-10倍。 Conclusion: PoT量化结合QAT能有效降低大模型在边缘设备上的资源消耗,同时保持良好性能,具备实际部署潜力。 Abstract: In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a significant challenge for implementation, especially for Edge devices. Unlike cloud computing, memory and processing power for Edge devices are very limited, which necessitates developing novel ideas to make such applications feasible. In this work, we investigate compressing weights with a special quantization that limits numbers to only power-of-two (PoT). This helps save a huge amount of memory as only exponents need to be stored, more importantly, it significantly reduces processing power by replacing costly multiplication with low cost bit shifting. To overcome performance loss due to this strict quantization, we investigate Quantization Aware Training (QAT) to enhance performance through additional training. Results on GPT-2 124M show a major enhancement for quantized PoT model after additional training, with a perplexity enhancement of 66% and BERT-Score loss to baseline GPT-2 of 1%. The memory saving is estimated to be 87.5% while the inference speed is expected to be 3-10x faster with PoT quantization versus full-precision.[73] Classifying several dialectal Nawatl varieties
Juan-José Guzmán-Landa,Juan-Manuel Torres-Moreno,Miguel Figueroa-Saavedra,Carlos-Emiliano González-Gallardo,Graham Ranger,Martha Lorena-Avendaño-Garrido
Main category: cs.CL
TL;DR: 本研究利用机器学习和神经网络对纳瓦特尔语的方言变体进行分类,以应对该语言资源匮乏及书写形式多样的挑战。
Details
Motivation: 纳瓦特尔语虽使用广泛且历史悠久,但其计算机语言资源稀缺,尤其是面对约30种方言变体和不同的书写拼写方式,缺乏有效的自动分类方法。 Method: 采用机器学习和神经网络技术对纳瓦特尔语的不同方言变体进行分类建模。 Result: 提出了一种可用于纳瓦特尔语方言分类的计算方法,有助于推动该语言的数字化处理与资源建设。 Conclusion: 机器学习和神经网络能够有效应用于资源稀缺语言的方言分类任务,为保护和研究濒危语言提供了技术支持。 Abstract: Mexico is a country with a large number of indigenous languages, among which the most widely spoken is Nawatl, with more than two million people currently speaking it (mainly in North and Central America). Despite its rich cultural heritage, which dates back to the 15th century, Nawatl is a language with few computer resources. The problem is compounded when it comes to its dialectal varieties, with approximately 30 varieties recognised, not counting the different spellings in the written forms of the language. In this research work, we addressed the problem of classifying Nawatl varieties using Machine Learning and Neural Networks.[74] Estimating Text Temperature
Nikolay Mikhaylovskiy
Main category: cs.CL
TL;DR: 提出了一种估计文本生成温度的方法,可用于评估语言模型及人类书写文本的温度参数。
Details
Motivation: 希望量化不同文本(包括人类书写)在生成时所使用的温度参数,以更好理解其随机性来源。 Method: 基于最大似然方法,利用预训练语言模型对已生成文本进行温度反推,并在多种小到中等规模语言模型上评估该方法的有效性。 Result: 成功训练出能够准确估计文本温度的模型,Qwen3 14B表现最佳,并用于多个流行语料库的温度估计。 Conclusion: 该方法可有效估计各类文本的生成温度,为分析文本随机性和生成过程提供了新工具。 Abstract: Autoregressive language models typically use temperature parameter at inference to shape the probability distribution and control the randomness of the text generated. After the text was generated, this parameter can be estimated using maximum likelihood approach. Following it, we propose a procedure to estimate the temperature of any text, including ones written by humans, with respect to a given language model. We evaluate the temperature estimation capability of a wide selection of small-to-medium LLMs. We then use the best-performing Qwen3 14B to estimate temperatures of popular corpora.[75] Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling
Berk Atil,Rebecca J. Passonneau,Ninareh Mehrabi
Main category: cs.CL
TL;DR: 本文系统评估了基于不同人物设定的毒性检测提示方法,提出了一种轻量级SVM元集成方法,通过整合多种提示策略的预测结果,在多样化人物设定下显著提升了主观性NLP任务中的检测性能。
Details
Motivation: 毒性判断具有主观性,受不同人群视角和社会先验影响;现有大模型提示方法在不同人物设定和基础模型上表现不一,缺乏系统性比较与有效整合方法。 Method: 系统评估多种提示方法(包括所提出的自动提示优化策略),并探索四种提示变体的集成效果,提出一种基于4位预测向量的SVM元集成方法。 Result: SVM元集成方法在各类模型-人物组合下 consistently 优于单一提示方法和传统的多数投票法,实现了最佳整体性能。 Conclusion: 该工作为人物设定感知的毒性检测提供了首个系统性比较之一,并提出了适用于主观性NLP任务的鲁棒型多视角评估方法。 Abstract: Toxicity detection is inherently subjective, shaped by the diverse perspectives and social priors of different demographic groups. While ``pluralistic'' modeling as used in economics and the social sciences aims to capture perspective differences across contexts, current Large Language Model (LLM) prompting techniques have different results across different personas and base models. In this work, we conduct a systematic evaluation of persona-aware toxicity detection, showing that no single prompting method, including our proposed automated prompt optimization strategy, uniformly dominates across all model-persona pairs. To exploit complementary errors, we explore ensembling four prompting variants and propose a lightweight meta-ensemble: an SVM over the 4-bit vector of prompt predictions. Our results demonstrate that the proposed SVM ensemble consistently outperforms individual prompting methods and traditional majority-voting techniques, achieving the strongest overall performance across diverse personas. This work provides one of the first systematic comparisons of persona-conditioned prompting for toxicity detection and offers a robust method for pluralistic evaluation in subjective NLP tasks.cs.CV [Back]
[76] Free Energy-Based Modeling of Emotional Dynamics in Video Advertisements
Takashi Ushio,Kazuhiro Onishi,Hideyoshi Yanagisawa
Main category: cs.CV
TL;DR: 本研究基于自由能原理,仅从广告视频的场景级表达特征中量化了“愉悦感”、“惊喜”和“习惯化”三种情感反应,利用KLD、BS和UN三个核心指标解释情绪变化,并在1059个食品广告视频上验证了方法的有效性与鲁棒性。
Details
Motivation: 为了在不依赖生理信号或主观评分的情况下,建立可解释的情绪估计方法,从而更好地理解广告视频中的情感反应及其对媒体效果的影响。 Method: 基于自由能(FE)原则,使用Kullback-Leibler散度(KLD)表示预测误差,Bayesian惊奇(BS)表示信念更新,不确定性(UN)表示先验模糊性,从广告视频的场景级视觉特征中提取并量化情绪指标。 Result: 实验表明KLD反映与品牌呈现相关的‘愉悦感’,BS捕捉由信息复杂性引起的‘惊喜’,UN则反映由元素类型和空间布局不确定性以及元素多样性和数量引起的‘惊喜’;发现了三种典型情绪模式,且结果在九种超参数设置和六类日本广告视频中具有稳健性和泛化性。 Conclusion: 该方法为无需外部信号的可解释性情感估计提供了新路径,有助于推动更具吸引力的广告视频生成技术的发展。 Abstract: Emotional responses during advertising video viewing are recognized as essential for understanding media effects because they have influenced attention, memory, and purchase intention. To establish a methodological basis for explainable emotion estimation without relying on external information such as physiological signals or subjective ratings, we have quantified "pleasantness," "surprise," and "habituation" solely from scene-level expression features of advertising videos, drawing on the free energy(FE) principle, which has provided a unified account of perception, learning, and behavior. In this framework, Kullback-Leibler divergence (KLD) has captured prediction error, Bayesian surprise (BS) has captured belief updates, and uncertainty (UN) has reflected prior ambiguity, and together they have formed the core components of FE. Using 1,059 15 s food video advertisements, the experiments have shown that KLD has reflected "pleasantness" associated with brand presentation, BS has captured "surprise" arising from informational complexity, and UN has reflected "surprise" driven by uncertainty in element types and spatial arrangements, as well as by the variability and quantity of presented elements. This study also identified three characteristic emotional patterns, namely uncertain stimulus, sustained high emotion, and momentary peak and decay, demonstrating the usefulness of the proposed method. Robustness across nine hyperparameter settings and generalization tests with six types of Japanese advertising videos (three genres and two durations) confirmed that these tendencies remained stable. This work can be extended by integrating a wider range of expression elements and validating the approach through subjective ratings, ultimately guiding the development of technologies that can support the creation of more engaging advertising videos.[77] Can Generative Models Actually Forge Realistic Identity Documents?
Alexander Vinogradov
Main category: cs.CV
TL;DR: 该研究评估了当前开源扩散模型生成身份文件伪造品的能力,发现尽管能模拟表面视觉特征,但在结构和法医真实性上仍存在缺陷。
Details
Motivation: 由于生成图像模型在图像逼真度上的进展引发了其被用于伪造证件的担忧,本文旨在评估现有公开模型生成可欺骗人工或自动验证系统的证件伪造品的实际风险。 Method: 通过文本到图像和图像到图像生成管道,使用Stable Diffusion、Qwen、Flux、Nano-Banana等多个公开可用的生成模型家族进行实验评估。 Result: 当前生成模型能够模拟证件的表面美学特征,但无法复现其结构和法医层面的真实性,因此难以通过严格的验证系统。 Conclusion: 生成式身份文件深伪技术在法医级真实性方面风险可能被高估,强调机器学习研究者与文件鉴定专家合作进行实际风险评估的重要性。 Abstract: Generative image models have recently shown significant progress in image realism, leading to public concerns about their potential misuse for document forgery. This paper explores whether contemporary open-source and publicly accessible diffusion-based generative models can produce identity document forgeries that could realistically bypass human or automated verification systems. We evaluate text-to-image and image-to-image generation pipelines using multiple publicly available generative model families, including Stable Diffusion, Qwen, Flux, Nano-Banana, and others. The findings indicate that while current generative models can simulate surface-level document aesthetics, they fail to reproduce structural and forensic authenticity. Consequently, the risk of generative identity document deepfakes achieving forensic-level authenticity may be overestimated, underscoring the value of collaboration between machine learning practitioners and document-forensics experts in realistic risk assessment.[78] Pediatric Pneumonia Detection from Chest X-Rays:A Comparative Study of Transfer Learning and Custom CNNs
Agniv Roy Choudhury
Main category: cs.CV
TL;DR: 本研究比较了从零训练的CNN与使用迁移学习的模型(ResNet50、DenseNet121、EfficientNet-B0)在儿童肺炎检测中的性能,发现微调后的ResNet50表现最佳,准确率达99.43%,且Grad-CAM可视化验证了模型关注临床相关区域。
Details
Motivation: 由于放射科医生资源有限且诊断存在变异性,儿童肺炎的胸部X光诊断面临挑战,亟需一种高效、准确的自动化辅助诊断方法。 Method: 采用包含5,216张儿科胸部X光片的数据集,按80/10/10划分训练、验证和测试集,训练七种CNN模型(包括自定义网络和迁移学习模型),评估指标包括准确率、F1分数和AUC,并使用Grad-CAM提供可解释性分析。 Result: 微调后的ResNet50表现最优,准确率为99.43%,F1-score为99.61%,AUC达99.93%,仅3例误分类;迁移学习结合微调比固定骨干网络平均高出5.5个百分点,Grad-CAM显示模型聚焦于肺部关键区域。 Conclusion: 迁移学习结合微调显著优于从零训练的CNN,在儿童肺炎检测中表现出近乎完美的准确性,具备在资源有限地区作为筛查工具的应用潜力。 Abstract: Pneumonia is a leading cause of mortality in children under five, with over 700,000 deaths annually. Accurate diagnosis from chest X-rays is limited by radiologist availability and variability. Objective: This study compares custom CNNs trained from scratch with transfer learning (ResNet50, DenseNet121, EfficientNet-B0) for pediatric pneumonia detection, evaluating frozen-backbone and fine-tuning regimes. Methods: A dataset of 5,216 pediatric chest X-rays was split 80/10/10 for training, validation, and testing. Seven models were trained and assessed using accuracy, F1-score, and AUC. Grad-CAM visualizations provided explainability. Results: Fine-tuned ResNet50 achieved the best performance: 99.43\% accuracy, 99.61\% F1-score, and 99.93\% AUC, with only 3 misclassifications. Fine-tuning outperformed frozen-backbone models by 5.5 percentage points on average. Grad-CAM confirmed clinically relevant lung regions guided predictions. Conclusions: Transfer learning with fine-tuning substantially outperforms CNNs trained from scratch for pediatric pneumonia detection, showing near-perfect accuracy. This system has strong potential as a screening tool in resource-limited settings. Future work should validate these findings on multi-center and adult datasets. Keywords: Pneumonia detection, deep learning, transfer learning, CNN, chest X-ray, pediatric diagnosis, ResNet, DenseNet, EfficientNet, Grad-CAM.[79] Unified Review and Benchmark of Deep Segmentation Architectures for Cardiac Ultrasound on CAMUS
Zahid Ullah,Muhammad Hilal,Eunsoo Lee,Dragan Pamucar,Jihie Kim
Main category: cs.CV
TL;DR: 本研究对心脏超声分割中的U-Net、Attention U-Net和TransUNet进行了标准化基准比较,并在CAMUS数据集上评估了不同预处理和自监督方法的影响,发现保持原始NIfTI格式有助于提升性能,而TransUNet结合自监督表现出最佳泛化能力。
Details
Motivation: 尽管已有综述涵盖心脏影像与深度学习进展,但缺乏统一且可复现的实验基准。本文旨在建立一个公平、可控的比较框架,连接文献综述与实证评估。 Method: 结合聚焦性文献回顾,对U-Net、Attention U-Net和TransUNet三种架构进行控制变量实验,使用CAMUS数据集,统一训练划分、损失函数和评估指标,测试多种预处理路径(如NIfTI、PNG-16位、伪标签、自监督预训练)的影响。 Result: 使用原生NIfTI数据训练的U-Net达到94%平均Dice系数,PNG-16位版本为91%;Attention U-Net在小区域和低对比度边界表现略优;TransUNet因建模全局上下文能力,在困难样本中表现最好,尤其配合自监督预训练时;伪标签经置信度过滤后提升了模型鲁棒性。 Conclusion: 本文提供了三种主流分割网络在标准设置下的可复现基准结果,强调了保持图像强度保真度的重要性,并展望了自监督学习与基于GPT的多模态标注流程在未来数据高效标注与质量控制中的潜力。 Abstract: Several review papers summarize cardiac imaging and DL advances, few works connect this overview to a unified and reproducible experimental benchmark. In this study, we combine a focused review of cardiac ultrasound segmentation literature with a controlled comparison of three influential architectures, U-Net, Attention U-Net, and TransUNet, on the Cardiac Acquisitions for Multi-Structure Ultrasound Segmentation (CAMUS) echocardiography dataset. Our benchmark spans multiple preprocessing routes, including native NIfTI volumes, 16-bit PNG exports, GPT-assisted polygon-based pseudo-labels, and self-supervised pretraining (SSL) on thousands of unlabeled cine frames. Using identical training splits, losses, and evaluation criteria, a plain U-Net achieved a 94% mean Dice when trained directly on NIfTI data (preserving native dynamic range), while the PNG-16-bit workflow reached 91% under similar conditions. Attention U-Net provided modest improvements on small or low-contrast regions, reducing boundary leakage, whereas TransUNet demonstrated the strongest generalization on challenging frames due to its ability to model global spatial context, particularly when initialized with SSL. Pseudo-labeling expanded the training set and improved robustness after confidence filtering. Overall, our contributions are threefold: a harmonized, apples-to-apples benchmark of U-Net, Attention U-Net, and TransUNet under standardized CAMUS preprocessing and evaluation; practical guidance on maintaining intensity fidelity, resolution consistency, and alignment when preparing ultrasound data; and an outlook on scalable self-supervision and emerging multimodal GPT-based annotation pipelines for rapid labeling, quality assurance, and targeted dataset curation.[80] Motion-Compensated Latent Semantic Canvases for Visual Situational Awareness on Edge
Igor Lodin,Sergii Filatov,Vira Filatova,Dmytro Filatov
Main category: cs.CV
TL;DR: 提出Motion-Compensated Latent Semantic Canvases (MCLSC) 方法,通过运动补偿和异步分割显著降低资源受限设备上的语义分割频率与处理延迟。
Details
Motivation: 在资源受限的边缘设备上实现高效的视觉情境感知,减少频繁执行昂贵的语义分割带来的计算开销。 Method: 维护两个位于稳定基准坐标系中的潜在语义画布(静态层和动态层),利用视频流的运动补偿保持坐标一致性,并仅在检测到新信息时触发基于Mask2Former的异步全景分割。 Result: 在预录480p视频片段上,相比逐帧分割,分割调用减少30倍以上,平均端到端处理时间降低20倍以上,同时保持静态与动态语义覆盖的一致性。 Conclusion: MCLSC 通过运动门控与坐标稳定化,显著提升了边缘设备上持续视觉感知的效率与可行性。 Abstract: We propose Motion-Compensated Latent Semantic Canvases (MCLSC) for visual situational awareness on resource-constrained edge devices. The core idea is to maintain persistent semantic metadata in two latent canvases - a slowly accumulating static layer and a rapidly updating dynamic layer - defined in a baseline coordinate frame stabilized from the video stream. Expensive panoptic segmentation (Mask2Former) runs asynchronously and is motion-gated: inference is triggered only when motion indicates new information, while stabilization/motion compensation preserves a consistent coordinate system for latent semantic memory. On prerecorded 480p clips, our prototype reduces segmentation calls by >30x and lowers mean end-to-end processing time by >20x compared to naive per-frame segmentation, while maintaining coherent static/dynamic semantic overlays.[81] VL-OrdinalFormer: Vision Language Guided Ordinal Transformers for Interpretable Knee Osteoarthritis Grading
Zahid Ullah,Jihie Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为VLOrdinalFormer的视觉-语言引导序数学习框架,用于从膝关节X光片中全自动评估膝骨关节炎(KOA)严重程度。该方法结合ViT-L16主干、CORAL序数回归和CLIP驱动的语义对齐模块,提升了KL分级(尤其是KL1和KL2)的准确性与可解释性,在OAI kneeKL224数据集上达到了最先进的性能。
Details
Motivation: KOA的早期阶段(KL1和KL2)在影像学上差异细微,导致放射科医生间判断一致性差,亟需一种准确、稳健且可解释的自动化分级方法以支持临床决策。 Method: 提出VLOrdinalFormer:采用ViT-L16作为骨干网络,结合CORAL进行序数回归,并引入CLIP驱动的语义对齐模块,融合关节间隙变窄、骨赘形成和软骨下硬化等文本临床概念;使用分层五折交叉验证、类别感知重加权、测试时增强与全局阈值优化来提升鲁棒性。 Result: 在OAI kneeKL224数据集上,VLOrdinalFormer在宏观F1分数和总体准确率上优于CNN和ViT基线模型,尤其显著提升了KL1和KL2的分类性能,同时保持对轻度和重度病例的高准确率;Grad-CAM和CLIP相似性图显示模型关注解剖学相关的临床区域,具备良好可解释性。 Conclusion: VLOrdinalFormer通过融合视觉-语言语义信息与序数学习,实现了准确、稳健且可解释的KOA自动分级,具有应用于常规放射学实践中疾病进展评估的潜力。 Abstract: Knee osteoarthritis (KOA) is a leading cause of disability worldwide, and accurate severity assessment using the Kellgren Lawrence (KL) grading system is critical for clinical decision making. However, radiographic distinctions between early disease stages, particularly KL1 and KL2, are subtle and frequently lead to inter-observer variability among radiologists. To address these challenges, we propose VLOrdinalFormer, a vision language guided ordinal learning framework for fully automated KOA grading from knee radiographs. The proposed method combines a ViT L16 backbone with CORAL based ordinal regression and a Contrastive Language Image Pretraining (CLIP) driven semantic alignment module, allowing the model to incorporate clinically meaningful textual concepts related to joint space narrowing, osteophyte formation, and subchondral sclerosis. To improve robustness and mitigate overfitting, we employ stratified five fold cross validation, class aware re weighting to emphasize challenging intermediate grades, and test time augmentation with global threshold optimization. Experiments conducted on the publicly available OAI kneeKL224 dataset demonstrate that VLOrdinalFormer achieves state of the art performance, outperforming CNN and ViT baselines in terms of macro F1 score and overall accuracy. Notably, the proposed framework yields substantial performance gains for KL1 and KL2 without compromising classification accuracy for mild or severe cases. In addition, interpretability analyses using Grad CAM and CLIP similarity maps confirm that the model consistently attends to clinically relevant anatomical regions. These results highlight the potential of vision language aligned ordinal transformers as reliable and interpretable tools for KOA grading and disease progression assessment in routine radiological practice.[82] VideoCuRL: Video Curriculum Reinforcement Learning with Orthogonal Difficulty Decomposition
Hongbo Jin,Kuanwei Lin,Wenhao Zhang,Yichen Jin,Ge Li
Main category: cs.CV
TL;DR: 本文提出了VideoCuRL,一种用于视频语言模型强化学习的新框架,通过将任务难度分解为视觉时序感知负荷和认知推理深度两个维度,并采用高效的无训练代理指标构建二维课程学习策略,显著提升了视频理解和推理性能。
Details
Motivation: 现有的强化学习课程策略依赖于随机数据打乱或基于标量难度指标的简单课程方法,难以区分视频理解中的视觉时序感知负荷与认知推理深度两种正交挑战,因此需要更细粒度的课程设计。 Method: 提出VideoCuRL框架,使用光流和关键帧熵衡量视觉复杂性,校准惊奇度(Calibrated Surprisal)衡量认知复杂性,构建二维课程网格;采用能力感知的对角波前调度策略进行训练规划,并引入动态稀疏KL散度和结构化重访机制以稳定训练过程。 Result: 实验表明,VideoCuRL在推理任务上优于强基线模型(VSI-Bench +2.5),在感知任务上也表现更优(VideoMME +2.9),同时避免了生成式课程中高昂的推理开销。 Conclusion: VideoCuRL通过解耦视觉与认知难度并设计高效课程策略,实现了更鲁棒、可扩展的视频语言模型后训练,为复杂时空推理任务提供了新的解决方案。 Abstract: Reinforcement Learning (RL) is crucial for empowering VideoLLMs with complex spatiotemporal reasoning. However, current RL paradigms predominantly rely on random data shuffling or naive curriculum strategies based on scalar difficulty metrics. We argue that scalar metrics fail to disentangle two orthogonal challenges in video understanding: Visual Temporal Perception Load and Cognitive Reasoning Depth. To address this, we propose VideoCuRL, a novel framework that decomposes difficulty into these two axes. We employ efficient, training-free proxies, optical flow and keyframe entropy for visual complexity, Calibrated Surprisal for cognitive complexity, to map data onto a 2D curriculum grid. A competence aware Diagonal Wavefront strategy then schedules training from base alignment to complex reasoning. Furthermore, we introduce Dynamic Sparse KL and Structured Revisiting to stabilize training against reward collapse and catastrophic forgetting. Extensive experiments show that VideoCuRL surpasses strong RL baselines on reasoning (+2.5 on VSI-Bench) and perception (+2.9 on VideoMME) tasks. Notably, VideoCuRL eliminates the prohibitive inference overhead of generation-based curricula, offering a scalable solution for robust video post-training.[83] Comparative Evaluation of CNN Architectures for Neural Style Transfer in Indonesian Batik Motif Generation: A Comprehensive Study
Happy Gery Pangestu,Andi Prademon Yunus,Siti Khomsah
Main category: cs.CV
TL;DR: 本研究系统比较了五种CNN主干网络在印尼蜡染图案生成中的表现,发现ResNet在保持结构相似性的同时显著提升了计算效率,适合资源受限环境下的实际应用。
Details
Motivation: 现有基于VGG的神经风格迁移方法在处理印尼蜡染图案时计算开销大,难以在资源受限环境中部署,亟需更高效的替代方案。 Method: 通过245次受控实验,结合SSIM、LPIPS等定量指标、定性评估和ANOVA统计分析,系统比较VGG16、VGG19、Inception V3、ResNet50和ResNet101在结构保持、风格表达和计算效率方面的权衡。 Result: 不同主干网络在结构相似性上无显著差异(SSIM的ANOVA p=0.83),但ResNet系列比VGG快5-6倍,FLOPs减少16倍以上,且保持相当的感知相似性;VGG产生更密集的绘画纹理,ResNet更擅长保留几何稳定性和蜡染笔触,Inception V3表现居中但噪声较多。 Conclusion: ResNet-based架构在保持结构保真度与提升计算效率之间取得了更好平衡,应将主干网络选择从追求风格强度转向效率与实用性,为工业化规模的蜡染图案生成提供可行基础。 Abstract: Neural Style Transfer (NST) provides a computational framework for the digital preservation and generative exploration of Indonesian batik motifs; however, existing approaches remain largely centered on VGG-based architectures whose strong stylistic expressiveness comes at the cost of high computational and memory demands, that limits practical deployment in resource-limited environments. This study presents a systematic comparative analysis of five widely used CNN backbones, namely VGG16, VGG19, Inception V3, ResNet50, and ResNet101, based on 245 controlled experiments combining quantitative metrics, qualitative assessment, and statistical analysis to examine the trade-off between structural preservation, stylistic behavior, and computational efficiency. The results show that backbone selection does not yield statistically significant differences in structural similarity, as confirmed by ANOVA on SSIM (p= 0.83), indicating comparable levels of structural preservation rather than equivalent stylistic quality. Within this context, ResNet-based architectures achieve approximately 5-6x faster convergence than VGG models while maintaining similar perceptual similarity (LPIPS = 0.53) and requiring over 16x fewer FLOPs (0.63 vs 10.12 GFLOPs). Qualitative analysis reveals consistent stylistic trade-offs, with VGG producing denser painterly textures, ResNet favoring geometric stability and canting stroke preservation with milder stylization, and Inception V3 exhibiting intermediate but noisier behavior. These findings reposition architectural choice in NST from maximizing stylistic intensity toward efficiency-aware and structure-preserving deployment, highlighting ResNet-based backbones as a practical foundation for scalable, industry-oriented batik generation.[84] CornViT: A Multi-Stage Convolutional Vision Transformer Framework for Hierarchical Corn Kernel Analysis
Sai Teja Erukude,Jane Mascarenhas,Lior Shamir
Main category: cs.CV
TL;DR: 本研究提出CornViT,一种基于卷积视觉Transformer的三阶段框架,用于自动化玉米籽粒分级,在纯度、形态和胚向识别上显著优于ResNet和DenseNet,配套发布数据集与可部署的Web应用。
Details
Motivation: 传统玉米籽粒分级依赖人工,效率低且主观性强,亟需准确、可部署的自动化解决方案以支持种子认证与育种工作。 Method: 构建三阶段CvT-13分类框架:第一阶段判断籽粒纯度,第二阶段对纯籽粒分类形态(扁/圆),第三阶段判断纯扁籽粒的胚向(上/下);基于公开数据集人工重标注构建三个专用数据集,并采用ImageNet-22k预训练模型进行头层微调。 Result: 在三个任务上分别取得93.76%、94.11%和91.12%的测试准确率,显著优于ResNet-50和DenseNet-121;发布数据集与基于Flask的Web应用,支持可视化推理。 Conclusion: Convolution-augmented self-attention结构(如CvT)在细粒度种子分析中表现优越,CornViT框架结合高质量数据集与可解释性工具,为种子质量评估提供了可落地的自动化方案。 Abstract: Accurate grading of corn kernels is critical for seed certification, directional seeding, and breeding, yet it is still predominantly performed by manual inspection. This work introduces CornViT, a three-stage Convolutional Vision Transformer (CvT) framework that emulates the hierarchical reasoning of human seed analysts for single-kernel evaluation. Three sequential CvT-13 classifiers operate on 384x384 RGB images: Stage 1 distinguishes pure from impure kernels; Stage 2 categorizes pure kernels into flat and round morphologies; and Stage 3 determines the embryo orientation (up vs. down) for pure, flat kernels. Starting from a public corn seed image collection, we manually relabeled and filtered images to construct three stage-specific datasets: 7265 kernels for purity, 3859 pure kernels for morphology, and 1960 pure-flat kernels for embryo orientation, all released as benchmarks. Head-only fine-tuning of ImageNet-22k pretrained CvT-13 backbones yields test accuracies of 93.76% for purity, 94.11% for shape, and 91.12% for embryo-orientation detection. Under identical training conditions, ResNet-50 reaches only 76.56 to 81.02 percent, whereas DenseNet-121 attains 86.56 to 89.38 percent accuracy. These results highlight the advantages of convolution-augmented self-attention for kernel analysis. To facilitate adoption, we deploy CornViT in a Flask-based web application that performs stage-wise inference and exposes interpretable outputs through a browser interface. Together, the CornViT framework, curated datasets, and web application provide a deployable solution for automated corn kernel quality assessment in seed quality workflows. Source code and data are publicly available.[85] Evaluating Contextual Intelligence in Recyclability: A Comprehensive Study of Image-Based Reasoning Systems
Eliot Park,Abhi Kumar,Pranav Rajpurkar
Main category: cs.CV
TL;DR: 本研究探讨了使用先进的视觉-语言模型(如GPT-4o、GPT-4o-mini和Claude 3.5)来预测常见废弃物的可回收性,并评估其在匹配物品与合适回收箱、适应地理位置差异、污染或损坏情况及多材料物体等复杂场景下的表现。
Details
Motivation: 准确判断物品是否可回收及其正确处置方式对公众而言仍具挑战,亟需提升回收效率以促进环境可持续性。 Method: 利用精选图像数据集,评估GPT-4o、GPT-4o-mini和Claude 3.5等视觉-语言模型在多种情景下判断物品可回收性和物理适配性的能力。 Result: 这些模型在上下文理解方面相比前代有显著进步,能较好处理位置差异、污染、损坏和复合材料等挑战,但在准确性方面仍有不足。 Conclusion: 持续优化具备上下文感知能力的模型对于改善公众回收行为和推动可持续发展至关重要。 Abstract: While the importance of efficient recycling is widely acknowledged, accurately determining the recyclability of items and their proper disposal remains a complex task for the general public. In this study, we explore the application of cutting-edge vision-language models (GPT-4o, GPT-4o-mini, and Claude 3.5) for predicting the recyclability of commonly disposed items. Utilizing a curated dataset of images, we evaluated the models' ability to match objects to appropriate recycling bins, including assessing whether the items could physically fit into the available bins. Additionally, we investigated the models' performance across several challenging scenarios: (i) adjusting predictions based on location-specific recycling guidelines; (ii) accounting for contamination or structural damage; and (iii) handling objects composed of multiple materials. Our findings highlight the significant advancements in contextual understanding offered by these models compared to previous iterations, while also identifying areas where they still fall short. The continued refinement of context-aware models is crucial for enhancing public recycling practices and advancing environmental sustainability.[86] Clean-GS: Semantic Mask-Guided Pruning for 3D Gaussian Splatting
Subhankar Mishra
Main category: cs.CV
TL;DR: Clean-GS是一种利用稀疏语义掩码去除3D高斯点阵中漂浮物和背景杂点的方法,可在仅使用少量分割掩码的情况下实现60-80%的模型压缩,同时保持高质量渲染。
Details
Motivation: 3D Gaussian Splatting会产生大量无关的漂浮高斯点,影响视觉质量和模型部署效率,尤其是在带宽受限的应用中。现有方法依赖全局重要性指标,难以精准区分目标与背景。 Method: 提出Clean-GS,结合基于白名单的空间滤波、颜色引导验证和基于邻域的异常值去除:(1) 将高斯投影到语义掩码区域进行白名单过滤;(2) 利用深度缓冲的颜色一致性验证;(3) 基于邻域关系剔除孤立异常点。仅需3个(1%视图)语义掩码即可完成清理。 Result: 在Tanks and Temples数据集上,模型大小从125MB减少到47MB,压缩率达62%,同时保持对象细节和渲染质量;有效分离出复杂户外场景中的纪念碑和目标物体。 Conclusion: Clean-GS通过引入稀疏语义先验显著提升了3DGS重建的整洁度与紧凑性,为Web、AR/VR等资源受限场景下的高效部署提供了可行方案。 Abstract: 3D Gaussian Splatting produces high-quality scene reconstructions but generates hundreds of thousands of spurious Gaussians (floaters) scattered throughout the environment. These artifacts obscure objects of interest and inflate model sizes, hindering deployment in bandwidth-constrained applications. We present Clean-GS, a method for removing background clutter and floaters from 3DGS reconstructions using sparse semantic masks. Our approach combines whitelist-based spatial filtering with color-guided validation and outlier removal to achieve 60-80\% model compression while preserving object quality. Unlike existing 3DGS pruning methods that rely on global importance metrics, Clean-GS uses semantic information from as few as 3 segmentation masks (1\% of views) to identify and remove Gaussians not belonging to the target object. Our multi-stage approach consisting of (1) whitelist filtering via projection to masked regions, (2) depth-buffered color validation, and (3) neighbor-based outlier removal isolates monuments and objects from complex outdoor scenes. Experiments on Tanks and Temples show that Clean-GS reduces file sizes from 125MB to 47MB while maintaining rendering quality, making 3DGS models practical for web deployment and AR/VR applications. Our code is available at https://github.com/smlab-niser/clean-gs[87] Four-Stage Alzheimer's Disease Classification from MRI Using Topological Feature Extraction, Feature Selection, and Ensemble Learning
Faisal Ahmed
Main category: cs.CV
TL;DR: 提出TDA-Alz框架,结合拓扑数据分析与集成学习,实现高效、可解释的阿尔茨海默病四阶段分类,准确率达98.19%,优于现有深度学习方法。
Details
Motivation: 解决阿尔茨海默病严重程度分类中数据有限、模型不可解释和计算成本高的问题。 Method: 利用拓扑数据分析(TDA)从脑MRI中提取内在结构模式的拓扑描述符,通过特征选择保留最具判别性的特征,并采用集成学习进行多类别分类。 Result: 在OASIS-1数据集上达到98.19%的准确率和99.75%的AUC,优于或匹配现有的深度学习方法,且无需数据增强、预训练网络或大规模计算资源。 Conclusion: TDA-Alz提供了一种轻量级、高效且可解释的替代方案,适用于临床决策支持系统中的AD严重程度分类。 Abstract: Accurate and efficient classification of Alzheimer's disease (AD) severity from brain magnetic resonance imaging (MRI) remains a critical challenge, particularly when limited data and model interpretability are of concern. In this work, we propose TDA-Alz, a novel framework for four-stage Alzheimer's disease severity classification (non-demented, moderate dementia, mild, and very mild) using topological data analysis (TDA) and ensemble learning. Instead of relying on deep convolutional architectures or extensive data augmentation, our approach extracts topological descriptors that capture intrinsic structural patterns of brain MRI, followed by feature selection to retain the most discriminative topological features. These features are then classified using an ensemble learning strategy to achieve robust multiclass discrimination. Experiments conducted on the OASIS-1 MRI dataset demonstrate that the proposed method achieves an accuracy of 98.19% and an AUC of 99.75%, outperforming or matching state-of-the-art deep learning--based methods reported on OASIS and OASIS-derived datasets. Notably, the proposed framework does not require data augmentation, pretrained networks, or large-scale computational resources, making it computationally efficient and fast compared to deep neural network approaches. Furthermore, the use of topological descriptors provides greater interpretability, as the extracted features are directly linked to the underlying structural characteristics of brain MRI rather than opaque latent representations. These results indicate that TDA-Alz offers a powerful, lightweight, and interpretable alternative to deep learning models for MRI-based Alzheimer's disease severity classification, with strong potential for real-world clinical decision-support systems.[88] Application of deep learning techniques in non-contrast computed tomography pulmonary angiogram for pulmonary embolism diagnosis
I-Hsien Ting,Yi-Jun Tseng,Yu-Sheng Lin
Main category: cs.CV
TL;DR: 本研究提出了一种基于3D卷积神经网络的深度学习模型,用于无对比剂CT图像中肺栓塞的自动分类,验证了该方法在肺栓塞诊断中的可行性。
Details
Motivation: 肺栓塞是一种危及生命的疾病,传统使用对比剂的CT肺血管造影存在导致急性肾损伤的风险,且耗时较长,可能延误急性患者的治疗时机,因此需要一种无需对比剂的快速诊断方法。 Method: 采用3D卷积神经网络模型对无对比剂的CT图像进行肺栓塞分类,利用深度学习技术实现自动诊断。 Result: 该模型在无对比剂CT图像中肺栓塞分类任务上达到了85%的准确率和0.84的AUC值,表现出良好的诊断性能。 Conclusion: 研究表明,基于深度学习的无对比剂CT图像分析在肺栓塞诊断中具有可行性和临床应用潜力,有助于避免对比剂相关风险并加快诊断速度。 Abstract: Pulmonary embolism is a life-threatening disease, early detection and treatment can significantly reduce mortality. In recent years, many studies have been using deep learning in the diagnosis of pulmonary embolism with contrast medium computed tomography pulmonary angiography, but the contrast medium is likely to cause acute kidney injury in patients with pulmonary embolism and chronic kidney disease, and the contrast medium takes time to work, patients with acute pulmonary embolism may miss the golden treatment time. This study aims to use deep learning techniques to automatically classify pulmonary embolism in CT images without contrast medium by using a 3D convolutional neural network model. The deep learning model used in this study had a significant impact on the pulmonary embolism classification of computed tomography images without contrast with 85\% accuracy and 0.84 AUC, which confirms the feasibility of the model in the diagnosis of pulmonary embolism.[89] Analyzing the Shopping Journey: Computing Shelf Browsing Visits in a Physical Retail Store
Luis Yoichi Morales,Francesco Zanlungo,David M. Woollard
Main category: cs.CV
TL;DR: 本文提出了一种通过3D视觉跟踪和顶置摄像头提取顾客“货架访问”行为的算法,用于理解零售环境中顾客的浏览意图,并验证了其在不同商店环境下的适用性及其与购买行为的关系。
Details
Motivation: 由于机器人在零售场景中面向顾客的应用面临挑战,需要自主理解顾客购物意图,因此研究实体店内顾客活动。 Method: 使用基于机器视觉的3D跟踪和顶置摄像头获取顾客轨迹,提出一种计算‘货架访问’的算法,并通过两个独立数据集(8138和15129条轨迹)进行校准和跨店评估。 Result: 算法能在不同于校准环境的商店中准确识别顾客浏览行为,并可用于分析大规模轨迹中的浏览模式及其与实际购买的关系。 Conclusion: 该算法具备跨环境泛化能力, shelf browsing信息可支持零售规划和人机交互应用。 Abstract: Motivated by recent challenges in the deployment of robots into customer-facing roles within retail, this work introduces a study of customer activity in physical stores as a step toward autonomous understanding of shopper intent. We introduce an algorithm that computes shoppers' ``shelf visits'' -- capturing their browsing behavior in the store. Shelf visits are extracted from trajectories obtained via machine vision-based 3D tracking and overhead cameras. We perform two independent calibrations of the shelf visit algorithm, using distinct sets of trajectories (consisting of 8138 and 15129 trajectories), collected in different stores and labeled by human reviewers. The calibrated models are then evaluated on trajectories held out of the calibration process both from the same store on which calibration was performed and from the other store. An analysis of the results shows that the algorithm can recognize customers' browsing activity when evaluated in an environment different from the one on which calibration was performed. We then use the model to analyze the customers' ``browsing patterns'' on a large set of trajectories and their relation to actual purchases in the stores. Finally, we discuss how shelf browsing information could be used for retail planning and in the domain of human-robot interaction scenarios.[90] ShadowGS: Shadow-Aware 3D Gaussian Splatting for Satellite Imagery
Feng Luo,Hongbo Pan,Xiang Yang,Baoyu Jiang,Fengqing Liu,Tao Huang
Main category: cs.CV
TL;DR: 本文提出了ShadowGS,一种基于3D高斯点阵的多时相卫星影像阴影建模框架,通过物理渲染与一致性约束显著提升了三维重建精度和阴影解耦能力。
Details
Motivation: 多时相卫星影像中由于光照变化导致的阴影不一致问题严重影响3D重建质量,现有方法难以准确处理阴影的几何与光照耦合效应。 Method: 提出ShadowGS框架,结合遥感中的物理渲染方程与高效的光线行进技术,精确建模几何一致的阴影;引入阴影一致性约束和阴影图先验,解耦光照成分并提升稀疏视角下的重建性能。 Result: 实验表明,ShadowGS在阴影解耦精度、3D重建准确性和新视图合成质量上优于当前最先进方法,且仅需几分钟训练时间,在RGB、全色锐化和稀疏视角输入下均表现出强鲁棒性。 Conclusion: ShadowGS有效解决了多时相卫星图像中阴影不一致对3D重建的影响,通过物理引导的渲染与先验约束,实现了高效、精确且鲁棒的三维重建。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a novel paradigm for 3D reconstruction from satellite imagery. However, in multi-temporal satellite images, prevalent shadows exhibit significant inconsistencies due to varying illumination conditions. To address this, we propose ShadowGS, a novel framework based on 3DGS. It leverages a physics-based rendering equation from remote sensing, combined with an efficient ray marching technique, to precisely model geometrically consistent shadows while maintaining efficient rendering. Additionally, it effectively disentangles different illumination components and apparent attributes in the scene. Furthermore, we introduce a shadow consistency constraint that significantly enhances the geometric accuracy of 3D reconstruction. We also incorporate a novel shadow map prior to improve performance with sparse-view inputs. Extensive experiments demonstrate that ShadowGS outperforms current state-of-the-art methods in shadow decoupling accuracy, 3D reconstruction precision, and novel view synthesis quality, with only a few minutes of training. ShadowGS exhibits robust performance across various settings, including RGB, pansharpened, and sparse-view satellite inputs.[91] Learning to Segment Liquids in Real-world Images
Jonas Li,Michelle Li,Luke Liu,Heng Fan
Main category: cs.CV
TL;DR: 提出了一种新的液体检测模型LQDM,并构建了包含5000张真实图像的大规模液体数据集LQDS,通过跨注意力机制提升液体分割性能。
Details
Motivation: 液体在日常生活中无处不在,但其多样的外观、透明性和反射性使得液体分割具有挑战性,现有研究对此关注不足,限制了机器人对液体的安全交互能力。 Method: 构建了一个名为LQDS的大规模数据集,包含5000张标注为14类的真实世界液体图像;设计了LQDM模型,利用边界分支与主分割分支之间的跨注意力机制来增强分割效果。 Result: 在LQDS测试集上进行了大量实验,LQDM优于现有的最先进方法,在液体语义分割任务中表现出色。 Conclusion: LQDM有效提升了复杂场景下液体的分割精度,为液体感知建立了强有力的基线,推动了机器人对液体安全交互的研究。 Abstract: Different types of liquids such as water, wine and medicine appear in all aspects of daily life. However, limited attention has been given to the task, hindering the ability of robots to avoid or interact with liquids safely. The segmentation of liquids is difficult because liquids come in diverse appearances and shapes; moreover, they can be both transparent or reflective, taking on arbitrary objects and scenes from the background or surroundings. To take on this challenge, we construct a large-scale dataset of liquids named LQDS consisting of 5000 real-world images annotated into 14 distinct classes, and design a novel liquid detection model named LQDM, which leverages cross-attention between a dedicated boundary branch and the main segmentation branch to enhance segmentation predictions. Extensive experiments demonstrate the effectiveness of LQDM on the test set of LQDS, outperforming state-of-the-art methods and establishing a strong baseline for the semantic segmentation of liquids.[92] PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education
Megha Mariam K. M,Aditya Arun,Zakaria Laskar,C. V. Jawahar
Main category: cs.CV
TL;DR: 本文提出了一个用于评估文本到视频(T2V)模型在物理教育中生成解释性视频能力的基准,旨在探索AI生成教学视频的可行性。
Details
Motivation: 推动科学教育发展,利用生成式AI自动创建直观、生动的物理概念可视化内容,提升教学可及性与个性化水平。 Method: 构建了一个专门的基准测试集,将物理概念分解为细粒度教学点,并为每个点设计了用于视觉解释的提示语,用以评估T2V模型生成准确教育视频的能力。 Result: 当前T2V模型能生成视觉连贯、运动流畅的视频,但在概念准确性上表现不足,尤其在电磁学和热力学方面存在困难。 Conclusion: 尽管现有T2V模型在视觉质量上表现良好,但概念正确性仍有待提高;该基准有望帮助社区缩小这一差距,推动可扩展、准确且课程对齐的AI教育内容发展。 Abstract: Generative AI models, particularly Text-to-Video (T2V) systems, offer a promising avenue for transforming science education by automating the creation of engaging and intuitive visual explanations. In this work, we take a first step toward evaluating their potential in physics education by introducing a dedicated benchmark for explanatory video generation. The benchmark is designed to assess how well T2V models can convey core physics concepts through visual illustrations. Each physics concept in our benchmark is decomposed into granular teaching points, with each point accompanied by a carefully crafted prompt intended for visual explanation of the teaching point. T2V models are evaluated on their ability to generate accurate videos in response to these prompts. Our aim is to systematically explore the feasibility of using T2V models to generate high-quality, curriculum-aligned educational content-paving the way toward scalable, accessible, and personalized learning experiences powered by AI. Our evaluation reveals that current models produce visually coherent videos with smooth motion and minimal flickering, yet their conceptual accuracy is less reliable. Performance in areas such as mechanics, fluids, and optics is encouraging, but models struggle with electromagnetism and thermodynamics, where abstract interactions are harder to depict. These findings underscore the gap between visual quality and conceptual correctness in educational video generation. We hope this benchmark helps the community close that gap and move toward T2V systems that can deliver accurate, curriculum-aligned physics content at scale. The benchmark and accompanying codebase are publicly available at https://github.com/meghamariamkm/PhyEduVideo.[93] Deep Clustering with Associative Memories
Bishwajit Saha,Dmitry Krotov,Mohammed J. Zaki,Parikshit Ram
Main category: cs.CV
TL;DR: 提出了一种基于能量动态和联想记忆的新型深度聚类方法DCAM,通过单一目标紧密整合表示学习与聚类,提升了图像和文本等多种模态下的聚类性能。
Details
Motivation: 现有深度聚类方法中表示学习与聚类任务分离,因聚类本身为离散优化问题,难以与可微的表示学习无缝融合,导致二者耦合不紧密。 Method: 设计了一种新的损失函数,利用基于能量的动力学和联想记忆(Associative Memories)机制,在一个统一的目标中联合优化表示学习和聚类过程。 Result: 实验表明DCAM在多种网络结构(卷积、残差、全连接)和数据模态(图像、文本)下均能提升聚类质量。 Conclusion: DCAM通过能量驱动的联想记忆机制,实现了表示学习与聚类的深度融合,提供了一种更紧凑、有效的深度聚类框架。 Abstract: Deep clustering - joint representation learning and latent space clustering - is a well studied problem especially in computer vision and text processing under the deep learning framework. While the representation learning is generally differentiable, clustering is an inherently discrete optimization task, requiring various approximations and regularizations to fit in a standard differentiable pipeline. This leads to a somewhat disjointed representation learning and clustering. In this work, we propose a novel loss function utilizing energy-based dynamics via Associative Memories to formulate a new deep clustering method, DCAM, which ties together the representation learning and clustering aspects more intricately in a single objective. Our experiments showcase the advantage of DCAM, producing improved clustering quality for various architecture choices (convolutional, residual or fully-connected) and data modalities (images or text).[94] A Deep Learning Approach for Automated Skin Lesion Diagnosis with Explainable AI
Md. Maksudul Haque,Rahnuma Akter,A S M Ahsanul Sarkar Akib,Abdul Hasib
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的多类皮肤病变分类方法,结合数据平衡、数据增强、混合EfficientNetV2-L与通道注意力机制及三阶段渐进学习,在HAM10000数据集上实现了91.15%的准确率和85.45%的宏F1分数,并利用Grad-CAM等可解释AI技术提升模型透明度与临床可信度。
Details
Motivation: 皮肤癌是全球最常见且危险的癌症之一,需要及时精准诊断,但现有方法在分类性能和模型可解释性方面仍有不足。 Method: 采用高效的数据平衡与大规模数据增强策略,构建融合通道注意力机制的混合EfficientNetV2-L模型,结合三阶段渐进学习框架,并使用Grad-CAM和显著性图等可解释AI技术生成模型预测的可视化解释。 Result: 模型在HAM10000数据集上达到91.15%的准确率、85.45%的宏F1分数和99.33%的微平均AUC,对黑色素瘤和良性痣等类别表现尤为突出,XAI分析揭示了影响分类的关键视觉特征。 Conclusion: 所提方法在多类皮肤病变分类中表现出色,兼具高性能与高可解释性,有助于提升临床诊断的准确性与医生对模型的信任度。 Abstract: Skin cancer is also one of the most common and dangerous types of cancer in the world that requires timely and precise diagnosis. In this paper, a deep-learning architecture of the multi-class skin lesion classification on the HAM10000 dataset will be described. The system suggested combines high-quality data balancing methods, large-scale data augmentation, hybridized EfficientNetV2-L framework with channel attention, and a three-stage progressive learning approach. Moreover, we also use explainable AI (XAI) techniques such as Grad-CAM and saliency maps to come up with intelligible visual representations of model predictions. Our strategy is with a total accuracy of 91.15 per cent, macro F1 of 85.45\% and micro-average AUC of 99.33\%. The model has shown high performance in all the seven lesion classes with specific high performance of melanoma and melanocytic nevi. In addition to enhancing diagnostic transparency, XAI also helps to find out the visual characteristics that cause the classifications, which enhances clinical trustworthiness.[95] Few-Shot Video Object Segmentation in X-Ray Angiography Using Local Matching and Spatio-Temporal Consistency Loss
Lin Xi,Yingliang Ma,Xiahai Zhuang
Main category: cs.CV
TL;DR: 提出一种新的FSVOS模型,采用基于方向的局部匹配策略,提升X射线血管造影视频中多目标分割的准确性和泛化能力。
Details
Motivation: 现有方法依赖低效或硬件特定的操作,缺乏跨设备可移植性,且难以适应多样化的空间结构。 Method: 引入基于方向的非参数化局部采样机制,并设计监督式时空对比学习以增强跨帧特征一致性。 Result: 在CADICA、XACV和新发布的MOSXAV数据集上优于当前最先进方法,具有更好分割精度和泛化性能。 Conclusion: 该方法具备高灵活性、良好可移植性,适用于多种临床应用场景。 Abstract: We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most relevant neighboring pixels. Rather than relying on inefficient standard im2col-like implementations (e.g., spatial convolutions, depthwise convolutions and feature-shifting mechanisms) or hardware-specific CUDA kernels (e.g., deformable and neighborhood attention), which often suffer from limited portability across non-CUDA devices, we reorganize the local sampling process through a direction-based sampling perspective. Specifically, we implement a non-parametric sampling mechanism that enables dynamically varying sampling regions. This approach provides the flexibility to adapt to diverse spatial structures without the computational costs of parametric layers and the need for model retraining. To further enhance feature coherence across frames, we design a supervised spatio-temporal contrastive learning scheme that enforces consistency in feature representations. In addition, we introduce a publicly available benchmark dataset for multi-object segmentation in X-ray angiography videos (MOSXAV), featuring detailed, manually labeled segmentation ground truth. Extensive experiments on the CADICA, XACV, and MOSXAV datasets show that our proposed FSVOS method outperforms current state-of-the-art video segmentation methods in terms of segmentation accuracy and generalization capability (i.e., seen and unseen categories). This work offers enhanced flexibility and potential for a wide range of clinical applications.[96] UnrealPose: Leveraging Game Engine Kinematics for Large-Scale Synthetic Human Pose Data
Joshua Kawaguchi,Saad Manzur,Emily Gao Wang,Maitreyi Sinha,Bryan Vela,Yunxi Wang,Brandon Vela,Wayne B. Hayes
Main category: cs.CV
TL;DR: 本文提出了UnrealPose-Gen,一个基于Unreal Engine 5的高质量离线渲染管线,用于生成带有精确标注的3D人体姿态数据,并发布了包含约一百万帧的UnrealPose-1M数据集。
Details
Motivation: 真实世界中多样且标注准确的3D人体姿态数据获取成本高且受限于拍摄环境,而野外数据集缺乏已知真值,因此需要一种可扩展、高质量的合成数据生成方案。 Method: 基于Unreal Engine 5的Movie Render Queue构建UnrealPose-Gen渲染管线,生成包含3D关节、2D关键点、边界框、相机参数等丰富标注信息的合成图像,并创建UnrealPose-1M数据集,包含八种序列(脚本化与随机化),覆盖多种场景、动作和视角。 Result: 生成的数据在四个任务上进行了真实到合成的性能验证:图像到3D姿态估计、2D关键点检测、2D到3D提升以及人物检测/分割,展示了其高保真性和实用性。 Conclusion: UnrealPose-1M数据集和UnrealPose-Gen管线为3D人体姿态研究提供了高质量、多样化且带精确标注的合成数据资源,具备广泛的应用潜力,作者已公开发布该数据集和生成工具。 Abstract: Diverse, accurately labeled 3D human pose data is expensive and studio-bound, while in-the-wild datasets lack known ground truth. We introduce UnrealPose-Gen, an Unreal Engine 5 pipeline built on Movie Render Queue for high-quality offline rendering. Our generated frames include: (i) 3D joints in world and camera coordinates, (ii) 2D projections and COCO-style keypoints with occlusion and joint-visibility flags, (iii) person bounding boxes, and (iv) camera intrinsics and extrinsics. We use UnrealPose-Gen to present UnrealPose-1M, an approximately one million frame corpus comprising eight sequences: five scripted "coherent" sequences spanning five scenes, approximately 40 actions, and five subjects; and three randomized sequences across three scenes, approximately 100 actions, and five subjects, all captured from diverse camera trajectories for broad viewpoint coverage. As a fidelity check, we report real-to-synthetic results on four tasks: image-to-3D pose, 2D keypoint detection, 2D-to-3D lifting, and person detection/segmentation. Though time and resources constrain us from an unlimited dataset, we release the UnrealPose-1M dataset, as well as the UnrealPose-Gen pipeline to support third-party generation of human pose data.[97] WildIng: A Wildlife Image Invariant Representation Model for Geographical Domain Shift
Julian D. Santamaria,Claudia Isaza,Jhony H. Giraldo
Main category: cs.CV
TL;DR: 本文提出了一种名为WildIng的野生动物图像不变表示模型,通过结合文本描述与图像特征,提升深度学习模型在地理域迁移下的泛化能力,实验显示其在跨区域数据集上比现有基础模型提高30%的准确率。
Details
Motivation: 现有的野生动物识别深度学习模型在训练和测试数据来自相同地理区域时表现良好,但在跨区域应用时性能显著下降,主要由于其依赖纯图像表示,对背景、光照等地理相关变化敏感。因此需要更鲁棒的跨域识别方法。 Method: 提出WildIng模型,将文本描述(如物种外观)与图像特征融合,构建对地理域变化更具不变性的联合表示。利用视觉-语言模型框架,在非洲和美洲不同地区的数据集上进行训练和跨域测试。 Result: 在非洲数据集上训练的CLIP模型在美洲测试时准确率从84.77%降至16.17%;引入文本增强的WildIng方法使基础模型BioCLIP在跨域场景下准确率提升约30%。 Conclusion: WildIng通过融合文本语义信息有效缓解了野生动物识别中因地理分布偏移导致的性能下降问题,显著提升了模型的跨区域泛化能力,为大规模、非侵入式生态监测提供了可行方案。 Abstract: Wildlife monitoring is crucial for studying biodiversity loss and climate change. Camera trap images provide a non-intrusive method for analyzing animal populations and identifying ecological patterns over time. However, manual analysis is time-consuming and resource-intensive. Deep learning, particularly foundation models, has been applied to automate wildlife identification, achieving strong performance when tested on data from the same geographical locations as their training sets. Yet, despite their promise, these models struggle to generalize to new geographical areas, leading to significant performance drops. For example, training an advanced vision-language model, such as CLIP with an adapter, on an African dataset achieves an accuracy of 84.77%. However, this performance drops significantly to 16.17% when the model is tested on an American dataset. This limitation partly arises because existing models rely predominantly on image-based representations, making them sensitive to geographical data distribution shifts, such as variation in background, lighting, and environmental conditions. To address this, we introduce WildIng, a Wildlife image Invariant representation model for geographical domain shift. WildIng integrates text descriptions with image features, creating a more robust representation to geographical domain shifts. By leveraging textual descriptions, our approach captures consistent semantic information, such as detailed descriptions of the appearance of the species, improving generalization across different geographical locations. Experiments show that WildIng enhances the accuracy of foundation models such as BioCLIP by 30% under geographical domain shift conditions. We evaluate WildIng on two datasets collected from different regions, namely America and Africa. The code and models are publicly available at https://github.com/Julian075/CATALOG/tree/WildIng.[98] DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models
Yue Zhou,Jue Chen,Zilun Zhang,Penghui Huang,Ran Ding,Zhentao Zou,PengFei Gao,Yuchen Wei,Ke Li,Xue Yang,Xue Jiang,Hongxin Yang,Jonathan Li
Main category: cs.CV
TL;DR: 本文提出了DVGBench,一个面向无人机应用的高质量隐式视觉定位基准,并设计了结合隐式到显式思维链(I2E-CoT)的遥感大视觉语言模型DroneVG-R1,以提升模型在需要领域知识的隐式任务中的推理能力。
Details
Motivation: 现有遥感视觉定位数据集主要依赖显式指代表达,难以评估模型在需领域知识的隐式任务中的表现,限制了无人机场景下模型的推理能力发展。 Method: 构建包含六类应用场景的隐式视觉定位基准DVGBench,提出DroneVG-R1模型,引入隐式到显式思维链(I2E-CoT)并结合强化学习框架,将隐式表达转化为显式线索以降低定位难度。 Result: 主流模型在显式和隐式任务上均表现出显著推理局限,DroneVG-R1通过I2E-CoT机制在隐式VG任务中展现出更强的推理与定位能力。 Conclusion: DVGBench填补了遥感领域隐式视觉定位的空白,DroneVG-R1验证了引入领域知识推理的有效性,为无人机智能体的视觉语言模型发展提供了重要方向。 Abstract: Remote sensing (RS) large vision-language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions-such as relative position, relative size, and color cues-thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces DVGBench, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents. The code and datasets will be released at https://github.com/zytx121/DVGBench[99] Lightweight Channel Attention for Efficient CNNs
Prem Babu Kanaparthi,Tulasi Venkata Sri Varshini Padamata
Main category: cs.CV
TL;DR: 本文对不同的通道注意力机制(SE、ECA和提出的LCA)在ResNet18和MobileNetV2上的性能进行了实证研究,提出了一种轻量化的LCA模块,在保持高精度的同时实现了良好的参数效率和推理速度。
Details
Motivation: 不同通道注意力机制在效率与准确率之间的权衡尚未被充分探索,需要系统性比较以指导实际应用。 Method: 提出Lite Channel Attention(LCA)模块,采用自适应的一维卷积和分组操作来减少参数量,同时保留有效的注意力行为,并在CIFAR-10数据集上对SE、ECA和LCA进行对比实验。 Result: LCA在ResNet18上达到94.68%的准确率,在MobileNetV2上达到93.10%,参数效率与ECA相当,且具有更优的推理延迟。 Conclusion: LCA是一种高效且轻量的通道注意力模块,适合部署在资源受限环境中的CNN模型中。 Abstract: Attention mechanisms have become integral to modern convolutional neural networks (CNNs), delivering notable performance improvements with minimal computational overhead. However, the efficiency accuracy trade off of different channel attention designs remains underexplored. This work presents an empirical study comparing Squeeze and Excitation (SE), Efficient Channel Attention (ECA), and a proposed Lite Channel Attention (LCA) module across ResNet 18 and MobileNetV2 architectures on CIFAR 10. LCA employs adaptive one dimensional convolutions with grouped operations to reduce parameter usage while preserving effective attention behavior. Experimental results show that LCA achieves competitive accuracy, reaching 94.68 percent on ResNet 18 and 93.10 percent on MobileNetV2, while matching ECA in parameter efficiency and maintaining favorable inference latency. Comprehensive benchmarks including FLOPs, parameter counts, and GPU latency measurements are provided, offering practical insights for deploying attention enhanced CNNs in resource constrained environments.[100] Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking
Shiao Wang,Xiao Wang,Haonan Zhao,Jiarui Xu,Bo Jiang,Lin Zhu,Xin Zhao,Yonghong Tian,Jin Tang
Main category: cs.CV
TL;DR: 提出一种基于频域早期融合的RGB-Event视觉目标跟踪框架,通过频域特征解耦与选择性融合,结合运动引导的空间稀疏化模块,提升特征表示并降低计算开销。
Details
Motivation: 现有RGB-Event跟踪方法多采用传统特征级融合,未能充分利用事件相机的高动态范围和运动敏感特性,且对低信息区域处理效率低下,导致冗余计算。 Method: 将RGB和事件模态通过快速傅里叶变换转至频域,解耦振幅与相位分量;利用振幅与相位注意力机制选择性融合高频事件信息;设计运动引导的空间稀疏化模块,依据运动线索过滤低信息区域,仅保留目标相关特征送入骨干网络。 Result: 在FE108、FELT和COESOT三个主流RGB-Event跟踪数据集上进行了广泛实验,结果表明所提方法在精度和效率方面均优于现有方法。 Conclusion: 本文提出的频域早期融合与运动引导稀疏化策略有效提升了RGB-Event跟踪的性能与计算效率,为多模态跟踪提供了新思路。 Abstract: Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking[101] ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval
Tien-Huy Nguyen,Huu-Loc Tran,Thanh Duc Ngo
Main category: cs.CV
TL;DR: 本文提出ITSELF,一种基于注意力引导的隐式局部对齐框架,用于文本-图像匹配中的人体搜索任务,无需额外监督即可实现细粒度对齐,在多个基准上达到SOTA性能。
Details
Motivation: 现有方法易受捷径学习和虚假相关性影响,且引入先验知识可能破坏模态内结构;作者发现编码器注意力在早期训练中已具备空间定位能力,由此启发设计更可靠的对齐机制。 Method: 提出ITSELF框架,包含GRAB模块将模型自身注意力转化为高显著性token的注意力库并施加局部目标;MARS模块跨层聚合注意力并进行多样性感知的选择;ATS模块动态调度训练过程中保留的上下文信息量,从粗到细逐步聚焦判别细节。 Result: 在三个主流TBPS基准上实现了最先进的性能,并展现出优异的跨数据集泛化能力。 Conclusion: ITSELF通过利用模型自身的注意力机制实现无需外部监督的隐式局部对齐,有效缓解了错对齐问题,在文本-图像匹配任务中表现出强健性和有效性。 Abstract: Vision Language Models (VLMs) have rapidly advanced and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our finding that encoder attention surfaces spatially precise evidence from the earliest training epochs, and to alleviate these issues, we introduceITSELF, an attention-guided framework for implicit local alignment. At its core, Guided Representation with Attentive Bank (GRAB) converts the model's own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks showstate-of-the-art performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision. Our project is publicly available at https://trhuuloc.github.io/itself[102] Enhanced Leukemic Cell Classification Using Attention-Based CNN and Data Augmentation
Douglas Costa Braga,Daniel Oliveira Dantas
Main category: cs.CV
TL;DR: 提出了一种基于注意力机制的深度学习管道,用于白血病细胞分类,结合EfficientNetV2-B3和Squeeze-and-Excitation模块,在C-NMC 2019数据集上实现了97.89%的F1分数和准确率,显著优于现有方法,且参数量减少89%,具有良好的可解释性和临床部署潜力。
Details
Motivation: 急性淋巴细胞白血病(ALL)的诊断依赖显微镜检查,但存在观察者间差异和耗时问题,亟需自动化、可重复且鲁棒的分类系统以提升诊断一致性与效率。 Method: 采用EfficientNetV2-B3与Squeeze-and-Excitation模块结合的注意力卷积神经网络,引入全面的数据增强、焦点损失处理类别不平衡,并采用按患者划分的数据分割策略以确保评估的可靠性与可重复性。 Result: 在C-NMC 2019数据集上达到97.89%的F1分数和准确率,蒙特卡洛验证显示性能提升显著(p < 0.001),相比基线方法最高提升4.67%,参数量仅为VGG16的11%(1520万 vs 1.38亿)。 Conclusion: 该管道在保证高精度的同时具备良好可解释性与计算效率,注意力机制可可视化关键诊断特征,适用于临床环境下的白血病细胞自动分类。 Abstract: We present a reproducible deep learning pipeline for leukemic cell classification, focusing on system architecture, experimental robustness, and software design choices for medical image analysis. Acute lymphoblastic leukemia (ALL) is the most common childhood cancer, requiring expert microscopic diagnosis that suffers from inter-observer variability and time constraints. The proposed system integrates an attention-based convolutional neural network combining EfficientNetV2-B3 with Squeeze-and-Excitation mechanisms for automated ALL cell classification. Our approach employs comprehensive data augmentation, focal loss for class imbalance, and patient-wise data splitting to ensure robust and reproducible evaluation. On the C-NMC 2019 dataset (12,528 original images from 62 patients), the system achieves a 97.89% F1-score and 97.89% accuracy on the test set, with statistical validation through 100-iteration Monte Carlo experiments confirming significant improvements (p < 0.001) over baseline methods. The proposed pipeline outperforms existing approaches by up to 4.67% while using 89% fewer parameters than VGG16 (15.2M vs. 138M). The attention mechanism provides interpretable visualizations of diagnostically relevant cellular features, demonstrating that modern attention-based architectures can improve leukemic cell classification while maintaining computational efficiency suitable for clinical deployment.[103] Mono3DV: Monocular 3D Object Detection with 3D-Aware Bipartite Matching and Variational Query DeNoising
Kiet Dang Vu,Trung Thai Tran,Kien Nguyen Do Trung,Duc Dung Nguyen
Main category: cs.CV
TL;DR: 本文提出了一种新的基于Transformer的单目3D目标检测框架Mono3DV,通过引入3D感知的二分匹配和改进的去噪机制显著提升了性能。
Details
Motivation: 现有DETR-like方法在单目3D目标检测中忽略了3D属性在匹配过程中的作用,导致高质量3D预测被抑制,训练不稳定。 Method: 提出3D-Aware Bipartite Matching将3D几何信息纳入匹配成本;设计3D-DeNoising方案稳定训练过程;提出Variational Query DeNoising机制缓解梯度消失问题。 Result: 在KITTI 3D检测基准上实现了最先进的性能,且未使用任何外部数据。 Conclusion: Mono3DV通过融合3D信息到匹配与去噪过程中,有效解决了单目3D检测中的不稳定性与性能瓶颈。 Abstract: While DETR-like architectures have demonstrated significant potential for monocular 3D object detection, they are often hindered by a critical limitation: the exclusion of 3D attributes from the bipartite matching process. This exclusion arises from the inherent ill-posed nature of 3D estimation from monocular image, which introduces instability during training. Consequently, high-quality 3D predictions can be erroneously suppressed by 2D-only matching criteria, leading to suboptimal results. To address this, we propose Mono3DV, a novel Transformer-based framework. Our approach introduces three key innovations. First, we develop a 3D-Aware Bipartite Matching strategy that directly incorporates 3D geometric information into the matching cost, resolving the misalignment caused by purely 2D criteria. Second, it is important to stabilize the Bipartite Matching to resolve the instability occurring when integrating 3D attributes. Therefore, we propose 3D-DeNoising scheme in the training phase. Finally, recognizing the gradient vanishing issue associated with conventional denoising techniques, we propose a novel Variational Query DeNoising mechanism to overcome this limitation, which significantly enhances model performance. Without leveraging any external data, our method achieves state-of-the-art results on the KITTI 3D object detection benchmark.[104] Deepfake Detection with Multi-Artifact Subspace Fine-Tuning and Selective Layer Masking
Xiang Zhang,Wenliang Weng,Daoyong Fu,Ziqiang Li,Zhangjie Fu
Main category: cs.CV
TL;DR: 本文提出了一种基于多伪造子空间和选择性层掩码(MASM)的深度伪造检测方法,通过解耦语义与伪造特征表示,提升跨数据集场景下的泛化能力。
Details
Motivation: 现有方法在适应新伪造特征时容易破坏预训练模型的语义结构,难以兼顾多样性伪造特征建模与语义稳定性。 Method: 采用奇异值分解将预训练权重分解为稳定的语义主子空间和多个可学习的伪造子空间,并引入选择性层掩码策略动态调节各层更新,结合正交性和谱一致性约束来规范多个伪造子空间的学习。 Result: 该方法在跨数据集和复杂真实场景下表现出更强的鲁棒性和泛化性能,有效缓解了过拟合单一伪造特征的问题。 Conclusion: MASM通过分离语义与伪造表示并约束其学习过程,实现了更稳定、更具泛化的深伪检测,为应对多样化的伪造技术提供了新思路。 Abstract: Deepfake detection still faces significant challenges in cross-dataset and real-world complex scenarios. The root cause lies in the high diversity of artifact distributions introduced by different forgery methods, while pretrained models tend to disrupt their original general semantic structures when adapting to new artifacts. Existing approaches usually rely on indiscriminate global parameter updates or introduce additional supervision signals, making it difficult to effectively model diverse forgery artifacts while preserving semantic stability. To address these issues, this paper proposes a deepfake detection method based on Multi-Artifact Subspaces and selective layer masks (MASM), which explicitly decouples semantic representations from artifact representations and constrains the fitting strength of artifact subspaces, thereby improving generalization robustness in cross-dataset scenarios. Specifically, MASM applies singular value decomposition to model weights, partitioning pretrained weights into a stable semantic principal subspace and multiple learnable artifact subspaces. This design enables decoupled modeling of different forgery artifact patterns while preserving the general semantic subspace. On this basis, a selective layer mask strategy is introduced to adaptively regulate the update behavior of corresponding network layers according to the learning state of each artifact subspace, suppressing overfitting to any single forgery characteristic. Furthermore, orthogonality constraints and spectral consistency constraints are imposed to jointly regularize multiple artifact subspaces, guiding them to learn complementary and diverse artifact representations while maintaining a stable overall spectral structure.[105] Evaluating transfer learning strategies for improving dairy cattle body weight prediction in small farms using depth-image and point-cloud data
Jin Wang,Angelo De Castro,Yuxi Zhang,Lucas Basolli Borsatto,Yuechen Guo,Victoria Bastos Primo,Ana Beatriz Montevecchio Bernardino,Gota Morota,Ricardo C Chebel,Haipeng Yu
Main category: cs.CV
TL;DR: 该研究评估了迁移学习在小规模农场中利用大型农场预训练模型进行奶牛体重预测的有效性,并比较了基于深度图像和点云数据的两种方法的性能。结果显示,迁移学习显著提升了小农场上的预测精度,且两种数据模态之间没有明显性能差异。
Details
Motivation: 目前在畜牧业中,迁移学习的效果和最佳微调策略尚不明确,尤其是在超越ImageNet或COCO预训练权重的应用场景下;同时缺乏对深度图像与三维点云数据在奶牛体重预测中直接比较的研究。 Method: 使用来自三个不同规模奶牛场的1,201、215和58头奶牛的顶视深度图像和点云数据,评估四种深度学习模型(ConvNeXt和MobileViT用于深度图像,PointNet和DGCNN用于点云),并在三种实验设计下比较迁移学习、单源学习和联合学习的性能。 Result: 迁移学习在所有四个模型上均显著提高小农场的体重预测性能,效果优于单源学习并接近或超过联合学习;深度图像与点云模型之间未观察到一致的性能差异。 Conclusion: 迁移学习适用于数据有限的小型农场,在跨农场数据共享受限时具有优势,仅需预训练模型权重而无需原始数据,且深度图像与点云方法在体重预测中表现相当。 Abstract: Computer vision provides automated, non-invasive, and scalable tools for monitoring dairy cattle, thereby supporting management, health assessment, and phenotypic data collection. Although transfer learning is commonly used for predicting body weight from images, its effectiveness and optimal fine-tuning strategies remain poorly understood in livestock applications, particularly beyond the use of pretrained ImageNet or COCO weights. In addition, while both depth images and three-dimensional point-cloud data have been explored for body weight prediction, direct comparisons of these two modalities in dairy cattle are limited. Therefore, the objectives of this study were to 1) evaluate whether transfer learning from a large farm enhances body weight prediction on a small farm with limited data, and 2) compare the predictive performance of depth-image- and point-cloud-based approaches under three experimental designs. Top-view depth images and point-cloud data were collected from 1,201, 215, and 58 cows at large, medium, and small dairy farms, respectively. Four deep learning models were evaluated: ConvNeXt and MobileViT for depth images, and PointNet and DGCNN for point clouds. Transfer learning markedly improved body weight prediction on the small farm across all four models, outperforming single-source learning and achieving gains comparable to or greater than joint learning. These results indicate that pretrained representations generalize well across farms with differing imaging conditions and dairy cattle populations. No consistent performance difference was observed between depth-image- and point-cloud-based models. Overall, these findings suggest that transfer learning is well suited for small farm prediction scenarios where cross-farm data sharing is limited by privacy, logistical, or policy constraints, as it requires access only to pretrained model weights rather than raw data.[106] EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos
Hongming Fu,Wenjia Wang,Xiaozhen Qiao,Shuo Yang,Zheng Liu,Bo Zhao
Main category: cs.CV
TL;DR: 本文提出了EgoGrasp,首个从野外第一人称单目视频中重建世界空间手-物交互(W-HOI)的方法,解决了动态相机、严重遮挡和运动下的重建难题。
Details
Motivation: 现有方法局限于单张图像或相机坐标系,无法建模时间动态或全局一致的轨迹,且忽略物体姿态与交互约束,在真实场景下性能受限。 Method: 提出多阶段框架,包括基于空间智能模型的鲁棒预处理流程、基于解耦扩散模型的全身HOI先验模型,以及多目标测试时优化范式。 Result: 实验表明该方法在W-HOI重建任务上达到最先进的性能,支持无模板、多物体扩展。 Conclusion: EgoGrasp首次实现了在动态真实环境中从单目视频稳定重建世界坐标系下手-物交互,推动了具身智能与虚拟现实应用的发展。 Abstract: We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.[107] Enhancing Histopathological Image Classification via Integrated HOG and Deep Features with Robust Noise Performance
Ifeanyi Ezuma,Ugochukwu Ugwu
Main category: cs.CV
TL;DR: 本研究评估了机器学习和深度学习模型在LC25000组织病理图像数据集上的分类性能,采用微调的InceptionResNet-v2进行分类和特征提取,结果显示深度特征显著提升模型表现与鲁棒性。
Details
Motivation: 随着数字病理学的发展,自动化图像分析在临床中变得至关重要,需要高效准确的模型来辅助病理诊断。 Method: 使用微调的InceptionResNet-v2网络作为分类器和特征提取器,并结合传统机器学习模型对提取的深度特征进行分类,同时评估不同信噪比条件下的模型鲁棒性。 Result: 微调的InceptionResNet-v2达到96.01%的准确率和96.8%的平均AUC;基于其深度特征训练的神经网络模型取得99.84%准确率和99.99% AUC;结合HOG与深度特征进一步提升性能,但在噪声环境下效果下降。 Conclusion: 深度特征显著提升分类性能和模型鲁棒性,尤其以GBM和KNN表现出更强抗噪能力,验证了深度特征在组织病理图像分析中的优越性。 Abstract: The era of digital pathology has advanced histopathological examinations, making automated image analysis essential in clinical practice. This study evaluates the classification performance of machine learning and deep learning models on the LC25000 dataset, which includes five classes of histopathological images. We used the fine-tuned InceptionResNet-v2 network both as a classifier and for feature extraction. Our results show that the fine-tuned InceptionResNet-v2 achieved a classification accuracy of 96.01\% and an average AUC of 96.8\%. Models trained on deep features from InceptionResNet-v2 outperformed those using only the pre-trained network, with the Neural Network model achieving an AUC of 99.99\% and accuracy of 99.84\%. Evaluating model robustness under varying SNR conditions revealed that models using deep features exhibited greater resilience, particularly GBM and KNN. The combination of HOG and deep features showed enhanced performance, however, less so in noisy environments.[108] Efficient Hyperspectral Image Reconstruction Using Lightweight Separate Spectral Transformers
Jianan Li,Wangcai Zhao,Tingfa Xu
Main category: cs.CV
TL;DR: 提出了一种轻量化的分离光谱Transformer(LSST)架构,用于高效重建压缩感知下的高光谱图像,结合光谱与空间特性,在性能、计算量和参数量之间实现了优越平衡。
Details
Motivation: 高光谱图像重建在压缩感知下面临效率与精度的挑战,现有方法难以兼顾计算成本与对复杂光谱-空间特征的建模能力。 Method: 采用分而治之策略,设计了分离光谱Transformer块(SSTB)建模光谱关系(含分组光谱自注意力和光谱打乱操作),并结合轻量级空间卷积块(LSCB)处理空间信息(使用深度可分离卷积);引入焦点光谱损失(Focal Spectrum Loss)动态调整训练中各波段的损失权重。 Result: LSST在多个数据集上实现了优于现有方法的重建性能,同时显著减少了FLOPs和模型参数量,表现出更高的效率和有效性。 Conclusion: LSST通过专门设计的光谱与空间分支及新的损失函数,有效提升了高光谱图像压缩感知重建的效率与质量,具备良好的应用潜力。 Abstract: Hyperspectral imaging (HSI) is essential across various disciplines for its capacity to capture rich spectral information. However, efficiently reconstructing hyperspectral images from compressive sensing measurements presents significant challenges. To tackle these, we adopt a divide-and-conquer strategy that capitalizes on the unique spectral and spatial characteristics of hyperspectral images. We introduce the Lightweight Separate Spectral Transformer (LSST), an innovative architecture tailored for efficient hyperspectral image reconstruction. This architecture consists of Separate Spectral Transformer Blocks (SSTB) for modeling spectral relationships and Lightweight Spatial Convolution Blocks (LSCB) for spatial processing. The SSTB employs Grouped Spectral Self-attention and a Spectrum Shuffle operation to effectively manage both local and non-local spectral relationships. Simultaneously, the LSCB utilizes depth-wise separable convolutions and strategic ordering to enhance spatial information processing. Furthermore, we implement the Focal Spectrum Loss, a novel loss weighting mechanism that dynamically adjusts during training to improve reconstruction across spectrally complex bands. Extensive testing demonstrates that our LSST achieves superior performance while requiring fewer FLOPs and parameters, underscoring its efficiency and effectiveness. The source code is available at: https://github.com/wcz1124/LSST.[109] A UAV-Based Multispectral and RGB Dataset for Multi-Stage Paddy Crop Monitoring in Indian Agricultural Fields
Adari Rama Sukanya,Puvvula Roopesh Naga Sri Sai,Kota Moses,Rimalapudi Sarvendranath
Main category: cs.CV
TL;DR: 本文介绍了一个大规模的基于无人机(UAV)的RGB和多光谱图像数据集,覆盖印度安得拉邦维杰亚瓦达地区稻田从育苗到收获的各个生长阶段。
Details
Motivation: 为支持印度水稻作物在不同生长阶段的研究,如精准喷洒、病害分析和产量估计,提供高分辨率、富含元数据的公开数据集。 Method: 使用2000万像素RGB相机和500万像素四波段多光谱相机(红、绿、红边、近红外)采集图像,制定标准操作程序(SOP)以确保数据可重复性,并利用Pix4D Fields对图像进行验证,生成正射镶嵌图和植被指数图(如NDVI、NDRE)。 Result: 数据集包含42,430张原始图像(415 GB),覆盖5英亩区域,地面采样距离为1 cm/pixel,附有GPS坐标、飞行高度和环境条件等元数据。 Conclusion: 该数据集是少数涵盖印度水稻全生长周期的高分辨率图像数据集之一,已公开发布于IEEE DataPort,具有重要研究与应用价值。 Abstract: We present a large-scale unmanned aerial vehicle (UAV)-based RGB and multispectral image dataset collected over paddy fields in the Vijayawada region, Andhra Pradesh, India, covering nursery to harvesting stages. We used a 20-megapixel RGB camera and a 5-megapixel four-band multispectral camera capturing red, green, red-edge, and near-infrared bands. Standardised operating procedure (SOP) and checklists were developed to ensure repeatable data acquisition. Our dataset comprises of 42,430 raw images (415 GB) captured over 5 acres with 1 cm/pixel ground sampling distance (GSD) with associated metadata such as GPS coordinates, flight altitude, and environmental conditions. Captured images were validated using Pix4D Fields to generate orthomosaic maps and vegetation index maps, such as normalised difference vegetation index (NDVI) and normalised difference red-edge (NDRE) index. Our dataset is one of the few datasets that provide high-resolution images with rich metadata that cover all growth stages of Indian paddy crops. The dataset is available on IEEE DataPort with DOI, . It can support studies on targeted spraying, disease analysis, and yield estimation.[110] Luminark: Training-free, Probabilistically-Certified Watermarking for General Vision Generative Models
Jiayi Xu,Zhang Zhang,Yuanrui Zhang,Ruitao Chen,Yixian Xu,Tianyu He,Di He
Main category: cs.CV
TL;DR: 本文提出了一种无需训练且具有概率性认证的通用视觉生成模型水印方法Luminark,利用块级亮度统计实现高检测准确率和强鲁棒性。
Details
Motivation: 为了解决现有水印方法在视觉生成模型中缺乏通用性、需额外训练以及难以保证检测可靠性的挑战。 Method: 提出基于块级亮度统计的新水印定义,预设二值模式与亮度阈值,并通过验证图像块亮度是否超过阈值来检测水印;利用引导技术作为即插即用机制,实现跨范式的无缝水印注入。 Result: 在九个涵盖扩散、自回归和混合框架的先进生成模型上进行评估,Luminark表现出高检测精度、对常见图像变换的强鲁棒性,并保持良好的图像视觉质量。 Conclusion: Luminark是一种通用、无需训练且可概率性认证的水印方法,适用于多种视觉生成模型,兼顾检测可靠性与生成质量。 Abstract: In this paper, we introduce \emph{Luminark}, a training-free and probabilistically-certified watermarking method for general vision generative models. Our approach is built upon a novel watermark definition that leverages patch-level luminance statistics. Specifically, the service provider predefines a binary pattern together with corresponding patch-level thresholds. To detect a watermark in a given image, we evaluate whether the luminance of each patch surpasses its threshold and then verify whether the resulting binary pattern aligns with the target one. A simple statistical analysis demonstrates that the false positive rate of the proposed method can be effectively controlled, thereby ensuring certified detection. To enable seamless watermark injection across different paradigms, we leverage the widely adopted guidance technique as a plug-and-play mechanism and develop the \emph{watermark guidance}. This design enables Luminark to achieve generality across state-of-the-art generative models without compromising image quality. Empirically, we evaluate our approach on nine models spanning diffusion, autoregressive, and hybrid frameworks. Across all evaluations, Luminark consistently demonstrates high detection accuracy, strong robustness against common image transformations, and good performance on visual quality.[111] 600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script
Haq Nawaz Malik
Main category: cs.CV
TL;DR: 本文提出了一种名为600K-KS-OCR的大规模合成数据集,用于训练和评估针对克什米尔语文字的光学字符识别系统。
Details
Motivation: 解决克什米尔语这一濒危语言在文字识别领域资源匮乏的问题。 Method: 使用三种传统克什米尔字体生成约602,000个词级图像,并结合数据增强技术模拟真实文档退化情况,提供多种格式标注以兼容CRNN、TrOCR等模型。 Result: 数据集包含约602,000张256x64像素的图像,分布在总计约10.6GB的十个压缩包中,并以CC-BY-4.0许可证发布。 Conclusion: 该数据集为低资源语言特别是克什米尔语的OCR研究提供了重要支持,有助于提升模型在实际场景中的鲁棒性。 Abstract: This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.[112] NarrativeTrack: Evaluating Video Language Models Beyond the Frame
Hyeonjeong Ha,Jinjin Ge,Bo Feng,Kaixin Ma,Gargi Chakraborty
Main category: cs.CV
TL;DR: 本文提出了NarrativeTrack,首个用于评估多模态大语言模型(MLLMs)在视频中叙事理解能力的基准,通过细粒度的实体中心推理和渐进式复杂性框架CRP揭示了现有模型在时序连贯性和实体追踪上的不足。
Details
Motivation: 现有的视觉语言基准主要关注短片段或场景级语义,缺乏对视频中随时间展开的叙事理解能力的系统评估,尤其是实体在动态上下文中的持续表征问题。 Method: 提出Compositional Reasoning Progression (CRP) 框架,从实体存在、变化和歧义三个维度逐步增加叙事复杂性,并构建自动化实体中心流水线提取时序锚定的实体表示。 Result: 评估发现现有MLLMs在视觉转换和时序动态中难以稳定追踪实体,开源通用模型感知强但时序连贯性弱,视频专用模型能捕捉时序但易产生上下文幻觉。 Conclusion: 叙事理解需要将感知 grounding 与时序推理结合,NarrativeTrack为诊断和提升MLLMs的时序叙事理解提供了首个系统化框架。 Abstract: Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs through fine-grained entity-centric reasoning. Unlike existing benchmarks limited to short clips or coarse scene-level semantics, we decompose videos into constituent entities and examine their continuity via a Compositional Reasoning Progression (CRP), a structured evaluation framework that progressively increases narrative complexity across three dimensions: entity existence, entity changes, and entity ambiguity. CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning. A fully automated entity-centric pipeline enables scalable extraction of temporally grounded entity representations, providing the foundation for CRP. Evaluations of state-of-the-art MLLMs reveal that models fail to robustly track entities across visual transitions and temporal dynamics, often hallucinating identity under context shifts. Open-source general-purpose MLLMs exhibit strong perceptual grounding but weak temporal coherence, while video-specific MLLMs capture temporal context yet hallucinate entity's contexts. These findings uncover a fundamental trade-off between perceptual grounding and temporal reasoning, indicating that narrative understanding emerges only from their integration. NarrativeTrack provides the first systematic framework to diagnose and advance temporally grounded narrative comprehension in MLLMs.[113] Evolving CNN Architectures: From Custom Designs to Deep Residual Models for Diverse Image Classification and Detection Tasks
Mahmudul Hasan,Mabsur Fatin Bin Hossain
Main category: cs.CV
TL;DR: 本论文比较了自定义卷积神经网络(CNN)与广泛使用的预训练及迁移学习模型在五个真实世界图像数据集上的性能,涵盖二分类、细粒度多分类和目标检测任务。研究分析了网络深度、残差连接等结构因素对性能的影响,并展示了所提架构在识别非法自动三轮车场景中的适应性。
Details
Motivation: 为不同复杂度的视觉任务选择合适的CNN架构提供实践指导,解决定制模型与预训练模型在实际应用中的性能差异问题。 Method: 设计一种自定义CNN架构,并与多种预训练及迁移学习模型进行对比实验;在五个真实图像数据集上评估其在分类与定位任务中的表现,分析网络深度、残差连接和特征提取策略的影响。 Result: 更深的CNN在细粒度多分类任务中表现更优,而轻量级预训练模型在二分类任务中依然高效;所提架构可成功扩展至目标检测任务,在识别非法自动三轮车场景中表现出良好适应性。 Conclusion: 应根据任务复杂度和资源限制选择合适的CNN架构:复杂任务受益于深层定制模型,简单任务则适合使用轻量预训练模型。 Abstract: This paper presents a comparative study of a custom convolutional neural network (CNN) architecture against widely used pretrained and transfer learning CNN models across five real-world image datasets. The datasets span binary classification, fine-grained multiclass recognition, and object detection scenarios. We analyze how architectural factors, such as network depth, residual connections, and feature extraction strategies, influence classification and localization performance. The results show that deeper CNN architectures provide substantial performance gains on fine-grained multiclass datasets, while lightweight pretrained and transfer learning models remain highly effective for simpler binary classification tasks. Additionally, we extend the proposed architecture to an object detection setting, demonstrating its adaptability in identifying unauthorized auto-rickshaws in real-world traffic scenes. Building upon a systematic analysis of custom CNN architectures alongside pretrained and transfer learning models, this study provides practical guidance for selecting suitable network designs based on task complexity and resource constraints.[114] Histogram Assisted Quality Aware Generative Model for Resolution Invariant NIR Image Colorization
Abhinav Attri,Rajeev Ranjan Dwivedi,Samiran Das,Vinod Kumar Kurmi
Main category: cs.CV
TL;DR: 本文提出了HAQAGen,一种用于分辨率不变的近红外到RGB图像着色的统一生成模型,通过结合全局颜色统计对齐、局部色调饱和先验和纹理感知监督,在保持结构保真度的同时实现色彩真实感。
Details
Motivation: 现有的NIR-to-RGB着色方法在不同分辨率下难以同时保持色彩真实性和细节清晰度,缺乏对全局颜色统计和局部色彩一致性的联合建模。 Method: 提出HAQAGen模型:(i) 结合可微直方图匹配、感知质量度量和特征相似性损失以对齐全局颜色统计并保留纹理;(ii) 通过SPADE注入局部色调-饱和度先验以稳定色彩重建;(iii) 在Mamba骨干网络中引入纹理感知监督;并设计自适应分辨率推理引擎以支持高分辨率转换。 Result: 在FANVID、OMSIV、VCIP2020和RGB2NIR数据集上显著优于现有方法,获得更高的感知指标得分,生成图像具有更锐利的纹理和更自然的颜色。 Conclusion: HAQAGen是一种可扩展且有效的NIR-to-RGB转换方案,能够在不牺牲纹理保真或泛化能力的前提下适应多种成像场景和原生分辨率。 Abstract: We present HAQAGen, a unified generative model for resolution-invariant NIR-to-RGB colorization that balances chromatic realism with structural fidelity. The proposed model introduces (i) a combined loss term aligning the global color statistics through differentiable histogram matching, perceptual image quality measure, and feature based similarity to preserve texture information, (ii) local hue-saturation priors injected via Spatially Adaptive Denormalization (SPADE) to stabilize chromatic reconstruction, and (iii) texture-aware supervision within a Mamba backbone to preserve fine details. We introduce an adaptive-resolution inference engine that further enables high-resolution translation without sacrificing quality. Our proposed NIR-to-RGB translation model simultaneously enforces global color statistics and local chromatic consistency, while scaling to native resolutions without compromising texture fidelity or generalization. Extensive evaluations on FANVID, OMSIV, VCIP2020, and RGB2NIR using different evaluation metrics demonstrate consistent improvements over state-of-the-art baseline methods. HAQAGen produces images with sharper textures, natural colors, attaining significant gains as per perceptual metrics. These results position HAQAGen as a scalable and effective solution for NIR-to-RGB translation across diverse imaging scenarios. Project Page: https://rajeev-dw9.github.io/HAQAGen/[115] Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation
Tianheng Cheng,Xinggang Wang,Junchao Liao,Wenyu Liu
Main category: cs.CV
TL;DR: 提出了一种新的引导注意力插值(GAI)方法,用于高效且准确的语义分割,通过自适应地融合高低分辨率特征,在多个数据集上实现了低延迟语义分割的新SOTA。
Details
Motivation: 现有的低分辨率特征插值方法(如双线性插值)存在特征错位和上下文信息不足的问题,且增强高分辨率特征语义信息计算开销大,难以满足低延迟推理需求。 Method: 提出引导注意力插值(GAI),利用不同分辨率特征之间的空间和语义关系,自适应地插值生成富含语义的高分辨率特征,并可集成到任意卷积网络中。 Result: 基于GAI的GAIN模型在Cityscapes上达到78.8 mIoU@22.3 FPS,在CamVid上达到80.6 mIoU@64.5 FPS(使用NVIDIA 1080Ti),显著优于现有低延迟方法。 Conclusion: GAI有效解决了特征对齐与语义丰富性的矛盾,在保持高效推理的同时提升了语义分割性能,为低延迟场景提供了新的解决方案。 Abstract: Semantic segmentation is a fundamental problem in computer vision and it requires high-resolution feature maps for dense prediction. Current coordinate-guided low-resolution feature interpolation methods, e.g., bilinear interpolation, produce coarse high-resolution features which suffer from feature misalignment and insufficient context information. Moreover, enriching semantics to high-resolution features requires a high computation burden, so that it is challenging to meet the requirement of lowlatency inference. We propose a novel Guided Attentive Interpolation (GAI) method to adaptively interpolate fine-grained high-resolution features with semantic features to tackle these issues. Guided Attentive Interpolation determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high-resolution features with rich semantics. GAI can be integrated with any deep convolutional network for efficient semantic segmentation. In experiments, the GAI-based semantic segmentation networks, i.e., GAIN, can achieve78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU, which are the new state-of-the-art results of low-latency semantic segmentation. Code and models are available at: https://github.com/hustvl/simpleseg.[116] CardioMOD-Net: A Modal Decomposition-Neural Network Framework for Diagnosis and Prognosis of HFpEF from Echocardiography Cine Loops
Andrés Bell-Navas,Jesús Garicano-Mena,Antonella Ausiello,Soledad Le Clainche,María Villalba-Orero,Enrique Lara-Pezzi
Main category: cs.CV
TL;DR: 本研究提出了一种名为CardioMOD-Net的统一AI框架,能够从标准小鼠超声心动图视频中实现HFpEF的多类诊断和发病时间连续预测。
Details
Motivation: HFpEF由多种共病引起且经历漫长的亚临床阶段,导致早期诊断和预后困难;现有AI模型局限于二分类检测,缺乏对共病特异性分型和疾病进展时间估计的能力。 Method: 使用四组小鼠(对照、高血糖、肥胖、高血压)的二维胸骨旁长轴 cine loops,通过高阶动态模态分解(HODMD)提取时序特征,并利用共享潜在表示支持两个Vision Transformer模块:一个用于分类诊断,另一个用于回归预测HFpEF发病年龄。 Result: 四组整体诊断准确率为65%,所有类别均超过50%;误分类主要发生在OB或SAH与对照组之间的早期重叠阶段。预后模块预测发病时间的均方根误差为21.72周,OB和SAH组预测最准,且预测的HFpEF发病分布与真实情况高度一致。 Conclusion: CardioMOD-Net能从单个cine loop中实现多类表型分型和HFpEF发病时间的连续预测,即使在小样本条件下也具有潜力,为整合诊断与预后建模提供了新基础。 Abstract: Introduction: Heart failure with preserved ejection fraction (HFpEF) arises from diverse comorbidities and progresses through prolonged subclinical stages, making early diagnosis and prognosis difficult. Current echocardiography-based Artificial Intelligence (AI) models focus primarily on binary HFpEF detection in humans and do not provide comorbidity-specific phenotyping or temporal estimates of disease progression towards decompensation. We aimed to develop a unified AI framework, CardioMOD-Net, to perform multiclass diagnosis and continuous prediction of HFpEF onset directly from standard echocardiography cine loops in preclinical models. Methods: Mouse echocardiography videos from four groups were used: control (CTL), hyperglycaemic (HG), obesity (OB), and systemic arterial hypertension (SAH). Two-dimensional parasternal long-axis cine loops were decomposed using Higher Order Dynamic Mode Decomposition (HODMD) to extract temporal features for downstream analysis. A shared latent representation supported Vision Transformers, one for a classifier for diagnosis and another for a regression module for predicting the age at HFpEF onset. Results: Overall diagnostic accuracy across the four groups was 65%, with all classes exceeding 50% accuracy. Misclassifications primarily reflected early-stage overlap between OB or SAH and CTL. The prognostic module achieved a root-mean-square error of 21.72 weeks for time-to-HFpEF prediction, with OB and SAH showing the most accurate estimates. Predicted HFpEF onset closely matched true distributions in all groups. Discussion: This unified framework demonstrates that multiclass phenotyping and continuous HFpEF onset prediction can be obtained from a single cine loop, even under small-data conditions. The approach offers a foundation for integrating diagnostic and prognostic modelling in preclinical HFpEF research.[117] GenCAMO: Scene-Graph Contextual Decoupling for Environment-aware and Mask-free Camouflage Image-Dense Annotation Generation
Chenglizhao Chen,Shaojiang Yuan,Xiaoxue Lu,Mengke Song,Jia Song,Zhenyu Wu,Wenfeng Song,Shuai Li
Main category: cs.CV
TL;DR: 本文提出了一种利用生成模型合成高质量伪装图像密集标注数据的方法,以解决现有数据集稀缺的问题。作者构建了大规模多模态标注的伪装数据集GenCAMO-DB,并提出了环境感知、无需掩码的生成框架GenCAMO,显著提升了复杂伪装场景下的密集预测性能。
Details
Motivation: 由于真实世界中高精度密集标注的伪装数据集稀缺且标注成本高昂,现有的密集预测模型难以充分训练和评估,因此需要一种高效、低成本的方式生成具有丰富语义信息的伪装图像与标注数据。 Method: 提出GenCAMO-DB——一个包含深度图、场景图、属性描述和文本提示等多模态标注的大规模伪装数据集;设计GenCAMO框架,该框架基于生成模型实现环境感知、无需掩码的高保真伪装图像与密集标注联合生成。 Result: 在多个模态上的实验表明,使用GenCAMO生成的数据训练的密集预测模型在复杂伪装场景中表现显著提升,验证了合成数据的有效性和泛化能力。 Conclusion: 通过生成模型合成具有多模态标注的伪装图像数据是解决真实数据稀缺问题的有效途径,GenCAMO及其数据集为未来密集预测任务提供了重要资源和技术支持。 Abstract: Conceal dense prediction (CDP), especially RGB-D camouflage object detection and open-vocabulary camouflage object segmentation, plays a crucial role in advancing the understanding and reasoning of complex camouflage scenes. However, high-quality and large-scale camouflage datasets with dense annotation remain scarce due to expensive data collection and labeling costs. To address this challenge, we explore leveraging generative models to synthesize realistic camouflage image-dense data for training CDP models with fine-grained representations, prior knowledge, and auxiliary reasoning. Concretely, our contributions are threefold: (i) we introduce GenCAMO-DB, a large-scale camouflage dataset with multi-modal annotations, including depth maps, scene graphs, attribute descriptions, and text prompts; (ii) we present GenCAMO, an environment-aware and mask-free generative framework that produces high-fidelity camouflage image-dense annotations; (iii) extensive experiments across multiple modalities demonstrate that GenCAMO significantly improves dense prediction performance on complex camouflage scenes by providing high-quality synthetic data. The code and datasets will be released after paper acceptance.[118] Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors
Hao Lu,Xuhui Zhu,Wenjing Zhang,Yanan Li,Xiang Bai
Main category: cs.CV
TL;DR: 本文提出了视频个体计数(VIC)的新方法OMAN++,通过引入社会群体先验和时空位移先验,在拥挤场景下显著提升了计数性能,特别是在新构建的WuhanMetroCrowd数据集上实现了38.12%的误差降低。
Details
Motivation: 现有VIC方法在密集场景(如地铁通勤)中表现不佳,且缺乏适合的数据集来研究复杂人群流动,因此需要更鲁棒的方法和更具挑战性的数据集。 Method: 提出OMAN++方法,利用社会群体先验将一对一匹配扩展为一对多匹配,并设计位移先验注入器以增强特征提取、匹配和训练;同时构建了包含多种密度、遮挡和动态变化的WuhanMetroCrowd数据集。 Result: OMAN++在SenseCrowd、CroHD和MovingDroneCrowd等多个基准上超越现有方法,并在WuhanMetroCrowd上实现38.12%的误差减少,显著提升拥挤场景下的计数准确性。 Conclusion: 通过引入合理的先验知识并重新思考VIC的本质,可以有效提升复杂拥挤场景中的视频个体计数性能,OMAN++为该任务提供了新的强基线。 Abstract: Video Individual Counting (VIC) is a recently introduced task aiming to estimate pedestrian flux from a video. It extends Video Crowd Counting (VCC) beyond the per-frame pedestrian count. In contrast to VCC that learns to count pedestrians across frames, VIC must identify co-existent pedestrians between frames, which turns out to be a correspondence problem. Existing VIC approaches, however, can underperform in congested scenes such as metro commuting. To address this, we build WuhanMetroCrowd, one of the first VIC datasets that characterize crowded, dynamic pedestrian flows. It features sparse-to-dense density levels, short-to-long video clips, slow-to-fast flow variations, front-to-back appearance changes, and light-to-heavy occlusions. To better adapt VIC approaches to crowds, we rethink the nature of VIC and recognize two informative priors: i) the social grouping prior that indicates pedestrians tend to gather in groups and ii) the spatial-temporal displacement prior that informs an individual cannot teleport physically. The former inspires us to relax the standard one-to-one (O2O) matching used by VIC to one-to-many (O2M) matching, implemented by an implicit context generator and a O2M matcher; the latter facilitates the design of a displacement prior injector, which strengthens not only O2M matching but also feature extraction and model training. These designs jointly form a novel and strong VIC baseline OMAN++. Extensive experiments show that OMAN++ not only outperforms state-of-the-art VIC baselines on the standard SenseCrowd, CroHD, and MovingDroneCrowd benchmarks, but also indicates a clear advantage in crowded scenes, with a 38.12% error reduction on our WuhanMetroCrowd dataset. Code, data, and pretrained models are available at https://github.com/tiny-smart/OMAN.[119] MS-ISSM: Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity
Zhang Chen,Shuai Wan,Yuezhe Zhang,Siyu Ren,Fuzheng Yang,Junhui Hou
Main category: cs.CV
TL;DR: 提出了一种基于多尺度隐式结构相似性测量(MS-ISSM)的点云质量评估方法,利用径向基函数实现连续特征表示,避免不规则数据中的匹配误差,并结合ResGrouped-MLP网络提升感知评分映射性能。
Details
Motivation: 点云的非结构化和不规则特性使得传统点到点的质量评估方法难以准确建立感知特征对应关系,因此需要一种更鲁棒的相似性度量方式。 Method: 采用径向基函数(RBF)对局部特征进行连续隐式表示,将失真度量转化为隐式函数系数的比较;设计ResGrouped-MLP网络,结合分组编码、残差块和通道注意力机制,实现多尺度特征差异到感知分数的映射。 Result: 在多个基准数据集上实验表明,MS-ISSM在可靠性与泛化能力方面优于现有最先进指标。 Conclusion: MS-ISSM有效解决了点云质量评估中因数据不规则导致的特征匹配难题,通过连续隐式表示和分层网络设计显著提升了感知质量预测的准确性。 Abstract: The unstructured and irregular nature of point clouds poses a significant challenge for objective quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes Radial Basis Functions (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat MLPs by adopting a grouped encoding strategy integrated with Residual Blocks and Channel-wise Attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.[120] RefSR-Adv: Adversarial Attack on Reference-based Image Super-Resolution Models
Jiazhu Dai,Huihui Jiang
Main category: cs.CV
TL;DR: 本文提出了一种针对参考图像超分辨率(RefSR)的新型对抗攻击方法RefSR-Adv,通过仅扰动参考图像即可显著降低重建图像质量,揭示了RefSR模型对参考特征过度依赖的安全漏洞。
Details
Motivation: 现有研究主要集中于RefSR的后门攻击,而其在对抗攻击下的脆弱性尚未被充分探索,本文旨在填补这一研究空白。 Method: 提出RefSR-Adv,通过最大化对抗输出与干净输出之间的差异,仅对参考图像添加扰动以实现攻击,在多种网络架构和数据集上验证有效性。 Result: RefSR-Adv在CNN、Transformer和Mamba架构及CUFED5、WR-SR、DRefSR数据集上均引发严重性能下降和视觉伪影,且攻击效果与输入与参考图像的相似度呈正相关。 Conclusion: RefSR系统存在因过度依赖参考图像而导致的安全漏洞,本研究呼吁关注RefSR模型的鲁棒性设计。 Abstract: Single Image Super-Resolution (SISR) aims to recover high-resolution images from low-resolution inputs. Unlike SISR, Reference-based Super-Resolution (RefSR) leverages an additional high-resolution reference image to facilitate the recovery of high-frequency textures. However, existing research mainly focuses on backdoor attacks targeting RefSR, while the vulnerability of the adversarial attacks targeting RefSR has not been fully explored. To fill this research gap, we propose RefSR-Adv, an adversarial attack that degrades SR outputs by perturbing only the reference image. By maximizing the difference between adversarial and clean outputs, RefSR-Adv induces significant performance degradation and generates severe artifacts across CNN, Transformer, and Mamba architectures on the CUFED5, WR-SR, and DRefSR datasets. Importantly, experiments confirm a positive correlation between the similarity of the low-resolution input and the reference image and attack effectiveness, revealing that the model's over-reliance on reference features is a key security flaw. This study reveals a security vulnerability in RefSR systems, aiming to urge researchers to pay attention to the robustness of RefSR.[121] XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
Zunhai Su,Weihao Ye,Hansen Feng,Keyu Fan,Jing Zhang,Dahai Yu,Zhengwu Liu,Ngai Wong
Main category: cs.CV
TL;DR: XStreamVGGT是一种无需调优的高效流式3D视觉模型,通过联合剪枝和量化KV缓存,显著降低内存消耗和推理延迟,同时保持性能几乎不变。
Details
Motivation: StreamVGGT在流式3D重建中表现优异,但其KV缓存随输入增长而无界增加,导致内存和延迟问题,限制了实际应用。 Method: 提出XStreamVGGT,通过高效的token重要性识别进行KV剪枝,并结合KV张量的特殊分布进行量化,实现固定内存预算下的流式推理。 Result: 实验表明,XStreamVGGT将内存使用减少4.42倍,推理速度提升5.48倍,性能损失可忽略。 Conclusion: XStreamVGGT实现了高效、可扩展的流式3D视觉应用,为大规模Transformer在实际场景中的部署提供了有效解决方案。 Abstract: Learning-based 3D visual geometry models have benefited substantially from large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention for strong streaming reconstruction, but suffers from unbounded KV cache growth, leading to escalating memory consumption and inference latency as input frames accumulate. We propose XStreamVGGT, a tuning-free approach that systematically compresses the KV cache through joint pruning and quantization, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs originating from multi-view inputs are pruned through efficient token importance identification, enabling a fixed memory budget. Leveraging the unique distribution of KV tensors, we incorporate KV quantization to further reduce memory consumption. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling scalable and practical streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.[122] Real-Time LiDAR Point Cloud Densification for Low-Latency Spatial Data Transmission
Kazuhiko Murasaki,Shunsuke Konagai,Masakatsu Aoki,Taiga Yoshida,Ryuichi Tanida
Main category: cs.CV
TL;DR: 提出了一种高速LiDAR点云稠密化方法,结合多LiDAR输入与高分辨率彩色图像,实现实时全高清深度图生成。
Details
Motivation: 为实现低延迟的沉浸式远程存在空间传输系统,需解决动态3D场景采集稀疏和实时处理两大难题。 Method: 结合多个LiDAR输入与高分辨率彩色图像,采用基于卷积神经网络的联合双边滤波策略进行深度补全。 Result: 方法可在30fps下生成全高清分辨率的稠密深度图,速度比现有基于训练的方法快15倍以上,且几何准确、无多视角不一致或重影伪影。 Conclusion: 该方法有效实现了低延迟、高质量的实时3D场景稠密重建,适用于沉浸式远程存在等实时应用。 Abstract: To realize low-latency spatial transmission system for immersive telepresence, there are two major problems: capturing dynamic 3D scene densely and processing them in real time. LiDAR sensors capture 3D in real time, but produce sparce point clouds. Therefore, this paper presents a high-speed LiDAR point cloud densification method to generate dense 3D scene with minimal latency, addressing the need for on-the-fly depth completion while maintaining real-time performance. Our approach combines multiple LiDAR inputs with high-resolution color images and applies a joint bilateral filtering strategy implemented through a convolutional neural network architecture. Experiments demonstrate that the proposed method produces dense depth maps at full HD resolution in real time (30 fps), which is over 15x faster than a recent training-based depth completion approach. The resulting dense point clouds exhibit accurate geometry without multiview inconsistencies or ghosting artifacts.[123] Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation
Riccardo Gelato,Carlo Sgaravatti,Jakob Grahn,Giacomo Boracchi,Filippo Maria Bianchi
Main category: cs.CV
TL;DR: 本文提出了一种针对Sentinel-1 SAR图像中雪崩分割标注的高效方法,基于Segment Anything Model(SAM)并针对SAR数据特点进行多方面改进,显著提升了标注效率。
Details
Motivation: 由于SAR图像标注依赖专家且耗时巨大,缺乏大规模高质量标注数据集,限制了雪崩检测模型的发展,因此需要一种能加速SAR图像标注的方法。 Method: 通过引入适配器缓解域差异、设计多编码器处理多通道SAR输入、采用提示工程策略提高小尺度低对比度雪崩的定位精度,并优化训练算法以减少计算瓶颈,最终将模型集成至标注工具中。 Result: 所提方法有效解决了SAM在SAR图像上的域不匹配、输入限制、提示敏感性和训练低效等问题,在实验中显著加快了SAR图像的标注速度。 Conclusion: 该工作成功将通用分割模型SAM适配至遥感SAR雪崩映射任务,为高成本标注场景提供了高效的解决方案,推动了基于SAR的雪崩风险监测发展。 Abstract: Remote sensing solutions for avalanche segmentation and mapping are key to supporting risk forecasting and mitigation in mountain regions. Synthetic Aperture Radar (SAR) imagery from Sentinel-1 can be effectively used for this task, but training an effective detection model requires gathering a large dataset with high-quality annotations from domain experts, which is prohibitively time-consuming. In this work, we aim to facilitate and accelerate the annotation of SAR images for avalanche mapping. We build on the Segment Anything Model (SAM), a segmentation foundation model trained on natural images, and tailor it to Sentinel-1 SAR data. Adapting SAM to our use-case requires addressing several domain-specific challenges: (i) domain mismatch, since SAM was not trained on satellite/SAR imagery; (ii) input adaptation, because SAR products typically provide more than three channels, while SAM is constrained to RGB images; (iii) robustness to imprecise prompts that can affect target identification and degrade the segmentation quality, an issue exacerbated in small, low-contrast avalanches; and (iv) training efficiency, since standard fine-tuning is computationally demanding for SAM. We tackle these challenges through a combination of adapters to mitigate the domain gap, multiple encoders to handle multi-channel SAR inputs, prompt-engineering strategies to improve avalanche localization accuracy, and a training algorithm that limits the training time of the encoder, which is recognized as the major bottleneck. We integrate the resulting model into an annotation tool and show experimentally that it speeds up the annotation of SAR images.[124] UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
Mengfei Li,Peng Li,Zheng Zhang,Jiahao Lu,Chengfeng Zhao,Wei Xue,Qifeng Liu,Sida Peng,Wenxiao Zhang,Wenhan Luo,Yuan Liu,Yike Guo
Main category: cs.CV
TL;DR: 提出UniSH,一种统一的前馈框架,用于联合度量尺度的3D场景与人体重建,通过利用未标注的真实世界数据和创新训练范式,显著缩小仿真到现实的域差距,实现高保真重建。
Details
Motivation: 由于缺乏大规模标注的真实世界数据,现有方法依赖合成数据导致仿真到现实的域差距,影响在真实视频上的泛化能力、人体几何精度和对齐效果。 Method: 提出一种创新训练范式,结合强先验信息,采用(1)鲁棒蒸馏策略从专家深度模型中提取高频细节以优化人体表面;(2)两阶段监督方案:先在合成数据上学习粗略定位,再在真实数据上通过优化SMPL网格与人体点云之间的几何对应关系进行微调。 Result: 该前馈模型可在单次前向传播中联合恢复高保真场景几何、人体点云、相机参数和一致的度量尺度SMPL人体,在人类为中心的场景重建任务上达到最先进性能,并在全局人体运动估计中表现优于优化方法和仅HMR方法。 Conclusion: UniSH有效利用未标注真实数据,弥合了仿真与现实之间的鸿沟,实现了高质量、度量准确的联合场景与人体重建,具有良好的实际应用潜力。 Abstract: We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real data by directly optimizing the geometric correspondence between the SMPL mesh and the human point cloud. This approach enables our feed-forward model to jointly recover high-fidelity scene geometry, human point clouds, camera parameters, and coherent, metric-scale SMPL bodies, all in a single forward pass. Extensive experiments demonstrate that our model achieves state-of-the-art performance on human-centric scene reconstruction and delivers highly competitive results on global human motion estimation, comparing favorably against both optimization-based frameworks and HMR-only methods. Project page: https://murphylmf.github.io/UniSH/[125] Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
Bac Nguyen,Yuhta Takida,Naoki Murata,Chieh-Hsin Lai,Toshimitsu Uesaka,Stefano Ermon,Yuki Mitsufuji
Main category: cs.CV
TL;DR: 提出了一种名为CODA的新方法,通过引入寄存器槽和对比对齐损失来改善基于扩散模型的物体中心学习中的槽纠缠和图像-槽不对齐问题,在多个数据集上显著提升了性能。
Details
Motivation: 现有的Slot Attention结合预训练扩散模型在物体中心学习中存在槽纠缠和槽与图像内容对齐弱的问题。 Method: 提出Contrastive Object-centric Diffusion Alignment (CODA),引入寄存器槽吸收残差注意力,并采用对比对齐损失以增强槽与图像的对应关系,其训练目标作为最大化互信息的可处理代理。 Result: 在MOVi-C/E、VOC和COCO等数据集上,CODA在前景调整兰德指数(FG-ARI)等指标上显著优于强基线(如COCO上+6.1%),并提升属性预测和组合图像生成能力,且计算开销低。 Conclusion: CODA能有效缓解槽纠缠和对齐问题,是一种高效、可扩展的框架,具有在复杂真实场景中实现鲁棒物体中心学习的应用潜力。 Abstract: Slot Attention (SA) with pretrained diffusion models has recently shown promise for object-centric learning (OCL), but suffers from slot entanglement and weak alignment between object slots and image content. We propose Contrastive Object-centric Diffusion Alignment (CODA), a simple extension that (i) employs register slots to absorb residual attention and reduce interference between object slots, and (ii) applies a contrastive alignment loss to explicitly encourage slot-image correspondence. The resulting training objective serves as a tractable surrogate for maximizing mutual information (MI) between slots and inputs, strengthening slot representation quality. On both synthetic (MOVi-C/E) and real-world datasets (VOC, COCO), CODA improves object discovery (e.g., +6.1% FG-ARI on COCO), property prediction, and compositional image generation over strong baselines. Register slots add negligible overhead, keeping CODA efficient and scalable. These results indicate potential applications of CODA as an effective framework for robust OCL in complex, real-world scenes.[126] HyDRA: Hybrid Denoising Regularization for Measurement-Only DEQ Training
Markus Haltmeier,Lukas Neumann,Nadja Gruber,Johannes Schwab,Gyeongha Hwang
Main category: cs.CV
TL;DR: 提出HyDRA框架,一种仅使用测量值的深度均衡模型训练方法,结合测量一致性与自适应去噪正则化,适用于缺乏监督数据的图像重建问题。
Details
Motivation: 在许多实际场景中,图像重建问题缺乏成对的监督数据(x, y),仅有测量值y可用,传统DEQ模型依赖监督数据,限制了其应用。 Method: 提出HyDRA框架,结合测量一致性与自适应去噪正则化项,并引入数据驱动的早停机制,在仅有测量值y的情况下训练DEQ模型。 Result: 在稀疏视图CT实验中,HyDRA实现了具有竞争力的重建质量,并保持快速推理能力。 Conclusion: HyDRA为无监督条件下的图像重建提供了一种有效且实用的DEQ训练方案,具有良好的应用前景。 Abstract: Solving image reconstruction problems of the form \(\mathbf{A} \mathbf{x} = \mathbf{y}\) remains challenging due to ill-posedness and the lack of large-scale supervised datasets. Deep Equilibrium (DEQ) models have been used successfully but typically require supervised pairs \((\mathbf{x},\mathbf{y})\). In many practical settings, only measurements \(\mathbf{y}\) are available. We introduce HyDRA (Hybrid Denoising Regularization Adaptation), a measurement-only framework for DEQ training that combines measurement consistency with an adaptive denoising regularization term, together with a data-driven early stopping criterion. Experiments on sparse-view CT demonstrate competitive reconstruction quality and fast inference.[127] RFAssigner: A Generic Label Assignment Strategy for Dense Object Detection
Ziqian Guan,Xieyi Fu,Yuting Wang,Haowen Xiao,Jiarui Zhu,Yingying Zhu,Yongtao Liu,Lin Gu
Main category: cs.CV
TL;DR: 提出了一种新的标签分配策略RFAssigner,通过高斯感受野距离自适应地为小目标补充正样本,提升密集检测器在多尺度上的性能。
Details
Motivation: 现有标签分配方法对小目标分配的正样本不足,导致训练中存在尺度不平衡问题。 Method: RFAssigner首先基于点先验确定初始正样本集,然后利用高斯感受野(GRF)距离衡量未分配候选位置与真实目标之间的相似性,并据此自适应选择额外的正样本。 Result: 在三个具有不同目标尺度分布的数据集上进行了实验,结果表明RFAssigner显著提升了检测性能,尤其在小目标上表现突出。单个FCOS-ResNet-50模型即达到最先进水平。 Conclusion: RFAssigner有效缓解了多尺度训练中的不平衡问题,无需额外模块或启发式规则即可提升密集检测器的性能,具有良好的通用性和实用性。 Abstract: Label assignment is a critical component in training dense object detectors. State-of-the-art methods typically assign each training sample a positive and a negative weight, optimizing the assignment scheme during training. However, these strategies often assign an insufficient number of positive samples to small objects, leading to a scale imbalance during training. To address this limitation, we introduce RFAssigner, a novel assignment strategy designed to enhance the multi-scale learning capabilities of dense detectors. RFAssigner first establishes an initial set of positive samples using a point-based prior. It then leverages a Gaussian Receptive Field (GRF) distance to measure the similarity between the GRFs of unassigned candidate locations and the ground-truth objects. Based on this metric, RFAssigner adaptively selects supplementary positive samples from the unassigned pool, promoting a more balanced learning process across object scales. Comprehensive experiments on three datasets with distinct object scale distributions validate the effectiveness and generalizability of our method. Notably, a single FCOS-ResNet-50 detector equipped with RFAssigner achieves state-of-the-art performance across all object scales, consistently outperforming existing strategies without requiring auxiliary modules or heuristics.[128] MambaFormer: Token-Level Guided Routing Mixture-of-Experts for Accurate and Efficient Clinical Assistance
Hamad Khan,Saddam Hussain Khan
Main category: cs.CV
TL;DR: 提出了一种基于LLM的MambaFormer混合专家模型(MoE),用于高效医疗问答和临床辅助,结合Transformer和状态空间模型专家,实现低延迟、高准确率的推理。
Details
Motivation: 解决大语言模型在临床应用中计算成本与效率之间的权衡问题,提升长序列处理能力与响应速度。 Method: 采用轻量级门控机制进行token级动态路由:复杂短查询由定制Transformer专家(ET5)处理,长序列由状态空间模型专家(EMamba)处理;基于上下文复杂度、序列长度和领域特征进行智能路由,并通过多目标损失函数联合优化路由决策与计算成本。 Result: 在DentalQA和PubMedQA数据集上验证,MambaFormer达到BERTScore 0.9180,延迟仅0.077秒,相比T5-Large提速24.4倍。 Conclusion: MambaFormer实现了推理延迟与预测精度之间的帕累托最优,在资源受限的临床环境中具有可扩展性和高效部署潜力。 Abstract: The deployment of large language models (LLMs) in real-world clinical applications is constrained by the fundamental trade-off between computational cost and the efficiency of linear-time models. To address this, we propose an LLM-based MambaFormer hybrid Mixture-of-Experts (MoE) framework for efficient medical question-answering (QA) and clinical assistance. The MambaFormer employs a lightweight gating mechanism that performs token-level dynamic routing to a customized Transformer expert (ET5) for short, complex queries or to a State Space Model expert (EMamba) for long, high-throughput sequences. The customized EMamba and ET5 models are tailored to accommodate input sequence dimensionality, embedding structure, sequence length, and target-specific output heads, and are fine-tuned through transfer learning on a new, custom-designed DentalQA dataset. Moreover, intelligent routing decisions are driven by the contextual complexity of token embeddings, normalized sequence length, and domain-aware features, thereby enforcing a Pareto-optimal trade-off between inference latency and prediction accuracy. Furthermore, a novel utility-guided multi-objective loss jointly optimizes decisions, router parameters, routing behavior, expert utilization, and computational cost by adaptively regulating token-level expert activation. Finally, the proposed MambaFormer is cross-validated (holdout) for medical QA on the new, custom-designed DentalQA and PubMedQA datasets and compared with state-of-the-art techniques. The proposed MambaFormer outperforms (BERTScore = 0.9180) with ultra-low latency (0.077 s), delivering a 24.4 speedup over T5-Large and establishing a scalable solution for resource-constrained clinical deployment.[129] AI-Powered Deepfake Detection Using CNN and Vision Transformer Architectures
Sifatullah Sheikh Urmi,Kirtonia Nuzath Tabassum Arthi,Md Al-Imran
Main category: cs.CV
TL;DR: 本研究评估了四种基于AI的模型(三种CNN和一种Vision Transformer)在大规模人脸图像数据集上进行深度伪造检测的性能,其中VFDNET结合MobileNetV3表现出最优的准确性和效率。
Details
Motivation: 随着人工智能生成的深度伪造内容日益增多,数字真实性面临严峻挑战,亟需可靠的技术手段来检测和应对深度伪造。 Method: 采用三种卷积神经网络(CNN)和一种视觉Transformer(Vision Transformer)模型,结合大规模人脸图像数据集,利用数据预处理和数据增强技术提升模型性能,并对各模型在不同场景下的表现进行评估。 Result: 经过数据增强和优化,所有模型性能均有所提升,其中VFDNET与MobileNetV3结合时展现出最高的检测准确率和运行效率。 Conclusion: 研究表明,结合先进网络架构与数据增强策略的AI模型能够有效应对深度伪造检测挑战,VFDNET与MobileNetV3的组合为高效、准确的深伪检测提供了可行方案。 Abstract: The increasing use of artificial intelligence generated deepfakes creates major challenges in maintaining digital authenticity. Four AI-based models, consisting of three CNNs and one Vision Transformer, were evaluated using large face image datasets. Data preprocessing and augmentation techniques improved model performance across different scenarios. VFDNET demonstrated superior accuracy with MobileNetV3, showing efficient performance, thereby demonstrating AI's capabilities for dependable deepfake detection.[130] S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss
Md. Sanaullah Chowdhury Lameya Sabrin
Main category: cs.CV
TL;DR: 本文提出S2M-Net,一种用于医学图像分割的高效网络架构,通过频谱选择性令牌混合器(SSTM)和形态感知自适应分割损失(MASL)在局部精度、全局上下文和计算效率之间实现平衡,在16个医学数据集上取得领先性能。
Details
Motivation: 医学图像分割面临局部精度、全局上下文和计算效率三者难以兼顾的问题,现有方法在小数据集上易过拟合且计算成本高,缺乏通用性和实用性。 Method: 提出S2M-Net,包含两个创新模块:1)SSTM模块利用截断2D FFT和可学习频域滤波实现O(HW log HW)复杂度下的全局感受野;2)MASL损失根据结构特征自动调整多种损失权重,无需手动调参。 Result: 在16个涵盖8种模态的医学图像数据集上验证,S2M-Net取得多项最优结果:息肉分割Dice达96.12%,手术器械分割83.77%(较先前方法提升17.85%),脑肿瘤分割80.90%,且参数量仅为基于Transformer方法的1/3.5到1/6。 Conclusion: S2M-Net有效解决了医学图像分割中的三难问题,在保持低计算成本的同时实现了优越的分割性能和跨任务泛化能力,具有良好的临床部署潜力。 Abstract: Medical image segmentation requires balancing local precision for boundary-critical clinical applications, global context for anatomical coherence, and computational efficiency for deployment on limited data and hardware a trilemma that existing architectures fail to resolve. Although convolutional networks provide local precision at $\mathcal{O}(n)$ cost but limited receptive fields, vision transformers achieve global context through $\mathcal{O}(n^2)$ self-attention at prohibitive computational expense, causing overfitting on small clinical datasets. We propose S2M-Net, a 4.7M-parameter architecture that achieves $\mathcal{O}(HW \log HW)$ global context through two synergistic innovations: (i) Spectral-Selective Token Mixer (SSTM), which exploits the spectral concentration of medical images via truncated 2D FFT with learnable frequency filtering and content-gated spatial projection, avoiding quadratic attention cost while maintaining global receptive fields; and (ii) Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights, eliminating manual per-dataset tuning. Comprehensive evaluation in 16 medical imaging datasets that span 8 modalities demonstrates state-of-the-art performance: 96.12\% Dice on polyp segmentation, 83.77\% on surgical instruments (+17.85\% over the prior art) and 80.90\% on brain tumors, with consistent 3-18\% improvements over specialized baselines while using 3.5--6$\times$ fewer parameters than transformer-based methods.[131] VReID-XFD: Video-based Person Re-identification at Extreme Far Distance Challenge Results
Kailash A. Hambarde,Hugo Proença,Md Rashidunnabi,Pranita Samale,Qiwei Yang,Pingping Zhang,Zijing Gong,Yuhao Wang,Xi Zhang,Ruoshui Qu,Qiaoyun He,Yuhang Zhang,Thi Ngoc Ha Nguyen,Tien-Dung Mai,Cheng-Jun Kang,Yu-Fan Lin,Jin-Hui Jiang,Chih-Chung Hsu,Tamás Endrei,György Cserey,Ashwat Rajbhandari
Main category: cs.CV
TL;DR: 本文提出了一个用于极端远距离(XFD)空中到地面行人重识别的视频基准VReID-XFD,以应对分辨率低、视角变化大等挑战。
Details
Motivation: 现有的行人重识别系统在极端远距离和跨视角条件下表现不佳,需要新的基准来研究这一新场景。 Method: 基于DetReIDX数据集构建VReID-XFD基准,包含371个身份、11,288个轨迹片段和大量视频帧,并支持多种跨视角评估设置。 Result: 最佳方法SAS-PReID在空中到地面设置下仅达到43.93% mAP,性能随高度和距离增加而下降,正射视角普遍不利,存在峰值性能与鲁棒性之间的权衡。 Conclusion: VReID-XFD为极端远距离跨视角行人重识别提供了具有挑战性的基准,揭示了当前方法的局限性,推动未来研究关注鲁棒特征学习。 Abstract: Person re-identification (ReID) across aerial and ground views at extreme far distances introduces a distinct operating regime where severe resolution degradation, extreme viewpoint changes, unstable motion cues, and clothing variation jointly undermine the appearance-based assumptions of existing ReID systems. To study this regime, we introduce VReID-XFD, a video-based benchmark and community challenge for extreme far-distance (XFD) aerial-to-ground person re-identification. VReID-XFD is derived from the DetReIDX dataset and comprises 371 identities, 11,288 tracklets, and 11.75 million frames, captured across altitudes from 5.8 m to 120 m, viewing angles from oblique (30 degrees) to nadir (90 degrees), and horizontal distances up to 120 m. The benchmark supports aerial-to-aerial, aerial-to-ground, and ground-to-aerial evaluation under strict identity-disjoint splits, with rich physical metadata. The VReID-XFD-25 Challenge attracted 10 teams with hundreds of submissions. Systematic analysis reveals monotonic performance degradation with altitude and distance, a universal disadvantage of nadir views, and a trade-off between peak performance and robustness. Even the best-performing SAS-PReID method achieves only 43.93 percent mAP in the aerial-to-ground setting. The dataset, annotations, and official evaluation protocols are publicly available at https://www.it.ubi.pt/DetReIDX/ .[132] LinMU: Multimodal Understanding Made Linear
Hongjie Wang,Niraj K. Jha
Main category: cs.CV
TL;DR: LinMU是一种线性复杂度的多模态理解模型,通过替换自注意力模块为双分支M-MATE结构,在保持性能的同时显著提升推理效率。
Details
Motivation: 现有视觉语言模型因自注意力机制的二次复杂度难以部署在边缘设备,且处理高分辨率图像和长视频时计算成本过高。 Method: 提出LinMU架构,用M-MATE块(结合双向状态空间模型Flex-MA和局部Swin窗口注意力Local-Swin)替代自注意力层,并采用三阶段蒸馏框架将预训练VLM转换为LinMU。 Result: 在MMMU、TextVQA、LongVideoBench等多个基准上,LinMU达到与教师模型相当的性能,首 token 时间最多减少2.7倍,token吞吐量在分钟级视频上最多提升9.0倍。 Conclusion: 无需二次复杂度注意力也可实现先进多模态推理,为长上下文、高分辨率视觉语言模型开辟了新路径。 Abstract: Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the VLM with the M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.[133] Achieving Fine-grained Cross-modal Understanding through Brain-inspired Hierarchical Representation Learning
Weihang You,Hanqi Jiang,Yi Pan,Junhao Chen,Tianming Liu,Fei Dou
Main category: cs.CV
TL;DR: 本文提出了一种名为NeuroAlign的新框架,用于实现fMRI与视频之间的细粒度对齐,通过模拟人类视觉系统的层次结构,显著提升了跨模态检索性能。
Details
Motivation: 现有方法难以反映大脑视觉处理的层次性和时序性过程,且在神经数据与视觉输入之间存在模态鸿沟。 Method: 提出NeuroAlign框架,包含两个阶段:通过神经-时间对比学习(NTCL)进行全局语义理解,以及通过增强的向量量化进行细粒度模式匹配;采用NTCL建模双向预测的时序动态,并通过DynaSyncMM-EMA实现自适应加权的多模态融合。 Result: 实验表明,NeuroAlign在跨模态检索任务中显著优于现有方法。 Conclusion: NeuroAlign为理解视觉认知机制提供了新范式,有效桥接了神经响应与视觉刺激之间的复杂关系。 Abstract: Understanding neural responses to visual stimuli remains challenging due to the inherent complexity of brain representations and the modality gap between neural data and visual inputs. Existing methods, mainly based on reducing neural decoding to generation tasks or simple correlations, fail to reflect the hierarchical and temporal processes of visual processing in the brain. To address these limitations, we present NeuroAlign, a novel framework for fine-grained fMRI-video alignment inspired by the hierarchical organization of the human visual system. Our framework implements a two-stage mechanism that mirrors biological visual pathways: global semantic understanding through Neural-Temporal Contrastive Learning (NTCL) and fine-grained pattern matching through enhanced vector quantization. NTCL explicitly models temporal dynamics through bidirectional prediction between modalities, while our DynaSyncMM-EMA approach enables dynamic multi-modal fusion with adaptive weighting. Experiments demonstrate that NeuroAlign significantly outperforms existing methods in cross-modal retrieval tasks, establishing a new paradigm for understanding visual cognitive mechanisms.[134] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding
Yixuan Lai,He Wang,Kun Zhou,Tianjia Shao
Main category: cs.CV
TL;DR: 提出一种基于短参考视频的身份条件扩散-变换器视频生成方法,通过捕捉个体动态特征提升身份保持能力,同时确保动作自然和提示忠实。
Details
Motivation: 现有方法在使用单张图像作为条件时忽略时间动态信息,导致姿态锁定、面部变形和身份漂移问题,难以在大姿态变化和丰富表情下保持身份一致性。 Method: 引入基于短参考视频的扩散-变换器模型,利用Sinkhorn路由编码器从参考视频中提取包含特定动态模式(如微笑形成方式)的紧凑身份令牌,并与预训练主干兼容,实现对身份动态的建模。 Result: 该方法在大幅姿态变化和复杂面部表情下显著提升了身份保持效果,同时维持了良好的运动自然性、视觉真实感和对文本提示的忠实度。 Conclusion: 通过引入参考视频中的动态信息并设计轻量级身份条件机制,可在不牺牲生成质量的前提下有效增强生成视频的身份一致性,为个性化视频生成提供了新思路。 Abstract: Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and "average" faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.[135] Advanced Machine Learning Approaches for Enhancing Person Re-Identification Performance
Dang H. Pham,Tu N. Nguyen,Hoa N. Nguyen
Main category: cs.CV
TL;DR: 该论文针对人员重识别(ReID)中的外观变化、域偏移和标注数据不足等问题,提出了三种先进方法,分别用于监督、无监督域自适应和完全无监督场景,显著提升了跨摄像头身份匹配的性能。
Details
Motivation: Person re-identification面临外观变异、域迁移和标注数据稀缺等挑战,现有方法在特征判别性、跨域泛化和伪标签噪声抑制方面存在局限,需更鲁棒的解决方案。 Method: 1) SCM-ReID:结合监督对比学习与混合损失优化;2) IQAGA和DAPRH:基于GAN的数据增强、域不变映射与伪标签精炼;3) ViTC-UReID:基于Vision Transformer的特征编码与相机感知代理学习。 Result: 在Market-1501、CUHK03、DukeMTMC-reID和MSMT17上取得显著提升,其中UDA设置下mAP和Rank-1指标最高提升12%,无监督方法也大幅超越现有基准。 Conclusion: 所提方法有效增强了ReID在不同学习设置下的特征表示能力与泛化性能,推动了其在真实监控系统中的实际应用。 Abstract: Person re-identification (ReID) plays a critical role in intelligent surveillance systems by linking identities across multiple cameras in complex environments. However, ReID faces significant challenges such as appearance variations, domain shifts, and limited labeled data. This dissertation proposes three advanced approaches to enhance ReID performance under supervised, unsupervised domain adaptation (UDA), and fully unsupervised settings. First, SCM-ReID integrates supervised contrastive learning with hybrid loss optimization (classification, center, triplet, and centroid-triplet losses), improving discriminative feature representation and achieving state-of-the-art accuracy on Market-1501 and CUHK03 datasets. Second, for UDA, IQAGA and DAPRH combine GAN-based image augmentation, domain-invariant mapping, and pseudo-label refinement to mitigate domain discrepancies and enhance cross-domain generalization. Experiments demonstrate substantial gains over baseline methods, with mAP and Rank-1 improvements up to 12% in challenging transfer scenarios. Finally, ViTC-UReID leverages Vision Transformer-based feature encoding and camera-aware proxy learning to boost unsupervised ReID. By integrating global and local attention with camera identity constraints, this method significantly outperforms existing unsupervised approaches on large-scale benchmarks. Comprehensive evaluations across CUHK03, Market-1501, DukeMTMC-reID, and MSMT17 confirm the effectiveness of the proposed methods. The contributions advance ReID research by addressing key limitations in feature learning, domain adaptation, and label noise handling, paving the way for robust deployment in real-world surveillance systems.[136] Garment Inertial Denoiser (GID): Endowing Accurate Motion Capture via Loose IMU Denoiser
Jiawei Fang,Ruonan Zheng,Xiaoxia Gao,Shifan Jiang,Anjun Chen,Qi Ye,Shihui Guo
Main category: cs.CV
TL;DR: 本文提出了一种用于宽松衣物中惯性动作捕捉(MoCap)的轻量级Transformer模型GID,通过位置感知专家架构实现传感器噪声去除与跨穿戴融合,显著提升了在无紧密贴合条件下的动作捕捉精度。
Details
Motivation: 传统惯性动作捕捉需要传感器紧贴身体,限制了日常使用;而将传感器嵌入宽松衣物会因位移导致严重噪声,破坏现有惯性流程,因此需要一种能处理此类结构化噪声的新方法。 Method: 提出GID(Garment Inertial Denoiser),采用三阶段框架:(i) 位置特定去噪,(ii) 自适应跨穿戴融合,(iii) 通用姿态预测;结合共享时空主干网络与针对每个IMU位置的专家头,并引入轻量融合模块以保持一致性。 Result: 在新提出的GarMoCap数据集上实验表明,GID可在单用户训练下实现实时精确去噪,并对未见过的用户、动作和服装类型具有良好泛化能力,作为即插即用模块可一致提升现有最先进方法的性能。 Conclusion: GID通过引入位置感知的专家结构和轻量融合机制,有效解决了宽松衣物中惯性传感器的动作捕捉噪声问题,推动了可穿戴动作捕捉向更舒适、实用方向的发展。 Abstract: Wearable inertial motion capture (MoCap) provides a portable, occlusion-free, and privacy-preserving alternative to camera-based systems, but its accuracy depends on tightly attached sensors - an intrusive and uncomfortable requirement for daily use. Embedding IMUs into loose-fitting garments is a desirable alternative, yet sensor-body displacement introduces severe, structured, and location-dependent corruption that breaks standard inertial pipelines. We propose GID (Garment Inertial Denoiser), a lightweight, plug-and-play Transformer that factorizes loose-wear MoCap into three stages: (i) location-specific denoising, (ii) adaptive cross-wear fusion, and (iii) general pose prediction. GID uses a location-aware expert architecture, where a shared spatio-temporal backbone models global motion while per-IMU expert heads specialize in local garment dynamics, and a lightweight fusion module ensures cross-part consistency. This inductive bias enables stable training and effective learning from limited paired loose-tight IMU data. We also introduce GarMoCap, a combined public and newly collected dataset covering diverse users, motions, and garments. Experiments show that GID enables accurate, real-time denoising from single-user training and generalizes across unseen users, motions, and garment types, consistently improving state-of-the-art inertial MoCap methods when used as a drop-in module.[137] Unsupervised SE(3) Disentanglement for in situ Macromolecular Morphology Identification from Cryo-Electron Tomography
Mostofa Rafid Uddin,Mahek Vora,Qifeng Wu,Muyuan Chen,Min Xu
Main category: cs.CV
TL;DR: 提出了一种解耦的深度表示学习框架,用于从冷冻电子断层扫描数据中分离SE(3)变换与形态学内容,实现了对大分子形态的自动、高效推断。
Details
Motivation: 现有基于期望最大化的方法难以发现稀有但重要的大分子形态,且依赖大量手动调参。 Method: 设计了一个解耦的深度表示学习框架,包含一个多选学习模块,能够在高噪声下分离SE(3)变换与形态学特征,并利用学习到的内容生成模板形态。 Result: 在模拟和真实数据集上实验表明,该方法优于先前方法,能发现此前未识别的大分子形态。 Conclusion: 该框架有效解决了cryo-ET中形态推断的挑战,提升了对罕见结构的检测能力,减少了对人工干预的依赖。 Abstract: Cryo-electron tomography (cryo-ET) provides direct 3D visualization of macromolecules inside the cell, enabling analysis of their in situ morphology. This morphology can be regarded as an SE(3)-invariant, denoised volumetric representation of subvolumes extracted from tomograms. Inferring morphology is therefore an inverse problem of estimating both a template morphology and its SE(3) transformation. Existing expectation-maximization based solution to this problem often misses rare but important morphologies and requires extensive manual hyperparameter tuning. Addressing this issue, we present a disentangled deep representation learning framework that separates SE(3) transformations from morphological content in the representation space. The framework includes a novel multi-choice learning module that enables this disentanglement for highly noisy cryo-ET data, and the learned morphological content is used to generate template morphologies. Experiments on simulated and real cryo-ET datasets demonstrate clear improvements over prior methods, including the discovery of previously unidentified macromolecular morphologies.[138] ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking
Xiaobao Wei,Zhangjie Ye,Yuxiang Gu,Zunjie Zhu,Yunfei Guo,Yingying Shen,Shan Zhao,Ming Lu,Haiyang Sun,Bing Wang,Guang Chen,Rongfeng Lu,Hangjun Ye
Main category: cs.CV
TL;DR: 本文提出了首个面向停车场景重建的基准ParkRecon3D和新框架ParkGaussian,结合3D高斯点阵与车位感知策略,显著提升重建质量与下游任务的一致性。
Details
Motivation: 现有工作集中于2D停车位感知与定位,缺乏对复杂3D空间几何建模的关注,且单纯提升视觉质量无法直接促进自动驾驶停车性能。 Method: 构建了包含鱼眼相机数据与密集标注的ParkRecon3D基准;提出ParkGaussian框架,首次将3D高斯点阵(3DGS)用于停车场景重建,并设计车位感知重建策略以增强车位区域合成质量。 Result: 在ParkRecon3D上的实验表明,ParkGaussian在重建质量和保持感知一致性方面均达到最先进水平。 Conclusion: ParkGaussian通过融合3D重建与感知需求,有效提升了自主泊车系统在复杂环境下的性能,为未来研究提供了重要基准与方法参考。 Abstract: Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released at: https://github.com/wm-research/ParkGaussian[139] Evaluation of Convolutional Neural Network For Image Classification with Agricultural and Urban Datasets
Shamik Shafkat Avro,Nazira Jesmin Lina,Shahanaz Sharmin
Main category: cs.CV
TL;DR: 本文提出了一种定制的卷积神经网络(CustomCNN),通过残差连接、Squeeze-and-Excitation注意力机制等设计,在多领域图像分类任务中实现了高效且具有竞争力的性能。
Details
Motivation: 研究网络架构设计选择对多领域图像分类性能的影响,特别是在智慧城市和农业图像应用中的实际效果。 Method: 设计并实现CustomCNN,采用残差连接、SE注意力机制、渐进式通道缩放和Kaiming初始化,并在五个公开数据集上进行训练与评估。 Result: CustomCNN在多个数据集上表现出与主流CNN模型相当的性能,同时计算效率更高。 Conclusion: 精心设计的网络架构能显著提升模型在真实应用场景中的表现,验证了架构选择的重要性。 Abstract: This paper presents the development and evaluation of a custom Convolutional Neural Network (CustomCNN) created to study how architectural design choices affect multi-domain image classification tasks. The network uses residual connections, Squeeze-and-Excitation attention mechanisms, progressive channel scaling, and Kaiming initialization to improve its ability to represent data and speed up training. The model is trained and tested on five publicly available datasets: unauthorized vehicle detection, footpath encroachment detection, polygon-annotated road damage and manhole detection, MangoImageBD and PaddyVarietyBD. A comparison with popular CNN architectures shows that the CustomCNN delivers competitive performance while remaining efficient in computation. The results underscore the importance of thoughtful architectural design for real-world Smart City and agricultural imaging applications.[140] SwinIFS: Landmark Guided Swin Transformer For Identity Preserving Face Super Resolution
Habiba Kausar,Saeed Anwar,Omar Jamal Hammad,Abdul Bais
Main category: cs.CV
TL;DR: 本文提出SwinIFS,一种基于关键点引导的面部超分辨率框架,结合结构先验与分层注意力机制,在中高倍率放大下实现身份保持的高质量人脸重建。
Details
Motivation: 由于低分辨率输入中细部结构和身份特征严重丢失,现有方法在极端放大条件下难以恢复有意义的面部结构,因此需要一种能保留身份信息并提升细节质量的方法。 Method: 引入密集高斯热图表示关键面部特征点作为结构先验,并融合到输入中;采用紧凑型Swin Transformer主干网络,利用其长距离依赖建模能力捕获上下文信息,同时保持局部几何结构,实现层次化注意力下的精细纹理恢复。 Result: 在CelebA数据集上实验表明,SwinIFS在感知质量、图像清晰度和身份保持方面优于现有方法,尤其在8倍放大下仍能生成逼真结果,且具有良好的计算效率。 Conclusion: SwinIFS通过融合结构先验与Transformer架构,有效提升了面部超分辨率的性能,尤其适用于监控、数字修复等对身份保真度要求高的实际应用场景。 Abstract: Face super-resolution aims to recover high-quality facial images from severely degraded low-resolution inputs, but remains challenging due to the loss of fine structural details and identity-specific features. This work introduces SwinIFS, a landmark-guided super-resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity-preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long-range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at https://github.com/Habiba123-stack/SwinIFS.[141] Mask-Guided Multi-Task Network for Face Attribute Recognition
Gong Gao,Zekai Wang,Jian Zhao,Ziqi Xie,Xianhui Liu,Weidong Zhao
Main category: cs.CV
TL;DR: 提出了一种新的多任务网络MGMTN,通过自适应掩码学习和分组-全局特征融合来提升面部属性识别性能。
Details
Motivation: 传统多任务方法使用全局特征图导致冗余特征和负迁移,需更精确的局部区域选择以提高识别效率。 Method: 设计了Mask-Guided Multi-Task Network(MGMTN),包含自适应掩码学习(AML)定位关键面部部件并生成分组掩码,结合分组与全局特征的G2FF模块进行特征融合。 Result: 在两个具有挑战性的面部属性识别数据集上进行了大量实验,结果表明MGMTN显著优于现有方法。 Conclusion: MGMTN通过聚焦关键面部区域并融合局部与全局特征,有效提升了面部属性识别的准确性和效率。 Abstract: Face Attribute Recognition (FAR) plays a crucial role in applications such as person re-identification, face retrieval, and face editing. Conventional multi-task attribute recognition methods often process the entire feature map for feature extraction and attribute classification, which can produce redundant features due to reliance on global regions. To address these challenges, we propose a novel approach emphasizing the selection of specific feature regions for efficient feature learning. We introduce the Mask-Guided Multi-Task Network (MGMTN), which integrates Adaptive Mask Learning (AML) and Group-Global Feature Fusion (G2FF) to address the aforementioned limitations. Leveraging a pre-trained keypoint annotation model and a fully convolutional network, AML accurately localizes critical facial parts (e.g., eye and mouth groups) and generates group masks that delineate meaningful feature regions, thereby mitigating negative transfer from global region usage. Furthermore, G2FF combines group and global features to enhance FAR learning, enabling more precise attribute identification. Extensive experiments on two challenging facial attribute recognition datasets demonstrate the effectiveness of MGMTN in improving FAR performance.[142] AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval
Yue Zhou,Ran Ding,Xue Yang,Xue Jiang,Xingzhao Liu
Main category: cs.CV
TL;DR: 本文提出了一种针对无人机拍摄车辆图像的遥感视觉-语言模型(VLM),通过构建包含206K指令的空间感知数据集AirSpatial,并引入空间定位与空间问答两项新任务,结合两阶段训练策略提升模型的空间理解能力,最终开发出具备细粒度识别与检索能力的空中智能体AirSpatialBot。
Details
Motivation: 现有遥感VLM在空间理解方面存在局限,难以满足实际应用需求,尤其在处理无人机拍摄的车辆图像时表现不足。 Method: 构建大规模空间感知数据集AirSpatial,包含3DBB标注;设计两阶段训练策略:图像理解预训练和空间理解微调;开发集成任务规划、图像与空间理解及执行能力的AirSpatialBot智能体。 Result: 实验验证了所提方法的有效性,揭示了现有VLM在空间理解上的缺陷,AirSpatialBot在细粒度车辆属性识别与检索任务中表现出色。 Conclusion: 本研究推动了遥感VLM在空间理解方面的发展,为基于无人机影像的智能分析提供了新的数据资源与技术路径。 Abstract: Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising Image Understanding Pre-training and Spatial Understanding Fine-tuning. Utilizing this trained spatially-aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot[143] DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
Xu Guo,Fulong Ye,Xinghui Li,Pengqi Tu,Pengze Zhang,Qichao Sun,Songtao Zhao,Xiangwang Hou,Qian He
Main category: cs.CV
TL;DR: 本文提出了一种新的视频人脸交换框架DreamID-V,通过引入SyncID-Pipe数据管道和Diffusion Transformer架构,显著提升了身份一致性和时序连贯性。
Details
Motivation: 现有视频人脸交换方法在保持身份相似性、属性保留以同时序一致性方面存在挑战,难以将图像人脸交换的优势有效迁移到视频领域。 Method: 提出SyncID-Pipe生成带身份锚定的配对数据,构建双向身份四元组用于监督;设计基于Diffusion Transformer的DreamID-V框架,采用模态感知条件注入模块;引入合成到真实的课程学习和身份一致性强化学习策略;并构建新基准IDBench-V。 Result: 实验表明DreamID-V在多个指标上优于现有最先进方法,在复杂场景下展现出更强的身份一致性和视觉真实感,并具有良好的泛化能力,可适应多种换脸相关任务。 Conclusion: DreamID-V成功实现了从图像到视频人脸交换的有效迁移,为视频级身份编辑提供了新的解决方案。 Abstract: Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.[144] EdgeNeRF: Edge-Guided Regularization for Neural Radiance Fields from Sparse Views
Weiqi Yu,Yiyang Yao,Lin He,Jianming Lv
Main category: cs.CV
TL;DR: 提出EdgeNeRF,一种边缘引导的稀疏视角3D重建方法,通过在非边缘区域应用深度和法线正则化来抑制几何伪影,同时保留边界细节。
Details
Motivation: 现有NeRF方法在稀疏视角下因几何伪影导致重建质量下降,全局正则化会丢失边界细节。 Method: 利用深度和法线突变产生边缘的先验,从输入图像中提取边缘,并在非边缘区域施加正则化约束。 Result: 在LLFF和DTU数据集上表现优异,能有效保持锐利几何边界并抑制伪影,且模块可插拔提升其他方法性能。 Conclusion: EdgeNeRF在稀疏视角下实现了高质量3D重建,兼顾几何一致性和高频细节保留,具有良好的通用性。 Abstract: Neural Radiance Fields (NeRF) achieve remarkable performance in dense multi-view scenarios, but their reconstruction quality degrades significantly under sparse inputs due to geometric artifacts. Existing methods utilize global depth regularization to mitigate artifacts, leading to the loss of geometric boundary details. To address this problem, we propose EdgeNeRF, an edge-guided sparse-view 3D reconstruction algorithm. Our method leverages the prior that abrupt changes in depth and normals generate edges. Specifically, we first extract edges from input images, then apply depth and normal regularization constraints to non-edge regions, enhancing geometric consistency while preserving high-frequency details at boundaries. Experiments on LLFF and DTU datasets demonstrate EdgeNeRF's superior performance, particularly in retaining sharp geometric boundaries and suppressing artifacts. Additionally, the proposed edge-guided depth regularization module can be seamlessly integrated into other methods in a plug-and-play manner, significantly improving their performance without substantially increasing training time. Code is available at https://github.com/skyhigh404/edgenerf.[145] In defense of the two-stage framework for open-set domain adaptive semantic segmentation
Wenqi Ren,Weijie Wang,Meng Zheng,Ziyan Wu,Yang Tang,Zhun Zhong,Nicu Sebe
Main category: cs.CV
TL;DR: 提出了一种名为SATS的分离-再适应训练策略,用于解决开放集域自适应语义分割中的已知/未知类别不平衡问题,并引入硬未知探索数据增强方法,在多个基准上显著超越现有方法。
Details
Motivation: 现有方法在单一阶段同时处理已知类别的域适应和未知类别的识别,导致已知类别负迁移和未知类别欠拟合,因训练标注不平衡。 Method: 采用两阶段策略:首先进行已知与未知类别的分离,然后执行对未知感知的域适应;并提出硬未知探索的数据增强方法,提升模型对复杂未知样本的识别能力。 Result: 在GTA5-to-Cityscapes上H-Score提升+3.85%,在SYNTHIA-to-Cityscapes上提升+18.64%,显著优于当前最优方法。 Conclusion: SATS通过分离再适应的训练策略和新的数据增强方法,有效缓解了开放集域自适应语义分割中的类别不平衡问题,提升了模型对真实未知对象的发现能力。 Abstract: Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) presents a significant challenge, as it requires both domain adaptation for known classes and the distinction of unknowns. Existing methods attempt to address both tasks within a single unified stage. We question this design, as the annotation imbalance between known and unknown classes often leads to negative transfer of known classes and underfitting for unknowns. To overcome these issues, we propose SATS, a Separating-then-Adapting Training Strategy, which addresses OSDA-SS through two sequential steps: known/unknown separation and unknown-aware domain adaptation. By providing the model with more accurate and well-aligned unknown classes, our method ensures a balanced learning of discriminative features for both known and unknown classes, steering the model toward discovering truly unknown objects. Additionally, we present hard unknown exploration, an innovative data augmentation method that exposes the model to more challenging unknowns, strengthening its ability to capture more comprehensive understanding of target unknowns. We evaluate our method on public OSDA-SS benchmarks. Experimental results demonstrate that our method achieves a substantial advancement, with a +3.85% H-Score improvement for GTA5-to-Cityscapes and +18.64% for SYNTHIA-to-Cityscapes, outperforming previous state-of-the-art methods.[146] PartImageNet++ Dataset: Enhancing Visual Models with High-Quality Part Annotations
Xiao Li,Zilong Liu,Yining Liu,Zhuhong Li,Na Dong,Sitian Qin,Xiaolin Hu
Main category: cs.CV
TL;DR: 本文提出了PartImageNet++(PIN++),一个包含ImageNet-1K所有类别详细部件标注的大规模数据集,并基于该数据集提出多尺度部件监督模型(MPM)以提升图像分类性能。
Details
Motivation: 现有数据集中高质量部件标注稀缺,限制了基于部件的模型发展,因此需要一个覆盖广泛类别且标注丰富的数据集来推动相关研究。 Method: 构建了包含10万张图像、每类100张标注图像的PIN++数据集;训练部件分割网络生成伪部件标签,并设计MPM模型结合原始标注与伪标签进行多任务学习,引入辅助旁路层增强特征表达。 Result: 在ImageNet-1K上验证了MPM的有效性,提升了鲁棒物体识别性能;并在部件分割、物体分割和少样本学习等下游任务中建立了强基线。 Conclusion: PIN++为部件级理解提供了重要资源,MPM展示了利用部件监督提升模型性能的有效路径,证明了细粒度标注在深度学习中的潜力。 Abstract: To address the scarcity of high-quality part annotations in existing datasets, we introduce PartImageNet++ (PIN++), a dataset that provides detailed part annotations for all categories in ImageNet-1K. With 100 annotated images per category, totaling 100K images, PIN++ represents the most comprehensive dataset covering a diverse range of object categories. Leveraging PIN++, we propose a Multi-scale Part-supervised recognition Model (MPM) for robust classification on ImageNet-1K. We first trained a part segmentation network using PIN++ and used it to generate pseudo part labels for the remaining unannotated images. MPM then integrated a conventional recognition architecture with auxiliary bypass layers, jointly supervised by both pseudo part labels and the original part annotations. Furthermore, we conducted extensive experiments on PIN++, including part segmentation, object segmentation, and few-shot learning, exploring various ways to leverage part annotations in downstream tasks. Experimental results demonstrated that our approach not only enhanced part-based models for robust object recognition but also established strong baselines for multiple downstream tasks, highlighting the potential of part annotations in improving model performance. The dataset and the code are available at https://github.com/LixiaoTHU/PartImageNetPP.[147] Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration
Wentao Bian,Fenglei Xu
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态少样本3D点云语义分割模型DA-FSS,通过解耦专家仲裁机制解决“可塑性-稳定性困境”和CLIP的类别混淆问题。
Details
Motivation: 现有“融合再细化”范式存在可塑性与稳定性之间的冲突,且CLIP易产生类间混淆导致语义盲区,限制了多模态少样本3D分割性能。 Method: 提出DA-FSS模型,包含并行专家细化模块和堆叠仲裁模块,解耦几何与语义路径,并通过解耦对齐模块进行无混淆知识迁移。 Result: 在S3DIS和ScanNet数据集上超越MM-FSS基线方法,几何边界、完整性和纹理区分能力更优。 Conclusion: DA-FSS通过分离并协调语义与几何通路,有效缓解了塑料性-稳定性矛盾,提升了多模态少样本3D语义分割的泛化能力。 Abstract: In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in "Fuse-then-Refine" paradigms: the "Plasticity-Stability Dilemma." In addition, CLIP's inter-class confusion can result in semantic blindness. To address these issues, we present the Decoupled-experts Arbitration Few-Shot SegNet (DA-FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA-FSS employs the same backbone and pre-trained text encoder as MM-FSS to generate text embeddings, which can increase free modalities' utilization rate and better leverage each modality's information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA-FSS over MM-FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA-FSS.[148] Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation
Mingxing Zhan,Li Zhang,Beibei Wang,Yingjie Wang,Zenglin Shi
Main category: cs.CV
TL;DR: 本文提出一种基于冻结主干网络的度量深度恢复方法,通过图像特定的逆深度仿射变换和语言引导的不确定性感知包络来实现跨域鲁棒的单目深度估计。
Details
Motivation: 单目度量深度估计因全局尺度不可识别和域迁移敏感性而具有挑战性,现有方法在跨域泛化上表现不佳。 Method: 在冻结相对深度主干和CLIP文本编码器的前提下,引入图像特定的逆深度仿射变换,并利用语言预测可行校准参数的不确定性感知包络,再结合多尺度冻结视觉特征选择包络内的具体校准参数;训练时使用闭式最小二乘oracle提供每张图像的监督信号。 Result: 在NYUv2和KITTI数据集上提升了域内精度,在SUN-RGBD和DDAD上实现了优于强语言基线的零样本迁移性能。 Conclusion: 该方法通过结合语言提供的粗略尺度线索与视觉特征精调,在保持模型轻量化的同时显著提升了单目度量深度估计的准确性与跨域鲁棒性。 Abstract: Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.[149] Domain Adaptation of Carotid Ultrasound Images using Generative Adversarial Network
Mohd Usama,Belal Ahmad,Christer Gronlund,Faleh Menawer R Althiyabi
Main category: cs.CV
TL;DR: 提出一种基于GAN的域适应方法,用于解决超声图像在不同设备或参数下分布不一致的问题,通过图像到图像转换对齐纹理并去除混响噪声,实验表明该方法显著提升域适应效果。
Details
Motivation: 深度学习在医学图像中常假设训练与测试数据同分布,但实际中不同设备或参数导致图像纹理和噪声差异,破坏该假设,导致模型泛化能力差,且逐设备重训练成本高。 Method: 将域适应任务建模为图像到图像翻译任务,采用改进的GAN框架,在保持图像内容的同时修改源域图像的纹理模式并去除混响噪声,使其与目标域对齐,并在两个包含三个域的颈动脉超声数据集上验证。 Result: 模型成功实现了纹理对齐和噪声去除,域适应效果优于无适应方法:直方图相关性更高(0.960 vs 0.916;0.920 vs 0.890),巴塔恰里亚距离更小(0.040 vs 0.090;0.085 vs 0.121),且优于CycleGAN方法。 Conclusion: 所提GAN-based方法能有效实现超声图像的域适应,提升模型跨设备泛化能力,减少重训练需求,具有临床应用潜力。 Abstract: Deep learning has been extensively used in medical imaging applications, assuming that the test and training datasets belong to the same probability distribution. However, a common challenge arises when working with medical images generated by different systems or even the same system with different parameter settings. Such images contain diverse textures and reverberation noise that violate the aforementioned assumption. Consequently, models trained on data from one device or setting often struggle to perform effectively with data from other devices or settings. In addition, retraining models for each specific device or setting is labor-intensive and costly. To address these issues in ultrasound images, we propose a novel Generative Adversarial Network (GAN)-based model. We formulated the domain adaptation tasks as an image-to-image translation task, in which we modified the texture patterns and removed reverberation noise in the test data images from the source domain to align with those in the target domain images while keeping the image content unchanged. We applied the proposed method to two datasets containing carotid ultrasound images from three different domains. The experimental results demonstrate that the model successfully translated the texture pattern of images and removed reverberation noise from the ultrasound images. Furthermore, we evaluated the CycleGAN approaches for a comparative study with the proposed model. The experimental findings conclusively demonstrated that the proposed model achieved domain adaptation (histogram correlation (0.960 (0.019), & 0.920 (0.043) and bhattacharya distance (0.040 (0.020), & 0.085 (0.048)), compared to no adaptation (0.916 (0.062) & 0.890 (0.077), 0.090 (0.070) & 0.121 (0.095)) for both datasets.[150] Robust Ship Detection and Tracking Using Modified ViBe and Backwash Cancellation Algorithm
Mohammad Hassan Saghafi,Seyed Majid Noorhosseini,Seyed Abolfazl Seyed Javadein,Hadi Khalili
Main category: cs.CV
TL;DR: 提出了一种改进的ViBe方法用于沿海视频序列中的船舶检测与跟踪,具有良好的实时性、鲁棒性和背景更新能力,并结合几何特征和亮度畸变实现背流消除。
Details
Motivation: 沿海场景复杂多变,存在海浪、光照变化等干扰,传统检测方法难以稳定检测船舶,需设计更鲁棒的实时检测方法。 Method: 采用改进的ViBe算法进行运动目标检测,降低漏检概率;利用船舶的几何特性及亮度畸变概念,提出新的背流消除方法。 Result: 实验结果表明该方法在船舶检测与跟踪中表现优异,具备实时性和高精度,能有效应对自然海浪和光照变化。 Conclusion: 所提出的改进ViBe与背流消除策略在复杂沿海环境中实现了鲁棒、快速且精确的船舶检测与跟踪,适用于实际监控应用。 Abstract: In this paper, we propose a robust real time detection and tracking method for detecting ships in a coastal video sequences. Since coastal scenarios are unpredictable and scenes have dynamic properties it is essential to apply detection methods that are robust to these conditions. This paper presents modified ViBe for moving object detection which detects ships and backwash. In the modified ViBe the probability of losing ships is decreased in comparison with the original ViBe. It is robust to natural sea waves and variation of lights and is capable of quickly updating the background. Based on geometrical properties of ship and some concepts such as brightness distortion, a new method for backwash cancellation is proposed. Experimental results demonstrate that the proposed strategy and methods have outstanding performance in ship detection and tracking. These results also illustrate real time and precise performance of the proposed strategy.[151] Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
Xinyu Qiu,Heng Jia,Zhengwen Zeng,Shuheng Shen,Changhua Meng,Yi Yang,Linchao Zhu
Main category: cs.CV
TL;DR: 提出ADPO,一种统一的强化学习框架,通过解耦优化机制和偏好验证奖励,在单个策略中联合学习答案生成与自我验证,显著提升验证性能并降低推理时间。
Details
Motivation: 现有的并行测试时扩展方法需要训练独立的生成和验证模型,导致高昂的训练和推理成本,且难以协同优化生成与验证过程。 Method: ADPO引入偏好验证奖励,利用正负样本的平均验证得分作为决策阈值,并在预测正确性与答案正确性一致时提供正反馈;同时采用解耦优化机制,分别计算生成与验证的优势,通过令牌掩码隔离梯度,并结合掩码GRPO目标函数,实现生成质量保持与验证分数校准的协同优化。 Result: ADPO实现了最高+34.1%的验证AUC提升和-53.5%的推理时间降低,在MathVista/MMMU上准确率分别提升+2.8%/+1.4%,ReasonSeg的cIoU提升+1.9,AndroidControl/GUI Odyssey的步成功率达到+1.7%/+1.0%。 Conclusion: ADPO通过统一框架有效整合生成与验证任务,解决了传统方法中训练与推理成本高、优化不协同的问题,在多个基准上显著优于现有方法,具备高效性和可扩展性。 Abstract: Parallel test-time scaling typically trains separate generation and verification models, incurring high training and inference costs. We propose Advantage Decoupled Preference Optimization (ADPO), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. ADPO introduces two innovations: a preference verification reward improving verification capability and a decoupled optimization mechanism enabling synergistic optimization of generation and verification. Specifically, the preference verification reward computes mean verification scores from positive and negative samples as decision thresholds, providing positive feedback when prediction correctness aligns with answer correctness. Meanwhile, the advantage decoupled optimization computes separate advantages for generation and verification, applies token masks to isolate gradients, and combines masked GRPO objectives, preserving generation quality while calibrating verification scores. ADPO achieves up to +34.1% higher verification AUC and -53.5% lower inference time, with significant gains of +2.8%/+1.4% accuracy on MathVista/MMMU, +1.9 cIoU on ReasonSeg, and +1.7%/+1.0% step success rate on AndroidControl/GUI Odyssey.[152] Higher-Order Domain Generalization in Magnetic Resonance-Based Assessment of Alzheimer's Disease
Zobia Batool,Diala Lteif,Vijaya B. Kolachalama,Huseyin Ozkan,Erchan Aptoula
Main category: cs.CV
TL;DR: 本文提出了一种名为Extended MixStyle (EM) 的新框架,通过融合高阶特征矩(偏度和峰度)来提升阿尔茨海默病(AD)在不同数据集间的泛化能力,显著优于现有单域泛化方法。
Details
Motivation: 由于扫描仪、协议和患者人口学差异导致的域偏移,现有的基于sMRI的深度学习模型在新队列上表现不佳,限制了其在现实世界中的应用。 Method: 提出Extended MixStyle (EM) 框架,在训练中混合高阶特征统计量(如偏度和峰度),以模拟多样化的分布变化,增强模型对未见域的鲁棒性。使用NACC的sMRI数据训练,并在三个未见过的外部队列上测试。 Result: EM在三个未知队列上平均macro-F1提升了2.4个百分点,优于当前最先进的单域泛化方法。 Conclusion: EM能有效提升AD分类模型的跨域泛化性能,具有在异构真实场景中实现可靠诊断的潜力。 Abstract: Despite progress in deep learning for Alzheimer's disease (AD) diagnostics, models trained on structural magnetic resonance imaging (sMRI) often do not perform well when applied to new cohorts due to domain shifts from varying scanners, protocols and patient demographics. AD, the primary driver of dementia, manifests through progressive cognitive and neuroanatomical changes like atrophy and ventricular expansion, making robust, generalizable classification essential for real-world use. While convolutional neural networks and transformers have advanced feature extraction via attention and fusion techniques, single-domain generalization (SDG) remains underexplored yet critical, given the fragmented nature of AD datasets. To bridge this gap, we introduce Extended MixStyle (EM), a framework for blending higher-order feature moments (skewness and kurtosis) to mimic diverse distributional variations. Trained on sMRI data from the National Alzheimer's Coordinating Center (NACC; n=4,647) to differentiate persons with normal cognition (NC) from those with mild cognitive impairment (MCI) or AD and tested on three unseen cohorts (total n=3,126), EM yields enhanced cross-domain performance, improving macro-F1 on average by 2.4 percentage points over state-of-the-art SDG benchmarks, underscoring its promise for invariant, reliable AD detection in heterogeneous real-world settings. The source code will be made available upon acceptance at https://github.com/zobia111/Extended-Mixstyle.[153] DeepInv: A Novel Self-supervised Learning Approach for Fast and Accurate Diffusion Inversion
Ziyue Zhang,Luxi Lin,Xiaolin Hu,Chao Chang,HuaiXi Wang,Yiyi Zhou,Rongrong Ji
Main category: cs.CV
TL;DR: 本文提出了一种新的自监督扩散反演方法DeepInv,通过引入自监督目标和数据增强策略生成高质量伪噪声,实现了快速准确的图像到噪声映射。
Details
Motivation: 现有的扩散反演方法由于缺乏有效的监督信号,通常依赖近似解,导致性能或效率受限。 Method: 提出了自监督目标和数据增强策略来生成伪噪声,并采用迭代多尺度训练机制训练参数化反演求解器。 Result: 实验表明,DeepInv在性能和推理速度上均优于现有方法,例如在COCO数据集上比EasyInv提升40.435% SSIM,比ReNoise快9887.5%。 Conclusion: DeepInv是首个可训练的逐步预测反演噪声的方法,为扩散模型反演提供了新思路。 Abstract: Diffusion inversion is a task of recovering the noise of an image in a diffusion model, which is vital for controllable diffusion image editing. At present, diffusion inversion still remains a challenging task due to the lack of viable supervision signals. Thus, most existing methods resort to approximation-based solutions, which however are often at the cost of performance or efficiency. To remedy these shortcomings, we propose a novel self-supervised diffusion inversion approach in this paper, termed Deep Inversion (DeepInv). Instead of requiring ground-truth noise annotations, we introduce a self-supervised objective as well as a data augmentation strategy to generate high-quality pseudo noises from real images without manual intervention. Based on these two innovative designs, DeepInv is also equipped with an iterative and multi-scale training regime to train a parameterized inversion solver, thereby achieving the fast and accurate image-to-noise mapping. To the best of our knowledge, this is the first attempt of presenting a trainable solver to predict inversion noise step by step. The extensive experiments show that our DeepInv can achieve much better performance and inference speed than the compared methods, e.g., +40.435% SSIM than EasyInv and +9887.5% speed than ReNoise on COCO dataset. Moreover, our careful designs of trainable solvers can also provide insights to the community. Codes and model parameters will be released in https://github.com/potato-kitty/DeepInv.[154] DiffKD-DCIS: Predicting Upgrade of Ductal Carcinoma In Situ with Diffusion Augmentation and Knowledge Distillation
Tao Li,Qing Li,Na Li,Hui Xie
Main category: cs.CV
TL;DR: 提出DiffKD-DCIS框架,结合条件扩散模型与知识蒸馏,提升DCIS升级为IDC的超声预测性能。
Details
Motivation: 传统深度学习方法因超声数据有限且泛化能力差,难以准确预测DCIS向IDC的升级。 Method: 构建三阶段框架:1)条件扩散模型生成高质量超声图像用于数据增强;2)深度教师网络从真实与合成数据中提取特征;3)轻量学生网络通过知识蒸馏学习,兼顾效率与泛化。 Result: 在1,435例多中心数据上验证,合成图像质量高;学生网络参数少、推理快,在外部测试集上表现优于部分对比模型,准确率媲美高年资放射科医生。 Conclusion: DiffKD-DCIS框架有效缓解数据稀缺问题,提升模型泛化与实用性,具备显著临床应用潜力。 Abstract: Accurately predicting the upgrade of ductal carcinoma in situ (DCIS) to invasive ductal carcinoma (IDC) is crucial for surgical planning. However, traditional deep learning methods face challenges due to limited ultrasound data and poor generalization ability. This study proposes the DiffKD-DCIS framework, integrating conditional diffusion modeling with teacher-student knowledge distillation. The framework operates in three stages: First, a conditional diffusion model generates high-fidelity ultrasound images using multimodal conditions for data augmentation. Then, a deep teacher network extracts robust features from both original and synthetic data. Finally, a compact student network learns from the teacher via knowledge distillation, balancing generalization and computational efficiency. Evaluated on a multi-center dataset of 1,435 cases, the synthetic images were of good quality. The student network had fewer parameters and faster inference. On external test sets, it outperformed partial combinations, and its accuracy was comparable to senior radiologists and superior to junior ones, showing significant clinical potential.[155] A Novel Deep Learning Method for Segmenting the Left Ventricle in Cardiac Cine MRI
Wenhui Chu,Aobo Jin,Hardik A. Gohel
Main category: cs.CV
TL;DR: 提出了一种基于组-批归一化的U-Net(GBU-Net)深度学习网络,用于短轴电影MRI中左心室的精确语义分割。
Details
Motivation: 传统CNN在心脏MRI分割中难以充分捕捉上下文信息,导致分割精度受限,因此需要一种能更好理解局部与全局上下文关系的模型。 Method: 设计并实现GBU-Net,结合组归一化与批归一化,在U-Net的下采样和上采样路径中增强特征提取与细节恢复能力,并引入改进策略以提升对医学图像上下文的理解。 Result: 在包含805张MRI扫描图像的数据集上测试,GBU-Net显著优于现有方法,Dice系数和平均垂直距离等指标表现更优;集成模型在SunnyBrook测试集上达到97%的Dice分数。 Conclusion: GBU-Net通过增强的归一化结构和上下文建模能力,实现了更精准的左心室分割,适用于手术机器人和临床医学分析,具有较高的应用价值。 Abstract: This research aims to develop a novel deep learning network, GBU-Net, utilizing a group-batch-normalized U-Net framework, specifically designed for the precise semantic segmentation of the left ventricle in short-axis cine MRI scans. The methodology includes a down-sampling pathway for feature extraction and an up-sampling pathway for detail restoration, enhanced for medical imaging. Key modifications include techniques for better contextual understanding crucial in cardiac MRI segmentation. The dataset consists of 805 left ventricular MRI scans from 45 patients, with comparative analysis using established metrics such as the dice coefficient and mean perpendicular distance. GBU-Net significantly improves the accuracy of left ventricle segmentation in cine MRI scans. Its innovative design outperforms existing methods in tests, surpassing standard metrics like the dice coefficient and mean perpendicular distance. The approach is unique in its ability to capture contextual information, often missed in traditional CNN-based segmentation. An ensemble of the GBU-Net attains a 97% dice score on the SunnyBrook testing dataset. GBU-Net offers enhanced precision and contextual understanding in left ventricle segmentation for surgical robotics and medical analysis.[156] FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation
Gen Li,Peiyu Liu
Main category: cs.CV
TL;DR: 提出VideoSpeculateRAG,一种结合推测解码与检索增强的高效视觉语言模型框架,提升推理速度与准确性。
Details
Motivation: 现有检索增强生成(RAG)方法在视觉语言模型中效率低且答案质量不稳定,尤其在实体识别对齐方面存在错误。 Method: 引入轻量级草稿模型生成多个候选答案,由重量级模型验证优化;采用基于相似度的过滤策略改善检索知识中的实体对齐问题。 Result: 相比标准RAG方法,在保持或提升准确率的同时,推理速度提升约2倍。 Conclusion: 推测解码与检索增强结合可有效提高复杂多模态任务的效率与可靠性。 Abstract: Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer accuracy. Experiments demonstrate that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x. Our framework highlights the potential of combining speculative decoding with retrieval-augmented reasoning to enhance efficiency and reliability in complex, knowledge-intensive multimodal tasks.[157] BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding
Hongbing Li,Linhui Xiao,Zihan Zhao,Qi Shen,Yixiang Huang,Bo Xiao,Zhanyu Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为BARE的偏见感知与推理增强框架,用于单塔结构的视觉定位任务,通过三个新模块缓解多模态干扰并提升指代表达的理解能力,在五个基准上实现了最先进的性能和更高的计算效率。
Details
Motivation: 现有的一塔视觉定位方法存在多模态表示过度纠缠和语义推理不足的问题,导致模态偏差和对指代表达理解困难。 Method: 提出BARE框架,包含语言显著性调制、视觉偏见校正和指代关系增强三个模块,保留模态特异性特征并增强指代语义构建。 Result: 在五个基准数据集上实验表明,BARE在性能上达到最先进水平,同时具有更高的计算效率。 Conclusion: BARE有效缓解了多模态偏差问题并增强了语义推理能力,为单塔视觉定位提供了高效且强大的解决方案。 Abstract: Visual Grounding (VG), which aims to locate a specific region referred to by expressions, is a fundamental yet challenging task in the multimodal understanding fields. While recent grounding transfer works have advanced the field through one-tower architectures, they still suffer from two primary limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders the comprehension of referential cues. In this paper, we propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding. BARE introduces a mechanism that preserves modality-specific features and constructs referential semantics through three novel modules: (i) language salience modulator, (ii) visual bias correction and (iii) referential relationship enhancement, which jointly mitigate multimodal distractions and enhance referential comprehension. Extensive experimental results on five benchmarks demonstrate that BARE not only achieves state-of-the-art performance but also delivers superior computational efficiency compared to existing approaches. The code is publicly accessible at https://github.com/Marloweeee/BARE.[158] DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving
Yang Zhou,Hao Shao,Letian Wang,Zhuofan Zong,Hongsheng Li,Steven L. Waslander
Main category: cs.CV
TL;DR: DrivingGen是首个面向生成式驾驶世界模型的综合性基准,旨在解决现有评估在视觉真实性、轨迹合理性、时间连贯性和可控性方面的不足,通过多样化数据集和新指标推动可部署的自动驾驶仿真发展。
Details
Motivation: 现有的视频生成评估方法在自动驾驶场景中存在明显局限,缺乏对安全性关键因素、轨迹合理性、时序一致性及可控性的系统评测,且数据集覆盖不足,难以支撑真实世界部署需求。 Method: 提出DrivingGen,结合来自驾驶数据集和互联网规模视频的多样化评估数据集,设计涵盖视觉 realism、轨迹 plausibility、时间 coherence 和 controllability 的综合指标体系,并对14种最先进模型进行基准测试。 Result: 评估揭示了通用模型与专用驾驶模型之间的权衡:通用模型视觉效果更好但违背物理规律,而驾驶专用模型运动更真实但视觉质量较低。DrivingGen展现出对不同模型能力的细粒度区分能力。 Conclusion: DrivingGen为生成式驾驶世界模型提供了首个全面、严格的评估框架,有助于引导研究向更可靠、可控和可部署的方向发展,推动自动驾驶中的仿真、规划与决策进步。 Abstract: Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.[159] Improving Flexible Image Tokenizers for Autoregressive Image Generation
Zixuan Fu,Lanqing Guo,Chong Wang,Binbin Song,Ding Liu,Bihan Wen
Main category: cs.CV
TL;DR: 本文提出了一种名为ReToK的灵活图像tokenizer,通过冗余token填充和分层语义正则化来改善可变长度图像token序列的表示能力,从而提升自回归图像生成性能。
Details
Motivation: 现有的嵌套dropout方法导致图像信息过度集中在前面的tokens中,限制了长序列下的生成效果,因此需要一种更均衡地利用所有tokens的方法。 Method: 引入冗余Token填充(Redundant Token Padding)以更频繁地激活尾部tokens,并结合分层语义正则化,使前部tokens与预训练视觉模型对齐,同时逐步减弱正则化强度以恢复细节。 Result: 在ImageNet 256×256上实验表明,ReToK在生成性能上优于现有的灵活和固定长度tokenizer。 Conclusion: ReToK有效缓解了信息集中问题,充分利用了整个token序列,提升了AR图像生成的质量和灵活性。 Abstract: Flexible image tokenizers aim to represent an image using an ordered 1D variable-length token sequence. This flexible tokenization is typically achieved through nested dropout, where a portion of trailing tokens is randomly truncated during training, and the image is reconstructed using the remaining preceding sequence. However, this tail-truncation strategy inherently concentrates the image information in the early tokens, limiting the effectiveness of downstream AutoRegressive (AR) image generation as the token length increases. To overcome these limitations, we propose \textbf{ReToK}, a flexible tokenizer with \underline{Re}dundant \underline{Tok}en Padding and Hierarchical Semantic Regularization, designed to fully exploit all tokens for enhanced latent modeling. Specifically, we introduce \textbf{Redundant Token Padding} to activate tail tokens more frequently, thereby alleviating information over-concentration in the early tokens. In addition, we apply \textbf{Hierarchical Semantic Regularization} to align the decoding features of earlier tokens with those from a pre-trained vision foundation model, while progressively reducing the regularization strength toward the tail to allow finer low-level detail reconstruction. Extensive experiments demonstrate the effectiveness of ReTok: on ImageNet 256$\times$256, our method achieves superior generation performance compared with both flexible and fixed-length tokenizers. Code will be available at: \href{https://github.com/zfu006/ReTok}{https://github.com/zfu006/ReTok}[160] FAR-AMTN: Attention Multi-Task Network for Face Attribute Recognition
Gong Gao,Zekai Wang,Xianhui Liu,Weidong Zhao
Main category: cs.CV
TL;DR: 提出了一种用于人脸属性识别的注意力多任务网络FAR-AMTN,通过共享参数的注意力模块和跨组特征融合机制,在减少模型参数的同时提升了泛化性能。
Details
Motivation: 传统多任务网络在人脸属性识别中导致参数量激增且限制了高层特征交互,难以挖掘属性间的语义关系,影响模型泛化能力。 Method: 提出FAR-AMTN,包含权重共享的组特定注意力(WSGSA)模块、跨组特征融合(CGFF)模块以及动态加权策略(DWS),以增强特征表示与任务间协同学习。 Result: 在CelebA和LFWA数据集上实验表明,所提方法在显著减少参数量的同时实现了更高的识别精度。 Conclusion: FAR-AMTN有效平衡了模型复杂度与性能,通过增强组内和组间特征交互,提升了多任务人脸属性识别的泛化能力。 Abstract: To enhance the generalization performance of Multi-Task Networks (MTN) in Face Attribute Recognition (FAR), it is crucial to share relevant information across multiple related prediction tasks effectively. Traditional MTN methods create shared low-level modules and distinct high-level modules, causing an exponential increase in model parameters with the addition of tasks. This approach also limits feature interaction at the high level, hindering the exploration of semantic relations among attributes, thereby affecting generalization negatively. In response, this study introduces FAR-AMTN, a novel Attention Multi-Task Network for FAR. It incorporates a Weight-Shared Group-Specific Attention (WSGSA) module with shared parameters to minimize complexity while improving group feature representation. Furthermore, a Cross-Group Feature Fusion (CGFF) module is utilized to foster interactions between attribute groups, enhancing feature learning. A Dynamic Weighting Strategy (DWS) is also introduced for synchronized task convergence. Experiments on the CelebA and LFWA datasets demonstrate that the proposed FAR-AMTN demonstrates superior accuracy with significantly fewer parameters compared to existing models.[161] EscherVerse: An Open World Benchmark and Dataset for Teleo-Spatial Intelligence with Physical-Dynamic and Intent-Driven Understanding
Tianjun Gu,Chenghua Gong,Jingyu Gong,Zhizhong Zhang,Yuan Xie,Lizhuang Ma,Xin Tan
Main category: cs.CV
TL;DR: 本文提出了Teleo-Spatial Intelligence (TSI) 新范式,结合物理动态推理与意图驱动推理,并发布EscherVerse(含基准、数据集和模型)以推动对动态人类场景中空间智能的研究。
Details
Motivation: 现有研究忽视了空间变化背后的人类意图,缺乏对物理机制与人类目标统一建模的能力。 Method: 提出TSI范式,构建真实视频衍生的大规模开放世界基准Escher-Bench、数据集Escher-35k及Escher系列模型,通过新型数据流水线支持对物体恒存、状态转移、轨迹预测与意图理解的综合评估。 Result: EscherVerse是首个系统评估意图驱动推理的基准,显著提升模型在动态、以人为中心场景中的空间推理能力。 Conclusion: TSI将空间智能从被动场景描述推进到主动、目的驱动的世界理解,为未来智能系统提供基础支持。 Abstract: The ability to reason about spatial dynamics is a cornerstone of intelligence, yet current research overlooks the human intent behind spatial changes. To address these limitations, we introduce Teleo-Spatial Intelligence (TSI), a new paradigm that unifies two critical pillars: Physical-Dynamic Reasoning--understanding the physical principles of object interactions--and Intent-Driven Reasoning--inferring the human goals behind these actions. To catalyze research in TSI, we present EscherVerse, consisting of a large-scale, open-world benchmark (Escher-Bench), a dataset (Escher-35k), and models (Escher series). Derived from real-world videos, EscherVerse moves beyond constrained settings to explicitly evaluate an agent's ability to reason about object permanence, state transitions, and trajectory prediction in dynamic, human-centric scenarios. Crucially, it is the first benchmark to systematically assess Intent-Driven Reasoning, challenging models to connect physical events to their underlying human purposes. Our work, including a novel data curation pipeline, provides a foundational resource to advance spatial intelligence from passive scene description toward a holistic, purpose-driven understanding of the world.[162] Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
Haonan Cai,Yuxuan Luo,Zhouhui Lian
Main category: cs.CV
TL;DR: 提出GAR-Font,一种新的自回归框架,用于多模态少样本字体生成,通过全局感知分词器和语言风格适配器提升字体结构与风格一致性。
Details
Motivation: 现有少样本字体生成方法难以同时保持字体的结构完整性和风格保真度,且忽略语言在风格表达中的作用。 Method: 引入全局感知分词器捕捉局部结构与全局风格,设计多模态风格编码器融合视觉与文本风格输入,并采用轻量级语言-风格适配器实现灵活控制,结合后处理优化结构与风格一致性。 Result: 实验表明GAR-Font在保持全局风格一致性和生成质量上优于现有方法,尤其在文本风格引导下表现更优。 Conclusion: GAR-Font有效提升了少样本字体生成的结构保真度与风格可控性,验证了多模态输入与全局建模在字体生成中的优势。 Abstract: Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.[163] Guiding Token-Sparse Diffusion Models
Felix Krause,Stefan Andreas Baumann,Johannes Schusterbauer,Olga Grebenkova,Ming Gui,Vincent Tao Hu,Björn Ommer
Main category: cs.CV
TL;DR: 本文提出了一种名为Sparse Guidance(SG)的新方法,通过在推理阶段利用token-level稀疏性来改善稀疏训练扩散模型的性能,解决了现有方法在Classifier-free Guidance下表现不佳的问题。
Details
Motivation: 现有的稀疏训练扩散模型虽然降低了训练成本,但由于对Classifier-free Guidance响应不足,导致推理时生成质量差、多样性低。 Method: 提出Sparse Guidance(SG),在推理中使用token-level稀疏性而非条件dropout作为引导信号,保留条件预测的高方差特性,并结合训练时的稀疏策略进行高效推理。 Result: 在ImageNet-256上以25%更少的FLOPs达到1.58的FID,最高可节省58%计算量而不损失质量;训练一个2.5B参数的文本到图像模型,SG在提升吞吐量的同时改善了图像构成和人类偏好得分。 Conclusion: Sparse Guidance有效提升了稀疏训练扩散模型在推理阶段的质量与效率,在保持高生成质量的同时显著降低计算开销,具备良好的实用价值。 Abstract: Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.[164] CAP-IQA: Context-Aware Prompt-Guided CT Image Quality Assessment
Kazi Ramisa Rifa,Jie Zhang,Abdullah Imran
Main category: cs.CV
TL;DR: 提出了一种上下文感知的提示引导图像质量评估框架(CAP-IQA),用于CT图像质量评估,结合文本先验与实例上下文,并通过因果去偏提升性能,在公开和私有数据集上均取得优越表现。
Details
Motivation: 现有基于提示的方法在CT图像质量评估中引入了理想化定义的偏差,难以适应真实世界中的各种退化情况,因此需要一种能够分离理想知识与实际退化的去偏方法。 Method: 提出CAP-IQA框架,结合CNN视觉编码器与领域特定文本编码器,利用放射科风格提示和上下文感知融合机制,引入实例级上下文提示并应用因果去偏策略以解耦理想先验与图像特异性退化。 Result: 在2023 LDCTIQA挑战赛基准上达到2.8590的相关性总分,超过排名第一团队4.24%;在包含91,514名儿童CT图像的内部数据集上验证了良好的泛化能力。 Conclusion: CAP-IQA通过融合文本先验与实例上下文并进行因果去偏,有效提升了CT图像质量评估的准确性与可解释性,具有良好的跨人群泛化能力。 Abstract: Prompt-based methods, which encode medical priors through descriptive text, have been only minimally explored for CT Image Quality Assessment (IQA). While such prompts can embed prior knowledge about diagnostic quality, they often introduce bias by reflecting idealized definitions that may not hold under real-world degradations such as noise, motion artifacts, or scanner variability. To address this, we propose the Context-Aware Prompt-guided Image Quality Assessment (CAP-IQA) framework, which integrates text-level priors with instance-level context prompts and applies causal debiasing to separate idealized knowledge from factual, image-specific degradations. Our framework combines a CNN-based visual encoder with a domain-specific text encoder to assess diagnostic visibility, anatomical clarity, and noise perception in abdominal CT images. The model leverages radiology-style prompts and context-aware fusion to align semantic and perceptual representations. On the 2023 LDCTIQA challenge benchmark, CAP-IQA achieves an overall correlation score of 2.8590 (sum of PLCC, SROCC, and KROCC), surpassing the top-ranked leaderboard team (2.7427) by 4.24%. Moreover, our comprehensive ablation experiments confirm that prompt-guided fusion and the simplified encoder-only design jointly enhance feature alignment and interpretability. Furthermore, evaluation on an in-house dataset of 91,514 pediatric CT images demonstrates the true generalizability of CAP-IQA in assessing perceptual fidelity in a different patient population.[165] An Empirical Study of Monocular Human Body Measurement Under Weak Calibration
Gaurav Sekar
Main category: cs.CV
TL;DR: 本文系统评估了三种弱校准单目方法在从单目RGB图像中估计人体尺寸任务中的表现,重点分析不同校准假设对测量稳定性与误差模式的影响。
Details
Motivation: 由于尺度模糊、视角敏感性和缺乏深度信息,从单目图像中准确估计人体尺寸具有挑战性,现有方法在实际消费设备上的鲁棒性和适用性尚不明确。 Method: 提出了三种弱校准策略:基于关键点几何、姿态驱动回归和物体校准轮廓,并在半约束条件下使用消费级相机进行实证比较。 Result: 实验表明,校准过程中的用户努力程度与周长测量的稳定性之间存在明显权衡,不同方法在不同体型下表现出不同的鲁棒性与失效模式。 Conclusion: 该研究为面向消费设备的轻量级单目人体测量系统提供了实证设计参考,强调应根据应用场景平衡校准成本与测量可靠性。 Abstract: Estimating human body measurements from monocular RGB imagery remains challenging due to scale ambiguity, viewpoint sensitivity, and the absence of explicit depth information. This work presents a systematic empirical study of three weakly calibrated monocular strategies: landmark-based geometry, pose-driven regression, and object-calibrated silhouettes, evaluated under semi-constrained conditions using consumer-grade cameras. Rather than pursuing state-of-the-art accuracy, the study analyzes how differing calibration assumptions influence measurement behavior, robustness, and failure modes across varied body types. The results reveal a clear trade-off between user effort during calibration and the stability of resulting circumferential quantities. This paper serves as an empirical design reference for lightweight monocular human measurement systems intended for deployment on consumer devices.[166] Animated 3DGS Avatars in Diverse Scenes with Consistent Lighting and Shadows
Aymen Mir,Riza Alp Guler,Jian Wang,Gerard Pons-Moll,Bing Zhou
Main category: cs.CV
TL;DR: 提出了一种基于3D高斯点阵(3DGS)的实时一致光照与阴影方法,通过深度高斯阴影图(DGSM)和球谐光照实现无需建模的动态阴影与重光照。
Details
Motivation: 在3D高斯点阵表示中缺乏有效的动态阴影与一致光照机制,导致虚拟角色与场景融合不自然。 Method: 提出了深度高斯阴影图(DGSM),利用3DGS沿光线的闭式光累积特性,在八面体图集中存储径向壳层透射率以实现实时阴影;同时使用球谐(SH)基表示HDR环境探针,进行快速的每高斯辐射传输以实现重光照。 Result: 在AvatarX、ActorsHQ等数据集上验证了方法有效性,实现了与ScanNet++、DL3DV等场景的自然融合,支持单人和多人设置下的动态交互与对象插入。 Conclusion: 该方法完全运行在3DGS的体素表示中,避免了网格化,实现了高效、一致的动态阴影与光照,提升了虚拟角色与场景的视觉一致性。 Abstract: We present a method for consistent lighting and shadows when animated 3D Gaussian Splatting (3DGS) avatars interact with 3DGS scenes or with dynamic objects inserted into otherwise static scenes. Our key contribution is Deep Gaussian Shadow Maps (DGSM), a modern analogue of the classical shadow mapping algorithm tailored to the volumetric 3DGS representation. Building on the classic deep shadow mapping idea, we show that 3DGS admits closed form light accumulation along light rays, enabling volumetric shadow computation without meshing. For each estimated light, we tabulate transmittance over concentric radial shells and store them in octahedral atlases, which modern GPUs can sample in real time per query to attenuate affected scene Gaussians and thus cast and receive shadows consistently. To relight moving avatars, we approximate the local environment illumination with HDRI probes represented in a spherical harmonic (SH) basis and apply a fast per Gaussian radiance transfer, avoiding explicit BRDF estimation or offline optimization. We demonstrate environment consistent lighting for avatars from AvatarX and ActorsHQ, composited into ScanNet++, DL3DV, and SuperSplat scenes, and show interactions with inserted objects. Across single and multi avatar settings, DGSM and SH relighting operate fully in the volumetric 3DGS representation, yielding coherent shadows and relighting while avoiding meshing.[167] LabelAny3D: Label Any Object 3D in the Wild
Jin Yao,Radowan Mahmud Redoy,Sebastian Elbaum,Matthew B. Dwyer,Zezhou Cheng
Main category: cs.CV
TL;DR: 提出LabelAny3D框架和COCO3D数据集,通过分析-合成方法生成高质量单目3D检测标注,提升开放词汇场景下的3D检测性能。
Details
Motivation: 现有单目3D检测模型在野外图像上表现差,主要由于缺乏3D野外数据集和3D标注困难。 Method: 提出LabelAny3D分析-合成框架,利用基础模型从2D图像重建完整3D场景,生成高质量3D边界框标注;基于此构建COCO3D新基准数据集,源自MS-COCO,覆盖更广的开放词汇物体类别。 Result: LabelAny3D生成的标注在多个基准上提升了单目3D检测性能,优于以往自动标注方法;COCO3D填补了现有3D数据集在类别覆盖上的空白。 Conclusion: 基础模型驱动的标注方法有望推动真实开放场景下3D识别的大规模发展。 Abstract: Detecting objects in 3D space from monocular input is crucial for applications ranging from robotics to scene understanding. Despite advanced performance in the indoor and autonomous driving domains, existing monocular 3D detection models struggle with in-the-wild images due to the lack of 3D in-the-wild datasets and the challenges of 3D annotation. We introduce LabelAny3D, an \emph{analysis-by-synthesis} framework that reconstructs holistic 3D scenes from 2D images to efficiently produce high-quality 3D bounding box annotations. Built on this pipeline, we present COCO3D, a new benchmark for open-vocabulary monocular 3D detection, derived from the MS-COCO dataset and covering a wide range of object categories absent from existing 3D datasets. Experiments show that annotations generated by LabelAny3D improve monocular 3D detection performance across multiple benchmarks, outperforming prior auto-labeling approaches in quality. These results demonstrate the promise of foundation-model-driven annotation for scaling up 3D recognition in realistic, open-world settings.[168] Trustworthy Data-Driven Wildfire Risk Prediction and Understanding in Western Canada
Zhengsen Xu,Lanying Wang,Sibo Cheng,Xue Rui,Kyle Gao,Yimin Zhu,Mabel Heffring,Zack Dewis,Saeid Taleghanidoozdoozan,Megan Greenwood,Motasem Alkayid,Quinn Ledingham,Hongjie He,Jonathan Li,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 提出了一种基于长序列、多尺度时间建模的可信数据驱动野火风险预测框架,在2023和2024年加拿大西部极端火灾季节中表现优异,具有高精度、低计算成本,并能量化预测不确定性与提供机制解释。
Details
Motivation: 现有野火风险预测模型因忽略不确定性量化和缺乏可解释性,难以应对气候与人类活动复杂交互下的极端火灾事件,亟需更可靠且可解释的预测方法。 Method: 构建了一个长序列、多尺度时间建模框架,融合多种异质驱动因子(如气象、燃料、地形等),结合不确定性量化机制,并利用SHAP值进行归因分析以实现过程级解释。 Result: 在2023和2024年加拿大西部火灾季评估中,模型F1得分为0.90,PR-AUC达0.98,计算成本低;不确定性分析揭示了空间与季节性的预测置信度模式;SHAP分析表明温度因子主导火灾风险,而湿度因子在2024年对区域差异的影响更显著。 Conclusion: 该框架不仅提升了野火风险预测的准确性与可靠性,还通过不确定性量化和可解释性增强了决策支持能力,适用于复杂环境下的 wildfire 风险管理。 Abstract: In recent decades, the intensification of wildfire activity in western Canada has resulted in substantial socio-economic and environmental losses. Accurate wildfire risk prediction is hindered by the intrinsic stochasticity of ignition and spread and by nonlinear interactions among fuel conditions, meteorology, climate variability, topography, and human activities, challenging the reliability and interpretability of purely data-driven models. We propose a trustworthy data-driven wildfire risk prediction framework based on long-sequence, multi-scale temporal modeling, which integrates heterogeneous drivers while explicitly quantifying predictive uncertainty and enabling process-level interpretation. Evaluated over western Canada during the record-breaking 2023 and 2024 fire seasons, the proposed model outperforms existing time-series approaches, achieving an F1 score of 0.90 and a PR-AUC of 0.98 with low computational cost. Uncertainty-aware analysis reveals structured spatial and seasonal patterns in predictive confidence, highlighting increased uncertainty associated with ambiguous predictions and spatiotemporal decision boundaries. SHAP-based interpretation provides mechanistic understanding of wildfire controls, showing that temperature-related drivers dominate wildfire risk in both years, while moisture-related constraints play a stronger role in shaping spatial and land-cover-specific contrasts in 2024 compared to the widespread hot and dry conditions of 2023. Data and code are available at https://github.com/SynUW/mmFire.[169] Evaluating Deep Learning-Based Face Recognition for Infants and Toddlers: Impact of Age Across Developmental Stages
Afzal Hossain,Mst Rumana Sumi,Stephanie Schuckers
Main category: cs.CV
TL;DR: 本研究评估了四种深度学习人脸识模型(FaceNet、ArcFace、MagFace、CosFace)在0至3岁婴幼儿纵向数据集上的表现,发现识别准确率随年龄增长而提升,并利用DANN方法减少时间漂移,显著提高跨时间验证性能。
Details
Motivation: 婴幼儿面部形态变化快、类间相似度高且数据稀缺,传统人脸识别模型难以适用,亟需针对低龄儿童开发鲁棒的生物识别系统。 Method: 在为期24个月、包含七个采集阶段的纵向数据集上,评估四种主流深度学习人脸识别模型的性能,并引入领域对抗神经网络(DANN)缓解因时间间隔导致的特征漂移问题。 Result: 0-6个月婴儿在0.1% FAR下的TAR仅为30.7%,2.5-3岁组提升至64.7%;短时间间隔内识别准确率更高;应用DANN后TAR提升超过12%,显著增强特征的时间稳定性。 Conclusion: 现有模型在婴幼儿人脸识别中仍面临显著挑战,尤其在生命早期阶段;结合DANN等域适应方法可有效提升时序稳定性,为智慧城市中的儿童医疗、安全与数字身份等应用提供更可靠的生物识别基础。 Abstract: Face recognition for infants and toddlers presents unique challenges due to rapid facial morphology changes, high inter-class similarity, and limited dataset availability. This study evaluates the performance of four deep learning-based face recognition models FaceNet, ArcFace, MagFace, and CosFace on a newly developed longitudinal dataset collected over a 24 month period in seven sessions involving children aged 0 to 3 years. Our analysis examines recognition accuracy across developmental stages, showing that the True Accept Rate (TAR) is only 30.7% at 0.1% False Accept Rate (FAR) for infants aged 0 to 6 months, due to unstable facial features. Performance improves significantly in older children, reaching 64.7% TAR at 0.1% FAR in the 2.5 to 3 year age group. We also evaluate verification performance over different time intervals, revealing that shorter time gaps result in higher accuracy due to reduced embedding drift. To mitigate this drift, we apply a Domain Adversarial Neural Network (DANN) approach that improves TAR by over 12%, yielding features that are more temporally stable and generalizable. These findings are critical for building biometric systems that function reliably over time in smart city applications such as public healthcare, child safety, and digital identity services. The challenges observed in early age groups highlight the importance of future research on privacy preserving biometric authentication systems that can address temporal variability, particularly in secure and regulated urban environments where child verification is essential.[170] FALCON: Few-Shot Adversarial Learning for Cross-Domain Medical Image Segmentation
Abdur R. Fayjie,Pankhi Kashyap,Jutika Borah,Patrick Vandewalle
Main category: cs.CV
TL;DR: 提出FALCON,一种跨域少样本3D医学图像分割框架,通过在自然图像上元训练并迁移到医学领域,实现高精度、低计算开销的分割。
Details
Motivation: 解决3D医学图像分割中标注数据稀缺、个体差异大、隐私问题和计算开销高的挑战。 Method: 采用2D切片处理3D数据,先在自然图像上进行元训练以学习通用分割先验,再通过对抗微调和边界感知学习迁移到医学领域,并利用支持样本进行任务感知推理。 Result: 在四个基准上实现了最低的Hausdorff距离和与最先进模型相当的Dice系数,且使用更少标注数据、无需数据增强、计算开销更低。 Conclusion: FALCON能有效提升医学图像分割的边界精度,具备临床应用潜力,尤其适用于标注资源有限的场景。 Abstract: Precise delineation of anatomical and pathological structures within 3D medical volumes is crucial for accurate diagnosis, effective surgical planning, and longitudinal disease monitoring. Despite advancements in AI, clinically viable segmentation is often hindered by the scarcity of 3D annotations, patient-specific variability, data privacy concerns, and substantial computational overhead. In this work, we propose FALCON, a cross-domain few-shot segmentation framework that achieves high-precision 3D volume segmentation by processing data as 2D slices. The framework is first meta-trained on natural images to learn-to-learn generalizable segmentation priors, then transferred to the medical domain via adversarial fine-tuning and boundary-aware learning. Task-aware inference, conditioned on support cues, allows FALCON to adapt dynamically to patient-specific anatomical variations across slices. Experiments on four benchmarks demonstrate that FALCON consistently achieves the lowest Hausdorff Distance scores, indicating superior boundary accuracy while maintaining a Dice Similarity Coefficient comparable to the state-of-the-art models. Notably, these results are achieved with significantly less labeled data, no data augmentation, and substantially lower computational overhead.[171] Mitigating Longitudinal Performance Degradation in Child Face Recognition Using Synthetic Data
Afzal Hossain,Stephanie Schuckers
Main category: cs.CV
TL;DR: 本研究探讨了合成面部数据是否可以作为纵向稳定器,提高儿童面部识别模型的时间鲁棒性。使用YFA数据集和StyleGAN2 ADA生成的合成数据进行增强训练,结果表明,与仅使用真实数据或预训练模型相比,合成数据增强显著降低了6至36个月跨度内的验证错误率。
Details
Motivation: 由于儿童面部快速且非线性的生长,导致模板漂移和随时间推移验证错误增加,儿童纵向人脸识别具有挑战性。因此需要提升模型在长时间跨度下的身份保持能力。 Method: 采用MagFace模型,在Young Face Aging (YFA) 数据集上使用身份不相交协议评估三种设置:(i) 无需微调的预训练MagFace嵌入,(ii) 仅使用真实人脸微调MagFace,(iii) 使用真实与StyleGAN2 ADA生成的合成人脸联合微调。合成数据仅用于训练身份,并通过后生成过滤减少身份泄露和伪影样本。 Result: 在6到36个月的注册-验证间隔中,合成数据增强的微调显著降低了错误率,优于预训练基线和仅使用真实数据微调的方法。 Conclusion: 合成面部数据可有效提升儿童面部识别模型的纵向稳定性,为改善儿科人脸识别中的身份持久性提供了低风险的有效途径。 Abstract: Longitudinal face recognition in children remains challenging due to rapid and nonlinear facial growth, which causes template drift and increasing verification errors over time. This work investigates whether synthetic face data can act as a longitudinal stabilizer by improving temporal robustness of child face recognition models. Using an identity disjoint protocol on the Young Face Aging (YFA) dataset, we evaluate three settings: (i) pretrained MagFace embeddings without dataset specific fine-tuning, (ii) MagFace fine-tuned using authentic training faces only, and (iii) MagFace fine-tuned using a combination of authentic and synthetically generated training faces. Synthetic data is generated using StyleGAN2 ADA and incorporated exclusively within the training identities; a post generation filtering step is applied to mitigate identity leakage and remove artifact affected samples. Experimental results across enrollment verification gaps from 6 to 36 months show that synthetic-augmented fine tuning substantially reduces error rates relative to both the pretrained baseline and real only fine tuning. These findings provide a risk aware assessment of synthetic augmentation for improving identity persistence in pediatric face recognition.[172] Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
Ruiyu Mao,Baoming Zhang,Nicholas Ruozzi,Yunhui Guo
Main category: cs.CV
TL;DR: 本文提出了一种面向路侧单目3D目标检测的可学习性驱动主动学习框架LH3D,通过识别并抑制固有模糊样本,仅用25%标注预算即可达到接近全监督性能。
Details
Motivation: 由于硬件和隐私限制,实际部署中常需标注仅有路侧视角的数据,但这类数据常因远处、模糊或遮挡导致3D属性难以确定,形成固有模糊样本,增加标注难度与成本,并引发可学习性问题。 Method: 提出LH3D框架,结合信息量与可标注可靠性进行主动学习,筛选既具信息量又可靠可标定的场景,抑制固有模糊样本的同时保证数据覆盖性。 Result: 在DAIR-V2X-I数据集上,仅使用25%标注预算时,LH3D对车辆、行人、骑行者的检测性能分别达到全监督模型的86.06%、67.32%和78.67%,显著优于基于不确定性的方法。 Conclusion: 研究表明,在路侧3D感知任务中,样本的可学习性比不确定性更重要,合理筛选可标注且信息丰富的样本能有效提升标注效率与模型性能。 Abstract: Roadside perception datasets are typically constructed via cooperative labeling between synchronized vehicle and roadside frame pairs. However, real deployment often requires annotation of roadside-only data due to hardware and privacy constraints. Even human experts struggle to produce accurate labels without vehicle-side data (image, LIDAR), which not only increases annotation difficulty and cost, but also reveals a fundamental learnability problem: many roadside-only scenes contain distant, blurred, or occluded objects whose 3D properties are ambiguous from a single view and can only be reliably annotated by cross-checking paired vehicle--roadside frames. We refer to such cases as inherently ambiguous samples. To reduce wasted annotation effort on inherently ambiguous samples while still obtaining high-performing models, we turn to active learning. This work focuses on active learning for roadside monocular 3D object detection and proposes a learnability-driven framework that selects scenes which are both informative and reliably labelable, suppressing inherently ambiguous samples while ensuring coverage. Experiments demonstrate that our method, LH3D, achieves 86.06%, 67.32%, and 78.67% of full-performance for vehicles, pedestrians, and cyclists respectively, using only 25% of the annotation budget on DAIR-V2X-I, significantly outperforming uncertainty-based baselines. This confirms that learnability, not uncertainty, matters for roadside 3D perception.[173] Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems
Yian Liu,Xiong Wang,Ping Xu,Lei Zhu,Ming Yan,Linyun Xue
Main category: cs.CV
TL;DR: 提出一种名为协方差分布优化(CDO)的模块,用于提升嵌入式系统中实时车道检测的准确性,且不增加计算复杂度。
Details
Motivation: 现有深度学习模型在低功耗嵌入式环境中缺乏通用的优化方法,且车道检测因视觉信号微弱和资源受限面临挑战。 Method: 设计CDO模块,通过使车道特征分布与真实标签对齐来提升检测精度,无需修改网络结构,可集成到三类主流车道检测方法中。 Result: 在六个模型和三个数据集(CULane、TuSimple、LLAMAS)上验证,准确率提升0.01%至1.5%,保持计算效率。 Conclusion: CDO模块易于集成、不增加计算负担,显著提升嵌入式系统中车道检测的性能、能效和运行灵活性。 Abstract: Real-time lane detection in embedded systems encounters significant challenges due to subtle and sparse visual signals in RGB images, often constrained by limited computational resources and power consumption. Although deep learning models for lane detection categorized into segmentation-based, anchor-based, and curve-based methods there remains a scarcity of universally applicable optimization techniques tailored for low-power embedded environments. To overcome this, we propose an innovative Covariance Distribution Optimization (CDO) module specifically designed for efficient, real-time applications. The CDO module aligns lane feature distributions closely with ground-truth labels, significantly enhancing detection accuracy without increasing computational complexity. Evaluations were conducted on six diverse models across all three method categories, including two optimized for real-time applications and four state-of-the-art (SOTA) models, tested comprehensively on three major datasets: CULane, TuSimple, and LLAMAS. Experimental results demonstrate accuracy improvements ranging from 0.01% to 1.5%. The proposed CDO module is characterized by ease of integration into existing systems without structural modifications and utilizes existing model parameters to facilitate ongoing training, thus offering substantial benefits in performance, power efficiency, and operational flexibility in embedded systems.[174] FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
Xijie Huang,Chengming Xu,Donghao Luo,Xiaobin Hu,Peng Tang,Xu Peng,Jiangning Zhang,Chengjie Wang,Yanwei Fu
Main category: cs.CV
TL;DR: 本文提出了一种无需运行时引导的First-Frame Propagation(FFP)视频编辑方法,通过构建大规模数据集FFP-300K和新框架解决外观保持与运动保留之间的矛盾。
Details
Motivation: 现有FFP方法依赖繁琐的运行时引导,主要由于训练数据集不足,表现为长度短、分辨率低且缺乏任务多样性,难以学习鲁棒的时间先验。 Method: 提出了FFP-300K数据集,包含30万对高保真720p、81帧长的视频;设计了自适应时空RoPE(AST-RoPE)架构以解耦外观与运动参考,并采用自蒸馏策略,利用身份传播任务作为正则化手段提升时间稳定性。 Result: 在EditVerseBench基准上显著优于现有学术与商业模型,PickScore提升约0.2,VLM分数提升约0.3。 Conclusion: 通过高质量数据集和新型架构设计,实现了真正无需引导的FFP,在长期时序一致性和语义保持方面表现优越,推动了可控视频编辑的发展。 Abstract: First-Frame Propagation (FFP) offers a promising paradigm for controllable video editing, but existing methods are hampered by a reliance on cumbersome run-time guidance. We identify the root cause of this limitation as the inadequacy of current training datasets, which are often too short, low-resolution, and lack the task diversity required to teach robust temporal priors. To address this foundational data gap, we first introduce FFP-300K, a new large-scale dataset comprising 300K high-fidelity video pairs at 720p resolution and 81 frames in length, constructed via a principled two-track pipeline for diverse local and global edits. Building on this dataset, we propose a novel framework designed for true guidance-free FFP that resolves the critical tension between maintaining first-frame appearance and preserving source video motion. Architecturally, we introduce Adaptive Spatio-Temporal RoPE (AST-RoPE), which dynamically remaps positional encodings to disentangle appearance and motion references. At the objective level, we employ a self-distillation strategy where an identity propagation task acts as a powerful regularizer, ensuring long-term temporal stability and preventing semantic drift. Comprehensive experiments on the EditVerseBench benchmark demonstrate that our method significantly outperforming existing academic and commercial models by receiving about 0.2 PickScore and 0.3 VLM score improvement against these competitors.[175] Point-SRA: Self-Representation Alignment for 3D Representation Learning
Lintong Wei,Jian Lu,Haozhe Cheng,Jihua Zhu,Kaibing Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为Point-SRA的3D表示学习方法,通过自蒸馏和概率建模对齐表示,采用可变掩码比和MeanFlow Transformer实现多级互补信息捕捉与多样化概率重建,在多个下游任务中显著优于现有方法。
Details
Motivation: 现有MAE方法使用固定掩码比,忽略了点云的多层次相关性和几何结构多样性,且依赖于与点云多样性相冲突的逐点重建假设。 Method: 提出Point-SRA,采用不同掩码比捕获几何与语义信息;设计MeanFlow Transformer(MFT)利用跨模态条件嵌入实现概率重建;引入双层自表示对齐机制(DSRA)在MAE和MFT层面统一表示;并设计Flow-Conditioned微调架构以充分利用学习到的分布。 Result: 在ScanObjectNN上超越Point-MAE达5.37%;颅内动脉瘤分割中动脉和动脉瘤的mIoU分别达到96.07%和86.87%;3D检测AP@50达47.3%,超过MaskPoint 5.12%。 Conclusion: Point-SRA通过可变掩码比、概率重建与双层表示对齐,有效提升了3D表示学习性能,在分类、分割与检测任务中均取得显著提升,验证了其在建模点云结构与分布上的优越性。 Abstract: Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratio neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.[176] MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement
Lei Zhu,Lijian Lin,Ye Zhu,Jiahao Wu,Xuehan Hou,Yu Li,Yunfei Liu,Jie Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为MANGO的新框架,用于生成音频驱动的3D对话头部运动,通过两阶段交替训练减少伪3D标签噪声,实现更自然的双向对话交互,并引入高质量数据集MANGO-Dialog验证其优越性能。
Details
Motivation: 现有方法主要关注单人说话场景,缺乏自然的双向听与说交互,且依赖误差较大的伪3D标签,难以捕捉细腻的面部动态。 Method: 提出MANGO框架:第一阶段使用基于扩散的Transformer和双音频交互模块建模多人语音下的自然3D运动;第二阶段利用快速3D高斯渲染器生成高保真图像,并通过交替训练提供2D光度监督以优化3D运动。 Result: 实验表明,该方法在两人3D对话运动建模中表现出卓越的准确性和真实感,显著提升了音频驱动说话头的保真度和可控性。 Conclusion: MANGO通过纯图像级监督和交替训练有效克服了伪3D标签的噪声问题,实现了更贴近现实的自然对话行为,推动了3D对话头像技术的发展。 Abstract: Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.[177] CTIS-QA: Clinical Template-Informed Slide-level Question Answering for Pathology
Hao Lu,Ziniu Qian,Yifu Li,Yang Zhou,Bingzheng Wei,Yan Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于临床诊断模板的病理信息结构化管道,设计了符合CAP标准的临床病理报告模板(CPRT),并构建了CTIS-Align和CTIS-Bench两个数据集,用于视觉-语言对齐和WSI问答任务。同时提出了CTIS-QA模型,在多个任务上优于现有方法。
Details
Motivation: 为了实现病理报告中诊断信息的标准化提取,并支持基于全切片图像(WSI)的视觉-语言理解与问答,需要一个系统化、临床指导的结构化流程。 Method: 基于CAP癌症协议设计临床病理报告模板(CPRT),用于从病理报告中提取特征;构建CTIS-Align(80k图文对)和CTIS-Bench(977张WSI,14,879个QA对)数据集;提出CTIS-QA模型,采用双流架构:一者通过聚类聚合全局特征,另一者通过注意力机制关注局部显著区域。 Result: 在TCGA-BRCA数据上验证了管道有效性;CTIS-QA在WSI-VQA、CTIS-Bench和滑动级别诊断任务上均优于现有SOTA模型。 Conclusion: 该研究提供了一个临床指导的病理信息结构化框架,推动了基于WSI的视觉-语言对齐与问答发展,CTIS-QA模型更贴近病理医生诊断逻辑,提升了性能表现。 Abstract: In this paper, we introduce a clinical diagnosis template-based pipeline to systematically collect and structure pathological information. In collaboration with pathologists and guided by the the College of American Pathologists (CAP) Cancer Protocols, we design a Clinical Pathology Report Template (CPRT) that ensures comprehensive and standardized extraction of diagnostic elements from pathology reports. We validate the effectiveness of our pipeline on TCGA-BRCA. First, we extract pathological features from reports using CPRT. These features are then used to build CTIS-Align, a dataset of 80k slide-description pairs from 804 WSIs for vision-language alignment training, and CTIS-Bench, a rigorously curated VQA benchmark comprising 977 WSIs and 14,879 question-answer pairs. CTIS-Bench emphasizes clinically grounded, closed-ended questions (e.g., tumor grade, receptor status) that reflect real diagnostic workflows, minimize non-visual reasoning, and require genuine slide understanding. We further propose CTIS-QA, a Slide-level Question Answering model, featuring a dual-stream architecture that mimics pathologists' diagnostic approach. One stream captures global slide-level context via clustering-based feature aggregation, while the other focuses on salient local regions through attention-guided patch perception module. Extensive experiments on WSI-VQA, CTIS-Bench, and slide-level diagnostic tasks show that CTIS-QA consistently outperforms existing state-of-the-art models across multiple metrics. Code and data are available at https://github.com/HLSvois/CTIS-QA.[178] Subimage Overlap Prediction: Task-Aligned Self-Supervised Pretraining For Semantic Segmentation In Remote Sensing Imagery
Lakshay Sharma,Alex Marin
Main category: cs.CV
TL;DR: 本文提出了一种名为子图像重叠预测(Subimage Overlap Prediction)的新型自监督预训练任务,用于遥感图像语义分割,能够在使用较少预训练数据的情况下实现更快的收敛速度和相等或更好的下游性能。
Details
Motivation: 现有的自监督学习方法通常依赖大量预训练数据,限制了其在数据稀缺场景下的应用。本文旨在减少对大规模预训练数据的依赖,提升遥感图像语义分割的效率与性能。 Method: 提出子图像重叠预测任务:从原始图像中提取一个子图像,并训练模型生成该子图像在原图中位置的语义掩码,以此进行自监督预训练。 Result: 在多种网络架构和多个下游数据集上验证了该方法的有效性,结果显示预训练显著加快了收敛速度,在mIoU指标上达到相等或更优的性能,且在标注数据较少时优势更明显;同时相比其他SSL方法所需预训练数据更少。 Conclusion: 子图像重叠预测是一种高效的数据节约型自监督预训练方法,在遥感图像语义分割任务中表现出优越的迁移性能和实用性。 Abstract: Self-supervised learning (SSL) methods have become a dominant paradigm for creating general purpose models whose capabilities can be transferred to downstream supervised learning tasks. However, most such methods rely on vast amounts of pretraining data. This work introduces Subimage Overlap Prediction, a novel self-supervised pretraining task to aid semantic segmentation in remote sensing imagery that uses significantly lesser pretraining imagery. Given an image, a sub-image is extracted and the model is trained to produce a semantic mask of the location of the extracted sub-image within the original image. We demonstrate that pretraining with this task results in significantly faster convergence, and equal or better performance (measured via mIoU) on downstream segmentation. This gap in convergence and performance widens when labeled training data is reduced. We show this across multiple architecture types, and with multiple downstream datasets. We also show that our method matches or exceeds performance while requiring significantly lesser pretraining data relative to other SSL methods. Code and model weights are provided at \href{https://github.com/sharmalakshay93/subimage-overlap-prediction}{github.com/sharmalakshay93/subimage-overlap-prediction}.[179] DDNet: A Dual-Stream Graph Learning and Disentanglement Framework for Temporal Forgery Localization
Boyang Zhao,Xin Liao,Jiaxin Chen,Xiaoshuai Wu,Yufeng Wu
Main category: cs.CV
TL;DR: 本文提出了一种用于视频时序伪造定位的双流图学习与解耦框架DDNet,通过结合局部时间距离和全局语义内容信息,有效提升了伪造片段检测精度与跨域鲁棒性。
Details
Motivation: 现有方法受限于局部视野,难以捕捉视频中的全局异常,导致时序伪造定位效果不佳。 Method: 提出DDNet框架,包含时间距离流和语义内容流,并引入轨迹解耦与适应(TDA)以及跨层级特征嵌入(CLFE)来提取解耦的伪造指纹并构建鲁棒特征表示。 Result: 在ForgeryNet和TVIL基准上实验表明,该方法在AP@0.95指标上超过现有最优方法约9%,并在跨域场景中表现出显著优势。 Conclusion: DDNet通过双流图学习和特征解耦有效整合局部与全局信息,显著提升了时序伪造定位的准确性和泛化能力。 Abstract: The rapid evolution of AIGC technology enables misleading viewers by tampering mere small segments within a video, rendering video-level detection inaccurate and unpersuasive. Consequently, temporal forgery localization (TFL), which aims to precisely pinpoint tampered segments, becomes critical. However, existing methods are often constrained by \emph{local view}, failing to capture global anomalies. To address this, we propose a \underline{d}ual-stream graph learning and \underline{d}isentanglement framework for temporal forgery localization (DDNet). By coordinating a \emph{Temporal Distance Stream} for local artifacts and a \emph{Semantic Content Stream} for long-range connections, DDNet prevents global cues from being drowned out by local smoothness. Furthermore, we introduce Trace Disentanglement and Adaptation (TDA) to isolate generic forgery fingerprints, alongside Cross-Level Feature Embedding (CLFE) to construct a robust feature foundation via deep fusion of hierarchical features. Experiments on ForgeryNet and TVIL benchmarks demonstrate that our method outperforms state-of-the-art approaches by approximately 9\% in AP@0.95, with significant improvements in cross-domain robustness.[180] VerLM: Explaining Face Verification Using Natural Language
Syed Abdul Hannan,Hazim Bukhari,Thomas Cantalapiedra,Eman Ansar,Massa Baali,Rita Singh,Bhiksha Raj
Main category: cs.CV
TL;DR: 本文提出了一种用于人脸识别的创新视觉-语言模型(VLM),能够准确判断两张人脸图像是否属于同一人,并以简洁和详尽两种方式解释决策理由。
Details
Motivation: 现有的人脸验证系统缺乏决策透明性,难以解释判断依据,限制了其在关键场景中的可信度与应用。 Method: 采用改进的先进建模方法,结合视觉特征提取与语言推理能力,通过两种互补的解释风格(简洁总结与详细差异描述)训练VLM,并将音频差异化的模型思路迁移至视觉领域。 Result: 该模型在准确性和可解释性方面均优于基线方法和现有模型,显著提升了人脸识别系统的透明度与性能。 Conclusion: 视觉-语言模型在人脸识别中具有巨大潜力,有助于构建更透明、可靠和可解释的验证系统。 Abstract: Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model's accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.[181] Causality-Aware Temporal Projection for Video Understanding in Video-LLMs
Zhengjian Kang,Qi Chen,Rui Liu,Kangtong Mo,Xingyu Zhang,Xiaoyu Deng,Ye Zhang
Main category: cs.CV
TL;DR: 本文提出V-CORE,一种参数高效的视频大语言模型框架,通过引入显式的时间顺序约束(如单向因果注意力和动态摘要令牌)提升视频理解中的时序和因果推理能力。
Details
Motivation: 现有Video-LLMs在处理需要时间顺序和因果连贯性的视频任务时存在不足,尤其是双向投影器会破坏时间顺序。因此需要一种能显式建模时间方向性的新框架。 Method: V-CORE包含两个关键组件:可学习的空间聚合(LSA)用于减少空间冗余;因果感知时间投影器(CATP),通过块因果注意力和终端动态摘要令牌实现单向信息流,确保时间信息按序聚合。 Result: 在NExT-QA上达到61.2%准确率,在MSVD-QA、MSRVTT-QA和TGIF-QA上表现具有竞争力,且在时间与因果推理子任务上分别提升3.5%和5.2%。可在单个消费级GPU上用4-bit QLoRA高效训练。 Conclusion: 显式的时间顺序约束对视频理解至关重要,V-CORE通过结构化的时间建模显著提升了模型在复杂时序和因果推理任务上的性能。 Abstract: Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.[182] Adaptive Hybrid Optimizer based Framework for Lumpy Skin Disease Identification
Ubaidullah,Muhammad Abid Hussain,Mohsin Raza Jafri,Rozi Khan,Moid Sandhu,Abd Ullah Khan,Hyundong Shin
Main category: cs.CV
TL;DR: 本文提出了一种基于混合深度学习的LUMPNet模型,用于牛结节性皮肤病(LSD)的早期检测,结合YOLOv11、EfficientNet和新型自适应混合优化器,在公开数据集上实现了99%的训练准确率和98%的验证准确率,性能优于现有方法。
Details
Motivation: 由于LSD具有快速传播特性,早期精准识别对防止疫情暴发和保障粮食安全至关重要,现有方法在检测精度和效率上仍有提升空间。 Method: LUMPNet结合YOLOv11进行病灶检测与定位,采用EfficientNet-based CNN分类器进行分类,并设计了一种新的自适应混合优化器以加速和稳定训练过程。 Result: 在公开数据集上,LUMPNet达到99%训练准确率和98%验证准确率,优于现有方法;通过与AdamW优化的EfficientNet-B0对比,进一步验证了其优越性能。 Conclusion: LUMPNet在LSD早期检测中表现出高准确性和鲁棒性,具备实际应用潜力,可为牲畜疾病智能诊断提供有效工具。 Abstract: Lumpy Skin Disease (LSD) is a contagious viral infection that significantly deteriorates livestock health, thereby posing a serious threat to the global economy and food security. Owing to its rapid spread characteristics, early and precise identification is crucial to prevent outbreaks and ensure timely intervention. In this paper, we propose a hybrid deep learning-based approach called LUMPNet for the early detection of LSD. LUMPNet utilizes image data to detect and classify skin nodules -- the primary indicator of LSD. To this end, LUMPNet uses YOLOv11, EfficientNet-based CNN classifier with compound scaling, and a novel adaptive hybrid optimizer. More precisely, LUMPNet detects and localizes LSD skin nodules and lesions on cattle images. It exploits EfficientNet to classify the localized cattle images into LSD-affected or healthy categories. To stabilize and accelerate the training of YOLOv11 and EfficientNet hybrid model, a novel adaptive hybrid optimizer is proposed and utilized. We evaluate LUMPNet at various stages of LSD using a publicly available dataset. Results indicate that the proposed scheme achieves 99% LSD detection training accuracy, and outperforms existing schemes. The model also achieves validation accuracy of 98%. Moreover, for further evaluation, we conduct a case study using an optimized EfficientNet-B0 model trained with the AdamW optimizer, and compare its performance with LUMPNet. The results show that LUMPNet achieves superior performance.[183] Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning
Sungjune Park,Hongda Mao,Qingshuang Chen,Yong Man Ro,Yelin Kim
Main category: cs.CV
TL;DR: 本文提出了一种语言引导的场景上下文感知学习框架,用于增强第一人称视频中的视觉注意力预测。
Details
Motivation: 由于动态第一人称场景的复杂性和模糊性,现有方法在预测视觉注意力时面临挑战;而场景上下文信息对人类注意力具有重要调节作用。 Method: 设计了一个基于语言描述引导的上下文感知模块来生成上下文感知的视频表示,并引入两个训练目标:聚焦于兴趣区域和抑制无关区域的干扰。 Result: 在Ego4D和AEA数据集上进行了广泛实验,取得了最先进的性能,并在多种动态第一人称场景中表现出更强的鲁棒性。 Conclusion: 所提出的方法通过融合语言引导的场景上下文信息,显著提升了第一人称视觉注意力预测的准确性和鲁棒性。 Abstract: As the demand for analyzing egocentric videos grows, egocentric visual attention prediction, anticipating where a camera wearer will attend, has garnered increasing attention. However, it remains challenging due to the inherent complexity and ambiguity of dynamic egocentric scenes. Motivated by evidence that scene contextual information plays a crucial role in modulating human attention, in this paper, we present a language-guided scene context-aware learning framework for robust egocentric visual attention prediction. We first design a context perceiver which is guided to summarize the egocentric video based on a language-based scene description, generating context-aware video representations. We then introduce two training objectives that: 1) encourage the framework to focus on the target point-of-interest regions and 2) suppress distractions from irrelevant regions which are less likely to attract first-person attention. Extensive experiments on Ego4D and Aria Everyday Activities (AEA) datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance and enhanced robustness across diverse, dynamic egocentric scenarios.[184] RSwinV2-MD: An Enhanced Residual SwinV2 Transformer for Monkeypox Detection from Skin Images
Rashid Iqbal,Saddam Hussain Khan
Main category: cs.CV
TL;DR: 本文提出了一种名为Customized Residual SwinTransformerV2(RSwinV2)的深度学习方法,用于Mpox的诊断,通过改进的Transformer架构和引入逆残差模块,显著提升了皮肤病变分类的准确性。
Details
Motivation: 为了提高Mpox与其他类似疾病(如水痘、麻疹、牛痘)之间的鉴别诊断能力,克服传统CNN模型在处理长距离依赖和局部特征提取上的局限性。 Method: 提出RSwinV2模型,结合SwinTransformer的分层结构与移窗注意力机制,引入补丁和位置嵌入,并融合逆残差块(IRB)以缓解梯度消失问题,实现对全局与局部特征的有效捕捉。输入图像被划分为非重叠块,在移窗中进行注意力计算,提升计算效率并增强特征关联。 Result: 在Kaggle公开数据集上测试,RSwinV2取得了96.21%的准确率和95.62的F1分数,优于标准CNN模型和原始SwinTransformer。 Conclusion: RSwinV2是一种高效的计算机辅助诊断工具,能够显著提升Mpox病变识别的性能,具备临床辅助应用潜力。 Abstract: In this paper, a deep learning approach for Mpox diagnosis named Customized Residual SwinTransformerV2 (RSwinV2) has been proposed, trying to enhance the capability of lesion classification by employing the RSwinV2 tool-assisted vision approach. In the RSwinV2 method, a hierarchical structure of the transformer has been customized based on the input dimensionality, embedding structure, and output targeted by the method. In this RSwinV2 approach, the input image has been split into non-overlapping patches and processed using shifted windows and attention in these patches. This process has helped the method link all the windows efficiently by avoiding the locality issues of non-overlapping regions in attention, while being computationally efficient. RSwinV2 has further developed based on SwinTransformer and has included patch and position embeddings to take advantage of the transformer global-linking capability by employing multi-head attention in these embeddings. Furthermore, RSwinV2 has developed and incorporated the Inverse Residual Block (IRB) into this method, which utilizes convolutional skip connections with these inclusive designs to address the vanishing gradient issues during processing. RSwinV2 inclusion of IRB has therefore facilitated this method to link global patterns as well as local patterns; hence, its integrity has helped improve lesion classification capability by minimizing variability of Mpox and increasing differences of Mpox, chickenpox, measles, and cowpox. In testing SwinV2, its accuracy of 96.21 and an F1score of 95.62 have been achieved on the Kaggle public dataset, which has outperformed standard CNN models and SwinTransformers; RSwinV2 vector has thus proved its valiance as a computer-assisted tool for Mpox lesion observation interpretation.[185] ESGaussianFace: Emotional and Stylized Audio-Driven Facial Animation via 3D Gaussian Splatting
Chuhang Ma,Shuai Tan,Ye Pan,Jiaolong Yang,Xin Tong
Main category: cs.CV
TL;DR: 本文提出了一种名为ESGaussianFace的新框架,用于情感化和风格化的音频驱动面部动画生成,利用3D高斯点阵实现高效、高质量且具3D一致性的视频合成。
Details
Motivation: 现有音频驱动面部动画研究多集中于中性情绪,难以高效生成兼具情感表达与风格特征的高质量说话头视频。 Method: 采用3D高斯点阵重建3D场景并渲染视频,提出情感-音频引导的空间注意力机制,并设计两个3D高斯变形预测器以实现情感与风格驱动的形变,结合多阶段训练策略逐步学习唇动、情绪变化和风格特征。 Result: 该方法在唇动准确性、表情变化和风格表现力方面优于现有最先进方法,生成结果具有高效率、高质量和3D一致性。 Conclusion: ESGaussianFace能有效融合情感与风格信息,实现高质量的情感化面部动画生成,为未来个性化虚拟角色动画提供了可行方案。 Abstract: Most current audio-driven facial animation research primarily focuses on generating videos with neutral emotions. While some studies have addressed the generation of facial videos driven by emotional audio, efficiently generating high-quality talking head videos that integrate both emotional expressions and style features remains a significant challenge. In this paper, we propose ESGaussianFace, an innovative framework for emotional and stylized audio-driven facial animation. Our approach leverages 3D Gaussian Splatting to reconstruct 3D scenes and render videos, ensuring efficient generation of 3D consistent results. We propose an emotion-audio-guided spatial attention method that effectively integrates emotion features with audio content features. Through emotion-guided attention, the model is able to reconstruct facial details across different emotional states more accurately. To achieve emotional and stylized deformations of the 3D Gaussian points through emotion and style features, we introduce two 3D Gaussian deformation predictors. Futhermore, we propose a multi-stage training strategy, enabling the step-by-step learning of the character's lip movements, emotional variations, and style features. Our generated results exhibit high efficiency, high quality, and 3D consistency. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art techniques in terms of lip movement accuracy, expression variation, and style feature expressiveness.[186] GCR: Geometry-Consistent Routing for Task-Agnostic Continual Anomaly Detection
Joongwon Chae,Lihui Luo,Yang Liu,Runming Wang,Dongmei Yu,Zeming Liang,Xi Yuan,Dayan Zhang,Zhenglin Chen,Peiwu Qin,Ilmoon Chae
Main category: cs.CV
TL;DR: 本文提出了一种名为GCR的轻量级混合专家框架,用于稳定任务无关的持续异常检测,通过在共享的冻结图像块嵌入空间中进行几何一致性路由,避免了跨头评分不可比的问题,并在MVTec AD和VisA数据集上实现了接近零遗忘且保持良好检测性能的结果。
Details
Motivation: 现有方法在类别持续扩展的实际部署中表现不佳,主要因为测试时类别未知导致专家选择(路由)不稳定,尤其是基于不同头部异常分数的路由规则因分数分布差异而不可靠。 Method: 提出GCR框架,通过在共享的冻结patch-embedding空间中,根据到类别特定原型库的最近邻距离累积值最小化来进行直接路由,随后仅在选定专家内使用标准原型评分规则计算异常图,从而将跨头决策与异常评分分离。 Result: 在MVTec AD和VisA数据集上的实验表明,GCR显著提升了路由稳定性,缓解了持续性能崩溃问题,实现了接近零遗忘,同时保持了有竞争力的异常检测和定位性能。 Conclusion: 许多先前归因于表示遗忘的问题实际上可能是由于跨头路由中的决策规则不稳定所致,GCR通过几何一致性路由有效解决了这一问题。 Abstract: Feature-based anomaly detection is widely adopted in industrial inspection due to the strong representational power of large pre-trained vision encoders. While most existing methods focus on improving within-category anomaly scoring, practical deployments increasingly require task-agnostic operation under continual category expansion, where the category identity is unknown at test time. In this setting, overall performance is often dominated by expert selection, namely routing an input to an appropriate normality model before any head-specific scoring is applied. However, routing rules that compare head-specific anomaly scores across independently constructed heads are unreliable in practice, as score distributions can differ substantially across categories in scale and tail behavior. We propose GCR, a lightweight mixture-of-experts framework for stabilizing task-agnostic continual anomaly detection through geometry-consistent routing. GCR routes each test image directly in a shared frozen patch-embedding space by minimizing an accumulated nearest-prototype distance to category-specific prototype banks, and then computes anomaly maps only within the routed expert using a standard prototype-based scoring rule. By separating cross-head decision making from within-head anomaly scoring, GCR avoids cross-head score comparability issues without requiring end-to-end representation learning. Experiments on MVTec AD and VisA show that geometry-consistent routing substantially improves routing stability and mitigates continual performance collapse, achieving near-zero forgetting while maintaining competitive detection and localization performance. These results indicate that many failures previously attributed to representation forgetting can instead be explained by decision-rule instability in cross-head routing. Code is available at https://github.com/jw-chae/GCR[187] RRNet: Configurable Real-Time Video Enhancement with Arbitrary Local Lighting Variations
Wenlong Yang,Canran Jin,Weihang Yuan,Chao Wang,Lifeng Sun
Main category: cs.CV
TL;DR: RRNet是一种轻量级、可配置的渲染重光照网络,通过估计虚拟光源参数实现局部重光照,在实时视频增强中实现了视觉质量与效率的最优权衡。
Details
Motivation: 现有实时视频增强方法在不均匀光照下难以兼顾速度与曝光控制效果,需要一种高效且能进行局部光照调整的方法。 Method: 提出RRNet,利用深度感知渲染模块估计少量虚拟光源参数,实现无需像素对齐训练数据的局部重光照;采用轻量级编码器和预测头,并通过生成式AI构建低成本合成数据集用于训练。 Result: RRNet在低光增强、局部照明调节和眩光去除任务上 consistently 优于先前方法,支持实时高分辨率处理,并保持面部身份特征。 Conclusion: RRNet凭借其可解释的光照控制和高效架构,适用于视频会议、AR人像增强和移动摄影等实际应用场景。 Abstract: With the growing demand for real-time video enhancement in live applications, existing methods often struggle to balance speed and effective exposure control, particularly under uneven lighting. We introduce RRNet (Rendering Relighting Network), a lightweight and configurable framework that achieves a state-of-the-art tradeoff between visual quality and efficiency. By estimating parameters for a minimal set of virtual light sources, RRNet enables localized relighting through a depth-aware rendering module without requiring pixel-aligned training data. This object-aware formulation preserves facial identity and supports real-time, high-resolution performance using a streamlined encoder and lightweight prediction head. To facilitate training, we propose a generative AI-based dataset creation pipeline that synthesizes diverse lighting conditions at low cost. With its interpretable lighting control and efficient architecture, RRNet is well suited for practical applications such as video conferencing, AR-based portrait enhancement, and mobile photography. Experiments show that RRNet consistently outperforms prior methods in low-light enhancement, localized illumination adjustment, and glare removal.[188] Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion
Wenyu Shao,Hongbo Liu,Yunchuan Ma,Ruili Wang
Main category: cs.CV
TL;DR: 提出了一种基于实体引导的多任务红外与可见光图像融合方法(EGMT),通过提取文本中的实体级信息并结合多标签分类任务,有效提升融合图像的语义密度和质量。
Details
Motivation: 现有文本驱动的图像融合方法依赖句子级文本,易引入语义噪声且未能充分挖掘深层语义信息。 Method: 1)利用大视觉语言模型生成的图像描述中提取实体级文本信息;2)构建并行多任务学习架构,将图像融合与以实体为伪标签的多标签分类任务结合;3)设计实体引导的跨模态交互模块,实现视觉与文本特征的细粒度交互。 Result: 在多个公开数据集上实验表明,EGMT在保留显著目标、纹理细节和语义一致性方面优于现有最先进方法。 Conclusion: EGMT通过实体级语义引导和多任务学习,显著提升了红外与可见光图像融合的语义表达能力和融合质量,同时发布了四个带实体标注的数据集以推动相关研究。 Abstract: Existing text-driven infrared and visible image fusion approaches often rely on textual information at the sentence level, which can lead to semantic noise from redundant text and fail to fully exploit the deeper semantic value of textual information. To address these issues, we propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT). Our approach includes three key innovative components: (i) A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models, eliminating semantic noise from raw text while preserving critical semantic information; (ii) A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task. By using entities as pseudo-labels, the multi-label classification task provides semantic supervision, enabling the model to achieve a deeper understanding of image content and significantly improving the quality and semantic density of the fused image; (iii) An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features, which enhances feature representation by capturing cross-modal dependencies at both inter-visual and visual-entity levels. To promote the wide application of the entity-guided image fusion framework, we release the entity-annotated version of four public datasets (i.e., TNO, RoadScene, M3FD, and MSRS). Extensive experiments demonstrate that EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency, compared to the state-of-the-art methods. The code and dataset will be publicly available at https://github.com/wyshao-01/EGMT.[189] CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving
Shuhang Chen,Yunqiu Xu,Junjie Xie,Aojun Lu,Tao Feng,Zeying Huang,Ning Zhang,Yi Sun,Yi Yang,Hangjie Yuan
Main category: cs.CV
TL;DR: 本文提出了一种名为CogFlow的新框架,用于提升多模态大语言模型在视觉数学问题求解中的表现,通过引入认知启发的三阶段流程(感知→内化→推理)并设计相应的增强机制和训练算法,在感知、内化和推理各阶段全面提升模型性能,并贡献了一个包含12万多个高质量标注的新数据集MathCog。
Details
Motivation: 现有方法在视觉数学推理中主要关注视觉输入的提取与解释,但忽略了提取的视觉线索是否被忠实整合并合理用于后续推理过程,导致模型可能生成看似连贯但脱离视觉依据的推理链。 Method: 提出CogFlow框架,包含三个阶段:感知、知识内化和推理;设计协同视觉奖励以增强符号和图表的信息提取;引入知识内化奖励模型确保视觉线索被正确整合;提出视觉门控策略优化算法使推理过程扎根于视觉知识;构建大规模数据集MathCog用于训练。 Result: 在多个常用视觉数学推理基准上的实验表明,CogFlow显著优于现有方法,验证了其在提升感知能力、确保视觉信息忠实集成和防止脱离视觉依据推理方面的有效性。 Conclusion: CogFlow通过模拟人类分层推理流程,系统性地解决了视觉数学推理中感知与推理脱节的问题,为多模态模型中的视觉信息利用提供了更可靠的方法路径。 Abstract: Despite significant progress, multimodal large language models continue to struggle with visual mathematical problem solving. Some recent works recognize that visual perception is a bottleneck in visual mathematical reasoning, but their solutions are limited to improving the extraction and interpretation of visual inputs. Notably, they all ignore the key issue of whether the extracted visual cues are faithfully integrated and properly utilized in subsequent reasoning. Motivated by this, we present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage, explicitly simulating the hierarchical flow of human reasoning: perception$\Rightarrow$internalization$\Rightarrow$reasoning. Inline with this hierarchical flow, we holistically enhance all its stages. We devise Synergistic Visual Rewards to boost perception capabilities in parametric and semantic spaces, jointly improving visual information extraction from symbols and diagrams. To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a Knowledge Internalization Reward model in the internalization stage, bridging perception and reasoning. Moreover, we design a Visual-Gated Policy Optimization algorithm to further enforce the reasoning is grounded with the visual knowledge, preventing models seeking shortcuts that appear coherent but are visually ungrounded reasoning chains. Moreover, we contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations. Comprehensive experiments and analysis on commonly used visual mathematical reasoning benchmarks validate the superiority of the proposed CogFlow.[190] Agentic AI in Remote Sensing: Foundations, Taxonomy, and Emerging Systems
Niloufar Alipour Talemi,Julia Boone,Fatemeh Afghah
Main category: cs.CV
TL;DR: 本文综述了遥感领域中代理式人工智能(agentic AI)的发展,提出统一分类体系,涵盖单代理副驾与多代理系统,并探讨规划机制、检索增强生成与记忆结构等架构基础,同时回顾新兴的轨迹感知评估基准,指出在定位、安全与协调方面的挑战,为构建自主地理空间智能提供战略路线图。
Details
Motivation: 现有深度学习模型在复杂地理空间工作流中缺乏序贯规划与主动工具协调能力,而当前视觉基础模型与多模态大语言模型虽提升表征学习,但不足以支持自主决策,亟需向具备主动行为能力的代理式AI演进。 Method: 提出一种统一的分类法,区分单代理与多代理系统,系统分析规划机制、检索增强生成(RAG)和记忆结构等核心架构,并综述支持轨迹级推理的新评估基准。 Result: 建立了首个面向遥感领域的代理式AI综合框架,明确了从像素级准确率到轨迹级推理正确性的评估转变,识别出现有系统在环境接地(grounding)、安全性与工具协调方面的关键瓶颈。 Conclusion: 代理式AI是实现自主地球观测的关键方向,未来需围绕可解释规划、安全控制与多工具协同构建稳健的地理空间智能系统。 Abstract: The paradigm of Earth Observation analysis is shifting from static deep learning models to autonomous agentic AI. Although recent vision foundation models and multimodal large language models advance representation learning, they often lack the sequential planning and active tool orchestration required for complex geospatial workflows. This survey presents the first comprehensive review of agentic AI in remote sensing. We introduce a unified taxonomy distinguishing between single-agent copilots and multi-agent systems while analyzing architectural foundations such as planning mechanisms, retrieval-augmented generation, and memory structures. Furthermore, we review emerging benchmarks that move the evaluation from pixel-level accuracy to trajectory-aware reasoning correctness. By critically examining limitations in grounding, safety, and orchestration, this work outlines a strategic roadmap for the development of robust, autonomous geospatial intelligence.[191] Forget Less by Learning from Parents Through Hierarchical Relationships
Arjun Ramesh Kaushik,Naresh Kumar Devulapally,Vishnu Suresh Lokhande,Nalini K. Ratha,Venu Govindaraju
Main category: cs.CV
TL;DR: 提出了一种在双曲空间中通过父-子概念学习机制来缓解灾难性遗忘的框架FLLP,利用洛伦兹流形嵌入概念表示,有效支持持续学习并提升鲁棒性与泛化能力。
Details
Motivation: 现有方法主要关注减少概念间的干扰,忽视了概念间可能的正向交互,且难以避免连续学习中的灾难性遗忘问题。 Method: 提出Forget Less by Learning from Parents (FLLP),在洛伦兹流形的双曲空间中建模父-子概念关系,利用已学概念指导新概念的学习,从而减少遗忘。 Result: 在三个公开数据集和一个合成基准上验证了FLLP的有效性,结果显示其在鲁棒性和泛化性方面均优于现有方法。 Conclusion: FLLP通过引入双曲空间中的父-子学习机制,有效缓解了定制化生成模型中的灾难性遗忘,支持持续学习并提升了性能。 Abstract: Custom Diffusion Models (CDMs) offer impressive capabilities for personalization in generative modeling, yet they remain vulnerable to catastrophic forgetting when learning new concepts sequentially. Existing approaches primarily focus on minimizing interference between concepts, often neglecting the potential for positive inter-concept interactions. In this work, we present Forget Less by Learning from Parents (FLLP), a novel framework that introduces a parent-child inter-concept learning mechanism in hyperbolic space to mitigate forgetting. By embedding concept representations within a Lorentzian manifold, naturally suited to modeling tree-like hierarchies, we define parent-child relationships in which previously learned concepts serve as guidance for adapting to new ones. Our method not only preserves prior knowledge but also supports continual integration of new concepts. We validate FLLP on three public datasets and one synthetic benchmark, showing consistent improvements in both robustness and generalization.[192] Nodule-DETR: A Novel DETR Architecture with Frequency-Channel Attention for Ultrasound Thyroid Nodule Detection
Jingjing Wang,Qianglin Liu,Zhuo Xiao,Xinning Yao,Bo Liu,Lu Li,Lijuan Niu,Fugen Zhou
Main category: cs.CV
TL;DR: 提出了一种名为Nodule-DETR的新型检测Transformer架构,用于在超声图像中鲁棒地检测甲状腺结节,结合频域通道注意力、分层特征融合和多尺度可变形注意力模块,在真实临床数据上实现了最先进的性能。
Details
Motivation: 超声图像中甲状腺结节的对比度低、边界模糊导致诊断准确性受限,现有方法难以有效检测小而形状不规则的结节。 Method: 提出Nodule-DETR,包含三个创新模块:多光谱频域通道注意力(MSFCA)增强低对比度结节特征,分层特征融合(HFF)实现高效多尺度特征整合,多尺度可变形注意力(MSDA)灵活捕捉小且不规则形状的结节。 Result: 在真实世界甲状腺超声临床数据集上实验表明,Nodule-DETR在mAP@0.5:0.95指标上比基线模型提升0.149,达到最先进水平。 Conclusion: Nodule-DETR显著提高了甲状腺结节检测的准确性,具有重要的临床应用潜力,可作为计算机辅助甲状腺诊断的有效工具。 Abstract: Thyroid cancer is the most common endocrine malignancy, and its incidence is rising globally. While ultrasound is the preferred imaging modality for detecting thyroid nodules, its diagnostic accuracy is often limited by challenges such as low image contrast and blurred nodule boundaries. To address these issues, we propose Nodule-DETR, a novel detection transformer (DETR) architecture designed for robust thyroid nodule detection in ultrasound images. Nodule-DETR introduces three key innovations: a Multi-Spectral Frequency-domain Channel Attention (MSFCA) module that leverages frequency analysis to enhance features of low-contrast nodules; a Hierarchical Feature Fusion (HFF) module for efficient multi-scale integration; and Multi-Scale Deformable Attention (MSDA) to flexibly capture small and irregularly shaped nodules. We conducted extensive experiments on a clinical dataset of real-world thyroid ultrasound images. The results demonstrate that Nodule-DETR achieves state-of-the-art performance, outperforming the baseline model by a significant margin of 0.149 in mAP@0.5:0.95. The superior accuracy of Nodule-DETR highlights its significant potential for clinical application as an effective tool in computer-aided thyroid diagnosis. The code of work is available at https://github.com/wjj1wjj/Nodule-DETR.[193] Learning Action Hierarchies via Hybrid Geometric Diffusion
Arjun Ramesh Kaushik,Nalini K. Ratha,Venu Govindaraju
Main category: cs.CV
TL;DR: 提出HybridTAS框架,结合欧氏和双曲几何用于扩散模型中的去噪过程,以利用动作的层次结构,实现视频中每帧的动作标签分配。
Details
Motivation: 现有基于迭代优化的方法未能显式利用人类动作的层次性结构,限制了时序动作分割性能。 Method: 在扩散模型的去噪过程中引入混合欧氏与双曲几何空间:利用双曲空间的树状结构特性,在高时间步由高层级动作类别引导,在低时间步细化为细粒度动作类别,实现从粗到精的分割。 Result: 在GTEA、50Salads和Breakfast三个基准数据集上实验表明,该方法达到最先进的性能。 Conclusion: 通过双曲几何引导的去噪策略能有效建模动作的层次关系,显著提升时序动作分割效果。 Abstract: Temporal action segmentation is a critical task in video understanding, where the goal is to assign action labels to each frame in a video. While recent advances leverage iterative refinement-based strategies, they fail to explicitly utilize the hierarchical nature of human actions. In this work, we propose HybridTAS - a novel framework that incorporates a hybrid of Euclidean and hyperbolic geometries into the denoising process of diffusion models to exploit the hierarchical structure of actions. Hyperbolic geometry naturally provides tree-like relationships between embeddings, enabling us to guide the action label denoising process in a coarse-to-fine manner: higher diffusion timesteps are influenced by abstract, high-level action categories (root nodes), while lower timesteps are refined using fine-grained action classes (leaf nodes). Extensive experiments on three benchmark datasets, GTEA, 50Salads, and Breakfast, demonstrate that our method achieves state-of-the-art performance, validating the effectiveness of hyperbolic-guided denoising for the temporal action segmentation task.[194] TalkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing
Yujie Hu,Zecheng Tang,Xu Jiang,Weiqi Li,Jian Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的多功能图像编辑框架TalkPhoto,通过对话式交互实现精确图像操作,利用大语言模型分析用户需求并分层调用现有先进编辑方法,无需额外训练即可实现稳定高质量的编辑结果。
Details
Motivation: 现有基于指令的图像编辑方法依赖多指令数据集进行训练,耗时耗力且效果有限,难以应对复杂和未见过的编辑任务。 Method: 提出TalkPhoto框架,利用开源大语言模型结合专门设计的提示模板,分析用户指令并分层调用现有的图像编辑方法,实现即插即用、无需训练的高效编辑流程。 Result: 实验表明,该方法在多种图像编辑任务中实现了更准确的调用、更低的token消耗以及更高的编辑质量。 Conclusion: TalkPhoto提供了一种灵活、可控且高效的图像编辑方案,能够处理复杂和未见的编辑任务,显著优于需训练的现有方法。 Abstract: Thanks to the powerful language comprehension capabilities of Large Language Models (LLMs), existing instruction-based image editing methods have introduced Multimodal Large Language Models (MLLMs) to promote information exchange between instructions and images, ensuring the controllability and flexibility of image editing. However, these frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks, which is not only time-consuming and labor-intensive but also fails to achieve satisfactory results. In this paper, we present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction. We instruct the open-source LLM with a specially designed prompt template to analyze user needs after receiving instructions and hierarchically invoke existing advanced editing methods, all without additional training. Moreover, we implement a plug-and-play and efficient invocation of image editing methods, allowing complex and unseen editing tasks to be integrated into the current framework, achieving stable and high-quality editing results. Extensive experiments demonstrate that our method not only provides more accurate invocation with fewer token consumption but also achieves higher editing quality across various image editing tasks.[195] AR-MOT: Autoregressive Multi-object Tracking
Lianjie Jia,Yuhan Wu,Binghao Ran,Yifan Wang,Lijun Wang,Huchuan Lu
Main category: cs.CV
TL;DR: 本文提出了一种基于大语言模型的自回归多目标跟踪框架AR-MOT,将MOT视为序列生成任务,无需任务特定头即可灵活输出结构化结果。
Details
Motivation: 现有MOT方法架构僵化、任务特定,难以适应多样化和指令驱动的新场景,缺乏灵活性和通用性。 Method: 提出AR-MOT,利用大语言模型进行序列化输出;引入对象分词器增强区域感知,设计RAA模块对齐全局与局部特征,并通过TMF模块实现长期跟踪记忆融合。 Result: 在MOT17和DanceTrack上实验表明,AR-MOT性能与当前最先进方法相当,且具备良好的扩展性和多模态适应潜力。 Conclusion: AR-MOT为多目标跟踪提供了更通用、灵活的新范式,有望推动MOT向指令化、多模态方向发展。 Abstract: As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.[196] MacVQA: Adaptive Memory Allocation and Global Noise Filtering for Continual Visual Question Answering
Zhifei Li,Yiran Wang,Chenyi Xiong,Yujing Xia,Xiaoju Hou,Yue Zhao,Miao Zhang,Kui Xiao,Bing Yang
Main category: cs.CV
TL;DR: 提出了一种名为MacVQA的新框架,通过自适应记忆分配和全局噪声过滤来提升持续视觉问答(VQA)中的知识保留、适应性和特征表示鲁棒性,在多个任务上取得了优于现有方法的性能。
Details
Motivation: 现有的持续学习VQA方法在知识保留、适应新信息和鲁棒特征表示之间难以平衡。 Method: 提出MacVQA框架,结合视觉与问题信息,引入原型记忆分配机制,并设计噪声过滤模块以增强表示的鲁棒性。 Result: 在十个持续VQA任务上实验表明,MacVQA在标准任务上达到43.38%平均准确率和2.32%平均遗忘率,在新组合任务上为42.53%准确率和3.60%遗忘率,优于基线方法。 Conclusion: MacVQA能有效平衡知识获取、保留和组合泛化能力,显著提升持续VQA性能。 Abstract: Visual Question Answering (VQA) requires models to reason over multimodal information, combining visual and textual data. With the development of continual learning, significant progress has been made in retaining knowledge and adapting to new information in the VQA domain. However, current methods often struggle with balancing knowledge retention, adaptation, and robust feature representation. To address these challenges, we propose a novel framework with adaptive memory allocation and global noise filtering called MacVQA for visual question answering. MacVQA fuses visual and question information while filtering noise to ensure robust representations, and employs prototype-based memory allocation to optimize feature quality and memory usage. These designs enable MacVQA to balance knowledge acquisition, retention, and compositional generalization in continual VQA learning. Experiments on ten continual VQA tasks show that MacVQA outperforms existing baselines, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.[197] Face Normal Estimation from Rags to Riches
Meng Wang,Wenjing Dai,Jiawan Zhang,Xiaojie Guo
Main category: cs.CV
TL;DR: 提出一种粗到精的面部法线估计方法,减少对大规模配对数据的依赖。
Details
Motivation: 现有面部法线估计方法严重依赖大规模配对数据进行训练,限制了其应用。 Method: 首先在小数据集上训练一个简洁模型生成粗糙法线作为引导(示例),然后利用自注意力机制捕捉长程依赖以修正局部伪影,并通过定制的细化网络结合输入图像和示例生成高质量精细法线。 Result: 实验表明该方法在训练成本和估计质量上优于现有最先进方法。 Conclusion: 所提出的粗到精框架有效降低了对面部法线估计中大规模配对数据和计算资源的需求。 Abstract: Although recent approaches to face normal estimation have achieved promising results, their effectiveness heavily depends on large-scale paired data for training. This paper concentrates on relieving this requirement via developing a coarse-to-fine normal estimator. Concretely, our method first trains a neat model from a small dataset to produce coarse face normals that perform as guidance (called exemplars) for the following refinement. A self-attention mechanism is employed to capture long-range dependencies, thus remedying severe local artifacts left in estimated coarse facial normals. Then, a refinement network is customized for the sake of mapping input face images together with corresponding exemplars to fine-grained high-quality facial normals. Such a logical function split can significantly cut the requirement of massive paired data and computational resource. Extensive experiments and ablation studies are conducted to demonstrate the efficacy of our design and reveal its superiority over state-of-the-art methods in terms of both training expense as well as estimation quality. Our code and models are open-sourced at: https://github.com/AutoHDR/FNR2R.git.[198] MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization
Zhexin Zhang,Yifeng Zhu,Yangyang Xu,Long Chen,Yong Du,Shengfeng He,Jun Yu
Main category: cs.CV
TL;DR: 本文提出了一种名为MotionAdapter的内容感知运动迁移框架,可在基于DiT的文本到视频模型中实现鲁棒且语义对齐的运动迁移。
Details
Motivation: 现有扩散模型在复杂运动迁移方面仍面临挑战,难以在保持目标内容语义的同时准确传递参考视频中的运动。 Method: 通过分析3D全注意力模块中的跨帧注意力提取注意力驱动的运动场,并引入DINO引导的运动定制模块,基于内容对应关系调整和优化运动场,从而在去噪过程中指导DiT生成既继承参考运动又保留目标外观的视频。 Result: 实验表明,MotionAdapter在定性和定量评估中均优于当前最先进的方法,并支持复杂的运动迁移与编辑任务(如缩放)。 Conclusion: MotionAdapter通过显式解耦运动与外观并进行自适应定制,有效实现了跨内容的高质量运动迁移,拓展了DiT-based T2V模型在运动控制与编辑方面的能力。 Abstract: Recent advances in diffusion-based text-to-video models, particularly those built on the diffusion transformer architecture, have achieved remarkable progress in generating high-quality and temporally coherent videos. However, transferring complex motions between videos remains challenging. In this work, we present MotionAdapter, a content-aware motion transfer framework that enables robust and semantically aligned motion transfer within DiT-based T2V models. Our key insight is that effective motion transfer requires \romannumeral1) explicit disentanglement of motion from appearance and \romannumeral 2) adaptive customization of motion to target content. MotionAdapter first isolates motion by analyzing cross-frame attention within 3D full-attention modules to extract attention-derived motion fields. To bridge the semantic gap between reference and target videos, we further introduce a DINO-guided motion customization module that rearranges and refines motion fields based on content correspondences. The customized motion field is then used to guide the DiT denoising process, ensuring that the synthesized video inherits the reference motion while preserving target appearance and semantics. Extensive experiments demonstrate that MotionAdapter outperforms state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.[199] AFTER: Mitigating the Object Hallucination of LVLM via Adaptive Factual-Guided Activation Editing
Tianbo Wang,Yuqing Ma,Kewei Liao,Zhange Zhang,Simin Li,Jinyang Guo,Xianglong Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为AFTER的自适应事实引导的视觉-文本编辑方法,用于缓解大视觉语言模型中的对象幻觉问题,通过FAS和QAO模块有效减少语言偏见,在多个基准上显著降低了幻觉发生率。
Details
Motivation: 由于语言偏见,大视觉语言模型容易出现类别、属性和关系上的对象幻觉,阻碍了可信AI的应用;现有激活编辑方法缺乏对事实文本语义的有效利用,难以显式缓解语言偏见。 Method: 提出AFTER框架,包含Factual-Augmented Activation Steering (FAS) 和 Query-Adaptive Offset Optimization (QAO):FAS通过增强事实语义提供通用指导,建模精确的视觉-文本关联;QAO引入查询感知偏移估计器,实现针对特定查询的编辑。 Result: 在三个主流LVLM上的标准幻觉基准测试中,AFTER显著优于基线方法,最高在AMBER基准上减少了16.3%的幻觉。 Conclusion: AFTER通过融合事实文本语义与自适应编辑策略,有效缓解了LVLM中的对象幻觉问题,提升了模型的可靠性和跨模态理解能力。 Abstract: Large Vision-Language Models (LVLMs) have achieved substantial progress in cross-modal tasks. However, due to language bias, LVLMs are susceptible to object hallucination, which can be primarily divided into category, attribute, and relation hallucination, significantly impeding the trustworthy AI applications. Editing the internal activations of LVLMs has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the effective guidance offered by factual textual semantics, thereby struggling to explicitly mitigate language bias. To address these issues, we propose Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation (AFTER), which comprises Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO), to adaptively guides the original biased activations towards factual semantics. Specifically, FAS is proposed to provide factual and general guidance for activation editing, thereby explicitly modeling the precise visual-textual associations. Subsequently, QAO introduces a query-aware offset estimator to establish query-specific editing from the general steering vector, enhancing the diversity and granularity of editing. Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs validate the efficacy of the proposed AFTER, notably achieving up to a 16.3% reduction of hallucination over baseline on the AMBER benchmark. Our code and data will be released for reproducibility.[200] Forget Less by Learning Together through Concept Consolidation
Arjun Ramesh Kaushik,Naresh Kumar Devulapally,Vishnu Suresh Lokhande,Nalini Ratha,Venu Govindaraju
Main category: cs.CV
TL;DR: 提出了一种名为FL2T的新框架,通过集合不变的跨概念学习模块实现并发且顺序无关的概念学习,有效缓解了定制化扩散模型中的灾难性遗忘问题。
Details
Motivation: 现有定制化扩散模型在持续学习新概念时存在灾难性遗忘问题,且大多工作局限于顺序学习设置,忽略了概念间的相互作用。 Method: 提出了Forget Less by Learning Together (FL2T)框架,引入集合不变的跨概念学习模块,利用代理(proxies)引导跨概念的特征选择,促进知识保留与迁移。 Result: 在三个数据集上的大量实验表明,该方法显著提升了概念保持能力,在十项增量学习任务中平均CLIP图像对齐分数至少提高2%。 Conclusion: FL2T通过跨概念引导机制有效缓解了灾难性遗忘,验证了跨概念催化行为在增量概念学习中的有效性。 Abstract: Custom Diffusion Models (CDMs) have gained significant attention due to their remarkable ability to personalize generative processes. However, existing CDMs suffer from catastrophic forgetting when continuously learning new concepts. Most prior works attempt to mitigate this issue under the sequential learning setting with a fixed order of concept inflow and neglect inter-concept interactions. In this paper, we propose a novel framework - Forget Less by Learning Together (FL2T) - that enables concurrent and order-agnostic concept learning while addressing catastrophic forgetting. Specifically, we introduce a set-invariant inter-concept learning module where proxies guide feature selection across concepts, facilitating improved knowledge retention and transfer. By leveraging inter-concept guidance, our approach preserves old concepts while efficiently incorporating new ones. Extensive experiments, across three datasets, demonstrates that our method significantly improves concept retention and mitigates catastrophic forgetting, highlighting the effectiveness of inter-concept catalytic behavior in incremental concept learning of ten tasks with at least 2% gain on average CLIP Image Alignment scores.[201] Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
Weijian Ma,Shizhao Sun,Tianyu Yu,Ruiyu Wang,Tat-Seng Chua,Jiang Bian
Main category: cs.CV
TL;DR: 本文提出一种基于对象中心蓝图的视觉语言模型空间推理增强方法,通过构建结构化表示并结合监督微调、强化学习奖励与数据增强技术,显著提升了模型对空间语义的理解能力。
Details
Motivation: 现有方法在空间推理上存在局部感知与全局空间意识之间的权衡问题,难以同时捕捉物体位置及其整体组织结构。 Method: 引入对象中心的JSON风格蓝图表示,记录相关物体的位置、大小和属性;采用蓝图嵌入的推理轨迹进行监督微调,设计蓝图感知的强化学习奖励以促进因果推理一致性,并通过抗捷径的数据增强减少对表面线索的依赖。 Result: 实验表明,该方法在多个基准上持续优于现有的视觉语言模型和专门的空间推理模型。 Conclusion: 通过将认知中的蓝图概念融入VLM,所提方法有效增强了空间推理能力,实现了从视觉感知到空间语义理解的进阶。 Abstract: Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.[202] VIT-Ped: Visionary Intention Transformer for Pedestrian Behavior Analysis
Aly R. Elkammar,Karim M. Gamaleldin,Catherine M. Elias
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer/视频视觉Transformer的算法,用于行人意图预测,并在JAAD数据集上实现了优于现有技术的准确率、AUC和F1分数。
Details
Motivation: 行人意图预测是实现4级自动驾驶的关键技术之一,需综合考虑多种特征以提升道路安全性。 Method: 采用不同规模的Transformer/视频视觉Transformer模型,融合多种数据模态进行行人行为理解。 Result: 在JAAD数据集上达到SOTA性能,在Accuracy、AUC和F1-score等指标上超越现有方法。 Conclusion: 所提出的模型设计有效提升了行人意图预测性能,并通过消融研究验证了各设计选择的优势。 Abstract: Pedestrian Intention prediction is one of the key technologies in the transition from level 3 to level 4 autonomous driving. To understand pedestrian crossing behaviour, several elements and features should be taken into consideration to make the roads of tomorrow safer for everybody. We introduce a transformer / video vision transformer based algorithm of different sizes which uses different data modalities .We evaluated our algorithms on popular pedestrian behaviour dataset, JAAD, and have reached SOTA performance and passed the SOTA in metrics like Accuracy, AUC and F1-score. The advantages brought by different model design choices are investigated via extensive ablation studies.[203] API: Empowering Generalizable Real-World Image Dehazing via Adaptive Patch Importance Learning
Chen Zhu,Huiwen Zhang,Yujie Li,Mu He,Xiaotian Qiao
Main category: cs.CV
TL;DR: 提出了一种自适应补丁重要性感知(API)框架,用于可泛化的实际场景图像去雾,包含自动雾生成(AHG)和密度感知去雾(DHR)模块,并引入多负样本对比去雾(MNCD)损失,显著提升了去雾性能与泛化能力。
Details
Motivation: 现有基于学习的去雾方法在复杂真实场景中表现不佳,主要受限于训练数据不足和雾密度分布的复杂性。 Method: 提出API框架,包括AHG模块用于生成多样且逼真的雾图以增强训练数据,DHR模块以自适应方式处理不同雾密度区域,并设计MNCD损失利用空间和频率域的多个负样本来减少去雾细节的模糊性。 Result: 在多个真实世界去雾基准上实现了最先进的性能,定量指标和视觉质量均表现出色,且对不同雾分布具有强泛化能力。 Conclusion: 所提出的API框架结合AHG和DHR模块及MNCD损失,有效提升了真实场景图像去雾的性能和泛化性,推动了该领域的实际应用发展。 Abstract: Real-world image dehazing is a fundamental yet challenging task in low-level vision. Existing learning-based methods often suffer from significant performance degradation when applied to complex real-world hazy scenes, primarily due to limited training data and the intrinsic complexity of haze density distributions.To address these challenges, we introduce a novel Adaptive Patch Importance-aware (API) framework for generalizable real-world image dehazing. Specifically, our framework consists of an Automatic Haze Generation (AHG) module and a Density-aware Haze Removal (DHR) module. AHG provides a hybrid data augmentation strategy by generating realistic and diverse hazy images as additional high-quality training data. DHR considers hazy regions with varying haze density distributions for generalizable real-world image dehazing in an adaptive patch importance-aware manner. To alleviate the ambiguity of the dehazed image details, we further introduce a new Multi-Negative Contrastive Dehazing (MNCD) loss, which fully utilizes information from multiple negative samples across both spatial and frequency domains. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple real-world benchmarks, delivering strong results in both quantitative metrics and qualitative visual quality, and exhibiting robust generalization across diverse haze distributions.[204] Nighttime Hazy Image Enhancement via Progressively and Mutually Reinforcing Night-Haze Priors
Chen Zhu,Huiwen Zhang,Mu He,Yujie Li,Xiaotian Qiao
Main category: cs.CV
TL;DR: 本文提出了一种新的框架,通过在图像、块和像素级别上跨视觉和频率域的专家模型,增强夜间雾化图像的可见性,并利用频率感知路由器自适应地引导每个专家的贡献。
Details
Motivation: 现有的方法主要一次只处理单一类型的退化(如雾霾或低光),忽略了不同类型退化之间的相互作用,导致可见性改善有限。 Method: 该模型利用图像级、块级和像素级专家,在视觉和频率域中操作,以逐步恢复全局场景结构、区域模式和细粒度细节。引入了频率感知路由器来自适应地指导每位专家的贡献。 Result: 大量实验表明,该模型在夜间去雾基准测试中表现出优越的性能,无论是在定量还是定性方面。此外,还展示了该模型在白天去雾和低光增强任务中的泛化能力。 Conclusion: 通过加强雾霾和低光先验之间的内在一致性,所提出的框架能够有效提升夜间雾化图像的可见性,并且具有良好的通用性。 Abstract: Enhancing the visibility of nighttime hazy images is challenging due to the complex degradation distributions. Existing methods mainly address a single type of degradation (e.g., haze or low-light) at a time, ignoring the interplay of different degradation types and resulting in limited visibility improvement. We observe that the domain knowledge shared between low-light and haze priors can be reinforced mutually for better visibility. Based on this key insight, in this paper, we propose a novel framework that enhances visibility in nighttime hazy images by reinforcing the intrinsic consistency between haze and low-light priors mutually and progressively. In particular, our model utilizes image-, patch-, and pixel-level experts that operate across visual and frequency domains to recover global scene structure, regional patterns, and fine-grained details progressively. A frequency-aware router is further introduced to adaptively guide the contribution of each expert, ensuring robust image restoration. Extensive experiments demonstrate the superior performance of our model on nighttime dehazing benchmarks both quantitatively and qualitatively. Moreover, we showcase the generalizability of our model in daytime dehazing and low-light enhancement tasks.[205] Enhancing Object Detection with Privileged Information: A Model-Agnostic Teacher-Student Approach
Matthias Bartolo,Dylan Seychell,Gabriel Hili,Matthew Montebello,Carl James Debono,Saviour Formosa,Konstantinos Makantasis
Main category: cs.CV
TL;DR: 本文提出了一种模型无关的教师-学生架构,利用学习使用特权信息(LUPI)范式,在训练时引入如掩码、显著性图和深度线索等特权信息来提升目标检测性能,且不增加推理复杂度。
Details
Motivation: 在目标检测中,训练阶段可获得丰富的细粒度描述信息,但在推理时不可用,如何有效利用这些特权信息以提升模型性能是本文的研究动机。 Method: 采用LUPI范式,通过教师-学生框架将特权信息(如边界框掩码、显著性图、深度线索)注入到多种主流检测模型中,实现知识迁移。 Result: 在多个基准数据集(包括UAV垃圾检测和Pascal VOC 2012)上验证了方法的有效性,学生模型在检测精度上显著优于基线,尤其对中大型物体效果更佳,且推理效率不变。 Conclusion: LUPI框架是一种有效且实用的目标检测增强策略,适用于资源受限和真实应用场景。 Abstract: This paper investigates the integration of the Learning Using Privileged Information (LUPI) paradigm in object detection to exploit fine-grained, descriptive information available during training but not at inference. We introduce a general, model-agnostic methodology for injecting privileged information-such as bounding box masks, saliency maps, and depth cues-into deep learning-based object detectors through a teacher-student architecture. Experiments are conducted across five state-of-the-art object detection models and multiple public benchmarks, including UAV-based litter detection datasets and Pascal VOC 2012, to assess the impact on accuracy, generalization, and computational efficiency. Our results demonstrate that LUPI-trained students consistently outperform their baseline counterparts, achieving significant boosts in detection accuracy with no increase in inference complexity or model size. Performance improvements are especially marked for medium and large objects, while ablation studies reveal that intermediate weighting of teacher guidance optimally balances learning from privileged and standard inputs. The findings affirm that the LUPI framework provides an effective and practical strategy for advancing object detection systems in both resource-constrained and real-world settings.[206] Towards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement
Guangqian Guo,Aixi Ren,Yong Guo,Xuehui Yu,Jiacheng Tian,Wenli Li,Yaoxing Wang,Shan Gao
Main category: cs.CV
TL;DR: 本文提出了GleSAM++,通过生成潜在空间增强和退化感知自适应增强机制,提升Segment Anything Models在低质量图像上的分割鲁棒性。
Details
Motivation: 现有的SAM模型在严重退化的低质量图像上性能显著下降,限制了其在现实场景中的应用。因此,需要提高模型对不同图像质量的鲁棒性和泛化能力。 Method: 提出GleSAM++,利用生成潜在空间增强来提升低质量图像的鲁棒性;引入特征分布对齐(FDA)和通道复制扩展(CRE)以增强预训练扩散模型与分割框架的兼容性;设计退化感知自适应增强(DAE)机制,将重建过程解耦为退化水平预测和退化感知重建两个阶段。 Result: 实验表明,GleSAM++在复杂退化条件下显著提升了分割鲁棒性,同时保持对清晰图像的良好泛化能力,并在未见过的退化类型上表现良好。 Conclusion: GleSAM++能有效提升SAM系列模型在低质量和退化图像上的分割性能,且仅需少量可学习参数,具有高效优化能力和广泛适用性。 Abstract: Segment Anything Models (SAMs), known for their exceptional zero-shot segmentation performance, have garnered significant attention in the research community. Nevertheless, their performance drops significantly on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM++, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Additionally, to improve compatibility between the pre-trained diffusion model and the segmentation framework, we introduce two techniques, i.e., Feature Distribution Alignment (FDA) and Channel Replication and Expansion (CRE). However, the above components lack explicit guidance regarding the degree of degradation. The model is forced to implicitly fit a complex noise distribution that spans conditions from mild noise to severe artifacts, which substantially increases the learning burden and leads to suboptimal reconstructions. To address this issue, we further introduce a Degradation-aware Adaptive Enhancement (DAE) mechanism. The key principle of DAE is to decouple the reconstruction process for arbitrary-quality features into two stages: degradation-level prediction and degradation-aware reconstruction. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. Extensive experiments demonstrate that GleSAM++ significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM++ also performs well on unseen degradations, underscoring the versatility of our approach and dataset.[207] Adapting Depth Anything to Adverse Imaging Conditions with Events
Shihan Peng,Yuyang Xiong,Hanyu Zhou,Zhiwei Shi,Haoyue Liu,Gang Chen,Luxin Yan,Yi Chang
Main category: cs.CV
TL;DR: 本文提出了一种名为ADAE的事件引导的时空融合框架,用于在光照恶劣和动态条件下提升Depth Anything模型的深度估计鲁棒性。
Details
Motivation: 现有的深度估计基础模型在理想场景下表现良好,但在极端光照和运动模糊等退化条件下性能下降;而现有融合事件相机的方法通常从零训练,无法继承基础模型的开放世界泛化能力。 Method: 提出ADAE框架,包含两个关键模块:1)基于信息熵的自适应空间融合,用以检测并补偿光照退化;2)利用事件数据中的运动线索进行时间维度上的特征校正,缓解运动模糊影响。该方法在Depth Anything基础上引入事件数据,实现鲁棒的深度估计。 Result: 实验表明,ADAE在多种退化场景下显著优于现有方法,有效提升了Depth Anything在极端光照和运动模糊下的深度估计精度与稳定性。 Conclusion: ADAE成功地将事件相机的优势融入深度估计基础模型,在保持其开放世界泛化能力的同时,显著增强了在恶劣成像条件下的鲁棒性,为实际机器人应用提供了更可靠的深度感知方案。 Abstract: Robust depth estimation under dynamic and adverse lighting conditions is essential for robotic systems. Currently, depth foundation models, such as Depth Anything, achieve great success in ideal scenes but remain challenging under adverse imaging conditions such as extreme illumination and motion blur. These degradations corrupt the visual signals of frame cameras, weakening the discriminative features of frame-based depths across the spatial and temporal dimensions. Typically, existing approaches incorporate event cameras to leverage their high dynamic range and temporal resolution, aiming to compensate for corrupted frame features. However, such specialized fusion models are predominantly trained from scratch on domain-specific datasets, thereby failing to inherit the open-world knowledge and robust generalization inherent to foundation models. In this work, we propose ADAE, an event-guided spatiotemporal fusion framework for Depth Anything in degraded scenes. Our design is guided by two key insights: 1) Entropy-Aware Spatial Fusion. We adaptively merge frame-based and event-based features using an information entropy strategy to indicate illumination-induced degradation. 2) Motion-Guided Temporal Correction. We resort to the event-based motion cue to recalibrate ambiguous features in blurred regions. Under our unified framework, the two components are complementary to each other and jointly enhance Depth Anything under adverse imaging conditions. Extensive experiments have been performed to verify the superiority of the proposed method. Our code will be released upon acceptance.[208] Leveraging 2D-VLM for Label-Free 3D Segmentation in Large-Scale Outdoor Scene Understanding
Toshihiko Nishimura,Hirofumi Abe,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida
Main category: cs.CV
TL;DR: 提出一种无需标注3D训练数据的3D语义分割方法,通过虚拟相机将点云投影到2D图像,利用基础2D模型和自然语言提示进行分割,并通过多视角加权投票实现3D分割,支持开放词汇识别。
Details
Motivation: 传统3D语义分割依赖大量标注3D数据或配对的RGB图像,成本高且灵活性差;同时封闭词汇限制了对新类别的识别能力。 Method: 将3D点云通过虚拟相机投影为2D图像,使用基于自然语言提示的基础2D模型进行语义分割,再通过多视角预测结果的加权投票聚合为3D分割结果。 Result: 在大型点云数据上优于现有的无训练方法,性能接近有监督方法,并支持开放词汇识别,可通过任意文本查询检测物体。 Conclusion: 该方法有效实现了无需3D标注训练的3D语义分割,兼具高精度与灵活性,推动了开放词汇场景下的三维理解应用。 Abstract: This paper presents a novel 3D semantic segmentation method for large-scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open-vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.[209] AlignVTOFF: Texture-Spatial Feature Alignment for High-Fidelity Virtual Try-Off
Yihan Zhu,Mengying Ge
Main category: cs.CV
TL;DR: 提出AlignVTOFF,一种基于参考U-Net和纹理-空间特征对齐的并行U-Net框架,用于提升虚拟试穿中高保真平铺服装生成的质量。
Details
Motivation: 现有方法在处理复杂几何变形和高频纹理时难以保持结构化图案和细节,导致生成过程中纹理衰减。 Method: 设计Reference U-Net进行多尺度特征提取以增强几何保真度,并通过TSFA模块利用混合注意力机制将参考特征注入冻结的去噪U-Net,实现纹理与空间线索的对齐。 Result: 在多种实验设置下,AlignVTOFF均优于现有最先进方法,生成结果具有更优的结构真实感和高频细节保真度。 Conclusion: AlignVTOFF有效解决了虚拟试穿中纹理保留与几何变形建模的挑战,显著提升了生成图像的质量。 Abstract: Virtual Try-Off (VTOFF) is a challenging multimodal image generation task that aims to synthesize high-fidelity flat-lay garments under complex geometric deformation and rich high-frequency textures. Existing methods often rely on lightweight modules for fast feature extraction, which struggles to preserve structured patterns and fine-grained details, leading to texture attenuation during generation.To address these issues, we propose AlignVTOFF, a novel parallel U-Net framework built upon a Reference U-Net and Texture-Spatial Feature Alignment (TSFA). The Reference U-Net performs multi-scale feature extraction and enhances geometric fidelity, enabling robust modeling of deformation while retaining complex structured patterns. TSFA then injects the reference garment features into a frozen denoising U-Net via a hybrid attention design, consisting of a trainable cross-attention module and a frozen self-attention module. This design explicitly aligns texture and spatial cues and alleviates the loss of high-frequency information during the denoising process.Extensive experiments across multiple settings demonstrate that AlignVTOFF consistently outperforms state-of-the-art methods, producing flat-lay garment results with improved structural realism and high-frequency detail fidelity.[210] Agentic Retoucher for Text-To-Image Generation
Shaocheng Shen,Jianfeng Liang. Chunlei Cai,Cong Geng,Huiyu Duan,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai
Main category: cs.CV
TL;DR: 提出Agentic Retoucher,一种分层决策驱动的文本到图像生成后修正框架,通过感知-推理-行动循环实现细粒度、可靠的局部修复。
Details
Motivation: 现有T2I扩散模型存在局部失真问题,而当前修复方法或成本高、或空间定位弱,易导致语义漂移和不可靠编辑。 Method: 设计一个类人感知-推理-行动循环:感知代理基于图文一致性线索定位失真;推理代理通过渐进偏好对齐进行诊断;行动代理自适应规划局部重绘。并构建GenBlemish-27K数据集用于训练与评估。 Result: 在感知质量、失真定位和用户偏好对齐方面优于现有最先进方法。 Conclusion: Agentic Retoucher为T2I模型提供了更可靠、可解释且用户对齐的自我修正新范式。 Abstract: Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.[211] PhysSFI-Net: Physics-informed Geometric Learning of Skeletal and Facial Interactions for Orthognathic Surgical Outcome Prediction
Jiahao Bao,Huazhen Liu,Yu Zhuang,Leran Tao,Xinyu Xu,Yongtao Shi,Mengjia Cheng,Yiming Wang,Congshuang Ku,Ting Zeng,Yilang Du,Siyi Chen,Shunyao Shen,Suncheng Xiang,Hongbo Yu
Main category: cs.CV
TL;DR: 本研究提出了一种名为PhysSFI-Net的物理信息几何深度学习框架,用于精确预测正颌手术后的软组织变形,兼具高精度与可解释性。
Details
Motivation: 传统生物力学模型计算成本高,几何深度学习方法缺乏可解释性,因此需要一种兼顾精度与解释性的术后面部形态预测方法。 Method: 提出PhysSFI-Net,包含三个模块:分层图网络(结合颅面与手术计划编码器及注意力机制)、基于LSTM的软组织变形序列预测器、受生物力学启发的高分辨率面部表面重建模块。 Result: 在135例患者数据上验证,PhysSFI-Net取得1.070 mm的点云形状误差、1.296 mm的表面偏差误差和2.445 mm的标志点定位误差,优于ACMT-Net等现有方法。 Conclusion: PhysSFI-Net能够实现可解释、高分辨率且高精度的术后面部形态预测,在正颌手术规划与模拟中具有良好的临床应用潜力。 Abstract: Orthognathic surgery repositions jaw bones to restore occlusion and enhance facial aesthetics. Accurate simulation of postoperative facial morphology is essential for preoperative planning. However, traditional biomechanical models are computationally expensive, while geometric deep learning approaches often lack interpretability. In this study, we develop and validate a physics-informed geometric deep learning framework named PhysSFI-Net for precise prediction of soft tissue deformation following orthognathic surgery. PhysSFI-Net consists of three components: a hierarchical graph module with craniofacial and surgical plan encoders combined with attention mechanisms to extract skeletal-facial interaction features; a Long Short-Term Memory (LSTM)-based sequential predictor for incremental soft tissue deformation; and a biomechanics-inspired module for high-resolution facial surface reconstruction. Model performance was assessed using point cloud shape error (Hausdorff distance), surface deviation error, and landmark localization error (Euclidean distances of craniomaxillofacial landmarks) between predicted facial shapes and corresponding ground truths. A total of 135 patients who underwent combined orthodontic and orthognathic treatment were included for model training and validation. Quantitative analysis demonstrated that PhysSFI-Net achieved a point cloud shape error of 1.070 +/- 0.088 mm, a surface deviation error of 1.296 +/- 0.349 mm, and a landmark localization error of 2.445 +/- 1.326 mm. Comparative experiments indicated that PhysSFI-Net outperformed the state-of-the-art method ACMT-Net in prediction accuracy. In conclusion, PhysSFI-Net enables interpretable, high-resolution prediction of postoperative facial morphology with superior accuracy, showing strong potential for clinical application in orthognathic surgical planning and simulation.[212] MCD-Net: A Lightweight Deep Learning Baseline for Optical-Only Moraine Segmentation
Zhehuan Cao,Fiseha Berhanu Tesema,Ping Fu,Jianfeng Ren,Ahmed Nasr
Main category: cs.CV
TL;DR: 本研究提出了首个大规模纯光学冰碛物分割数据集,并开发了轻量级模型MCD-Net,在降低60%以上计算成本的同时实现了62.3%的mIoU和72.8%的Dice系数,验证了仅使用光学影像进行可靠冰碛物体分割的可行性。
Details
Motivation: 弱光学对比度和高分辨率DEM数据的缺乏限制了冰川地貌的自动制图,亟需针对冰碛物分割的专用数据集和高效模型。 Method: 构建了包含3,340幅手动标注高分辨率图像的数据集,提出MCD-Net模型,结合MobileNetV2编码器、CBAM注意力模块和DeepLabV3+解码器。 Result: MCD-Net在保持轻量化的同时达到62.3% mIoU和72.8% Dice系数,计算成本降低超60%;结果显示纯光学影像可有效分割冰碛物体,但脊线提取仍受限于亚像素宽度和光谱模糊性。 Conclusion: 该数据集和模型为冰碛物特异性分割提供了可复现的基准,且MCD-Net具备在高海拔冰川监测中部署的潜力。 Abstract: Glacial segmentation is essential for reconstructing past glacier dynamics and evaluating climate-driven landscape change. However, weak optical contrast and the limited availability of high-resolution DEMs hinder automated mapping. This study introduces the first large-scale optical-only moraine segmentation dataset, comprising 3,340 manually annotated high-resolution images from Google Earth covering glaciated regions of Sichuan and Yunnan, China. We develop MCD-Net, a lightweight baseline that integrates a MobileNetV2 encoder, a Convolutional Block Attention Module (CBAM), and a DeepLabV3+ decoder. Benchmarking against deeper backbones (ResNet152, Xception) shows that MCD-Net achieves 62.3\% mean Intersection over Union (mIoU) and 72.8\% Dice coefficient while reducing computational cost by more than 60\%. Although ridge delineation remains constrained by sub-pixel width and spectral ambiguity, the results demonstrate that optical imagery alone can provide reliable moraine-body segmentation. The dataset and code are publicly available at https://github.com/Lyra-alpha/MCD-Net, establishing a reproducible benchmark for moraine-specific segmentation and offering a deployable baseline for high-altitude glacial monitoring.[213] InpaintHuman: Reconstructing Occluded Humans with Multi-Scale UV Mapping and Identity-Preserving Diffusion Inpainting
Jinlong Fan,Shanshan Zhao,Liang Zheng,Jing Zhang,Yuxiang Yang,Mingming Gong
Main category: cs.CV
TL;DR: 本文提出InpaintHuman,一种从遮挡的单目视频中生成高保真、完整且可动画化的3D人体头像的新方法,通过多尺度UV参数化表示和身份保持的扩散修复模块实现鲁棒重建。
Details
Motivation: 在严重遮挡情况下,从单目视频中重建完整且可动画化的3D人体头像仍然具有挑战性,现有方法常产生几何畸变和时间不一致问题。 Method: 提出多尺度UV参数化表示与分层特征插值,结合基于文本反转和语义引导的身份保持扩散修复模块,并采用像素级监督确保身份保真度。 Result: 在PeopleSnapshot、ZJU-MoCap和OcMotion等合成与真实场景数据集上实验表明,该方法在不同姿态和视角下均显著提升重建质量。 Conclusion: InpaintHuman能有效处理遮挡问题,生成高质量、时空一致的人体三维头像,优于现有方法。 Abstract: Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging, particularly under severe occlusions. While 3D Gaussian Splatting has enabled photorealistic human rendering, existing methods struggle with incomplete observations, often producing corrupted geometry and temporal inconsistencies. We present InpaintHuman, a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos. Our approach introduces two key innovations: (i) a multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation, enabling robust reconstruction of occluded regions while preserving geometric details; and (ii) an identity-preserving diffusion inpainting module that integrates textual inversion with semantic-conditioned guidance for subject-specific, temporally coherent completion. Unlike SDS-based methods, our approach employs direct pixel-level supervision to ensure identity fidelity. Experiments on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion) demonstrate competitive performance with consistent improvements in reconstruction quality across diverse poses and viewpoints.[214] 360-GeoGS: Geometrically Consistent Feed-Forward 3D Gaussian Splatting Reconstruction for 360 Images
Jiaqi Yao,Zhongmiao Yan,Jingyi Xu,Songpengcheng Xia,Yan Xiang,Ling Pei
Main category: cs.CV
TL;DR: 本文提出了一种用于360度图像的新型前馈3D高斯点阵框架,通过引入深度-法线几何正则化,实现了在保持高质量渲染的同时显著提升几何一致性,适用于空间感知任务中的3D重建。
Details
Motivation: 传统多视角立体匹配在稀疏视角或低纹理区域表现不佳,而神经渲染方法虽然效果好但缺乏实时性且需逐场景优化;现有显式3D高斯点阵方法多关注视觉质量,忽略几何一致性,限制了其在空间智能应用中的可靠性。 Method: 提出一种前馈式3D高斯点阵框架,并引入Depth-Normal几何正则化,将渲染的深度梯度与法线信息耦合,监督高斯分布的旋转、尺度和位置,从而提升点云和表面重建精度。 Result: 实验结果表明,该方法在保持高渲染质量的同时,显著提高了几何一致性,在点云和表面重建精度上优于现有方法。 Conclusion: 所提出的方法为3D场景重建提供了一个高效且几何准确的解决方案,特别适用于AR、机器人和数字孪生等需要可靠空间感知的应用场景。 Abstract: 3D scene reconstruction is fundamental for spatial intelligence applications such as AR, robotics, and digital twins. Traditional multi-view stereo struggles with sparse viewpoints or low-texture regions, while neural rendering approaches, though capable of producing high-quality results, require per-scene optimization and lack real-time efficiency. Explicit 3D Gaussian Splatting (3DGS) enables efficient rendering, but most feed-forward variants focus on visual quality rather than geometric consistency, limiting accurate surface reconstruction and overall reliability in spatial perception tasks. This paper presents a novel feed-forward 3DGS framework for 360 images, capable of generating geometrically consistent Gaussian primitives while maintaining high rendering quality. A Depth-Normal geometric regularization is introduced to couple rendered depth gradients with normal information, supervising Gaussian rotation, scale, and position to improve point cloud and surface accuracy. Experimental results show that the proposed method maintains high rendering quality while significantly improving geometric consistency, providing an effective solution for 3D reconstruction in spatial perception tasks.[215] HeadLighter: Disentangling Illumination in Generative 3D Gaussian Heads via Lightstage Captures
Yating Wang,Yuan Sun,Xuan Wang,Ran Yi,Boyao Zhou,Yipengjing Sun,Hongyu Liu,Yinuo Wang,Lizhuang Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为HeadLighter的新框架,用于在基于3D高斯点阵的头部生成模型中实现外观与光照的物理合理解耦,支持显式的光照和视角编辑,同时保持高质量生成与实时渲染。
Details
Motivation: 现有3D感知头部生成模型中光照与外观深度耦合,难以实现可控重光照,且现有解耦方法依赖强假设,限制了复杂光照下的表现。 Method: 提出双分支架构,分别建模光照不变的头部属性和基于物理的渲染分量;采用渐进式解耦训练,并利用光场采集的多视角图像进行监督;引入法线蒸馏策略提升渲染真实性。 Result: 实验表明该方法在保持实时渲染和高质量生成的同时,支持精细的光照与 viewpoint 编辑,优于现有解耦方法。 Conclusion: HeadLighter实现了更精确的光照-外观解耦,为可控、逼真的3D头部生成提供了有效解决方案,并将公开代码与数据集。 Abstract: Recent 3D-aware head generative models based on 3D Gaussian Splatting achieve real-time, photorealistic and view-consistent head synthesis. However, a fundamental limitation persists: the deep entanglement of illumination and intrinsic appearance prevents controllable relighting. Existing disentanglement methods rely on strong assumptions to enable weakly supervised learning, which restricts their capacity for complex illumination. To address this challenge, we introduce HeadLighter, a novel supervised framework that learns a physically plausible decomposition of appearance and illumination in head generative models. Specifically, we design a dual-branch architecture that separately models lighting-invariant head attributes and physically grounded rendering components. A progressive disentanglement training is employed to gradually inject head appearance priors into the generative architecture, supervised by multi-view images captured under controlled light conditions with a light stage setup. We further introduce a distillation strategy to generate high-quality normals for realistic rendering. Experiments demonstrate that our method preserves high-quality generation and real-time rendering, while simultaneously supporting explicit lighting and viewpoint editing. We will publicly release our code and dataset.[216] MagicFight: Personalized Martial Arts Combat Video Generation
Jiancheng Huang,Mingfu Yan,Songyan Chen,Yi Huang,Shifeng Chen
Main category: cs.CV
TL;DR: 本文提出了一个新任务——个性化武术对战视频生成,以解决现有单人生成模型在双人互动场景(尤其是武术对战)中的不足。作者提出了MagicFight方法,并构建了基于Unity的游戏物理引擎数据集KungFu-Fiesta,以生成高保真、身份一致且动作连贯的双人对战视频。
Details
Motivation: 现有的文本到视频生成模型主要集中在单人场景(如舞蹈),难以处理双人互动中的身份混淆、肢体异常和动作不匹配问题,尤其是在复杂的武术对战场景中。因此,需要专门针对双人交互设计新的生成任务和模型。 Method: 提出MagicFight模型,针对双人武术对战进行优化;利用Unity游戏物理引擎构建包含多样化3D角色、招式和场景的专用数据集KungFu-Fiesta;改进现有生成模型以保持个体身份一致性和动作协调性。 Result: 成功生成了高质量、高保真的个性化双人武术对战视频,在身份保持、肢体合理性和动作同步方面优于现有方法;发布了首个专注于双人武术互动的数据集KungFu-Fiesta。 Conclusion: 该研究填补了个性化双人互动视频生成的研究空白,为复杂交互场景下的视频生成提供了新思路和数据基础,推动了交互式视频内容创作的发展。 Abstract: Amid the surge in generic text-to-video generation, the field of personalized human video generation has witnessed notable advancements, primarily concentrated on single-person scenarios. However, to our knowledge, the domain of two-person interactions, particularly in the context of martial arts combat, remains uncharted. We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. Our approach, MagicFight, is specifically crafted to overcome these hurdles. Given this pioneering task, we face a lack of appropriate datasets. Thus, we generate a bespoke dataset using the game physics engine Unity, meticulously crafting a multitude of 3D characters, martial arts moves, and scenes designed to represent the diversity of combat. MagicFight refines and adapts existing models and strategies to generate high-fidelity two-person combat videos that maintain individual identities and ensure seamless, coherent action sequences, thereby laying the groundwork for future innovations in the realm of interactive video content creation. Website: https://MingfuYAN.github.io/MagicFight/ Dataset: https://huggingface.co/datasets/MingfuYAN/KungFu-Fiesta[217] Car Drag Coefficient Prediction from 3D Point Clouds Using a Slice-Based Surrogate Model
Utkarsh Singh,Absaar Ali,Adarsh Roy
Main category: cs.CV
TL;DR: 本文提出了一种基于3D车辆几何切片序列处理的轻量级机器学习代理模型,用于快速准确预测气动阻力系数(Cd),在DrivAerNet++数据集上表现出高精度和低推理时间。
Details
Motivation: 传统CFD和风洞实验耗时耗资源,现有机器学习模型存在复杂度高、可解释性差或精度不足的问题,难以支持汽车设计早期阶段的快速迭代。 Method: 将3D点云按流线方向分解为有序2D横截面切片,使用轻量化的PointNet2D编码每一切片,并通过双向LSTM建模切片序列以捕捉纵向几何演变,实现对Cd的预测。 Result: 在DrivAerNet++数据集上达到R² > 0.9528,MAE ≈ 6.046×10⁻³,单样本推理时间约0.025秒(消费级GPU)。 Conclusion: 该方法提供了一种快速、准确且具可解释性的气动性能预测工具,有助于提升汽车空气动力学设计的效率与敏捷性。 Abstract: The automotive industry's pursuit of enhanced fuel economy and performance necessitates efficient aerodynamic design. However, traditional evaluation methods such as computational fluid dynamics (CFD) and wind tunnel testing are resource intensive, hindering rapid iteration in the early design stages. Machine learning-based surrogate models offer a promising alternative, yet many existing approaches suffer from high computational complexity, limited interpretability, or insufficient accuracy for detailed geometric inputs. This paper introduces a novel lightweight surrogate model for the prediction of the aerodynamic drag coefficient (Cd) based on a sequential slice-wise processing of the geometry of the 3D vehicle. Inspired by medical imaging, 3D point clouds of vehicles are decomposed into an ordered sequence of 2D cross-sectional slices along the stream-wise axis. Each slice is encoded by a lightweight PointNet2D module, and the sequence of slice embeddings is processed by a bidirectional LSTM to capture longitudinal geometric evolution. The model, trained and evaluated on the DrivAerNet++ dataset, achieves a high coefficient of determination (R^2 > 0.9528) and a low mean absolute error (MAE approx 6.046 x 10^{-3}) in Cd prediction. With an inference time of approximately 0.025 seconds per sample on a consumer-grade GPU, our approach provides fast, accurate, and interpretable aerodynamic feedback, facilitating more agile and informed automotive design exploration.[218] Remote Sensing Change Detection via Weak Temporal Supervision
Xavier Bou,Elliot Vincent,Gabriele Facciolo,Rafael Grompone von Gioi,Jean-Michel Morel,Thibaud Ehret
Main category: cs.CV
TL;DR: 提出一种弱时间监督策略,利用单时相遥感数据集的额外观测来训练变化检测模型,无需新标注,通过对象感知的变化图生成和迭代优化处理标签噪声,在FLAIR和IAILD数据集上表现出色。
Details
Motivation: 现有语义变化检测方法受限于标注数据稀缺,合成数据方法泛化能力不足,需在不增加人工标注的情况下提升模型性能。 Method: 扩展单时相数据集加入不同时间的观测,假设同地双时相图像多数无变化,跨地配对生成变化样本;采用对象感知变化图和迭代 refinement 机制处理弱标签噪声。 Result: 在扩展的FLAIR和IAILD数据集上实现了强零样本和低数据场景下的性能,跨基准表现优异,并在法国大范围区域展示了方法的可扩展性。 Conclusion: 所提弱时间监督方法有效利用现有单时相数据进行变化检测,无需额外标注,具备良好泛化性和实际应用潜力。 Abstract: Semantic change detection in remote sensing aims to identify land cover changes between bi-temporal image pairs. Progress in this area has been limited by the scarcity of annotated datasets, as pixel-level annotation is costly and time-consuming. To address this, recent methods leverage synthetic data or generate artificial change pairs, but out-of-domain generalization remains limited. In this work, we introduce a weak temporal supervision strategy that leverages additional temporal observations of existing single-temporal datasets, without requiring any new annotations. Specifically, we extend single-date remote sensing datasets with new observations acquired at different times and train a change detection model by assuming that real bi-temporal pairs mostly contain no change, while pairing images from different locations to generate change examples. To handle the inherent noise in these weak labels, we employ an object-aware change map generation and an iterative refinement process. We validate our approach on extended versions of the FLAIR and IAILD aerial datasets, achieving strong zero-shot and low-data regime performance across different benchmarks. Lastly, we showcase results over large areas in France, highlighting the scalability potential of our method.[219] Beyond Segmentation: An Oil Spill Change Detection Framework Using Synthetic SAR Imagery
Chenyang Lai,Shuaiyu Chen,Tianjin Huang,Siyang Song,Guangliang Cheng,Chunbo Luo,Zeyu Fu
Main category: cs.CV
TL;DR: 本文提出了一种基于双时相SAR图像的海洋油污变化检测新方法OSCD,并通过TAHI框架生成合成的灾前图像,有效降低了误检率,提升了检测准确性。
Details
Motivation: 现有油污检测方法多基于单幅SAR图像分割,难以区分油污与其他相似海面特征,导致误报率高、泛化能力差,尤其在数据稀缺情况下表现不佳。 Method: 提出Oil Spill Change Detection (OSCD)任务,利用前后时相SAR图像进行变化检测;设计Temporal-Aware Hybrid Inpainting (TAHI)框架,通过高保真混合修复和时序真实性增强生成灾前SAR图像。 Result: 构建了首个OSCD数据集,实验表明相比传统分割方法,OSCD显著降低误报率并提高检测精度。 Conclusion: 引入时序信息的变化检测框架OSCD结合TAHI生成策略,为实际场景下可靠、可扩展的油污监测提供了新思路。 Abstract: Marine oil spills are urgent environmental hazards that demand rapid and reliable detection to minimise ecological and economic damage. While Synthetic Aperture Radar (SAR) imagery has become a key tool for large-scale oil spill monitoring, most existing detection methods rely on deep learning-based segmentation applied to single SAR images. These static approaches struggle to distinguish true oil spills from visually similar oceanic features (e.g., biogenic slicks or low-wind zones), leading to high false positive rates and limited generalizability, especially under data-scarce conditions. To overcome these limitations, we introduce Oil Spill Change Detection (OSCD), a new bi-temporal task that focuses on identifying changes between pre- and post-spill SAR images. As real co-registered pre-spill imagery is not always available, we propose the Temporal-Aware Hybrid Inpainting (TAHI) framework, which generates synthetic pre-spill images from post-spill SAR data. TAHI integrates two key components: High-Fidelity Hybrid Inpainting for oil-free reconstruction, and Temporal Realism Enhancement for radiometric and sea-state consistency. Using TAHI, we construct the first OSCD dataset and benchmark several state-of-the-art change detection models. Results show that OSCD significantly reduces false positives and improves detection accuracy compared to conventional segmentation, demonstrating the value of temporally-aware methods for reliable, scalable oil spill monitoring in real-world scenarios.[220] Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
Romain Vo,Julián Tachella
Main category: cs.CV
TL;DR: 提出了一种域分割策略和正规算子近似方法,使得在任意大规模成像问题中都能将前向算子嵌入端到端重建模型,实现了3D X射线锥束CT和3D多线圈加速MRI的最先进性能,且仅需单个GPU进行训练和推理。
Details
Motivation: 现有的深度学习方法在处理大规模成像逆问题(如3D成像)时,由于全局前向算子占用过多内存,难以将其嵌入网络架构,限制了性能提升。 Method: 提出域分区策略和正规算子近似方法,使前向算子能有效集成到网络中,支持端到端训练,并降低内存需求。 Result: 在3D X射线锥束断层扫描和3D多线圈加速MRI上达到最先进的重建性能,同时仅使用单个GPU完成训练与推理。 Conclusion: 该方法突破了大规模成像问题中内存限制的瓶颈,为将物理模型嵌入深度网络提供了可扩展的解决方案。 Abstract: Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.[221] BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models
Sunny Gupta,Shounak Das,Amit Sethi
Main category: cs.CV
TL;DR: 提出了一种双侧提示优化框架(BiPrompt),在测试时同时缓解视觉和文本模态中的虚假特征依赖,提升视觉-语言模型的因果推理能力和鲁棒性。
Details
Motivation: 现有去偏方法通常只处理单一模态,导致在分布变化下鲁棒性不足且适应不稳定,CLIP等模型仍易受模态间虚假相关性影响。 Method: 提出BiPrompt框架:视觉端采用结构化注意力引导擦除,抑制背景激活并保证因果与虚假区域预测的一致性;文本端引入可学习的平衡提示归一化,将类别嵌入对齐到各向同性的语义空间。二者联合最小化预测与虚假线索之间的条件互信息。 Result: 在真实和合成的偏差基准上,BiPrompt在平均准确率和最差组准确率上均优于现有的测试时去偏方法。 Conclusion: BiPrompt通过双侧协同优化,实现了无需重训练或域监督的轻量级、高效视觉-语言去偏,推动模型向因果、域不变的可信推理发展。 Abstract: Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.[222] Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32
Oliver Custance,Saad Khan,Simon Parkinson
Main category: cs.CV
TL;DR: 该研究系统评估了六种信号分离方法在使用低成本ESP32 WiFi传感器进行多人步态识别中的性能,发现所有方法的准确率均较低(45-56%),且无显著差异,表明问题主要源于硬件限制而非算法。
Details
Motivation: 现有研究多集中于单人步态识别,而多人场景下的表现尚不明确;亟需判断性能瓶颈是来自算法还是硬件本身。 Method: 采用六种信号分离方法(FastICA、SOBI、PCA、NMF、小波、张量分解),在1至10人的七种场景下,使用商用ESP32设备进行实验,并引入新的诊断指标(个体内变异性、个体间可区分性、性能退化率)进行分析。 Result: 所有方法准确率介于45%-56%,标准差为3.74%,组间无统计显著差异(p > 0.05);最佳方法NMF仅达56%准确率;随着人数增加,个体内变异性高、个体间可区分性低,性能严重下降。 Conclusion: 商用ESP32传感器提供的信号质量不足以支持可靠的多人步态分离,性能瓶颈主要在于硬件限制而非算法设计。 Abstract: WiFi Channel State Information (CSI) has shown promise for single-person gait identification, with numerous studies reporting high accuracy. However, multi-person identification remains largely unexplored, with the limited existing work relying on complex, expensive setups requiring modified firmware. A critical question remains unanswered: is poor multi-person performance an algorithmic limitation or a fundamental hardware constraint? We systematically evaluate six diverse signal separation methods (FastICA, SOBI, PCA, NMF, Wavelet, Tensor Decomposition) across seven scenarios with 1-10 people using commodity ESP32 WiFi sensors--a simple, low-cost, off-the-shelf solution. Through novel diagnostic metrics (intra-subject variability, inter-subject distinguishability, performance degradation rate), we reveal that all methods achieve similarly low accuracy (45-56\%, $σ$=3.74\%) with statistically insignificant differences (p $>$ 0.05). Even the best-performing method, NMF, achieves only 56\% accuracy. Our analysis reveals high intra-subject variability, low inter-subject distinguishability, and severe performance degradation as person count increases, indicating that commodity ESP32 sensors cannot provide sufficient signal quality for reliable multi-person separation.[223] QuIC: A Quantum-Inspired Interaction Classifier for Revitalizing Shallow CNNs in Fine-Grained Recognition
Cheng Ying Wu,Yen Jui Chang
Main category: cs.CV
TL;DR: 提出量子启发的交互分类器QuIC,用于细粒度视觉分类,通过建模二阶特征协方差提升浅层网络性能,兼具高效性与准确性。
Details
Motivation: 浅层网络在细粒度分类任务中表现不佳,因标准全局平均池化丢失高阶特征交互信息;现有双线性方法存在维度高、训练不稳定问题。 Method: 受量子力学启发,将特征通道建模为量子态,通过可学习的观测量算子捕捉二阶特征协方差,设计为轻量、即插即用模块,支持稳定端到端训练。 Result: QuIC显著提升VGG16的Top-1准确率近20%,在ResNet18上优于SE-Block等先进注意力机制,t-SNE可视化显示其能增强类内聚性和细粒度区分能力。 Conclusion: QuIC有效弥补了浅层网络在细粒度分类中的性能差距,为边缘设备部署高效准确模型提供了可行方案。 Abstract: Deploying deep learning models for Fine-Grained Visual Classification (FGVC) on resource-constrained edge devices remains a significant challenge. While deep architectures achieve high accuracy on benchmarks like CUB-200-2011, their computational cost is often prohibitive. Conversely, shallow networks (e.g., AlexNet, VGG) offer efficiency but fail to distinguish visually similar sub-categories. This is because standard Global Average Pooling (GAP) heads capture only first-order statistics, missing the subtle high-order feature interactions required for FGVC. While Bilinear CNNs address this, they suffer from high feature dimensionality and instability during training. To bridge this gap, we propose the Quantum-inspired Interaction Classifier (QuIC). Drawing inspiration from quantum mechanics, QuIC models feature channels as interacting quantum states and captures second-order feature covariance via a learnable observable operator. Designed as a lightweight, plug-and-play module, QuIC supports stable, single-stage end-to-end training without exploding feature dimensions. Experimental results demonstrate that QuIC significantly revitalizes shallow backbones: it boosts the Top-1 accuracy of VGG16 by nearly 20% and outperforms state-of-the-art attention mechanisms (SE-Block) on ResNet18. Qualitative analysis, including t-SNE visualization, further confirms that QuIC resolves ambiguous cases by explicitly attending to fine-grained discriminative features and enforcing compact intra-class clustering.[224] Mind the Gap: Continuous Magnification Sampling for Pathology Foundation Models
Alexander Möllers,Julius Hense,Florian Schulz,Timo Milbich,Maximilian Alber,Lukas Ruff
Main category: cs.CV
TL;DR: 本文研究了病理学基础模型在不同放大倍数下的性能表现,提出连续放大倍数采样方法和优化的采样分布,以改善跨尺度表征质量,并构建新基准进行评估。
Details
Motivation: 现有病理学基础模型在不同放大倍数下的性能差异不明确,且训练中离散均匀采样导致中间倍数性能下降,需系统分析采样策略的影响。 Method: 将放大倍数采样建模为多源域适应问题,提出理论框架分析权衡;引入连续采样方法和优化分布,并构建TCGA-MS与BRACS-MS两个新基准进行评估。 Result: 实验表明连续采样在中间放大倍数下比离散采样最高提升4个百分点的平衡分类准确率,优化分布可进一步提升性能;发现放大倍数是模型性能差异的主要因素。 Conclusion: 连续放大倍数采样和优化分布能有效提升病理学基础模型在多尺度下的稳定性与性能,为未来可靠跨倍数分析的模型设计提供了方向。 Abstract: In histopathology, pathologists examine both tissue architecture at low magnification and fine-grained morphology at high magnification. Yet, the performance of pathology foundation models across magnifications and the effect of magnification sampling during training remain poorly understood. We model magnification sampling as a multi-source domain adaptation problem and develop a simple theoretical framework that reveals systematic trade-offs between sampling strategies. We show that the widely used discrete uniform sampling of magnifications (0.25, 0.5, 1.0, 2.0 mpp) leads to degradation at intermediate magnifications. We introduce continuous magnification sampling, which removes gaps in magnification coverage while preserving performance at standard scales. Further, we derive sampling distributions that optimize representation quality across magnification scales. To evaluate these strategies, we introduce two new benchmarks (TCGA-MS, BRACS-MS) with appropriate metrics. Our experiments show that continuous sampling substantially improves over discrete sampling at intermediate magnifications, with gains of up to 4 percentage points in balanced classification accuracy, and that optimized distributions can further improve performance. Finally, we evaluate current histopathology foundation models, finding that magnification is a primary driver of performance variation across models. Our work paves the way towards future pathology foundation models that perform reliably across magnifications.[225] Parameter-Efficient Domain Adaption for CSI Crowd-Counting via Self-Supervised Learning with Adapter Modules
Oliver Custance,Saad Khan,Simon Parkinson,Quan Z. Sheng
Main category: cs.CV
TL;DR: 提出了一种基于CSI-ResNet-A和Adapter模块的两阶段框架,通过自监督对比学习和高效微调实现跨域无设备人数统计,具有强泛化性和低参数更新量。
Details
Motivation: 解决WiFi CSI人数统计模型在不同环境间泛化能力差(域偏移)的问题,提升实际部署可行性。 Method: 采用两阶段框架:首先用自监督对比学习预训练CSI-ResNet-A以学习域不变特征,引入轻量Adapter模块进行高效微调;随后由有状态计数机处理事件序列生成稳定人数估计。 Result: 在10-shot无监督场景下MAE低至0.44;新提出的GI指标接近完美;在WiAR基准上达到98.8%准确率,为新SOTA;消融实验显示Adapter微调性能接近全微调(98.84% vs 99.67%),但训练参数减少97.2%。 Conclusion: 该框架实现了高性能、高鲁棒性和低训练开销的平衡,为面向真实场景的可扩展、隐私保护型IoT感知系统提供了实用解决方案。 Abstract: Device-free crowd-counting using WiFi Channel State Information (CSI) is a key enabling technology for a new generation of privacy-preserving Internet of Things (IoT) applications. However, practical deployment is severely hampered by the domain shift problem, where models trained in one environment fail to generalise to another. To overcome this, we propose a novel two-stage framework centred on a CSI-ResNet-A architecture. This model is pre-trained via self-supervised contrastive learning to learn domain-invariant representations and leverages lightweight Adapter modules for highly efficient fine-tuning. The resulting event sequence is then processed by a stateful counting machine to produce a final, stable occupancy estimate. We validate our framework extensively. On our WiFlow dataset, our unsupervised approach excels in a 10-shot learning scenario, achieving a final Mean Absolute Error (MAE) of just 0.44--a task where supervised baselines fail. To formally quantify robustness, we introduce the Generalisation Index (GI), on which our model scores near-perfectly, confirming its ability to generalise. Furthermore, our framework sets a new state-of-the-art public WiAR benchmark with 98.8\% accuracy. Our ablation studies reveal the core strength of our design: adapter-based fine-tuning achieves performance within 1\% of a full fine-tune (98.84\% vs. 99.67\%) while training 97.2\% fewer parameters. Our work provides a practical and scalable solution for developing robust sensing systems ready for real-world IoT deployments.[226] NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
Huichao Zhang,Liao Qu,Yiheng Liu,Hang Chen,Yangyang Song,Yongsheng Dong,Shikun Sun,Xian Li,Xu Wang,Yi Jiang,Hu Ye,Bo Chen,Yiming Gao,Peng Liu,Akide Liu,Zhipeng Yang,Qili Deng,Linjie Xing,Jiyang Liu,Zhao Wang,Yang Zhou,Mingcong Liu,Yi Zhang,Qian He,Xiwei Hu,Zhongqi Qi,Jie Shao,Zhiye Fu,Shuai Wang,Fangmin Chen,Xuezhi Chai,Zhihua Wu,Yitong Wang,Zehuan Yuan,Daniel K. Du,Xinglong Wu
Main category: cs.CV
TL;DR: NextFlow是一种统一的解码器-only自回归变换模型,训练于6万亿个交错的文本-图像离散令牌上,实现了高效的多模态理解和生成,包括图像编辑、交错内容和视频生成,在视觉质量上达到最先进水平。
Details
Motivation: 由于文本和图像模态的本质不同——文本是严格顺序的,而图像是层次化的——需要一种既能保持文本下一个词预测又能适应图像多尺度生成的方法。 Method: 采用统一的自回归架构进行文本的下一个词预测和图像的下一个尺度预测,并引入前缀调优策略用于强化学习,通过稳健的训练方案解决多尺度生成的不稳定性问题。 Result: NextFlow能够在5秒内生成1024x1024的图像,比同类自回归模型快几个数量级,并在视觉质量上达到或接近专门的扩散模型的水平。 Conclusion: NextFlow通过统一的架构和新颖的生成策略,成功地结合了文本和图像的生成能力,展示了卓越的多模态处理性能。 Abstract: We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.[227] Seeing the Unseen: Zooming in the Dark with Event Cameras
Dachun Kai,Zeyu Xiao,Huyue Zhu,Jiaxiao Wang,Yueyi Zhang,Xiaoyan Sun
Main category: cs.CV
TL;DR: 本文提出了RetinexEVSR,首个基于事件驱动的低光视频超分辨率框架,利用高对比度事件信号和Retinex先验来提升低光条件下的视频质量。
Details
Motivation: 现有低光视频超分方法因对比度低和高频信息不足难以恢复细节,需更有效的跨模态融合策略。 Method: 提出一种双向跨模态融合策略,包括光照引导的事件增强模块和事件引导的反射率增强模块,结合Retinex模型提取的光照图逐步优化事件特征,并通过多尺度机制恢复反射率细节。 Result: 在三个数据集上达到SOTA性能,在SDSD基准上最高提升2.95 dB,运行时间减少65%。 Conclusion: RetinexEVSR有效结合事件数据与Retinex先验,显著提升了低光视频超分的质量与效率。 Abstract: This paper addresses low-light video super-resolution (LVSR), aiming to restore high-resolution videos from low-light, low-resolution (LR) inputs. Existing LVSR methods often struggle to recover fine details due to limited contrast and insufficient high-frequency information. To overcome these challenges, we present RetinexEVSR, the first event-driven LVSR framework that leverages high-contrast event signals and Retinex-inspired priors to enhance video quality under low-light scenarios. Unlike previous approaches that directly fuse degraded signals, RetinexEVSR introduces a novel bidirectional cross-modal fusion strategy to extract and integrate meaningful cues from noisy event data and degraded RGB frames. Specifically, an illumination-guided event enhancement module is designed to progressively refine event features using illumination maps derived from the Retinex model, thereby suppressing low-light artifacts while preserving high-contrast details. Furthermore, we propose an event-guided reflectance enhancement module that utilizes the enhanced event features to dynamically recover reflectance details via a multi-scale fusion mechanism. Experimental results show that our RetinexEVSR achieves state-of-the-art performance on three datasets. Notably, on the SDSD benchmark, our method can get up to 2.95 dB gain while reducing runtime by 65% compared to prior event-based methods. Code: https://github.com/DachunKai/RetinexEVSR.[228] Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion
Binglei Li,Mengping Yang,Zhiyu Tan,Junping Zhang,Hao Li
Main category: cs.CV
TL;DR: 本文提出了一种系统性分析MMDiT模型中各模块功能的流程,揭示了语义信息在早期模块出现、细节在后期生成,并发现文本条件的影响大于特定模块的移除。基于此,作者提出了无需训练的新策略,提升了文本对齐、编辑精度和推理速度,在多个基准上显著提升性能。
Details
Motivation: 尽管MMDiT模型在文本到图像生成中取得突破,但其内部各模块如何与文本条件交互仍不清楚,缺乏系统性理解。 Method: 设计了一套系统性分析流程,通过移除、禁用和增强不同模块中的文本隐状态来探究各模块的功能及其与文本条件的交互作用。 Result: 发现语义信息出现在较早模块,细节在后期生成;移除模块影响小于禁用文本条件;选择性增强文本条件可改善语义属性。基于此提出训练-free策略,在SD3.5上将T2I-Combench++从56.92%提升至63.00%,GenEval从66.42%提升至71.63%。 Conclusion: 该研究深化了对MMDiT模型工作机制的理解,提出的无需训练的方法在不牺牲生成质量的前提下,有效提升了文本对齐、编辑能力和推理效率,为后续改进提供了新思路。 Abstract: Recent breakthroughs of transformer-based diffusion models, particularly with Multimodal Diffusion Transformers (MMDiT) driven models like FLUX and Qwen Image, have facilitated thrilling experiences in text-to-image generation and editing. To understand the internal mechanism of MMDiT-based models, existing methods tried to analyze the effect of specific components like positional encoding and attention layers. Yet, a comprehensive understanding of how different blocks and their interactions with textual conditions contribute to the synthesis process remains elusive. In this paper, we first develop a systematic pipeline to comprehensively investigate each block's functionality by removing, disabling and enhancing textual hidden-states at corresponding blocks. Our analysis reveals that 1) semantic information appears in earlier blocks and finer details are rendered in later blocks, 2) removing specific blocks is usually less disruptive than disabling text conditions, and 3) enhancing textual conditions in selective blocks improves semantic attributes. Building on these observations, we further propose novel training-free strategies for improved text alignment, precise editing, and acceleration. Extensive experiments demonstrated that our method outperforms various baselines and remains flexible across text-to-image generation, image editing, and inference acceleration. Our method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5, without sacrificing synthesis quality. These results advance understanding of MMDiT models and provide valuable insights to unlock new possibilities for further improvements.[229] Prior-Guided DETR for Ultrasound Nodule Detection
Jingjing Wang,Zhuo Xiao,Xinning Yao,Bo Liu,Lijuan Niu,Xiangzhi Bai,Fugen Zhou
Main category: cs.CV
TL;DR: 提出一种先验引导的DETR框架用于超声结节检测,通过在多个网络阶段引入几何和结构先验知识,提升对不规则、边界模糊及多尺度结节的检测性能,在多个甲状腺和乳腺超声数据集上优于18种现有方法。
Details
Motivation: 超声结节检测因结节形状不规则、边界模糊、尺度变化大以及斑点噪声干扰而具有挑战性,影响早期癌症诊断。现有方法主要依赖纯数据驱动特征学习,缺乏对医学先验知识的有效利用。 Method: 提出一种先验引导的DETR框架:1)在CNN主干中嵌入带先验正则化的空间自适应可变形FFN(SDFPR),注入几何先验以稳定不规则结节的特征提取;2)设计多尺度空间-频率特征混合器(MSFFM),结合空间域(强调轮廓连续性和边界线索)和频率域(捕获整体形态并抑制噪声)建模获取结构先验;3)通过密集特征交互(DFI)机制在编码器各层传播和利用先验调制特征,指导解码器查询优化。 Result: 在两个临床甲状腺超声数据集(Thyroid I 和 Thyroid II)及两个公开基准(TN3K 和 BUSI)上实验表明,该方法在检测形态复杂结节方面显著优于18种对比方法,取得更高检测精度。 Conclusion: 通过多层次融合几何与结构先验知识,所提出的先验引导DETR框架有效提升了超声图像中复杂结节的检测性能,为医学图像分析中的先验知识建模提供了新思路。 Abstract: Accurate detection of ultrasound nodules is essential for the early diagnosis and treatment of thyroid and breast cancers. However, this task remains challenging due to irregular nodule shapes, indistinct boundaries, substantial scale variations, and the presence of speckle noise that degrades structural visibility. To address these challenges, we propose a prior-guided DETR framework specifically designed for ultrasound nodule detection. Instead of relying on purely data-driven feature learning, the proposed framework progressively incorporates different prior knowledge at multiple stages of the network. First, a Spatially-adaptive Deformable FFN with Prior Regularization (SDFPR) is embedded into the CNN backbone to inject geometric priors into deformable sampling, stabilizing feature extraction for irregular and blurred nodules. Second, a Multi-scale Spatial-Frequency Feature Mixer (MSFFM) is designed to extract multi-scale structural priors, where spatial-domain processing emphasizes contour continuity and boundary cues, while frequency-domain modeling captures global morphology and suppresses speckle noise. Furthermore, a Dense Feature Interaction (DFI) mechanism propagates and exploits these prior-modulated features across all encoder layers, enabling the decoder to enhance query refinement under consistent geometric and structural guidance. Experiments conducted on two clinically collected thyroid ultrasound datasets (Thyroid I and Thyroid II) and two public benchmarks (TN3K and BUSI) for thyroid and breast nodules demonstrate that the proposed method achieves superior accuracy compared with 18 detection methods, particularly in detecting morphologically complex nodules.The source code is publicly available at https://github.com/wjj1wjj/Ultrasound-DETR.[230] FMVP: Masked Flow Matching for Adversarial Video Purification
Duoxun Tang,Xueyi Zhang,Chak Hin Wang,Xi Xiao,Dasen Dai,Xinhang Jiang,Wentao Shi,Rui Li,Qing Li
Main category: cs.CV
TL;DR: 本文提出了一种基于流匹配的对抗性视频净化方法FMVP,通过掩码策略破坏对抗结构,并利用条件流匹配与频率门控损失恢复干净视频,显著提升了视频识别模型在多种攻击下的鲁棒性。
Details
Motivation: 现有的扩散模型在视频对抗净化中存在采样效率低和轨迹弯曲的问题,且难以有效恢复被微小扰动破坏的视频内容,因此需要一种更高效、物理上合理的净化机制。 Method: 提出FMVP方法,采用掩码策略打破全局对抗结构,结合条件流匹配(CFM)和修复目标重建视频动态,并设计频率门控损失(FGL)抑制高频对抗残差,保留低频内容保真度;同时构建攻击感知和通用训练范式以应对已知与未知攻击。 Result: 在UCF-101和HMDB-51数据集上,FMVP在PGD攻击下鲁棒准确率超87%,CW攻击下超89%,优于DiffPure、DP、TS和FlowPure等现有方法;对自适应攻击DiffHammer也表现出更强鲁棒性,并实现零样本对抗检测,PGD检测准确率达98%,CW达79%。 Conclusion: FMVP通过物理破坏对抗结构并结合流匹配与频率感知损失,实现了高效、鲁棒的视频对抗净化,在多种攻击下均表现优越,兼具净化能力与零样本检测潜力。 Abstract: Video recognition models remain vulnerable to adversarial attacks, while existing diffusion-based purification methods suffer from inefficient sampling and curved trajectories. Directly regressing clean videos from adversarial inputs often fails to recover faithful content due to the subtle nature of perturbations; this necessitates physically shattering the adversarial structure. Therefore, we propose Flow Matching for Adversarial Video Purification FMVP. FMVP physically shatters global adversarial structures via a masking strategy and reconstructs clean video dynamics using Conditional Flow Matching (CFM) with an inpainting objective. To further decouple semantic content from adversarial noise, we design a Frequency-Gated Loss (FGL) that explicitly suppresses high-frequency adversarial residuals while preserving low-frequency fidelity. We design Attack-Aware and Generalist training paradigms to handle known and unknown threats, respectively. Extensive experiments on UCF-101 and HMDB-51 demonstrate that FMVP outperforms state-of-the-art methods (DiffPure, Defense Patterns (DP), Temporal Shuffling (TS) and FlowPure), achieving robust accuracy exceeding 87% against PGD and 89% against CW attacks. Furthermore, FMVP demonstrates superior robustness against adaptive attacks (DiffHammer) and functions as a zero-shot adversarial detector, attaining detection accuracies of 98% for PGD and 79% for highly imperceptible CW attacks.[231] VIBE: Visual Instruction Based Editor
Grigorii Alekseenko,Aleksandr Gordeev,Irina Tolstykh,Bulat Suleimanov,Vladimir Dokholyan,Georgii Fedorov,Sergey Yakubson,Aleksandra Tsybina,Mikhail Chernyshov,Maksim Kuprashevich
Main category: cs.CV
TL;DR: 本文提出了一种紧凑、高吞吐量的基于指令的图像编辑管道,使用2B参数的Qwen3-VL模型指导编辑过程,并结合1.6B参数的Sana1.5扩散模型生成图像,在保持高质量的同时显著降低计算成本。
Details
Motivation: 现有的基于扩散模型的图像编辑方法通常参数量大、计算开销高,限制了在资源受限环境下的部署;同时大多数开源方案难以达到实际应用的质量水平。因此需要构建一个高效且高质量的轻量级图像编辑系统。 Method: 采用Qwen3-VL作为视觉语言引导模型理解编辑指令并生成中间表示,结合轻量级扩散模型Sana1.5进行图像生成;通过优化架构设计、数据处理流程和训练配置,实现在低资源条件下高效推理并保持源图像一致性。 Result: 在ImgEdit和GEdit基准上,该方法性能媲美甚至超过参数量数倍于它的大型模型,尤其在属性调整、对象移除、背景修改和目标替换等需保留输入图像内容的任务中表现突出;模型可在NVIDIA H100上以BF16精度在约4秒内生成最高2K分辨率图像,总显存占用低于24GB。 Conclusion: 本文证明了小型化扩散模型结合现代视觉语言模型可在低推理成本下实现高质量的指令驱动图像编辑,为实际部署和研究提供了高效、可访问的解决方案。 Abstract: Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.[232] A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets
Annoor Sharara Akhand
Main category: cs.CV
TL;DR: 本文对三种卷积神经网络(CNN)范式在五个真实世界图像分类数据集上进行了受控比较,发现迁移学习始终具有最强的预测性能,而自定义小型CNN在计算和内存受限时提供了良好的效率-准确性权衡。
Details
Motivation: 为了系统比较不同CNN应用范式(从头训练、使用预训练模型作为特征提取器、迁移学习)在实际视觉识别任务中的性能与效率表现。 Method: 在五个涵盖道路缺陷、农作物识别、果蔬病害等领域的图像分类数据集上,对比了从头训练的小型CNN、使用大型预训练CNN作为固定特征提取器以及进行部分或完全微调的迁移学习方法,并以准确率、宏F1分数、每轮训练时间及参数量为评估指标。 Result: 迁移学习在所有数据集中均取得最佳的预测性能;自定义小型CNN虽然性能稍低,但训练更快、参数更少,在资源受限场景下表现出良好的效率-准确性权衡。 Conclusion: 迁移学习是追求高性能的首选方法,而自定义小型CNN更适合于计算和内存资源受限的实际应用场景。 Abstract: Convolutional Neural Networks (CNNs) are a standard approach for visual recognition due to their capacity to learn hierarchical representations from raw pixels. In practice, practitioners often choose among (i) training a compact custom CNN from scratch, (ii) using a large pre-trained CNN as a fixed feature extractor, and (iii) performing transfer learning via partial or full fine-tuning of a pre-trained backbone. This report presents a controlled comparison of these three paradigms across five real-world image classification datasets spanning road-surface defect recognition, agricultural variety identification, fruit/leaf disease recognition, pedestrian walkway encroachment recognition, and unauthorized vehicle recognition. Models are evaluated using accuracy and macro F1-score, complemented by efficiency metrics including training time per epoch and parameter counts. The results show that transfer learning consistently yields the strongest predictive performance, while the custom CNN provides an attractive efficiency--accuracy trade-off, especially when compute and memory budgets are constrained.[233] SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection
Xiantai Xiang,Guangyao Zhou,Zixiao Wen,Wenshuai Li,Ben Niu,Feng Wang,Lijia Huang,Qiantong Wang,Yuhan Liu,Zongxu Pan,Yuxin Hu
Main category: cs.CV
TL;DR: 本文提出SLGNet,一种参数高效的多模态目标检测框架,结合分层结构先验与语言引导调制,在冻结的ViT基础模型上实现对RGB和红外图像的鲁棒感知,显著提升夜间和复杂环境下的检测性能,同时减少约87%可训练参数。
Details
Motivation: 现有基于适配器的方法在跨模态结构一致性方面表现不足,且传统静态融合机制缺乏环境感知能力,导致在高对比度或夜间等域差异大的场景中性能下降。 Method: 设计结构感知适配器提取双模态分层结构表示并动态注入ViT;引入语言引导调制模块,利用VLM生成的结构化描述动态校准视觉特征,增强环境感知。 Result: 在LLVIP、FLIR、KAIST和DroneVehicle数据集上达到SOTA性能,LLVIP上mAP达66.1,可训练参数比全微调减少约87%。 Conclusion: SLGNet通过结构感知与语言引导的动态融合机制,在保持高效参数利用的同时显著提升了多模态目标检测的鲁棒性和适应性,适用于复杂动态环境下的感知任务。 Abstract: Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.[234] VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Shikun Sun,Liao Qu,Huichao Zhang,Yiheng Liu,Yangyang Song,Xian Li,Xu Wang,Yi Jiang,Daniel K. Du,Xinglong Wu,Jia Jia
Main category: cs.CV
TL;DR: 提出一种增强型Group Relative Policy Optimization框架,通过中间奖励、动态时间步重加权和掩码传播算法解决VAR模型在强化学习中的异步策略冲突问题。
Details
Motivation: VAR模型在生成过程中存在异步策略冲突,导致强化学习训练不稳定和对齐效果差。 Method: 引入稳定性的中间奖励、动态时间步重加权机制以及基于Reward Feedback Learning的掩码传播算法来优化GRPO框架。 Result: 相比基础GRPO,在样本质量和目标对齐方面有显著提升。 Conclusion: 该方法有效解决了VAR模型在RL中的优化难题,实现了更鲁棒和高效的训练。 Abstract: Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.[235] DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies
Renke Wang,Zhenyu Zhang,Ying Tai,Jian Yang
Main category: cs.CV
TL;DR: 本文提出DiffProxy,一种基于扩散生成先验的多视角一致人体代理生成框架,用于从多视角图像中恢复人体网格,仅在合成数据上训练即可在真实世界基准上实现最先进的零样本泛化性能。
Details
Motivation: 现有方法在真实数据集上受限于不完美的标注,在合成数据上则存在域差距问题,难以兼顾精确监督与真实场景泛化。 Method: 提出DiffProxy框架,包含多条件机制生成多视角一致且像素对齐的人体代理、结合视觉提示的手部细化模块,以及不确定性感知的测试时缩放策略,利用扩散模型的生成先验连接合成训练与真实应用。 Result: 在五个真实世界基准上达到最先进性能,尤其在遮挡和部分视角等挑战场景下表现出强零样本泛化能力。 Conclusion: DiffProxy通过扩散生成先验有效弥合了合成数据训练与真实场景应用之间的鸿沟,实现了高鲁棒性的人体网格恢复。 Abstract: Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models' training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html[236] TopoLoRA-SAM: Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation
Salim Khazem
Main category: cs.CV
TL;DR: 本文提出了TopoLoRA-SAM,一种拓扑感知且参数高效的分割模型自适应框架,通过在冻结的ViT编码器中引入LoRA和轻量级卷积适配器,并结合可微分clDice损失,在多个医学与遥感图像数据集上实现了优于现有方法的性能,仅需微调5.2%的参数。
Details
Motivation: 现有的基础分割模型(如SAM)虽具有良好的零样本泛化能力,但在特定领域(如细小结构或噪声模态图像)的语义分割任务中适应困难,且全量微调计算成本高并可能导致灾难性遗忘。 Method: 提出TopoLoRA-SAM,将低秩适应(LoRA)注入冻结的ViT编码器,并加入轻量空间卷积适配器,同时可选用地拓扑感知的clDice损失进行监督,实现参数高效的域自适应。 Result: 在五个基准数据集(包括视网膜血管、息肉和SAR图像分割)上测试,TopoLoRA-SAM在视网膜平均Dice和整体平均Dice上均表现最优,仅训练4.9M(5.2%)参数;在CHASE_DB1等挑战性数据集上显著提升精度与鲁棒性。 Conclusion: 拓扑感知的参数高效微调策略可在多种复杂场景下匹配甚至超越完全微调的专业模型,为大模型的下游适配提供了高效可靠的解决方案。 Abstract: Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero-shot generalization through large-scale pretraining, but adapting them to domain-specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine-tuning is computationally expensive and risks catastrophic forgetting. We propose \textbf{TopoLoRA-SAM}, a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation. TopoLoRA-SAM injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology-aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE\_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD), comparing against U-Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice across datasets, while training only \textbf{5.2\%} of model parameters ($\sim$4.9M). On the challenging CHASE\_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git[237] InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
Shuai Yuan,Yantai Yang,Xiaotian Yang,Xupeng Zhang,Zhonghao Zhao,Lingming Zhang,Zhipeng Zhang
Main category: cs.CV
TL;DR: 本文提出了InfiniteVGGT,一种因果视觉几何变换器,通过有界但自适应的KV缓存实现无限视野流式3D几何理解,并设计了一种无需训练的注意力无关剪枝策略以维持长期稳定性,同时发布了首个支持约10,000帧连续评估的Long3D基准测试。
Details
Motivation: 现有的离线模型无法适用于实时系统,而现有流式架构在处理无限视野输入时存在灾难性漂移或无法长期稳定的问题,亟需一种可扩展且稳定的解决方案。 Method: 提出InfiniteVGGT,采用因果注意力机制与可滚动的KV缓存结构,结合无需训练的注意力无关剪枝策略,在保持表达能力的同时实现信息更新;兼容FlashAttention,支持高效流式推理。 Result: InfiniteVGGT在长期序列中显著优于现有流式方法,表现出卓越的稳定性;通过新提出的Long3D基准(长达约10,000帧)实现了对无限视野3D几何估计的首次严格评估。 Conclusion: InfiniteVGGT成功解决了流式3D视觉几何理解中可扩展性与长期稳定性的矛盾,为未来持续、大规模的3D理解提供了可行架构和评估基础。 Abstract: The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT[238] Rank-based Geographical Regularization: Revisiting Contrastive Self-Supervised Learning for Multispectral Remote Sensing Imagery
Tom Burgert,Leonard Hackel,Paolo Rota,Begüm Demir
Main category: cs.CV
TL;DR: 本文提出了GeoRank,一种用于对比自监督学习的新正则化方法,通过优化球面距离将地理关系嵌入特征空间,提升了多光谱遥感图像的学习效果,并系统研究了对比自监督学习在该领域的关键适应性问题。
Details
Motivation: 由于多光谱遥感图像具有显著的地理和时间变异性,直接应用现有的自监督学习方法存在挑战,因此需要专门针对此类数据设计更有效的学习策略。 Method: 提出GeoRank方法,通过直接优化球面距离来嵌入地理关系,并系统评估数据增强、数据集规模、图像大小以及时序视图对多光谱遥感图像对比自监督学习的影响。 Result: GeoRank在多种对比自监督算法(如BYOL、DINO)上均取得优于或媲美现有地理元数据融合方法的性能,并显示出对不同任务设置的良好适应性。 Conclusion: GeoRank能有效利用地理信息提升多光谱遥感图像的自监督特征学习效果,且所进行的系统性分析为该领域的方法设计提供了实用指导。 Abstract: Self-supervised learning (SSL) has become a powerful paradigm for learning from large, unlabeled datasets, particularly in computer vision (CV). However, applying SSL to multispectral remote sensing (RS) images presents unique challenges and opportunities due to the geographical and temporal variability of the data. In this paper, we introduce GeoRank, a novel regularization method for contrastive SSL that improves upon prior techniques by directly optimizing spherical distances to embed geographical relationships into the learned feature space. GeoRank outperforms or matches prior methods that integrate geographical metadata and consistently improves diverse contrastive SSL algorithms (e.g., BYOL, DINO). Beyond this, we present a systematic investigation of key adaptations of contrastive SSL for multispectral RS images, including the effectiveness of data augmentations, the impact of dataset cardinality and image size on performance, and the task dependency of temporal views. Code is available at https://github.com/tomburgert/georank.[239] SortWaste: A Densely Annotated Dataset for Object Detection in Industrial Waste Sorting
Sara Inácio,Hugo Proença,João C. Neves
Main category: cs.CV
TL;DR: 本文介绍了SortWaste,一个用于废料分拣的密集标注目标检测数据集,并提出了ClutterScore指标来评估场景复杂度,同时对先进目标检测模型进行了基准测试,结果表明在高度杂乱场景中性能显著下降。
Details
Motivation: 由于人口增长导致废物产量增加,手动分拣效率低且存在健康风险,现有自动化方法难以应对真实废物流的高度可变性和视觉复杂性,主要原因是缺乏真实世界的数据集。 Method: 提出SortWaste数据集,源自材料回收设施,并引入ClutterScore指标,通过物体数量、类别与尺寸熵、空间重叠等代理指标量化场景复杂度,同时对多种先进目标检测模型进行基准测试。 Result: 在塑料检测任务中取得59.7% mAP的较好结果,但在高度杂乱场景中性能明显下降,验证了ClutterScore与模型性能之间的相关性。 Conclusion: SortWaste和ClutterScore为废物检测提供了重要工具和评估标准,揭示了当前模型在复杂场景中的局限性,强调需要更具挑战性的数据集推动该领域发展。 Abstract: The increasing production of waste, driven by population growth, has created challenges in managing and recycling materials effectively. Manual waste sorting is a common practice; however, it remains inefficient for handling large-scale waste streams and presents health risks for workers. On the other hand, existing automated sorting approaches still struggle with the high variability, clutter, and visual complexity of real-world waste streams. The lack of real-world datasets for waste sorting is a major reason automated systems for this problem are underdeveloped. Accordingly, we introduce SortWaste, a densely annotated object detection dataset collected from a Material Recovery Facility. Additionally, we contribute to standardizing waste detection in sorting lines by proposing ClutterScore, an objective metric that gauges the scene's hardness level using a set of proxies that affect visual complexity (e.g., object count, class and size entropy, and spatial overlap). In addition to these contributions, we provide an extensive benchmark of state-of-the-art object detection models, detailing their results with respect to the hardness level assessed by the proposed metric. Despite achieving promising results (mAP of 59.7% in the plastic-only detection task), performance significantly decreases in highly cluttered scenes. This highlights the need for novel and more challenging datasets on the topic.[240] 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera
Xiaopeng Guo,Yinzhe Xu,Huajian Huang,Sai-Kit Yeung
Main category: cs.CV
TL;DR: 提出首个基于深度学习的单目全向视觉里程计360DVO,通过畸变感知球面特征提取器(DAS-Feat)和新型全向可微束调整(ODBA)模块,在真实和合成数据集上显著提升鲁棒性和精度。
Details
Motivation: 现有单目全向视觉里程计方法依赖手工特征或光度目标,在剧烈运动和光照变化等挑战场景中鲁棒性不足,需更强大的特征表示和优化框架。 Method: 设计畸变感知球面特征提取器(DAS-Feat)以自适应学习抗畸变特征,并利用稀疏特征块在新提出的全向可微束调整(ODBA)模块中进行有效位姿估计。 Result: 在自建真实世界基准和公开合成数据集(TartanAir V2、360VO)上实验表明,360DVO相比现有最先进方法(如360VO、OpenVSLAM)鲁棒性提升50%,精度提升37.5%。 Conclusion: 360DVO是首个基于深度学习的OVO框架,在多种环境下显著优于传统方法,推动了全向视觉里程计的发展。 Abstract: Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: https://chris1004336379.github.io/360DVO-homepage[241] Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping
Saurabh Kaushik,Lalit Maurya,Beth Tellman
Main category: cs.CV
TL;DR: 本文提出了一种名为Prithvi-CAFE的新模型,通过结合Geo-Foundation Model(Prithvi)与具有注意力机制的CNN分支,提升了洪水映射任务中的局部细节捕捉能力,在Sen1Flood11和FloodPlanet数据集上均取得了当前最优的性能。
Details
Motivation: 现有的地理基础模型(GFMs)在洪水映射任务中难以有效捕捉关键的局部细节,导致性能无法超越U-Net等基线模型,因此需要一种能够融合局部特征与全局依赖的改进架构。 Method: 提出Prithvi-CAFE模型,将Prithvi预训练编码器与一个带有卷积注意力模块(CAM)的并行CNN残差分支相结合,通过适配器实现快速微调,并进行多尺度、多层次特征融合,以同时保留长距离依赖和局部细节。 Result: 在Sen1Flood11测试集上,Prithvi-CAFE达到83.41的IoU,优于原始Prithvi(82.50)及其他主流GFMs;在独立测试站点上IoU达81.37,显著高于U-Net(70.57)和Prithvi(72.42)。在FloodPlanet上也表现最佳,IoU为64.70,超过U-Net(60.14)和其他GFMs。 Conclusion: Prithvi-CAFE通过融合CNN的局部感知能力与GFM的全局建模优势,显著提升了洪水映射等对局部细节敏感的遥感分割任务性能,具有广泛的应用潜力。 Abstract: Geo-Foundation Models (GFMs), have proven effective in diverse downstream applications, including semantic segmentation, classification, and regression tasks. However, in case of flood mapping using Sen1Flood11 dataset as a downstream task, GFMs struggles to outperform the baseline U-Net, highlighting model's limitation in capturing critical local nuances. To address this, we present the Prithvi-Complementary Adaptive Fusion Encoder (CAFE), which integrate Prithvi GFM pretrained encoder with a parallel CNN residual branch enhanced by Convolutional Attention Modules (CAM). Prithvi-CAFE enables fast and efficient fine-tuning through adapters in Prithvi and performs multi-scale, multi-level fusion with CNN features, capturing critical local details while preserving long-range dependencies. We achieve state-of-the-art results on two comprehensive flood mapping datasets: Sen1Flood11 and FloodPlanet. On Sen1Flood11 test data, Prithvi-CAFE (IoU 83.41) outperforms the original Prithvi (IoU 82.50) and other major GFMs (TerraMind 82.90, DOFA 81.54, spectralGPT: 81.02). The improvement is even more pronounced on the hold-out test site, where Prithvi-CAFE achieves an IoU of 81.37 compared to the baseline U-Net (70.57) and original Prithvi (72.42). On FloodPlanet, Prithvi-CAFE also surpasses the baseline U-Net and other GFMs, achieving an IoU of 64.70 compared to U-Net (60.14), Terramind (62.33), DOFA (59.15) and Prithvi 2.0 (61.91). Our proposed simple yet effective Prithvi-CAFE demonstrates strong potential for improving segmentation tasks where multi-channel and multi-modal data provide complementary information and local details are critical. The code is released on \href{https://github.com/Sk-2103/Prithvi-CAFE}{Prithvi-CAFE Github}[242] Fusion2Print: Deep Flash-Non-Flash Fusion for Contactless Fingerprint Matching
Roja Sahoo,Anoop Namboodiri
Main category: cs.CV
TL;DR: 本文提出了一种名为Fusion2Print (F2P) 的新框架,用于融合闪光与非闪光的无接触指纹图像,以提升脊线清晰度和识别性能。
Details
Motivation: 无接触指纹识别虽然具有卫生和便捷的优势,但图像常因光照变化、皮肤色素沉着和镜面反射导致脊线模糊,现有单次拍摄方法难以同时兼顾噪声抑制与细节保留。 Method: 构建了一个配对的闪光-非闪光指纹数据集(FNF Database),通过手动减法分离出保留脊线的信号;设计了一个轻量级注意力融合网络结合两种模态,并利用U-Net增强模块生成优化的灰度图像;采用跨域兼容的深度嵌入模型,在统一空间中实现无接触与接触式指纹的匹配。 Result: F2P显著提升了脊线清晰度,在识别性能上优于单次捕获基线方法,达到AUC=0.999,EER=1.12%。 Conclusion: Fusion2Print是首个系统性融合配对闪光与非闪光无接触指纹的框架,有效解决了噪声与对比度之间的权衡问题,实现了高精度且兼容多模式的指纹识别。 Abstract: Contactless fingerprint recognition offers a hygienic and convenient alternative to contact-based systems, enabling rapid acquisition without latent prints, pressure artifacts, or hygiene risks. However, contactless images often show degraded ridge clarity due to illumination variation, subcutaneous skin discoloration, and specular reflections. Flash captures preserve ridge detail but introduce noise, whereas non-flash captures reduce noise but lower ridge contrast. We propose Fusion2Print (F2P), the first framework to systematically capture and fuse paired flash-non-flash contactless fingerprints. We construct a custom paired dataset, FNF Database, and perform manual flash-non-flash subtraction to isolate ridge-preserving signals. A lightweight attention-based fusion network also integrates both modalities, emphasizing informative channels and suppressing noise, and then a U-Net enhancement module produces an optimally weighted grayscale image. Finally, a deep embedding model with cross-domain compatibility, generates discriminative and robust representations in a unified embedding space compatible with both contactless and contact-based fingerprints for verification. F2P enhances ridge clarity and achieves superior recognition performance (AUC=0.999, EER=1.12%) over single-capture baselines (Verifinger, DeepPrint).[243] BEDS: Bayesian Emergent Dissipative Structures
Laurent Caraffa
Main category: cs.CV
TL;DR: 本文提出了一个名为BEDS的理论框架,将非平衡热力学、贝叶斯推断、信息几何和机器学习统一起来,认为学习本质上是通过熵导出将流转化为结构的过程,并在理论上和实践中验证了可持续学习系统的可能性。
Details
Motivation: 旨在揭示物理、生物和计算系统中学习的本质,统一不同学科中的核心概念,并为构建可持续的人工智能提供理论基础。 Method: 基于Prigogine的耗散结构理论,建立热力学过程与贝叶斯更新之间的形式同构,推导数学常数作为贝叶斯推断的不动点,并提出哥德尔不完备性与热力学约束之间的类比猜想,同时设计了一个实现BEDS原则的去中心化网络架构。 Result: 成功推导出e、π、φ等数学常数为贝叶斯推断的固定点;提出了学习系统中形式系统病理与耗散不足之间的结构类比;实现了比现有系统节能六个数量级的对等网络架构。 Conclusion: 学习是一种普遍的耗散过程,其原理可跨物理、生物与计算系统通用,BEDS为理解智能的自然基础及设计高效AI系统提供了统一框架。 Abstract: We present BEDS (Bayesian Emergent Dissipative Structures), a theoretical framework that unifies concepts from non-equilibrium thermodynamics, Bayesian inference, information geometry, and machine learning. The central thesis proposes that learning, across physical, biological, and computational systems, fundamentally constitutes the conversion of flux into structure through entropy export. Building on Prigogine's theory of dissipative structures, we establish a formal isomorphism between thermodynamic processes and Bayesian updating, demonstrating that sustainable learning systems must follow dissipative patterns where crystallized posteriors become priors for subsequent levels of emergence. We derive fundamental mathematical constants (e, π, φ) as fixed points of Bayesian inference under minimal axioms, suggesting these constants emerge necessarily from any system capable of representing and updating uncertainty. Furthermore, we propose a conjecture linking Gödel's incompleteness theorems to thermodynamic constraints, hypothesizing that pathologies of formal systems (incompleteness, undecidability) are structurally analogous to dissipation deficits in physical systems. As practical validation, we present a peer-to-peer network architecture implementing BEDS principles, achieving six orders of magnitude improvement in energy efficiency compared to existing distributed consensus systems while enabling continuous learning. This work bridges fundamental physics, mathematical logic, and practical system design, offering both theoretical insights into the nature of learning and computation, and a concrete pathway toward sustainable artificial intelligence.[244] Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding
Jingming He,Chongyi Li,Shiqi Wang,Sam Kwong
Main category: cs.CV
TL;DR: 提出了一种联合增强框架,用于3D语义高斯建模,通过结合语义与渲染分支,利用各向异性3D高斯切比雪夫描述符和局部语义-形状信号自适应调整高斯分布,并引入跨场景知识迁移以提升分割精度、渲染质量和效率。
Details
Motivation: 现有方法将语义分割与图像渲染分支分离处理,仅依赖2D监督且忽略3D高斯几何结构;同时,高斯自适应策略仅基于渲染梯度,在纹理缺失区域表现不足。 Method: 引入基于拉普拉斯-贝尔特米算子的各向异性3D高斯切比雪夫描述符来捕捉细粒度3D形状细节;结合局部语义与形状信号来自适应调整高斯分布和球谐函数;设计跨场景知识迁移模块以持续更新形状模式。 Result: 在多个数据集上验证了该方法在语义分割准确性和渲染质量方面均优于现有方法,同时保持高渲染帧率。 Conclusion: 所提联合增强框架有效融合语义与渲染分支,提升了3D语义高斯建模的鲁棒性与效率,尤其在低纹理区域表现优越,且通过知识迁移加快收敛速度。 Abstract: Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace-Beltrami operator to capture fine-grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without relying solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.[245] Meta-Learning Guided Pruning for Few-Shot Plant Pathology on Edge Devices
Shahnawaz Alam,Mohammed Mudassir Uddin,Mohammed Kaif Pasha
Main category: cs.CV
TL;DR: 本文提出了一种名为Disease-Aware Channel Importance Scoring (DACIS) 的方法,结合剪枝与少样本学习,构建了一个三阶段的PMP压缩流程,可在树莓派上实现高效、轻量化的植物病害识别。
Details
Motivation: 偏远地区的农民缺乏实验室和高性能计算资源,难以快速诊断植物病害;同时深度模型通常体积大、依赖大量标注数据,限制了其在边缘设备上的应用。 Method: 提出Disease-Aware Channel Importance Scoring (DACIS),结合通道重要性评分与少样本学习,采用三阶段Prune-then-Meta-Learn-then-Prune (PMP) 流程压缩模型,在减少参数的同时保留对病害敏感的特征通道。 Result: 在PlantVillage和PlantDoc数据集上验证,模型体积减少78%,保持92.3%原始准确率,并在树莓派4上达到每秒7帧的推理速度,支持田间实时诊断。 Conclusion: 该方法有效平衡了模型大小、准确率和推理速度,为资源受限环境下的农业病害识别提供了可行的解决方案。 Abstract: Farmers in remote areas need quick and reliable methods for identifying plant diseases, yet they often lack access to laboratories or high-performance computing resources. Deep learning models can detect diseases from leaf images with high accuracy, but these models are typically too large and computationally expensive to run on low-cost edge devices such as Raspberry Pi. Furthermore, collecting thousands of labeled disease images for training is both expensive and time-consuming. This paper addresses both challenges by combining neural network pruning -- removing unnecessary parts of the model -- with few-shot learning, which enables the model to learn from limited examples. This paper proposes Disease-Aware Channel Importance Scoring (DACIS), a method that identifies which parts of the neural network are most important for distinguishing between different plant diseases, integrated into a three-stage Prune-then-Meta-Learn-then-Prune (PMP) pipeline. Experiments on PlantVillage and PlantDoc datasets demonstrate that the proposed approach reduces model size by 78\% while maintaining 92.3\% of the original accuracy, with the compressed model running at 7 frames per second on a Raspberry Pi 4, making real-time field diagnosis practical for smallholder farmers.[246] Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Jing Tan,Zhaoyang Zhang,Yantao Shen,Jiarui Cai,Shuo Yang,Jiajun Wu,Wei Xia,Zhuowen Tu,Stefano Soatto
Main category: cs.CV
TL;DR: Talk2Move是一种基于强化学习的扩散框架,通过自然语言指令实现场景中物体的空间变换,利用Group Relative Policy Optimization和空间奖励机制,在无需成对监督数据的情况下实现精确且语义一致的几何操作。
Details
Motivation: 现有文本驱动的图像编辑方法难以实现物体级别的几何变换(如平移、旋转、缩放),主要受限于配对数据稀缺和像素级优化的不足。 Method: 提出Talk2Move框架,结合强化学习与扩散模型,采用Group Relative Policy Optimization(GRPO)生成多样化 rollout,通过轻量级文本变化和输入图像探索几何动作;引入对象中心的空间奖励机制,评估位移、旋转和缩放行为,并结合离策略步评估与主动步采样提升学习效率。 Result: 在构建的基准上实验表明,Talk2Move在空间准确性和场景一致性方面优于现有的文本引导编辑方法,能够实现更精确、连贯且语义忠实的物体变换。 Conclusion: Talk2Move有效解决了文本指导下的物体级空间变换难题,通过无需成对数据的强化学习框架,实现了可解释、高精度的几何操作,推动了多模态生成系统在空间理解与控制方面的发展。 Abstract: We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.[247] VINO: A Unified Visual Generator with Interleaved OmniModal Context
Junyi Chen,Tong He,Zhoujie Fu,Pengfei Wan,Kun Gai,Weicai Ye
Main category: cs.CV
TL;DR: VINO是一个统一的视觉生成框架,能够在单一模型中实现图像和视频的生成与编辑,通过共享扩散主干网络和交错条件令牌机制,支持多模态输入并实现高质量、可控的视觉内容创作。
Details
Motivation: 现有的视觉生成模型通常针对特定任务或模态设计独立模型,缺乏跨图像和视频生成与编辑的统一框架。VINO旨在通过一个共享架构解决多模态、多任务视觉生成问题,提升模型通用性和可扩展性。 Method: VINO结合了视觉语言模型(VLM)和多模态扩散变换器(MMDiT),将文本、图像和视频等多模态输入编码成交错的条件令牌,用于引导扩散过程。采用多阶段训练流程,逐步扩展视频生成基础模型为支持图像与视频输入输出的统一多任务生成器。 Result: 在多种生成与编辑基准测试中,VINO展现出优异的视觉质量、指令遵循能力、参考对象与属性保持能力,以及更可控的多身份编辑效果,尤其在长指令理解和跨静态-动态内容的身份一致性方面表现突出。 Conclusion: VINO验证了构建统一视觉生成系统的可行性,展示了交错式上下文计算作为通用视觉创作基础的潜力,为可扩展的多模态生成模型提供了实践路径。 Abstract: We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.[248] ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors
Kaede Shiohara,Toshihiko Yamasaki,Vladislav Golyanik
Main category: cs.CV
TL;DR: 本文提出了一种名为ExposeAnyone的全自监督方法,基于扩散模型通过音频生成表情序列,用于未知深度伪造的检测。