Skip to content

Table of Contents

cs.CL [Back]

[1] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

Pu Zhao,Xuan Shen,Zhenglun Kong,Yixin Shen,Sung-En Chang,Arash Akbari,Timothy Rupprecht,Lei Lu,Enfu Nan,Changdi Yang,Yumei He,Weiyan Shi,Xingchen Xu,Yu Huang,Wei Jiang,Wei Wang,Yue Chen,Yong He,Yanzhi Wang

Main category: cs.CL

TL;DR: Moxin 7B是一个完全开源的大语言模型,遵循模型开放框架,强调训练、数据和实现细节的全面透明,并推出了多个具备多模态和中文能力的变体,在多项评估中表现优异。

Details Motivation: 推动大语言模型的全面开源与透明化,促进开放、协作的研究生态,支持可复现和可定制的模型发展。 Method: 基于Model Openness Framework开发Moxin 7B,并构建三个变体Moxin-VLM、Moxin-VLA和Moxin-Chinese以增强多模态与中文处理能力,采用开源框架与公开数据进行训练。 Result: Moxin系列模型在多种任务评估中表现出色,所有模型、数据和代码均已公开发布。 Conclusion: Moxin系列模型通过全面开源和多样化能力扩展,为构建健康可持续的开源LLM生态系统提供了有力支持。 Abstract: Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.

[2] Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

Sophie Zhao

Main category: cs.CL

TL;DR: 研究表明,基于transformer的语言模型的句子嵌入空间中存在与人类可解释的认知属性对齐的分层结构,通过线性和非线性探针可解码出连续能量分数和离散层级标签,表明其几何组织超越表面词汇统计。

Details Motivation: 探索transformer语言模型的嵌入空间是否包含与人类认知或心理属性对齐的高层级、分级结构,而不仅仅是低层次的语法或语义特征。 Method: 构建包含480个句子的数据集,标注连续的能量分数和七个有序认知类别的离散层级标签;使用多个transformer模型的固定句子嵌入,通过线性和浅层非线性探针预测标签,并与TF-IDF基线比较,辅以UMAP可视化和混淆矩阵进行定性分析。 Result: 线性和非线性探针均能可靠解码认知标签,非线性探针表现更优;TF-IDF基线性能差,说明结果不依赖词频统计;置换检验显示性能显著高于随机水平;可视化显示嵌入空间中存在从低到高的平滑梯度和相邻层级的混淆模式。 Conclusion: transformer模型的嵌入空间展现出与人类定义的认知属性对齐的分层几何结构,这反映了模型在无显式监督下仍能捕捉抽象认知层级,但该发现不涉及模型具有意识或主观体验。 Abstract: Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces, yet the presence of higher-level cognitive organization within these representations remains underexplored. In this work, we investigate whether sentence embeddings encode a graded, hierarchical structure aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with continuous ordinal energy scores and discrete tier labels spanning seven ordered cognitive categories. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous scores and tier labels are reliably decodable, with shallow nonlinear probes providing consistent performance gains over linear probes. Lexical TF-IDF baselines perform substantially worse, indicating that the observed structure is not attributable to surface word statistics alone. Nonparametric permutation tests further confirm that probe performance exceeds chance under label-randomization nulls. Qualitative analyses using UMAP visualizations and confusion matrices reveal smooth low-to-high gradients and predominantly adjacent-tier confusions in embedding space. Taken together, these results provide evidence that transformer embedding spaces exhibit a hierarchical geometric organization aligned with human-defined cognitive attributes, while remaining agnostic to claims of internal awareness or phenomenology.

[3] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

Shaofei Cai,Yulei Qin,Haojia Lin,Zihan Xu,Gang Li,Yuchen Shi,Zongyi Li,Yong Mao,Siqi Cai,Xiaoyu Tan,Yitao Liang,Ke Li,Xing Sun

Main category: cs.CL

TL;DR: 本文提出了SmartSnap,一种通过代理自身进行主动、现场自我验证的新范式,以解决基于GUI的复杂任务中强化学习代理在任务完成验证上的可扩展性瓶颈。

Details Motivation: 现有的任务验证方法是被动且事后进行的,需要分析整个交互轨迹,处理冗长且含有噪声的历史信息,导致成本高、可靠性低。这限制了智能体在复杂GUI任务中的可扩展性。 Method: 提出Self-Verifying Agent与SmartSnap范式,结合3C原则(完整性、简洁性、创造性),让代理在执行任务的同时主动选择最小而决定性的快照作为证据,并由LLM-as-a-Judge仅基于这些快照进行判决。 Result: 在移动端任务上的实验表明,该方法显著提升了8B和30B规模模型的性能,分别增益达26.08%和16.66%,并能与DeepSeek V3.1和Qwen3-235B-A22B等大模型竞争。 Conclusion: SmartSnap通过将验证从被动转为主动、由全局轨迹转向精炼快照,实现了高效可靠的任务验证,推动了大规模语言模型驱动代理的可扩展训练。 Abstract: Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.

[4] The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

Zubaida Mohammed Albadani,Mohammed Q. Shormani

Main category: cs.CL

TL;DR: 本文在最简方案框架下研究了也门伊比阿拉伯语中qulk从句的句法结构,提出qulk从句是双子句结构,其中qulk作为从句嵌入谓词,选择一个完整的CP补足语,并通过合并、移位、一致和拼写等最简操作进行句法推导。

Details Motivation: 旨在探讨qulk从句在没有补语化成分的情况下如何实现嵌入,并揭示其句法结构是否符合最简方案理论。 Method: 采用最简方案的核心操作(Merge, Move, Agree, Spell-out)对qulk从句进行分层句法分析,并结合形态合并等后句法过程解释其生成机制。 Result: 发现qulk从句为双子句结构,qulk充当嵌入谓词,能够选择完整CP;该分析可解释方言特有现象如双部分否定、代词附着及CP嵌入。 Conclusion: 支持最简方案在方言句法中的适用性,同时提出将该分析扩展至第二人称‘你说’从句的理论可能性,并为最简理论的普遍性提供新证据。 Abstract: This study investigates the syntax of qulk-clauses in Yemeni Ibbi Arabic (YIA) within the Minimalist Program. The construction qulk-clause, a morphologically fused form meaning 'I said,' introduces embedded declarative interrogative, and imperative clauses, often eithout complementizer. The central proposal of this paper is that qulk-clauses are biclausal structures in which qulk functions a clause-embedding predicate sec;ecting a dull CP complement. By applying core minimalist operations, viz., Merge, Move, Agree, and Spell-out, the study provides a layered syntactic analysis of qulk-clauses, for illustrating how their derivation proceeds through standard computational steps and post-syntactic processes such as Morphological Merger. The proposal also accounts for dialect-specific features like bipartite negation, cliticization, and CP embedding. The findings offer theoretical contributions to generative syntax, specifically minimalism. The study concludes raising theoretical questions concerning extending the analysis to the addressee-clause kil-k 'you said'. It also provides insights into the possibility of the universality of minimalism.

[5] Towards Efficient Post-Training via Fourier-Driven Adapter Architectures

Donggyun Bae,Jongil Park

Main category: cs.CL

TL;DR: 提出了一种基于随机傅里叶特征的轻量级适配器框架FAA,用于大语言模型的高效微调,通过频域分解实现语义信息的频率感知调节,在多个基准上表现优异。

Details Motivation: 为了解决大语言模型微调中参数效率与性能之间的权衡问题,探索频域信息在适配过程中的作用。 Method: 将随机傅里叶特征引入轻量适配模块,对中间表示进行高低频分解,并通过频率感知机制调节语义信息。 Result: 在GLUE、E2E NLG和指令微调任务上,FAA优于或媲美现有参数高效微调方法,且计算和内存开销低。 Conclusion: FAA是一种高效、鲁棒的大模型后训练微调方法,频域建模有助于提升适配性能。 Abstract: We propose a novel framework, termed Fourier-Activated Adapter (FAA), for parameter-efficient fine-tuning of large pre-trained language models. By incorporating random Fourier features into lightweight adapter modules, FAA decomposes intermediate representations into complementary low- and high-frequency components, enabling frequency-aware modulation of semantic information. This design allows the model to selectively emphasize informative frequency bands during adaptation while preserving the representational capacity of the frozen backbone. Extensive experiments on GLUE, E2E NLG, and instruction-tuning benchmarks demonstrate that FAA consistently achieves competitive or superior performance compared to existing parameter-efficient fine-tuning methods, while maintaining low computational and memory overhead. Ablation studies further verify the effectiveness of frequency-aware activation and adaptive weighting mechanisms, highlighting FAA as a robust and efficient approach for post-training large language models.

[6] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

Elsen Ronando,Sozo Inoue

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型(LLM)引导的示例选择框架,用于解决少样本人类活动识别中依赖大量标注数据和传统几何方法难以区分相似活动的问题。

Details Motivation: 现有的人类活动识别方法依赖大规模标注数据和纯几何的示例选择策略,在区分语义相近的活动(如走路、上楼、下楼)时表现不佳,且在少样本场景下性能受限。 Method: 引入LLM生成的知识先验,包括特征重要性、类间混淆性和示例预算倍增因子,结合边界验证、PageRank中心性、Hubness惩罚和设施选址优化,指导示例评分与选择。 Result: 在UCI-HAR数据集的严格少样本条件下,该方法达到88.78%的宏F1分数,优于随机采样、herding和k-center等经典方法。 Conclusion: 将LLM生成的语义先验与结构和几何线索相结合,能更有效地选择代表性传感器示例,提升少样本可穿戴传感活动识别的性能。 Abstract: In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and $k$-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.

[7] Hallucination Detection and Evaluation of Large Language Model

Chenggong Zhang,Haopeng Wang

Main category: cs.CL

TL;DR: 本文提出了一种轻量级的幻觉检测模型HHEM,显著提升了大语言模型幻觉评估的效率,并结合分段检索优化了在摘要任务中的局部幻觉检测效果。

Details Motivation: 现有大语言模型幻觉评估方法计算成本高,难以广泛应用,需更高效且准确的检测方法。 Method: 引入Hughes幻觉评估模型(HHEM),采用基于分类的轻量框架,不依赖LLM判断;提出分段检索机制以提升对摘要中局部幻觉的检测,并在多种LLM上进行对比实验,使用TPR、TNR和准确率等指标评估性能。 Result: HHEM将评估时间从8小时缩短至10分钟,在问答任务中达到最高82.2%的准确率和78.9%的TPR;但其在摘要任务中的局部幻觉检测表现较弱,引入分段检索后有所改善;CDF分析显示参数量较大的模型(7B-9B)幻觉较少,而中等规模模型更不稳定。 Conclusion: HHEM是一种高效、准确的幻觉检测方法,结合结构化评估框架可有效提升大语言模型生成内容的可信度,未来需进一步优化对局部幻觉的识别能力。 Abstract: Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy \(82.2\%\) and TPR \(78.9\%\). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.

[8] HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

Cattalyya Nuengsigkapian

Main category: cs.CL

TL;DR: HiFi-RAG 是一种分层过滤的检索增强生成系统,在开放域问答中通过多阶段流程提升答案质量,结合 Gemini 2.5 Flash 的高效处理与 Gemini 2.5 Pro 的强推理能力,在多个指标上显著优于基线模型。

Details Motivation: 标准嵌入式检索在开放域RAG中存在 retrieved 文档包含无关信息、生成答案偏离用户意图的问题,需要更精准和对齐的生成方法。 Method: 采用多阶段 pipeline:使用 Gemini 2.5 Flash 进行查询构建、分层内容过滤和引用归因,利用其速度和成本优势;仅在最终回答生成时使用 Gemini 2.5 Pro 的推理能力。 Result: 在 MMU-RAGent 验证集上,ROUGE-L 提升至 0.274(+19.6%),DeBERTaScore 达到 0.677(+6.2%);在需后截止知识的 Test2025 数据集上,ROUGE-L 超越参数化基线 57.4%,DeBERTaScore 超越 14.9%。 Conclusion: HiFi-RAG 通过分层过滤机制有效提升了检索相关性和答案准确性,兼顾效率与性能,是解决开放域 RAG 挑战的有效方案。 Abstract: Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.

[9] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

Jie Zhou,Xin Chen,Jie Zhang,Zhe Li

Main category: cs.CL

TL;DR: 本研究提出了垂直领域会计推理的概念,建立了评估标准,并对多个大语言模型在会计推理任务上的表现进行了评测,发现尽管提示工程能提升性能,但现有模型仍难以满足企业级会计应用的需求。

Details Motivation: 为了推动企业数字化转型和社会发展,需要将大语言模型有效融入会计等专业领域,理解其在特定领域的推理能力成为关键挑战。 Method: 通过分析GLM系列模型的训练数据特征,提出垂直领域会计推理概念并建立评估标准,基于该框架对GLM-6B、GLM-130B、GLM-4和GPT-4等模型在会计推理任务中的表现进行评估,比较不同提示工程策略下的性能差异。 Result: 实验结果表明,不同的提示工程策略可在不同程度上提升模型表现,其中GPT-4展现出最强的会计推理能力,但当前大语言模型整体仍无法满足实际应用场景的要求,尤其在企业级会计部署方面存在明显不足。 Conclusion: 虽然大语言模型在会计推理方面展现出潜力,但其在专业领域的应用仍需进一步优化,特别是在领域适配、推理能力和部署可行性方面需持续改进以实现真正落地。 Abstract: Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.

[10] Constituency Structure over Eojeol in Korean Treebanks

Jungyeul Park,Chulwoo Park

Main category: cs.CL

TL;DR: 本文探讨了韩语构式树库中词素与词(eojeol)作为终端单位的表示问题,主张采用基于eojeol的构式表示,并将形态信息分离为独立层次。

Details Motivation: 解决韩语构式树库中因使用词素作为终端单位而导致的句法结构与形态结构混淆的问题,并实现与基于eojeol的依存树库的一致性。 Method: 通过比较分析Sejong和Penn韩语树库,提出在明确归一化假设下两者在eojeol层面具有表示等价性,并设计一种基于eojeol的标注方案。 Result: 证明了Sejong和Penn韩语树库可在eojeol层级上视为表示等价,提出了一个可保持构式可解释性、支持跨树库比较和构式-依存转换的标注框架。 Conclusion: 基于eojeol的构式表示结合分离的形态层是更合理的韩语语法表示方法,有助于统一不同语法资源。 Abstract: The design of Korean constituency treebanks raises a fundamental representational question concerning the choice of terminal units. Although Korean words are morphologically complex, treating morphemes as constituency terminals conflates word internal morphology with phrase level syntactic structure and creates mismatches with eojeol based dependency resources. This paper argues for an eojeol based constituency representation, with morphological segmentation and fine grained part of speech information encoded in a separate, non constituent layer. A comparative analysis shows that, under explicit normalization assumptions, the Sejong and Penn Korean treebanks can be treated as representationally equivalent at the eojeol based constituency level. Building on this result, we outline an eojeol based annotation scheme that preserves interpretable constituency and supports cross treebank comparison and constituency dependency conversion.

[11] ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

Suhua Wang,Zifan Wang,Xiaoxin Sun,D. J. Wang,Zhanbo Liu,Xin Li

Main category: cs.CL

TL;DR: 本文提出了一种针对满语语音合成的新型方法ManchuTTS,通过多层次表示和跨模态注意力机制应对数据稀缺与语音黏着性问题,并构建首个满语TTS数据集,实验表明该方法在语音自然度和发音准确率上显著优于基线模型。

Details Motivation: 满语作为濒危语言面临严重的数据稀缺和强烈的语音黏着性问题,现有语音合成方法难以有效处理其独特的语言特征。 Method: 设计了三层次文本表示(音素、音节、韵律)和跨模态分层注意力机制以实现多粒度对齐;采用深度卷积网络与流匹配Transformer结合的非自回归合成模型,并引入分层对比损失来增强声学-语言结构对应关系。 Result: 在6.24小时标注语料中使用5.2小时子集训练下,ManchuTTS取得了4.52的MOS得分,显著优于所有基线模型;消融实验证明分层引导使黏着词发音准确率提升31%,韵律自然度提升27%。 Conclusion: ManchuTTS有效解决了满语语音合成中的数据稀缺与语音黏着性挑战,为濒危语言的语音合成提供了可借鉴的技术框架。 Abstract: As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu's linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.

[12] Learning When Not to Attend Globally

Xuan Luo,Kailai Zhang,Xifeng Yan

Main category: cs.CL

TL;DR: 提出All-or-Here Attention (AHA),通过动态切换全注意力与局部滑动窗口注意力,显著减少对全局上下文的依赖,在保持性能的同时大幅提升推理效率。

Details Motivation: 受人类阅读习惯启发,探索大语言模型是否可以像人一样主要关注局部内容,仅在必要时回顾全局上下文,以提高计算效率。 Method: 设计一种二元路由器,每个注意力头可动态选择使用全注意力或局部滑动窗口注意力;通过不同窗口大小评估上下文依赖的分布特性。 Result: 使用256长度的窗口时,最多93%的全注意力操作可被局部注意力替代而不损失性能;发现上下文依赖呈长尾分布,全局注意力需求随局部窗口增大迅速衰减。 Conclusion: 全注意力在多数情况下是冗余的,高效推理只需按需访问全局上下文,AHA实现了局部处理与全局访问的解耦。 Abstract: When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93\% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.

[13] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis

Zhiqiang Gao,Shihao Gao,Zixing Zhang,Yihao Guo,Hongyu Chen,Jing Han

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的结构化提示管道和集成方法,用于多模态对话中的方面级情感分析与情感翻转检测。

Details Motivation: 为了提升多说话人对话中细粒度情感理解的能力,特别是情感观点元组抽取和动态情感变化识别的准确性。 Method: 在子任务一中设计了结构化提示流程,引导大语言模型逐步提取情感六元组;在子任务二中通过集成三个大语言模型来识别情感翻转及其触发因素。 Result: 在子任务一中取得47.38%的平均分,在子任务二中达到74.12%的精确匹配F1分数。 Conclusion: 逐步精细化提取和模型集成策略在复杂的多模态情感分析任务中是有效的。 Abstract: Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.

[14] Chain-of-thought Reviewing and Correction for Time Series Question Answering

Chen Su,Yuanhe Tian,Yan Song

Main category: cs.CL

TL;DR: 提出T3LLM框架,通过工人、评审者和学生三模型协作实现带显式纠错机制的多步推理,提升时间序列问答的准确性。

Details Motivation: 现有基于大语言模型的时间序列分析方法易在复杂数值序列上出错,而时间序列表征本身具有可验证性,可用于一致性检查。 Method: 设计包含工人、评审者和学生的三阶段框架:工人生成逐步思维链,评审者审查并纠正错误,学生学习修正后的思维链进行内化推理。 Result: 在多个真实世界的时间序列问答基准上,T3LLM优于强基线方法,达到最先进性能。 Conclusion: T3LLM通过引入显式纠错机制,有效提升了大语言模型在时间序列问答中的推理准确性和鲁棒性。 Abstract: With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM-based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine-tune the student model, internalizing multi-step reasoning and self-correction into its parameters. Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.

[15] M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

Fanglin Xu,Wei Zhang,Jian Yang,Guo Chen,Aishan Liu,Zhoujun Li,Xianglong Liu,Bryan Dai

Main category: cs.CL

TL;DR: 本文提出了M2G-Eval,一个多粒度、多语言的代码生成评估框架,涵盖类、函数、代码块和行四个层级,覆盖18种编程语言,用于细粒度分析大模型在不同代码结构和语言中的表现。

Details Motivation: 现有基准测试主要集中在单一结构粒度和少数编程语言上,难以全面反映模型在不同代码范围和多语言环境下的细粒度能力差异。 Method: 构建包含17,000多个训练任务和1,286个人工标注、无污染测试实例的M2G-Eval基准;在Qwen3-8B基础上通过监督微调和Group Relative Policy Optimization训练M2G-Eval-Coder模型;对30个主流LLM进行评估。 Result: 发现(1)任务难度呈层级递增:行级最易,类级最难;(2)随着复杂度上升,全粒度与部分粒度语言间性能差距扩大;(3)存在强跨语言相关性,表明模型可迁移编程概念。 Conclusion: M2G-Eval支持对代码生成能力的细粒度诊断,揭示了当前模型在生成复杂、长篇代码方面仍面临挑战。 Abstract: The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.

[16] On the Role of Discreteness in Diffusion LLMs

Ziqi Jin,Bin Wang,Xiang Lin,Lidong Bing,Aixin Sun

Main category: cs.CL

TL;DR: 本文重新审视了扩散语言建模,提出了分离扩散机制与语言特有需求的五个关键属性,分析了现有方法在连续嵌入空间和离散token上的扩散模型的局限性,并指出均匀腐蚀和逐token训练是当前大模型中的两个核心问题。

Details Motivation: 扩散模型在语言生成中具有并行解码和迭代优化的优势,但文本的离散性和结构复杂性使得直接应用扩散机制面临挑战,因此需要从扩散过程和语言建模双重视角重新审视其适用性。 Method: 将现有扩散语言模型分为在嵌入空间进行连续扩散和在token上进行离散扩散两类,并提出五个评估其有效性的关键属性,通过分析近期大型模型的表现揭示其结构性权衡。 Result: 现有方法仅满足五个关键属性中的一部分,暴露出结构性折衷;发现两个核心问题:均匀腐蚀未考虑信息在位置间的分布差异,以及逐token边际训练无法捕捉并行解码中的多token依赖关系。 Conclusion: 为了更有效地建模文本结构,未来的扩散语言模型应设计更贴近文本内在结构的扩散过程,以实现更一致和高效的生成。 Abstract: Diffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models.

[17] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Hadi Mohammadi,Tamas Kozak,Anastasia Giachanou

Main category: cs.CL

TL;DR: 本研究评估了GRPO和DPO两种优化方法在提升大语言模型链式思维(CoT)推理忠实性方面的表现,发现GRPO在较大模型上效果更优,尤其在Qwen2.5-14B-Instruct模型上表现最佳,表明GRPO有助于实现更透明、可信的推理。

Details Motivation: 链式思维(CoT)虽能提升大模型的多步推理能力,但其生成的解释常与实际推理过程不符,可能导致误导性或欺骗性结果,影响模型的安全监督与对齐监测,因此需提升CoT的忠实性。 Method: 通过实验比较Group Relative Policy Optimization (GRPO) 和 Direct Preference Optimization (DPO) 两种优化方法在不同规模大语言模型上的表现,评估其对CoT忠实性的改善效果,并分析模型规模与性能之间的关系。 Result: GRPO在较大模型上优于DPO,Qwen2.5-14B-Instruct模型表现最佳;两种方法均显示模型规模与性能正相关,但GRPO在提升忠实性方面潜力更大,尽管在小规模模型上表现较不稳定。 Conclusion: GRPO相较于DPO更能有效提升大语言模型在CoT推理中的忠实性,尤其适用于大规模模型,是实现更透明、可靠推理的有前景方向。 Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.

[18] Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Pere Martra

Main category: cs.CL

TL;DR: 通过基于最大绝对权重(MAW)准则的GLU-MLP层结构化宽度剪枝,研究发现降低扩展比会系统性地影响不同模型能力:参数化知识和困惑度性能下降,但指令遵循能力显著提升,多步推理保持稳健。扩展比被识别为关键架构参数,可选择性调节模型的认知能力,并揭示知识与真实性之间的强逆相关。

Details Motivation: 挑战剪枝导致均匀退化的普遍假设,探究宽度剪枝对不同模型能力的差异化影响,特别是扩展比在模型压缩之外的认知调控作用。 Method: 采用MAW准则对GLU-MLP层进行结构化宽度剪枝,评估七种扩展比配置,在涵盖事实知识、数学推理、语言理解、指令遵循和真实性的综合基准上进行分析。 Result: 指令遵循能力大幅提升(+46%至+75%),多步推理稳健;MMLU等知识任务性能下降;发现MMLU与TruthfulQA-MC2间的强逆相关(r = -0.864);剪枝配置节能达23%,单请求延迟增加,批处理受益。 Conclusion: 扩展比是可选择性调节模型认知能力的关键参数,MAW剪枝是一种选择性过滤机制,能在减少参数知识的同时保持或增强行为对齐,连接了模型压缩与真实性研究两个领域。 Abstract: Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.

[19] Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency

Yoshith Roy Kotla,Varshith Roy Kotla

Main category: cs.CL

TL;DR: 提出Vocabulary-Aware Conformal Prediction (VACP) 框架,通过语义掩码和温度调整显著减小大词汇量语言模型的预测集大小,同时保持有效的覆盖概率。

Details Motivation: 标准softmax概率在大语言模型中常校准不佳,传统保校准方法生成的预测集过大而缺乏信息量,需更高效的不确定性量化方法。 Method: 提出Vocabulary-Aware Conformal Prediction (VACP),结合语义掩码减少有效词汇空间,并利用温度调整得分,在保证边际覆盖的同时提升预测集效率。 Result: 在Gemma-2B上使用SQUAD和WikiText测试,VACP实现89.7%经验覆盖(目标90%),平均预测集大小从847降至4.3个token,效率提升197倍。 Conclusion: VACP在大词汇量语言模型中实现了高覆盖率与极高效预测集的平衡,为高风险领域中的可靠不确定性估计提供了可行方案。 Abstract: Deploying large language models (LLMs) in high-stakes domains requires rigorous uncertainty quantification, yet standard softmax probabilities are often poorly calibrated. We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies (greater than 250,000 tokens). Our central contribution is the identification of a coverage-efficiency tradeoff: while naive conformal prediction achieves valid coverage, it produces prediction sets of hundreds of tokens, rendering them uninformative. We propose Vocabulary-Aware Conformal Prediction (VACP), a framework that leverages semantic masking and temperature-adjusted scoring to reduce the effective prediction space while provably maintaining marginal coverage. Experiments on Gemma-2B using SQUAD and WikiText benchmarks demonstrate that VACP achieves 89.7 percent empirical coverage (90 percent target) while reducing the mean prediction set size from 847 tokens to 4.3 tokens -- a 197x improvement in efficiency. We provide a theoretical analysis of vocabulary reduction and release our implementation for reproducibility.

[20] GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages

Ahmed Abdullah,Sana Fatima,Haroon Mahmood

Main category: cs.CL

TL;DR: 本文提出了一种多语言希望言语检测框架,重点关注乌尔都语,利用XLM-RoBERTa、mBERT等预训练模型,在PolyHope-M 2025基准上取得了优异的F1分数,表明多语言模型在低资源环境下的有效性。

Details Motivation: 希望言语在自然语言处理中研究较少,尤其是乌尔都语等低资源语言缺乏相关资源,限制了促进积极在线交流工具的发展。 Method: 采用XLM-RoBERTa、mBERT、EuroBERT和UrduBERT等预训练Transformer模型,进行简单预处理并训练分类器,用于希望言语检测。 Result: 在PolyHope-M 2025基准上,乌尔都语二元分类F1得分为95.2%,多类别分类为65.2%,在西班牙语、德语和英语中也表现良好。 Conclusion: 现有多语言模型可在低资源环境中有效应用于希望言语检测,有助于构建更积极的数字话语环境。 Abstract: Hope speech has been relatively underrepresented in Natural Language Processing (NLP). Current studies are largely focused on English, which has resulted in a lack of resources for low-resource languages such as Urdu. As a result, the creation of tools that facilitate positive online communication remains limited. Although transformer-based architectures have proven to be effective in detecting hate and offensive speech, little has been done to apply them to hope speech or, more generally, to test them across a variety of linguistic settings. This paper presents a multilingual framework for hope speech detection with a focus on Urdu. Using pretrained transformer models such as XLM-RoBERTa, mBERT, EuroBERT, and UrduBERT, we apply simple preprocessing and train classifiers for improved results. Evaluations on the PolyHope-M 2025 benchmark demonstrate strong performance, achieving F1-scores of 95.2% for Urdu binary classification and 65.2% for Urdu multi-class classification, with similarly competitive results in Spanish, German, and English. These results highlight the possibility of implementing existing multilingual models in low-resource environments, thus making it easier to identify hope speech and helping to build a more constructive digital discourse.

[21] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Anaelia Ovalle,Candace Ross,Sebastian Ruder,Adina Williams,Karen Ullrich,Mark Ibrahim,Levent Sagun

Main category: cs.CL

TL;DR: 研究发现,尽管大语言模型在多语言任务中表现出高准确率,但其推理过程在非拉丁语系中与结论的对齐程度显著较差,揭示了现有评估方法的不足。

Details Motivation: 探索大语言模型在不同语言间的推理能力是否一致,尤其是推理链的质量是否跨语言保持稳定。 Method: 提出一个经人类验证的评估框架,分析来自6种语言和6个前沿模型的6.5万条GlobalMMLU推理链,并通过人工标注建立错误分类体系。 Result: 非拉丁字母语言的推理链与其结论之间的错位至少是拉丁字母语言的两倍;主要错误类型为证据性错误(如无支持的主张、模糊事实)和不合逻辑的推理步骤。 Conclusion: 当前的多语言评估方法未能全面反映模型的真实推理能力,需引入关注推理质量的评估框架。 Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.

[22] Mitigating Social Desirability Bias in Random Silicon Sampling

Sashank Chapala,Maksym Mironov,Songgaojun Deng

Main category: cs.CL

TL;DR: 本研究探讨了通过心理引导的提示词改写来缓解大语言模型(LLM)在模拟人群响应时存在的社会期许偏差(SDB),发现中性、第三人称的“改写提示”最能提升LLM与真实人类数据(ANES)的一致性。

Details Motivation: LLM在硅基抽样中常因社会期许偏差而偏离真实人类反应,尤其是在敏感话题上;现有对SDB的研究不足,因此需要探索低成本、有效的提示工程方法来提升模拟真实性。 Method: 基于ANES选举调查数据,选取三个LLM(Llama-3.1系列和GPT-4.1-mini),复现基线硅抽样并验证SDB存在;测试四种提示缓解策略:改写、反向编码、启动提示和前言提示;使用Jensen-Shannon散度加自举置信区间评估与真实数据的对齐程度。 Result: 改写提示最有效,显著降低分布集中度,使结果更接近ANES数据;反向编码效果不一;启动和前言提示导致响应趋同,未系统性改善偏差。 Conclusion: 基于提示词的框架控制(尤其是中性第三人称表述)可有效缓解LLM中的社会期许偏差,为生成更具代表性的硅基样本提供了实用路径。 Abstract: Large Language Models (LLMs) are increasingly used to simulate population responses, a method known as ``Silicon Sampling''. However, responses to socially sensitive questions frequently exhibit Social Desirability Bias (SDB), diverging from real human data toward socially acceptable answers. Existing studies on social desirability bias in LLM-based sampling remain limited. In this work, we investigate whether minimal, psychologically grounded prompt wording can mitigate this bias and improve alignment between silicon and human samples. We conducted a study using data from the American National Election Study (ANES) on three LLMs from two model families: the open-source Llama-3.1 series and GPT-4.1-mini. We first replicate a baseline silicon sampling study, confirming the persistent Social Desirability Bias. We then test four prompt-based mitigation methods: \emph{reformulated} (neutral, third-person phrasing), \emph{reverse-coded} (semantic inversion), and two meta-instructions, \emph{priming} and \emph{preamble}, respectively encouraging analytics and sincerity. Alignment with ANES is evaluated using Jensen-Shannon Divergence with bootstrap confidence intervals. Our results demonstrate that reformulated prompts most effectively improve alignment by reducing distribution concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding produced mixed results across eligible items, while the Priming and Preamble encouraged response uniformity and showed no systematic benefit for bias mitigation. Our findings validate the efficacy of prompt-based framing controls in mitigating inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples.

[23] Data Augmentation for Classification of Negative Pregnancy Outcomes in Imbalanced Data

Md Badsha Biswas

Main category: cs.CL

TL;DR: 本研究提出利用社交媒体数据(特别是Twitter)结合自然语言处理技术,识别孕妇妊娠经历并分类妊娠结局,以补充传统流行病学数据,助力负性妊娠结局的研究与干预效果评估。

Details Motivation: 婴儿死亡率居高不下,出生缺陷是主因之一,现有研究对不良妊娠结局的成因理解仍不足,需更全面的数据和干预策略。 Method: 构建自然语言处理(NLP)流程,从Twitter等公开社交媒体数据中提取孕妇妊娠经历,通过数据预处理和增强技术应对数据不平衡、噪声和非结构化问题,将报告足月且正常体重出生者归为阳性病例,不良妊娠结局者归为阴性病例。 Result: 成功开发NLP管道以自动识别和分类妊娠结局,并可用于评估特定干预、治疗或产前暴露对母婴健康的影响,为孕产妇队列研究提供新框架。 Conclusion: 社交媒体数据可作为流行病学研究妊娠结局的有效补充资源,具有广泛的应用前景。 Abstract: Infant mortality remains a significant public health concern in the United States, with birth defects identified as a leading cause. Despite ongoing efforts to understand the causes of negative pregnancy outcomes like miscarriage, stillbirths, birth defects, and premature birth, there is still a need for more comprehensive research and strategies for intervention. This paper introduces a novel approach that uses publicly available social media data, especially from platforms like Twitter, to enhance current datasets for studying negative pregnancy outcomes through observational research. The inherent challenges in utilizing social media data, including imbalance, noise, and lack of structure, necessitate robust preprocessing techniques and data augmentation strategies. By constructing a natural language processing (NLP) pipeline, we aim to automatically identify women sharing their pregnancy experiences, categorizing them based on reported outcomes. Women reporting full gestation and normal birth weight will be classified as positive cases, while those reporting negative pregnancy outcomes will be identified as negative cases. Furthermore, this study offers potential applications in assessing the causal impact of specific interventions, treatments, or prenatal exposures on maternal and fetal health outcomes. Additionally, it provides a framework for future health studies involving pregnant cohorts and comparator groups. In a broader context, our research showcases the viability of social media data as an adjunctive resource in epidemiological investigations about pregnancy outcomes.

[24] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu,Minghua He,Shaoxun Zeng,Sijun Zhang,Linhao Zhang,Chuhan Wu,Wei Jia,Yuan Liu,Xiao Zhou,Jie Zhou

Main category: cs.CL

TL;DR: WeDLM是一种基于因果注意力的扩散解码框架,通过拓扑重排序实现前缀缓存友好的并行生成,在保持生成质量的同时显著提升推理速度。

Details Motivation: 现有的扩散语言模型因使用双向注意力导致无法有效利用前缀KV缓存,重复上下文化降低了推理效率,难以在实际部署中超越优化的自回归引擎。 Method: 提出WeDLM,采用纯因果注意力机制,引入拓扑重排序将已观测token移至物理前缀但保留逻辑位置,并设计流式解码策略持续提交高置信度token,维持固定并行负载。 Result: 实验表明WeDLM在匹配部署条件下相比vLLM加速的自回归基线可接近3倍提速(复杂推理任务)和高达10倍提速(低熵生成场景),同时保持与强自回归模型相当的生成质量。 Conclusion: WeDLM证明了扩散式解码在实际部署中可以超越优化的自回归引擎,为高效语言生成提供了新方向。 Abstract: Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.

[25] Harnessing Large Language Models for Biomedical Named Entity Recognition

Jian Chen,Leilei Su,Cong Sun

Main category: cs.CL

TL;DR: 本文提出了一种名为BioSelectTune的数据驱动框架,通过混合超级过滤策略提升生物医学命名实体识别(BioNER)的性能,在仅使用50%训练数据的情况下达到SOTA水平。

Details Motivation: 通用大模型在生物医学领域表现受限,主要由于缺乏领域知识和低质量训练数据导致性能下降。 Method: 将BioNER任务重构为结构化JSON生成任务,采用新型Hybrid Superfiltering策略,利用弱模型筛选高质量训练数据子集进行微调。 Result: 在多个BioNER基准测试中达到最先进性能,仅用50%精选正样本即超越全量训练基线和BioMedBERT等专用模型。 Conclusion: 数据质量优于数量,BioSelectTune为高效适配大模型至专业领域提供了新范式。 Abstract: Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we demonstrate that BioSelectTune achieves state-of-the-art (SOTA) performance across multiple BioNER benchmarks. Notably, our model, trained on only 50% of the curated positive data, not only surpasses the fully-trained baseline but also outperforms powerful domain-specialized models like BioMedBERT.

[26] Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis

Dongning Rao,Yunbiao Zeng,Zhihua Jiang,Jujian Lv

Main category: cs.CL

TL;DR: 本文提出了一种用于多模态情感分析(MSA)的新模型TEXT,结合了多模态大语言模型生成的解释和时序对齐机制,通过文本路由稀疏专家网络和门控融合,在多个数据集上取得了最优性能。

Details Motivation: 现有MSA方法对解释信息以及时序对齐的利用不足,限制了对多模态情感的精细建模。 Method: TEXT利用多模态大语言模型(MLLM)生成解释信息,设计了面向时序的神经网络模块来对齐音视频表征,并引入文本引导的稀疏混合专家结构与门控融合机制,结合Mamba与时间交叉注意力进行时序建模。 Result: TEXT在四个数据集上均优于所有对比模型,包括三种最新方法和三种MLLM,在六项指标中的至少四项取得最佳表现;例如在CH-SIMS数据集上将平均绝对误差降至0.353,相比最新方法降低13.5%。 Conclusion: 通过引入解释增强和时序对齐机制,TEXT有效提升了多模态情感分析的性能,验证了解释信息和精细时序建模在MSA中的重要价值。 Abstract: Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.

[27] Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

Muhammad Zain Ali,Bernhard Pfahringer,Tony Smith

Main category: cs.CL

TL;DR: 本文研究了在乌尔都语假新闻检测中,通过领域自适应预训练提升多语言模型性能的方法,实验表明经过领域适应的XLM-R模型表现优于基础版本。

Details Motivation: 低资源语言如乌尔都语在虚假信息检测方面受到关注较少,现有模型对领域特定术语处理能力差,导致性能不佳。 Method: 采用领域自适应预训练结合微调的两阶段训练方法,使用公开的乌尔都语新闻语料对XLM-RoBERTa和mBERT进行领域适应,并在四个乌尔都语假新闻数据集上评估效果。 Result: 领域适应后的XLM-R在所有数据集上均优于原始模型,而mBERT结果则表现不一。 Conclusion: 领域自适应预训练有助于提升多语言模型在低资源语言假新闻检测中的性能,尤其对XLM-R效果显著。 Abstract: Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R consistently outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results.

[28] CNSight: Evaluation of Clinical Note Segmentation Tools

Risha Surana,Adrian Law,Sunwoo Kim,Rishab Sridhar,Angxiao Han,Peiyu Hong

Main category: cs.CL

TL;DR: 本研究评估了基于规则的方法、领域特定的Transformer模型和大型语言模型在临床笔记分段中的表现,使用MIMIC-IV数据集中的1000条笔记进行实验,结果显示GPT-5-mini在句子级和自由文本分割中达到72.4的平均F1分数,表现最佳。

Details Motivation: 临床笔记常以非结构化或半结构化形式存储,难以用于二次分析和下游临床应用,可靠地识别章节边界是结构化这些笔记的关键步骤。 Method: 采用基于规则的基线方法、领域特定的Transformer模型以及大型语言模型,对来自MIMIC-IV数据集的1000条临床笔记进行分段实验,并在句子级和自由文本两种任务上评估性能。 Result: 基于API的大型语言模型整体表现最优,GPT-5-mini在平均F1得分上达到72.4;轻量级基线方法在结构化的句子级任务中仍有竞争力,但在非结构化的自由文本上表现较差。 Conclusion: 研究结果为临床笔记分段的方法选择提供了指导,并为信息提取、队列识别和自动摘要等下游任务奠定了基础。 Abstract: Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.

Sameer Sitoula,Tej Bahadur Shahi,Laxmi Prasad Bhatt,Anisha Pokhrel,Arjun Neupane

Main category: cs.CL

TL;DR: 本文提出了一个名为NepEMO的新数据集,用于尼泊尔语Reddit帖子的多标签情感和情感分类,并比较了多种机器学习、深度学习和Transformer模型的性能,结果表明Transformer模型表现最佳。

Details Motivation: 由于Reddit用户常在匿名环境下讨论敏感话题,如健康和日常生活,因此需要针对非主流语言(如尼泊尔语)构建专门的情感分析数据集,以更好理解社交媒体中的情绪表达。 Method: 收集并手动标注了4,462条2019年1月至2025年6月期间用英语、罗马化尼泊尔语和天城文书写的Reddit帖子,涵盖五种情绪(恐惧、愤怒、悲伤、喜悦、抑郁)和三种情感类别(正面、负面、中性),并通过LDA主题建模、TF-IDF关键词提取和n-gram分析进行语言学分析,同时比较了传统机器学习、深度学习和Transformer模型在多标签情感和情感分类任务上的表现。 Result: Transformer模型在多标签情感分类和情感分类任务上均优于传统的机器学习和深度学习模型;此外,数据分析揭示了情绪趋势、情绪共现模式、特定情感的n-gram特征以及主要讨论主题。 Conclusion: NepEMO数据集为低资源语言(如尼泊尔语)的社会媒体情感分析提供了有价值的资源,且Transformer架构在此类任务中具有优越性能,支持未来在多语言情感识别中的研究。 Abstract: Social media (SM) platforms (e.g. Facebook, Twitter, and Reddit) are increasingly leveraged to share opinions and emotions, specifically during challenging events, such as natural disasters, pandemics, and political elections, and joyful occasions like festivals and celebrations. Among the SM platforms, Reddit provides a unique space for its users to anonymously express their experiences and thoughts on sensitive issues such as health and daily life. In this work, we present a novel dataset, called NepEMO, for multi-label emotion (MLE) and sentiment classification (SC) on the Nepali subreddit post. We curate and build a manually annotated dataset of 4,462 posts (January 2019- June 2025) written in English, Romanised Nepali and Devanagari script for five emotions (fear, anger, sadness, joy, and depression) and three sentiment classes (positive, negative, and neutral). We perform a detailed analysis of posts to capture linguistic insights, including emotion trends, co-occurrence of emotions, sentiment-specific n-grams, and topic modelling using Latent Dirichlet Allocation and TF-IDF keyword extraction. Finally, we compare various traditional machine learning (ML), deep learning (DL), and transformer models for MLE and SC tasks. The result shows that transformer models consistently outperform the ML and DL models for both tasks.

[30] AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

Shihao Cai,Runnan Fang,Jialong Wu,Baixuan Li,Xinyu Wang,Yong Jiang,Liangcai Su,Liwen Zhang,Wenbiao Yin,Zhen Zhang,Fuli Feng,Pengjun Xie,Xiaobin Wang

Main category: cs.CL

TL;DR: 提出了一种统一的自动化流程和环境级强化学习算法,用于合成高难度、易验证的模拟环境,并有效提升语言代理在其中的训练效率与稳定性。

Details Motivation: 现有基于模拟环境的强化学习方法受限于环境合成的半自动化、任务难度不足以及模拟用户不稳定等问题,难以支持高效的语言代理训练。 Method: 设计了一个可自动、可扩展地生成高难度且易于验证任务的模拟环境的统一管道,并提出一种环境级强化学习算法,通过环境级优势估计缓解用户不稳定性,提升训练效率与稳定性。 Result: 在tau-bench、tau2-Bench和VitaBench等多个代理基准上的综合评估表明所提方法有效,且具备良好的跨域泛化能力。 Conclusion: 该方法为语言代理在复杂模拟环境中的强化学习提供了更稳定、高效的训练框架,具有较强的可扩展性和泛化性。 Abstract: Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.

[31] Diversity or Precision? A Deep Dive into Next Token Prediction

Haoyuan Wu,Hai Wang,Jiajia Wu,Jinxiang Ou,Keyao Wang,Weile Chen,Zihao Zheng,Bei Yu

Main category: cs.CL

TL;DR: 本文提出了一种新的预训练目标,通过引入奖励塑形策略来调整大语言模型的token输出分布,从而为强化学习提供更优的探索空间,并发现强调精确性的先验比高熵分布更有利于推理性能。

Details Motivation: 研究预训练模型的token输出分布在强化学习中的影响,探索如何构建更有利于后续强化学习推理能力提升的探索空间。 Method: 将下一词预测建模为随机决策过程,提出一种广义预训练目标,采用正向奖励缩放和基于排名的非对称负样本机制进行奖励塑形,以平衡多样性与精确性。 Result: 该方法能够有效重塑预训练阶段的token输出分布,实验发现强调精确性的分布相比高熵分布能为强化学习提供更优的探索空间。 Conclusion: 精确性导向的先验比高熵分布更能促进大语言模型在推理任务上的强化学习效果,挑战了传统认为高熵利于探索的观点。 Abstract: Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.

[32] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

Mengdi Chai,Ali R. Zomorrodi

Main category: cs.CL

TL;DR: 该研究评估了三种先进大语言模型(ChatGPT-4o、Gemini 1.5 Pro 和 Llama 3.3 70B)在临床决策支持中的表现,覆盖典型患者诊疗的全流程。结果显示模型在不同任务中表现差异大,提示需针对具体任务和模型设计优化策略。

Details Motivation: 探索大语言模型在真实临床决策中的实用性,而不仅限于医学知识测试。 Method: 使用36个病例,评估三个LLM在五个连续临床任务(鉴别诊断、紧急处理、诊断检查、最终诊断、治疗建议)中的表现,并测试不同温度设置和MedPrompt框架下的提示工程效果。 Result: 模型在最终诊断中准确率高,但在相关诊断检查中表现差;提示工程对低基线任务有提升,但对其他任务可能适得其反;目标式少样本提示未显著优于随机选择。 Conclusion: 提示工程的效果高度依赖于模型和任务,需采用定制化、情境感知的策略将LLM有效整合到医疗实践中。 Abstract: Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM's out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining tasks. Furthermore, ChatGPT performed better under the zero temperature, whereas LIama showed stronger performance under the default temperature. Next, we assessed whether prompt engineering could enhance LLM performance by applying variations of the MedPrompt framework, incorporating targeted and random dynamic few-shot learning. The results demonstrate that prompt engineering is not a one-size-fit-all solution. While it significantly improved the performance on the task with lowest baseline accuracy (relevant diagnostic testing), it was counterproductive for others. Another key finding was that the targeted dynamic few-shot prompting did not consistently outperform random selection, indicating that the presumed benefits of closely matched examples may be counterbalanced by loss of broader contextual diversity. These findings suggest that the impact of prompt engineering is highly model and task-dependent, highlighting the need for tailored, context-aware strategies for integrating LLMs into healthcare.

[33] Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

Tao Yu,Yongqi An,Kuan Zhu,Guibo Zhu,Ming Tang,Jinqiao Wang

Main category: cs.CL

TL;DR: 提出FANG框架,通过功能感知的神经元分组和自适应稀疏分配,在保持语言建模性能的同时提升下游任务准确性,解决了校准偏差问题。

Details Motivation: 现有结构化剪枝方法在few-shot校准集无法反映预训练数据分布时,下游任务泛化能力有限,需缓解校准偏差。 Method: 基于神经元处理的语义上下文类型进行功能感知的分组,组内重要性评估时对相关性强的token加权,并保留跨上下文类型贡献的神经元,按功能复杂度自适应分配稀疏度。 Result: 在30%和40%稀疏度下,结合FLAP和OBC方法,平均准确率分别提升1.5%–8.5%,达到SOTA。 Conclusion: FANG有效缓解了校准偏差,提升了剪枝后模型的泛化能力和性能平衡。 Abstract: Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%--8.5% in average accuracy under 30% and 40% sparsity.

[34] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Wenxuan Xu,Arvind Pillai,Subigya Nepal,Amanda C Collins,Daniel M Mackin,Michael V Heinz,Tess Z Griffin,Nicholas C Jacobson,Andrew Campbell

Main category: cs.CL

TL;DR: LENS是一个将多模态传感数据与语言模型对齐的框架,用于生成基于临床的自然语言心理健康叙述,通过构建大规模传感器-文本数据集并训练直接嵌入原始传感器信号的编码器,实现对抑郁和焦虑症状的准确描述。

Details Motivation: 现有的大语言模型难以直接处理长时间的传感器时序数据,且缺乏配对的传感器-文本数据集,限制了多模态健康感知在心理健康评估中的应用。 Method: 构建一个包含超过10万对传感器-文本问答的大规模数据集,利用生态瞬时评估(EMA)回应生成自然语言描述,并训练一个块级编码器将原始传感器信号映射到大语言模型的表示空间中,从而实现时间序列与文本的融合。 Result: LENS在标准NLP指标和特定任务的症状严重程度准确性上优于强基线模型,13名心理健康专业人员的用户研究表明其生成的叙述具有临床意义和全面性。 Conclusion: LENS为将大语言模型作为健康感知接口提供了可扩展的路径,能够推理原始行为信号并支持临床决策。 Abstract: Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.

[35] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman,Shashank Srivastava

Main category: cs.CL

TL;DR: 该论文探讨了基于提示的评估方法在判断链式思维(CoT)推理忠实性时存在的问题,指出“偏置特征”指标将不完整性和不忠实验混淆,并提出新的评估指标和因果分析方法以更全面地理解模型推理过程。

Details Motivation: 现有指标如Biasing Features将未提及影响预测的提示视为不忠实验,但作者认为这实际上可能是由于推理链压缩导致的不完整性,而非真正的不忠实验,因此需要重新审视这些评估标准。 Method: 作者在多跳推理任务上使用Llama-3和Gemma-3模型,结合新提出的faithful@k指标和因果中介分析(Causal Mediation Analysis),评估提示词未被表达时是否仍对预测产生因果影响。 Result: 许多被Biasing Features标记为不忠实验的CoT在其他指标下被认为是忠实验的;增加推理时的token预算可显著提升提示词的表达率(某些情况下达90%);即使未明确表达,提示词仍可通过CoT因果地影响预测。 Conclusion: 仅依赖提示词是否出现在CoT中来评估其忠实验性是不可靠的,应结合因果中介分析和腐败类指标等更广泛的可解释性工具进行综合评估。 Abstract: Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.

[36] Accelerating Language Model Workflows with Prompt Choreography

TJ Bai,Jason Eisner

Main category: cs.CL

TL;DR: 提出Prompt Choreography框架,通过动态全局KV缓存提升多智能体工作流中大语言模型的执行效率,显著减少延迟并实现端到端加速。

Details Motivation: 在多智能体工作流中,大语言模型频繁重复编码消息导致计算冗余,影响效率。 Method: 引入动态全局KV缓存机制,允许每个LLM调用关注先前编码消息的任意重排序子集,并支持并行调用;通过微调使模型适应缓存机制。 Result: 实现了每条消息首令牌时间快2.0-6.2倍,部分工作流端到端速度提升超过2.2倍。 Conclusion: Prompt Choreography能有效减少冗余计算,在保持结果一致性的同时大幅提升LLM工作流执行效率。 Abstract: Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.

[37] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

Melikşah Türker,A. Ebrar Kızıloğlu,Onur Güngör,Susan Üsküdarlı

Main category: cs.CL

TL;DR: 本文提出了TabiBERT,一种基于ModernBERT架构从零开始训练的单语土耳其语编码器,支持8192个token的上下文长度,在推理速度和内存效率上显著提升,并在多任务基准TabiBench上实现新的SOTA性能。

Details Motivation: 土耳其语NLP领域缺乏一个从头训练、融合现代Transformer架构改进(如RoPE、FlashAttention等)的单语编码器模型,限制了其在复杂任务中的表现与应用。 Method: 采用ModernBERT架构设计TabiBERT,集成RoPE、FlashAttention和改进的归一化方法;在包含1万亿token的大规模多领域语料库(网页、学术、代码、数学内容)上从头预训练;构建标准化多任务评估基准TabiBench,涵盖28个数据集,使用GLUE风格宏平均进行评测。 Result: TabiBERT支持8192 token上下文(是原始BERT的16倍),推理速度快2.65倍,降低GPU内存消耗;在TabiBench上得分为77.58,超过BERTurk 1.62分,在八个任务类别中的五个达到SOTA,包括问答(+9.55)、代码检索(+2.41)和文档检索(+0.60);相比各任务先前最佳模型平均提升1.47分。 Conclusion: TabiBERT填补了土耳其语NLP中现代架构单语编码器的空白,展现出卓越的跨领域泛化能力与高效推理特性,推动了土耳其语语言模型研究的可复现性与进一步发展。 Abstract: Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). The model supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories: question answering (+9.55), code retrieval (+2.41), and document retrieval (+0.60). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.

[38] Reservoir Computing inspired Matrix Multiplication-free Language Model

Takumi Shiratsuchi,Yuichiro Tanaka,Hakaru Tamukoh

Main category: cs.CL

TL;DR: 提出了一种无需矩阵乘法且受储层计算启发的语言模型架构,通过固定和共享部分权重、插入储层层以及减少内存访问,显著降低了参数量、训练和推理时间,同时保持了与基线相当的性能。

Details Motivation: 大型语言模型虽然性能优越,但计算成本高,限制了其广泛应用。因此,亟需提升模型的计算效率以降低训练和推理开销。 Method: 基于无需矩阵乘法的语言模型(MatMul-free LM),采用部分固定和共享权重的策略,引入储层计算结构,并在选定层中插入储层层以生成丰富的动态表示,同时结合多种操作减少内存访问次数。 Result: 该方法最多减少了19%的参数量,训练时间缩短9.9%,推理时间减少8.0%,同时在性能上与基线模型相当。 Conclusion: 所提出的架构在不牺牲模型性能的前提下,有效提升了计算效率,为构建高效语言模型提供了可行的新方向。 Abstract: Large language models (LLMs) have achieved state-of-the-art performance in natural language processing; however, their high computational cost remains a major bottleneck. In this study, we target computational efficiency by focusing on a matrix multiplication free language model (MatMul-free LM) and further reducing the training cost through an architecture inspired by reservoir computing. Specifically, we partially fix and share the weights of selected layers in the MatMul-free LM and insert reservoir layers to obtain rich dynamic representations without additional training overhead. Additionally, several operations are combined to reduce memory accesses. Experimental results show that the proposed architecture reduces the number of parameters by up to 19%, training time by 9.9%, and inference time by 8.0%, while maintaining comparable performance to the baseline model.

[39] Not too long do read: Evaluating LLM-generated extreme scientific summaries

Zhuoqi Lyu,Qing Ke

Main category: cs.CL

TL;DR: BiomedTLDR是一个包含研究者撰写的科学论文摘要的大规模数据集,用于评估大语言模型在生成类似人类的极简摘要(TLDR)方面的能力。研究发现,尽管一些模型能生成类人摘要,但整体上LLM更倾向于提取式而非抽象式生成,表现出对原文词汇和结构的较强依赖。

Details Motivation: 缺乏高质量、全面的科学TLDR数据集限制了大语言模型在科学摘要生成方面的开发与评估,因此需要构建一个基于真实研究人员撰写的摘要数据集来系统分析LLMs的表现。 Method: 提出BiomedTLDR数据集,利用文献条目中附带作者评论的做法收集研究者撰写的TLDR;基于该数据集测试多个流行的开源大语言模型,使用摘要质量评估指标和定性分析比较模型与人类摘要的差异。 Result: 一些开源LLM能够生成接近人类水平的TLDR,但总体上LLM生成的摘要更偏向于保留原文的词汇和句式结构,表现出更强的提取性倾向,而在抽象化、重述能力上弱于人类专家。 Conclusion: 当前大语言模型在生成科学TLDR时仍偏向提取式策略,与人类的抽象概括能力存在差距;BiomedTLDR为未来改进和评估LLM的科学摘要能力提供了重要资源。 Abstract: High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at https://github.com/netknowledge/LLM_summarization (Lyu and Ke, 2025).

[40] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen,Zeyu Ji,Qianren Mao,Junhang Cheng,Bangjie Qin,Hao Wu,Zhuoran Li,Jingzheng Li,Kai Sun,Zizhe Wang,Yikun Ban,Zhu Sun,Xiangyang Ji,Hailong Sun

Main category: cs.CL

TL;DR: LLM-PeerReview是一种无监督的LLM集成方法,通过多模型评分与聚合机制选择最佳响应,在多个数据集上显著优于现有方法。

Details Motivation: 在没有人工标注或监督信号的情况下,如何有效利用多个大语言模型的多样性优势来提升生成质量。 Method: 采用LLM-as-a-Judge技术对多个候选响应进行评分,结合图模型或平均策略聚合得分,并选出最高分响应作为最终输出。 Result: 在四个数据集上取得了强结果,两个变体分别比Smoothie-Global高出6.9%和7.3%。 Conclusion: LLM-PeerReview提供了一种简单、可解释且通用的无监督集成框架,能有效利用多模型集体智慧提升生成性能。 Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

[41] Anka: A Domain-Specific Language for Reliable LLM Code Generation

Saif Khalfan Saif Al Mazrouei

Main category: cs.CL

TL;DR: 本文提出了一种名为Anka的领域特定语言(DSL),用于数据转换管道,通过显式和受限语法减少大语言模型在复杂编程任务中的错误。实验表明,尽管LLM未经过Anka的训练,但其在该语言上的任务准确率显著高于Python,尤其是在多步任务中表现突出。

Details Motivation: 大语言模型在代码生成方面表现出色,但在复杂的多步编程任务中仍存在系统性错误。这些错误源于通用语言的灵活性,导致操作顺序和变量管理上的歧义。因此,需要一种更明确、约束更强的语言来降低生成代码时的不确定性。 Method: 设计并实现了一个名为Anka的DSL,专用于数据转换管道,并具有明确且受限的语法结构。使用包含100个基准问题的数据集对Claude 3.5 Haiku和GPT-4o-mini等模型进行评估,比较其在Anka与Python上的解析成功率和任务准确性。所有测试均基于上下文提示完成,无需额外训练。 Result: Claude 3.5 Haiku在Anka上实现了99.9%的解析成功和95.8%的整体任务准确率;在多步任务中,Anka相比Python有40个百分点的准确率优势(100% vs 60%);GPT-4o-mini也显示出+26.7个百分点的优势。结果验证了LLM可通过上下文学习掌握新DSL,并在特定任务上超越其广泛训练过的通用语言。 Conclusion: 领域特定语言若为LLM生成而精心设计,可显著提升复杂任务的代码生成准确性。这表明未来可通过语言设计优化而非仅依赖模型扩展来改进LLM的编程能力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python's flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.

[42] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Dianyun Wang,Qingsen Ma,Yuhu Shang,Zhifeng Lu,Lechen Ning,Zhenbo Xu,Huijia Wu,Zhaofeng He

Main category: cs.CL

TL;DR: 本文提出了一种基于稀疏自编码器(SAE)的可解释低秩子空间引导适配器初始化方法,用于参数高效微调。该方法通过解耦特征空间提升模型在下游任务中的性能与透明性,在安全对齐任务中表现优于全量微调并接近RLHF方法,同时仅更新0.19-0.24%的参数。

Details Motivation: 现有的低秩适应方法(如LoRA)隐式地从数据中学习任务相关权重更新的低秩子空间,缺乏可解释性和直接控制,主要原因是特征维度存在多语义性(polysemanticity)。本文旨在通过引入机械可解释性技术,显式构建可解释的低秩子空间,以提高微调的效率和透明度。 Method: 利用预训练的稀疏自编码器(SAE)识别解耦的特征空间中的任务相关特征,并基于这些特征构建显式的、可解释的低秩子空间来指导适配器的初始化。同时提供了理论分析,证明在单语义假设下,SAE能实现任意小的恢复误差,而在多语义空间中则存在不可约误差。 Result: 在安全对齐任务上,该方法达到高达99.6%的安全率,超过全量微调7.4个百分点,并接近基于RLHF的方法,同时仅更新0.19%-0.24%的参数。此外,通过SAE特征的语义锚定,提供了对学习到的对齐子空间的可解释洞察。 Conclusion: 将机械可解释性融入微调过程可以同时提升大型语言模型在下游任务上的性能与透明性,验证了显式构造解耦特征子空间在参数高效微调中的有效性。 Abstract: Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity--individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.

[43] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

Jiahao Zhu,Jipeng Qiang,Ran Bai,Chenyu Liu,Xiaoye Ouyang

Main category: cs.CL

TL;DR: 本研究提出了“直播听觉变体解析”(LiveAMR)任务,旨在检测电商直播中通过发音变异进行的虚假宣传,并构建了首个包含86,790个样本的数据集,利用大语言模型提升性能,显著增强了直播内容监管能力。

Details Motivation: 由于主播常通过发音变体(morphs)规避审查并进行虚假广告,现有文本层面的变体检测方法无法有效应对语音发音层面的隐蔽违规行为,因此需要针对直播音频中的发音变异进行专门研究。 Method: 将LiveAMR任务转化为文本到文本的生成问题,构建专用数据集,并利用大语言模型生成额外训练数据以提升模型性能。 Result: 成功构建了包含86,790个样本的LiveAMR数据集,所提出的方法在检测发音变体方面表现出良好性能,验证了其在提升直播监管有效性方面的潜力。 Conclusion: 通过引入LiveAMR任务并结合大语言模型生成数据,能够有效识别电商直播中的发音变体,为加强医疗健康类直播内容的监管提供了可行的技术路径。 Abstract: E-commerce live streaming in China, particularly on platforms like Douyin, has become a major sales channel, but hosts often use morphs to evade scrutiny and engage in false advertising. This study introduces the Live Auditory Morph Resolution (LiveAMR) task to detect such violations. Unlike previous morph research focused on text-based evasion in social media and underground industries, LiveAMR targets pronunciation-based evasion in health and medical live streams. We constructed the first LiveAMR dataset with 86,790 samples and developed a method to transform the task into a text-to-text generation problem. By leveraging large language models (LLMs) to generate additional training data, we improved performance and demonstrated that morph resolution significantly enhances live streaming regulation.

[44] AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

Minjiang Huang,Jipeng Qiang,Yi Zhu,Chaowei Zhang,Xiangyu Zhao,Kui Yu

Main category: cs.CL

TL;DR: AI4Reading是一个基于大语言模型和语音合成技术的多智能体系统,用于自动生成类似播客的有声书解读,旨在提升内容准确性、可理解性和叙事逻辑性。

Details Motivation: 手动制作有声书解读费时耗力,亟需自动化方法来降低资源消耗并提高生产效率。 Method: 提出AI4Reading系统,设计由11个专业化智能体(如主题分析、案例提取、编辑、叙述者和校对等)组成的协作框架,结合大语言模型与语音合成技术生成有声解读内容。 Result: 与专家制作的解读相比,AI4Reading生成的脚本更简洁准确,但在语音生成质量方面仍有差距。 Conclusion: AI4Reading能够有效支持自动化有声书解读生成,在内容组织和可理解性方面表现良好,具备进一步优化语音输出的潜力。 Abstract: Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system's output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.

[45] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

Jiafeng Liang,Hao Li,Chang Li,Jiaqi Zhou,Shixin Jiang,Zekun Wang,Changkai Ji,Zhihao Zhu,Runxuan Liu,Tao Ren,Jinlan Fu,See-Kiong Ng,Xia Liang,Ming Liu,Bing Qin

Main category: cs.CL

TL;DR: 本文系统地整合了认知神经科学与大语言模型驱动的智能体之间的跨学科记忆知识,提出了一个从生物到人工系统的记忆机制比较分析框架,并探讨了记忆评估基准、安全性和未来研究方向。

Details Motivation: 现有研究受限于跨学科障碍,难以充分吸收人类记忆机制的本质,因此需要搭建认知神经科学与AI智能体之间的桥梁。 Method: 通过综述认知神经科学中的记忆定义与功能,将其逐步延伸至大语言模型和智能体领域,进行记忆分类、存储机制及管理生命周期的生物与人工对比分析。 Result: 提供了生物与人工记忆系统的全面比较,总结了主流的记忆评估基准,并从攻防两个角度探讨了记忆安全性问题。 Conclusion: 展望了多模态记忆系统和技能获取等未来研究方向,推动AI智能体更高效地借鉴人类记忆机制。 Abstract: Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.

[46] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

Xin Zhang,Yang Cao,Baoxing Wu,Xinyi Chen,Kai Song,Siying Li

Main category: cs.CL

TL;DR: 本文提出了一种基于外部子图生成的逐步推理增强框架SGR,以提升大语言模型在复杂任务中的推理能力。

Details Motivation: 大语言模型在需要深度推理和逻辑推断的任务中表现不佳,容易受到训练数据中噪声或无关信息的影响,导致输出不准确或与事实不符。 Method: 通过从外部知识库动态构建与查询相关的子图,并利用其语义结构引导模型进行多步推理,最后整合多个推理路径得出答案。 Result: 在多个基准数据集上的实验结果表明,SGR consistently 优于强基线模型。 Conclusion: SGR能有效减少噪声信息影响,提升大语言模型的推理准确性。 Abstract: Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.

[47] Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

Jiapeng Wang,Yiwen Hu,Yanzipeng Gao,Haoyu Wang,Shuo Wang,Hongyu Lu,Jiaxin Mao,Wayne Xin Zhao,Junyi Li,Xiao Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为EntroDrop的熵引导令牌丢弃方法,用于缓解在多轮训练中自回归大语言模型因重复数据暴露导致的性能下降问题。

Details Motivation: 由于高质量领域特定数据的稀缺,多轮训练成为适应大语言模型的常用策略,但会导致模型对低熵令牌过拟合而损害高熵令牌的泛化能力。 Method: 引入EntroDrop,通过基于令牌熵值选择性掩码低熵令牌,并采用课程学习策略调整正则化强度,以平衡学习动态。 Result: 在0.6B到8B参数规模的模型上实验表明,EntroDrop在多轮训练中持续优于标准正则化基线方法,保持了更强的性能稳定性。 Conclusion: 将正则化与令牌级学习动态对齐对于数据受限下的大模型适应至关重要,EntroDrop为该场景提供了有效解决方案。 Abstract: As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model's ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.

[48] The Effect of Gender Diversity on Scientific Team Impact: A Team Roles Perspective

Yi Zhao,Yongjun Zhu,Donghun Kim,Yuzhuo Wang,Heng Zhang,Chao Lu,Chengzhi Zhang

Main category: cs.CL

TL;DR: 该研究基于13万篇PLOS期刊论文,探讨科研团队中领导与支持角色的性别多样性对团队影响力(五年引用量)的影响,发现性别多样性与团队影响力呈倒U型关系,且团队规模调节该关系。

Details Motivation: 现有研究对性别多样性与科研团队成功的关系结论不一,且多忽视团队内部角色差异,本文旨在揭示不同角色中性别多样性对团队影响力的具体作用机制。 Method: 将论文所有作者视为科研团队,利用作者贡献声明将成员分为领导和辅助角色,基于超过13万篇PLOS论文数据,采用多变量回归和阈值回归模型分析性别多样性与五年引用量之间的关系,并考察团队规模的调节作用。 Result: (1)领导组和支持组的性别多样性与团队影响力均呈倒U型关系;(2)全女性领导+全男性支持的团队影响力最高;(3)小团队中领导层性别多样性负向影响显著,大团队中转为不显著正向;而支持层性别多样性始终显著正向。 Conclusion: 性别多样性对科研团队影响力的影响因角色和团队规模而异,提升团队影响力需考虑角色分工与规模的交互作用,特别是支持角色中的性别平衡可能更具积极意义。 Abstract: The influence of gender diversity on the success of scientific teams is of great interest to academia. However, prior findings remain inconsistent, and most studies operationalize diversity in aggregate terms, overlooking internal role differentiation. This limitation obscures a more nuanced understanding of how gender diversity shapes team impact. In particular, the effect of gender diversity across different team roles remains poorly understood. To this end, we define a scientific team as all coauthors of a paper and measure team impact through five-year citation counts. Using author contribution statements, we classified members into leadership and support roles. Drawing on more than 130,000 papers from PLOS journals, most of which are in biomedical-related disciplines, we employed multivariable regression to examine the association between gender diversity in these roles and team impact. Furthermore, we apply a threshold regression model to investigate how team size moderates this relationship. The results show that (1) the relationship between gender diversity and team impact follows an inverted U-shape for both leadership and support groups; (2) teams with an all-female leadership group and an all-male support group achieve higher impact than other team types. Interestingly, (3) the effect of leadership-group gender diversity is significantly negative for small teams but becomes positive and statistically insignificant in large teams. In contrast, the estimates for support-group gender diversity remain significant and positive, regardless of team size.

[49] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Xuan Feng,Bo An,Tianlong Gu,Liang Chang,Fengrui Hao,Peipeng Yu,Shuai Zhao

Main category: cs.CL

TL;DR: 本文提出了一种名为Causal-Contrastive Preference Optimization (C2PO) 的统一对齐框架,用于同时缓解大语言模型中的刻板偏见和结构性偏见,通过因果反事实信号分离偏差特征,并在优化过程中动态抑制捷径特征,实验证明其在多个基准上有效且不损害通用推理能力。

Details Motivation: 现有方法通常孤立处理刻板偏见和结构性偏见,往往在减轻一种偏见的同时加剧另一种。本文旨在系统性地解决这两类偏见共存的问题,识别出输入中潜在的虚假特征关联是导致错误推理的主要原因。 Method: 提出C2PO框架,利用因果反事实信号将导致偏见的特征与有效推理路径分离,并设计公平敏感的偏好更新机制,在logit级别动态评估并抑制捷径特征,从而在优化过程中直接发现并压制虚假相关性。 Result: 在多个涵盖刻板偏见、结构性偏见、域外公平性和通用性能的基准(如BBQ, HANS, MMLU等)上实验表明,C2PO能有效减轻两类偏见,同时保持强大的通用推理能力。 Conclusion: C2PO提供了一种统一且有效的解决方案,能够在不牺牲模型整体性能的前提下,协同缓解大语言模型中的多种偏见,提升了模型的可信度和公平性。 Abstract: Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.

[50] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

Yuqi Tang,Jing Yu,Zichang Su,Kehua Feng,Zhihui Zhu,Libin Wang,Lei Liang,Qiang Zhang,Keyan Ding,Huajun Chen

Main category: cs.CL

TL;DR: 本文提出了ClinDEF,一个基于模拟诊断对话的动态框架,用于评估大语言模型在临床推理中的表现。该框架通过疾病知识图谱动态生成病例,并引入多轮医患对话与细粒度评估机制,弥补了现有静态基准的不足。

Details Motivation: 现有的大语言模型医学评估主要依赖静态问答,无法反映真实的动态临床推理过程;因此需要一种更贴近实际诊疗流程的评估方法。 Method: 提出ClinDEF框架,基于疾病知识图谱动态生成患者病例,构建LLM医生与自动化患者代理之间的多轮交互对话,并结合诊断准确性、效率分析和基于评分标准的诊断质量进行综合评估。 Result: 实验表明,ClinDEF能够有效揭示当前最先进的大语言模型在临床推理中的关键缺陷,特别是在信息收集、鉴别诊断和推理效率方面。 Conclusion: ClinDEF提供了一种更细致、更具临床意义的大语言模型临床推理评估范式,推动了动态医学AI评估的发展。 Abstract: Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.

[51] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv,Jin Ma,Yiyuan Ma,Siyuan Qiao

Main category: cs.CL

TL;DR: 提出专家-路由耦合(ERC)损失,通过轻量级辅助损失函数增强MoE模型中路由决策与专家能力的对齐,提升模型性能并实现专家专精程度的量化控制。

Details Motivation: 现有MoE模型缺乏确保路由决策与专家能力对齐的机制,限制了模型性能。 Method: 引入ERC损失,利用扰动的路由嵌入作为代理token,约束每个专家对其自身代理token的激活高于其他专家,并确保每个代理token在对应专家中引发最强激活。 Result: ERC损失计算高效,仅作用于n²个激活(n为专家数),成本与批量大小无关;在3B至15B参数的MoE-LLM上验证有效,支持训练过程中专家专精程度的定量追踪。 Conclusion: ERC损失能有效耦合专家与路由,提升MoE模型性能,并提供对专家专精程度的灵活控制与洞察。 Abstract: Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

[52] Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Thomas Haschka,Joseph Bakarji

Main category: cs.CL

TL;DR: 提出一种基于嵌套密度聚类的层次化语义文本分类方法,利用大语言模型嵌入构建文本间的层次语义关系树,无需预定义类别,适用于科学计量、主题演化等场景。

Details Motivation: 现有的大语言模型嵌入虽能捕捉语义相似性,但文本语料中的全局语义结构仍不清晰,缺乏无需预设类别的数据驱动分类方法。 Method: 提出嵌套密度聚类方法,在LLM嵌入空间中通过从高密度到低密度的聚类合并,构建反映层次语义关系的聚类树。 Result: 在科学论文摘要、20 Newsgroups和IMDB等数据集上成功构建了语义层次树,展示了方法在不同领域的鲁棒性和分类能力。 Conclusion: 该方法可有效揭示文本数据集中的层次语义结构,支持研究领域及其子领域的数据驱动发现,并具有在 scientometrics 和主题演化分析中的广泛应用潜力。 Abstract: Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster -- the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 Newsgroups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.

[53] Automatic Detection of Complex Quotation Patterns in Aggadic Literature

Hadar Miller,Tsvi Kuflik,Moshe Lavee

Main category: cs.CL

TL;DR: 本文提出了一种名为ACT的三阶段算法,用于自动检测拉比文献中的圣经引文,其结合形态感知对齐与上下文敏感增强,能有效识别复杂的引用模式,在F1分数上超越现有方法。

Details Motivation: 现有文本重用检测框架在处理短小、改写或结构嵌入式引文时表现不佳,且难以捕捉复杂引用风格,因此需要一种更适应拉比文献特点的自动化检测方法。 Method: 提出ACT算法,包含三个阶段:形态感知的文本对齐、上下文敏感的特征增强,以及引用分类;通过对比不同配置(如ACT-2和ACT-3)评估各组件贡献,并与主流系统及人工校注本进行比较。 Result: ACT-QE完整 pipeline 的F1得分为0.91(Recall 0.89, Precision 0.94),优于所有基线系统;ACT-2召回率更高但精确率下降,ACT-3在覆盖与特异性间取得权衡;ACT还能分类跨语料库的风格模式。 Conclusion: ACT有效解决了形态丰富、引用密集文本中引文检测的挑战,弥合了机器自动检测与人工编辑判断之间的方法论差距,为数字人文和计算语言学提供了可推广的技术基础。 Abstract: This paper presents ACT (Allocate Connections between Texts), a novel three-stage algorithm for the automatic detection of biblical quotations in Rabbinic literature. Unlike existing text reuse frameworks that struggle with short, paraphrased, or structurally embedded quotations, ACT combines a morphology-aware alignment algorithm with a context-sensitive enrichment stage that identifies complex citation patterns such as "Wave" and "Echo" quotations. Our approach was evaluated against leading systems, including Dicta, Passim, Text-Matcher, as well as human-annotated critical editions. We further assessed three ACT configurations to isolate the contribution of each component. Results demonstrate that the full ACT pipeline (ACT-QE) outperforms all baselines, achieving an F1 score of 0.91, with superior Recall (0.89) and Precision (0.94). Notably, ACT-2, which lacks stylistic enrichment, achieves higher Recall (0.90) but suffers in Precision, while ACT-3, using longer n-grams, offers a tradeoff between coverage and specificity. In addition to improving quotation detection, ACT's ability to classify stylistic patterns across corpora opens new avenues for genre classification and intertextual analysis. This work contributes to digital humanities and computational philology by addressing the methodological gap between exhaustive machine-based detection and human editorial judgment. ACT lays a foundation for broader applications in historical textual analysis, especially in morphologically rich and citation-dense traditions like Aggadic literature.

[54] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Fengjiao Chen,Minhao Jing,Weitao Lu,Yan Feng,Xiaoyu Li,Xuezhi Cao

Main category: cs.CL

TL;DR: 本文研究了在大规模预训练下,生成任务是否能提升视觉理解能力,提出了UniHetero模型,发现语义生成而非像素生成更有利于理解,且生成任务展现出更好的数据扩展性和利用率。

Details Motivation: 探索在大规模数据上,视觉生成任务是否能够增强视觉理解能力,推动统一视觉-语言模型的发展。 Method: 提出一个简洁的统一模型UniHetero,在超过2亿样本的大规模数据上进行预训练,分析生成与理解之间的关系。 Result: 发现语义生成能提升理解性能;生成任务具有更优的数据扩展趋势和更高的数据利用率;输入嵌入上的自回归有助于捕捉视觉细节。 Conclusion: 生成可以促进理解,但关键在于生成语义而非像素,这为统一视觉理解与生成模型提供了重要启示。 Abstract: Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.

[55] Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

Hazel Kim,Philip Torr

Main category: cs.CL

TL;DR: 本文提出了Mixture of Latent Concept Experts (MoLaCE),一种轻量级的推理时框架,用于减轻大语言模型中的输入确认偏见。通过混合基于潜在概念的不同专家,MoLaCE在保持计算效率的同时提升了模型的鲁棒性,并能与多智能体辩论结合以减少相关错误。

Details Motivation: 大语言模型容易受到输入确认偏见的影响,当提示暗示某个答案时,模型倾向于强化该偏见而非探索其他可能性。这种现象在基础模型中已有危害,在多智能体辩论中风险更大,因为回声室效应会加剧偏见而非纠正。因此需要一种有效机制来缓解这一问题。 Method: 提出MoLaCE框架,利用语言的组合性特性,根据不同提示对潜在概念进行动态加权,通过调整不同潜在概念上的激活强度来实例化多个专家,并在推理时混合这些专家以减少确认偏见。该方法可在单个LLM内模拟辩论的好处,也可集成到多智能体系统中。 Result: 实验表明,MoLaCE能持续减少确认偏见,提高模型鲁棒性,在多种任务上表现优于或相当于多智能体辩论方法,同时仅需其一小部分计算资源。 Conclusion: MoLaCE是一种高效、可扩展的方法,能够有效缓解大语言模型中的确认偏见,既适用于单模型内部,也适用于多智能体框架,为降低相关错误和提升决策质量提供了新思路。 Abstract: Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi-agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference-time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt-specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi-agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate while requiring only a fraction of the computation.

[56] Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

Sahil Kale,Antonio Luca Alfeo

Main category: cs.CL

TL;DR: 提出一种基于知识图谱的LLM幻觉自检测方法,通过将模型输出转化为实体和关系的知识图谱来识别幻觉,显著提升了检测准确率。

Details Motivation: 幻觉问题严重阻碍了大语言模型的安全部署,现有自检方法仍有提升空间。 Method: 将LLM生成的回答转化为知识图谱,并利用图结构评估其中包含幻觉的可能性。 Result: 在GPT-4o和Gemini-2.5-Flash上测试,相比标准自检和SelfCheckGPT,准确率最高提升16%,F1分数提升20%。 Conclusion: 结构化知识表示能有效增强LLM对原子事实的分析能力,该方法低成本、模型无关,有助于实现更安全可信的语言模型。 Abstract: Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection. Specifically, we propose a simple yet powerful approach that enriches hallucination self-detection by (i) converting LLM responses into knowledge graphs of entities and relations, and (ii) using these graphs to estimate the likelihood that a response contains hallucinations. We evaluate the proposed approach using two widely used LLMs, GPT-4o and Gemini-2.5-Flash, across two hallucination detection datasets. To support more reliable future benchmarking, one of these datasets has been manually curated and enhanced and is released as a secondary outcome of this work. Compared to standard self-detection methods and SelfCheckGPT, a state-of-the-art approach, our method achieves up to 16% relative improvement in accuracy and 20% in F1-score. Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs, even when initial outputs contain inaccuracies. This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.

[57] Instruction-Following Evaluation of Large Vision-Language Models

Daiki Shiono,Shumpei Miyawaki,Ryota Tanaka,Jun Suzuki

Main category: cs.CL

TL;DR: 该研究发现大型视觉语言模型(LVLMs)在使用常用数据集微调后,其指令跟随能力会下降;通过构建强调输出格式是否指定的新数据集,发现包含输出格式指示的训练数据能提升LVLMs的指令跟随准确性。

Details Motivation: 解决LVLMs在视觉指令微调后失去原有LLMs良好指令跟随能力的问题。 Method: 构建新的强调输出格式是否指定的训练数据集,定量评估不同数据集微调对LVLMs指令跟随能力的影响。 Result: 实验表明常用数据集微调会导致指令跟随能力下降,而包含输出格式指示的数据可显著提升该能力。 Conclusion: 在视觉指令微调中引入带有输出格式说明的样本有助于缓解LVLMs指令跟随能力的退化。 Abstract: Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.

[58] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Yu-Xiang Lin,Cheng-Han Chiang,Hung-yi Lee

Main category: cs.CL

TL;DR: 本文研究了口语语言模型(SLMs)在多轮对话中难以维持指定说话风格的问题,称之为“风格失忆”,并发现当前模型在系统提示中接收指令时更难遵循所需风格。

Details Motivation: 探究SLMs在多轮对话中无法持续保持指定说话风格的原因,特别是情感、口音、音量和语速等副语言风格的保持能力。 Method: 评估了三个专有和两个开源SLMs在不同提示策略下的表现,分析其在多轮对话中对风格指令的记忆与执行情况。 Result: 所有测试的SLMs均无法在多轮后持续保持指定风格;尽管模型能回忆起风格指令,但实际表达中仍失败;将指令放在用户消息中比系统消息中更有效。 Conclusion: SLMs存在显著的风格失忆问题,系统提示的设计需重新思考以提升风格一致性。 Abstract: In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.

[59] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Yuwen Li,Wei Zhang,Zelong Huang,Mason Yang,Jiajun Wu,Shawn Guo,Huahao Hu,Lingyi Sun,Jian Yang,Mingjie Tang,Byran Dai

Main category: cs.CL

TL;DR: InfTool是一个完全自主的框架,通过自演化的多智能体合成,仅基于API规范生成多样且经过验证的工具调用轨迹,无需人工标注,显著提升大模型调用外部工具的能力。

Details Motivation: 现有的大语言模型在调用外部工具时面临三大挑战:依赖昂贵的人工标注、难以泛化到未见过的工具、以及单模型合成的质量上限限制了覆盖范围和可靠性。 Method: 提出InfTool框架,利用三个协作智能体(用户模拟器、工具调用助手和MCP服务器)自动生成工具调用轨迹,并通过分组相对策略优化(GRPO)结合门控奖励训练模型,形成闭环迭代优化过程。 Result: 在BFCL基准测试中,InfTool将一个32B的基础模型准确率从19.8%提升至70.9%,性能超越大10倍的模型,并接近Claude-Opus水平,完全基于合成数据训练。 Conclusion: InfTool实现了无需人工标注的大规模工具调用能力提升,打破了现有方法的瓶颈,展示了自演化多智能体框架在增强LLM工具使用方面的巨大潜力。 Abstract: Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single-model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self-evolving multi-agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool-Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single-turn calls to complex multi-step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher-quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus, and entirely from synthetic data without human annotation.

[60] A Dataset and Benchmark for Consumer Healthcare Question Summarization

Abhishek Basu,Deepak Gupta,Dina Demner-Fushman,Shweta Yadav

Main category: cs.CL

TL;DR: 本文介绍了一个新的数据集CHQ-Sum,包含1507个由领域专家标注的消费者健康问题及其摘要,旨在推动针对消费者医疗问题的摘要系统研究。

Details Motivation: 消费者在表达医疗需求时通常使用冗长且外围的信息,给自然语言理解带来挑战,而现有缺乏领域专家标注的数据集限制了高效摘要系统的发展。 Method: 构建了一个名为CHQ-Sum的新数据集,数据来源于社区问答论坛,并由领域专家进行标注,包含原始问题和对应摘要,并在多个先进的摘要模型上进行基准测试。 Result: CHQ-Sum数据集为消费者健康问题的理解提供了宝贵资源,并通过在多种SOTA摘要模型上的实验验证了其有效性。 Conclusion: CHQ-Sum数据集有助于推动面向消费者健康信息的自动摘要研究,促进医疗领域的自然语言处理应用发展。 Abstract: The quest for seeking health information has swamped the web with consumers health-related questions. Generally, consumers use overly descriptive and peripheral information to express their medical condition or other healthcare needs, contributing to the challenges of natural language understanding. One way to address this challenge is to summarize the questions and distill the key information of the original question. Recently, large-scale datasets have significantly propelled the development of several summarization tasks, such as multi-document summarization and dialogue summarization. However, a lack of a domain-expert annotated dataset for the consumer healthcare questions summarization task inhibits the development of an efficient summarization system. To address this issue, we introduce a new dataset, CHQ-Sum,m that contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media. We benchmark the dataset on multiple state-of-the-art summarization models to show the effectiveness of the dataset

[61] Nested Browser-Use Learning for Agentic Information Seeking

Baixuan Li,Jialong Wu,Wenbiao Yin,Kuan Li,Zhongwang Zhang,Huifeng Yin,Zhengwei Tao,Liwen Zhang,Pengjun Xie,Jingren Zhou,Yong Jiang

Main category: cs.CL

TL;DR: 本文提出了NestBrowse,一种解耦交互控制与页面探索的嵌套式浏览器操作框架,以提升信息获取代理在深度网页搜索中的效率与灵活性。

Details Motivation: 现有信息获取代理多局限于API级片段检索和基于URL的页面抓取,难以获取更丰富的浏览信息,限制了其在深度搜索任务中的能力。 Method: 提出Nested Browser-Use Learning(NestBrowse),采用最小且完整的浏览器操作框架,通过嵌套结构将交互控制与页面探索分离,简化智能体推理过程。 Result: 在具有挑战性的深度信息搜索基准上,NestBrowse展现出显著的实际优势,实验结果表明其在效率和灵活性方面优于现有方法。 Conclusion: NestBrowse有效解决了全浏览器交互带来的复杂性问题,为信息获取代理实现更深层次的网页信息获取提供了可行且高效的解决方案。 Abstract: Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.

[62] Less is more: Probabilistic reduction is best explained by small-scale predictability measures

Cassandra L. Jacobs,Andrés Buxó-Lugo,Anna K. Taylor,Marie Leopold-Hooke

Main category: cs.CL

TL;DR: 本文探讨了在研究语言模型概率与认知现象关系时所需的上下文量,证明n-gram表示足以作为规划的认知单元。

Details Motivation: 确定在研究语言模型概率与人类认知之间的关系时,需要多少上下文信息是必要或合适的。 Method: 通过考察是否整句是观察概率减少所必需的,来测试n-gram表示的有效性。 Result: 发现n-gram表示已足够用于捕捉与认知相关的语言概率模式。 Conclusion: n-gram可以作为语言产生过程中认知规划的基本单位,无需依赖完整句子的上下文。 Abstract: The primary research questions of this paper center on defining the amount of context that is necessary and/or appropriate when investigating the relationship between language model probabilities and cognitive phenomena. We investigate whether whole utterances are necessary to observe probabilistic reduction and demonstrate that n-gram representations suffice as cognitive units of planning.

[63] Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Panagiotis Theocharopoulos,Ajinkya Kulkarni,Mathew Magimai. -Doss

Main category: cs.CL

TL;DR: 该研究评估了在ICML接受的约500篇真实学术论文中嵌入多语言隐藏对抗性提示对大语言模型(LLM)评审结果的影响,发现英语、日语和中文提示会显著改变评审分数和接收决定,而阿拉伯语则影响甚微。

Details Motivation: 由于大语言模型越来越多地被用于高影响力的工作流程(如学术同行评审),但其容易受到文档级隐藏提示注入攻击,因此有必要评估此类攻击在真实学术场景中的影响及其跨语言差异。 Method: 构建了一个包含约500篇ICML录用论文的数据集,并在这些文档中嵌入四种不同语言但语义相同的隐蔽对抗性提示指令,随后使用大语言模型对这些论文进行评审并分析评分与决策变化。 Result: 提示注入显著改变了英语、日语和中文版本的评审分数和接受/拒绝决策,而阿拉伯语注入几乎没有产生影响。 Conclusion: 基于大语言模型的评审系统容易受到文档级提示注入攻击,且不同语言之间的脆弱性存在显著差异,表明在部署LLM于关键评审流程前需加强安全防护。 Abstract: Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.

[64] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

Deepak Babu Piskala

Main category: cs.CL

TL;DR: 本文提出了ProfASR-Bench,一个用于高风险专业场景(如金融、医疗、法律和技术)的自动语音识别(ASR)评估基准,强调领域术语密集、正式语域变化和关键实体错误容忍度低等挑战。该基准支持上下文条件识别评估,并引入“上下文利用差距”(CUG)概念,揭示当前ASR系统虽能接受提示但未能有效利用上下文信息的问题。

Details Motivation: 现有ASR基准未能充分反映专业场景中的实际挑战,如密集的专业术语、正式语体变化以及对关键实体识别极低的容错要求,因此需要一个更贴近真实高风险应用环境的评估工具。 Method: 构建了一个包含金融、医学、法律和技术领域的专业语音评估数据集ProfASR-Bench,每个样本包含自然语言提示(领域线索或说话人画像)和富含实体的目标语句;在无上下文、仅画像、领域+画像、理想提示和对抗提示条件下,使用Whisper和Qwen-Omni等模型进行测试,并采用传统WER、实体感知评分及按口音和性别切片报告性能。 Result: 实验发现,即使使用理想提示,轻量级文本上下文对平均词错误率(WER)几乎没有改善,对抗提示也未显著降低性能,表明当前模型未能有效利用可用上下文信息,存在明显的上下文利用差距(CUG)。 Conclusion: 当前主流ASR系统在专业高风险场景中虽具备形式上的提示能力,但实际上未能有效利用外部上下文信息来提升识别准确性,ProfASR-Bench为评估和改进此类系统的上下文融合策略提供了标准化测试平台。 Abstract: Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench Code: https://github.com/prdeepakbabu/ProfASR-Bench

[65] Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans

Sky CH-Wang,Justin Svegliato,Helen Appel,Jason Eisner

Main category: cs.CL

TL;DR: 提出了一种基于偏好监督的增量式语言模型微调方法,通过细粒度反馈驱动的改进链实现更高效的对齐。

Details Motivation: 传统A/B偏好排序或全句重写在模型对齐中缺乏细粒度指导,导致学习效率低。 Method: 标注者标记回答中的“喜欢”和“不喜欢”片段并说明原因,基础模型从左到右逐步重写不满意部分,形成改进链,并利用相邻步骤构建偏好对进行直接对齐。 Result: 该方法优于基于标准A/B偏好排序或全对比重写的直接对齐方法,在偏好调优中更高效且有效。 Conclusion: 结构化的、基于修订的监督方式能显著提升语言模型的偏好对齐效果。 Abstract: We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.

[66] Eliciting Behaviors in Multi-Turn Conversations

Jing Huang,Shujian Zhang,Lun Wang,Andrew Hard,Rajiv Mathews,John Lambert

Main category: cs.CL

TL;DR: 本文研究了在多轮对话中从大语言模型中引出特定行为的方法,提出了一种分类框架,并比较了基于离线和在线交互的不同方法的效率,发现在线方法在有限查询下显著更有效。

Details Motivation: 现有研究主要集中在单轮对话中引出模型行为,缺乏对多轮对话场景的有效评估方法,因此需要系统性地研究多轮设置下的行为引出技术。 Method: 提出了一个分析框架,将现有方法分为仅使用先验知识、使用离线交互和使用在线交互三类,并引入了一种统一的多轮在线行为引出方法,用于生成多轮测试用例。 Result: 在线方法在三个任务上仅用几千次查询就达到了平均45/19/77%的成功率,而静态方法几乎无法发现失败案例。 Conclusion: 在线交互方法在多轮对话行为引出中更高效,强调应发展动态基准以更好评估大语言模型。 Abstract: Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.

cs.CV [Back]

[67] Characterizing Motion Encoding in Video Diffusion Timesteps

Vatsal Baherwani,Yixuan Ren,Abhinav Shrivastava

Main category: cs.CV

TL;DR: 本文通过大规模定量研究揭示了文本到视频扩散模型中运动和外观在去噪过程中的分离机制,提出了一种基于时间步长约束的运动主导区域操作原则,并简化了现有的运动定制方法。

Details Motivation: 尽管实践中认为早期时间步主要影响运动和布局,后期则优化外观,但这种行为尚未被系统地表征。本文旨在明确运动编码在视频扩散模型时间步中的作用机制。 Method: 通过在指定时间步范围内注入新条件,量化外观编辑与运动保持之间的权衡来代理运动编码,并在多种架构下进行大规模实验以分析运动与外观的分离特性。 Result: 发现一致存在的早期运动主导阶段和后期外观主导阶段,确定了时间步空间中的运动-外观分界;并将训练和推理限制在运动主导阶段,实现了无需额外模块或特殊目标函数的高效运动迁移。 Conclusion: 将经验性启发转化为时空解耦原则,所提出的时间步约束策略可直接集成到现有视频生成与编辑方法中,提升运动控制效率与理解。 Abstract: Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.

[68] Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture, Training, and Deployment

Dawnena Key

Main category: cs.CV

TL;DR: 提出了一种结合3D CNN和LSTM的实时美国手语识别系统,利用视频流实现高精度词级ASL识别,并部署于AWS和边缘设备。

Details Motivation: 为解决全球超过7000万聋哑人和听力障碍者的沟通障碍,需开发高效准确的手语识别技术。 Method: 采用3D CNN提取视频中的时空特征,再通过LSTM建模手势的时序依赖关系,构建混合深度学习模型,并在多个数据集上进行训练与评估。 Result: 在WLASL、ASL-LEX等数据集上训练后,系统对100个专家标注的手语词汇实现了F1分数0.71至0.99的性能表现,支持实时推理。 Conclusion: 该混合架构能有效捕捉手语的时空特性,具备良好的实际应用潜力,可部署于云端和边缘设备以提升无障碍通信的可及性。 Abstract: This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture combining 3D Convolutional Neural Networks (3D CNN) with Long Short-Term Memory (LSTM) networks. The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide. Our architecture leverages 3D convolutions to capture spatial-temporal features from video frames, followed by LSTM layers that model sequential dependencies inherent in sign language gestures. Trained on the WLASL dataset (2,000 common words), ASL-LEX lexical database (~2,700 signs), and a curated set of 100 expert-annotated ASL signs, the system achieves F1-scores ranging from 0.71 to 0.99 across sign classes. The model is deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference. We discuss the architecture design, training methodology, evaluation metrics, and deployment considerations for practical accessibility applications.

[69] Enhancing Medical Data Analysis through AI-Enhanced Locally Linear Embedding: Applications in Medical Point Location and Imagery

Hassan Khalid,Muhammad Mahad Khaliq,Muhammad Jawad Bashir

Main category: cs.CV

TL;DR: 本文提出了一种结合人工智能与局部线性嵌入(LLE)的创新模型,用于提升医疗账单和转录服务的准确性与效率。

Details Motivation: 旨在解决高维医疗数据处理中的复杂性与人工错误问题,提升医疗信息系统的自动化水平。 Method: 提出并构建了AI增强的LLE数学模型,利用其降维能力优化医疗数据处理流程,并在真实场景中进行实验验证。 Result: 实验结果表明,该模型显著提高了数据处理的准确性和操作效率。 Conclusion: AI增强的LLE在医疗数据处理中具有巨大潜力,为未来在更广泛医疗应用中的研究奠定了基础。 Abstract: The rapid evolution of Artificial intelligence in healthcare has opened avenues for enhancing various processes, including medical billing and transcription. This paper introduces an innovative approach by integrating AI with Locally Linear Embedding (LLE) to revolutionize the handling of high-dimensional medical data. This AI-enhanced LLE model is specifically tailored to improve the accuracy and efficiency of medical billing systems and transcription services. By automating these processes, the model aims to reduce human error and streamline operations, thereby facilitating faster and more accurate patient care documentation and financial transactions. This paper provides a comprehensive mathematical model of AI-enhanced LLE, demonstrating its application in real-world healthcare scenarios through a series of experiments. The results indicate a significant improvement in data processing accuracy and operational efficiency. This study not only underscores the potential of AI-enhanced LLE in medical data analysis but also sets a foundation for future research into broader healthcare applications.

[70] Unbiased Visual Reasoning with Controlled Visual Inputs

Zhaonan Li,Shijie Lu,Fei Wang,Jacob Dineen,Xiao Ye,Zhikun Xu,Siyi Liu,Young Min Cho,Bangzheng Li,Daniel Chang,Kenny Nguyen,Qizheng Yang,Muhao Chen,Ben Zhou

Main category: cs.CV

TL;DR: VISTA是一种模块化框架,通过解耦感知与推理来提升视觉语言模型在视觉问答中的鲁棒性,减少对虚假相关性的依赖。

Details Motivation: 现有端到端视觉语言模型常依赖虚假相关性而非真实视觉证据回答问题,微调后更易产生捷径行为,缺乏可靠推理。 Method: 提出VISTA框架:使用冻结的VLM作为传感器仅回答简短客观的感知问题,由纯文本LLM作为推理器分解问题、规划查询并以自然语言聚合视觉事实;通过强化学习在奖励对齐环境中训练,引入显式信息瓶颈控制信息流。 Result: 基于Qwen2.5-VL和Llama3.2-Vision传感器,仅用641个手工多步问题训练,VISTA在SpuriVerse上分别提升16.29%和6.77%,在MMVP和SeedBench子集上保持竞争力,且能跨未见传感器迁移并识别恢复感知失败。 Conclusion: VISTA通过模块化设计和受限感知接口有效提升了视觉推理的鲁棒性和可解释性,减少了对虚假特征的依赖,推理过程更贴近视觉证据。 Abstract: End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA's reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.

[71] SAMM2D: Scale-Aware Multi-Modal 2D Dual-Encoder for High-Sensitivity Intracrania Aneurysm Screening

Antara Titikhsha,Divyanshu Tak

Main category: cs.CV

TL;DR: 本文提出了一种名为SAMM2D的双编码器框架,用于颅内动脉瘤检测,在RSNA数据集上AUC达到0.686,较临床基线提升32%。研究发现,在强预训练主干网络下,任何形式的数据增强均会降低性能,挑战了‘越多增强越好’的传统认知。通过决策阈值校准,模型灵敏度达95%,超过放射科医生平均水平,并具备显著的临床关注一致性。结果表明,医学影像任务中强预训练可能比复杂增强策略更有效。

Details Motivation: 动脉瘤检测对预防危及生命的出血至关重要,但由于形态微小、类别不平衡和标注数据稀缺,仍具挑战性。现有方法依赖数据增强来缓解数据不足,但在医学图像中的有效性尚未明确。 Method: 提出SAMM2D双编码器框架,采用强ImageNet预训练主干网络,系统评估六种数据增强策略的效果,并结合Grad-CAM可视化分析模型注意力区域。 Result: SAMM2D在RSNA数据集上AUC为0.686,比临床基线提高32%;未使用数据增强的模型表现最佳,优于所有增强版本1.75–2.23个百分点(p < 0.01);灵敏度达95%,超过放射科医生水平;Grad-CAM显示85%真阳性关注到相关血管区域(与专家标注IoU为62%)。 Conclusion: 在强预训练背景下,数据增强反而损害模型性能,说明其在低数据量医学场景中可能冗余且破坏特征流形。研究表明,未来医学影像应优先考虑强预训练而非复杂增强策略,SAMM2D兼具高性能与临床可解释性,具备实际筛查应用潜力。 Abstract: Effective aneurysm detection is essential to avert life-threatening hemorrhages, but it remains challenging due to the subtle morphology of the aneurysm, pronounced class imbalance, and the scarcity of annotated data. We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset; an improvement of 32% over the clinical baseline. In a comprehensive ablation across six augmentation regimes, we made a striking discovery: any form of data augmentation degraded performance when coupled with a strong pretrained backbone. Our unaugmented baseline model outperformed all augmented variants by 1.75--2.23 percentage points (p < 0.01), overturning the assumption that "more augmentation is always better" in low-data medical settings. We hypothesize that ImageNet-pretrained features already capture robust invariances, rendering additional augmentations both redundant and disruptive to the learned feature manifold. By calibrating the decision threshold, SAMM2D reaches 95% sensitivity, surpassing average radiologist performance, and translates to a projected \$13.9M in savings per 1,000 patients in screening applications. Grad-CAM visualizations confirm that 85% of true positives attend to relevant vascular regions (62% IoU with expert annotations), demonstrating the model's clinically meaningful focus. Our results suggest that future medical imaging workflows could benefit more from strong pretraining than from increasingly complex augmentation pipelines.

[72] HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology

Xitong Ling,Minxi Ouyang,Xiaoxiao Li,Jiawen Li,Ying Chen,Yuxuan Sun,Xinrui Chen,Tian Guan,Xiaoping Liu,Yonghong He

Main category: cs.CV

TL;DR: 提出HookMIL,一种上下文感知且计算高效的多实例学习框架,通过可学习的hook token实现结构化上下文聚合,在病理图像分析中实现最先进的性能。

Details Motivation: 传统MIL方法丢失关键上下文信息,基于Transformer的方法存在计算复杂度高和冗余计算的问题。 Method: 引入可学习的hook token,支持多模态初始化(视觉、文本、空间特征),采用线性复杂度的双向注意力机制,并设计Hook Diversity Loss和hook-to-hook通信机制以提升特化性和减少冗余。 Result: 在四个公开病理数据集上实验表明,HookMIL在性能、计算效率和可解释性方面均优于现有方法。 Conclusion: HookMIL有效解决了MIL在病理图像分析中的上下文丢失和计算效率问题,具有良好的应用前景。 Abstract: Multiple Instance Learning (MIL) has enabled weakly supervised analysis of whole-slide images (WSIs) in computational pathology. However, traditional MIL approaches often lose crucial contextual information, while transformer-based variants, though more expressive, suffer from quadratic complexity and redundant computations. To address these limitations, we propose HookMIL, a context-aware and computationally efficient MIL framework that leverages compact, learnable hook tokens for structured contextual aggregation. These tokens can be initialized from (i) key-patch visual features, (ii) text embeddings from vision-language pathology models, and (iii) spatially grounded features from spatial transcriptomics-vision models. This multimodal initialization enables Hook Tokens to incorporate rich textual and spatial priors, accelerating convergence and enhancing representation quality. During training, Hook tokens interact with instances through bidirectional attention with linear complexity. To further promote specialization, we introduce a Hook Diversity Loss that encourages each token to focus on distinct histopathological patterns. Additionally, a hook-to-hook communication mechanism refines contextual interactions while minimizing redundancy. Extensive experiments on four public pathology datasets demonstrate that HookMIL achieves state-of-the-art performance, with improved computational efficiency and interpretability. Codes are available at https://github.com/lingxitong/HookMIL.

[73] Tiny-YOLOSAM: Fast Hybrid Image Segmentation

Kenneth Xu,Songhan Wu

Main category: cs.CV

TL;DR: 提出Tiny-YOLOSAM,一种结合YOLO检测器与TinySAM的混合分割 pipeline,在保持高质量掩码的同时大幅提升速度和覆盖范围。

Details Motivation: TinySAM虽轻量但仍依赖密集提示,导致在实际应用中速度慢、覆盖率低,难以满足实时性需求。 Method: 使用YOLOv12生成前景物体的边界框提示输入TinySAM,并在未覆盖区域补充稀疏点提示,实现高效全场景分割。 Result: 在COCO val2017上,AR从16.4%提升至77.1%,mIoU从19.2%升至67.8%,端到端运行时间从49.20秒/图缩短至10.39秒/图(快4.7倍)。 Conclusion: 检测器引导的提示策略结合稀疏采样是替代密集‘分割一切’方法的有效途径,适用于实际全场景分割任务。 Abstract: The Segment Anything Model (SAM) enables promptable, high-quality segmentation but is often too computationally expensive for latency-critical settings. TinySAM is a lightweight, distilled SAM variant that preserves strong zero-shot mask quality, yet its "segment-everything" mode still requires hundreds of prompts and remains slow in practice. We first replicate TinySAM on COCO val2017 using official checkpoints, matching the reported AP within 0.03%, establishing a reliable experimental baseline. Building on this, we propose Tiny-YOLOSAM, a fast hybrid pipeline that uses a recent YOLO detector (YOLOv12) to generate box prompts for TinySAM on salient foreground objects, and supplements uncovered regions with sparse point prompts sampled only where YOLO-guided masks provide no coverage. On COCO val2017, the hybrid system substantially improves class-agnostic coverage (AR from 16.4% to 77.1%, mIoU from 19.2% to 67.8%) while reducing end-to-end runtime from 49.20s/image to 10.39s/image (4.7x) on an Apple M1 Pro CPU. These results suggest detector-guided prompting combined with targeted sparse sampling as an effective alternative to dense "segment-everything" prompting for practical full-scene segmentation.

[74] Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy

Shivum Telang

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型(VLM)和少样本学习的多模态可解释性模型,通过分析眼底图像和OCT图像中的病变分布,生成自然语言描述和Grad-CAM热图,模拟眼科医生的推理过程,提升糖尿病视网膜病变(DR)的分类与解释能力。

Details Motivation: 现有DR诊断模型依赖单一成像模式,且仅能定位病灶,缺乏对分类结果的临床可解释性;手动标注病灶费时费力,临床医生需要能解释分类推理过程的模型。 Method: 采用VLM结合少样本学习,分析视网膜四个象限的病变分布,生成自然语言解释,并利用Grad-CAM生成OCT和眼底图像的神经元权重热图,实现多模态可视化解释。 Result: 在3,000张眼底图像和1,000张OCT图像的数据集上验证了方法的有效性,模型能够准确识别DR病变并提供可视化的分类依据,提升了模型的可解释性和实用性。 Conclusion: 该多模态可解释模型克服了传统方法在解释性、模态单一和标注成本方面的局限,为DR的筛查、治疗和研究提供了更实用、全面的工具。 Abstract: Diabetic Retinopathy (DR) is a leading cause of vision loss worldwide, requiring early detection to preserve sight. Limited access to physicians often leaves DR undiagnosed. To address this, AI models utilize lesion segmentation for interpretability; however, manually annotating lesions is impractical for clinicians. Physicians require a model that explains the reasoning for classifications rather than just highlighting lesion locations. Furthermore, current models are one-dimensional, relying on a single imaging modality for explainability and achieving limited effectiveness. In contrast, a quantitative-detection system that identifies individual DR lesions in natural language would overcome these limitations, enabling diverse applications in screening, treatment, and research settings. To address this issue, this paper presents a novel multimodal explainability model utilizing a VLM with few-shot learning, which mimics an ophthalmologist's reasoning by analyzing lesion distributions within retinal quadrants for fundus images. The model generates paired Grad-CAM heatmaps, showcasing individual neuron weights across both OCT and fundus images, which visually highlight the regions contributing to DR severity classification. Using a dataset of 3,000 fundus images and 1,000 OCT images, this innovative methodology addresses key limitations in current DR diagnostics, offering a practical and comprehensive tool for improving patient outcomes.

[75] TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting

Qiang Guo,Rubo Zhang,Bingbing Zhang,Junjie Liu,Jianqing Liu

Main category: cs.CV

TL;DR: 本文提出TCFormer,一种仅含500万参数的超轻量级弱监督Transformer模型,通过全局上下文感知特征提取、可学习密度加权聚合模块和密度分级分类损失,在仅使用图像级标签的情况下实现高效准确的 crowd counting,适用于边缘设备。

Details Motivation: 现有crowd counting方法依赖密集标注和计算开销大的模型,限制了其在资源受限场景下的应用。需要一种更轻量且弱监督的方法以提升可扩展性。 Method: 采用高效的视觉Transformer作为骨干网络;设计Learnable Density-Weighted Averaging模块,根据预测密度动态加权局部token;引入密度级别分类损失,将密度离散化为多个等级进行正则化训练。 Result: 在ShanghaiTech A/B、UCF-QNRF和NWPU四个基准上验证了方法的有效性,TCFormer在参数量极小的情况下达到有竞争力的精度,显著优于其他轻量模型。 Conclusion: TCFormer实现了参数效率与计数精度的良好平衡,是一种适合部署于边缘设备的弱监督crowd counting解决方案。 Abstract: Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model's classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.

[76] A CNN-Based Malaria Diagnosis from Blood Cell Images with SHAP and LIME Explainability

Md. Ismiel Hossen Abir,Awolad Hossain

Main category: cs.CV

TL;DR: 本研究提出了一种基于自定义卷积神经网络(CNN)的深度学习方法,用于自动分类疟疾感染的血细胞图像,准确率达96%,并在可解释性方面应用了SHAP、LIME和显著性图等技术,显示出在资源有限地区进行快速、准确诊断的潜力。

Details Motivation: 传统疟疾诊断方法如显微镜血涂片检测灵敏度低、依赖专家判断且资源需求高,尤其在偏远地区难以实施,因此需要一种更高效、自动化的诊断手段。 Method: 设计并训练了一个自定义的卷积神经网络(CNN)来分类血细胞图像,并与ResNet50、VGG16、MobileNetV2和DenseNet121等预训练模型进行比较;同时使用SHAP、LIME和显著性图等可解释AI技术提升模型可解释性。 Result: 自定义CNN模型在疟疾图像分类中达到96%的准确率,精确率和召回率均超过0.95;与其他主流模型相比表现优异,且可解释性技术有效揭示了模型关注的关键图像区域。 Conclusion: 基于深度学习的方法,特别是结合可解释AI的自定义CNN,能够实现快速、准确且可解释的疟疾诊断,适用于资源匮乏地区的医疗部署。 Abstract: Malaria remains a prevalent health concern in regions with tropical and subtropical climates. The cause of malaria is the Plasmodium parasite, which is transmitted through the bites of infected female Anopheles mosquitoes. Traditional diagnostic methods, such as microscopic blood smear analysis, are low in sensitivity, depend on expert judgment, and require resources that may not be available in remote settings. To overcome these limitations, this study proposes a deep learning-based approach utilizing a custom Convolutional Neural Network (CNN) to automatically classify blood cell images as parasitized or uninfected. The model achieves an accuracy of 96%, with precision and recall scores exceeding 0.95 for both classes. This study also compares the custom CNN with established deep learning architectures, including ResNet50, VGG16, MobileNetV2, and DenseNet121. To enhance model interpretability, Explainable AI techniques such as SHAP, LIME, and Saliency Maps are applied. The proposed system shows how deep learning can provide quick, accurate and understandable malaria diagnosis, especially in areas with limited resources.

[77] Signal-SGN++: Topology-Enhanced Time-Frequency Spiking Graph Network for Skeleton-Based Action Recognition

Naichuan Zheng,Xiahai Lun,Weiyi Li,Yuchen Du

Main category: cs.CV

TL;DR: 本文提出Signal-SGN++,一种结合图卷积与脉冲神经网络优势的拓扑感知脉冲图框架,通过1D-SGC、FSC和TSSA等模块实现高效时空频域特征提取,在保持低能耗的同时显著提升动作识别性能。

Details Motivation: 现有GCN在骨骼动作识别中计算密集、能耗高;SNN虽节能但难以捕捉人体运动的时频-拓扑联合依赖关系,亟需兼顾效率与建模能力的新方法。 Method: 设计Signal-SGN++框架,包含1D-SGC和FSC主干用于时空与频谱特征提取,嵌入TSSA机制实现基于学习拓扑的自适应注意力分配,并引入MWTF分支结合TATF单元进行多尺度时频融合,保留结构先验信息。 Result: 在大规模基准上验证,Signal-SGN++在显著降低能耗的同时,取得优于现有SNN方法及媲美最先进GCN的识别精度,实现了更优的精度-效率权衡。 Conclusion: Signal-SGN++有效融合了GCN的拓扑建模能力与SNN的能量效率优势,通过结构自适应与时频动态建模,为高效骨骼动作识别提供了新范式。 Abstract: Graph Convolutional Networks (GCNs) demonstrate strong capability in modeling skeletal topology for action recognition, yet their dense floating-point computations incur high energy costs. Spiking Neural Networks (SNNs), characterized by event-driven and sparse activation, offer energy efficiency but remain limited in capturing coupled temporal-frequency and topological dependencies of human motion. To bridge this gap, this article proposes Signal-SGN++, a topology-aware spiking graph framework that integrates structural adaptivity with time-frequency spiking dynamics. The network employs a backbone composed of 1D Spiking Graph Convolution (1D-SGC) and Frequency Spiking Convolution (FSC) for joint spatiotemporal and spectral feature extraction. Within this backbone, a Topology-Shift Self-Attention (TSSA) mechanism is embedded to adaptively route attention across learned skeletal topologies, enhancing graph-level sensitivity without increasing computational complexity. Moreover, an auxiliary Multi-Scale Wavelet Transform Fusion (MWTF) branch decomposes spiking features into multi-resolution temporal-frequency representations, wherein a Topology-Aware Time-Frequency Fusion (TATF) unit incorporates structural priors to preserve topology-consistent spectral fusion. Comprehensive experiments on large-scale benchmarks validate that Signal-SGN++ achieves superior accuracy-efficiency trade-offs, outperforming existing SNN-based methods and achieving competitive results against state-of-the-art GCNs under substantially reduced energy consumption.

[78] VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Fadi Dornaika,Cosimo Distante,Abdenour Hadid

Main category: cs.CV

TL;DR: 提出VLM-PAR,一种基于冻结SigLIP多语言编码器的模块化视觉-语言框架,通过跨注意力融合优化图像与提示嵌入对齐,在多个行人属性识别基准上实现先进性能。

Details Motivation: 解决行人属性识别中类别不平衡、属性间复杂依赖和域偏移等问题。 Method: 采用冻结的SigLIP多语言编码器,通过紧凑的交叉注意力机制融合并优化视觉特征与提示嵌入的对齐。 Result: 在PA100K上达到新的SOTA性能,并在PETA和Market-1501上显著提升平均准确率。 Conclusion: 大规模视觉-语言预训练结合针对性的跨模态微调可有效应对PAR中的不平衡与泛化挑战。 Abstract: Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.

[79] Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

Hieu Minh Nguyen,Tam Le-Thanh Dang,Kiet Van Nguyen

Main category: cs.CV

TL;DR: 本文提出了ViSignVQA,首个面向越南语招牌文本的视觉问答(VQA)大规模数据集,包含10,762张图像和25,573个问答对,并结合OCR与多智能体框架提升性能,强调了领域特定资源对低资源语言VQA的重要性。

Details Motivation: 现有VQA研究在低资源语言(如越南语)的招牌文本理解上探索不足,缺乏适配真实场景的语言、文化和视觉多样性的数据集与方法。 Method: 构建了ViSignVQA数据集,并基于BLIP-2、LaTr等先进VQA模型集成越南语OCR(SwinTextSpotter)和预训练语言模型(ViT5)进行基准测试,同时提出结合感知与推理代理及GPT-4的多智能体VQA框架。 Result: 实验显示引入OCR文本可使F1分数最高提升209%;所提多智能体框架通过多数投票达到75.98%的准确率。 Conclusion: ViSignVQA填补了越南语场景文本理解的数据空白,验证了OCR增强上下文和多智能体推理对低资源语言VQA的有效性,为相关研究提供了重要基准。 Abstract: Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.

[80] On Extending Semantic Abstraction for Efficient Search of Hidden Objects

Tasha Pais,Nikhilesh Belulkar

Main category: cs.CV

TL;DR: 本文提出了一种基于语义抽象的方法,利用2D视觉语言模型的相关性激活来定位和补全被遮挡的隐藏物体,通过历史放置数据提升3D搜索效率,显著优于随机搜索。

Details Motivation: 为了使家用机器人能更高效地寻找丢失或被遮挡的物体,需要一种能够处理无法直接识别的隐藏物体的3D定位方法。 Method: 将2D VLM的相关性图视为“抽象对象”表示,并结合物体常被放置的历史位置数据,用于学习隐藏物体的3D定位与形状补全。 Result: 模型能在首次尝试中准确识别隐藏物体的完整3D位置,且搜索速度明显快于朴素的随机搜索。 Conclusion: 扩展后的语义抽象框架为家庭机器人提供了快速寻找隐藏物体的能力,有助于节省时间和精力。 Abstract: Semantic Abstraction's key observation is that 2D VLMs' relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as "abstract object" representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.

[81] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs

Naishan Zheng,Jie Huang,Qingpei Guo,Feng Zhao

Main category: cs.CV

TL;DR: VideoScaffold 是一种用于流式视频理解的动态表示框架,通过自适应调整事件粒度并保持细粒度视觉语义,实现了对长视频的高效理解。

Details Motivation: 现有静态方法(如稀疏采样、帧压缩)在处理连续视频流时容易产生片段化或过度压缩的问题,难以满足流式场景下时间连贯性和语义完整性的需求。 Method: 提出 VideoScaffold 框架,包含弹性尺度事件分割(EES)和层次化事件整合(HEC):EES 根据预测引导动态优化事件边界,HEC 则逐步聚合语义相关的段落形成多级抽象。 Result: 在离线和流式视频理解基准上均达到最先进性能,且具有模块化、即插即用特性,可无缝扩展图像基 MLLM 以支持连续视频理解。 Conclusion: VideoScaffold 能有效桥接细粒度帧理解与高层事件推理,在不同长度视频中实现平滑过渡,显著提升流式长视频理解效果。 Abstract: Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at https://github.com/zheng980629/VideoScaffold.

[82] KAN-FPN-Stem:A KAN-Enhanced Feature Pyramid Stem for Boosting ViT-based Pose Estimation

HaoNan Tang

Main category: cs.CV

TL;DR: 本文提出了一种新的KAN增强型FPN-Stem架构,用于提升Vision Transformers在姿态估计等密集预测任务中的性能。核心改进是用KAN-based卷积层替代FPN中传统的3x3卷积进行多尺度特征融合后的非线性平滑处理,显著提升了特征融合质量。

Details Motivation: 现有的ViT模型(如ViTPose)前端设计过于简单,尤其是patchification机制导致多尺度信息丢失,限制了性能提升。作者旨在识别并解决真正的性能瓶颈——特征融合过程中的非线性平滑不足,而非注意力模块的引入。 Method: 保留FPN经典的'上采样-相加'融合结构,将末端的标准3x3卷积替换为具有更强非线性建模能力的KAN-based卷积层,以自适应地修正多尺度融合过程中产生的伪影。 Result: 在COCO数据集上,所提方法相比轻量级ViTPose-S基线最高提升了+2.0 AP。消融实验表明性能瓶颈在于特征融合的质量而非注意力机制的引入。 Conclusion: ViT前端的性能瓶颈主要在于特征融合(Fusion)的质量,而非特征精炼(Attention)。引入KAN算子提供了一条有效解决该问题的路径,同时所提模块具有即插即用和高性能的优点。 Abstract: Vision Transformers (ViT) have demonstrated significant promise in dense prediction tasks such as pose estimation. However, their performance is frequently constrained by the overly simplistic front-end designs employed in models like ViTPose. This naive patchification mechanism struggles to effectively handle multi-scale variations and results in irreversible information loss during the initial feature extraction phase. To overcome this limitation, we introduce a novel KAN-enhanced FPN-Stem architecture. Through rigorous ablation studies, we first identified that the true bottleneck for performance improvement lies not in plug-and-play attention modules (e.g., CBAM), but in the post-fusion non-linear smoothing step within the FPN. Guided by this insight, our core innovation is to retain the classic "upsample-and-add" fusion stream of the FPN, but replace its terminal, standard linear 3x3 smoothing convolution with a powerful KAN-based convolutional layer. Leveraging its superior non-linear modeling capabilities, this KAN-based layer adaptively learns and rectifies the "artifacts" generated during the multi-scale fusion process. Extensive experiments on the COCO dataset demonstrate that our KAN-FPN-Stem achieves a significant performance boost of up to +2.0 AP over the lightweight ViTPose-S baseline. This work not only delivers a plug-and-play, high-performance module but, more importantly, reveals that: the performance bottleneck in ViT front-end often lies not in 'feature refinement' (Attention), but in the quality of 'feature fusion' (Fusion). Furthermore, it provides an effective path to address this bottleneck through the introduction of the KAN operator.

[83] Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction

Mengxiao Geng,Ran Hong,Xiaoling Xu,Bingxuan Li,Qiegen Liu

Main category: cs.CV

TL;DR: 提出了一种元信息引导的跨域协同扩散模型(MiG-DM),通过融合投影域和图像域信息及临床元信息,显著提升低剂量PET图像质量。

Details Motivation: 现有低剂量PET重建方法常忽略投影域物理先验和患者特异性元信息,导致噪声严重、对比度低且生理细节丢失,难以实现功能与语义的有效关联。 Method: 设计了元信息编码模块将临床参数转化为语义提示,并引入跨域架构:在投影域使用正弦图适配器捕捉全局物理结构,在图像域进行细节恢复,结合扩散模型实现双域协同优化。 Result: 在UDPET公开数据集和多个临床数据集上验证,MiG-DM在不同剂量水平下均优于现有最先进方法,显著提升图像质量并更好保留生理细节。 Conclusion: MiG-DM通过融合多模态先验与双域处理机制,有效提升了低剂量PET图像重建性能,具有较强的临床应用潜力。 Abstract: Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.

[84] Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture

Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan

Main category: cs.CV

TL;DR: 提出一种混合知识蒸馏框架,用于在资源受限的边缘设备上实现高效准确的植物种类与病害识别。

Details Motivation: 解决深度学习模型在智能农业边缘设备上部署时计算效率与识别精度之间的权衡问题。 Method: 设计结合倒残差块与密集连接的轻量化学⽣模型,采用ResNet18作为教师网络,通过融合硬标签监督、特征层蒸馏、响应层蒸馏和自蒸馏的多目标策略进行训练。 Result: 在水稻种子分类任务中,学生模型达到98.56%准确率,仅比教师模型低0.09%,计算量减少2.7倍,参数量减少10倍以上;相比DenseNet121和ViT分别减少6倍和80倍参数,且保持相当或更高的精度;在多个植物叶片病害数据集上表现良好。 Conclusion: 所提框架在保持高精度的同时显著降低模型复杂度,具有良好的泛化能力与硬件部署潜力,适用于资源受限的智能农业系统。 Abstract: Deploying deep learning models on resource-constrained edge devices remains a major challenge in smart agriculture due to the trade-off between computational efficiency and recognition accuracy. To address this challenge, this study proposes a hybrid knowledge distillation framework for developing a lightweight yet high-performance convolutional neural network. The proposed approach designs a customized student model that combines inverted residual blocks with dense connectivity and trains it under the guidance of a ResNet18 teacher network using a multi-objective strategy that integrates hard-label supervision, feature-level distillation, response-level distillation, and self-distillation. Experiments are conducted on a rice seed variety identification dataset containing nine varieties and further extended to four plant leaf disease datasets, including rice, potato, coffee, and corn, to evaluate generalization capability. On the rice seed variety classification task, the distilled student model achieves an accuracy of 98.56%, which is only 0.09% lower than the teacher model (98.65%), while requiring only 0.68 GFLOPs and approximately 1.07 million parameters. This corresponds to a reduction of about 2.7 times in computational cost and more than 10 times in model size compared with the ResNet18 teacher model. In addition, compared with representative pretrained models, the proposed student reduces the number of parameters by more than 6 times relative to DenseNet121 and by over 80 times compared with the Vision Transformer (ViT) architecture, while maintaining comparable or superior classification accuracy. Consistent performance gains across multiple plant leaf disease datasets further demonstrate the robustness, efficiency, and strong deployment potential of the proposed framework for hardware-limited smart agriculture systems.

[85] Evaluating an Adaptive Multispectral Turret System for Autonomous Tracking Across Variable Illumination Conditions

Aahan Sachdeva,Dhanvinkumar Ganeshkumar,James E. Gallagher,Tyler Treat,Edward J. Oughton

Main category: cs.CV

TL;DR: 提出一种自适应融合RGB与长波红外(LWIR)视频流的框架,通过动态选择最优检测模型提升不同光照条件下自主机器人视觉性能。

Details Motivation: 传统RGB检测在低光环境下表现差,热成像系统缺乏颜色和纹理信息,限制了紧急服务中机器人平台的应用。 Method: 融合对齐的RGB和LWIR图像(11种比例),训练33个YOLO模型,覆盖三种光照条件(无光、弱光、全光)。 Result: 在全光和弱光下,最佳融合比例(80/20 和 90/10)分别达到92.8%和92.0%平均置信度,显著优于YOLOv5n和YOLOv11n基线;无光下40/60融合达71.0%,略高于基线但不显著。 Conclusion: 自适应RGB-LWIR融合提升了各类光照下的目标检测置信度与可靠性,增强了自主机器人在复杂环境中的视觉能力。 Abstract: Autonomous robotic platforms are playing a growing role across the emergency services sector, supporting missions such as search and rescue operations in disaster zones and reconnaissance. However, traditional red-green-blue (RGB) detection pipelines struggle in low-light environments, and thermal-based systems lack color and texture information. To overcome these limitations, we present an adaptive framework that fuses RGB and long-wave infrared (LWIR) video streams at multiple fusion ratios and dynamically selects the optimal detection model for each illumination condition. We trained 33 You Only Look Once (YOLO) models on over 22,000 annotated images spanning three light levels: no-light (<10 lux), dim-light (10-1000 lux), and full-light (>1000 lux). To integrate both modalities, fusion was performed by blending aligned RGB and LWIR frames at eleven ratios, from full RGB (100/0) to full LWIR (0/100) in 10% increments. Evaluation showed that the best full-light model (80/20 RGB-LWIR) and dim-light model (90/10 fusion) achieved 92.8% and 92.0% mean confidence; both significantly outperformed the YOLOv5 nano (YOLOv5n) and YOLOv11 nano (YOLOv11n) baselines. Under no-light conditions, the top 40/60 fusion reached 71.0%, exceeding baselines though not statistically significant. Adaptive RGB-LWIR fusion improved detection confidence and reliability across all illumination conditions, enhancing autonomous robotic vision performance.

[86] Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models

Antara Titikhsha,Om Kulkarni,Dharun Muthaiah

Main category: cs.CV

TL;DR: 本文提出了一种通过引入轻量级外部判别器(Human Perception Embedding, HPE)来增强文本到图像生成模型几何控制能力的方法,实现了在无需专门训练的情况下分离几何结构与视觉风格,显著提升了生成图像的语义对齐程度。

Details Motivation: 现有文本到图像扩散模型虽能生成高细节纹理,但常忽视严格的几何约束,导致形状不符合人类感知。本文旨在弥合人类感知与生成模型之间的语义鸿沟,提升模型对物体形状的理解能力。 Method: 利用THINGS三元组数据集训练一个轻量化的HPE教师模型,用以捕捉人类对物体形状的敏感性,并将其作为外部引导信号,通过梯度注入方式干预Stable Diffusion、SiT-XL/2和PixArt-Σ等扩散模型的潜在生成过程,实现几何与风格的可控分离。 Result: 实验表明,流动模型在缺乏持续引导时易偏离目标轨迹;该方法实现了复杂三维形状(如Eames椅)在冲突材质(如粉色金属)上的零样本迁移,生成结果在语义对齐上比无引导基线提升约80%。 Conclusion: 小型教师模型可有效引导大型生成系统,在不需专门训练的前提下增强几何控制能力,拓展了文本到图像合成的创造性应用范围。 Abstract: Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-Σ. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.

[87] GeCo: A Differentiable Geometric Consistency Metric for Video Generation

Leslie Gu,Junhwa Hur,Charles Herrmann,Fangneng Zhan,Todd Zickler,Deqing Sun,Hanspeter Pfister

Main category: cs.CV

TL;DR: 提出了一种名为GeCo的几何基础度量方法,用于检测静态场景中的几何变形和遮挡不一致伪影。

Details Motivation: 现有视频生成模型在处理静态场景时容易产生几何变形和遮挡不一致的问题,缺乏有效的评估与优化手段。 Method: 通过融合残差运动和深度先验,构建可解释的密集一致性图来揭示伪影,并将GeCo用作无需训练的指导损失。 Result: 能够有效检测并减少视频生成中的变形伪影,系统性地评测了多种模型并发现其常见失败模式。 Conclusion: GeCo是一种有效的、无需训练的工具,可用于提升视频生成中几何一致性和视觉质量。 Abstract: We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.

[88] The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

Dingyu Wang,Zimu Yuan,Jiajun Liu,Shanggui Liu,Nan Zhou,Tianxing Xu,Di Huang,Dong Jiang

Main category: cs.CV

TL;DR: 本文提出了一个名为Bones and Joints (B&J)的基准测试,用于评估人工智能模型在真实骨科与运动医学病例中的多模态临床推理能力,发现当前模型在开放性、多模态任务中表现不佳,尤其在图像理解与避免文本驱动幻觉方面存在严重缺陷。

Details Motivation: 现有医学AI基准多基于考试题或简化病例,无法反映真实临床中整合文本、影像等多模态信息的复杂推理过程,因此需要一个更贴近实际诊疗流程的评估框架。 Method: 构建包含1,245个真实病例问题的B&J基准,涵盖7项模拟临床推理路径的任务(如知识回忆、图文解读、诊断、治疗规划和理由生成),并对11个视觉-语言模型和6个大语言模型进行评估,以专家标准答案为基准进行比较。 Result: 最先进模型在结构化选择题中准确率超过90%,但在需多模态整合的开放性任务中准确率仅约60%;视觉-语言模型在医学图像解读上表现差,常出现忽略视觉矛盾的文本驱动幻觉;专为医疗微调的模型未显示出一致优势。 Conclusion: 当前AI模型尚不具备处理复杂多模态临床推理的能力,临床应用应限于辅助性文本任务;未来需在多模态融合与视觉理解方面取得根本性突破才能实现安全部署。 Abstract: Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.

[89] FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

Hussain Alasmawi,Numan Saeed,Mohammad Yaqub

Main category: cs.CV

TL;DR: 本文提出了Fetal-Gauge,首个针对胎儿超声影像的视觉问答基准,用于评估视觉-语言模型(VLMs)在多种临床任务中的表现,揭示现有模型性能远未达到临床要求,亟需领域适配的架构与训练方法。

Details Motivation: 由于产前超声检查需求增长导致全球超声技师短缺,而当前缺乏标准化基准来评估视觉-语言模型在胎儿超声中的性能,限制了深度学习在此领域的应用与发展。 Method: 构建了一个包含超过42,000张图像和93,000个问答对的大规模数据集Fetal-Gauge,涵盖解剖平面识别、结构定位、胎儿方位判断、视图合规性和临床诊断等任务,并系统评估了多个最先进的通用和医学专用视觉-语言模型。 Result: 实验结果显示最佳模型准确率仅为55%,远低于临床应用要求,暴露出当前VLM在胎儿超声理解上的严重不足。 Conclusion: Fetal-Gauge为推动多模态深度学习在产前护理中的发展提供了坚实基础,凸显了开发领域专用模型的必要性,并有助于应对全球医疗可及性挑战。 Abstract: The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.

[90] A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation

Philip Xu,David Elizondo,Raouf Hamzaoui

Main category: cs.CV

TL;DR: Uni4D是一个统一的框架,用于大规模开放词汇的3D检索和可控4D生成,通过文本、3D模型和图像模态间的三级对齐实现跨模态理解。

Details Motivation: 为了提升开放词汇下3D内容检索与动态4D生成的能力,解决现有方法在语义对齐和跨模态一致性上的不足。 Method: 基于Align3D 130数据集,采用三级对齐策略:文本到3D检索、多视角3D到图像对齐、图像到文本对齐,并引入3D文本多头注意力机制优化语义匹配。 Result: 实验表明,Uni4D在3D检索质量与可控4D生成方面表现优异,实现了高质量、时间一致的4D资产生成。 Conclusion: Uni4D有效增强了跨模态语义对齐能力,推动了动态多模态理解及其在实际应用中的发展。 Abstract: We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.

[91] Learning Dynamic Scene Reconstruction with Sinusoidal Geometric Priors

Tian Guo,Hui Yuan,Philip Xu,David Elizondo

Main category: cs.CV

TL;DR: SirenPose是一种结合周期性激活特性和关键点几何先验的新型损失函数,用于提升动态3D场景重建的时空一致性与运动建模精度。

Details Motivation: 现有方法在快速运动和多目标场景中难以保持运动建模精度和时空一致性,SirenPose旨在通过引入物理启发的约束机制解决这一问题。 Method: 结合Sinusoidal表示网络的周期性激活特性与关键点结构的几何先验,设计新的损失函数SirenPose,并在扩展至60万标注实例的数据集上进行训练。 Result: 实验结果表明,使用SirenPose训练的模型在时空一致性指标上显著优于先前方法,尤其在快速运动和复杂场景变化中表现更优。 Conclusion: SirenPose通过融合周期性激活与几何先验,有效提升了动态3D场景重建的精度与鲁棒性,适用于高动态复杂场景。 Abstract: We propose SirenPose, a novel loss function that combines the periodic activation properties of sinusoidal representation networks with geometric priors derived from keypoint structures to improve the accuracy of dynamic 3D scene reconstruction. Existing approaches often struggle to maintain motion modeling accuracy and spatiotemporal consistency in fast moving and multi target scenes. By introducing physics inspired constraint mechanisms, SirenPose enforces coherent keypoint predictions across both spatial and temporal dimensions. We further expand the training dataset to 600,000 annotated instances to support robust learning. Experimental results demonstrate that models trained with SirenPose achieve significant improvements in spatiotemporal consistency metrics compared to prior methods, showing superior performance in handling rapid motion and complex scene changes.

[92] Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware

Vesal Ahsani,Babak Hossein Khalaj

Main category: cs.CV

TL;DR: 提出一种适用于低成本边缘设备的单摄像头驾驶员行为识别系统,支持实时监测17类分心和疲劳相关行为。

Details Motivation: 在计算、功耗和成本受限的情况下,实现低延迟、高准确性的车内驾驶员状态监测。 Method: 采用紧凑型逐帧视觉模型、混淆感知标签设计和时序决策头,在Raspberry Pi 5和Google Coral Edge TPU上部署。 Result: 在Raspberry Pi 5上达到约16 FPS(INT8),Coral Edge TPU上约25 FPS,实现实时性能;在真实车载环境中验证运行效果。 Conclusion: 该系统可在廉价硬件上实现稳定、低延迟的驾驶员行为识别,为以人为中心的车辆智能提供可靠的状态感知输入。 Abstract: In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.

[93] Attack-Aware Deepfake Detection under Counter-Forensic Manipulations

Noor Fatima,Hasan Faraz Khan,Muzammil Behzad

Main category: cs.CV

TL;DR: 本文提出一种攻击感知的深度伪造与图像取证检测器,通过红队训练和随机测试时防御机制,在真实部署条件下实现鲁棒性、概率校准和可解释证据。

Details Motivation: 现有检测方法在面对对抗性攻击和实际场景扰动时性能下降,缺乏可靠的不确定性估计和可解释性,难以实际部署。 Method: 采用双流架构:语义流使用预训练骨干编码内容,取证流提取残差特征,通过轻量级适配器融合进行分类;同时使用浅层特征金字塔网络生成弱监督下的篡改热图。训练中采用worst-of-K反取证攻击增强(如JPEG重压缩、重采样、去噪再加噪等),测试时引入低成本抖动(如缩放、裁剪、JPEG相位变化)并聚合预测。热图通过人脸框掩码引导聚焦人脸区域。 Result: 在标准深伪数据集和模拟监控场景(低光照、高压缩)上评估,报告了干净样本与攻击下的AUC、最坏情况准确率、可靠性、拒绝质量及弱定位得分。结果显示接近完美的攻击排序能力、低校准误差、最小拒绝风险,并在regrain攻击下保持可控退化。 Conclusion: 该方法提供了一个模块化、数据高效且可实际部署的基线,支持攻击感知检测、校准概率输出和可操作的热图解释。 Abstract: This work presents an attack-aware deepfake and image-forensics detector designed for robustness, well-calibrated probabilities, and transparent evidence under realistic deployment conditions. The method combines red-team training with randomized test-time defense in a two-stream architecture, where one stream encodes semantic content using a pretrained backbone and the other extracts forensic residuals, fused via a lightweight residual adapter for classification, while a shallow Feature Pyramid Network style head produces tamper heatmaps under weak supervision. Red-team training applies worst-of-K counter-forensics per batch, including JPEG realign and recompress, resampling warps, denoise-to-regrain operations, seam smoothing, small color and gamma shifts, and social-app transcodes, while test-time defense injects low-cost jitters such as resize and crop phase changes, mild gamma variation, and JPEG phase shifts with aggregated predictions. Heatmaps are guided to concentrate within face regions using face-box masks without strict pixel-level annotations. Evaluation on existing benchmarks, including standard deepfake datasets and a surveillance-style split with low light and heavy compression, reports clean and attacked performance, AUC, worst-case accuracy, reliability, abstention quality, and weak-localization scores. Results demonstrate near-perfect ranking across attacks, low calibration error, minimal abstention risk, and controlled degradation under regrain, establishing a modular, data-efficient, and practically deployable baseline for attack-aware detection with calibrated probabilities and actionable heatmaps.

[94] PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation

Darrin Bright,Rakshith Raj,Kanchan Keisham

Main category: cs.CV

TL;DR: 提出PortionNet,一种基于跨模态知识蒸馏的框架,利用点云特征学习实现从单张RGB图像进行准确的食物营养估计,无需深度传感器。

Details Motivation: 由于缺少3D信息,从单张图像进行食物营养估计具有挑战性,而现有深度方法因依赖深度传感器难以在普通智能手机上应用。 Method: 提出PortionNet,采用双模式训练策略,通过轻量级适配网络模拟点云表示,在训练时学习点云几何特征,推理时仅需RGB图像。 Result: 在MetaFood3D数据集上达到最先进性能,优于以往方法;在SimpleFood45上表现出强泛化能力,尤其在能量估计方面。 Conclusion: PortionNet有效解决了无深度传感器下的食物营养估计难题,实现了高精度与良好泛化性,适用于移动设备。 Abstract: Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.

[95] MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Run Ling,Ke Cao,Jian Lu,Ao Ma,Haowei Liu,Runze He,Changwei Wang,Rongtao Xu,Yihua Shao,Zhanjie Zhang,Peng Wu,Guibing Guo,Wei Feng,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Xingwei Wang

Main category: cs.CV

TL;DR: 本文提出了MoFu框架,用于解决多主体视频生成中的尺度不一致和排列敏感性问题,通过引入尺度感知调制、傅里叶融合策略及稳定性损失,在新建立的基准上实现了优于现有方法的性能。

Details Motivation: 当前多主体视频生成方法存在主体尺度不一致和输入顺序敏感的问题,影响生成视频的自然性和保真度。 Method: 提出MoFu框架:1)Scale-Aware Modulation(SMO)利用LLM提取文本中的隐式尺度信息并调节特征;2)傅里叶融合策略通过FFT整合参考图像的频域特征;3)设计尺度-排列稳定性损失以提升一致性。同时构建了包含尺度和排列变化的新基准进行评估。 Result: 实验表明,MoFu在保持主体自然尺度、视觉保真度和整体质量方面显著优于现有方法,有效缓解了尺度不一致和排列敏感性问题。 Conclusion: MoFu通过尺度感知调制、频域融合和联合优化损失,统一解决了多主体视频生成中的关键挑战,为该领域提供了更鲁棒的解决方案。 Abstract: Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

[96] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

Yang Ding,Yizhen Zhang,Xin Lai,Ruihang Chu,Yujiu Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为VideoZoomer的新型代理框架,用于增强多模态大语言模型在长视频理解中的能力,通过动态调整视觉注意力和分阶段训练策略,在多种基准上实现了优于现有开源模型甚至媲美闭源系统的表现。

Details Motivation: 现有的多模态大语言模型在处理长视频时受限于上下文窗口,并且依赖均匀采样或静态帧选择,容易遗漏关键信息且无法在推理过程中纠正初始错误。因此需要一种能够动态聚焦重要时间段的方法。 Method: 提出VideoZoomer框架,采用从低帧率概览开始,通过时间缩放工具自主选择关键时段获取高帧率片段,进行多轮交互式细粒度推理;并采用两阶段训练策略:先对蒸馏出的示例与反思轨迹数据集进行监督微调,再通过强化学习优化代理策略。 Result: 实验表明,该7B模型在多个长视频理解和推理基准上表现出色,展现出多样而复杂的推理模式,在减少帧数预算下仍保持高效,性能超越现有开源模型并可与专有系统媲美。 Conclusion: VideoZoomer通过动态视觉聚焦和强化学习驱动的代理机制,显著提升了多模态大模型在长视频理解任务中的表现,同时具备更高的推理效率和更强的证据整合能力。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.

[97] SpotEdit: Selective Region Editing in Diffusion Transformers

Zhibin Qin,Zhenxiong Tan,Zeqing Wang,Songhua Liu,Xinchao Wang

Main category: cs.CV

TL;DR: 提出了一种无需训练的扩散编辑框架SpotEdit,通过选择性更新修改区域实现高效精确的图像编辑。

Details Motivation: 现有方法在每一步都对所有区域进行均匀处理和去噪,导致计算冗余并可能损害未修改区域,引发是否需要重生成每个区域的思考。 Method: 设计了SpotSelector来识别稳定区域并复用条件图像特征以跳过其计算;采用SpotFusion通过动态融合机制自适应地混合这些特征与编辑后的标记。 Result: 减少了不必要的计算,保持了未修改区域的高保真度,同时保证了上下文连贯性和编辑质量。 Conclusion: SpotEdit是一种高效且精确的图像编辑方法,能够在不牺牲质量的前提下显著降低计算开销。 Abstract: Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.

[98] DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

Jianrong Zhang,Hehe Fan,Yi Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为DeMoGen的可分解学习训练范式,通过能量扩散模型将复杂人体动作分解为语义明确的子成分,并支持灵活重组生成新颖动作。

Details Motivation: 现有方法主要关注从文本到动作的整体前向建模,缺乏对动作内部组成结构的理解,本文旨在通过逆向视角实现动作的可分解学习。 Method: 提出基于能量扩散模型的DeMoGen框架,包含三种训练变体:DeMoGen-Exp(显式训练分解文本提示)、DeMoGen-OSS(正交自监督分解)和DeMoGen-SC(保持原始与分解文本嵌入的语义一致性)。 Result: 模型能从未标注的复合动作中发现可重用的动作基元,实现动作的有效分解与重组,并在生成多样性与分布外泛化方面表现良好。同时构建了一个支持文本分解的 dataset。 Conclusion: DeMoGen实现了对人体动作的可分解学习,支持语义清晰的动作分解与创造性重组,推动了文本到动作生成中的组合性与可解释性。 Abstract: Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.

[99] The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

Mariya Miteva,Maria Nisheva-Pavlova

Main category: cs.CV

TL;DR: 提出一种基于变分自编码器的多视图潜在表示学习框架,用于胶质母细胞瘤中MGMT启动子甲基化的非侵入性预测。

Details Motivation: 传统单模态和早期融合方法在放射组学中存在特征冗余和模态特异性信息建模不充分的问题。 Method: 采用独立的概率编码器对T1Gd和FLAIR MRI影像分别编码,并在紧凑的潜在空间中进行多模态融合,利用潜在表示进行MGMT状态分类。 Result: 该方法能有效整合互补放射组学特征,保留模态特异性结构,提升分类性能。 Conclusion: 所提出的多视图VAE框架在MGMT启动子甲基化预测中展现出优越潜力,优于传统单模态或早期融合方法。 Abstract: Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and an incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) to integrate complementary radiomic features derived from post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Recovery (FLAIR) magnetic resonance imaging (MRI). By encoding each modality through an independent probabilistic encoder and performing fusion in a compact latent space, the proposed approach preserves modality-specific structure while enabling effective multimodal integration. The resulting latent embeddings are subsequently used for MGMT promoter methylation classification.

[100] Feature Learning with Multi-Stage Vision Transformers on Inter-Modality HER2 Status Scoring and Tumor Classification on Whole Slides

Olaide N. Oyelade,Oliver Hoxey,Yulia Humrye

Main category: cs.CV

TL;DR: 本研究提出了一种基于视觉Transformer的端到端管道,用于联合分析H&E和IHC全切片图像(WSI),实现像素级HER2状态评分(0, 1+, 2+, 3+),并在私有数据集上取得了高分类准确率。

Details Motivation: 现有深度学习方法在HER2评分中难以提供像素级定位,且联合分析H&E与IHC图像具有挑战性,因此需要一种能精确定位并自动评分的端到端方法。 Method: 采用视觉Transformer系统进行端到端处理;通过补丁级处理H&E WSI进行肿瘤定位;设计新的映射函数关联H&E恶性区域与IHC对应区域;嵌入临床启发的HER2评分机制以实现自动像素级四分类评分。 Result: 肿瘤定位表现出良好的分类准确性;HER2状态预测的分类准确率达0.94,特异性达0.933;模型在WSI补丁上的表现可与病理学家媲美。 Conclusion: 所提出的端到端ViT-based管道能够有效联合分析H&E和IHC图像,实现精准的像素级HER2评分,具有良好的临床应用潜力。 Abstract: The popular use of histopathology images, such as hematoxylin and eosin (H&E), has proven to be useful in detecting tumors. However, moving such cancer cases forward for treatment requires accurate on the amount of the human epidermal growth factor receptor 2 (HER2) protein expression. Predicting both the lower and higher levels of HER2 can be challenging. Moreover, jointly analyzing H&E and immunohistochemistry (IHC) stained images for HER2 scoring is difficult. Although several deep learning methods have been investigated to address the challenge of HER2 scoring, they suffer from providing a pixel-level localization of HER2 status. In this study, we propose a single end-to-end pipeline using a system of vision transformers with HER2 status scoring on whole slide images of WSIs. The method includes patch-wise processing of H&E WSIs for tumor localization. A novel mapping function is proposed to correspondingly identify correlated IHC WSIs regions with malignant regions on H&E. A clinically inspired HER2 scoring mechanism is embedded in the pipeline and allows for automatic pixel-level annotation of 4-way HER2 scoring (0, 1+, 2+, and 3+). Also, the proposed method accurately returns HER2-negative and HER2-positive. Privately curated datasets were collaboratively extracted from 13 different cases of WSIs of H&E and IHC. A thorough experiment is conducted on the proposed method. Results obtained showed a good classification accuracy during tumor localization. Also, a classification accuracy of 0.94 and a specificity of 0.933 were returned for the prediction of HER2 status, scoring in the 4-way methods. The applicability of the proposed pipeline was investigated using WSIs patches as comparable to human pathologists. Findings from the study showed the usability of jointly evaluated H&E and IHC images on end-to-end ViTs-based models for HER2 scoring

[101] Human-like visual computing advances explainability and few-shot learning in deep neural networks for complex physiological data

Alaa Alahmadi,Mohamed Hasan

Main category: cs.CV

TL;DR: 本文提出一种感知启发的伪彩色编码技术,通过将临床显著的时间特征(如QT间期)转化为结构化颜色表示,提升深度神经网络在极小样本下对药物性长QT综合征的识别能力,并增强模型可解释性。

Details Motivation: 深度神经网络在生理信号分析中面临数据需求大和可解释性差的问题,限制了其临床可靠性。本文旨在提高模型在极端数据稀缺情况下的泛化能力和与人类推理的一致性。 Method: 采用感知启发的伪彩色编码技术,结合原型网络和ResNet-18架构,在单心动周期和10秒心律的ECG图像上评估单样本与少样本学习性能,并通过可视化分析模型注意力分布。 Result: 伪彩色编码使模型仅用1到5个训练样本即可学习到具有判别性的临床特征;注意力分析显示模型聚焦于有意义的ECG成分;多心动周期聚合进一步提升性能。 Conclusion: 人类感知启发的编码方式可有效提升医学人工智能在数据效率、可解释性和因果推理方面的能力,为小样本医疗场景提供新思路。 Abstract: Machine vision models, particularly deep neural networks, are increasingly applied to physiological signal interpretation, including electrocardiography (ECG), yet they typically require large training datasets and offer limited insight into the causal features underlying their predictions. This lack of data efficiency and interpretability constrains their clinical reliability and alignment with human reasoning. Here, we show that a perception-informed pseudo-colouring technique, previously demonstrated to enhance human ECG interpretation, can improve both explainability and few-shot learning in deep neural networks analysing complex physiological data. We focus on acquired, drug-induced long QT syndrome (LQTS) as a challenging case study characterised by heterogeneous signal morphology, variable heart rate, and scarce positive cases associated with life-threatening arrhythmias such as torsades de pointes. This setting provides a stringent test of model generalisation under extreme data scarcity. By encoding clinically salient temporal features, such as QT-interval duration, into structured colour representations, models learn discriminative and interpretable features from as few as one or five training examples. Using prototypical networks and a ResNet-18 architecture, we evaluate one-shot and few-shot learning on ECG images derived from single cardiac cycles and full 10-second rhythms. Explainability analyses show that pseudo-colouring guides attention toward clinically meaningful ECG features while suppressing irrelevant signal components. Aggregating multiple cardiac cycles further improves performance, mirroring human perceptual averaging across heartbeats. Together, these findings demonstrate that human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence.

[102] VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Zhengfei Kuang,Rui Lin,Long Zhao,Gordon Wetzstein,Saining Xie,Sanghyun Woo

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)的3D物体排列方法,通过引入MCP-API、专用视觉工具和多智能体协作框架,解决了MLLM在3D场景操作中的视觉对齐弱、感知不足和迭代错误等问题,在25个复杂任务中显著优于现有基线。

Details Motivation: 尽管MLLM在2D视觉语言任务中取得进展,但在复杂3D场景操作中的应用仍不足,尤其在视觉对齐、3D感知和迭代纠错方面存在挑战。 Method: 1)提出基于MCP的API,将交互从原始代码操作转为函数级更新;2)引入专门的视觉工具增强MLLM对3D场景状态和空间信息的理解;3)设计包含规划、执行和验证角色的多智能体协作框架以处理多步指令和错误恢复。 Result: 在25个复杂的3D物体排列任务上验证了方法的有效性,显著优于现有基线模型。 Conclusion: 该方法有效弥补了MLLM在3D场景操作中的关键缺陷,通过API抽象、感知增强和多智能体协作实现了更鲁棒的3D对象排列。 Abstract: Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan-3d.github.io

[103] Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Xin Yu,Xiaojuan Qi,Zhengqi Li,Kai Zhang,Richard Zhang,Zhe Lin,Eli Shechtman,Tianyu Wang,Yotam Nitzan

Main category: cs.CV

TL;DR: 本文提出了Self-Evaluating Model (Self-E),一种从零开始训练的文本到图像生成模型,支持任意步数推理。它结合了Flow Matching的学习方式和自评估机制,无需预训练教师模型,实现了高效且可扩展的生成。

Details Motivation: 传统扩散或流模型依赖局部监督,需要大量推理步骤;蒸馏方法则依赖预训练教师模型。本文旨在提出一种无需依赖这些条件、能从零开始训练且支持任意步数推理的高质量生成模型。 Method: Self-E采用类似Flow Matching的数据学习方式,并引入自评估机制:利用当前得分估计评估自身生成样本,作为动态自教师。结合瞬时局部学习与自驱动全局匹配,实现端到端训练。 Result: 在大规模文本到图像基准上实验表明,Self-E在极低步数下表现优异,在50步时性能媲美最先进的Flow Matching模型,且随着步数增加性能持续提升。 Conclusion: Self-E是首个从零开始训练、支持任意步数推理的文本到图像模型,提供了一个统一、高效且可扩展的生成框架。 Abstract: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.

[104] iOSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI

Himanshu Naidu,Yuxiang Zhang,Sachin Mehta,Anat Caspi

Main category: cs.CV

TL;DR: 本文介绍了一款名为iOSPointMapper的移动应用,利用现代iPhone和iPad实现基于设备端语义分割、LiDAR深度估计和融合定位技术的实时、隐私保护的人行道特征测绘,并通过用户验证提升数据质量,数据匿名上传至TDEI平台,具备可扩展性和用户中心特点。

Details Motivation: 准确且及时的人行道数据对于构建无障碍和包容性的步行基础设施至关重要,但现有数据采集方法成本高、碎片化且难以扩展,因此需要一种更高效、可扩展的解决方案。 Method: 开发了iOSPointMapper应用程序,结合设备端语义分割、LiDAR深度估计与GPS/IMU融合定位,在移动设备上实现实时检测和定位交通标志、信号灯和灯杆等人行道特征;引入用户引导标注界面以验证系统输出;数据经匿名化后提交至TDEI平台。 Result: 系统在特征检测和空间映射性能方面表现良好,能够有效支持高质量行人基础设施数据的采集,并能与多模态交通数据集无缝集成。 Conclusion: iOSPointMapper提供了一种可扩展、注重隐私且以用户为中心的方法,有助于填补当前行人数据的空白,推动更安全、包容的城市步行环境建设。 Abstract: Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system's feature detection and spatial mapping performance reveal the application's potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian

[105] DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization

Hansang Lee,Chaelin Lee,Nieun Seo,Joon Seok Lim,Helen Hong

Main category: cs.CV

TL;DR: DeFloMat是一种基于条件流匹配的生成式目标检测框架,通过确定性流场实现快速推理,在仅3步内达到43.32% AP_{10:50},显著优于DiffusionDet。

Details Motivation: 扩散模型虽准确但推理慢,难以满足临床等时效性要求高的应用场景,需解决生成精度与效率之间的权衡问题。 Method: 引入Conditional Flow Matching和Rectified Flow理论,用确定性常微分方程替代扩散模型的随机去噪过程,实现快速推理。 Result: 在MRE医学影像数据集上,3步推理即达到43.32% AP_{10:50},比DiffusionDet提升1.4倍,且具有更高召回率和稳定性。 Conclusion: DeFloMat解决了生成式检测中精度与速度的矛盾,为临床快速目标检测树立了新标准。 Abstract: We propose DeFloMat (Detection with Flow Matching), a novel generative object detection framework that addresses the critical latency bottleneck of diffusion-based detectors, such as DiffusionDet, by integrating Conditional Flow Matching (CFM). Diffusion models achieve high accuracy by formulating detection as a multi-step stochastic denoising process, but their reliance on numerous sampling steps ($T \gg 60$) makes them impractical for time-sensitive clinical applications like Crohn's Disease detection in Magnetic Resonance Enterography (MRE). DeFloMat replaces this slow stochastic path with a highly direct, deterministic flow field derived from Conditional Optimal Transport (OT) theory, specifically approximating the Rectified Flow. This shift enables fast inference via a simple Ordinary Differential Equation (ODE) solver. We demonstrate the superiority of DeFloMat on a challenging MRE clinical dataset. Crucially, DeFloMat achieves state-of-the-art accuracy ($43.32\% \text{ } AP_{10:50}$) in only $3$ inference steps, which represents a $1.4\times$ performance improvement over DiffusionDet's maximum converged performance ($31.03\% \text{ } AP_{10:50}$ at $4$ steps). Furthermore, our deterministic flow significantly enhances localization characteristics, yielding superior Recall and stability in the few-step regime. DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization.

[106] Bright 4B: Scaling Hyperspherical Learning for Segmentation in 3D Brightfield Microscopy

Amil Khan,Matheus Palhares Viana,Suraj Mishra,B. S. Manjunath

Main category: cs.CV

TL;DR: Bright-4B是一个基于40亿参数的3D亮场显微图像分割基础模型,无需荧光标记即可实现对细胞器的高精度形态分割。

Details Motivation: 现有的无标记3D亮场显微成像虽具有非侵入性和快速优势,但缺乏鲁棒的体积分割方法,通常依赖荧光标记或复杂后处理,限制了其广泛应用。 Method: 提出Bright-4B模型,采用单位超球面学习、原生稀疏注意力机制、深度宽度残差HyperConnections、软混合专家系统以及各向异性图像块嵌入,实现对3D亮场图像中亚细胞结构的直接分割。 Result: 在多个共聚焦数据集上,Bright-4B能准确分割细胞核、线粒体等细胞器,保持不同深度和细胞类型的精细结构细节,性能优于当前主流CNN和Transformer基线模型。 Conclusion: Bright-4B实现了仅从亮场图像进行高精度、无需标记的3D细胞结构分割,推动了大规模无标记细胞图谱构建的发展。 Abstract: Label-free 3D brightfield microscopy offers a fast and noninvasive way to visualize cellular morphology, yet robust volumetric segmentation still typically depends on fluorescence or heavy post-processing. We address this gap by introducing Bright-4B, a 4 billion parameter foundation model that learns on the unit hypersphere to segment subcellular structures directly from 3D brightfield volumes. Bright-4B combines a hardware-aligned Native Sparse Attention mechanism (capturing local, coarse, and selected global context), depth-width residual HyperConnections that stabilize representation flow, and a soft Mixture-of-Experts for adaptive capacity. A plug-and-play anisotropic patch embed further respects confocal point-spread and axial thinning, enabling geometry-faithful 3D tokenization. The resulting model produces morphology-accurate segmentations of nuclei, mitochondria, and other organelles from brightfield stacks alone--without fluorescence, auxiliary channels, or handcrafted post-processing. Across multiple confocal datasets, Bright-4B preserves fine structural detail across depth and cell types, outperforming contemporary CNN and Transformer baselines. All code, pretrained weights, and models for downstream finetuning will be released to advance large-scale, label-free 3D cell mapping.

[107] FluenceFormer: Transformer-Driven Multi-Beam Fluence Map Regression for Radiotherapy Planning

Ujunwa Mgboh,Rafi Ibn Sultan,Joshua Kim,Kundan Thind,Dongxiao Zhu

Main category: cs.CV

TL;DR: 本文提出了一种名为FluenceFormer的新型Transformer框架,用于几何感知的放疗通量图预测,通过两阶段设计和物理信息损失函数FAR提升预测精度与结构一致性。

Details Motivation: 由于解剖结构与射束调制间关系复杂,通量图预测作为逆问题存在长期依赖建模困难,传统卷积方法难以保证结构一致性和物理可实现性。 Method: 提出FluenceFormer框架:第一阶段从解剖输入预测全局剂量先验,第二阶段结合射束几何条件回归物理校准的通量图;引入Fluence-Aware Regression (FAR) 损失,融合体素保真度、梯度平滑性、结构一致性和射束能量守恒。 Result: 在前列腺IMRT数据集上评估多个Transformer骨干网络,Swin UNETR版本性能最优,能量误差降至4.5%,结构保真度显著优于现有CNN和单阶段方法(p < 0.05)。 Conclusion: FluenceFormer通过统一的两阶段Transformer架构和物理引导损失,实现了更准确、物理合理且结构一致的通量图预测,具有良好的通用性和临床应用潜力。 Abstract: Fluence map prediction is central to automated radiotherapy planning but remains an ill-posed inverse problem due to the complex relationship between volumetric anatomy and beam-intensity modulation. Convolutional methods in prior work often struggle to capture long-range dependencies, which can lead to structurally inconsistent or physically unrealizable plans. We introduce \textbf{FluenceFormer}, a backbone-agnostic transformer framework for direct, geometry-aware fluence regression. The model uses a unified two-stage design: Stage~1 predicts a global dose prior from anatomical inputs, and Stage~2 conditions this prior on explicit beam geometry to regress physically calibrated fluence maps. Central to the approach is the \textbf{Fluence-Aware Regression (FAR)} loss, a physics-informed objective that integrates voxel-level fidelity, gradient smoothness, structural consistency, and beam-wise energy conservation. We evaluate the generality of the framework across multiple transformer backbones, including Swin UNETR, UNETR, nnFormer, and MedFormer, using a prostate IMRT dataset. FluenceFormer with Swin UNETR achieves the strongest performance among the evaluated models and improves over existing benchmark CNN and single-stage methods, reducing Energy Error to $\mathbf{4.5\%}$ and yielding statistically significant gains in structural fidelity ($p < 0.05$).

[108] EmoCtrl: Controllable Emotional Image Content Generation

Jingyuan Yang,Weibin Luo,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了EmoCtrl模型,用于可控的情感图像生成,在保持内容忠实性的同时实现目标情感表达,解决了现有方法在情感与内容一致性之间的权衡问题。

Details Motivation: 现有的文本到图像模型缺乏对情感的感知,而情感驱动模型则容易导致内容失真,因此需要一种既能保持内容准确又能表达指定情感的生成方法。 Method: 提出EmoCtrl模型,包含文本和视觉情感增强模块,并构建一个标注了内容、情感和情感提示的 dataset 来桥接抽象情感与视觉线索,通过学习情感token来联合优化内容与情感表现。 Result: 实验表明EmoCtrl在定量和定性评估中均优于现有方法,能有效平衡内容保真度与情感表达,用户研究验证了其符合人类偏好,且在创意应用中展现出良好的泛化能力。 Conclusion: EmoCtrl成功实现了内容忠实与情感可控的图像生成,所学情感token具有互补性和通用性,为情感化图像生成提供了有效解决方案。 Abstract: An image conveys meaning through both its visual content and emotional tone, jointly shaping human perception. We introduce Controllable Emotional Image Content Generation (C-EICG), which aims to generate images that remain faithful to a given content description while expressing a target emotion. Existing text-to-image models ensure content consistency but lack emotional awareness, whereas emotion-driven models generate affective results at the cost of content distortion. To address this gap, we propose EmoCtrl, supported by a dataset annotated with content, emotion, and affective prompts, bridging abstract emotions to visual cues. EmoCtrl incorporates textual and visual emotion enhancement modules that enrich affective expression via descriptive semantics and perceptual cues. The learned emotion tokens exhibit complementary effects, as demonstrated through ablations and visualizations. Quantatitive and qualatitive experiments demonstrate that EmoCtrl achieves faithful content and expressive emotion control, outperforming existing methods across multiple aspects. User studies confirm EmoCtrl's strong alignment with human preference. Moreover, EmoCtrl generalizes well to creative applications, further demonstrating the robustness and adaptability of the learned emotion tokens.

[109] SuperiorGAT: Graph Attention Networks for Sparse LiDAR Point Cloud Reconstruction in Autonomous Systems

Khalfalla Awedat,Mohamed Abidalrekab,Gurcan Comert,Mustafa Ayad

Main category: cs.CV

TL;DR: 本文提出了一种基于图注意力网络的框架SuperiorGAT,用于重建稀疏LiDAR点云中因环境遮挡导致的缺失高程信息。

Details Motivation: LiDAR感知受限于固定的垂直波束分辨率,并因环境遮挡引起的波束丢失而性能下降。 Method: 将LiDAR扫描建模为波束感知图,采用门控残差融合与前馈精炼机制,在不增加网络深度的情况下实现精确重建。 Result: 在KITTI数据集的多种场景下实验表明,SuperiorGAT相比PointNet模型和更深的GAT基线具有更低的重建误差和更好的几何一致性。 Conclusion: 通过架构优化可在不依赖额外传感器硬件的情况下,以计算高效的方式提升LiDAR分辨率。 Abstract: LiDAR-based perception in autonomous systems is constrained by fixed vertical beam resolution and further compromised by beam dropout resulting from environmental occlusions. This paper introduces SuperiorGAT, a graph attention-based framework designed to reconstruct missing elevation information in sparse LiDAR point clouds. By modeling LiDAR scans as beam-aware graphs and incorporating gated residual fusion with feed-forward refinement, SuperiorGAT enables accurate reconstruction without increasing network depth. To evaluate performance, structured beam dropout is simulated by removing every fourth vertical scanning beam. Extensive experiments across diverse KITTI environments, including Person, Road, Campus, and City sequences, demonstrate that SuperiorGAT consistently achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. Qualitative X-Z projections further confirm the model's ability to preserve structural integrity with minimal vertical distortion. These results suggest that architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware.

[110] LECalib: Line-Based Event Camera Calibration

Zibin Liu,Banglei Guana,Yang Shanga,Zhenbao Yu,Yifei Bian,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于线特征的事件相机标定框架,直接从事件流中检测线条并利用非线性优化优化相机参数,适用于平面和非平面线条,具有高效性和高精度。

Details Motivation: 现有事件相机标定方法耗时且依赖人工放置的标定物,难以适应快速变化的场景,因此需要一种更高效、自动化的标定方法。 Method: 利用人造环境中常见的几何线条(如门、窗、盒子等)作为标定基础,直接从事件流中检测线条,构建事件-线条标定模型生成相机参数的初始估计,并通过非线性优化进一步精化参数。 Result: 在单目和双目事件相机上进行了仿真和真实实验,验证了该方法的可行性和准确性。 Conclusion: 所提出的方法无需闪光图案或重建强度图像,能够高效准确地完成事件相机标定,适用于多种场景,且代码已开源。 Abstract: Camera calibration is an essential prerequisite for event-based vision applications. Current event camera calibration methods typically involve using flashing patterns, reconstructing intensity images, and utilizing the features extracted from events. Existing methods are generally time-consuming and require manually placed calibration objects, which cannot meet the needs of rapidly changing scenarios. In this paper, we propose a line-based event camera calibration framework exploiting the geometric lines of commonly-encountered objects in man-made environments, e.g., doors, windows, boxes, etc. Different from previous methods, our method detects lines directly from event streams and leverages an event-line calibration model to generate the initial guess of camera parameters, which is suitable for both planar and non-planar lines. Then, a non-linear optimization is adopted to refine camera parameters. Both simulation and real-world experiments have demonstrated the feasibility and accuracy of our method, with validation performed on monocular and stereo event cameras. The source code is released at https://github.com/Zibin6/line_based_event_camera_calib.

[111] Towards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework

Zhicheng Zhao,Yuancheng Xu,Andong Lu,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种质量感知动态融合网络(QDFNet),用于鲁棒的光学与合成孔径雷达(SAR)图像融合目标检测,通过动态评估特征可靠性并自适应融合,在模态缺失或退化情况下显著提升检测性能。

Details Motivation: 由于成像机制差异、时间不同步和配准困难,获取对齐的光学-SAR图像对极为困难,常导致模态缺失或退化,现有方法在随机模态缺失下的鲁棒性和融合效果仍不足。 Method: 提出QDFNet,包含动态模态质量评估(DMQA)模块,利用可学习参考令牌迭代优化特征可靠性评估;设计正交约束归一化融合(OCNF)模块,通过正交约束保持模态独立性,并根据可靠性动态调整融合权重。 Result: 在SpaceNet6-OTD和OGSOD-2.0数据集上实验表明,QDFNet在模态部分损坏或缺失场景下优于现有最先进方法,展现出更强的鲁棒性和检测精度。 Conclusion: QDFNet通过可靠性引导的动态融合机制,有效应对多模态遥感图像中常见的模态缺失问题,显著提升了光学-SAR融合检测的稳定性和实用性。 Abstract: Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing, as these modalities provide complementary information for all-weather monitoring. However, practical deployment is severely limited by inherent challenges. Due to distinct imaging mechanisms, temporal asynchrony, and registration difficulties, obtaining well-aligned optical-SAR image pairs remains extremely difficult, frequently resulting in missing or degraded modality data. Although recent approaches have attempted to address this issue, they still suffer from limited robustness to random missing modalities and lack effective mechanisms to ensure consistent performance improvement in fusion-based detection. To address these limitations, we propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection. Our proposed method leverages learnable reference tokens to dynamically assess feature reliability and guide adaptive fusion in the presence of missing modalities. In particular, we design a Dynamic Modality Quality Assessment (DMQA) module that employs learnable reference tokens to iteratively refine feature reliability assessment, enabling precise identification of degraded regions and providing quality guidance for subsequent fusion. Moreover, we develop an Orthogonal Constraint Normalization Fusion (OCNF) module that employs orthogonal constraints to preserve modality independence while dynamically adjusting fusion weights based on reliability scores, effectively suppressing unreliable feature propagation. Extensive experiments on the SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate the superiority and effectiveness of QDFNet compared to state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.

[112] SonoVision: A Computer Vision Approach for Helping Visually Challenged Individuals Locate Objects with the Help of Sound Cues

Md Abu Obaida Zishan,Annajiat Alim Rasel

Main category: cs.CV

TL;DR: SonoVision是一个基于智能手机的应用程序,利用声音提示帮助视障人士定位日常物品,提升其独立性。

Details Motivation: 帮助视障人士克服在日常生活中定位物体的困难,减少对他人的依赖,避免潜在危险。 Method: 使用Flutter开发平台构建应用,后端采用Efficientdet-D2模型进行物体检测,并通过耳机左右声道的声音提示(如单耳或双耳同时发声)指示物体方向。 Result: 应用能有效通过声音线索帮助用户判断物体位置,且支持完全离线运行,具备良好的安全性和可用性。 Conclusion: SonoVision为视障人士提供了一种安全、便捷、独立的物体定位解决方案,具有实际应用价值。 Abstract: Locating objects for the visually impaired is a significant challenge and is something no one can get used to over time. However, this hinders their independence and could push them towards risky and dangerous scenarios. Hence, in the spirit of making the visually challenged more self-sufficient, we present SonoVision, a smart-phone application that helps them find everyday objects using sound cues through earphones/headphones. This simply means, if an object is on the right or left side of a user, the app makes a sinusoidal sound in a user's respective ear through ear/headphones. However, to indicate objects located directly in front, both the left and right earphones are rung simultaneously. These sound cues could easily help a visually impaired individual locate objects with the help of their smartphones and reduce the reliance on people in their surroundings, consequently making them more independent. This application is made with the flutter development platform and uses the Efficientdet-D2 model for object detection in the backend. We believe the app will significantly assist the visually impaired in a safe and user-friendly manner with its capacity to work completely offline. Our application can be accessed here https://github.com/MohammedZ666/SonoVision.git.

[113] SAM 3D for 3D Object Reconstruction from Remote Sensing Images

Junsheng Yao,Lichao Mou,Qingyu Li

Main category: cs.CV

TL;DR: 本文首次系统评估了通用图像到3D基础模型SAM 3D在单目遥感建筑重建中的应用,相较于TRELLIS表现出更优的屋顶几何一致性和边界清晰度,并通过分段-重建-组合流程扩展至城市场景建模,揭示了其潜力与实际局限。

Details Motivation: 现有单目3D建筑重建方法通常依赖特定架构和强监督,缺乏通用性,因此需要探索基础模型在该任务中的适用性与潜力。 Method: 采用SAM 3D这一通用图像到3D基础模型,在纽约城市数据集上与TRELLIS进行对比,使用FID和CLIP-MMD作为评估指标,并提出分段-重建-组合流程以扩展至城市场景重建。 Result: SAM 3D在屋顶几何连贯性和边界清晰度上优于TRELLIS,能够有效生成高质量单体建筑3D结构,并可通过所提流程扩展至完整城市场景建模。 Conclusion: SAM 3D作为通用基础模型在单目遥感3D重建中表现优异,具备替代专用模型的潜力,未来应融合场景级结构先验以进一步提升性能。 Abstract: Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.

[114] Comparing Object Detection Models for Electrical Substation Component Mapping

Haley Mody,Namish Bansal,Dennies Kiprono Bor,Edward J. Oughton

Main category: cs.CV

TL;DR: 本研究比较了YOLOv8、YOLOv11和RF-DETR三种计算机视觉模型在美 国变电站组件自动识别中的性能,旨在提高基础设施脆弱性评估的效率与准确性。

Details Motivation: 传统人工绘制变电站基础设施费时费力,且电力系统作为关键国家基础设施,需高效识别关键组件以降低故障风险。 Method: 使用手动标注的美国变电站图像数据集,训练并比较YOLOv8、YOLOv11和RF-DETR三种模型,在检测精度、准确率和效率方面进行评估。 Result: 三种模型均能有效识别变电站组件,其中某一模型在精度与效率间表现最佳,适用于大规模自动化映射。具体性能指标显示各模型优劣。 Conclusion: 基于计算机视觉的自动化方法可显著提升变电站组件映射的效率与可扩展性,为电网脆弱性分析提供了可行的机器学习解决方案。 Abstract: Electrical substations are a significant component of an electrical grid. Indeed, the assets at these substations (e.g., transformers) are prone to disruption from many hazards, including hurricanes, flooding, earthquakes, and geomagnetically induced currents (GICs). As electrical grids are considered critical national infrastructure, any failure can have significant economic and public safety implications. To help prevent and mitigate these failures, it is thus essential that we identify key substation components to quantify vulnerability. Unfortunately, traditional manual mapping of substation infrastructure is time-consuming and labor-intensive. Therefore, an autonomous solution utilizing computer vision models is preferable, as it allows for greater convenience and efficiency. In this research paper, we train and compare the outputs of 3 models (YOLOv8, YOLOv11, RF-DETR) on a manually labeled dataset of US substation images. Each model is evaluated for detection accuracy, precision, and efficiency. We present the key strengths and limitations of each model, identifying which provides reliable and large-scale substation component mapping. Additionally, we utilize these models to effectively map the various substation components in the United States, showcasing a use case for machine learning in substation mapping.

[115] Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

Sukhyun Jeong,Yong-Hoon Choi

Main category: cs.CV

TL;DR: 本文提出了一种名为PGR$^2$M的混合表示方法,通过将可解释的姿态码与残差码结合,提升了文本驱动的3D动作生成和编辑的质量与可控性。

Details Motivation: 现有基于姿态码的方法在捕捉时间动态细节和高频动作特征方面表现不足,导致重建质量和局部控制能力下降,本文旨在解决这一问题。 Method: 引入姿态引导的残差细化(PGR$^2$M),采用残差向量量化(RVQ)分解动作为姿态潜码(全局结构)和残差潜码(细粒度变化),并通过残差丢弃机制保持语义对齐;使用基础Transformer预测姿态码,精修Transformer条件化预测残差码。 Result: 在HumanML3D和KIT-ML数据集上,PGR$^2$M在Fréchet inception距离和重建指标上优于CoMo及近期扩散模型和基于tokenization的方法,用户研究验证了其在保持结构前提下实现直观编辑的能力。 Conclusion: PGR$^2$M通过混合表示有效平衡了动作生成的保真度与可解释性,在文本到动作生成和编辑任务中实现了更高质量和更强的可控性。 Abstract: Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.

[116] Event-based high temporal resolution measurement of shock wave motion field

Taihang Lei,Banglei Guan,Minzu Liang,Pengju Sun,Jing Tao,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于多事件相机的新型框架,用于高时空分辨率下冲击波运动参数的精确测量,实现了多角度测量、运动场重建和爆炸当量反演,实验结果表明该方法具有高精度和显著进展。

Details Motivation: 准确测量具有高时空分辨率的冲击波运动参数对于功率场测试和损伤评估等应用至关重要,但冲击波快速不均匀传播和不稳定测试条件带来了重大挑战。 Method: 利用多个事件相机建立极坐标系,通过自适应感兴趣区域提取和迭代斜率分析提取冲击波前缘事件,并根据基于事件的光学成像模型推导几何模型和三维重建模型。 Result: 速度测量结果与压力传感器和经验公式的结果相比,最大误差为5.20%,最小误差为0.06%,实现了高精度的冲击波运动场测量。 Conclusion: 所提方法在高空间和时间分辨率下实现了冲击波运动场的高精度测量,代表了该领域的重要进展。 Abstract: Accurate measurement of shock wave motion parameters with high spatiotemporal resolution is essential for applications such as power field testing and damage assessment. However, significant challenges are posed by the fast, uneven propagation of shock waves and unstable testing conditions. To address these challenges, a novel framework is proposed that utilizes multiple event cameras to estimate the asymmetry of shock waves, leveraging its high-speed and high-dynamic range capabilities. Initially, a polar coordinate system is established, which encodes events to reveal shock wave propagation patterns, with adaptive region-of-interest (ROI) extraction through event offset calculations. Subsequently, shock wave front events are extracted using iterative slope analysis, exploiting the continuity of velocity changes. Finally, the geometric model of events and shock wave motion parameters is derived according to event-based optical imaging model, along with the 3D reconstruction model. Through the above process, multi-angle shock wave measurement, motion field reconstruction, and explosive equivalence inversion are achieved. The results of the speed measurement are compared with those of the pressure sensors and the empirical formula, revealing a maximum error of 5.20% and a minimum error of 0.06%. The experimental results demonstrate that our method achieves high-precision measurement of the shock wave motion field with both high spatial and temporal resolution, representing significant progress.

[117] Scalpel-SAM: A Semi-Supervised Paradigm for Adapting SAM to Infrared Small Object Detection

Zihan Liu,Xiangning Ren,Dezhang Kong,Yipeng Zhang,Meng Han

Main category: cs.CV

TL;DR: 提出了一种基于分层MoE适配器的两阶段半监督知识蒸馏范式,用于解决红外小目标检测中因标注成本高导致的数据稀缺问题,在仅使用10%标注数据的情况下实现了媲美甚至超越全监督模型的性能。

Details Motivation: 现有半监督方法在红外小目标检测中面临领域差距大、无法编码物理先验、架构复杂等问题,且标注成本高昂,亟需高效利用少量标注数据的新范式。 Method: 设计了一个包含四个白盒神经算子的分层MoE适配器,并构建两阶段知识蒸馏与迁移框架:第一阶段利用少量全监督数据通过先验引导的知识蒸馏将SAM蒸馏为专家教师模型Scalpel-SAM;第二阶段利用该教师模型生成伪标签训练轻量下游模型。 Result: 实验表明,仅用10%标注数据训练的下游模型性能可达到甚至超过全监督模型,验证了所提范式在缓解数据稀缺问题上的有效性。 Conclusion: 本文提出的半监督范式首次系统性地利用SAM作为教师模型解决红外小目标检测中的数据标注瓶颈,具备实际部署潜力和推广价值。 Abstract: Infrared small object detection urgently requires semi-supervised paradigms due to the high cost of annotation. However, existing methods like SAM face significant challenges of domain gaps, inability of encoding physical priors, and inherent architectural complexity. To address this, we designed a Hierarchical MoE Adapter consisting of four white-box neural operators. Building upon this core component, we propose a two-stage paradigm for knowledge distillation and transfer: (1) Prior-Guided Knowledge Distillation, where we use our MoE adapter and 10% of available fully supervised data to distill SAM into an expert teacher (Scalpel-SAM); and (2) Deployment-Oriented Knowledge Transfer, where we use Scalpel-SAM to generate pseudo labels for training lightweight and efficient downstream models. Experiments demonstrate that with minimal annotations, our paradigm enables downstream models to achieve performance comparable to, or even surpassing, their fully supervised counterparts. To our knowledge, this is the first semi-supervised paradigm that systematically addresses the data scarcity issue in IR-SOT using SAM as the teacher model.

[118] Tracking by Predicting 3-D Gaussians Over Time

Tanish Baranwal,Himanshu Gaurav Singh,Jathushan Rajasegaran,Jitendra Malik

Main category: cs.CV

TL;DR: 提出Video-GMAE,一种基于高斯点阵的自监督视频表示学习方法,能自然涌现出追踪能力,并在多个数据集上超越现有自监督方法。

Details Motivation: 希望通过引入3D场景动态性的合理归纳偏置来提升视频表征学习效果,利用高斯点阵表示视频中物体的运动。 Method: 设计Video-GMAE架构,将视频编码为随时间变化的高斯点阵序列,通过自监督学习建模动态3D场景的2D投影。 Result: 预训练后模型展现出零样本追踪能力,性能媲美当前最优;小规模微调后在Kinetics和Kubric数据集上分别提升34.6%和13.1%。 Conclusion: 使用高斯点阵作为视频表示具有强大潜力,不仅能自然涌现高级语义能力(如追踪),还在下游任务中显著优于现有自监督方法。 Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.

[119] SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration

Xin Chen,Kang Luo,Yangyi Xiao,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于月球机器人任务的多模态3D目标检测模型SCAFusion,基于BEVFusion框架,通过引入认知适配器、对比对齐模块、相机辅助训练分支和分段感知坐标注意力机制,显著提升了小尺寸和不规则目标的检测性能。

Details Motivation: 现有适用于地面自动驾驶的多模态3D感知方法在月球等非地球环境中表现不佳,主要问题包括特征对齐差、多模态协同不足以及对小目标检测能力弱。因此需要针对月球探测任务设计更可靠的检测模型。 Method: 基于BEVFusion框架,引入四个关键组件:认知适配器用于高效调整相机主干网络;对比对齐模块增强相机与LiDAR特征一致性;相机辅助训练分支强化视觉表征;分段感知坐标注意力机制专门提升小而复杂目标的检测能力。 Result: 在nuScenes验证集上达到69.7% mAP和72.1% NDS,较基线分别提升5.0%和2.7%;在Isaac Sim构建的模拟月球环境中,mAP达到90.93%,比基线提高11.5%,尤其在小型陨石类障碍物检测上有显著改进。 Conclusion: SCAFusion在几乎不增加参数量和计算开销的前提下,有效提升了月面环境下对小而复杂目标的检测精度,具备在月球自主导航中应用的潜力。 Abstract: Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.

[120] DreamOmni3: Scribble-based Editing and Generation

Bin Xia,Bohao Peng,Jiyang Liu,Sitong Wu,Jingyao Li,Junjia Huang,Xu Zhao,Yitong Wang,Ruihang Chu,Bei Yu,Jiaya Jia

Main category: cs.CV

TL;DR: 本文提出了基于涂鸦的编辑和生成任务,通过结合文本、图像和手绘草图实现更灵活的用户界面创作,并提出DreamOmni3模型解决数据构建与框架设计问题。

Details Motivation: 语言提示难以准确捕捉用户意图中的编辑位置和细粒度视觉细节,因此需要结合涂鸦等视觉线索提升生成与编辑的精确性。 Method: 提出涂鸦为基础的编辑与生成任务,构建包含多种涂鸦类型的数据合成流程,并设计联合输入框架,将原始图像与涂鸦图像一同输入模型,利用颜色区分区域并共享编码以精确定位。 Result: 建立了全面的基准测试,实验结果表明DreamOmni3在各项任务上表现优异,支持复杂多模态输入下的精准编辑与生成。 Conclusion: DreamOmni3通过联合输入原始与涂鸦图像,有效提升了图文生成与编辑的灵活性和准确性,推动了GUI环境下多模态创作的发展。 Abstract: Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.

[121] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

Qinglin Zeng,Kaitong Cai,Ruiqi Chen,Qinhan Lv,Keze Wang

Main category: cs.CV

TL;DR: 提出CoAgent框架,通过计划-生成-验证的闭环流程提升开放域视频生成中的叙事连贯性和视觉一致性。

Details Motivation: 现有文本到视频模型常独立处理每个镜头,导致身份漂移、场景不一致和时间结构不稳定,缺乏跨镜头的全局一致性控制。 Method: 构建一个包含分镜规划器、全局上下文管理器、生成模块、视觉一致性控制器和验证代理的协作式闭环系统;分镜规划器将输入分解为具有实体、空间关系和时序线索的镜头级计划;全局上下文管理器维护实体记忆;验证代理基于视觉语言推理评估并触发选择性重生成。 Result: 实验表明,CoAgent在长视频生成中显著提升了叙事连贯性、视觉一致性和整体叙事质量。 Conclusion: CoAgent通过显式的结构化规划与闭环反馈机制,有效解决了开放域视频生成中的跨镜头一致性难题,为长序列内容生成提供了可扩展的框架。 Abstract: Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.

[122] Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

Jesen Zhang,Ningyuan Liu,Kaitong Cai,Sidi Liu,Jing Yang,Ziliang Chen,Xiaofei Sun,Keze Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SR-MCR的轻量级、无标签框架,通过利用模型输出中的内在推理过程信号来对齐多模态大语言模型的推理过程,提升了答案准确性和推理连贯性。

Details Motivation: 现有的多模态大模型在推理过程中常出现流畅但不可靠的问题,缺乏步骤间的一致性和视觉接地能力,主要因为现有对齐方法仅监督最终答案而忽略中间推理过程的可靠性。 Method: 引入SR-MCR框架,结合五种自指线索(语义对齐、词汇保真度、非冗余性、视觉接地和步骤一致性)构建归一化、可靠性加权的奖励机制,并采用无批评者的GRPO目标与置信感知冷却机制稳定训练。 Result: 基于Qwen2.5-VL构建的SR-MCR在多个视觉基准上显著提升答案准确性和推理连贯性,SR-MCR-7B在同类开源模型中达到81.4%的平均准确率,处于领先水平。消融实验验证了各奖励项和冷却模块的独立贡献。 Conclusion: SR-MCR通过细粒度的过程级对齐有效改善了多模态大模型的推理质量和视觉接地能力,为无需额外标注的推理优化提供了新思路。 Abstract: Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

[123] ReFRM3D: A Radiomics-enhanced Fused Residual Multiparametric 3D Network with Multi-Scale Feature Fusion for Glioma Characterization

Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Arefin Ittesafun Abian,Yan Zhang,Mirjam Jonkman,Sami Azam

Main category: cs.CV

TL;DR: 提出了一种基于多参数MRI的新型放射组学增强融合残差3D网络(ReFRM3D)和多特征肿瘤标志物分类器,用于提升胶质瘤分割与分类效率,在多个BraTS数据集上取得了优异的分割性能。

Details Motivation: 现有胶质瘤诊断方法存在成像数据变异大、计算资源利用不足、分割与分类效率低等问题,亟需更高效的自动化解决方案。 Method: 基于3D U-Net架构,提出ReFRM3D网络,结合多尺度特征融合、混合上采样和扩展残差跳跃机制,并引入放射组学特征构建多特征肿瘤标志物分类器。 Result: 在BraTS2019、2020和2021数据集上实现了显著的分割性能提升,例如BraTS2019中全肿瘤、增强肿瘤和肿瘤核心的Dice系数分别达到94.04%、92.68%和93.64%。 Conclusion: 所提方法有效提升了胶质瘤的分割精度与分类效率,为脑肿瘤的自动表征提供了先进且可靠的解决方案。 Abstract: Gliomas are among the most aggressive cancers, characterized by high mortality rates and complex diagnostic processes. Existing studies on glioma diagnosis and classification often describe issues such as high variability in imaging data, inadequate optimization of computational resources, and inefficient segmentation and classification of gliomas. To address these challenges, we propose novel techniques utilizing multi-parametric MRI data to enhance tumor segmentation and classification efficiency. Our work introduces the first-ever radiomics-enhanced fused residual multiparametric 3D network (ReFRM3D) for brain tumor characterization, which is based on a 3D U-Net architecture and features multi-scale feature fusion, hybrid upsampling, and an extended residual skip mechanism. Additionally, we propose a multi-feature tumor marker-based classifier that leverages radiomic features extracted from the segmented regions. Experimental results demonstrate significant improvements in segmentation performance across the BraTS2019, BraTS2020, and BraTS2021 datasets, achieving high Dice Similarity Coefficients (DSC) of 94.04%, 92.68%, and 93.64% for whole tumor (WT), enhancing tumor (ET), and tumor core (TC) respectively in BraTS2019; 94.09%, 92.91%, and 93.84% in BraTS2020; and 93.70%, 90.36%, and 92.13% in BraTS2021.

[124] KV-Tracker: Real-Time Pose Tracking with Transformers

Marwan Taher,Ignacio Alzugaray,Kirill Mazur,Xin Kong,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出了一种基于KV缓存的实时多视角3D几何网络方法,实现高效的6自由度姿态跟踪与在线重建。

Details Motivation: 现有多视角3D几何网络虽具有强大先验,但速度慢,难以用于实时应用。 Method: 通过关键帧选择与管理,结合π³网络的全双向注意力机制,缓存全局自注意力模块的键值(KV)对,并将其作为唯一的场景表示用于在线跟踪。 Result: 实现了最高15倍的推理加速,保持约27 FPS的高帧率,且无漂移或灾难性遗忘,在多个数据集上表现出色。 Conclusion: 该KV缓存策略是模型无关的,可广泛应用于其他多视角网络,支持实时、在线的物体与场景位姿跟踪和重建。 Abstract: Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $π^3$ with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.

[125] PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment

Bin Wang,Yang Xu,Huan Zhao,Hao Zhang,Zixing Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的个性化3D说话头动画框架PTalker,通过音频和面部运动序列的风格解耦与多层次对齐机制,实现高保真、个性化的语音驱动面部动画生成。

Details Motivation: 现有方法在语音驱动3D说话头生成中忽视了个体说话风格的细微差异,导致个性化和真实感不足。 Method: 设计了解耦约束以将音频和动作序列映射到独立的风格与内容空间;采用包含空间对齐(图注意力网络)、时间对齐(交叉注意力)和特征对齐(top-k双向对比损失与KL散度)的三层次模态对齐机制。 Result: 在公开数据集上的实验表明,PTalker在唇音同步精度和个人化风格表达上优于现有最先进方法,能生成更真实、更具个性化的3D说话头动画。 Conclusion: PTalker通过风格解耦和多层级对齐机制有效提升了个性化3D talking head生成的真实性和同步性,推动了语音驱动面部动画的发展。 Abstract: Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely "PTalker". This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.

[126] Enhancing Noise Resilience in Face Clustering via Sparse Differential Transformer

Dafeng Zhang,Yongqi Song,Shizhuo Liu

Main category: cs.CV

TL;DR: 提出了一种基于稀疏微分Transformer的Top-K Jaccard相似性方法,用于提升面部聚类中嵌入关系测量的准确性与鲁棒性。

Details Motivation: 现有方法使用Jaccard系数时引入过多无关节点,导致相似性度量判别力不足,影响聚类性能。 Method: 提出预测驱动的Top-K Jaccard相似系数,并设计基于Transformer的模型来优化邻域选择;进一步引入稀疏微分Transformer(SDT)以抑制噪声、增强特征关系建模能力。 Result: 在MS-Celeb-1M等多个数据集上实现了最先进的聚类性能,显著优于现有方法。 Conclusion: 所提出的SDT和Top-K Jaccard策略有效提升了面部嵌入相似性测量的准确性和鲁棒性,为大规模面部聚类提供了高效解决方案。 Abstract: The method used to measure relationships between face embeddings plays a crucial role in determining the performance of face clustering. Existing methods employ the Jaccard similarity coefficient instead of the cosine distance to enhance the measurement accuracy. However, these methods introduce too many irrelevant nodes, producing Jaccard coefficients with limited discriminative power and adversely affecting clustering performance. To address this issue, we propose a prediction-driven Top-K Jaccard similarity coefficient that enhances the purity of neighboring nodes, thereby improving the reliability of similarity measurements. Nevertheless, accurately predicting the optimal number of neighbors (Top-K) remains challenging, leading to suboptimal clustering results. To overcome this limitation, we develop a Transformer-based prediction model that examines the relationships between the central node and its neighboring nodes near the Top-K to further enhance the reliability of similarity estimation. However, vanilla Transformer, when applied to predict relationships between nodes, often introduces noise due to their overemphasis on irrelevant feature relationships. To address these challenges, we propose a Sparse Differential Transformer (SDT), instead of the vanilla Transformer, to eliminate noise and enhance the model's anti-noise capabilities. Extensive experiments on multiple datasets, such as MS-Celeb-1M, demonstrate that our approach achieves state-of-the-art (SOTA) performance, outperforming existing methods and providing a more robust solution for face clustering.

[127] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Jiacheng Ye,Shansan Gong,Jiahui Gao,Junming Fan,Shuang Wu,Wei Bi,Haoli Bai,Lifeng Shang,Lingpeng Kong

Main category: cs.CV

TL;DR: 本文提出了基于扩散语言模型的视觉-语言-动作模型Dream-VL和Dream-VLA,克服了自回归模型在视觉规划和机器人控制中的局限性,在多个基准上达到领先性能。

Details Motivation: 自回归视觉语言模型在复杂视觉规划和动态机器人控制中受限于序列生成方式,本文探索基于扩散的语言模型以提升生成效率与建模能力。 Method: 构建基于扩散语言模型的开放视觉语言模型Dream-VL,并通过在开放机器人数据集上进行持续预训练,发展出视觉-语言-行动模型Dream-VLA,利用其双向生成特性支持动作分块和平行生成。 Result: Dream-VL在多项基准上表现媲美顶级自回归模型,且在视觉规划任务中更具潜力;Dream-VLA在LIBERO、SimplerEnv-Bridge和SimplerEnv-Fractal上分别达到97.2%、71.4%和60.5%的平均成功率,超过π₀和GR00T-N1等先进模型。 Conclusion: 基于扩散的视觉语言模型为视觉-语言-动作任务提供了更优的基础架构,具有更快的收敛速度和更强的任务表现力,具备广泛应用前景。 Abstract: While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.

[128] Rethinking Memory Design in SAM-Based Visual Object Tracking

Mohamad Alansari,Muzammal Naseer,Hasan Al Marzouqi,Naoufel Werghi,Sajid Javed

Main category: cs.CV

TL;DR: 本文对基于SAM的视觉目标跟踪中的记忆机制进行了系统性研究,分析了现有方法在短时记忆帧选择上的差异,并提出了一种统一的混合记忆框架,将记忆分解为短期外观记忆和长期干扰分辨记忆,在多个基准上显著提升了在遮挡、复杂运动和干扰物场景下的鲁棒性。

Details Motivation: 现有基于SAM2的跟踪器以不同方式处理记忆限制,缺乏对记忆设计原则的统一理解,且不清楚这些机制能否迁移到更强的基础模型(如SAM3)上。因此需要系统研究记忆机制的设计及其跨模型的可迁移性。 Method: 首先分析代表性SAM2跟踪器的记忆策略,发现其主要差异在于短时记忆帧的选择;然后在SAM3框架下忠实复现这些机制,并在十个基准上进行大规模评估;最后提出一种将记忆分解为短期外观记忆和长期干扰分辨记忆的统一混合记忆框架。 Result: 实验表明,所提出的混合记忆框架在SAM2和SAM3 backbone上均能稳定提升性能,尤其在长时遮挡、复杂运动和强干扰场景下表现出更强的鲁棒性。 Conclusion: 通过系统分析与模块化设计,本文揭示了记忆机制在SAM-based跟踪中的核心作用,提出的统一框架为未来基于基础模型的跟踪系统提供了可扩展、可解释的记忆架构。 Abstract: \noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: https://github.com/HamadYA/SAM3_Tracking_Zoo. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}

[129] Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

Yuming Gu,Yizhi Wang,Yining Hong,Yipeng Gao,Hao Jiang,Angtian Wang,Bo Liu,Nathaniel S. Dennler,Zhengfei Kuang,Hao Li,Gordon Wetzstein,Chongyang Ma

Main category: cs.CV

TL;DR: 本文提出了Envision,一种基于扩散模型的视觉规划框架,通过显式结合目标图像来生成物理合理且目标一致的视频轨迹,从而提升具身智能体在操作任务中的空间一致性和目标对齐能力。

Details Motivation: 现有视觉规划方法多为前向预测,缺乏对目标状态的显式建模,导致生成轨迹易出现空间漂移和目标错位问题。 Method: Envision分为两个阶段:首先由目标意象模型根据场景和指令生成任务相关的连贯目标图像;然后利用首尾帧条件化的视频扩散模型(FL2V)在初始观测与目标图像之间插值,生成平滑、物理合理的视频轨迹。 Result: 在物体操作和图像编辑基准上,Envision在目标对齐、空间一致性和对象保持方面优于基线方法。 Conclusion: 通过显式目标约束的扩散模型可有效提升具身视觉规划的准确性和可靠性,所生成的视觉计划能为机器人控制提供有效指导。 Abstract: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.

[130] FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Yidi Liu,Zihao Fan,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Xueyang Fu,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文提出了一种用于图像超分辨率(ISR)任务的细粒度感知奖励模型(FinPercep-RM)和协同进化课程学习(CCL)机制,以解决传统基于单一全局评分的图像质量评估模型在强化学习中导致的奖励劫持问题。

Details Motivation: 传统的图像质量评估(IQA)模型仅输出全局分数,难以捕捉局部细微失真,导致在使用人类反馈强化学习(RLHF)优化ISR模型时出现奖励劫持现象。因此需要一种更敏感的细粒度奖励模型来准确反映感知质量。 Method: 提出FinPercep-RM,采用编码器-解码器结构,不仅输出全局质量评分,还生成感知退化图以定位和量化局部缺陷;构建FGR-30k数据集训练该模型;进一步设计协同进化课程学习(CCL)机制,使奖励模型和ISR生成器按同步课程从简单到复杂逐步训练,提升稳定性并抑制奖励劫持。 Result: 实验表明,所提方法在多种ISR模型上均优于现有RLHF方法,显著提升全局质量和局部真实感,有效缓解训练不稳定与奖励劫持问题。 Conclusion: 通过引入细粒度奖励模型与协同进化的课程学习策略,能够更有效地对齐图像超分辨率中的感知质量优化目标,为基于人类偏好的强化学习提供了更可靠和稳定的框架。 Abstract: Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.

[131] Visual Autoregressive Modelling for Monocular Depth Estimation

Amir El-Ghoussani,André Kaup,Nassir Navab,Gustavo Carneiro,Vasileios Belagiannis

Main category: cs.CV

TL;DR: 提出一种基于视觉自回归(VAR)先验的单目深度估计方法,替代扩散模型,在室内和室外数据集上表现优异。

Details Motivation: 寻找扩散模型之外的深度估计方法,探索自回归模型在几何感知生成模型中的潜力。 Method: 利用大规模文本到图像的VAR模型,引入尺度自适应条件上采样机制和无分类器引导,通过10个固定的自回归阶段进行推理,并仅用74K合成样本微调。 Result: 在受限训练条件下实现了室内基准的最先进性能,并在户外数据集上表现出色。 Conclusion: 证明了自回归先验是一类有前景的几何感知生成模型,具有良好的数据可扩展性和对3D视觉任务的适应性。 Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".

[132] Investigating Deep Learning Models for Ejection Fraction Estimation from Echocardiography Videos

Shravan Saranyan,Pramit Saha

Main category: cs.CV

TL;DR: 本研究评估了多种深度学习架构在从超声心动图视频中估计左心室射血分数(LVEF)方面的性能,发现改进的3D Inception模型表现最佳,RMSE为6.79%,并指出模型设计和训练策略对医学视频分析具有广泛意义。

Details Motivation: 手动评估超声心动图中的心脏功能耗时且存在较大的观察者间变异,亟需一种高效、准确的自动化方法来提升临床诊断效率与一致性。 Method: 研究系统地评估了包括3D Inception、双流网络和CNN-RNN在内的多种深度学习架构,并对结构修改和特征融合策略进行了比较,使用EchoNet-Dynamic数据集(包含10,030个视频)进行训练与验证。 Result: 改进的3D Inception架构表现最优,RMSE达6.79%;较小且较简单的模型泛化能力更好,且模型性能对卷积核大小和归一化策略等超参数高度敏感。 Conclusion: 深度学习可有效用于LVEF的自动估算,其中3D Inception架构表现最佳,相关设计与训练经验可推广至其他医学视频分析任务。 Abstract: Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and plays a central role in the diagnosis and management of cardiovascular disease. Echocardiography, as a readily accessible and non-invasive imaging modality, is widely used in clinical practice to estimate LVEF. However, manual assessment of cardiac function from echocardiograms is time-consuming and subject to considerable inter-observer variability. Deep learning approaches offer a promising alternative, with the potential to achieve performance comparable to that of experienced human experts. In this study, we investigate the effectiveness of several deep learning architectures for LVEF estimation from echocardiography videos, including 3D Inception, two-stream, and CNN-RNN models. We systematically evaluate architectural modifications and fusion strategies to identify configurations that maximize prediction accuracy. Models were trained and evaluated on the EchoNet-Dynamic dataset, comprising 10,030 echocardiogram videos. Our results demonstrate that modified 3D Inception architectures achieve the best overall performance, with a root mean squared error (RMSE) of 6.79%. Across architectures, we observe a tendency toward overfitting, with smaller and simpler models generally exhibiting improved generalization. Model performance was also found to be highly sensitive to hyperparameter choices, particularly convolutional kernel sizes and normalization strategies. While this study focuses on echocardiography-based LVEF estimation, the insights gained regarding architectural design and training strategies may be applicable to a broader range of medical and non-medical video analysis tasks.

[133] Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains

Qiankun Li,Feng He,Huabao Chen,Xin Ning,Kun Wang,Zengfu Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Cluster Attention Adapter (CLAdapter) 的新方法,用于将大规模预训练模型的知识有效迁移到数据受限的下游科学任务中。CLAdapter通过引入注意力机制和聚类中心,利用分布相关性和变换矩阵来增强特征表示,并能无缝集成到CNN和Transformer等多种架构中(包括2D和3D场景)。在10个跨领域的数据集上实验表明,该方法在各类数据受限场景下均达到最先进性能。

Details Motivation: 尽管大规模数据集和预训练模型取得了显著进展,但在数据稀缺的专用科学领域,模型迁移仍面临挑战。因此,需要一种能够有效适配不同下游任务并充分利用预训练知识的方法。 Method: 提出CLAdapter,结合注意力机制与聚类中心,通过建模特征分布的相关性及使用变换矩阵,对预训练模型的特征进行个性化增强;其统一接口支持多种模型架构(如ViT、ConvNeXt等)和维度(2D/3D)。 Result: 在10个涵盖生物、医学、工业、农业、环境、地理、材料科学、OOD和3D分析等领域的数据集上验证了CLAdapter的有效性,结果表明其在各种数据受限任务中均取得SOTA性能。 Conclusion: CLAdapter是一种高效且通用的适配器,能够释放基础视觉模型在多样化、数据受限科学任务中的潜力,推动了自适应迁移学习的发展。 Abstract: In the big data era, the computer vision field benefits from large-scale datasets such as LAION-2B, LAION-400M, and ImageNet-21K, Kinetics, on which popular models like the ViT and ConvNeXt series have been pre-trained, acquiring substantial knowledge. However, numerous downstream tasks in specialized and data-limited scientific domains continue to pose significant challenges. In this paper, we propose a novel Cluster Attention Adapter (CLAdapter), which refines and adapts the rich representations learned from large-scale data to various data-limited downstream tasks. Specifically, CLAdapter introduces attention mechanisms and cluster centers to personalize the enhancement of transformed features through distribution correlation and transformation matrices. This enables models fine-tuned with CLAdapter to learn distinct representations tailored to different feature sets, facilitating the models' adaptation from rich pre-trained features to various downstream scenarios effectively. In addition, CLAdapter's unified interface design allows for seamless integration with multiple model architectures, including CNNs and Transformers, in both 2D and 3D contexts. Through extensive experiments on 10 datasets spanning domains such as generic, multimedia, biological, medical, industrial, agricultural, environmental, geographical, materials science, out-of-distribution (OOD), and 3D analysis, CLAdapter achieves state-of-the-art performance across diverse data-limited scientific domains, demonstrating its effectiveness in unleashing the potential of foundation vision models via adaptive transfer. Code is available at https://github.com/qklee-lz/CLAdapter.

[134] INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

Mert Ikinci,Luna Toma,Karin U. Loeffler,Leticia Ussem,Daniela Süsskind,Julia M. Weller,Yousef Yeganeh,Martina C. Herwig-Carl,Shadi Albarqouni

Main category: cs.CV

TL;DR: 提出了一种名为INTERACT-CMIL的多头深度学习框架,用于联合预测结膜黑色素细胞内病变(CMIL)的五个组织病理学指标,提升了分级准确性。

Details Motivation: 准确分级CMIL对治疗和黑色素瘤预测至关重要,但由于形态学特征细微且诊断标准相互关联,现有方法难以实现稳定、一致的评估。 Method: 采用多头深度学习框架INTERACT-CMIL,通过共享特征学习、组合式部分监督和跨任务依赖损失(Inter-Dependence Loss)联合预测五个病理轴:WHO4、WHO5、水平扩散、垂直扩散和细胞非典型性。 Result: 在包含486个专家标注活检片段的多中心数据集上,INTERACT-CMIL相较于CNN和基础模型基线显著提升性能,宏观F1分数相对提高最高达55.1%(WHO4)和25.0%(垂直扩散)。 Conclusion: 该框架提供一致且可解释的多标准预测,与专家评分一致,为CMIL诊断建立了可重复的计算基准,推动数字眼病理的标准化。 Abstract: Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

[135] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

ZhenQi Chen,TsaiChing Ni,YuanFu Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为CritiFusion的新方法,通过在推理时结合多模态语义批评机制和频域优化来提升文本到图像生成的语义一致性和细节表现。

Details Motivation: 现有的文本到图像扩散模型虽然视觉质量高,但在复杂提示下的语义对齐方面仍存在挑战。 Method: 引入CritiCore模块,利用视觉-语言模型和多个大语言模型生成高层语义反馈;同时使用SpecFusion在频谱域融合中间生成状态以保留高频细节并注入结构信息。 Result: 在标准基准测试中显著提升了文本-图像对应性的人类对齐指标、偏好评分和美学评价,效果媲美最先进的奖励优化方法。 Conclusion: CritiFusion作为一种无需训练的即插即用模块,有效增强了现有扩散模型的提示保真度、细节和现实感。 Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

[136] Autoregressive Flow Matching for Motion Prediction

Johnathan Xie,Stefan Stojanov,Cristobal Eyzaguirre,Daniel L. K. Yamins,Jiajun Wu

Main category: cs.CV

TL;DR: 提出了一种名为自回归流匹配(ARFM)的新方法,用于序列连续数据的概率建模,并在多样化视频数据集上训练以生成长时程的点轨迹预测。

Details Motivation: 现有运动预测模型通常在窄分布上训练,难以泛化;而大规模视频预测虽视觉效果好,但对复杂运动建模能力不足。因此需要一种能扩展并准确预测复杂运动的方法。 Method: 开发了自回归流匹配(ARFM),通过对未来点轨迹进行概率建模,在多样化视频数据上进行训练,并支持长时程运动预测。同时构建了评估人类与机器人运动预测的新基准。 Result: ARFM能够有效预测复杂运动,并通过将机器人动作和人体运动预测与未来轨迹预测相结合,显著提升了下游任务性能。 Conclusion: ARFM是一种可扩展且有效的运动预测框架,通过在多样化数据上训练并结合轨迹预测,为人类和机器人运动预测提供了新的解决方案。 Abstract: Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: https://github.com/Johnathan-Xie/arfm-motion-prediction.

[137] Multimodal Diffeomorphic Registration with Neural ODEs and Structural Descriptors

Salvador Rodriguez-Sanz,Monica Hernandez

Main category: cs.CV

TL;DR: 提出了一种基于Neural ODE的多模态微分同胚配准方法,结合结构描述子与局部互信息,在准确性和鲁棒性上优于现有方法。

Details Motivation: 现有非刚性配准方法在精度、计算复杂度和正则化之间存在权衡,且大多局限于单模态假设,难以推广到多模态场景。 Method: 利用Neural ODE构建连续深度网络模型,引入模态无关的结构描述子(基于图像或特征)并结合局部互信息进行相似性度量,提出三种不同变体实现多模态配准。 Result: 在多种数据集组合上实验表明,该方法在大变形和小变形、单/多模态配准中均优于现有方法,具有良好的鲁棒性、多尺度适应性和计算效率。 Conclusion: 所提方法有效解决了多模态配准中的模态差异问题,无需大量训练数据,具备强泛化能力和稳定性,适用于复杂形变下的医学图像配准任务。 Abstract: This work proposes a multimodal diffeomorphic registration method using Neural Ordinary Differential Equations (Neural ODEs). Nonrigid registration algorithms exhibit tradeoffs between their accuracy, the computational complexity of their deformation model, and its proper regularization. In addition, they also assume intensity correlation in anatomically homologous regions of interest among image pairs, limiting their applicability to the monomodal setting. Unlike learning-based models, we propose an instance-specific framework that is not subject to high scan requirements for training and does not suffer performance degradation at inference time on modalities unseen during training. Our method exploits the potential of continuous-depth networks in the Neural ODE paradigm with structural descriptors, widely adopted as modality-agnostic metric models which exploit self-similarities on parameterized neighborhood geometries. We propose three different variants that integrate image-based or feature-based structural descriptors and nonstructural image similarities computed by local mutual information. We conduct extensive evaluations on different experiments formed by scan dataset combinations and show surpassing qualitative and quantitative results compared to state-of-the-art baselines adequate for large or small deformations, and specific of multimodal registration. Lastly, we also demonstrate the underlying robustness of the proposed framework to varying levels of explicit regularization while maintaining low error, its suitability for registration at varying scales, and its efficiency with respect to other methods targeted to large-deformation registration.

[138] SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis

Paul Dobre,Jackson Cooper,Xin Wang,Hongzhou Yang

Main category: cs.CV

TL;DR: 提出SCPainter框架,结合3D高斯点阵资产与扩散模型,统一实现真实感的3D资产插入与新视角合成。

Details Motivation: 现有方法在3D资产插入和新视角合成方面孤立处理,难以实现与场景交互及多样化训练场景生成,需统一框架提升自动驾驶模拟的真实性与多样性。 Method: 将3D高斯点阵(GS)车辆资产与3D场景点云联合投影至新视角,并以此投影结果作为条件输入扩散模型,生成高质量图像。 Result: 在Waymo Open Dataset上验证了框架有效性,能同时实现高质量的3D资产插入与新视角合成。 Conclusion: SCPainter实现了资产插入与NVS的联合建模,提升了自动驾驶仿真数据的真实性和多样性。 Abstract: 3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.

[139] Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning

Youssef Megahed,Robin Ducharme,Inok Lee,Inbal Willner,Olivier X. Miguel,Kevin Dick,Adrian D. C. Chan,Mark Walker,Steven Hawken

Main category: cs.CV

TL;DR: 本研究评估了基于超声自监督预训练的深度学习模型(USF-MAE)在早孕期超声图像中检测囊性水瘤的准确性与鲁棒性,结果表明其性能显著优于传统的DenseNet-169模型。

Details Motivation: 囊性水瘤是产前超声中的高风险征象,但现有监督深度学习方法受限于标注数据集较小,难以实现高精度检测。 Method: 采用在37万张未标注超声图像上预训练的USF-MAE模型,并在相同数据集和四折交叉验证下对囊性水瘤与正常对照进行二分类任务,使用准确率、敏感性、特异性、ROC-AUC评估性能,并通过Score-CAM进行可视化分析。 Result: USF-MAE模型平均准确率达0.96,敏感性0.94,特异性0.98,ROC-AUC为0.98,均优于DenseNet-169基线模型(分别为0.93、0.92、0.94、0.94),且差异具有统计学意义(p = 0.0057)。Score-CAM显示模型关注胎儿颈部区域,具备临床合理性。 Conclusion: 超声特异性的自监督预训练可显著提升小标注数据下的囊性水瘤检测性能,支持其在早期筛查项目中的应用潜力。 Abstract: Cystic hygroma is a high-risk prenatal ultrasound finding that portends high rates of chromosomal abnormalities, structural malformations, and adverse pregnancy outcomes. Automated detection can increase reproducibility and support scalable early screening programs, but supervised deep learning methods are limited by small labelled datasets. This study assesses whether ultrasound-specific self-supervised pretraining can facilitate accurate, robust deep learning detection of cystic hygroma in first-trimester ultrasound images. We fine-tuned the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), pretrained on over 370,000 unlabelled ultrasound images, for binary classification of normal controls and cystic hygroma cases used in this study. Performance was evaluated on the same curated ultrasound dataset, preprocessing pipeline, and 4-fold cross-validation protocol as for the DenseNet-169 baseline, using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). Model interpretability was analyzed qualitatively using Score-CAM visualizations. USF-MAE outperformed the DenseNet-169 baseline on all evaluation metrics. The proposed model yielded a mean accuracy of 0.96, sensitivity of 0.94, specificity of 0.98, and ROC-AUC of 0.98 compared to 0.93, 0.92, 0.94, and 0.94 for the DenseNet-169 baseline, respectively. Qualitative Score-CAM visualizations of model predictions demonstrated clinical relevance by highlighting expected regions in the fetal neck for both positive and negative cases. Paired statistical analysis using a Wilcoxon signed-rank test confirmed that performance improvements achieved by USF-MAE were statistically significant (p = 0.0057).

[140] Split4D: Decomposed 4D Scene Reconstruction Without Video Segmentation

Yongzhen Hu,Yihui Yang,Haotong Lin,Yifan Wang,Junting Dong,Yifu Deng,Xinyu Zhu,Fan Jia,Hujun Bao,Xiaowei Zhou,Sida Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为Freetime FeatureGS的新方法,用于从多视角视频中进行分解式4D场景重建,通过流式特征学习策略和可移动的高斯基元实现不依赖视频分割的高质量4D重建。

Details Motivation: 现有方法依赖于不稳定的视频分割结果,导致重建质量不可靠,本文旨在消除对视频分割的依赖并提升4D场景重建的稳定性与精度。 Method: 采用Freetime FeatureGS表示动态场景,将场景建模为具有可学习特征和线性运动能力的高斯基元;利用对比损失函数,根据2D分割图中投影是否属于同一实例来优化特征距离,并通过时间有序采样实现特征的流式传播。 Result: 在多个数据集上的实验表明,该方法显著优于近期方法,大幅提升了分解式4D场景重建的质量。 Conclusion: Freetime FeatureGS结合流式特征学习策略能有效克服视频分割不稳定的问题,实现了更准确、鲁棒的4D场景重建。 Abstract: This paper addresses the problem of decomposed 4D scene reconstruction from multi-view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per-image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.

[141] TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Hao Zhang,Mengsi Lyu,Bo Huang,Yulong Ao,Yonghua Lin

Main category: cs.CV

TL;DR: 提出一种针对长上下文、多图像场景的自适应视觉token剪枝方法,通过分解冗余并动态分配预算,在保持性能的同时显著减少视觉token数量。

Details Motivation: 现有视觉token剪枝方法常忽略长上下文、多图像输入场景,导致推理成本高且效率低。 Method: 将冗余分解为图像内和图像间两部分,通过图像内多样性与图像间变化量来量化,并据此进行动态预算分配;采用两阶段方法:图像内阶段按内容分配预算并选择代表性token,图像间阶段通过全局多样性和帕累托选择平衡多样性与文本对齐。 Result: 在长上下文多图像设置下,显著减少了视觉token数量,同时保持了模型性能。 Conclusion: 所提方法有效应对了长上下文、多图像场景下的视觉token剪枝挑战,提升了LMM的推理效率。 Abstract: Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.

[142] Neighbor-Aware Token Reduction via Hilbert Curve for Vision Transformers

Yunge Li,Lanyu Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于希尔伯特曲线重排序的邻居感知令牌缩减方法,通过保留2D空间中的邻域结构来提升视觉Transformer的计算效率。

Details Motivation: 现有令牌合并和剪枝策略常忽略空间连续性和邻居关系,导致局部上下文丢失,影响ViT的效率与性能。 Method: 引入希尔伯特曲线对令牌进行重排序,以在1D序列中保持2D空间邻接关系,并提出邻居感知剪枝(NAP)和相邻令牌相似性合并(MAT)两种策略。 Result: 实验表明该方法在多个基准上实现了优于现有方法的精度-效率权衡。 Conclusion: 保持空间连续性和邻居结构对ViT的令牌缩减至关重要,为ViT架构优化提供了新思路。 Abstract: Vision Transformers (ViTs) have achieved remarkable success in visual recognition tasks, but redundant token representations limit their computational efficiency. Existing token merging and pruning strategies often overlook spatial continuity and neighbor relationships, resulting in the loss of local context. This paper proposes novel neighbor-aware token reduction methods based on Hilbert curve reordering, which explicitly preserves the neighbor structure in a 2D space using 1D sequential representations. Our method introduces two key strategies: Neighbor-Aware Pruning (NAP) for selective token retention and Merging by Adjacent Token similarity (MAT) for local token aggregation. Experiments demonstrate that our approach achieves state-of-the-art accuracy-efficiency trade-offs compared to existing methods. This work highlights the importance of spatial continuity and neighbor structure, offering new insights for the architectural optimization of ViTs.

[143] Next Best View Selections for Semantic and Dynamic 3D Gaussian Splatting

Yiqian Li,Wen Jiang,Kostas Daniilidis

Main category: cs.CV

TL;DR: 提出了一种基于Fisher信息的主动学习算法,用于选择对语义和动态场景建模最具信息量的视图,优于随机和基于不确定性的启发式方法。

Details Motivation: 语义与动态场景理解存在大量数据冗余,需要有效选择信息量最大的帧以提升模型训练效率。 Method: 将视图选择问题建模为主动学习问题,利用Fisher信息量化候选视图在语义高斯参数和形变网络上的信息增益。 Result: 在大规模静态图像和动态视频数据集上验证了方法的有效性,显著提升了渲染质量和语义分割性能。 Conclusion: 该方法为联合处理语义推理与动态场景建模提供了有原则的解决方案,优于现有启发式或随机策略。 Abstract: Understanding semantics and dynamics has been crucial for embodied agents in various tasks. Both tasks have much more data redundancy than the static scene understanding task. We formulate the view selection problem as an active learning problem, where the goal is to prioritize frames that provide the greatest information gain for model training. To this end, we propose an active learning algorithm with Fisher Information that quantifies the informativeness of candidate views with respect to both semantic Gaussian parameters and deformation networks. This formulation allows our method to jointly handle semantic reasoning and dynamic scene modeling, providing a principled alternative to heuristic or random strategies. We evaluate our method on large-scale static images and dynamic video datasets by selecting informative frames from multi-camera setups. Experimental results demonstrate that our approach consistently improves rendering quality and semantic segmentation performance, outperforming baseline methods based on random selection and uncertainty-based heuristics.

[144] Plug In, Grade Right: Psychology-Inspired AGIQA

Zhicheng Liao,Baoliang Chen,Hanwei Zhu,Lingyu Zhu,Shiqi Wang,Weisi Lin

Main category: cs.CV

TL;DR: 提出一种基于算术分级响应模型(AGQG)的图像质量评估方法,通过建模质量等级的单调难度来缓解语义漂移问题,提升现有AGIQA模型性能。

Details Motivation: 现有AGIQA模型因文本-图像嵌入间的语义不一致导致相似性分布出现多峰模式,引发“语义漂移”问题,影响质量评估可靠性。 Method: 受心理测量学启发,引入经典的分级响应模型(GRM),设计双分支质量分级模块:一支估计图像能力,另一支构建算术递增的难度等级,确保难度单调性与分布单峰性。 Result: 所提AGQG模块具有即插即用特性,集成到多个SOTA AGIQA框架中均能持续提升性能,并在自然图像与屏幕内容图像质量评估任务上展现良好泛化能力。 Conclusion: AGQG通过引入可解释的单调难度建模机制,有效缓解语义漂移,为未来图像质量评估模型提供了新的基础组件。 Abstract: Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both "excellent" and "poor" grade descriptions while deviating from the "good" one. We refer to this phenomenon as "semantic drift", where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject's ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image's ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.

[145] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

Ruoyu Wang,Ziyu Li,Beier Zhu,Liangyu Yuan,Hanwang Zhang,Xun Yang,Xiaojun Chang,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为EPD-Solver的新型常微分方程求解器,通过引入多梯度并行计算来减少扩散模型采样过程中的截断误差,在低延迟条件下显著提升生成质量。

Details Motivation: 现有加速方法在低延迟下因无法捕捉高曲率轨迹段而导致累积截断误差,从而引起图像质量严重下降。 Method: 基于向量值函数的中值定理,设计了EPD-Solver,利用多个并行梯度评估更准确地逼近积分解,并采用两阶段优化框架:先通过蒸馏方法优化少量可学习参数,再使用参数高效的强化学习微调策略。 Result: EPD-Solver在保持低延迟的同时显著降低了截断误差,提升了生成图像质量,并可在复杂文本到图像任务中作为插件改进现有ODE采样器。 Conclusion: EPD-Solver通过并行方向估计和低维求解器空间优化,有效解决了扩散模型在低步数采样下的精度与效率权衡问题,具有良好的通用性和实用性。 Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.

[146] VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM

Jingchao Wang,Kaiwen Zhou,Zhijian Wu,Kunhua Ji,Dingjiang Huang,Yefeng Zheng

Main category: cs.CV

TL;DR: 本文提出了首个基于多模态大语言模型(VPTracker)的全局视觉-语言跟踪框架,通过引入位置感知的视觉提示机制,结合空间先验信息,在全图范围内进行目标定位,有效提升了在视角变化、遮挡和快速运动等挑战场景下的跟踪稳定性与抗干扰能力。

Details Motivation: 现有视觉-语言跟踪方法通常局限于局部搜索,在视角变化、遮挡或目标快速移动时容易丢失目标,缺乏全局语义推理能力,导致跟踪不稳定和漂移问题。 Method: 提出VPTracker,利用多模态大语言模型的强大语义推理能力进行全局搜索;设计一种位置感知的视觉提示机制,构建基于目标前一时刻位置的区域级提示,使模型优先进行区域识别,并在必要时才进行全局推断。 Result: 实验表明,该方法在多种挑战场景下显著提高了跟踪的稳定性和目标分辨能力,有效抑制了视觉或语义相似对象的干扰。 Conclusion: 本工作开创性地将多模态大语言模型应用于全局视觉-语言跟踪,通过融合空间先验的提示机制,在保持全局搜索优势的同时增强了对干扰的鲁棒性,为MLLM在视觉跟踪中的应用提供了新方向。 Abstract: Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.

[147] Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation

Bin Liu,Wenyan Tian,Huangxin Fu,Zizheng Li,Zhifen He,Bo Li

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯和三平面表示的高效3D医学图像重建方法,能够在稀疏切片条件下生成高质量、解剖结构连贯且语义稳定的图像,显著提升重建效率。

Details Motivation: 传统医学图像3D重建方法计算成本高,在稀疏切片下易出现结构不连续和细节丢失,难以满足临床精度需求。 Method: 采用基于3D高斯和三平面表示的方法,结合高斯表示在高效渲染和几何表达上的优势,增强稀疏条件下的结构连续性和语义一致性。 Result: 在超声(US)和磁共振成像(MRI)等多模态医学数据集上的实验表明,该方法在稀疏数据条件下仍能生成高质量、解剖连贯且语义稳定的重建结果,并显著提高重建效率。 Conclusion: 所提方法为医学图像的3D可视化与临床分析提供了一种高效且可靠的新途径。 Abstract: 3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy requirements.To address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.

[148] Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image

Po-Chih Wu

Main category: cs.CV

TL;DR: 本研究评估了现有开放词汇目标检测模型在低质量图像条件下的性能,提出了一种模拟真实世界低质量图像的新数据集。实验表明,尽管在轻度图像退化下模型性能下降不明显,但在重度退化下所有模型性能均显著下降,其中OWLv2表现最优。

Details Motivation: 开放词汇目标检测旨在突破固定类别限制,实现类人识别能力,但在实际应用中常面临低质量图像的挑战,因此需要评估现有模型在此类条件下的鲁棒性。 Method: 构建了一个模拟现实世界低质量图像的新数据集,并在不同退化程度下对多种主流开放词汇检测模型(如OWLv2、OWL-ViT、GroundingDINO、Detic)进行系统评估。 Result: 在轻度图像退化下模型mAP变化不大,但在重度退化下性能急剧下降;OWLv2在各类退化中表现最稳定,优于其他模型。 Conclusion: 当前开放词汇检测模型在高程度图像退化下仍面临严峻挑战,需进一步提升鲁棒性;OWLv2显示出更强的适应能力,为未来研究提供了基准和方向。 Abstract: Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.

[149] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

Libo Zhang,Zekun Li,Tianyu Li,Zeyu Cao,Rui Xu,Xiaoxiao Long,Wenjia Wang,Jingbo Wang,Yuan Liu,Wenping Wang,Daquan Zhou,Taku Komura,Zhiyang Dou

Main category: cs.CV

TL;DR: 本文提出了EgoReAct,首个基于自回归模型的实时、因果式从第一人称视频生成3D空间对齐的人类反应动作的框架,并构建了更精确对齐的Human Reaction Dataset (HRD) 来解决现有数据集的空间不一致问题。

Details Motivation: 现有第一人称视觉反应数据集(如ViMo)存在严重的空间不对齐问题,且缺乏动态视角与反应动作的匹配,难以建模真实人类的上下文敏感反应。因此需要构建一个空间对齐的数据集并设计能实现严格因果性和3D空间对齐的生成模型。 Method: 构建了Human Reaction Dataset (HRD),实现第一人称视频与反应动作的空间对齐;采用Vector Quantised-Variational AutoEncoder将反应动作压缩至紧凑的潜在空间;使用Generative Pre-trained Transformer进行自回归的实时动作生成,并引入3D动态特征(如度量深度和头部动态)增强空间定位。 Result: 实验表明,EgoReAct在真实性、空间一致性与生成效率方面显著优于先前方法,同时保持了严格的因果性生成能力。 Conclusion: EgoReAct通过高质量数据集和基于VQ-VAE与GPT的架构,实现了从第一人称视频到3D对齐人体反应动作的高效、真实且因果的生成,为具身智能和人机交互提供了新思路。 Abstract: Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.

[150] Depth Anything in $360^\circ$: Towards Scale Invariance in the Wild

Hualie Jiang,Ziyang Song,Zhiqiang Lou,Rui Xu,Minglang Tan

Main category: cs.CV

TL;DR: 本文提出了DA360,一种适用于全景深度估计的Depth Anything V2改进模型,通过引入可学习的偏移参数和环形填充机制,在室内外数据集上实现了零样本全景深度估计的性能突破。

Details Motivation: 全景深度估计在开放世界场景中的零样本泛化能力远落后于透视图像,且缺乏足够的训练数据,因此需要一种能有效迁移透视域知识的方法。 Method: 基于Depth Anything V2,提出学习ViT主干中的尺度-偏移不变性参数,将其转化为仅尺度不变的输出以生成合理的3D点云;同时在DPT解码器中引入圆形填充以消除接缝伪影,保证球面连续性。 Result: 在标准室内基准和新构建的户外数据集Metropolis上,DA360相比基础模型在室内和室外分别降低了50%和10%的相对深度误差,且比PanDA等强基线方法提升约30%的相对误差表现。 Conclusion: DA360显著提升了零样本全景深度估计的性能,成为该领域新的最先进方法,尤其在跨域泛化和空间一致性方面表现出色。 Abstract: Panoramic depth estimation provides a comprehensive solution for capturing complete $360^\circ$ environmental structural information, offering significant benefits for robotics and AR/VR applications. However, while extensively studied in indoor settings, its zero-shot generalization to open-world domains lags far behind perspective images, which benefit from abundant training data. This disparity makes transferring capabilities from the perspective domain an attractive solution. To bridge this gap, we present Depth Anything in $360^\circ$ (DA360), a panoramic-adapted version of Depth Anything V2. Our key innovation involves learning a shift parameter from the ViT backbone, transforming the model's scale- and shift-invariant output into a scale-invariant estimate that directly yields well-formed 3D point clouds. This is complemented by integrating circular padding into the DPT decoder to eliminate seam artifacts, ensuring spatially coherent depth maps that respect spherical continuity. Evaluated on standard indoor benchmarks and our newly curated outdoor dataset, Metropolis, DA360 shows substantial gains over its base model, achieving over 50\% and 10\% relative depth error reduction on indoor and outdoor benchmarks, respectively. Furthermore, DA360 significantly outperforms robust panoramic depth estimation methods, achieving about 30\% relative error improvement compared to PanDA across all three test datasets and establishing new state-of-the-art performance for zero-shot panoramic depth estimation.

[151] KANO: Kolmogorov-Arnold Neural Operator for Image Super-Resolution

Chenyu Li,Danfeng Hong,Bing Zhang,Zhaojie Pan,Jocelyn Chanussot

Main category: cs.CV

TL;DR: 本文提出了一种基于Kolmogorov-Arnold定理的可解释性神经算子KANO,用于单幅图像超分辨率重建,通过B样条函数逼近光谱曲线,实现对复杂退化过程的透明建模。

Details Motivation: 现有超分辨率方法依赖黑箱网络,难以解释和控制退化过程,缺乏物理可解释性。 Method: 基于Kolmogorov-Arnold定理设计KANO算子,采用有限B样条函数的加性结构分段逼近连续光谱曲线,并优化其形状参数以捕捉关键频谱特征。 Result: 在自然图像、航拍和卫星遥感图像上验证了KANO的有效性,揭示了MLP与KAN在序列拟合中的优劣差异。 Conclusion: KANO为图像超分辨率提供了具有物理可解释性的退化建模新范式,推动了可解释SR技术的发展。 Abstract: The highly nonlinear degradation process, complex physical interactions, and various sources of uncertainty render single-image Super-resolution (SR) a particularly challenging task. Existing interpretable SR approaches, whether based on prior learning or deep unfolding optimization frameworks, typically rely on black-box deep networks to model latent variables, which leaves the degradation process largely unknown and uncontrollable. Inspired by the Kolmogorov-Arnold theorem (KAT), we for the first time propose a novel interpretable operator, termed Kolmogorov-Arnold Neural Operator (KANO), with the application to image SR. KANO provides a transparent and structured representation of the latent degradation fitting process. Specifically, we employ an additive structure composed of a finite number of B-spline functions to approximate continuous spectral curves in a piecewise fashion. By learning and optimizing the shape parameters of these spline functions within defined intervals, our KANO accurately captures key spectral characteristics, such as local linear trends and the peak-valley structures at nonlinear inflection points, thereby endowing SR results with physical interpretability. Furthermore, through theoretical modeling and experimental evaluations across natural images, aerial photographs, and satellite remote sensing data, we systematically compare multilayer perceptrons (MLPs) and Kolmogorov-Arnold networks (KANs) in handling complex sequence fitting tasks. This comparative study elucidates the respective advantages and limitations of these models in characterizing intricate degradation mechanisms, offering valuable insights for the development of interpretable SR techniques.

[152] 3D Scene Change Modeling With Consistent Multi-View Aggregation

Zirui Zhou,Junfeng Ni,Shujie Zhang,Yixin Chen,Siyuan Huang

Main category: cs.CV

TL;DR: 提出SCaR-3D框架,用于实现3D场景中物体级别的变化检测与连续重建,通过2D差分与多视角融合实现空间一致的变更识别,并发布新数据集CCS3D。

Details Motivation: 现有3D变化检测方法存在空间不一致问题,且难以明确区分变化前后的状态,缺乏对物体级变化的精细建模能力。 Method: 提出基于符号距离的2D差分模块,结合多视角投票与剪枝进行聚合;利用3DGS的一致性分离前后状态,并设计选择性更新动态区域的连续重建策略。 Result: 在自建合成数据集CCS3D上验证了方法的有效性,实验表明该方法在准确性和效率方面均优于现有方法。 Conclusion: SCaR-3D实现了高精度、高效且空间一致的3D物体级变化检测与场景持续重建,为后续应用提供了可靠基础。 Abstract: Change detection plays a vital role in scene monitoring, exploration, and continual reconstruction. Existing 3D change detection methods often exhibit spatial inconsistency in the detected changes and fail to explicitly separate pre- and post-change states. To address these limitations, we propose SCaR-3D, a novel 3D scene change detection framework that identifies object-level changes from a dense-view pre-change image sequence and sparse-view post-change images. Our approach consists of a signed-distance-based 2D differencing module followed by multi-view aggregation with voting and pruning, leveraging the consistent nature of 3DGS to robustly separate pre- and post-change states. We further develop a continual scene reconstruction strategy that selectively updates dynamic regions while preserving the unchanged areas. We also contribute CCS3D, a challenging synthetic dataset that allows flexible combinations of 3D change types to support controlled evaluations. Extensive experiments demonstrate that our method achieves both high accuracy and efficiency, outperforming existing methods.

[153] A Minimal Solver for Relative Pose Estimation with Unknown Focal Length from Two Affine Correspondences

Zhenbao Yu,Shirong Ye,Ronghe Jin,Shunkun Liang,Zibin Liu,Huiyun Zhang,Banglei Guan

Main category: cs.CV

TL;DR: 本文提出了一种利用两个仿射对应关系和已知垂直方向来估计双视图相对位姿及焦距的新方法,通过约束方程和多项式特征值求解,在合成和真实数据上表现优于现有最先进方法。

Details Motivation: 由于IMU可提供相机垂直方向信息,从而将相对位姿从5自由度降至3自由度,提升估计效率与精度。 Method: 利用两个仿射对应建立约束方程,基于非平凡解性质推导出仅含焦距和相对旋转角的四个方程,并采用多项式特征值方法求解。 Result: 在合成与真实数据集上验证了该方法的有效性,性能优于现有最先进求解器。 Conclusion: 所提方法能高效准确地估计3自由度相对位姿和焦距,适用于配备IMU的多场景视觉系统。 Abstract: In this paper, we aim to estimate the relative pose and focal length between two views with known intrinsic parameters except for an unknown focal length from two affine correspondences (ACs). Cameras are commonly used in combination with inertial measurement units (IMUs) in applications such as self-driving cars, smartphones, and unmanned aerial vehicles. The vertical direction of camera views can be obtained by IMU measurements. The relative pose between two cameras is reduced from 5DOF to 3DOF. We propose a new solver to estimate the 3DOF relative pose and focal length. First, we establish constraint equations from two affine correspondences when the vertical direction is known. Then, based on the properties of the equation system with nontrivial solutions, four equations can be derived. These four equations only involve two parameters: the focal length and the relative rotation angle. Finally, the polynomial eigenvalue method is utilized to solve the problem of focal length and relative rotation angle. The proposed solver is evaluated using synthetic and real-world datasets. The results show that our solver performs better than the existing state-of-the-art solvers.

[154] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Bangya Liu,Xinyu Gong,Zelin Zhao,Ziyang Song,Yulei Lu,Suhui Wu,Jun Zhang,Suman Banerjee,Hao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于Diffusion Transformer的框架ByteLoom,用于生成具有几何一致性的多视角人体-物体交互视频,通过引入RCM-cache机制和渐进式训练策略,解决了现有方法在多视角一致性与手部标注依赖上的局限。

Details Motivation: 现有HOI视频生成方法缺乏有效的多视角信息注入机制,且严重依赖精细的手部网格标注,导致跨视角一致性差和模型泛化能力弱。 Method: 提出ByteLoom框架,采用RCM-cache机制利用相对坐标图(RCM)作为统一表示来保持物体几何一致性,并控制6自由度物体变换;设计一种渐进式训练课程,降低对手部网格标注的依赖,结合简化的人体条件与3D物体输入进行训练。 Result: 实验表明,该方法在保持人物身份、物体多视角几何结构、运动流畅性和物体操控方面表现优异,显著优于现有方法。 Conclusion: ByteLoom有效提升了HOI视频生成的多视角一致性和几何精度,同时减少了对精细标注的依赖,推动了该技术在数字人、电商、广告和机器人模仿学习中的应用潜力。 Abstract: Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.

[155] MUSON: A Reasoning-oriented Multimodal Dataset for Socially Compliant Navigation in Urban Environments

Zhuonan Liu,Xinyu Zhang,Zishuo Wang,Tomohito Kawabata,Xuesu Xiao,Ling Xiao

Main category: cs.CV

TL;DR: MUSON是一个用于短视距社交导航的多模态数据集,采用五步思维链标注,包含感知、预测、推理、动作和解释,显式建模物理约束并平衡动作空间,提升了对安全关键行为的学习能力。

Details Motivation: 现有社交导航数据集缺乏明确的推理监督,且动作分布长尾严重,限制了模型学习安全关键行为的能力。 Method: 提出MUSON数据集,采用结构化的五步Chain-of-Thought标注(感知、预测、推理、动作、解释),显式建模静态物理约束,并设计均衡的离散动作空间。 Result: 在MUSON上评测多个先进小规模视觉语言模型,Qwen2.5-VL-3B取得0.8625的最高决策准确率。与SNEI相比,MUSON在推理、动作和解释方面更一致。 Conclusion: MUSON能有效支持 socially compliant navigation 的研究,是一个可复用的基准数据集。 Abstract: Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models' ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at https://huggingface.co/datasets/MARSLab/MUSON

[156] Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs

Ziyu Zhou,Haozhe Luo,Mohammad Reza Hosseinzadeh Taher,Jiaxuan Pang,Xiaowei Ding,Michael B. Gotway,Jianming Liang

Main category: cs.CV

TL;DR: Lamps是一种基于自监督学习的医学影像基础模型,通过利用人体解剖结构的一致性、连贯性和层次性,在大规模胸部X光图像上进行预训练,显著提升了模型的鲁棒性、可迁移性和临床潜力。

Details Motivation: 现有自监督学习方法在医学影像中忽视了人体解剖结构的一致性、连贯性和层次性,限制了对解剖特征的有效学习,因此需要一种更符合医学影像本质的基础模型。 Method: 提出Lamps模型,通过多视角自监督学习,将解剖结构的一致性、相干性和层次性作为监督信号,在大规模胸片数据上进行预训练。 Result: 在10个数据集上经过微调和涌现特性分析表明,Lamps在鲁棒性、可转移性和临床潜力方面优于10种基线模型。 Conclusion: Lamps通过从多个角度学习人体解剖结构,为医学影像基础模型提供了与解剖结构一致的有意义且鲁棒的表征学习新范式。 Abstract: Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps' superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.

[157] Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples

Weiwei Li,Junzhuo Liu,Yuanyuan Ren,Yuchen Zheng,Yahao Liu,Wen Li

Main category: cs.CV

TL;DR: 本文提出了一种数据驱动的方法来减轻深度学习模型中的虚假相关性,通过识别、中和、消除和更新的流程,在图像和自然语言处理去偏基准上显著提升了最差组的准确率。

Details Motivation: 现有的方法通常依赖于标注潜在的虚假属性或基于简单假设过滤虚假特征,但在真实世界数据中由于虚假相关的复杂性和隐蔽性,效果往往不理想。 Method: 观察到受虚假特征影响的样本在学习到的特征空间中表现出分散分布,据此识别虚假特征的存在;通过简单的分组策略获得偏差不变表示,并学习特征变换以对齐该表示来消除虚假特征;最后结合学习到的特征变换更新分类器。 Result: 在图像和NLP去偏基准上的实验表明,与标准的经验风险最小化(ERM)相比,最差组准确率提高了超过20%。 Conclusion: 所提出的管道能有效缓解深度学习模型中的虚假相关问题,提升模型在不同群体上的泛化性能。 Abstract: Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions (e.g., simplicity of bias). However, these methods may yield unsatisfactory performance due to the intricate and elusive nature of spurious correlations in real-world data. In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. We observe that samples that are influenced by spurious features tend to exhibit a dispersed distribution in the learned feature space. This allows us to identify the presence of spurious features. Subsequently, we obtain a bias-invariant representation by neutralizing the spurious features based on a simple grouping strategy. Then, we learn a feature transformation to eliminate the spurious features by aligning with this bias-invariant representation. Finally, we update the classifier by incorporating the learned feature transformation and obtain an unbiased model. By integrating the aforementioned identifying, neutralizing, eliminating and updating procedures, we build an effective pipeline for mitigating spurious correlation. Experiments on image and NLP debiasing benchmarks show an improvement in worst group accuracy of more than 20% compared to standard empirical risk minimization (ERM). Codes and checkpoints are available at https://github.com/davelee-uestc/nsf_debiasing .

[158] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng,Jia-Wei Liao,Cheng-Fu Chou,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 本文提出了M-ErasureBench,首个全面评估多模态输入下概念擦除效果的基准框架,并发现现有方法在非文本模态(如学习嵌入和反演潜变量)中表现不佳。为此,作者提出IRECE,一种即插即用模块,通过在去噪过程中定位目标概念并扰动相关潜变量来增强推理时的鲁棒性,显著降低概念重现率,同时保持图像质量。

Details Motivation: 现有概念擦除研究主要关注文本提示,忽视了在图像编辑和个性化生成等实际应用中日益重要的其他输入模态,这些模态可能成为攻击面,导致被擦除概念重新出现。 Method: 提出M-ErasureBench,涵盖文本提示、学习嵌入和反演潜变量三种输入模态,并区分白盒与黑盒设置,共五个评估场景;提出IRECE方法,利用交叉注意力定位目标概念并在去噪过程中扰动对应潜变量以增强鲁棒性。 Result: 实验表明,现有方法在文本提示下表现良好,但在学习嵌入和反演潜变量下CRR超过90%;IRECE在最具挑战性的白盒潜变量反演场景下将CRR降低达40%,且保持视觉质量。 Conclusion: M-ErasureBench是首个超越文本提示的概念擦除综合评估基准,结合IRECE为构建更可靠的生成模型提供了实用防护方案。 Abstract: Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.

[159] SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation

Hasan Faraz Khan,Noor Fatima,Muzammil Behzad

Main category: cs.CV

TL;DR: SwinTF3D是一种轻量级多模态融合模型,结合视觉与语言表征,实现基于文本引导的3D医学图像分割,在保持高效计算的同时具备良好泛化能力。

Details Motivation: 现有3D分割模型依赖大量标注数据且缺乏语义理解,难以适应灵活的临床需求和新任务,限制了其在实际医疗中的应用。 Method: 提出SwinTF3D,采用基于Transformer的视觉编码器提取体积特征,并通过高效融合机制与轻量级文本编码器结合,实现对自然语言提示的理解与空间结构的对齐。 Result: 在BTCV数据集上表现出竞争性的Dice和IoU分数,具有良好的泛化能力和显著的计算效率优势。 Conclusion: SwinTF3D通过融合视觉与语言模态,建立了一种可解释、交互式的3D医学图像分割新范式,为临床影像提供了更灵活、高效的解决方案。 Abstract: The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.

[160] Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance

Haosen Li,Wenshuo Chen,Shaofeng Liang,Lei Wang,Haozhe Jia,Yutao Yue

Main category: cs.CV

TL;DR: 本文提出了一种名为Guided Path Sampling (GPS)的新方法,用于解决在扩散模型中使用分类器自由引导(CFG)进行迭代优化时路径不稳定的问题。通过将不稳定的外推替换为数据流形上的内插,并设计动态调整引导强度的调度策略,GPS确保了采样路径的稳定性,从而显著提升了图像质量和对复杂提示的遵循能力。

Details Motivation: 标准的Classifier-Free Guidance (CFG)在与迭代精炼方法结合时存在根本性局限,其外推性质会导致采样路径偏离数据流形,使误差发散,影响精炼效果。因此需要一种更稳定的方法来保证路径在数据流形上。 Method: 提出Guided Path Sampling (GPS),用基于流形约束的内插替代CFG的外推,并引入最优调度策略动态调整引导强度,使其与模型从粗到细的生成过程相匹配。理论上证明该方法可将误差从无界放大转为有界,保障稳定性。 Result: 在SDXL和Hunyuan-DiT等现代架构上实验表明,GPS在感知质量与复杂提示遵循方面优于现有方法。例如,在SDXL上取得0.79的ImageReward和0.2995的HPS v2分数,并在GenEval上将语义对齐准确率提升至57.45%。 Conclusion: 路径稳定性是有效迭代精炼的前提条件,GPS提供了一个鲁棒的框架来实现这一目标,为扩散模型的高质量生成开辟了新方向。 Abstract: Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG's extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model's natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.

[161] Hash Grid Feature Pruning

Yangzhi Ma,Bojun Liu,Jie Li,Li Li,Dong Liu

Main category: cs.CV

TL;DR: 提出了一种基于高斯点坐标识别和剪枝无效特征的哈希网格特征剪枝方法,有效减少存储和传输开销,在不牺牲模型性能的前提下提升率失真表现。

Details Motivation: 由于高斯点在3D空间中分布不规则且稀疏,导致哈希网格中存在大量无效区域,造成存储和传输冗余。 Method: 根据输入高斯点的坐标识别并剪除哈希网格中的无效特征,仅编码有效特征。 Result: 在Common Test Conditions下,相比基线方法平均比特率降低8%。 Conclusion: 该方法有效减少了哈希网格的存储需求,同时保持模型性能,提升了压缩效率。 Abstract: Hash grids are widely used to learn an implicit neural field for Gaussian splatting, serving either as part of the entropy model or for inter-frame prediction. However, due to the irregular and non-uniform distribution of Gaussian splats in 3D space, numerous sparse regions exist, rendering many features in the hash grid invalid. This leads to redundant storage and transmission overhead. In this work, we propose a hash grid feature pruning method that identifies and prunes invalid features based on the coordinates of the input Gaussian splats, so that only the valid features are encoded. This approach reduces the storage size of the hash grid without compromising model performance, leading to improved rate-distortion performance. Following the Common Test Conditions (CTC) defined by the standardization committee, our method achieves an average bitrate reduction of 8% compared to the baseline approach.

[162] JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Kai Liu,Jungang Li,Yuchong Sun,Shengqiong Wu,Jianzhang Gao,Daoan Zhang,Wei Zhang,Sheng Jin,Sicheng Yu,Geng Zhan,Jiayi Ji,Fan Zhou,Liang Zheng,Shuicheng Yan,Hao Fei,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了JavisGPT,首个用于联合音频-视频(JAV)理解和生成的统一多模态大语言模型,采用简洁的编码器-LLM-解码器架构,并通过三阶段训练策略和高质量指令数据集JavisInst-Omni实现优越性能。

Details Motivation: 现有的多模态大语言模型在处理联合音频-视频任务时缺乏统一框架,难以有效捕捉时空同步信息,限制了复杂场景下的理解与生成能力。 Method: 提出JavisGPT模型,采用编码器-LLM-解码器架构,引入SyncFusion模块进行时空音视频融合,并设计可学习的同步感知查询以连接预训练的JAV-DiT生成器;通过多模态预训练、音视频微调和大规模指令调优的三阶段训练流程逐步提升能力。 Result: 在多个JAV理解和生成基准测试中,JavisGPT均优于现有MLLM,尤其在复杂且时间同步要求高的场景下表现突出。 Conclusion: JavisGPT为联合音视频理解与生成提供了有效的统一框架,验证了同步感知建模和分阶段训练在多模态任务中的重要性。 Abstract: This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

[163] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Qihang Peng,Xuesong Chen,Chenye Yang,Shaoshuai Shi,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出ColaVLA,一种统一的视觉-语言-动作框架,通过将推理从文本转移到统一的潜在空间,并结合分层并行轨迹解码器,实现高效、准确且安全的自动驾驶轨迹生成。

Details Motivation: 现有基于视觉语言模型(VLM)的规划器存在离散文本推理与连续控制之间的不匹配、自回归推理延迟高以及规划器非因果或效率低等问题,难以满足实时部署需求。 Method: 提出ColaVLA框架,包含认知潜在推理器(Cognitive Latent Reasoner)和分层并行规划器(Hierarchical Parallel Planner)。前者通过两次VLM前向传播将场景理解压缩为决策导向的元动作嵌入,后者在单次前向传播中生成多尺度、因果一致的轨迹。 Result: 在nuScenes基准上的实验表明,ColaVLA在开环和闭环设置下均达到最先进性能,具有更高的效率和鲁棒性。 Conclusion: ColaVLA成功融合了VLM的泛化能力与实时规划的需求,在保持可解释性的同时实现了高效、安全的轨迹生成,推动了端到端自动驾驶系统的发展。 Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

[164] Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects

Zhicheng Zhao,Xuanang Fan,Lingma Sun,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种名为DRMNet的密集区域挖掘网络,利用密度图作为空间先验来指导高分辨率遥感图像中密集小目标的检测,显著提升了在高密度和严重遮挡场景下的检测性能。

Details Motivation: 由于严重相互遮挡和有限的像素占比,现有检测方法难以有效识别高分辨率遥感图像中的密集小目标,且计算资源分配不均,限制了特征学习效果。 Method: 设计了密度生成分支(DGB)建模目标分布模式;引入密集区域聚焦模块(DAFM)实现高效的局部-全局特征交互;提出双滤波融合模块(DFFM),通过离散余弦变换分离多尺度特征并进行密度引导的交叉注意力增强。 Result: 在AI-TOD和DTOD数据集上的实验表明,DRMNet在高密度和严重遮挡场景下优于现有最先进方法。 Conclusion: DRMNet通过密度图引导的自适应特征学习机制,有效提升了密集小目标检测的准确性和鲁棒性,尤其适用于复杂高密度遥感场景。 Abstract: High-resolution remote sensing imagery increasingly contains dense clusters of tiny objects, the detection of which is extremely challenging due to severe mutual occlusion and limited pixel footprints. Existing detection methods typically allocate computational resources uniformly, failing to adaptively focus on these density-concentrated regions, which hinders feature learning effectiveness. To address these limitations, we propose the Dense Region Mining Network (DRMNet), which leverages density maps as explicit spatial priors to guide adaptive feature learning. First, we design a Density Generation Branch (DGB) to model object distribution patterns, providing quantifiable priors that guide the network toward dense regions. Second, to address the computational bottleneck of global attention, our Dense Area Focusing Module (DAFM) uses these density maps to identify and focus on dense areas, enabling efficient local-global feature interaction. Finally, to mitigate feature degradation during hierarchical extraction, we introduce a Dual Filter Fusion Module (DFFM). It disentangles multi-scale features into high- and low-frequency components using a discrete cosine transform and then performs density-guided cross-attention to enhance complementarity while suppressing background interference. Extensive experiments on the AI-TOD and DTOD datasets demonstrate that DRMNet surpasses state-of-the-art methods, particularly in complex scenarios with high object density and severe occlusion.

[165] CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

Behnam Raoufi,Hossein Sharify,Mohamad Mahdee Ramezanee,Khosrow Hajsadeghi,Saeed Bagheri Shouraki

Main category: cs.CV

TL;DR: 提出CLIP-Joint-Detect,一种检测器无关的框架,通过联合训练引入CLIP风格的视觉-语言对比监督,提升目标检测性能。

Details Motivation: 传统目标检测器依赖交叉熵分类,易受类别不平衡和标签噪声影响。 Method: 设计轻量级并行头,将区域或网格特征映射到CLIP嵌入空间,并通过InfoNCE对比损失和辅助交叉熵项与可学习的类别文本嵌入对齐,同时优化所有标准检测损失。 Result: 在Pascal VOC和MS COCO上使用Faster R-CNN和YOLOv11验证,一致且显著提升性能,保持实时推理速度。 Conclusion: 联合优化可学习文本嵌入能显著增强多种架构和数据集上的闭集检测性能。 Abstract: Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.

[166] Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection

Runwei Guan,Jianan Liu,Shaofeng Liang,Fangqiang Ding,Shanliang Yao,Xiaokai Bai,Daizong Liu,Tao Huang,Guoqiang Mao,Hui Xiong

Main category: cs.CV

TL;DR: 本文提出了一种名为WRCFormer的新型3D目标检测框架,通过融合原始4D毫米波雷达立方体和相机输入,利用小波注意力模块和几何引导的渐进式融合机制,在K-Radar基准上实现了最先进的性能。

Details Motivation: 由于4D毫米波雷达数据稀疏且语义信息有限,而点云化处理会导致信息损失,直接使用原始数据又计算开销大,因此需要一种高效融合雷达与相机数据的方法以提升感知能力。 Method: 设计基于小波的特征金字塔网络中的小波注意力模块来增强稀疏信号表示,并引入两阶段、模态无关的几何引导渐进融合机制,实现雷达立方体多视角表示与图像特征的有效融合。 Result: 在K-Radar基准测试中,WRCFormer在所有场景下超越最佳模型约2.4%,在雨夹雪场景下提升1.6%,表现出优异的恶劣天气鲁棒性。 Conclusion: WRCFormer通过融合原始4D雷达立方体与相机数据,有效解决了信息损失与计算成本之间的权衡问题,显著提升了3D目标检测性能,尤其在复杂天气条件下表现突出。 Abstract: 4D millimeter-wave (mmWave) radar has been widely adopted in autonomous driving and robot perception due to its low cost and all-weather robustness. However, its inherent sparsity and limited semantic richness significantly constrain perception capability. Recently, fusing camera data with 4D radar has emerged as a promising cost effective solution, by exploiting the complementary strengths of the two modalities. Nevertheless, point-cloud-based radar often suffer from information loss introduced by multi-stage signal processing, while directly utilizing raw 4D radar data incurs prohibitive computational costs. To address these challenges, we propose WRCFormer, a novel 3D object detection framework that fuses raw radar cubes with camera inputs via multi-view representations of the decoupled radar cube. Specifically, we design a Wavelet Attention Module as the basic module of wavelet-based Feature Pyramid Network (FPN) to enhance the representation of sparse radar signals and image data. We further introduce a two-stage query-based, modality-agnostic fusion mechanism termed Geometry-guided Progressive Fusion to efficiently integrate multi-view features from both modalities. Extensive experiments demonstrate that WRCFormer achieves state-of-the-art performance on the K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios and 1.6% in the sleet scenario, highlighting its robustness under adverse weather conditions.

[167] YOLO-IOD: Towards Real Time Incremental Object Detection

Shizhou Zhang,Xueqiang Lv,Yinghui Xing,Qirui Wu,Di Xu,Chen Zhao,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了YOLO-IOD,一种基于YOLO-World的实时增量目标检测框架,通过解决三类知识冲突(前景-背景混淆、参数干扰和知识蒸馏不对齐)来缓解灾难性遗忘,并引入LoCo COCO这一更现实的基准进行评估。

Details Motivation: 现有增量目标检测方法多基于Faster R-CNN或DETR系列,无法适配实时YOLO框架;且在YOLO中存在多种知识冲突导致严重遗忘问题,需专门设计适用于YOLO的增量学习方案。 Method: 提出YOLO-IOD框架,包含三个核心组件:1)冲突感知伪标签优化(CPR)缓解前景-背景混淆;2)基于重要性的卷积核选择(IKS)减少参数干扰;3)跨阶段非对称知识蒸馏(CAKD)解决蒸馏不对齐问题;并采用分阶段参数高效微调策略,在预训练YOLO-World基础上实现增量学习。 Result: 在传统COCO和新提出的LoCo COCO基准上实验表明,YOLO-IOD在保持高性能的同时显著减少遗忘,优于现有方法。 Conclusion: YOLO-IOD有效解决了YOLO系列模型在增量目标检测中的关键知识冲突问题,实现了高效、低遗忘的实时增量检测,推动了YOLO架构在增量学习场景中的应用。 Abstract: Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.

[168] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance

Chunyuan Chen,Yunuo Cai,Shujuan Li,Weiyun Liang,Bin Wang,Jing Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为ReamCamo的统一框架,用于生成具有高真实感和语义一致性的伪装图像,通过布局控制和多模态文本-视觉条件提升生成质量,并引入新指标评估伪装效果。

Details Motivation: 现有伪装图像生成方法在视觉相似性和语义一致性方面存在不足,难以逼近真实伪装场景,因此需要更逼真且结构合理的生成方法。 Method: 提出ReamCamo框架,采用基于外绘(out-painting)的方法,引入布局控制调节全局结构,并结合细粒度文本描述与面向纹理的背景检索构建多模态条件以指导生成过程,同时设计背景-前景分布差异度量来量化伪装质量。 Result: 实验和可视化结果表明,ReamCamo在生成图像的真实感、语义连贯性和伪装效果上优于现有方法,所提度量能有效反映伪装质量。 Conclusion: ReamCamo通过结构化控制和多模态引导显著提升了伪装图像生成的 realism 与 semantic coherence,为伪装目标检测提供了更优的数据生成方案。 Abstract: Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.

[169] PoseStreamer: A Multi-modal Framework for 6DoF Pose Estimation of Unseen Moving Objects

Huiming Yang,Linglin Liao,Fei Ding,Sibo Wang,Zijian Zeng

Main category: cs.CV

TL;DR: 本文提出了一种名为PoseStreamer的多模态6DoF姿态估计框架,专为高速运动场景设计,结合事件相机优势,通过三个核心组件实现鲁棒的姿态估计,并提出了新的快速运动基准数据集MoCapCube6D。

Details Motivation: 现有6DoF姿态估计方法在高速和低光场景下表现不佳,尤其是标准RGB相机易受运动模糊影响,而当前事件相机方法在高速移动中性能有限,因此需要更鲁棒的解决方案。 Method: 提出PoseStreamer框架,包含自适应姿态记忆队列、以对象为中心的2D追踪器和沿相机射线的光线姿态滤波器,融合历史方向信息、2D先验与几何优化,提升高速下的3D姿态估计精度。 Result: 实验表明,PoseStreamer在高速场景中显著优于现有方法,具备高精度和强泛化能力,能在未见物体上实现无模板的姿态估计,且新数据集MoCapCube6D有效支持快速运动下的性能评估。 Conclusion: PoseStreamer通过多模态融合与结构创新,在高速和低光环境下实现了鲁棒且通用的6DoF姿态估计,推动了事件相机在动态视觉任务中的应用。 Abstract: Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.

[170] Spatial-aware Symmetric Alignment for Text-guided Medical Image Segmentation

Linglin Liao,Qichuan Geng,Yu Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Spatial-aware Symmetric Alignment (SSA)的框架,用于提升基于混合临床文本(位置、描述和诊断信息)的医学图像分割性能,通过双向细粒度对齐和空间引导策略显著提升了对具有空间关系约束病灶的分割精度。

Details Motivation: 现有文本引导的医学图像分割方法难以同时处理诊断性和描述性文本,且缺乏对文本中位置约束的有效建模,导致分割结果出现关键偏差。 Method: 提出SSA框架,包含对称最优传输对齐机制以建立图像区域与多类型文本表达之间的双向细粒度跨模态关联,并设计复合方向引导策略,通过构建区域级引导掩码显式引入文本中的空间约束。 Result: 在公开基准数据集上的大量实验表明,SSA实现了最先进的性能,尤其在具有空间关系约束的病灶分割上表现突出。 Conclusion: SSA有效解决了当前文本引导医学图像分割中多类型文本关联弱和空间约束缺失的问题,显著提升了分割准确性,特别是在复杂空间描述下的应用潜力巨大。 Abstract: Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text "in the left lower lung", the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.

[171] Reverse Personalization

Han-Wei Kung,Tuomas Varanka,Nicu Sebe

Main category: cs.CV

TL;DR: 提出了一种基于条件扩散反演的反向个性化框架,用于实现可控的人脸匿名化,无需依赖文本提示或模型微调。

Details Motivation: 现有基于提示的人脸匿名化方法依赖于预训练模型中的主体表示或需要针对特定身份进行模型微调,难以泛化且缺乏对属性的控制。 Method: 通过分析身份生成过程,引入条件扩散反演技术,结合身份引导的条件分支,直接在图像上进行操作,实现无需文本提示的属性可控匿名化。 Result: 该方法在身份去除、属性保留和图像质量之间达到了最先进的平衡,且能推广到训练数据之外的主体。 Conclusion: 所提出的反向个性化框架有效实现了无需微调和文本提示的可控人脸匿名化,具有良好的泛化能力和应用前景。 Abstract: Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model's training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at https://github.com/hanweikung/reverse-personalization .

[172] A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease Diagnosis,Freshness Assessment, and Fruit Detection

Soham Dutta,Soham Banerjee,Sneha Mahata,Anindya Sen,Sayantani Datta

Main category: cs.CV

TL;DR: 本文提出了一种基于低成本RGB无人机的统一智能果园管理管道,集成ResNet50、VGG16和YOLOv8分别实现叶片病害检测、苹果新鲜度判断和实时苹果定位,并在ESP32-CAM与树莓派上实现离线本地推理,实验结果显示高精度,为精准农业提供了经济可扩展的解决方案。

Details Motivation: 现有无人机系统多孤立处理果园管理任务且依赖昂贵的多光谱传感器,缺乏低成本、一体化的解决方案。 Method: 构建一个基于RGB图像的统一无人机果园智能管道,采用ResNet50进行叶片病害检测,VGG16进行苹果新鲜度分类,YOLOv8实现实时苹果检测与定位,系统部署于ESP32-CAM和树莓派,支持完全离线的本地推理。 Result: 实验结果表明,叶片病害分类准确率达98.9%,苹果新鲜度分类准确率达97.4%,苹果检测F1得分为0.857。 Conclusion: 该框架提供了一种无需云支持、低成本且可扩展的多任务果园监测方案,优于依赖多光谱传感器的传统方法,适用于资源受限环境下的精准农业应用。 Abstract: Apple orchards require timely disease detection, fruit quality assessment, and yield estimation, yet existing UAV-based systems address such tasks in isolation and often rely on costly multispectral sensors. This paper presents a unified, low-cost RGB-only UAV-based orchard intelligent pipeline integrating ResNet50 for leaf disease detection, VGG 16 for apple freshness determination, and YOLOv8 for real-time apple detection and localization. The system runs on an ESP32-CAM and Raspberry Pi, providing fully offline on-site inference without cloud support. Experiments demonstrate 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. The framework provides an accessible and scalable alternative to multispectral UAV solutions, supporting practical precision agriculture on affordable hardware.

[173] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

Wenyuan Huang,Zhao Wang,Zhou Wei,Ting Huang,Fang Zhao,Jian Yang,Zhenyu Zhang

Main category: cs.CV

TL;DR: 本文提出OpenGround,一种用于开放世界3D视觉定位的零样本框架,通过主动认知推理(ACR)模块克服传统预定义对象查找表(OLT)的局限性,并引入新数据集OpenTarget验证其在开放场景下的有效性。

Details Motivation: 现有3D视觉定位方法依赖预定义的对象查找表(OLT),限制了在未见或开放类别场景中的应用,难以应对真实世界中多样化和未知目标的需求。 Method: 提出OpenGround框架,核心是主动认知推理(ACR)模块,通过模拟人类感知的认知任务链,动态扩展视觉语言模型(VLM)的认知范围,并构建动态更新的OLT以支持开放世界推理。 Result: OpenGround在Nr3D上表现具有竞争力,在ScanRefer上达到最先进水平,并在新提出的OpenTarget数据集上取得17.6%的显著性能提升。 Conclusion: OpenGround实现了对开放世界3D视觉定位的有效支持,突破了传统OLT的限制,展现出在预定义和开放类别场景下的良好泛化能力。 Abstract: 3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at [this https URL](https://why-102.github.io/openground.io/).

[174] With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs

Ciprian Constantinescu,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文提出了一种基于单目图像的地理-语义上下文图(GSCG)框架,用于上下文感知的物体分类,通过整合深度估计与全景及材质分割,构建具有几何、色彩和材质属性的图结构,并利用图分类器融合局部与全局上下文信息,显著提升了分类准确率。

Details Motivation: 现有物体识别系统多忽略场景上下文信息,而人类依赖丰富上下文进行识别,因此需要一种能显式建模上下文并提升识别性能与可解释性的方法。 Method: 首先通过单目图像构建Geo-Semantic Contextual Graph(GSCG),结合深度估计与统一的全景及材质分割模型,将物体作为带有几何、色彩和材质属性的节点,空间关系作为边;然后设计一个基于图的分类器,聚合目标物体自身、邻近物体和全局场景的特征进行分类。 Result: 在COCO 2017数据集上,该方法达到73.4%的分类准确率,远超无上下文模型(低至38.4%)、微调ResNet(最高53.5%)以及多模态大模型Llama 4 Scout(最高42.3%)。 Conclusion: 显式构建结构化的上下文表示(如GSCG)对于物体识别至关重要,不仅能大幅提升性能,还增强了模型推理过程的可解释性。 Abstract: Humans effortlessly identify objects by leveraging a rich understanding of the surrounding scene, including spatial relationships, material properties, and the co-occurrence of other objects. In contrast, most computational object recognition systems operate on isolated image regions, devoid of meaning in isolation, thus ignoring this vital contextual information. This paper argues for the critical role of context and introduces a novel framework for contextual object classification. We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image. This rich, structured representation is built by integrating a metric depth estimator with a unified panoptic and material segmentation model. The GSCG encodes objects as nodes with detailed geometric, chromatic, and material attributes, and their spatial relationships as edges. This explicit graph structure makes the model's reasoning process inherently interpretable. We then propose a specialized graph-based classifier that aggregates features from a target object, its immediate neighbors, and the global scene context to predict its class. Through extensive ablation studies, we demonstrate that our context-aware model achieves a classification accuracy of 73.4%, dramatically outperforming context-agnostic versions (as low as 38.4%). Furthermore, our GSCG-based approach significantly surpasses strong baselines, including fine-tuned ResNet models (max 53.5%) and a state-of-the-art multimodal Large Language Model (LLM), Llama 4 Scout, which, even when given the full image alongside a detailed description of objects, maxes out at 42.3%. These results on COCO 2017 train/val splits highlight the superiority of explicitly structured and interpretable context for object recognition tasks.

[175] An Architecture-Led Hybrid Report on Body Language Detection Project

Thomson Tong,Diba Darooneh

Main category: cs.CV

TL;DR: 本文分析了两个现代视觉-语言模型(Qwen2.5-VL-7B-Instruct 和 Llama-4-Scout-17B-16E-Instruct)的架构,并将其应用于一个视频到结果的系统流程中,强调架构特性如何影响实际工程设计与输出可靠性。

Details Motivation: 理解现代视觉-语言模型的架构特性如何支持或限制实际应用中的功能实现,特别是在视频分析与结构化输出生成方面,以指导可靠系统的构建。 Method: 通过架构级分析,总结两个VLM共享的多模态基础(视觉分词、Transformer注意力机制、指令遵循),并描述其在BodyLanguageDetection项目中的具体应用流程:视频帧采样、VLM提示生成带属性的边界框、模式验证和可选的标注视频渲染。 Result: 明确了模型行为与系统约束之间的关键区别,如结构有效但语义错误的可能性、模式验证仅限于结构而非几何正确性、人物标识符为帧局部性,以及单帧分析返回自由文本而非强制JSON格式。 Conclusion: 这些架构与行为差异对于撰写可辩护的技术主张、设计鲁棒接口和规划评估方案至关重要,需在工程实践中予以充分考虑。 Abstract: This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.

[176] Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Yi Zhou,Xuechao Zou,Shun Zhang,Kai Li,Shiying Wang,Jingming Chen,Congyan Lang,Tengfei Cao,Pin Tao,Yuanchun Shi

Main category: cs.CV

TL;DR: 本文提出了一种名为Co2S的半监督遥感图像语义分割框架,通过融合视觉-语言模型和自监督模型的先验知识,有效缓解伪标签漂移和误差累积问题,在多个数据集上表现出优越性能。

Details Motivation: 为了解决半监督遥感图像分割中因伪标签漂移导致的确认偏差和错误累积问题,提升模型在标注数据有限情况下的分割性能。 Method: 提出Co2S框架,采用基于CLIP和DINOv3预训练的异构双学生结构,并引入显式-隐式语义协同引导机制与全局-局部特征协同融合策略,以融合不同先验知识并增强语义一致性。 Result: 在六个主流遥感数据集上进行了广泛实验,Co2S在多种划分协议和不同场景下均取得了领先的分割性能。 Conclusion: Co2S通过融合多模态先验和协同特征融合策略,显著提升了半监督遥感图像分割的稳定性和准确性,具有良好的应用潜力。 Abstract: Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.

[177] 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

Ryousuke Yamada,Kohsuke Ide,Yoshihiro Fukuhara,Hirokatsu Kataoka,Gilles Puy,Andrei Bursuc,Yuki M. Asano

Main category: cs.CV

TL;DR: 提出了一种名为LAM3C的自监督框架,利用无标签视频生成的点云进行3D表示学习,在无需真实3D扫描的情况下,在室内语义和实例分割任务上超越了以往方法。

Details Motivation: 由于收集大规模真实3D场景扫描成本高且耗时,探索是否可以从仅含视频数据(无需3D传感器)中学习3D表示。 Method: 提出LAM3C框架,利用从网络收集的房间漫游视频(如房产展示)生成点云数据集RoomTours,并采用基于Sinkhorn-Knopp的多级拉普拉斯感知聚类与噪声正则化损失来稳定学习过程。 Result: 在不使用任何真实3D扫描的情况下,LAM3C在室内语义和实例分割任务上性能优于之前的自监督方法。 Conclusion: 无标签视频是3D自监督学习的一个丰富且可行的数据来源,有望降低对昂贵3D扫描的依赖。 Abstract: Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.

[178] Video-BrowseComp: Benchmarking Agentic Video Research on Open Web

Zhengyang Liang,Yan Shu,Xiangrui Liu,Minghao Qin,Kaixin Liang,Paolo Rota,Nicu Sebe,Zheng Liu,Lizi Liao

Main category: cs.CV

TL;DR: 本文提出了Video-BrowseComp,首个面向开放网络的自主视频推理基准,旨在评估智能体在动态视频模态中的主动研究能力,揭示了现有模型在依赖时间视觉证据任务上的严重不足。

Details Motivation: 现有的视频基准主要关注被动感知,无法评估智能体在开放网络中进行主动视频研究的能力,尤其是在需要跨时间线验证和多源证据整合的场景下存在明显缺陷。 Method: 构建了一个包含210个问题的挑战性基准Video-BrowseComp,强制要求依赖时间视觉证据来回答问题,评估了包括GPT-5.1(带搜索)在内的先进模型在该基准上的表现。 Result: 实验结果显示当前最先进的模型(如GPT-5.1 w/ Search)准确率仅为15.24%,且主要依赖文本元数据,在元数据稀疏的动态场景(如体育、游戏)中性能急剧下降。 Conclusion: Video-BrowseComp填补了主动视频推理基准的空白,推动领域从被动感知向主动、基于时间线索的视频理解发展,凸显了视觉接地能力的重要性。 Abstract: The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.

[179] MedSAM-based lung masking for multi-label chest X-ray classification

Brayden Miao,Zain Rehman,Xin Miao,Siming Liu,Jianjie Wang

Main category: cs.CV

TL;DR: 提出一种基于MedSAM分割引导的胸部X光分类流程,通过引入解剖学先验提升异常检测性能,发现掩码策略对不同任务和模型结构有差异化影响,需根据临床目标和骨干网络选择合适的掩码方式。

Details Motivation: 胸部X光自动解读面临弱疾病信号、数据集偏差和空间监督不足等挑战,现有方法缺乏对解剖结构先验的有效利用,限制了模型的鲁棒性和可解释性。 Method: 采用MedSAM作为肺区提取模块,先在Airlangga医院的数据集上微调MedSAM生成肺部掩码,再应用于NIH CXR数据集的子集进行多标签异常分类;比较原始图像、紧掩码和松掩码三种输入方式在ResNet50等CNN模型上的表现差异。 Result: MedSAM能在多种成像条件下生成解剖合理的肺部掩码;松掩码在保持宏AUROC的同时显著提升正常样本识别能力;紧掩码降低异常检测性能但提高训练效率;松掩码通过保留肺门周围和外周信息部分缓解性能下降。 Conclusion: 肺部掩码应被视为可调控的空间先验,其设计需匹配骨干网络架构和具体临床目标,而非统一应用,以平衡异常检测与正常筛查之间的性能权衡。 Abstract: Chest X-ray (CXR) imaging is widely used for screening and diagnosing pulmonary abnormalities, yet automated interpretation remains challenging due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models for medical image segmentation (MedSAM) provide an opportunity to introduce anatomically grounded priors that may improve robustness and interpretability in CXR analysis. We propose a segmentation-guided CXR classification pipeline that integrates MedSAM as a lung region extraction module prior to multi-label abnormality classification. MedSAM is fine-tuned using a public image-mask dataset from Airlangga University Hospital. We then apply it to a curated subset of the public NIH CXR dataset to train and evaluate deep convolutional neural networks for multi-label prediction of five abnormalities (Mass, Nodule, Pneumonia, Edema, and Fibrosis), with the normal case (No Finding) evaluated via a derived score. Experiments show that MedSAM produces anatomically plausible lung masks across diverse imaging conditions. We find that masking effects are both task-dependent and architecture-dependent. ResNet50 trained on original images achieves the strongest overall abnormality discrimination, while loose lung masking yields comparable macro AUROC but significantly improves No Finding discrimination, indicating a trade-off between abnormality-specific classification and normal case screening. Tight masking consistently reduces abnormality level performance but improves training efficiency. Loose masking partially mitigates this degradation by preserving perihilar and peripheral context. These results suggest that lung masking should be treated as a controllable spatial prior selected to match the backbone and clinical objective, rather than applied uniformly.

[180] PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion

Jian Wang,Sixing Rong,Jiarui Xing,Yuling Xu,Weide Liu

Main category: cs.CV

TL;DR: PathoSyn提出了一种基于解耦偏差建模的MRI图像合成框架,通过在稳定解剖流形上建模病理残差,实现高保真、结构一致的病灶生成。

Details Motivation: 现有生成模型在全局像素空间或依赖二值掩膜进行MRI合成时,常导致特征纠缠、解剖结构破坏或边界伪影,缺乏对局部病理变化的精细控制。 Method: 将合成任务分解为确定性解剖重建与随机偏差建模;提出偏差空间扩散模型学习病理残差的条件分布,并结合缝合感知融合策略和推理时稳定模块以保持空间连贯性。 Result: 在肿瘤成像基准上,PathoSyn在感知真实感和解剖保真度方面显著优于整体扩散和掩膜条件基线方法。 Conclusion: PathoSyn提供了一个数学上严谨的框架,支持高质量患者特异性数据生成、可解释的反事实疾病进展建模,有助于低数据场景下的诊断算法开发与临床决策系统评估。 Abstract: We present PathoSyn, a unified generative framework for Magnetic Resonance Imaging (MRI) image synthesis that reformulates imaging-pathology as a disentangled additive deviation on a stable anatomical manifold. Current generative models typically operate in the global pixel domain or rely on binary masks, these paradigms often suffer from feature entanglement, leading to corrupted anatomical substrates or structural discontinuities. PathoSyn addresses these limitations by decomposing the synthesis task into deterministic anatomical reconstruction and stochastic deviation modeling. Central to our framework is a Deviation-Space Diffusion Model designed to learn the conditional distribution of pathological residuals, thereby capturing localized intensity variations while preserving global structural integrity by construction. To ensure spatial coherence, the diffusion process is coupled with a seam-aware fusion strategy and an inference-time stabilization module, which collectively suppress boundary artifacts and produce high-fidelity internal lesion heterogeneity. PathoSyn provides a mathematically principled pipeline for generating high-fidelity patient-specific synthetic datasets, facilitating the development of robust diagnostic algorithms in low-data regimes. By allowing interpretable counterfactual disease progression modeling, the framework supports precision intervention planning and provides a controlled environment for benchmarking clinical decision-support systems. Quantitative and qualitative evaluations on tumor imaging benchmarks demonstrate that PathoSyn significantly outperforms holistic diffusion and mask-conditioned baselines in both perceptual realism and anatomical fidelity. The source code of this work will be made publicly available.

[181] Domain-Shift Immunity in Deep Deformable Registration via Local Feature Representations

Mingzhen Shao,Sarang Joshi

Main category: cs.CV

TL;DR: 本文提出了一种名为UniReg的通用图像配准框架,揭示了基于深度学习的变形图像配准模型对域偏移具有内在鲁棒性,其关键在于局部特征表示而非全局外观。

Details Motivation: 尽管基于学习的图像配准模型被认为对域偏移敏感,但其鲁棒性的根本机制尚不清楚,本文旨在探究并验证这种鲁棒性的来源。 Method: 提出UniReg框架,解耦特征提取与形变估计,使用固定的预训练特征提取器和UNet结构的形变网络,并在单一数据集上训练以测试跨域和多模态性能。 Result: UniReg在仅在一个数据集上训练的情况下,展现出与优化方法相当的跨域和多模态配准性能;分析发现传统CNN模型在模态偏移下的失败源于早期卷积层的数据集偏差。 Conclusion: 局部特征一致性是学习型变形配准模型鲁棒性的关键驱动因素,应设计保留域不变局部特征的骨干网络。 Abstract: Deep learning has advanced deformable image registration, surpassing traditional optimization-based methods in both accuracy and efficiency. However, learning-based models are widely believed to be sensitive to domain shift, with robustness typically pursued through large and diverse training datasets, without explaining the underlying mechanisms. In this work, we show that domain-shift immunity is an inherent property of deep deformable registration models, arising from their reliance on local feature representations rather than global appearance for deformation estimation. To isolate and validate this mechanism, we introduce UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed, pre-trained feature extractors and a UNet-based deformation network. Despite training on a single dataset, UniReg exhibits robust cross-domain and multi-modal performance comparable to optimization-based methods. Our analysis further reveals that failures of conventional CNN-based models under modality shift originate from dataset-induced biases in early convolutional layers. These findings identify local feature consistency as the key driver of robustness in learning-based deformable registration and motivate backbone designs that preserve domain-invariant local features.

[182] GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

Jingyu Li,Xiaolong Zhao,Zhe Liu,Wenxiao Wu,Li Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为GeoTeacher的半监督3D目标检测方法,通过关键点几何关系监督和体素级数据增强提升学生模型对几何结构的理解能力,在ONCE和Waymo数据集上实现了最先进的性能。

Details Motivation: 现有半监督3D检测方法忽视了模型在标注数据有限时对物体几何形状敏感性不足的问题,难以有效捕捉对检测至关重要的几何信息。 Method: 设计了一个基于关键点的几何关系监督模块,将教师模型的几何知识迁移到学生模型,并提出一种带有距离衰减机制的体素级数据增强策略,以增加物体几何多样性并保持远距离物体完整性。 Result: 在ONCE和Waymo数据集上进行了大量实验,验证了GeoTeacher的有效性和泛化能力,取得了新的SOTA结果。 Conclusion: GeoTeacher显著提升了学生模型在有限标注数据下对物体几何关系的理解能力,且可兼容多种半监督3D检测框架,具有广泛适用性。 Abstract: Semi-supervised 3D object detection, aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student model's ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model's ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher model's knowledge of object geometry to the student, thereby improving the student's capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student model's ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at https://github.com/SII-Whaleice/GeoTeacher

[183] REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

Fulin Shi,Wenyi Xiao,Bin Chen,Liang Din,Leilei Gan

Main category: cs.CV

TL;DR: 提出REVEALER框架,基于强化引导的视觉推理实现文本到图像生成中元素级对齐的细粒度评估。

Details Motivation: 现有文本到图像评估方法多依赖粗粒度指标或静态问答流程,缺乏细粒度可解释性且难以反映人类偏好。 Method: 采用“定位-推理-结论”结构化范式,利用多模态大语言模型显式定位语义元素,并通过分组相对策略优化(GRPO)结合结构、定位和对齐奖励进行训练。 Result: 在EvalMuse-40K、RichHF、MHaluBench和GenAI-Bench四个基准上均达到最先进性能,优于强闭源模型和监督基线,且推理效率更高。 Conclusion: REVEALER实现了可解释的元素级对齐评估,显著提升了文本到图像模型评估的准确性与效率。 Abstract: Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.

[184] GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection

Yi Zhang,Yi Wang,Lei Yao,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出了GVSynergy-Det,一种基于高斯-体素协同表示学习的图像到3D目标检测框架,在无需密集3D监督的情况下实现了最先进的性能。

Details Motivation: 现有基于图像的3D检测方法在无密集3D标注时难以准确恢复几何结构,而高精度方法依赖昂贵的深度传感器或密集监督,限制了实际应用。 Method: 提出双表示架构:1)自适应可泛化的高斯点阵化提取细粒度表面几何特征;2)设计跨表示增强机制,将高斯场中的几何细节融入体素特征中,实现可学习的特征融合。 Result: 在ScanNetV2和ARKitScenes数据集上均取得当前最优性能,显著优于已有方法,且完全无需深度输入或密集3D几何监督(如点云或TSDF)。 Conclusion: GVSynergy-Det通过高斯与体素表示的协同学习,有效结合了连续表面建模与离散空间结构的优势,为无监督3D检测提供了新思路。 Abstract: Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).

[185] GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

Tianchen Deng,Xuefeng Chen,Yi Chen,Qu Chen,Yuyao Xu,Lijin Yang,Le Xu,Yu Zhang,Bo Zhang,Wuxiong Huang,Hesheng Wang

Main category: cs.CV

TL;DR: 提出一种基于3D高斯场景表示的统一驾驶世界模型框架,实现3D场景理解与多模态生成,通过将语言特征嵌入高斯基元实现早期模态对齐,并设计任务感知的语言引导采样策略和双条件生成模型,在nuScenes和NuInteract上达到SOTA性能。

Details Motivation: 现有驾驶世界模型缺乏3D场景理解能力,无法准确对齐文本信息与3D场景,且生成过程缺乏语义推理能力。 Method: 采用3D高斯场景表示,将语言特征嵌入每个高斯基元以实现早期模态对齐;设计任务感知的语言引导采样策略,提取紧凑的3D令牌输入大语言模型;构建双条件多模态生成模型,结合高级语言条件与低级图像条件联合指导生成。 Result: 在nuScenes和NuInteract数据集上验证了方法有效性,实现了最先进的性能。 Conclusion: 该框架有效提升了驾驶世界模型的3D理解与多模态生成能力,为自动驾驶中的语义推理与内容生成提供了新思路。 Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.

[186] ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis

Maisha Haque,Israt Jahan Ayshi,Sadaf M. Anis,Nahian Tasnim,Mithila Moontaha,Md. Sabbir Ahmed,Muhammad Iqbal Hossain,Mohammad Zavid Parvez,Subrata Chakraborty,Biswajeet Pradhan,Biswajit Banik

Main category: cs.CV

TL;DR: 本研究提出了一种名为ForCM的新方法,结合对象基图像分析(OBIA)与深度学习(DL),利用Sentinel-2多光谱影像进行森林覆盖制图,显著提高了亚马逊雨林区域的分类精度。

Details Motivation: 传统OBIA方法在森林覆盖制图中存在精度局限,且不同深度学习模型与OBIA融合的效果尚未充分评估,因此需要探索更高效的组合方法以提升制图准确性。 Method: 采用UNet、UNet++、ResUNet、AttentionUNet和ResNet50-Segnet等多种深度学习模型,应用于高分辨率Sentinel-2 L2A影像,并将表现最佳的模型与OBIA技术结合;使用三组多光谱影像数据(三波段和四波段)进行训练与评估。 Result: 结合OBIA的ResUNet和AttentionUNet分别达到94.54%和95.64%的整体分类精度,优于传统OBIA方法的92.91%;验证了深度学习与OBIA融合的有效性。 Conclusion: ForCM方法通过融合深度学习与OBIA显著提升了森林覆盖制图精度,同时展示了免费工具如QGIS在环境监测中的实用潜力,支持全球生态保护工作。 Abstract: This research proposes "ForCM", a novel approach to forest cover mapping that combines Object-Based Image Analysis (OBIA) with Deep Learning (DL) using multispectral Sentinel-2 imagery. The study explores several DL models, including UNet, UNet++, ResUNet, AttentionUNet, and ResNet50-Segnet, applied to high-resolution Sentinel-2 Level 2A satellite images of the Amazon Rainforest. The datasets comprise three collections: two sets of three-band imagery and one set of four-band imagery. After evaluation, the most effective DL models are individually integrated with the OBIA technique to enhance mapping accuracy. The originality of this work lies in evaluating different deep learning models combined with OBIA and comparing them with traditional OBIA methods. The results show that the proposed ForCM method improves forest cover mapping, achieving overall accuracies of 94.54 percent with ResUNet-OBIA and 95.64 percent with AttentionUNet-OBIA, compared to 92.91 percent using traditional OBIA. This research also demonstrates the potential of free and user-friendly tools such as QGIS for accurate mapping within their limitations, supporting global environmental monitoring and conservation efforts.

[187] Exploring Syn-to-Real Domain Adaptation for Military Target Detection

Jongoh Jeong,Youngjin Oh,Gyeongrae Nam,Jeongeun Lee,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文提出使用Unreal Engine生成基于RGB的逼真合成数据,用于跨域军事目标检测,并通过合成到真实的迁移实验评估现有域适应方法的性能。

Details Motivation: 由于军事领域常涉及多种环境,且缺乏真实军事目标数据集,现有域适应方法难以直接应用;同时SAR数据成本高,因此需要低成本、高效的RGB解决方案。 Method: 利用Unreal Engine构建高保真合成RGB数据集,进行合成到真实的跨域迁移实验,并在自建的训练-验证数据对上评测不同监督程度下的域适应方法。 Result: 实验表明,仅需少量图像提示(如物体类别)的弱监督方法显著优于无监督和半监督域适应方法。 Conclusion: 当前弱监督域适应方法在军事目标检测中表现更优,但仍存在挑战需进一步研究。 Abstract: Object detection is one of the key target tasks of interest in the context of civil and military applications. In particular, the real-world deployment of target detection methods is pivotal in the decision-making process during military command and reconnaissance. However, current domain adaptive object detection algorithms consider adapting one domain to another similar one only within the scope of natural or autonomous driving scenes. Since military domains often deal with a mixed variety of environments, detecting objects from multiple varying target domains poses a greater challenge. Several studies for armored military target detection have made use of synthetic aperture radar (SAR) data due to its robustness to all weather, long range, and high-resolution characteristics. Nevertheless, the costs of SAR data acquisition and processing are still much higher than those of the conventional RGB camera, which is a more affordable alternative with significantly lower data processing time. Furthermore, the lack of military target detection datasets limits the use of such a low-cost approach. To mitigate these issues, we propose to generate RGB-based synthetic data using a photorealistic visual tool, Unreal Engine, for military target detection in a cross-domain setting. To this end, we conducted synthetic-to-real transfer experiments by training our synthetic dataset and validating on our web-collected real military target datasets. We benchmark the state-of-the-art domain adaptation methods distinguished by the degree of supervision on our proposed train-val dataset pair, and find that current methods using minimal hints on the image (e.g., object class) achieve a substantial improvement over unsupervised or semi-supervised DA methods. From these observations, we recognize the current challenges that remain to be overcome.

[188] Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks

Changgyoon Oh,Jongoh Jeong,Jegyeong Cho,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的自适应时间步选择与特征整合框架,用于提升少样本密集预测任务的性能。

Details Motivation: 现有扩散模型在少样本密集预测中依赖经验选择时间步特征,导致次优且任务偏差的问题。 Method: 提出了任务感知的时间步选择(TTS)模块和时间步特征整合(TFC)模块,并结合参数高效的微调适配器来自适应选择和整合关键时间步特征。 Result: 在Taskonomy数据集上验证了方法的有效性,在通用和少样本学习场景下显著提升了密集预测性能。 Conclusion: 通过可学习的时间步特征整合,能够有效提升扩散模型在少样本密集预测中的泛化能力与性能表现。 Abstract: Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.

[189] AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding

Jongoh Jeong,Taek-Jin Song,Jong-Hwan Kim,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文提出了一种名为AVOID的新数据集,用于在恶劣视觉条件下进行实时障碍物检测,支持多种视觉感知任务,并对高性能实时网络进行了基准测试和多任务网络的消融研究。

Details Motivation: 现有的道路驾驶数据集通常只包含正常或恶劣条件下的图像,且缺乏与其他类别在同一视觉域中捕获的道路障碍物,难以可靠地检测意外的小型道路危险。因此,需要一个涵盖各种天气和光照条件的道路障碍数据集。 Method: 作者构建了一个名为AVOID的新型大规模模拟数据集,包含不同天气和时间条件下拍摄的含有意外道路障碍物的图像,并提供了语义图、深度图、原始和语义LiDAR数据以及路径点信息;同时对实时障碍物检测网络进行基准测试,并设计了用于语义分割、深度估计和路径点预测的多任务网络进行消融实验。 Result: 该数据集有效支持多种视觉感知任务,实验验证了高精度实时检测的可行性,并通过多任务学习提升了模型性能。 Conclusion: AVOID数据集为在复杂和不利环境下进行道路障碍物检测和多模态感知研究提供了重要资源,推动了自动驾驶系统在真实世界中的鲁棒性发展。 Abstract: Understanding road scenes for visual perception remains crucial for intelligent self-driving cars. In particular, it is desirable to detect unexpected small road hazards reliably in real-time, especially under varying adverse conditions (e.g., weather and daylight). However, existing road driving datasets provide large-scale images acquired in either normal or adverse scenarios only, and often do not contain the road obstacles captured in the same visual domain as for the other classes. To address this, we introduce a new dataset called AVOID, the Adverse Visual Conditions Dataset, for real-time obstacle detection collected in a simulated environment. AVOID consists of a large set of unexpected road obstacles located along each path captured under various weather and time conditions. Each image is coupled with the corresponding semantic and depth maps, raw and semantic LiDAR data, and waypoints, thereby supporting most visual perception tasks. We benchmark the results on high-performing real-time networks for the obstacle detection task, and also propose and conduct ablation studies using a comprehensive multi-task network for semantic segmentation, depth and waypoint prediction tasks.

[190] MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?

Shiqi Dai,Zizhi Ma,Zhicong Luo,Xuesong Yang,Yibin Huang,Wanyue Zhang,Chi Chen,Zonghao Guo,Wang Xu,Yufei Sun,Maosong Sun

Main category: cs.CV

TL;DR: 本文提出了MM-UAVBench,首个针对低空无人机场景中多模态大语言模型(MLLMs)的综合基准测试,涵盖感知、认知和规划三大能力维度,包含19个子任务和超过5.7K个基于真实数据标注的问题,实验揭示了现有MLLM在空间偏差和多视角理解等方面的瓶颈。

Details Motivation: 现有的MLLM基准很少覆盖低空无人机场景的独特挑战,而无人机相关评估又局限于特定任务,缺乏对MLLM通用智能的统一评估,因此需要一个专门针对低空UAV场景的全面基准来系统评估MLLM的能力。 Method: 构建了一个名为MM-UAVBench的综合基准,涵盖感知、认知和规划三个核心能力维度,包含19个子任务和超过5.7K个人工标注问题,所有数据均来自公开的真实无人机数据集,并对16种主流开源和专有MLLM进行了广泛评测。 Result: 实验结果表明当前MLLM在适应复杂低空场景方面表现不佳,暴露出空间偏差、多视角理解困难等关键瓶颈,现有模型在低空环境中的视觉与认知需求面前仍存在显著不足。 Conclusion: MM-UAVBench填补了低空UAV场景中MLLM评估的空白,为推动面向实际应用的鲁棒、可靠无人机智能研究提供了重要工具和方向指引。 Abstract: While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.

[191] Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

Youngchae Kwon,Jinyoung Choi,Injung Kim

Main category: cs.CV

TL;DR: 提出了一种新的整体检测Transformer(Holi-DETR),通过利用三种上下文信息来提升服装项目检测的准确性。

Details Motivation: 由于时尚物品外观的高度多样性以及子类别之间的相似性,导致服装项目检测具有挑战性。 Method: 提出了Holi-DETR模型,整合了时尚物品间的共现关系、基于物品间空间布局的相对位置和大小,以及物品与人体关键点之间的空间关系这三种异构上下文信息到Detection Transformer中。 Result: 实验表明,所提方法在平均精度(AP)上分别比基础DETR和较新开发的Co-DETR提高了3.6个百分点和1.1个百分点。 Conclusion: Holi-DETR能够有效减少检测中的歧义,并通过综合利用多种上下文信息显著提高时尚物品检测性能。 Abstract: Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).

[192] Bridging Your Imagination with Audio-Video Generation via a Unified Director

Jiaxu Zhang,Tianshu Hu,Yuan Zhang,Zenan Li,Linjie Luo,Guosheng Lin,Xin Chen

Main category: cs.CV

TL;DR: UniMAGE是一个统一的AI视频创作模型,通过融合文本和图像生成,实现脚本撰写与关键帧设计的一体化,提升叙事逻辑与视觉一致性。

Details Motivation: 现有视频生成系统将脚本生成与关键帧设计分离,缺乏导演般的整体逻辑与创意整合,限制了非专业用户的使用体验与生成质量。 Method: 采用Mixture-of-Transformers架构,提出“先交错、后解耦”的训练范式:首先进行交错概念学习,利用图文交错数据增强理解;然后进行解耦专家学习,分别优化脚本写作与关键帧生成。 Result: 实验表明,UniMAGE在开源模型中达到最优性能,能生成逻辑连贯的视频脚本和视觉一致的关键帧图像。 Conclusion: UniMAGE实现了脚本与视觉生成的统一建模,为非专家用户提供了高效、高质量的长视频创作工具,推动AI向更具创造力的导演角色迈进。 Abstract: Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.

[193] Anomaly Detection by Effectively Leveraging Synthetic Images

Sungho Kang,Hyunkyu Park,Yeonho Lee,Hanbyul Lee,Mijoo Jeong,YeongHyeon Park,Injae Lee,Juneho Yi

Main category: cs.CV

TL;DR: 提出一种结合文本引导图像翻译和图像检索的框架,用于高效生成高质量合成缺陷图像,并通过两阶段训练策略提升工业异常检测性能。

Details Motivation: 由于真实缺陷图像稀缺,现有方法在生成合成缺陷图像时面临成本与质量之间的权衡:基于规则的方法成本低但真实性差,基于生成模型的方法质量高但成本昂贵。 Method: 利用预训练的文本引导图像到图像转换模型生成缺陷图像,结合图像检索模型筛选与真实正常图像相似的输出以提高相关性和质量;采用两阶段训练策略,先在大量基于规则合成的图像上预训练,再在少量高质量合成图像上微调。 Result: 在MVTec AD数据集上的实验表明,该方法显著降低了数据收集成本,同时提升了异常检测性能。 Conclusion: 所提框架能有效平衡合成图像的质量与生成成本,通过引入图像检索过滤机制和两阶段训练,实现了更优的无监督异常检测效果。 Abstract: Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.

[194] SURE Guided Posterior Sampling: Trajectory Correction for Diffusion-Based Inverse Problems

Minwoo Kim,Hongki Lim

Main category: cs.CV

TL;DR: 提出了一种名为SGPS的新方法,利用SURE梯度和PCA噪声估计来纠正扩散模型采样轨迹偏差,显著减少误差累积,在少于100次网络评估下实现高质量逆问题重建。

Details Motivation: 现有基于扩散模型的逆问题求解方法因交替采样与数据一致性步骤导致误差累积,需大量迭代才能获得高质量结果,效率低下。 Method: 采用Stein无偏风险估计(SURE)梯度更新和基于PCA的噪声估计,在采样早期和中期校正轨迹偏差,提升后验采样的准确性。 Result: 在多种逆问题上验证了SGPS的有效性,结果表明其在低NFE(<100)情况下 consistently 优于现有方法。 Conclusion: SGPS通过纠正采样轨迹偏差显著减少了误差累积,实现了高效且高质量的逆问题求解,为扩散模型在实际应用中的部署提供了更优方案。 Abstract: Diffusion models have emerged as powerful learned priors for solving inverse problems. However, current iterative solving approaches which alternate between diffusion sampling and data consistency steps typically require hundreds or thousands of steps to achieve high quality reconstruction due to accumulated errors. We address this challenge with SURE Guided Posterior Sampling (SGPS), a method that corrects sampling trajectory deviations using Stein's Unbiased Risk Estimate (SURE) gradient updates and PCA based noise estimation. By mitigating noise induced errors during the critical early and middle sampling stages, SGPS enables more accurate posterior sampling and reduces error accumulation. This allows our method to maintain high reconstruction quality with fewer than 100 Neural Function Evaluations (NFEs). Our extensive evaluation across diverse inverse problems demonstrates that SGPS consistently outperforms existing methods at low NFE counts.

[195] Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network

Dongsheng Li,Chaobo Chen,Siling Wang,Song Gao

Main category: cs.CV

TL;DR: 本文提出了一种用于红外气体泄漏检测的物理-边缘混合网络PEG-DRNet,结合气体输运建模与边缘感知机制,在弱对比度和小目标场景下显著提升了检测性能。

Details Motivation: 红外气体泄漏因羽流微弱、边界模糊而难以检测,现有方法在细节保留与上下文感知之间缺乏平衡,且对物理特性利用不足。 Method: 提出PEG-DRNet:1)Gas Block通过扩散-对流建模,结合局部分支与大核分支捕捉多尺度气体传播;2)AGPEO算子融合多方向梯度与相位一致性生成可靠边缘先验;3)CASR-PAN基于边缘与内容线索自适应聚合跨尺度特征,增强判别力并减少冗余。 Result: 在IIG数据集上,PEG-DRNet达到29.8% AP、84.3% AP$_{50}$和25.3%小目标AP,分别超越RT-DETR-R18基线3.0%、6.5%和5.3%,仅需43.7 Gflops和14.9M参数,并在IIG和LangGas数据集上优于现有CNN与Transformer检测器。 Conclusion: PEG-DRNet通过融合物理模型与边缘感知机制,实现了高精度与高效性的良好平衡,显著提升了红外气体泄漏检测的性能,具有较强的工业应用潜力。 Abstract: Detecting infrared gas leaks is critical for environmental monitoring and industrial safety, yet remains difficult because plumes are faint, small, semitransparent, and have weak, diffuse boundaries. We present physics-edge hybrid gas dynamic routing network (PEG-DRNet). First, we introduce the Gas Block, a diffusion-convection unit modeling gas transport: a local branch captures short-range variations, while a large-kernel branch captures long-range propagation. An edge-gated learnable fusion module balances local detail and global context, strengthening weak-contrast plume and contour cues. Second, we propose the adaptive gradient and phase edge operator (AGPEO), computing reliable edge priors from multi-directional gradients and phase-consistent responses. These are transformed by a multi-scale edge perception module (MSEPM) into hierarchical edge features that reinforce boundaries. Finally, the content-adaptive sparse routing path aggregation network (CASR-PAN), with adaptive information modulation modules for fusion and self, selectively propagates informative features across scales based on edge and content cues, improving cross-scale discriminability while reducing redundancy. Experiments on the IIG dataset show that PEG-DRNet achieves an overall AP of 29.8\%, an AP$_{50}$ of 84.3\%, and a small-object AP of 25.3\%, surpassing the RT-DETR-R18 baseline by 3.0\%, 6.5\%, and 5.3\%, respectively, while requiring only 43.7 Gflops and 14.9 M parameters. The proposed PEG-DRNet achieves superior overall performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors in AP and AP$_{50}$ on the IIG and LangGas dataset.

[196] RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

Fan Wei,Runmin Dong,Yushan Lai,Yixiang Yang,Zhaoyang Luo,Jinxiao Zhang,Miao Yang,Shuai Yuan,Jiyao Zhao,Bin Luo,Haohuan Fu

Main category: cs.CV

TL;DR: 提出一种无需训练的两阶段数据剪枝方法,用于提升遥感扩散生成基础模型的训练效率和生成质量,即使在85%高剪枝率下仍能保持数据多样性与代表性,并在下游任务中达到SOTA性能。

Details Motivation: 现有遥感扩散模型依赖大量冗余、含噪且类别不平衡的数据,训练效率低且难以收敛;同时现有数据处理方法忽略生成建模的分布需求和遥感图像的异质性。 Method: 采用训练免费的两阶段数据剪枝策略:第一阶段基于熵准则去除低信息量样本;第二阶段利用遥感场景分类数据集作为基准,进行场景感知的聚类与分层采样,在保证聚类效果的同时降低计算成本,并通过平衡簇间均匀性和样本代表性实现高剪枝比下的细粒度选择。 Result: 在剪除85%训练数据的情况下,模型收敛速度显著提升,生成质量更好;基于剪枝数据训练的扩散模型在超分辨率和语义图像合成等下游任务中 consistently 达到最先进水平。 Conclusion: 该数据剪枝范式能有效提升遥感生成基础模型的训练效率与性能,兼具多样性与代表性保留能力,为构建高效遥感生成模型提供了实用指导。 Abstract: Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.

[197] Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism

Siyu Zhang,Ying Chen,Lianlei Shan,Runhe Qiu

Main category: cs.CV

TL;DR: 本文提出了一种结合动态分辨率输入策略(DRIS)和多尺度视觉-语言对齐机制(MS-VLAM)的视觉语言模型框架,用于提升遥感图像多模态融合的语义理解精度与计算效率。

Details Motivation: 现有方法在固定分辨率下难以兼顾效率与细节,且单尺度对齐缺乏语义层次,导致遥感图像跨模态融合时出现语义错位和粒度不平衡问题。 Method: 提出DRIS采用由粗到细的策略自适应分配计算资源;设计MS-VLAM构建对象、局部区域和全局三级对齐机制,以增强跨模态语义一致性。 Result: 在RS-GPT4V数据集上实验表明,该方法在图像描述生成(BLEU-4、CIDEr)和跨模态检索(R@10)等任务中均优于传统方法,显著提升了语义理解准确性和计算效率。 Conclusion: 所提框架为构建高效、鲁棒的多模态遥感系统提供了新思路,对智能遥感解译的工程应用具有理论指导和技术支撑意义。 Abstract: Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.

[198] ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

Xingwei Ma,Shiyang Feng,Bo Zhang,Bin Wang

Main category: cs.CV

TL;DR: 提出ViLaCD-R1,一种两阶段遥感变化检测框架,结合视觉语言模型与掩码引导解码器,提升语义理解、定位精度和抗干扰能力。

Details Motivation: 传统方法在遥感变化检测中缺乏高阶语义理解,现有VLM方法存在定位不准、边界不清和可解释性差的问题。 Method: 设计两阶段框架:第一阶段用视觉语言模型(VLM)通过监督微调和强化学习进行块级双时相推理,生成粗略变化掩码;第二阶段用掩码引导解码器(MGD)融合双时相特征与粗掩码,输出精确二值变化图。 Result: 在多个遥感变化检测基准上表现优异,显著提升语义变化识别与定位精度,有效抑制非语义变化,实现最先进的准确率。 Conclusion: ViLaCD-R1通过结合VLM的语义推理能力和掩码引导的精细解码,在复杂场景下实现了更准确、鲁棒且可解释的遥感变化检测。 Abstract: Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.

[199] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

Shin seong Kim,Minjung Shin,Hyunin Cho,Youngjung Uh

Main category: cs.CV

TL;DR: 本文提出了一种名为ASemconsist的新框架,通过选择性修改文本嵌入来实现对角色身份的显式语义控制,同时保持图像与提示的一致性,并引入自适应特征共享策略和统一的评估指标CQS,在角色一致性和提示对齐之间取得了优异平衡。

Details Motivation: 现有方法在生成图像序列时难以兼顾角色身份一致性与每幅图像的文本对齐,存在二者权衡的问题。 Method: 提出ASemconsist框架,通过选择性修改文本嵌入实现语义控制;利用FLUX中的padding嵌入作为语义容器;设计自适应特征共享策略,仅对存在歧义的身份提示施加约束。 Result: 该方法在保持高文本对齐的同时显著提升角色身份一致性,实现了当前最优性能。此外提出了统一的评估指标CQS,综合衡量一致性与对齐性。 Conclusion: ASemconsist有效解决了身份一致性与提示对齐之间的权衡问题,结合新提出的CQS评估协议,为未来研究提供了更全面的评价标准和技术路径。 Abstract: Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist

[200] Contour Information Aware 2D Gaussian Splatting for Image Representation

Masaya Takabe,Hiroshi Watanabe,Sujun Hong,Tomohiro Ikai,Zheming Fan,Ryo Ishimoto,Kakeru Sugimoto,Ruri Imichi

Main category: cs.CV

TL;DR: 提出了一种轮廓信息感知的2D高斯点阵化框架(Contour Information-Aware 2DGS),通过引入对象分割先验来约束高斯分布区域,防止跨边界混合,提升边缘重建质量,尤其在高压缩比下表现优异。

Details Motivation: 现有2D高斯点阵化方法在高压缩下容易导致边界模糊,缺乏对轮廓结构的建模能力,因此需要一种能够保持边缘清晰度的表示方法。 Method: 将对象分割先验融入2D高斯点阵化,在光栅化过程中限制每个高斯核位于特定分割区域内,并引入训练暖启动策略以稳定收敛。 Result: 在合成色卡和DAVIS数据集上实验表明,该方法在物体边缘区域的重建质量优于现有2DGS方法,尤其在高斯数量极少时效果更显著,同时保持快速渲染和低内存开销。 Conclusion: 所提方法通过引入轮廓感知机制有效提升了2D高斯点阵化在边缘保持方面的性能,实现了高质量、轻量化的图像表示。 Abstract: Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high compression. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.

[201] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

Tong Shao,Yusen Fu,Guoying Sun,Jingde Kong,Zhuotao Tian,Jingyong Su

Main category: cs.CV

TL;DR: 本文提出了一种名为CEM的保真度优化插件,通过累积误差最小化来优化缓存策略,从而提升扩散变换模型在图像和视频生成中的推理速度与生成质量。

Details Motivation: 扩散变换模型(DiT)因迭代去噪过程导致推理速度慢,现有基于缓存的加速方法存在计算误差大且固定策略无法适应去噪过程中复杂误差变化的问题。 Method: 提出CEM方法,预定义刻画模型对时间步和缓存间隔敏感性的误差,并基于此设计动态规划算法进行策略优化,实现累积误差近似下的缓存策略优化。该方法无需额外计算开销,可无缝集成到现有纠错框架和量化模型中。 Result: 在九个生成模型和三种任务上的实验表明,CEM显著提升了现有加速模型的生成保真度,在FLUX.1-dev、PixArt-α、StableDiffusion1.5和Hunyuan等模型上甚至超过了原始生成性能。 Conclusion: CEM是一种模型无关、泛化性强的训练-free加速插件,能有效优化缓存策略以最小化累积误差,大幅提升生成质量和适用性。 Abstract: Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$α$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.

[202] YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

Xu Lin,Jinlong Peng,Zhenye Gan,Jiawen Zhu,Jun Liu

Main category: cs.CV

TL;DR: 提出YOLO-Master,一种引入实例条件自适应计算的实时目标检测框架,通过高效稀疏专家混合(ES-MoE)模块动态分配计算资源,提升复杂场景下的检测性能并保持实时性。

Details Motivation: 现有YOLO类模型采用静态密集计算,导致简单场景计算冗余、复杂场景资源不足,影响效率与性能平衡。 Method: 设计YOLO-Master框架,引入ES-MoE模块和轻量级动态路由网络,通过多样性增强目标促进专家专业化,并根据输入场景复杂度自适应激活最相关专家。 Result: 在MS COCO上达到42.4% AP和1.62ms延迟,比YOLOv13-N提升0.8% mAP且快17.8%,在密集场景增益显著,同时保持实时推理能力。 Conclusion: YOLO-Master通过实例条件自适应计算有效解决了传统RTOD模型中计算分配不均的问题,在多个基准上实现了更优的精度-速度权衡。 Abstract: Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.

[203] Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition

Arman Martirosyan,Shahane Tigranyan,Maria Razzhivina,Artak Aslanyan,Nazgul Salikhova,Ilya Makarov,Andrey Savchenko,Aram Avetisyan

Main category: cs.CV

TL;DR: 本文提出了两种多模态框架,用于在iMiGUE数据集上解决微手势识别和基于行为的情感预测任务。通过融合RGB和3D姿态信息实现微手势分类,并结合面部与上下文特征进行情感识别,所提方法在MiGA 2025挑战赛中取得第二名的优异成绩。

Details Motivation: 微手势识别和基于行为的情感预测都需要建模细微的人类行为,现有方法难以充分融合多模态信息以捕捉细粒度时空模式,因此需要更有效的多模态框架来提升性能。 Method: 采用MViTv2-S和2s-AGCN分别提取视频和骨骼嵌入,并通过跨模态令牌融合模块整合;对于情感识别,则使用SwinFace和MViTv2-S提取面部与上下文特征,并通过InterFusion模块进行融合。 Result: 在iMiGUE数据集上的实验表明,该方法在行为情感预测任务中表现优异,于MiGA 2025挑战赛中获得第二名。 Conclusion: 所提出的多模态融合框架能有效整合不同模态信息,在微手势识别和行为情感预测任务中均展现出强健的性能,验证了跨模态学习在细粒度行为理解中的潜力。 Abstract: Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.

[204] MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

Md. Sazzadul Islam Prottasha,Nabil Walid Rafi

Main category: cs.CV

TL;DR: 本研究比较了MedGemma和GPT-4在医学影像诊断六种疾病中的表现,结果显示经LoRA微调的MedGemma-4b-it模型在准确率和敏感性上均优于未微调的GPT-4,强调领域特定微调对减少临床应用中幻觉的重要性。

Details Motivation: 探索多模态大语言模型在医学影像诊断中的潜力,并评估领域特定微调对临床决策支持系统性能的影响。 Method: 使用低秩适应(LoRA)技术对MedGemma-4b-it模型进行微调,并与未经微调的GPT-4进行对比,通过混淆矩阵和分类报告进行定量分析。 Result: MedGemma模型在测试集上的平均准确率达到80.37%,显著高于GPT-4的69.58%,并且在癌症和肺炎检测等高风险任务中表现出更高的敏感性。 Conclusion: 领域特定的微调对于提升多模态大语言模型在临床环境中的可靠性和有效性至关重要,MedGemma展现出作为复杂、基于证据的医学推理工具的巨大潜力。 Abstract: Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.

[205] CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

Ke Niu,Haiyang Yu,Zhuofan Chen,Zhengtao Yao,Weitao Jia,Xiaodong Ge,Jingqun Tang,Benlei Cui,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 提出了一种名为Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD)的新范式,用于生成高精度、可编辑的CAD模型,并构建了包含17,299个实例的开源基准CADExpert。

Details Motivation: 传统CAD建模复杂,现有方法生成的3D模型不可编辑且精度不足,基于文本或图像的输入依赖大量人工标注,限制了在工业设计中的自动化和扩展性。 Method: 提出了CME-CAD范式,结合多专家协同学习,采用两阶段训练:多专家微调(MEFT)和多专家强化学习(MERL),以生成符合约束、精确且完全可编辑的CAD模型。 Result: 成功构建了包含正交投影、精确尺寸标注、专家生成的思维链、可执行CADQuery代码和渲染3D模型的开源基准CADExpert,共17,299个实例。 Conclusion: CME-CAD范式有效提升了CAD代码生成的精度与可编辑性,为工业级CAD自动化提供了可扩展的解决方案。 Abstract: Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model's ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.

[206] Visual Language Hypothesis

Xiu Li

Main category: cs.CV

TL;DR: 本文从结构和拓扑视角研究视觉表示学习,提出视觉理解需要一种视觉语义语言,并推导出视觉空间具有纤维丛结构,语义对应商空间,且语义不变性需要非同胚的判别目标与支持拓扑变化的模型机制。

Details Motivation: 探索视觉表示学习背后的结构性和拓扑性原理,解释为何现有模型需要显式语义监督或对齐机制来实现语义抽象。 Method: 基于视觉理解需具备语义语言的假设,结合可迁移性和抽象性的普遍前提,利用拓扑学中的纤维丛和商空间理论分析视觉观测空间的组织结构。 Result: 1) 语义商空间X/G不是X的子流形,无法仅通过平滑变形获得,必须依赖外部语义目标(如标签、跨实例识别或多模态对齐)实现语义不变性;2) 模型架构需支持拓扑变化,即‘扩展-收缩’过程以分离结构并形成离散语义区域。 Conclusion: 该框架为理解大规模判别式和多模态模型中的经验规律提供了拓扑学解释,并与经典统计学习理论一致,强调语义抽象不仅需要外部语义信号,还需具备拓扑变换能力的表示机制。 Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.

[207] CountGD++: Generalized Prompting for Open-World Counting

Niki Amini-Naieni,Andrew Zisserman

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态开放世界计数方法CountGD++,通过引入“伪示例”和允许指定不计数对象,提升了计数的灵活性、准确性和泛化能力。

Details Motivation: 现有自动计数方法在目标对象指定方式上存在局限,如需手动标注视觉示例且无法指明不计数对象。 Method: 扩展提示方式以支持文本和/或视觉示例描述不计数对象,提出‘伪示例’概念自动化标注视觉示例,并允许使用自然和合成外部图像中的视觉示例;将CountGD++作为LLM的视觉专家代理。 Result: 在多个数据集上实现了更高的准确性、效率和泛化性能。 Conclusion: 所提方法显著增强了开放世界计数任务中提示的灵活性与模型表现。 Abstract: The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.

[208] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Kanghee Lee,Injae Lee,Minseok Kwak,Kwonyoung Ryu,Jungi Hong,Jaesik Park

Main category: cs.CV

TL;DR: 本文提出了一种可扩展的多视角数据生成与标注流程,构建了包含200万QA对的SpatialMosaic数据集和100万QA对的SpatialMosaic-Bench基准,用于提升视觉语言模型在真实复杂场景下的三维空间推理能力,并提出了融合3D重建模型的混合框架SpatialMosaicVLM。

Details Motivation: 现有3D场景理解方法依赖预构建的3D表示或重建流水线,限制了可扩展性和实际应用;且真实环境中的部分可见、遮挡和低重叠等挑战性条件下的空间推理仍缺乏充分研究。 Method: 提出一种可扩展的多视角图像数据生成与标注流程,构建大规模指令微调数据集SpatialMosaic(2M QA)和评估基准SpatialMosaic-Bench(1M QA,6项任务),并设计融合3D重建模型作为几何编码器的视觉语言模型框架SpatialMosaicVLM。 Result: 实验表明,所构建的数据集和VQA任务能有效提升模型在挑战性多视角条件下的空间推理性能,验证了数据生成流程在构建真实、多样化QA对方面的有效性。 Conclusion: 通过高质量的多视角QA数据生成和结合3D几何信息的混合建模,可显著增强视觉语言模型在复杂现实场景中的空间理解与推理能力,为无需显式3D重建的鲁棒空间推理提供了可行路径。 Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.

[209] MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning

Shuyuan Lin,Mengtin Lo,Haosheng Chen,Yanjie Liang,Qiangqiang Wu

Main category: cs.CV

TL;DR: 本文提出了一种用于两视图对应学习的多图上下文注意力网络(MGCA-Net),通过上下文几何注意力(CGA)和跨阶段多图一致性(CSMGC)模块提升局部几何建模与跨阶段信息优化能力,在YFCC100M和SUN3D数据集上显著优于现有方法。

Details Motivation: 现有方法在局部几何建模和跨阶段信息优化方面存在不足,难以准确捕捉匹配对的几何约束,影响模型鲁棒性。 Method: 提出MGCA-Net,包含CGA模块(自适应融合空间位置与特征信息以增强局部与全局几何关系建模)和CSMGC模块(通过跨阶段稀疏图网络建立几何一致性)。 Result: 在YFCC100M和SUN3D数据集上的实验表明,MGCA-Net在离群点剔除和相机位姿估计任务中显著优于现有SOTA方法。 Conclusion: MGCA-Net有效提升了两视图对应学习中的几何建模与信息一致性,增强了匹配鲁棒性,具有良好的应用前景。 Abstract: Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at http://www.linshuyuan.com.

[210] NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization

Yifei Li,Haoyuan He,Yu Zheng,Bingyao Yu,Wenzhao Zheng,Lei Chen,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了NeXT-IMDL,一个用于图像篡改检测与定位(IMDL)的大规模诊断基准,通过多维度交叉评估揭示现有模型在真实场景中泛化能力的不足。

Details Motivation: 现有的IMDL方法在跨数据集评估下表现出良好的性能,但这种简化评估掩盖了其在面对多样化AI生成内容时的脆弱性,导致对技术进展的误判。因此,需要更系统、更具挑战性的评估框架来揭示模型的真实泛化能力。 Method: 提出NeXT-IMDL基准,从编辑模型、篡改类型、语义内容和伪造粒度四个基本维度对AIGC-based篡改进行分类,并设计五种严格的跨维度评估协议,对11种代表性模型进行全面测试。 Result: 实验表明,尽管现有模型在原始设定下表现良好,但在NeXT-IMDL的跨维度协议下均出现显著性能下降和系统性失效,暴露出其泛化能力的严重不足。 Conclusion: NeXT-IMDL有效揭示了当前IMDL方法的局限性,为构建真正鲁棒的下一代图像篡改检测模型提供了诊断工具和研究方向。 Abstract: The accessibility surge and abuse risks of user-friendly image editing models have created an urgent need for generalizable, up-to-date methods for Image Manipulation Detection and Localization (IMDL). Current IMDL research typically uses cross-dataset evaluation, where models trained on one benchmark are tested on others. However, this simplified evaluation approach conceals the fragility of existing methods when handling diverse AI-generated content, leading to misleading impressions of progress. This paper challenges this illusion by proposing NeXT-IMDL, a large-scale diagnostic benchmark designed not just to collect data, but to probe the generalization boundaries of current detectors systematically. Specifically, NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery granularity. Built upon this, NeXT-IMDL implements five rigorous cross-dimension evaluation protocols. Our extensive experiments on 11 representative models reveal a critical insight: while these models perform well in their original settings, they exhibit systemic failures and significant performance degradation when evaluated under our designed protocols that simulate real-world, various generalization scenarios. By providing this diagnostic toolkit and the new findings, we aim to advance the development towards building truly robust, next-generation IMDL models.

[211] SoulX-LiveTalk Technical Report

Le Shen,Qiao Qian,Tan Yu,Ke Zhou,Tianhang Yu,Yu Zhan,Zhenjie Wang,Ming Tao,Shunshun Yin,Siyuan Liu

Main category: cs.CV

TL;DR: SoulX-LiveTalk是一个140亿参数的实时流式音频驱动虚拟形象生成框架,采用自校正双向蒸馏和多步回溯自校正机制,在保证低延迟和高帧率的同时显著提升视觉质量和运动连贯性。

Details Motivation: 现有方法在实时、长时虚拟形象生成中因单向注意力或模型简化而牺牲视觉质量,难以满足高保真交互需求。 Method: 提出Self-correcting Bidirectional Distillation以保留视频块内的双向注意力,并引入Multi-step Retrospective Self-Correction Mechanism实现错误自主恢复;结合混合序列并行、Parallel VAE和内核级优化构建全栈推理加速套件。 Result: 实现0.87秒启动延迟和32 FPS实时吞吐,成为首个在140亿参数规模达到亚秒级延迟的系统。 Conclusion: SoulX-LiveTalk在大规模、高质量、实时虚拟形象生成方面树立了新标准,解决了长时生成中的稳定性与保真度难题。 Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

[212] SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation

Xiaolan Li,Wanquan Liu,Pengcheng Li,Pengyu Jie,Chenqiang Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为SOFTooth的语义增强、顺序感知的2D-3D融合框架,用于解决3D牙齿实例分割中的边界泄漏、中心漂移和标签不一致等问题,通过引入冻结的2D语义(如SAM模型)提升3D分割性能,在无2D微调的情况下实现了最先进的效果。

Details Motivation: 由于牙弓密集、牙龈边界模糊、缺牙情况以及重要的第三磨牙罕见性,3D牙齿实例分割具有挑战性;现有3D方法存在边界泄漏和身份不一致问题,而2D基础模型难以直接应用于3D临床流程。 Method: 提出SOFTooth框架:1)点级残差门控模块将咬合面SAM嵌入注入3D点特征以优化边界;2)中心引导的掩码优化模块增强实例掩码与几何中心的一致性;3)顺序感知的匈牙利匹配策略结合解剖顺序和中心距离进行实例分配。 Result: 在3DTeethSeg'22数据集上达到最先进的整体准确率和平均IoU,尤其在第三磨牙等困难样本上表现显著提升。 Conclusion: 无需2D微调即可有效将丰富的2D语义迁移到3D牙齿分割任务中,SOFTooth在复杂解剖结构和缺牙情况下仍保持鲁棒性和准确性。 Abstract: Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg'22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.

[213] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment

Henglin Liu,Nisha Huang,Chang Liu,Jiangpeng Yan,Huijuan Huang,Jixuan Ying,Tong-Yee Lee,Pengfei Wan,Xiangyang Ji

Main category: cs.CV

TL;DR: 本文提出了一种新的美学质量评估框架ArtQuant和大规模多维度数据集RAD,通过结合LLM解码器与联合描述生成,有效解决了数据稀缺与模型碎片化问题,在多个数据集上实现了最先进的性能,并显著减少了训练周期。

Details Motivation: 美学质量评估因其涉及视觉感知、认知和情感的复杂性而具有挑战性,现有数据集主要关注视觉感知且标注成本高,同时现有模型难以有效处理长文本描述,导致对艺术图像的美学判断存在认知差距。 Method: 为解决数据稀缺问题,提出了基于迭代流程自动生成的大规模多维结构化数据集RAD(7万样本);为应对模型碎片化,设计了ArtQuant框架,利用LLM解码器进行联合描述生成,以耦合孤立的美学维度并更好建模长文本语义,并通过理论分析证明RAD的数据语义充分性与生成范式可最小化预测熵。 Result: ArtQuant在多个美学数据集上达到最先进性能,仅需传统方法33%的训练周期,验证了其高效性与优越性。 Conclusion: 本文提出的RAD数据集与ArtQuant框架通过数据与模型协同优化,显著提升了AIGC内容的美学评估效果,缩小了艺术图像与人类美学判断之间的认知差距,具备良好的可扩展性与研究推广价值。 Abstract: The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.

[214] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia,Yongkang Li,Lijun Zhou,Jingfeng Yao,Kaixin Xiong,Haiyang Sun,Bing Wang,Kun Ma,Hangjun Ye,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: DriveLaW提出了一种统一视频生成与运动规划的新范式,通过共享潜在表示实现高保真未来预测与可靠轨迹规划的一致性,在自动驾驶世界模型中取得SOTA性能。

Details Motivation: 现有自动驾驶世界模型中,视频预测与运动规划通常分离,导致二者不一致,难以应对真实世界的长尾挑战。需要一种统一框架来确保预测与规划的内在一致性。 Method: 提出DriveLaW,包含DriveLaW-Video(生成高质量未来视频)和DriveLaW-Act(基于视频潜在表示进行扩散规划),通过三阶段渐进训练策略联合优化两个组件。 Result: 在视频生成指标上FID提升33.3%,FVD提升1.8%;在NAVSIM规划基准上创下新纪录。 Conclusion: DriveLaW成功统一了视频生成与运动规划,实现了预测与动作的一致性,显著提升了自动驾驶世界模型的性能。 Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

[215] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

Dohyun Kim,Seungwoo Lyu,Seung Wook Kim,Paul Hongsuck Seo

Main category: cs.CV

TL;DR: 本文提出了一种名为Direct Diffusion Score Preference Optimization (DDSPO)的新方法,通过利用预训练参考模型自动生成每一步的偏好信号,在无需人工标注的情况下优化扩散模型的生成质量,提升了文本-图像对齐性和视觉美感。

Details Motivation: 扩散模型在文本到图像生成任务中表现优异,但难以精确对齐用户意图并保持稳定的美学质量;现有基于偏好的训练方法依赖昂贵且可能含噪的人工标注数据,限制了其应用。 Method: 提出DDSPO方法,利用获胜与失败策略在去噪过程中的每一步提供密集的过渡级监督信号,并通过预训练参考模型对比原始提示和语义退化提示下的输出,自动产生偏好信号,实现无需显式奖励建模或人工标注的分数空间偏好优化。 Result: 实验结果表明,DDSPO在文本-图像对齐和视觉质量方面优于或媲美现有的基于偏好的方法,同时显著减少了对外部标注数据的依赖。 Conclusion: DDSPO是一种高效、低监督需求的扩散模型偏好优化框架,能够有效提升生成结果的质量与意图一致性,具备实际部署潜力。 Abstract: Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: https://dohyun-as.github.io/DDSPO

[216] Towards Integrating Uncertainty for Domain-Agnostic Segmentation

Jesse Brouwers,Xiaoyan Xing,Alexander Timans

Main category: cs.CV

TL;DR: 本文研究了不确定性量化在分割模型(如SAM)中的应用,以提升其在挑战性场景下的泛化能力。作者构建了UncertSAM基准,评估了多种轻量级不确定性估计方法,并探索了基于不确定性的预测优化策略。

Details Motivation: 尽管SAM等基础分割模型具有强大的零样本性能,但在分布偏移或知识受限的领域中仍表现脆弱。本文旨在探索不确定性量化是否能够缓解这些问题,并实现领域无关的模型泛化。 Method: 1) 构建包含八个数据集的UncertSAM基准,用于测试SAM在阴影、透明和伪装等挑战条件下的表现;2) 评估一系列轻量级、后验的不确定性估计方法;3) 探索初步的不确定性引导的预测优化步骤。 Result: 最后一层拉普拉斯近似产生的不确定性估计与分割误差有良好的相关性,显示出有意义的信号。初步的优化结果显示改进潜力。 Conclusion: 将不确定性纳入分割模型有助于提升鲁棒性和领域无关性能,具有重要应用前景。 Abstract: Foundation models for segmentation such as the Segment Anything Model (SAM) family exhibit strong zero-shot performance, but remain vulnerable in shifted or limited-knowledge domains. This work investigates whether uncertainty quantification can mitigate such challenges and enhance model generalisability in a domain-agnostic manner. To this end, we (1) curate UncertSAM, a benchmark comprising eight datasets designed to stress-test SAM under challenging segmentation conditions including shadows, transparency, and camouflage; (2) evaluate a suite of lightweight, post-hoc uncertainty estimation methods; and (3) assess a preliminary uncertainty-guided prediction refinement step. Among evaluated approaches, a last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating a meaningful signal. While refinement benefits are preliminary, our findings underscore the potential of incorporating uncertainty into segmentation models to support robust, domain-agnostic performance. Our benchmark and code are made publicly available.

[217] Fuzzy-Logic and Deep Learning for Environmental Condition-Aware Road Surface Classification

Mustafa Demetgul,Sanja Lazarova Molnar

Main category: cs.CV

TL;DR: 本文提出了一种基于天气条件和路面状况数据的实时道路监测系统,利用手机摄像头采集图像数据,并结合加速度数据进行深度学习模型训练,实现了高精度的道路分类。

Details Motivation: 传统道路监测方法成本高且缺乏系统性,需要更高效、低成本的实时监测方案。 Method: 收集校园道路的手机摄像头图像和车辆加速度数据,将加速度数据转化为图像形式,使用AlexNet、LeNet、VGG和ResNet等多种深度学习模型进行训练,并结合模糊逻辑根据天气和时间选择使用加速度或图像数据进行分类。 Result: 在五类道路表面(沥青、破损沥青、碎石路、破损碎石路、铺面路)分类中实现了超过95%的准确率,比较了基于加速度和基于图像的方法性能。 Conclusion: 所提出的系统能够有效实现高精度的实时道路状态监测,结合多源数据与深度学习模型并利用模糊逻辑动态选择输入源,提升了实际应用中的适应性和准确性。 Abstract: Monitoring states of road surfaces provides valuable information for the planning and controlling vehicles and active vehicle control systems. Classical road monitoring methods are expensive and unsystematic because they require time for measurements. This article proposes an real time system based on weather conditional data and road surface condition data. For this purpose, we collected data with a mobile phone camera on the roads around the campus of the Karlsruhe Institute of Technology. We tested a large number of different image-based deep learning algorithms for road classification. In addition, we used road acceleration data along with road image data for training by using them as images. We compared the performances of acceleration-based and camera image-based approaches. The performances of the simple Alexnet, LeNet, VGG, and Resnet algorithms were compared as deep learning algorithms. For road condition classification, 5 classes were considered: asphalt, damaged asphalt, gravel road, damaged gravel road, pavement road and over 95% accuracy performance was achieved. It is also proposed to use the acceleration or the camera image to classify the road surface according to the weather and the time of day using fuzzy logic.

[218] RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction

Shuhong Liu,Chenyu Bao,Ziteng Cui,Yun Liu,Xuangeng Chu,Lin Gu,Marcos V. Conde,Ryo Umagami,Tomohiro Hashimoto,Zijian Hu,Tianhan Xu,Yuan Gan,Yusuke Kurose,Tatsuya Harada

Main category: cs.CV

TL;DR: RealX3D是一个用于多视角视觉恢复和三维重建的实拍基准,涵盖多种物理退化类型,并揭示了现有方法在真实复杂环境下的局限性。

Details Motivation: 现有的多视角三维重建方法在理想条件下表现良好,但在真实场景中面对光照、散射、遮挡和模糊等物理退化时性能显著下降,缺乏系统评估这些影响的基准。 Method: 提出RealX3D基准,将物理退化分为四类(光照、散射、遮挡、模糊),采用统一采集协议获取像素对齐的低质量与高质量图像,同时提供高分辨率图像、RAW数据和密集激光扫描结果,生成世界尺度网格与度量深度图。 Result: 在多种基于优化和前馈的方法上进行评测,结果显示在物理退化下重建质量显著下降,验证了当前方法的脆弱性。 Conclusion: RealX3D为多视角重建与视觉恢复提供了更贴近真实世界的评估平台,凸显了开发鲁棒算法的需求。 Abstract: We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.

[219] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models

Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的解码框架CoFi-Dec,通过生成自反馈和从粗到细的视觉条件化来减少大视觉语言模型中的幻觉问题。

Details Motivation: 大视觉语言模型在多模态理解和生成方面取得了显著进展,但仍然容易产生与视觉输入不一致的幻觉内容,限制了其在现实世界应用中的可靠性。 Method: CoFi-Dec首先基于图像的粗粒度和细粒度视图生成两个中间文本响应,然后使用文本到图像模型将这些响应转换为合成图像,形成多层次的视觉假设。引入基于Wasserstein的融合机制,将多个视觉条件下的预测分布对齐成几何上一致的解码轨迹。 Result: 在六个专注于幻觉的基准测试中,CoFi-Dec显著减少了实体级和语义级的幻觉,优于现有的解码策略。该框架具有模型无关性,无需额外训练,并可无缝应用于多种大视觉语言模型。 Conclusion: CoFi-Dec通过结合生成自反馈和从粗到细的视觉条件化,有效缓解了大视觉语言模型中的幻觉问题,提高了输出的可靠性和保真度。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.

[220] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin

Kayathri Vigneswaran,Hugo Retief,Jai Clifford Holmes,Mariangel Garcia Andarcia,Hansaka Tennakoon

Main category: cs.CV

TL;DR: 提出了一种结合视觉水位线检测、YOLOv8姿态估计和大语言模型(GPT-4o 和 Gemini)的河流水位自动监测框架,实现了高精度、可扩展的水位读数提取。

Details Motivation: 传统水文观测方法受限于人工误差和环境条件,难以实现连续、准确的水位监测,亟需自动化解决方案。 Method: 采用图像预处理、标注、水线检测、刻度间隙估计和数字读数提取的多阶段流程,结合 YOLOv8 姿态估计与多模态大语言模型,并引入几何元数据提升 LLM 预测性能。 Result: 水线检测精度达94.24%,F1分数为83.64%;Gemini Stage 2 在最优条件下平均绝对误差为5.43 cm,均方根误差8.58 cm,R²达0.84。 Conclusion: 融合几何元数据与多模态AI显著提升了水位估计的鲁棒性与准确性,该框架为实时河口水位数字化和水资源管理提供了高效可靠的解决方案。 Abstract: Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.

[221] Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

Bohan Xiao,Peiyong Wang,Qisheng He,Ming Dong

Main category: cs.CV

TL;DR: 提出一种基于双逼近器的去噪布朗桥模型(Dual-approx Bridge),用于确定性图像到图像翻译,显著提升输出保真度和图像质量。

Details Motivation: 在确定性图像翻译任务中,现有方法难以同时保证输出的高保真性和低方差,需设计更优的生成模型以逼近真实数据分布。 Method: 利用布朗桥动力学构建生成过程,引入两个神经网络分别近似前向和反向过程,实现去噪与精确重建。 Result: 在图像生成和超分辨率任务上,优于随机和确定性基线方法,表现出更高的图像质量和对真实标签的忠实度。 Conclusion: Dual-approx Bridge 通过双逼近结构和布朗桥机制,有效提升了确定性图像翻译的性能,具备良好的应用潜力。 Abstract: Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: https://github.com/bohan95/dual-app-bridge

[222] HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Yuxin Wen,Qing Shuai,Di Kang,Jing Li,Cheng Wen,Yue Qian,Ningxin Jiao,Changhai Chen,Weijie Chen,Yiran Wang,Jinkun Guo,Dongyue An,Han Liu,Yanyu Tong,Chao Zhang,Qing Guo,Juan Chen,Qiao Zhang,Youyi Zhang,Zihao Yao,Cheng Zhang,Hong Duan,Xiaoping Wu,Qi Chen,Fei Cheng,Liang Dong,Peng He,Hao Zhang,Jiaxin Lin,Chao Zhang,Zhongyi Fan,Yifan Li,Zhichao Hu,Yuhong Liu,Linus,Jie Jiang,Xiaolong Li,Linchao Bao

Main category: cs.CV

TL;DR: HY-Motion 1.0 是首个基于扩散Transformer的十亿参数级3D人体动作生成模型,支持从文本生成高质量动作,并通过全流程训练实现优异的指令对齐和商用潜力。

Details Motivation: 推动3D人体动作生成模型向大规模、高精度和商业化发展,填补当前开源模型在规模和性能上的不足。 Method: 采用基于Diffusion Transformer的流匹配架构,结合大规模预训练(>3000小时数据)、高质量微调(400小时精选数据)以及基于人类反馈和奖励模型的强化学习,构建全阶段训练范式,并辅以严格的数据清洗与标注流程。 Result: 模型覆盖超过200个动作类别、6大类动作,动作质量和指令跟随能力显著优于现有开源基准。 Conclusion: HY-Motion 1.0 实现了大规模动作生成模型的成功扩展,具备领先的生成能力和商用前景,已向开源社区公开以促进后续研究。 Abstract: We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.

[223] MCI-Net: A Robust Multi-Domain Context Integration Network for Point Cloud Registration

Shuyuan Lin,Wenwu Peng,Junjie Huang,Qiang Qi,Miaohui Wang,Jian Weng

Main category: cs.CV

TL;DR: 提出了一种多域上下文融合网络MCI-Net,用于提升点云配准中的特征表示与性能,通过图邻域聚合、渐进式上下文交互和动态内点选择策略,在多个数据集上显著优于现有方法。

Details Motivation: 现有基于欧氏邻域的深度学习方法难以有效捕捉点云中的隐式语义和结构一致性,导致特征表达能力不足。 Method: 提出了MCI-Net,包括全局图构建的图邻域聚合模块、进行域内解耦和域间交互的渐进式上下文交互模块,以及利用多次位姿估计残差优化内点权重的动态内点选择方法。 Result: 在3DMatch等室内外点云数据集上实现了96.4%的最高配准召回率,显著优于现有最先进方法。 Conclusion: MCI-Net通过多域上下文融合有效提升了点云配准的精度与鲁棒性,具备较强的特征判别能力和应用潜力。 Abstract: Robust and discriminative feature learning is critical for high-quality point cloud registration. However, existing deep learning-based methods typically rely on Euclidean neighborhood-based strategies for feature extraction, which struggle to effectively capture the implicit semantics and structural consistency in point clouds. To address these issues, we propose a multi-domain context integration network (MCI-Net) that improves feature representation and registration performance by aggregating contextual cues from diverse domains. Specifically, we propose a graph neighborhood aggregation module, which constructs a global graph to capture the overall structural relationships within point clouds. We then propose a progressive context interaction module to enhance feature discriminability by performing intra-domain feature decoupling and inter-domain context interaction. Finally, we design a dynamic inlier selection method that optimizes inlier weights using residual information from multiple iterations of pose estimation, thereby improving the accuracy and robustness of registration. Extensive experiments on indoor RGB-D and outdoor LiDAR datasets show that the proposed MCI-Net significantly outperforms existing state-of-the-art methods, achieving the highest registration recall of 96.4\% on 3DMatch. Source code is available at http://www.linshuyuan.com.

[224] SC-Net: Robust Correspondence Learning via Spatial and Cross-Channel Context

Shuyuan Lin,Hailiang Liao,Qiang Qi,Junjie Huang,Taotao Lai,Jian Weng

Main category: cs.CV

TL;DR: 本文提出了一种名为SC-Net的新网络,用于改进双视图对应学习中的运动场估计,通过空间和通道双重视角整合上下文信息,在位姿估计和异常值去除任务中优于现有方法。

Details Motivation: 现有的CNN骨干网络未针对特定任务设计,难以有效聚合全局上下文,且在大视差场景下容易过度平滑密集运动场。 Method: 提出了SC-Net,包含三个模块:自适应聚焦正则化模块(AFR)增强位置感知和鲁棒性;双侧场调整模块(BFA)建模长距离依赖并促进空间与通道维度交互;位置感知恢复模块(PAR)用于精确恢复运动向量。 Result: 在YFCC100M和SUN3D数据集上的实验表明,SC-Net在相对位姿估计和异常值去除任务上优于当前最先进的方法。 Conclusion: SC-Net通过双侧上下文融合有效提升了两视图对应关系学习的性能,具有更强的位置感知能力和运动场恢复精度。 Abstract: Recent research has focused on using convolutional neural networks (CNNs) as the backbones in two-view correspondence learning, demonstrating significant superiority over methods based on multilayer perceptrons. However, CNN backbones that are not tailored to specific tasks may fail to effectively aggregate global context and oversmooth dense motion fields in scenes with large disparity. To address these problems, we propose a novel network named SC-Net, which effectively integrates bilateral context from both spatial and channel perspectives. Specifically, we design an adaptive focused regularization module (AFR) to enhance the model's position-awareness and robustness against spurious motion samples, thereby facilitating the generation of a more accurate motion field. We then propose a bilateral field adjustment module (BFA) to refine the motion field by simultaneously modeling long-range relationships and facilitating interaction across spatial and channel dimensions. Finally, we recover the motion vectors from the refined field using a position-aware recovery module (PAR) that ensures consistency and precision. Extensive experiments demonstrate that SC-Net outperforms state-of-the-art methods in relative pose estimation and outlier removal tasks on YFCC100M and SUN3D datasets. Source code is available at http://www.linshuyuan.com.

[225] TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding

Zongsheng Cao,Yangfan He,Anran Liu,Feng Chen,Zepeng Wang,Jun Xie

Main category: cs.CV

TL;DR: TV-RAG是一种无需训练的框架,通过时间对齐和熵引导语义增强长视频理解,在不需微调的情况下提升大视频语言模型的推理能力。

Details Motivation: 现有大视频语言模型在处理长视频时受限于短时间窗口且难以捕捉细粒度语义变化,主流基于文本的检索方法忽略多模态间的时间依赖关系。 Method: 提出TV-RAG框架,包含时间衰减检索模块(引入显式时间偏移)和熵加权关键帧采样器(选择信息密集帧),结合时间与语义信号实现双层推理。 Result: TV-RAG在Video-MME、MLVU和LongVideoBench等多个长视频基准上优于主流基线方法。 Conclusion: TV-RAG为大视频语言模型提供了轻量、低成本的升级方案,有效提升了长视频复杂推理性能。 Abstract: Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.

[226] Multi-label Classification with Panoptic Context Aggregation Networks

Mingyuan Jiu,Hailong Zhu,Wenchuan Wei,Hichem Sahbi,Rongrong Ji,Mingliang Xu

Main category: cs.CV

TL;DR: 本文提出了Deep Panoptic Context Aggregation Network (PanCAN),通过在高维Hilbert空间中进行跨尺度特征聚合,实现多阶几何上下文的层次化整合,显著提升了多标签分类性能。

Details Motivation: 现有方法主要关注基本几何关系或局部特征,忽略了对象之间的跨尺度上下文交互,限制了复杂场景的理解能力。 Method: PanCAN结合随机游走与注意力机制,在每个尺度上学习多阶邻域关系;通过级联不同尺度模块,选择精细尺度上的显著锚点,并利用注意力动态融合其邻域特征,实现跨尺度建模。 Result: 在NUS-WIDE、PASCAL VOC2007和MS-COCO数据集上的大量多标签分类实验表明,PanCAN在定量和定性评估中均优于现有技术。 Conclusion: PanCAN通过有效融合多阶和跨尺度上下文感知特征,显著增强了复杂场景理解能力,大幅提升了多标签图像分类性能。 Abstract: Context modeling is crucial for visual recognition, enabling highly discriminative image representations by integrating both intrinsic and extrinsic relationships between objects and labels in images. A limitation in current approaches is their focus on basic geometric relationships or localized features, often neglecting cross-scale contextual interactions between objects. This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts through cross-scale feature aggregation in a high-dimensional Hilbert space. Specifically, PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism. Modules from different scales are cascaded, where salient anchors at a finer scale are selected and their neighborhood features are dynamically fused via attention. This enables effective cross-scale modeling that significantly enhances complex scene understanding by combining multi-order and cross-scale context-aware features. Extensive multi-label classification experiments on NUS-WIDE, PASCAL VOC2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results, outperforming state-of-the-art techniques in both quantitative and qualitative evaluations, thereby substantially improving multi-label classification performance.

[227] IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation

Donghao Zhou,Jingyu Lin,Guibao Shen,Quande Liu,Jialin Gao,Lihao Liu,Lan Du,Cunjian Chen,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了IdentityStory框架,用于生成以人物为中心的故事图像序列,通过迭代身份发现和重去噪身份注入机制,实现了跨多幅图像的人物身份一致性,尤其在人脸一致性和多角色组合方面表现突出。

Details Motivation: 现有的视觉生成模型在生成包含一致角色的文本故事时面临挑战,尤其是在保持人物面部细节的一致性和协调多个角色方面存在不足。 Method: 提出IdentityStory框架,包含两个核心组件:迭代身份发现(Iterative Identity Discovery)用于提取连贯的角色身份,以及重去噪身份注入(Re-denoising Identity Injection)用于在保留上下文的同时将身份信息注入图像。 Result: 在ConsiStory-Human基准上的实验表明,该方法在人脸一致性方面优于现有方法,并支持多角色组合,同时展现出在无限长度故事生成和动态角色构成中的潜力。 Conclusion: IdentityStory有效解决了人类角色在连续图像中身份一致性的难题,推动了以人物为中心的视觉故事生成的发展。 Abstract: Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.

[228] Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

Hexin Zhang,Dong Li,Jie Huang,Bingzhou Wang,Xueyang Fu,Zhengjun Zha

Main category: cs.CV

TL;DR: 提出了一种无需训练的推理时缩放框架IAFS,通过迭代优化和自适应频域引导,有效平衡了图像超分辨率中感知质量与结构保真度的矛盾。

Details Motivation: 现有扩散模型在图像超分辨率中难以同时保证高频感知质量和低频结构保真度,且当前推理时优化策略存在过平滑或结构失配问题。 Method: 提出IAFS框架,结合迭代修正与频率感知粒子融合,在生成过程中逐步校正结构偏差,并自适应融合高低频信息以实现更均衡的重建。 Result: 在多个扩散超分模型上实验表明,IAFS在感知细节和结构准确性方面均优于现有推理时方法,有效缓解了感知-保真冲突。 Conclusion: IAFS为扩散模型超分辨率提供了一种高效、通用的推理时优化方案,无需训练即可提升图像质量。 Abstract: Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.

[229] AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Binhe Yu,Zhen Wang,Kexin Li,Yuqian Yuan,Wenqiao Zhang,Long Chen,Juncheng Li,Jun Xiao,Yueting Zhuang

Main category: cs.CV

TL;DR: 本文提出AnyMS,一种无需训练的框架,用于布局引导的多主体定制,通过双层次注意力解耦机制实现文本对齐、主体身份保持和布局控制。

Details Motivation: 现有方法在文本对齐、主体身份保持和布局控制之间难以平衡,且依赖额外训练限制了可扩展性和效率。 Method: 引入一种自底向上的双层次注意力解耦机制:全局解耦分离文本与视觉条件的交叉注意力以确保文本对齐;局部解耦将每个主体的注意力限制在其指定区域,避免冲突并保证身份与布局控制。同时利用预训练图像适配器提取与扩散模型对齐的主体特征,无需微调。 Result: 实验表明AnyMS在复杂组合和更多主体数量下均达到最先进性能,支持高质量多主体图像生成。 Conclusion: AnyMS是一种高效、可扩展的无需训练框架,在多主体定制任务中实现了优异的文本对齐、身份保持和布局控制平衡。 Abstract: Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.

[230] PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

Shengyi Hua,Jianfeng Wu,Tianle Shen,Kangzhe Hu,Zhongzhen Huang,Shujuan Ni,Zhihong Zhang,Yuan Li,Zhe Wang,Xiaofan Zhang

Main category: cs.CV

TL;DR: PathFound 是一种代理式多模态病理诊断模型,通过主动获取证据和推理优化诊断,显著提升准确率。

Details Motivation: 现有病理基础模型采用静态推理范式,无法在诊断不明确时主动获取更多证据,而临床诊断通常需要反复观察和补充检查。 Method: 结合病理视觉基础模型、视觉语言模型和强化学习训练的推理模型,设计包含初步诊断、证据获取和最终决策三个阶段的证据寻求推理框架。 Result: 在多种大模型上验证该策略可一致提升诊断准确性,PathFound 在多种临床场景中达到最先进性能,并能发现核特征、局部浸润等细微病变。 Conclusion: 证据寻求的推理范式更贴近真实临床流程,能有效提升计算病理学中的诊断精度与可靠性。 Abstract: Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.

[231] PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang

Main category: cs.CV

TL;DR: 本文提出PurifyGen,一种无需训练的文本到图像生成安全方法,通过双阶段提示净化策略有效减少有害内容生成,同时保持模型原始权重和生成质量。

Details Motivation: 现有安全方法如文本黑名单或内容分类易被绕过且依赖大量数据和再训练,难以有效防止扩散模型生成不安全内容。 Method: 提出PurifyGen,采用双阶段提示净化:首先通过计算提示词与预定义有毒和清洁概念嵌入的互补语义距离进行细粒度风险评估;其次对高风险提示实施双空间变换,将其投影至有毒概念的零空间并对其齐至清洁概念的范围空间,实现有害语义去除与安全语义增强,并通过词元级替换最小化对安全内容的影响。 Result: 在五个数据集上广泛测试表明,PurifyGen在降低不安全内容生成方面优于现有方法,且性能媲美依赖训练的方法,具有良好的泛化性和即插即用特性。 Conclusion: PurifyGen是一种有效、无需训练、理论支持强的安全T2I生成方法,能够在保留原始意图的同时显著提升生成内容的安全性。 Abstract: Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to https://github.com/AI-Researcher-Team/PurifyGen.

[232] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li,Xi Fang,Yixuan Li,Chaozheng Huang,Junjie Wang,Xi Wang,Hongzhe Bai,Bojun Hao,Shenyu Lin,Huiqi Liang,Linfeng Zhang,Guolin Ke

Main category: cs.CV

TL;DR: 本文提出了RxnBench,一个用于评估多模态大语言模型在科学文献中理解化学反应能力的基准,揭示了现有模型在化学逻辑和结构识别上的不足。

Details Motivation: 探索多模态大语言模型在真实化学文献中理解复杂反应图的能力,推动AI在化学发现中的应用。 Method: 构建包含单图问答(SF-QA)和全文问答(FD-QA)两个任务的RxnBench基准,使用来自305个反应图和108篇论文的数据评估MLLMs。 Result: 实验显示当前MLLM在提取文本方面表现良好,但在深层化学逻辑和精确结构识别上仍有显著缺陷,即使具备推理能力的模型在FD-QA任务上也未达到50%准确率。 Conclusion: 需要开发领域专用的视觉编码器和更强的推理引擎,以提升AI在化学领域的自主理解能力。 Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

[233] ThinkGen: Generalized Thinking for Visual Generation

Siyu Jiao,Yiheng Lin,Yujie Zhong,Qi She,Wei Zhou,Xiaohan Lan,Zilong Huang,Fei Yu,Yingchen Yu,Yunqing Zhao,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出了ThinkGen,首个基于思维链(CoT)推理的通用视觉生成框架,结合多模态大模型与扩散变换器(DiT),通过解耦架构和分离式强化学习训练范式(SepGRPO)实现跨场景的高质量图像生成。

Details Motivation: 现有CoT推理在生成任务中的应用受限于特定场景机制,缺乏通用性和可扩展性,难以适应多样化的生成需求。 Method: 提出ThinkGen框架,采用解耦设计:预训练MLLM根据用户意图生成指令,DiT依据指令生成图像;并提出SepGRPO训练方法,在MLLM与DiT之间交替进行强化学习,支持多数据集联合训练。 Result: 实验表明ThinkGen在多个生成基准上实现了鲁棒且领先的性能,验证了CoT推理在视觉生成任务中的有效性与泛化能力。 Conclusion: ThinkGen首次将CoT推理有效引入通用视觉生成,展示了思维驱动生成的潜力,为多模态生成模型提供了新的训练范式与发展方向。 Abstract: Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

[234] Image Denoising Using Global and Local Circulant Representation

Zhaoming Kong,Xiaowei Yang,Jiahuan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于Haar变换与张量奇异值分解(Haar-tSVD)的高效图像去噪方法,通过建立PCA与Haar变换在循环表示下的理论联系,实现对图像块全局与局部相关性的有效建模,具备快速、无需学习基函数、可并行等优点,并结合自适应噪声估计和深度网络进一步提升性能。

Details Motivation: 应对日益增长的图像数据带来的高效去噪需求,克服传统方法在速度与性能之间的权衡问题。 Method: 提出Haar-tSVD算法,利用循环表示下PCA与Haar变换的理论关联,结合统一的张量SVD投影与Haar变换捕捉图像块的全局与局部相关性;引入自适应噪声估计机制,并融合深度神经网络以增强在强噪声下的表现。 Result: 在多个去噪数据集上验证了方法的有效性和高效性,相比现有方法在去噪速度和性能之间取得了更好平衡,尤其在严重噪声条件下通过深度网络集成进一步提升了效果。 Conclusion: Haar-tSVD是一种计算简单、无需训练、可并行的即插即用型去噪器,通过理论创新和结构优化实现了高效且鲁棒的图像去噪,具有实际应用价值。 Abstract: The proliferation of imaging devices and countless image data generated every day impose an increasingly high demand on efficient and effective image denoising. In this paper, we establish a theoretical connection between principal component analysis (PCA) and the Haar transform under circulant representation, and present a computationally simple denoising algorithm. The proposed method, termed Haar-tSVD, exploits a unified tensor singular value decomposition (t-SVD) projection combined with Haar transform to efficiently capture global and local patch correlations. Haar-tSVD operates as a one-step, parallelizable plug-and-play denoiser that eliminates the need for learning local bases, thereby striking a balance between denoising speed and performance. Besides, an adaptive noise estimation scheme is introduced to improve robustness according to eigenvalue analysis of the circulant structure. To further enhance the performance under severe noise conditions, we integrate deep neural networks with Haar-tSVD based on the established Haar-PCA relationship. Experimental results on various denoising datasets demonstrate the efficiency and effectiveness of proposed method for noise removal. Our code is publicly available at https://github.com/ZhaomingKong/Haar-tSVD.

[235] ProGuard: Towards Proactive Multimodal Safeguard

Shaohan Yu,Lijun Li,Chenyang Si,Lu Sheng,Jing Shao

Main category: cs.CV

TL;DR: 本文提出了一种名为ProGuard的视觉-语言主动防护模型,通过强化学习训练,在无需调整模型的情况下识别和描述分布外(OOD)安全风险,显著提升了多模态内容的安全审核能力。

Details Motivation: 现有防御方法难以应对生成模型快速演进带来的新型多模态安全风险,且存在模态偏差问题,缺乏对未知风险的主动识别能力。 Method: 构建了一个包含87K样本、具有分层多模态安全分类标注的模态平衡数据集,并基于此采用强化学习训练视觉-语言基础模型;引入OOD安全类别推断任务,并结合基于同义词库的相似性奖励来增强模型对未见不安全类别的描述能力。 Result: ProGuard在二元安全分类上表现与闭源大模型相当,在不安全内容分类上显著优于现有开源守卫模型;在OOD风险检测和描述上分别提升52.6%和64.8%。 Conclusion: ProGuard具备强大的主动安全审核能力,能有效识别并简洁描述未知安全风险,为多模态内容安全提供了高效且可扩展的解决方案。 Abstract: The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.

[236] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Ethan Chern,Zhulin Hu,Bohao Tang,Jiadi Su,Steffi Chern,Zhijie Deng,Pengfei Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于多模态条件下的实时交互式视频扩散模型,通过改进的蒸馏方法在保持视觉质量的同时将推理延迟降低20倍,并构建了名为LiveTalk的实时交互系统,显著提升了多轮对话中的视频连贯性与响应速度。

Details Motivation: 现有的视频生成扩散模型由于依赖双向注意力和迭代去噪过程,难以实现实时交互;现有蒸馏方法主要关注文本到视频生成,缺乏对多模态(如音频、图像)条件下人机自然交互的支持。 Method: 提出一种改进的蒸馏策略,强调条件输入质量、策略初始化和优化调度,使模型具备自回归能力并减少采样步数,同时支持文本、图像和音频的多模态条件输入。 Result: 在HDTF、AVSpeech和CelebV-HQ等多模态头像视频生成基准上,蒸馏后的模型以仅1/20的推理成本达到与全步长双向模型相当甚至更优的视觉质量,并成功集成至LiveTalk系统中,在多轮交互任务中优于Sora2和Veo3,实现从分钟级延迟到实时响应的跨越。 Conclusion: 该方法有效解决了多模态条件下视频扩散模型的实时生成难题,为实现自然、高效的多模态人机交互提供了可行路径。 Abstract: Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

[237] Same or Not? Enhancing Visual Perception in Vision-Language Models

Damiano Marsili,Aditya Mehta,Ryan Y. Lin,Georgia Gkioxari

Main category: cs.CV

TL;DR: TWIN是一个大规模图像对查询数据集,旨在提升视觉语言模型(VLMs)的细粒度感知能力,通过训练使模型能辨别视觉上相似图像是否为同一物体,显著提高其在未见领域中的表现。

Details Motivation: 现有VLMs在视觉理解上偏向粗粒度识别,缺乏对细微视觉差异的关注,且训练数据多集中于一般性识别任务,无法支持精细感知。 Method: 构建TWIN数据集,包含561,000个图像对查询,要求模型判断两个相似图像是否为同一对象;并在该数据集上微调VLMs,同时引入FGVQA作为评估细粒度性能的新基准。 Result: 在FGVQA基准上,基于TWIN微调的模型性能最高提升19.3%,且在通用VQA任务中保持原有性能;模型在艺术、动物、植物和地标等未见领域也表现出更强的细粒度识别能力。 Conclusion: TWIN有效提升了VLMs的感知精度,具备良好的可扩展性和通用性,可作为开源VLM训练的即插即用组件,推动未来模型在细粒度视觉理解上的发展。 Abstract: Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/

[238] Detection Fire in Camera RGB-NIR

Nguyen Truong Khai,Luong Duc Vinh

Main category: cs.CV

TL;DR: 本文提出了三种改进夜间火灾检测的方法:新的近红外数据集、两阶段检测模型和Patched-YOLO,有效提升了检测精度并减少了人工光源的误检。

Details Motivation: 现有火灾检测模型在夜间使用红外相机时容易将人工光源误判为火源,且数据集构建存在局限性,导致性能受限。 Method: 1)采用数据增强策略扩充NIR数据集;2)提出结合YOLOv11与EfficientNetV2-B0的两阶段检测流程;3)设计Patched-YOLO用于提升RGB图像中小目标的检测能力。 Result: 所提方法在夜间火灾检测中相比先前模型具有更高的准确率,并显著降低由人工光源引起的误报;Patched-YOLO增强了对小尺寸和远距离火焰的检测效果。 Conclusion: 通过数据扩充、两阶段检测架构和基于图像块的处理策略,该研究有效提升了复杂环境下火灾检测的鲁棒性和准确性。 Abstract: Improving the accuracy of fire detection using infrared night vision cameras remains a challenging task. Previous studies have reported strong performance with popular detection models. For example, YOLOv7 achieved an mAP50-95 of 0.51 using an input image size of 640 x 1280, RT-DETR reached an mAP50-95 of 0.65 with an image size of 640 x 640, and YOLOv9 obtained an mAP50-95 of 0.598 at the same resolution. Despite these results, limitations in dataset construction continue to cause issues, particularly the frequent misclassification of bright artificial lights as fire. This report presents three main contributions: an additional NIR dataset, a two-stage detection model, and Patched-YOLO. First, to address data scarcity, we explore and apply various data augmentation strategies for both the NIR dataset and the classification dataset. Second, to improve night-time fire detection accuracy while reducing false positives caused by artificial lights, we propose a two-stage pipeline combining YOLOv11 and EfficientNetV2-B0. The proposed approach achieves higher detection accuracy compared to previous methods, particularly for night-time fire detection. Third, to improve fire detection in RGB images, especially for small and distant objects, we introduce Patched-YOLO, which enhances the model's detection capability through patch-based processing. Further details of these contributions are discussed in the following sections.

[239] Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging

Janani Annur Thiruvengadam,Kiran Mayee Nabigaru,Anusha Kovi

Main category: cs.CV

TL;DR: 提出了一种可扩展的残差特征聚合(SRFA)框架,用于CT图像中胰腺肿瘤的早期检测,结合MAGRes-UNet分割、DenseNet-121特征提取、HHO-BA特征选择及ViT与EfficientNet-B3混合分类模型,并通过SSA-GWO双优化提升性能,在准确率、F1分数和特异性上均优于传统方法。

Details Motivation: 胰腺肿瘤在CT图像中对比度低、解剖变异大,早期检测困难,现有方法难以有效捕捉细微视觉线索并实现多模态数据的泛化,因此需要一种高效且可扩展的自动化检测系统。 Method: 提出SRFA框架:首先通过预处理增强图像质量,使用MAGRes-UNet进行胰腺结构分割;接着利用DenseNet-121提取深层残差特征;采用HHO-BA混合元启发式算法进行特征选择;最后结合Vision Transformer与EfficientNet-B3构建混合分类器,并通过SSA-GWO双优化策略调参。 Result: 该模型在实验中达到96.23%的准确率、95.58%的F1分数和94.83%的特异性,显著优于传统CNN和现有Transformer模型,表现出更强的鲁棒性和泛化能力。 Conclusion: SRFA框架能有效提升胰腺肿瘤的早期检测性能,具备临床辅助诊断潜力,为复杂医学图像分析提供了一种高精度、可扩展的新方法。 Abstract: The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.

[240] Memorization in 3D Shape Generation: An Empirical Study

Shu Pu,Boya Zeng,Kaichen Zhou,Mengyu Wang,Zhuang Liu

Main category: cs.CV

TL;DR: 本文提出了一种评估3D生成模型记忆化的框架,并通过实验分析了数据和建模设计对记忆化的影响,发现适当的数据增强和模型设置可以有效减少记忆化而不影响生成质量。

Details Motivation: 理解3D生成模型是否依赖于训练数据的记忆化,有助于防止数据泄露并提升生成结果的多样性。 Method: 设计了一个量化记忆化的评估框架,并在潜在向量集(Vecset)扩散模型上进行受控实验,研究不同数据模态、条件粒度、引导尺度等因素的影响。 Result: 发现记忆化受数据模态和多样性影响,随细粒度条件增加而上升,在中等引导尺度时达到峰值,且可通过更长的Vecset和简单旋转增强缓解。 Conclusion: 该研究提供了对3D生成模型记忆化的实证理解,并提出了简单有效的策略来减少记忆化,同时保持生成质量。 Abstract: Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at https://github.com/zlab-princeton/3d_mem.

[241] Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception

Xiaoyu Li,Peidong Li,Xian Wu,Long Shi,Dedong Liu,Yitao Wu,Jiajia Fu,Dixiao Cui,Lijun Zhao,Lining Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为HAT的时空对齐模块,通过多假设运动模型自适应地解码最优对齐方案,提升了自动驾驶中端到端感知的鲁棒性和性能。

Details Motivation: 现有方法依赖统一的显式物理运动模型和语义特征进行跨帧对象对齐,难以应对不同类别和帧间运动状态与对象特征的变化,导致对齐效果次优。 Method: HAT利用多个显式运动模型生成历史实例的空间锚点和运动感知特征提议,并结合缓存的对象查询中的语义与运动线索进行多假设解码,以获得目标帧的最优对齐建议。 Result: 在nuScenes上,HAT显著提升了多种3D时序检测器和跟踪器的性能,与DETR3D结合达到46.0% AMOTA的SOTA结果;在端到端自动驾驶方法中,mAP提升1.3%,AMOTA提升3.1%,碰撞率降低32%;在语义退化场景下(nuScenes-C)仍保持更强的鲁棒性。 Conclusion: HAT通过引入多假设运动建模和自适应解码机制,有效增强了时空对齐的准确性与鲁棒性,验证了显式运动建模在端到端自动驾驶感知中的重要价值。 Abstract: Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.

[242] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Keda Tao,Wenjie Du,Bohan Yu,Weiqiang Wang,Jian Liu,Huan Wang

Main category: cs.CV

TL;DR: OmniAgent是一种全音频引导的主动感知代理,通过动态规划和粗到细的音频引导感知范式,实现细粒度音视频推理,在多个基准上显著超越现有模型。

Details Motivation: 现有多模态大模型在跨模态理解与多模态对齐方面存在不足,缺乏细粒度的音视频联合分析能力。 Method: 提出OmniAgent,采用动态规划自主调度专用工具,利用音频线索定位时序事件,引导后续推理,实现从被动响应到主动多模态探究的范式转变。 Result: 在三个音视频理解基准上的实验表明,OmniAgent性能达到最先进水平,比主流开源和专有模型高出10%-20%的准确率。 Conclusion: OmniAgent通过音频引导的主动感知机制,显著提升了多模态模型的细粒度理解和跨模态对齐能力。 Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

[243] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Kang Du,Yirui Guan,Zeyu Wang

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的多视角固有图像分解框架IDT,通过物理引导的分解模型实现视图一致的漫反射和镜面反射分离。

Details Motivation: 现有单视角扩散方法难以扩展到多视角场景,导致严重的视图不一致性问题。 Method: 提出Intrinsic Decomposition Transformer(IDT),利用Transformer注意力机制联合推理多视角输入,并基于物理成像模型将图像分解为漫反射率、漫反射阴影和镜面反射阴影。 Result: 在合成和真实数据集上验证了IDT能生成更清晰的漫反射率、更连贯的漫反射阴影和更纯净的镜面成分,显著提升多视角一致性。 Conclusion: IDT实现了高效、可控且视图一致的多视角固有图像分解,优于现有方法。 Abstract: Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.

[244] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Shaocong Xu,Songlin Wei,Qizhe Wei,Zheng Geng,Hong Li,Licheng Shen,Qianpu Sun,Shu Han,Bin Ma,Bohan Li,Chongjie Ye,Yuhang Zheng,Nan Wang,Saining Zhang,Hao Zhao

Main category: cs.CV

TL;DR: 本文提出了一种利用视频扩散模型处理透明物体深度估计的新方法DKT,通过构建大规模合成数据集TransPhy3D并采用LoRA微调,实现了在透明、反射场景下零样本迁移的SOTA性能,显著提升了深度估计的精度与时间一致性。

Details Motivation: 透明物体因折射、反射和透射导致传统深度感知方法(如立体视觉、ToF、单目深度)失效,产生空洞和时序不稳定的结果,亟需一种能鲁棒处理透明物体的新型感知方法。 Method: 构建包含1.1万段序列的合成视频数据集TransPhy3D,使用Blender/Cycles进行物理真实渲染,输出RGB、深度和法线;基于大型视频扩散模型,通过轻量级LoRA适配器将RGB映射到深度和法线,联合训练于TransPhy3D和现有帧级合成数据,实现任意长度视频的时序一致预测。 Result: DKT在ClearPose、DREDS和TransPhy3D-Test等真实与合成视频基准上实现零样本SOTA表现,优于强图像/视频基线,其法线估计变体也在ClearPose上取得最佳结果;1.3B小模型可达到约0.17秒/帧的速度,并在抓取任务中显著提升对半透明、反射和漫射表面的抓取成功率。 Conclusion: 扩散模型已隐式学习透明物体的光学规律,可通过迁移学习高效转化为鲁棒、时序连贯的感知系统,支持“Diffusion knows transparency”这一核心观点,为无需标注标签的挑战性场景感知提供了新路径。 Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.

[245] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu,Chin-Yang Lin,Zhixiang Wang,Chi-Wei Hsiao,Po-Fan Yu,Yu-Chih Chen,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Stream-DiffVSR的因果条件扩散框架,用于高效在线视频超分辨率(VSR),在保持高质量感知性能的同时大幅降低延迟,首次实现适用于低延迟在线部署的扩散模型VSR方法。

Details Motivation: 现有的基于扩散的视频超分方法依赖未来帧和多步去噪,导致延迟高,难以应用于实时场景。因此需要一种能在仅使用过去帧的情况下快速推理的因果在线VSR方法。 Method: 提出Stream-DiffVSR,采用四步蒸馏去噪器加速推理,引入自回归时序引导(ARTG)模块在潜在去噪过程中注入运动对齐信息,并设计轻量级时序感知解码器与时序处理模块(TPM)增强细节和时序一致性,且仅依赖过去帧进行因果推理。 Result: 在RTX4090 GPU上以0.328秒处理720p帧,初始延迟从超过4600秒降至0.328秒,比在线SOTA方法TMP降低130倍以上延迟,同时提升感知质量(LPIPS +0.095),显著优于以往扩散方法。 Conclusion: Stream-DiffVSR是目前延迟最低的扩散-based VSR方法,首次实现了适用于低延迟在线应用的扩散模型视频超分辨率,推动了扩散模型在实时视觉系统中的落地。 Abstract: Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/