cs.CL [Back]

[1] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

Pu Zhao,Xuan Shen,Zhenglun Kong,Yixin Shen,Sung-En Chang,Arash Akbari,Timothy Rupprecht,Lei Lu,Enfu Nan,Changdi Yang,Yumei He,Weiyan Shi,Xingchen Xu,Yu Huang,Wei Jiang,Wei Wang,Yue Chen,Yong He,Yanzhi Wang

Main category: cs.CL

TL;DR: Moxin 7B是一个完全开源的大语言模型，遵循模型开放框架，强调训练、数据和实现细节的透明性，并推出了多个具备多模态和中文能力的变体，在多项评估中表现优异。

Details

Motivation: 推动大语言模型的真正开放，超越仅发布权重的做法，促进可复现、可定制和协作的开源生态。 Method: 基于开源框架和开源数据训练Moxin 7B及其多个变体（Moxin-VLM、Moxin-VLA、Moxin-Chinese），覆盖视觉-语言、视觉-语言-动作和中文任务。 Result: 模型在多种评测中表现出色，且所有模型、数据和代码均已公开发布。 Conclusion: Moxin系列模型通过全面开放促进了健康可持续的开源大模型生态系统发展。 Abstract: Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.

[2] Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

Sophie Zhao

Main category: cs.CL

TL;DR: 研究表明，基于transformer的语言模型的句子嵌入空间中存在与人类可解释的认知属性对齐的层次化几何结构，通过线性与非线性探针可有效解码认知层级和能量得分，且性能超越词汇基线，揭示了嵌入空间中平滑的梯度组织。

Details

Motivation: 探索transformer语言模型的嵌入空间是否包含与人类认知或心理属性对齐的高层级、分级结构，而不仅仅是低层语言特征。 Method: 构建包含480个句子的数据集，标注连续的能量分数和七个有序认知类别的离散层级标签，使用多种transformer模型的固定句子嵌入，通过线性和浅层非线性探针预测这些标签，并与TF-IDF基线比较，辅以UMAP可视化和混淆矩阵进行定性分析。 Result: 线性和非线性探针均能可靠解码认知标签和能量分数，非线性探针表现更优；TF-IDF基线性能显著较差；置换检验显示结果显著高于随机水平；可视化显示嵌入空间中存在从低到高的平滑梯度和相邻层级的混淆模式。 Conclusion: transformer模型的嵌入空间展现出与人类定义的认知属性对齐的层次化、几何组织结构，表明其编码了高层认知信息，但该发现不涉及模型是否具有意识或现象体验。 Abstract: Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces, yet the presence of higher-level cognitive organization within these representations remains underexplored. In this work, we investigate whether sentence embeddings encode a graded, hierarchical structure aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with continuous ordinal energy scores and discrete tier labels spanning seven ordered cognitive categories. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous scores and tier labels are reliably decodable, with shallow nonlinear probes providing consistent performance gains over linear probes. Lexical TF-IDF baselines perform substantially worse, indicating that the observed structure is not attributable to surface word statistics alone. Nonparametric permutation tests further confirm that probe performance exceeds chance under label-randomization nulls. Qualitative analyses using UMAP visualizations and confusion matrices reveal smooth low-to-high gradients and predominantly adjacent-tier confusions in embedding space. Taken together, these results provide evidence that transformer embedding spaces exhibit a hierarchical geometric organization aligned with human-defined cognitive attributes, while remaining agnostic to claims of internal awareness or phenomenology.

[3] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

Shaofei Cai,Yulei Qin,Haojia Lin,Zihan Xu,Gang Li,Yuchen Shi,Zongyi Li,Yong Mao,Siqi Cai,Xiaoyu Tan,Yitao Liang,Ke Li,Xing Sun

Main category: cs.CL

TL;DR: 提出SmartSnap范式，实现智能体在复杂GUI任务中的主动、原位自验证，通过精简快照证据提升任务验证的可扩展性和可靠性。

Details

Motivation: 现有任务验证方法依赖于对智能体完整交互轨迹的后验分析，处理冗长且含噪声的历史信息导致成本高、可靠性低，限制了基于强化学习的自主智能体的可扩展性。 Method: 设计具备双重使命的自验证智能体，遵循3C原则（完整性、简洁性、创造性），在执行任务过程中主动选取最小且决定性的快照作为证据，由通用LLM-as-a-Judge仅基于这些快照进行有效性判断。 Result: 在移动设备任务上实验表明，该方法使8B和30B规模的LLM驱动智能体性能分别提升26.08%和16.66%，且能与DeepSeek V3.1和Qwen3-235B-A22B等大模型竞争。 Conclusion: SmartSnap范式通过将验证从被动后验转为主动自证，显著提升了GUI任务中智能体训练的可扩展性与验证可靠性，为构建高效自治代理提供了新路径。 Abstract: Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.

[4] The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

Zubaida Mohammed Albadani,Mohammed Q. Shormani

Main category: cs.CL

TL;DR: 本文在最简方案框架下研究也门伊比方言阿拉伯语中的qulk-从句的句法结构，提出qulk-从句是双子句结构，其中qulk作为嵌入谓词选择一个完整的CP补足语，并通过合并、移位、一致和拼读等最简操作进行推导。

Details

Motivation: 探讨qulk-从句的句法性质及其在最简方案中的生成机制，填补对阿拉伯语方言中此类结构的理论研究空白。 Method: 采用最简方案的核心操作（Merge, Move, Agree, Spell-out）和形态合并（Morphological Merger）分析qulk-从句的句法推导过程，并解释其双子句结构及方言特有现象。 Result: 揭示了qulk-从句为双子句结构，qulk充当嵌入谓词，能够选择完整CP；成功解释了该结构的句法行为以及如两部分否定、代词附着和CP嵌套等方言特征。 Conclusion: 支持最简方案在分析阿拉伯语方言中的适用性，为生成句法理论提供新证据，并建议将此分析扩展至第二人称的kil-k 'you said'结构，进一步检验最简理论的普遍性。 Abstract: This study investigates the syntax of qulk-clauses in Yemeni Ibbi Arabic (YIA) within the Minimalist Program. The construction qulk-clause, a morphologically fused form meaning 'I said,' introduces embedded declarative interrogative, and imperative clauses, often eithout complementizer. The central proposal of this paper is that qulk-clauses are biclausal structures in which qulk functions a clause-embedding predicate sec;ecting a dull CP complement. By applying core minimalist operations, viz., Merge, Move, Agree, and Spell-out, the study provides a layered syntactic analysis of qulk-clauses, for illustrating how their derivation proceeds through standard computational steps and post-syntactic processes such as Morphological Merger. The proposal also accounts for dialect-specific features like bipartite negation, cliticization, and CP embedding. The findings offer theoretical contributions to generative syntax, specifically minimalism. The study concludes raising theoretical questions concerning extending the analysis to the addressee-clause kil-k 'you said'. It also provides insights into the possibility of the universality of minimalism.

[5] Towards Efficient Post-Training via Fourier-Driven Adapter Architectures

Donggyun Bae,Jongil Park

Main category: cs.CL

TL;DR: 提出一种基于随机傅里叶特征的轻量级适配器框架FAA，用于大模型的高效微调，通过频域分解实现语义信息的频率感知调节，在多个基准上表现优异。

Details

Motivation: 为了在保持预训练语言模型性能的同时减少微调参数量，需要更高效的参数微调方法。 Method: 将随机傅里叶特征引入轻量适配器模块，对中间表示进行高低频分解，实现频率感知的语义调制。 Result: 在GLUE、E2E NLG和指令微调任务上，FAA优于或媲美现有PEFT方法，且计算和内存开销低。 Conclusion: FAA是一种高效、鲁棒的大模型后训练微调方法，频域建模为参数高效微调提供了新思路。 Abstract: We propose a novel framework, termed Fourier-Activated Adapter (FAA), for parameter-efficient fine-tuning of large pre-trained language models. By incorporating random Fourier features into lightweight adapter modules, FAA decomposes intermediate representations into complementary low- and high-frequency components, enabling frequency-aware modulation of semantic information. This design allows the model to selectively emphasize informative frequency bands during adaptation while preserving the representational capacity of the frozen backbone. Extensive experiments on GLUE, E2E NLG, and instruction-tuning benchmarks demonstrate that FAA consistently achieves competitive or superior performance compared to existing parameter-efficient fine-tuning methods, while maintaining low computational and memory overhead. Ablation studies further verify the effectiveness of frequency-aware activation and adaptive weighting mechanisms, highlighting FAA as a robust and efficient approach for post-training large language models.

[6] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

Elsen Ronando,Sozo Inoue

Main category: cs.CL

TL;DR: 提出了一种LLM引导的示例选择框架，用于解决人类活动识别中依赖大量标注数据和几何方法选择示例的局限性，在少样本条件下显著提升了性能。

Details

Motivation: 现有HAR方法依赖大规模标注数据和纯几何示例选择，难以区分相似的传感器活动（如走路、上下楼梯），在少样本场景下表现不佳。 Method: 引入大语言模型生成语义先验知识，结合特征重要性、类别混淆性和示例预算倍增器，指导示例评分与选择；融合基于边界的验证信号、PageRank中心性、hubness惩罚和设施位置优化，选出紧凑且信息丰富的示例集。 Result: 在UCI-HAR数据集的严格少样本条件下，达到88.78%的宏F1分数，优于随机采样、herding和k-center等传统方法。 Conclusion: 将LLM生成的语义先验与结构和几何线索结合，能更有效地支持少样本可穿戴传感器活动识别中的代表性示例选择。 Abstract: In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and $k$-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.

[7] Hallucination Detection and Evaluation of Large Language Model

Chenggong Zhang,Haopeng Wang

Main category: cs.CL

TL;DR: 本文提出了一种轻量级的幻觉检测框架HHEM，显著提升了大语言模型幻觉评估的效率，并通过分段检索和累积分布分析揭示了不同规模模型的幻觉特性。

Details

Motivation: 现有大语言模型幻觉评估方法计算成本高，需要更高效且准确的检测方案。 Method: 引入Hughes幻觉评估模型（HHEM），采用分类框架独立于LLM判断；结合非虚构检查与分段检索提升检测效果，并对不同规模LLM进行系统性比较分析。 Result: HHEM将评估时间从8小时缩短至10分钟，在QA任务中达到82.2%的准确率和78.9%的真阳性率；但对摘要中的局部幻觉检测较弱，引入分段检索后有所改善；CDF分析显示7B-9B参数模型幻觉较少，中等规模模型更不稳定。 Conclusion: HHEM在保持高检测精度的同时大幅提高效率，适合大规模应用；需结合结构化评估框架以平衡效率与事实验证，提升LLM内容可靠性。 Abstract: Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy $82.2\%$ and TPR $78.9\%$. However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.

[8] HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

Cattalyya Nuengsigkapian

Main category: cs.CL

TL;DR: HiFi-RAG 是一种分层过滤的检索增强生成框架，在开放域问答中通过多阶段流水线提升答案相关性和用户意图对齐，结合轻量级模型进行过滤与重排序、重量级模型生成答案，在多个指标上显著超越基线。

Details

Motivation: 标准RAG在开放域场景下存在检索结果包含无关信息以及生成答案偏离用户意图的问题，需要更精细的信息筛选机制来提升回答质量。 Method: 提出HiFi-RAG框架，采用多阶段流程：首先利用Gemini 2.5 Flash进行查询重构、分层内容过滤和引用归因，再由Gemini 2.5 Pro完成最终答案生成，实现效率与性能的平衡。 Result: 在MMU-RAGent验证集上，ROUGE-L提升至0.274（+19.6%），DeBERTaScore达0.677（+6.2%）；在需后截止知识的Test2025数据集上，ROUGE-L和DeBERTaScore分别超越基线57.4%和14.9%。 Conclusion: HiFi-RAG通过分层过滤策略有效提升了检索信息的相关性，并通过模型分工实现了成本、速度与生成质量的最优权衡，是开放域RAG任务中的高效解决方案。 Abstract: Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.

[9] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

Jie Zhou,Xin Chen,Jie Zhang,Zhe Li

Main category: cs.CL

TL;DR: 本研究提出了垂直领域会计推理的概念，建立了评估标准，并评测了GLM系列和GPT-4等大模型在会计推理任务中的表现，发现尽管提示工程能提升性能，但现有模型仍难以满足企业级应用需求。

Details

Motivation: 推动大语言模型在专业领域的有效融合，尤其是在会计领域实现可靠的推理能力，以支持企业数字化转型和社会发展。 Method: 通过分析GLM系列模型的训练数据特征，提出垂直领域会计推理概念并建立评估标准，进而对GLM-6B、GLM-130B、GLM-4和GPT-4等模型在会计推理任务上的表现进行评估，比较不同提示工程策略下的性能差异。 Result: 实验结果显示不同提示工程策略对各模型性能有不同程度的提升，GPT-4表现出最强的会计推理能力，但当前大模型整体仍未能达到实际应用要求，尤其在企业级会计场景中需进一步优化。 Conclusion: 现有大语言模型在会计领域具备一定推理潜力，但尚不足以胜任企业级应用，未来需针对垂直领域进行专门优化以释放其真正价值。 Abstract: Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.

[10] Constituency Structure over Eojeol in Korean Treebanks

Jungyeul Park,Chulwoo Park

Main category: cs.CL

TL;DR: 本文探讨了韩语成分句法树库中的基本表示问题，主张采用以词节（eojeol）为基础的成分表示方法，并将形态信息分离为独立层次，以实现与依存句法资源的一致性及跨树库比较。

Details

Motivation: 韩语词语形态复杂，若以语素为成分终端会混淆词内结构与短语层级句法结构，并导致与基于词节的依存树库不一致。 Method: 提出一种以词节为单位的成分句法表示方案，将形态切分和细粒度词性标注作为非成分层单独编码，并在明确归一化假设下对Sejong和Penn韩语树库进行对比分析。 Result: 分析表明，在归一化假设下，Sejong和Penn韩语树库在词节级成分结构上可视为表示等价，并据此构建了一种兼容句法结构与形态信息的新标注体系。 Conclusion: 采用词节为基础的成分表示能更好保持句法结构的可解释性，支持跨树库比较和成分-依存结构转换。 Abstract: The design of Korean constituency treebanks raises a fundamental representational question concerning the choice of terminal units. Although Korean words are morphologically complex, treating morphemes as constituency terminals conflates word internal morphology with phrase level syntactic structure and creates mismatches with eojeol based dependency resources. This paper argues for an eojeol based constituency representation, with morphological segmentation and fine grained part of speech information encoded in a separate, non constituent layer. A comparative analysis shows that, under explicit normalization assumptions, the Sejong and Penn Korean treebanks can be treated as representationally equivalent at the eojeol based constituency level. Building on this result, we outline an eojeol based annotation scheme that preserves interpretable constituency and supports cross treebank comparison and constituency dependency conversion.

[11] ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

Suhua Wang,Zifan Wang,Xiaoxin Sun,D. J. Wang,Zhanbo Liu,Xin Li

Main category: cs.CL

TL;DR: 本文提出了一种针对满语语音合成的新型方法ManchuTTS，通过多层次表示和分层注意力机制有效解决了数据稀缺和语音黏着性问题。

Details

Motivation: 满语作为一种濒危语言，面临严重的数据稀缺和强烈的语音黏着现象，传统语音合成方法难以适用。 Method: 设计了音素、音节和韵律三级文本表示，结合跨模态分层注意力机制实现多粒度对齐；采用深度卷积网络与流匹配Transformer相结合的非自回归生成模型，并引入分层对比损失来增强声学-语言结构对应关系。 Result: 在自建的6.24小时满语语音数据集上，使用5.2小时子集训练的模型取得了4.52的MOS得分，显著优于所有基线模型；消融实验显示分层机制使黏着词发音准确率提升31%，韵律自然度提升27%。 Conclusion: ManchuTTS有效应对了满语低资源与强黏着性的挑战，为濒危语言的语音合成提供了可借鉴的技术框架。 Abstract: As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu's linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.

[12] Learning When Not to Attend Globally

Xuan Luo,Kailai Zhang,Xifeng Yan

Main category: cs.CL

TL;DR: 提出All-or-Here Attention (AHA)，通过动态切换全注意力与局部滑动窗口注意力，显著减少注意力计算开销而不损失性能。

Details

Motivation: 受人类阅读习惯启发，探索大模型是否可仅在必要时访问全局上下文，以降低注意力机制的计算冗余。 Method: 设计一个每注意力头的二值路由器，在每个token上动态选择使用全注意力或局部滑动窗口注意力。 Result: 使用256长度的窗口可替代93%的全注意力操作而不影响性能；上下文依赖呈长尾分布，大部分情况下局部上下文已足够。 Conclusion: 全注意力在多数情况下是冗余的，高效推理只需按需访问全局上下文，AHA实现了局部处理与全局访问的解耦。 Abstract: When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93\% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.

[13] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis

Zhiqiang Gao,Shihao Gao,Zixing Zhang,Yihao Guo,Hongyu Chen,Jing Han

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的结构化提示和集成方法，用于多模态对话中的方面级情感分析与情感翻转检测，分别在两个子任务上取得了47.38%和74.12%的性能表现。

Details

Motivation: 为了提升多说话人对话中细粒度情感理解的能力，尤其是在复杂多模态环境下识别情感成分及其动态变化。 Method: 对于子任务一，设计了结构化提示流程，引导大语言模型逐步提取情感六元组；对于子任务二，采用三个大语言模型的集成策略来识别情感翻转及其触发原因。 Result: 在子任务一上取得47.38%的平均分，在子任务二上达到74.12%的精确匹配F1分数。 Conclusion: 逐步精细化提取和模型集成策略在复杂的多模态情感分析任务中是有效的。 Abstract: Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.

[14] Chain-of-thought Reviewing and Correction for Time Series Question Answering

Chen Su,Yuanhe Tian,Yan Song

Main category: cs.CL

TL;DR: 提出T3LLM框架，利用工人、评审和学生三个大模型协作进行时间序列问答，通过显式纠错机制提升推理准确性。

Details

Motivation: 现有基于大语言模型的时间序列分析方法在处理复杂数值序列时易出错，而时间序列表征本身具有可验证性，可用于推理过程的一致性检查。 Method: 构建包含工人、评审和学生的三模块框架：工人生成逐步思维链，评审检查并纠正错误，学生通过修正后的思维链进行学习，实现多步推理与自我纠正的内化。 Result: 在多个真实世界的时间序列问答基准上，T3LLM优于强基线方法，达到最先进性能。 Conclusion: T3LLM通过引入可验证性和显式纠错机制，有效提升了大模型在时间序列问答任务中的推理能力与鲁棒性。 Abstract: With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM-based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine-tune the student model, internalizing multi-step reasoning and self-correction into its parameters. Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.

[15] M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

Fanglin Xu,Wei Zhang,Jian Yang,Guo Chen,Aishan Liu,Zhoujun Li,Xianglong Liu,Bryan Dai

Main category: cs.CL

TL;DR: 本文提出了M2G-Eval，一个多粒度、多语言的代码生成评估框架，涵盖类、函数、代码块和行四个层次，支持18种编程语言，并发布了一个基于Qwen3-8B训练的M2G-Eval-Coder模型系列。对30个主流大模型的评估揭示了不同粒度任务的难度层级、全量与部分粒度语言间的性能差距扩大现象以及跨语言能力迁移的强相关性。

Details

Motivation: 现有代码大模型评测基准大多局限于单一结构粒度和少数编程语言，难以全面反映模型在不同代码范围和多语言环境下的细粒度能力差异，因此需要一种更细粒度、更广泛的评估框架。 Method: 构建了一个覆盖类、函数、代码块、行四个粒度层级的M2G-Eval评估框架，包含超过17,000个训练任务和1,286个人工标注且无数据污染的测试实例，涵盖18种编程语言；通过监督微调和Group Relative Policy Optimization训练Qwen3-8B得到M2G-Eval-Coder模型；在30个模型上进行系统评估。 Result: （1）发现了明显的任务难度层级：行级最容易，类级最难；（2）随着任务复杂度上升，全粒度语言与部分粒度语言之间的性能差距扩大；（3）存在强烈的跨语言性能相关性，表明模型能学习可迁移的编程概念。 Conclusion: M2G-Eval能够实现对代码生成能力的细粒度诊断，揭示了当前模型在生成复杂、长篇幅代码方面仍存在挑战，为未来模型改进提供了方向。 Abstract: The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.

[16] On the Role of Discreteness in Diffusion LLMs

Ziqi Jin,Bin Wang,Xiang Lin,Lidong Bing,Aixin Sun

Main category: cs.CL

TL;DR: 本文重新审视了扩散语言建模，提出了区分扩散机制与语言特需的五个属性，分析了现有方法在处理文本生成时的局限性，并指出了两个关键问题：均匀破坏不尊重信息分布，以及逐标记训练无法捕捉多标记依赖。

Details

Motivation: 扩散模型在语言生成中具有并行解码和迭代优化等优势，但文本的离散性和结构化特性使其难以直接应用。因此需要重新审视扩散语言建模的基本原理。 Method: 将现有方法分为嵌入空间的连续扩散和标记上的离散扩散，提出五个评估扩散语言模型的关键属性，并分析近期大型扩散语言模型的表现。 Result: 发现现有方法仅满足部分关键属性，存在结构性权衡；识别出两个核心问题：均匀腐败策略忽视位置间的信息分布，以及标记级边际训练无法建模多标记依赖关系。 Conclusion: 当前扩散语言模型因设计与文本结构不匹配而受限，未来应设计更贴合文本结构特性的扩散过程，以实现更一致的生成效果。 Abstract: Diffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models.

[17] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Hadi Mohammadi,Tamas Kozak,Anastasia Giachanou

Main category: cs.CL

TL;DR: 本研究探讨了链式思维（CoT）推理在大语言模型中的可信性问题，评估了GRPO和DPO两种优化方法对提升CoT忠实性的效果，发现GRPO在较大模型中表现更优，尤其在Qwen2.5-14B-Instruct上效果最佳。

Details

Motivation: CoT解释常与模型真实推理过程不符，可能导致生成看似合理但具有误导性的推理链条，影响模型的安全监督与对齐监控，因此需提升CoT的忠实性。 Method: 采用Group Relative Policy Optimization（GRPO）和Direct Preference Optimization（DPO）两种优化方法，系统评估其在不同规模大语言模型上对CoT忠实性的提升效果，并分析模型大小与性能之间的关系。 Result: GRPO在较大模型中表现优于DPO，Qwen2.5-14B-Instruct模型在所有指标上达到最优；两种方法均显示模型规模与性能正相关，但GRPO在提升忠实性方面潜力更大，仅在小规模模型上稳定性较差。 Conclusion: GRPO是一种有前景的优化方法，有助于实现更透明、可信的大语言模型推理过程，推动构建可信赖的AI系统。 Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.

[18] Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

Pere Martra

Main category: cs.CL

TL;DR: 通过基于最大绝对权重（MAW）准则的GLU-MLP层结构化宽度剪枝，研究发现降低扩展比会系统性地影响不同模型能力：参数化知识和困惑度性能下降，但指令遵循能力显著提升，多步推理保持稳健。扩展比被识别为关键架构参数，可选择性调节模型认知能力，并揭示知识与真实性之间的逆相关关系。

Details

Motivation: 挑战剪枝导致均匀退化的普遍假设，探索剪枝对不同模型能力的差异化影响，特别是扩展比在调节认知能力中的作用。 Method: 采用MAW准则对GLU-MLP层进行结构化宽度剪枝，评估七种扩展比配置，在涵盖事实知识、数学推理、语言理解、指令遵循和真实性的综合基准上进行分析。 Result: 指令遵循能力大幅提升（+46%至+75% IFEval），多步推理稳健（MUSR），事实知识（MMLU）与真实性（TruthfulQA-MC2）呈显著负相关（r = -0.864）；剪枝后能耗降低达23%（J/token），批量处理效率提升，但单请求延迟增加。 Conclusion: 扩展比是可选择性调节模型认知能力的关键参数，MAW引导的剪枝是一种选择性过滤机制，在减少参数知识的同时保留或增强行为对齐，连接了模型压缩与真实性研究两个领域。 Abstract: Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.

[19] Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency

Yoshith Roy Kotla,Varshith Roy Kotla

Main category: cs.CL

TL;DR: 提出Vocabulary-Aware Conformal Prediction (VACP) 框架，通过语义掩码和温度调整打分，在保证预测覆盖的同时显著减少大词汇量语言模型的预测集大小，实现197倍效率提升。

Details

Motivation: 大型语言模型在高风险领域应用时需要可靠的不确定性量化，但标准softmax概率通常校准不佳，传统保校准方法生成的预测集过大而缺乏实用性。 Method: 提出Vocabulary-Aware Conformal Prediction (VACP)，结合语义掩码和温度调整打分，缩小有效预测空间，同时理论上保证边际覆盖性。 Result: 在Gemma-2B模型上使用SQUAD和WikiText数据集实验显示，VACP达到89.7%经验覆盖率（目标90%），平均预测集大小从847个词降至4.3个词，效率提升197倍。 Conclusion: VACP在保持良好覆盖的同时极大提升了预测集效率，使保校准方法在大词汇语言模型中更具实用价值。 Abstract: Deploying large language models (LLMs) in high-stakes domains requires rigorous uncertainty quantification, yet standard softmax probabilities are often poorly calibrated. We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies (greater than 250,000 tokens). Our central contribution is the identification of a coverage-efficiency tradeoff: while naive conformal prediction achieves valid coverage, it produces prediction sets of hundreds of tokens, rendering them uninformative. We propose Vocabulary-Aware Conformal Prediction (VACP), a framework that leverages semantic masking and temperature-adjusted scoring to reduce the effective prediction space while provably maintaining marginal coverage. Experiments on Gemma-2B using SQUAD and WikiText benchmarks demonstrate that VACP achieves 89.7 percent empirical coverage (90 percent target) while reducing the mean prediction set size from 847 tokens to 4.3 tokens -- a 197x improvement in efficiency. We provide a theoretical analysis of vocabulary reduction and release our implementation for reproducibility.

[20] GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages

Ahmed Abdullah,Sana Fatima,Haroon Mahmood

Main category: cs.CL

TL;DR: 本文提出了一种多语言希望言语检测框架，重点关注低资源语言乌尔都语，利用XLM-RoBERTa、mBERT等预训练模型，在PolyHope-M 2025基准上取得了优异的F1分数，表明多语言模型在低资源环境下的有效性。

Details

Motivation: 希望言语在自然语言处理中研究较少，尤其是乌尔都语等低资源语言缺乏相关资源，限制了促进积极在线交流工具的发展。 Method: 采用XLM-RoBERTa、mBERT、EuroBERT和UrduBERT等预训练Transformer模型，进行简单预处理并训练分类器，用于多语言希望言语检测。 Result: 在PolyHope-M 2025基准上，乌尔都语二元分类F1得分为95.2%，多类分类为65.2%，在西班牙语、德语和英语中也表现良好。 Conclusion: 现有多语言模型可有效应用于低资源语言的希望言语检测，有助于构建更积极的网络话语环境。 Abstract: Hope speech has been relatively underrepresented in Natural Language Processing (NLP). Current studies are largely focused on English, which has resulted in a lack of resources for low-resource languages such as Urdu. As a result, the creation of tools that facilitate positive online communication remains limited. Although transformer-based architectures have proven to be effective in detecting hate and offensive speech, little has been done to apply them to hope speech or, more generally, to test them across a variety of linguistic settings. This paper presents a multilingual framework for hope speech detection with a focus on Urdu. Using pretrained transformer models such as XLM-RoBERTa, mBERT, EuroBERT, and UrduBERT, we apply simple preprocessing and train classifiers for improved results. Evaluations on the PolyHope-M 2025 benchmark demonstrate strong performance, achieving F1-scores of 95.2% for Urdu binary classification and 65.2% for Urdu multi-class classification, with similarly competitive results in Spanish, German, and English. These results highlight the possibility of implementing existing multilingual models in low-resource environments, thus making it easier to identify hope speech and helping to build a more constructive digital discourse.

[21] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Anaelia Ovalle,Candace Ross,Sebastian Ruder,Adina Williams,Karen Ullrich,Mark Ibrahim,Levent Sagun

Main category: cs.CL

TL;DR: 研究发现，尽管大语言模型在多语言任务中表现良好，但其推理过程在非拉丁语系脚本中与结论的对齐程度显著较差，揭示了当前多语言评估方法的不足。

Details

Motivation: 探索大语言模型在跨语言场景下的推理质量是否一致，尤其是推理链是否真正支持其结论。 Method: 提出一个人工验证框架，分析来自6种语言和6个前沿模型的6.5万条GlobalMMLU推理链，并通过人工标注建立错误分类体系。 Result: 非拉丁语系脚本中的推理与结论错位至少是拉丁语系的两倍，主要错误类型为证据性错误（如无依据主张、模糊事实）和非逻辑推理步骤。 Conclusion: 当前多语言评估仅关注任务准确率会忽略推理质量问题，需引入更注重推理合理性的评估框架。 Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.

Sashank Chapala,Maksym Mironov,Songgaojun Deng

Main category: cs.CL

TL;DR: 本研究探讨了通过心理启发的最小化提示词修改来减轻大语言模型（LLM）在模拟人群响应时的社会期望偏差（SDB），发现中性、第三人称的重述提示最有效改善LLM与真实人类数据的一致性。

Details

Motivation: 由于LLM在硅采样中对敏感问题常表现出社会期望偏差，偏离真实人类反应，现有研究对此类偏差的缓解方法探索不足，因此需验证简单提示工程是否能提升硅样本的代表性。 Method: 基于美国全国选举研究（ANES）数据，使用Llama-3.1和GPT-4.1-mini等三个LLM，比较四种提示缓解策略：重述、反向编码、启动和前言，并采用Jensen-Shannon散度与自举置信区间评估与ANES的一致性。 Result: 重述提示最有效地减少分布集中，使结果更接近ANES；反向编码效果不一；启动和前言未显著改善偏差，反而导致反应趋同。 Conclusion: 基于提示的框架控制可有效缓解LLM中的社会期望偏差，重述提示是一种实用且有效的改进硅采样的方法。 Abstract: Large Language Models (LLMs) are increasingly used to simulate population responses, a method known as ``Silicon Sampling''. However, responses to socially sensitive questions frequently exhibit Social Desirability Bias (SDB), diverging from real human data toward socially acceptable answers. Existing studies on social desirability bias in LLM-based sampling remain limited. In this work, we investigate whether minimal, psychologically grounded prompt wording can mitigate this bias and improve alignment between silicon and human samples. We conducted a study using data from the American National Election Study (ANES) on three LLMs from two model families: the open-source Llama-3.1 series and GPT-4.1-mini. We first replicate a baseline silicon sampling study, confirming the persistent Social Desirability Bias. We then test four prompt-based mitigation methods: \emph{reformulated} (neutral, third-person phrasing), \emph{reverse-coded} (semantic inversion), and two meta-instructions, \emph{priming} and \emph{preamble}, respectively encouraging analytics and sincerity. Alignment with ANES is evaluated using Jensen-Shannon Divergence with bootstrap confidence intervals. Our results demonstrate that reformulated prompts most effectively improve alignment by reducing distribution concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding produced mixed results across eligible items, while the Priming and Preamble encouraged response uniformity and showed no systematic benefit for bias mitigation. Our findings validate the efficacy of prompt-based framing controls in mitigating inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples.

[23] Data Augmentation for Classification of Negative Pregnancy Outcomes in Imbalanced Data

Md Badsha Biswas

Main category: cs.CL

TL;DR: 本研究提出利用社交媒体数据（尤其是推特）结合自然语言处理技术，识别孕妇妊娠经历并分类妊娠结局，以补充现有流行病学数据，助力负性妊娠结局的研究。

Details

Motivation: 出生缺陷是导致婴儿死亡的主要原因，而目前对流产、死产、出生缺陷和早产等负性妊娠结局的研究仍需更全面的数据与干预策略，传统数据来源存在局限。 Method: 构建一个自然语言处理（NLP）流程，对公开的社交媒体数据进行预处理和数据增强，自动识别分享妊娠经历的女性，并根据其报告的妊娠结果分为正例（足月、正常体重）和负例（负性妊娠结局）。 Result: 该方法能够有效利用非结构化、含噪声且不平衡的社交媒体数据，识别妊娠相关文本，并为观察性研究提供新的数据来源；同时具备评估特定干预、治疗或产前暴露因果效应的潜力。 Conclusion: 社交媒体数据可作为流行病学研究中妊娠结局分析的有效补充资源，本研究为未来涉及孕妇队列的研究提供了可行框架。 Abstract: Infant mortality remains a significant public health concern in the United States, with birth defects identified as a leading cause. Despite ongoing efforts to understand the causes of negative pregnancy outcomes like miscarriage, stillbirths, birth defects, and premature birth, there is still a need for more comprehensive research and strategies for intervention. This paper introduces a novel approach that uses publicly available social media data, especially from platforms like Twitter, to enhance current datasets for studying negative pregnancy outcomes through observational research. The inherent challenges in utilizing social media data, including imbalance, noise, and lack of structure, necessitate robust preprocessing techniques and data augmentation strategies. By constructing a natural language processing (NLP) pipeline, we aim to automatically identify women sharing their pregnancy experiences, categorizing them based on reported outcomes. Women reporting full gestation and normal birth weight will be classified as positive cases, while those reporting negative pregnancy outcomes will be identified as negative cases. Furthermore, this study offers potential applications in assessing the causal impact of specific interventions, treatments, or prenatal exposures on maternal and fetal health outcomes. Additionally, it provides a framework for future health studies involving pregnant cohorts and comparator groups. In a broader context, our research showcases the viability of social media data as an adjunctive resource in epidemiological investigations about pregnancy outcomes.

[24] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

Aiwei Liu,Minghua He,Shaoxun Zeng,Sijun Zhang,Linhao Zhang,Chuhan Wu,Wei Jia,Yuan Liu,Xiao Zhou,Jie Zhou

Main category: cs.CL

TL;DR: WeDLM是一种基于因果注意力的扩散解码框架，通过拓扑重排序实现前缀缓存友好的并行生成，在保持语言模型质量的同时显著提升推理速度。

Details

Motivation: 现有的扩散语言模型因使用双向注意力导致无法有效利用前缀KV缓存，重复上下文化降低了推理效率；而自回归模型虽高效但缺乏并行性，因此需要一种兼顾并行性和部署效率的解码方法。 Method: 提出WeDLM，采用纯因果注意力机制，引入拓扑重排序将已观测token移至物理前缀但保持其逻辑位置，并设计流式解码策略，持续提交高置信度token并维持固定并行负载，避免阻塞等待。 Result: 实验表明WeDLM在保持强自回归基线质量的同时，在复杂推理任务上接近3倍加速，在低熵生成场景下最高达10倍加速，且优于vLLM服务的自回归基线。 Conclusion: WeDLM实现了既兼容前缀缓存又支持高效并行解码的扩散语言模型框架，验证了扩散式解码在实际部署中可超越优化后的自回归引擎。 Abstract: Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.

[25] Harnessing Large Language Models for Biomedical Named Entity Recognition

Jian Chen,Leilei Su,Cong Sun

Main category: cs.CL

TL;DR: BioSelectTune是一种高效的数据中心化框架，通过混合超级过滤策略提升生物医学命名实体识别的性能，仅用50%的精选数据即达到最先进水平。

Details

Motivation: 通用大模型在生物医学命名实体识别任务中因缺乏领域知识和低质量训练数据而表现不佳。 Method: 将BioNER重构为结构化JSON生成任务，采用新型混合超级过滤策略，利用同源弱模型提炼高影响训练数据集。 Result: 在多个BioNER基准上实现了最先进的性能，仅使用50%精选正样本数据就超越了全量训练的基线模型和BioMedBERT等专用模型。 Conclusion: BioSelectTune证明了数据质量优于数量，是提升LLM在生物医学领域表现的有效方法。 Abstract: Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we demonstrate that BioSelectTune achieves state-of-the-art (SOTA) performance across multiple BioNER benchmarks. Notably, our model, trained on only 50% of the curated positive data, not only surpasses the fully-trained baseline but also outperforms powerful domain-specialized models like BioMedBERT.

Dongning Rao,Yunbiao Zeng,Zhihua Jiang,Jujian Lv

Main category: cs.CL

TL;DR: 本文提出了一种用于多模态情感分析（MSA）的新型模型TEXT，结合多模态大语言模型生成解释，并通过时序对齐机制融合文本、音频和视频模态，在多个数据集上取得了最优性能。

Details

Motivation: 现有MSA方法在利用解释信息和时序对齐方面仍不足，需要更有效地融合多模态情感线索。 Method: 提出TEXT模型：1）利用多模态大语言模型（MLLM）生成情感解释；2）设计时序导向的神经模块对齐音视频表征；3）引入文本引导的稀疏专家混合结构与门控融合机制；4）结合Mamba与时序交叉注意力进行时序建模。 Result: TEXT在四个数据集上均达到最佳性能，优于三种最新方法和三种MLLM，在六项指标中至少有四项领先。例如，在CH-SIMS数据集上将平均绝对误差降至0.353，相比最新方法降低13.5%。 Conclusion: 通过引入解释增强和时序对齐机制，TEXT有效提升了多模态情感分析的性能，验证了解释信息与时序建模在MSA中的重要性。 Abstract: Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.

[27] Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

Muhammad Zain Ali,Bernhard Pfahringer,Tony Smith

Main category: cs.CL

TL;DR: 本文研究了在低资源语言（如乌尔都语）中进行虚假新闻检测的挑战，提出通过领域自适应预训练提升多语言模型性能，并在四个乌尔都语数据集上验证了该方法的有效性。

Details

Motivation: 乌尔都语等低资源语言在虚假信息检测方面受到关注较少，现有多种语言模型在处理领域特定术语时表现不佳，因此需要提升模型在特定语言和领域中的泛化能力。 Method: 采用领域自适应预训练（domain-adaptive pretraining）结合微调的两阶段训练方法，使用公开的乌尔都语新闻语料对XLM-RoBERTa和mBERT两种多语言模型进行优化，并在四个公开的乌尔都语虚假新闻数据集上进行评估。 Result: 实验结果表明，经过领域适应的XLM-R在所有数据集上均优于原始模型，而mBERT的表现则不稳定，效果参差不齐。 Conclusion: 领域自适应预训练能有效提升多语言模型在低资源语言虚假新闻检测任务中的性能，尤其对XLM-RoBERTa更为显著，证明了领域适应在低资源语言处理中的重要性。 Abstract: Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R consistently outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results.

[28] CNSight: Evaluation of Clinical Note Segmentation Tools

Risha Surana,Adrian Law,Sunwoo Kim,Rishab Sridhar,Angxiao Han,Peiyu Hong

Main category: cs.CL

TL;DR: 本研究评估了基于规则的方法、领域特定的Transformer模型和大语言模型在临床笔记分段中的表现，使用MIMIC-IV中1000条笔记的数据集进行实验，结果表明基于API的大模型（如GPT-5-mini）在句子级和自由文本分割中表现最佳，平均F1达到72.4。

Details

Motivation: 临床笔记常以非结构化或半结构化形式存储，难以用于二次分析和下游应用，可靠识别段落边界是结构化这些笔记的关键步骤。 Method: 采用基于规则的基线方法、领域特定的Transformer模型以及大语言模型，在MIMIC-IV的1000条临床笔记数据集上进行临床笔记分段任务的评估。 Result: 基于API的大语言模型整体表现最好，GPT-5-mini在句子级和自由文本分割中平均F1达到72.4；轻量级基线方法在结构化句子任务中表现尚可，但在非结构化自由文本上表现较差。 Conclusion: 大语言模型在临床笔记分段任务中优于传统方法，为信息提取、队列识别和自动摘要等下游任务提供了有效基础。 Abstract: Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.

[29] NepEMO: A Multi-Label Emotion and Sentiment Analysis on Nepali Reddit with Linguistic Insights and Temporal Trends

Sameer Sitoula,Tej Bahadur Shahi,Laxmi Prasad Bhatt,Anisha Pokhrel,Arjun Neupane

Main category: cs.CL

TL;DR: 本文提出了一个名为NepEMO的新数据集，用于尼泊尔语Reddit帖子的多标签情感和情感分类，并通过多种模型比较发现Transformer模型在这两项任务中表现最佳。

Details

Motivation: 为了研究在Reddit等社交媒体平台上用户对敏感话题的情感表达，特别是在尼泊尔语社区中缺乏相关资源的情况下，构建一个多标签情感和情感分类的数据集。 Method: 收集并手动标注了4,462条来自2019年至2025年期间的英文、罗马化尼泊尔语和天城文书写的Reddit帖子，涵盖五种情绪（恐惧、愤怒、悲伤、喜悦、抑郁）和三种情感类别（正面、负面、中性），并通过传统机器学习、深度学习和Transformer模型进行比较分析。 Result: Transformer模型在多标签情感分类和情感分类任务中均优于传统机器学习和深度学习模型；同时通过语言学分析揭示了情绪趋势、情绪共现模式、情感相关的n-gram特征以及主题建模结果。 Conclusion: NepEMO数据集为低资源语言（如尼泊尔语）的情感分析提供了重要资源，且Transformer模型在处理此类任务时具有优越性能。 Abstract: Social media (SM) platforms (e.g. Facebook, Twitter, and Reddit) are increasingly leveraged to share opinions and emotions, specifically during challenging events, such as natural disasters, pandemics, and political elections, and joyful occasions like festivals and celebrations. Among the SM platforms, Reddit provides a unique space for its users to anonymously express their experiences and thoughts on sensitive issues such as health and daily life. In this work, we present a novel dataset, called NepEMO, for multi-label emotion (MLE) and sentiment classification (SC) on the Nepali subreddit post. We curate and build a manually annotated dataset of 4,462 posts (January 2019- June 2025) written in English, Romanised Nepali and Devanagari script for five emotions (fear, anger, sadness, joy, and depression) and three sentiment classes (positive, negative, and neutral). We perform a detailed analysis of posts to capture linguistic insights, including emotion trends, co-occurrence of emotions, sentiment-specific n-grams, and topic modelling using Latent Dirichlet Allocation and TF-IDF keyword extraction. Finally, we compare various traditional machine learning (ML), deep learning (DL), and transformer models for MLE and SC tasks. The result shows that transformer models consistently outperform the ML and DL models for both tasks.

[30] AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

Shihao Cai,Runnan Fang,Jialong Wu,Baixuan Li,Xinyu Wang,Yong Jiang,Liangcai Su,Liwen Zhang,Wenbiao Yin,Zhen Zhang,Fuli Feng,Pengjun Xie,Xiaobin Wang

Main category: cs.CL

TL;DR: 提出了一种统一的自动化流程和环境级强化学习算法，用于生成高难度、可验证的任务模拟环境，并有效缓解用户不稳定性问题，提升语言代理在模拟环境中训练的效率与稳定性。

Details

Motivation: 现有基于模拟环境的强化学习研究受限于半自动化的环境构建、任务难度不足以及模拟用户不稳定等问题，缺乏广度和深度，难以有效训练语言代理。 Method: 设计了一个统一的自动化管道以生成高难度且易于验证的模拟环境，并提出一种环境级强化学习算法，该算法在环境层面进行优势估计，以应对用户行为的不稳定性并提升训练效率。 Result: 在tau-bench、tau2-bench和vitabench等多个代理基准上的实验表明所提方法有效提升了训练的稳定性和性能，并展现出良好的跨域泛化能力。 Conclusion: 所提出的自动化环境合成 pipeline 与环境级 RL 算法能够有效支持复杂语言代理的训练，为未来代理智能的发展提供了可扩展且稳定的解决方案。 Abstract: Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.

[31] Diversity or Precision? A Deep Dive into Next Token Prediction

Haoyuan Wu,Hai Wang,Jiajia Wu,Jinxiang Ou,Keyao Wang,Weile Chen,Zihao Zheng,Bei Yu

Main category: cs.CL

TL;DR: 提出一种广义的预训练目标，通过奖励塑造策略来平衡多样性与精确性，以优化大语言模型在强化学习中的探索空间。

Details

Motivation: 标准交叉熵损失限制了预训练模型在强化学习中的探索潜力，需要更优的输出分布来提升推理能力。 Method: 将下一词预测视为随机决策过程，引入正奖励缩放因子和基于排名的负样本处理机制，在监督学习中融入在线策略强化学习原则。 Result: 重塑后的预训练分布相比高熵分布能为强化学习提供更优的探索空间，尤其强调精度的先验表现更好。 Conclusion: 强调精确性的预训练分布比高熵分布更有利于后续强化学习的推理性能提升。 Abstract: Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.

[32] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

Mengdi Chai,Ali R. Zomorrodi

Main category: cs.CL

TL;DR: 本研究评估了三种先进大语言模型（ChatGPT-4o、Gemini 1.5 Pro、Llama 3.3 70B）在临床决策支持中的表现，并探讨提示工程对模型性能的影响，发现其效果因模型和任务而异。

Details

Motivation: 探索大语言模型在真实临床决策中的实际应用价值，而不仅限于医学知识评估。 Method: 使用36个病例，评估三个大语言模型在五项临床推理任务中的表现（差异诊断、紧急处理、诊断检查、最终诊断、治疗建议），比较不同温度设置及提示工程（如MedPrompt框架与动态少样本学习）的影响。 Result: 模型在最终诊断中准确率高，但在诊断检查环节表现差；提示工程能提升部分任务表现，但并非普遍有效，且目标性示例未必优于随机示例。 Conclusion: 大语言模型在临床决策中的表现因任务和模型而异，提示工程的效果具有情境依赖性，需采用定制化策略以实现有效医疗集成。 Abstract: Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM's out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining tasks. Furthermore, ChatGPT performed better under the zero temperature, whereas LIama showed stronger performance under the default temperature. Next, we assessed whether prompt engineering could enhance LLM performance by applying variations of the MedPrompt framework, incorporating targeted and random dynamic few-shot learning. The results demonstrate that prompt engineering is not a one-size-fit-all solution. While it significantly improved the performance on the task with lowest baseline accuracy (relevant diagnostic testing), it was counterproductive for others. Another key finding was that the targeted dynamic few-shot prompting did not consistently outperform random selection, indicating that the presumed benefits of closely matched examples may be counterbalanced by loss of broader contextual diversity. These findings suggest that the impact of prompt engineering is highly model and task-dependent, highlighting the need for tailored, context-aware strategies for integrating LLMs into healthcare.

[33] Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

Tao Yu,Yongqi An,Kuan Zhu,Guibo Zhu,Ming Tang,Jinqiao Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为Function-Aware Neuron Grouping (FANG)的后训练剪枝框架，通过识别和保留对特定功能关键的神经元来缓解校准偏差，在保持语言建模性能的同时提升了下游任务的准确性。

Details

Motivation: 现有结构化剪枝方法在few-shot校准集无法充分反映预训练数据分布时，对下游任务的泛化能力有限。 Method: FANG根据神经元处理的语义上下文类型对其进行分组，并在每组内独立剪枝；在重要性估计中，赋予与神经元组功能角色强相关的token更高权重，并保留跨多种上下文类型起作用的神经元；同时根据各模块的功能复杂度自适应分配稀疏度。 Result: 实验表明，FANG在保持语言建模性能的同时提高了下游任务的准确率，结合FLAP和OBC方法在30%和40%稀疏度下平均准确率分别提升1.5%–8.5%，达到SOTA效果。 Conclusion: FANG有效缓解了校准偏差问题，通过功能感知的神经元分组和自适应稀疏度分配，实现了更优的剪枝性能。 Abstract: Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%--8.5% in average accuracy under 30% and 40% sparsity.

[34] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Wenxuan Xu,Arvind Pillai,Subigya Nepal,Amanda C Collins,Daniel M Mackin,Michael V Heinz,Tess Z Griffin,Nicholas C Jacobson,Andrew Campbell

Main category: cs.CL

TL;DR: LENS 是一个将多模态健康传感器数据与语言模型对齐的框架，用于生成基于临床依据的心理健康叙述。

Details

Motivation: 现有的大语言模型难以直接处理长时间的传感器时序数据，且缺乏配对的传感器-文本数据集，限制了心理健康评估中行为信号向自然语言的转化。 Method: 构建大规模传感器-文本问答数据集，将生态瞬时评估（EMA）响应转化为自然语言描述，并训练一个块级编码器将原始传感器信号投影到大语言模型的表示空间中。 Result: LENS 在标准 NLP 指标和症状严重程度准确性任务上优于强基线模型，13 名心理健康专业人员的用户研究表明其生成的叙述具有临床意义和全面性。 Conclusion: LENS 推动了大语言模型作为健康感知接口的发展，为直接从原始行为信号进行推理并支持临床决策提供了可扩展的路径。 Abstract: Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.

[35] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman,Shashank Srivastava

Main category: cs.CL

TL;DR: 本文指出，Biasing Features指标将未包含影响预测的提示信息的思维链（CoT）标记为不忠实现象，实际上混淆了“不忠”与“不完整”。实验表明，在多跳推理任务中，许多被该指标标记为不忠的CoT在其他指标下仍被视为忠。通过提出的faithful@k指标和因果中介分析，作者发现增加推理时的token预算可显著提升提示信息的表达率，且未表达的提示仍可能通过CoT因果影响预测。因此，作者建议不应仅依赖基于提示的评估，而应结合因果中介和破坏性测试等更广泛的可解释性工具。

Details

Motivation: 当前用于评估思维链（CoT）忠诚度的Biasing Features指标将遗漏提示信息视为不忠，但作者认为这错误地将信息压缩导致的不完整等同于不忠，需重新审视该指标的有效性。 Method: 在Llama-3和Gemma-3模型上进行多跳推理实验，使用多种忠诚度评估指标对比分析；提出新的faithful@k指标衡量提示信息表达程度，并采用因果中介分析（Causal Mediation Analysis）研究未表达提示对预测的因果影响。 Result: 许多被Biasing Features标记为不忠的CoT在其他指标下被认为是忠的，比例超过50%；增加推理token预算可将提示表达率提升至90%；因果中介分析显示，即使提示未被明确表达，仍可通过CoT影响预测结果。 Conclusion: 不应仅依赖提示是否被表达来判断CoT的忠诚度，因不完整与不忠本质不同；建议结合因果分析和破坏性测试等多元方法，构建更全面的可解释性评估体系。 Abstract: Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.

[36] Accelerating Language Model Workflows with Prompt Choreography

TJ Bai,Jason Eisner

Main category: cs.CL

TL;DR: 本文提出了Prompt Choreography框架，通过维护动态全局KV缓存来加速大语言模型的多智能体工作流执行，显著降低了延迟并提升了端到端效率。

Details

Motivation: 在多智能体工作流中，大语言模型频繁重复编码消息导致计算冗余，影响效率。因此需要一种机制减少重复计算，提升执行速度。 Method: 引入动态全局KV缓存机制，允许每个LLM调用关注之前编码消息的任意重排序子集，并支持并行调用；通过微调LLM使其适应缓存机制，以模拟原始结果。 Result: 实现了每条消息首token时间快2.0–6.2倍，某些工作流中端到端速度提升超过2.2倍，有效减少了冗余计算带来的开销。 Conclusion: Prompt Choreography能高效执行LLM工作流，在保持输出一致性的同时大幅降低延迟，适用于以冗余计算为主导的多智能体场景。 Abstract: Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.

[37] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

Melikşah Türker,A. Ebrar Kızıloğlu,Onur Güngör,Susan Üsküdarlı

Main category: cs.CL

TL;DR: 本文提出了TabiBERT，一种基于ModernBERT架构从零开始训练的土耳其语单语编码器，采用大规模多领域语料库预训练，并结合RoPE、FlashAttention和改进的归一化技术，在长上下文建模和推理效率上显著优于现有模型。

Details

Motivation: 土耳其NLP领域缺乏一个从零开始训练并整合现代Transformer架构进步的单语编码器，限制了其在多种下游任务中的表现。因此，需要构建一个现代化的、高效的土耳其语专用模型以推动该语言的技术发展。 Method: 采用ModernBERT架构，集成旋转位置嵌入（RoPE）、FlashAttention和优化的归一化策略，从头训练TabiBERT；使用包含万亿级token的84.88B多领域语料库进行预训练，支持8192长度上下文；构建标准化评测基准TabiBench，涵盖28个数据集、八大任务类别，采用GLUE风格宏平均评估。 Result: TabiBERT在TabiBench上得分为77.58，比BERTurk高1.62分，在八个任务类别中的五个达到SOTA，包括问答（+9.55）、代码检索（+2.41）和文档检索（+0.60）；相比各任务最佳专用模型（如TurkishBERTweet），平均提升+1.47，具备更强的跨域泛化能力；同时实现最高2.65倍推理加速和更低GPU内存占用。 Conclusion: TabiBERT是首个整合现代Transformer优化技术的土耳其语单语编码器，表现出卓越的性能与效率，显著推进了土耳其语NLP的发展；其开源模型权重、训练配置与评测代码为后续研究提供了可复现的基础。 Abstract: Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). The model supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories: question answering (+9.55), code retrieval (+2.41), and document retrieval (+0.60). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.

[38] Reservoir Computing inspired Matrix Multiplication-free Language Model

Takumi Shiratsuchi,Yuichiro Tanaka,Hakaru Tamukoh

Main category: cs.CL

TL;DR: 提出了一种基于无矩阵乘法语言模型和储层计算架构的高效模型，减少了参数量、训练和推理时间，同时保持了与基线模型相当的性能。

Details

Motivation: 降低大语言模型的高计算成本，提升计算效率。 Method: 在无矩阵乘法语言模型中部分固定和共享选定层的权重，引入储层层以获得丰富的动态表示，并结合多种操作减少内存访问。 Result: 参数量最多减少19%，训练时间减少9.9%，推理时间减少8.0%，性能与基线模型相当。 Conclusion: 所提出的架构在不牺牲性能的前提下显著提升了计算和存储效率，适用于资源受限场景。 Abstract: Large language models (LLMs) have achieved state-of-the-art performance in natural language processing; however, their high computational cost remains a major bottleneck. In this study, we target computational efficiency by focusing on a matrix multiplication free language model (MatMul-free LM) and further reducing the training cost through an architecture inspired by reservoir computing. Specifically, we partially fix and share the weights of selected layers in the MatMul-free LM and insert reservoir layers to obtain rich dynamic representations without additional training overhead. Additionally, several operations are combined to reduce memory accesses. Experimental results show that the proposed architecture reduces the number of parameters by up to 19%, training time by 9.9%, and inference time by 8.0%, while maintaining comparable performance to the baseline model.

[39] Not too long do read: Evaluating LLM-generated extreme scientific summaries

Zhuoqi Lyu,Qing Ke

Main category: cs.CL

TL;DR: 本文提出了一个名为BiomedTLDR的新数据集，包含大量研究者撰写的科学论文摘要，用于评估大语言模型（LLMs）在生成高质量科学极简摘要（TLDR）方面的能力。研究发现，尽管一些LLM能生成类似人类的摘要，但总体上更倾向于复制原文的词汇和结构，表现出更强的抽取性而非抽象性。

Details

Motivation: 缺乏高质量、大规模的科学TLDR数据集限制了大语言模型在科学摘要生成方面的开发与评估，因此需要构建一个真实反映专家撰写习惯的数据集来系统分析LLM的表现。 Method: 通过收集科研人员在文献管理中为论文添加的手写评论，构建了一个名为BiomedTLDR的新型数据集，并基于该数据集测试了多个流行的开源大语言模型从论文摘要生成TLDR的能力。 Result: 分析表明，当前的大语言模型虽然能在一定程度上生成类人摘要，但在用词和句式上更依赖原文，表现出比人类更强的抽取倾向，缺乏真正的抽象概括能力。 Conclusion: BiomedTLDR为科学TLDR研究提供了宝贵资源，研究结果揭示了现有LLM在抽象 summarization 上的不足，提示未来需加强模型在语义理解与内容重构方面的能力。 Abstract: High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at https://github.com/netknowledge/LLM_summarization (Lyu and Ke, 2025).

[40] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen,Zeyu Ji,Qianren Mao,Junhang Cheng,Bangjie Qin,Hao Wu,Zhuoran Li,Jingzheng Li,Kai Sun,Zizhe Wang,Yikun Ban,Zhu Sun,Xiangyang Ji,Hailong Sun

Main category: cs.CL

TL;DR: 提出了一种名为LLM-PeerReview的无监督LLM集成方法，通过多模型协同评分与聚合策略选择最优响应，在多个数据集上显著优于现有方法。

Details

Motivation: 旨在提升多语言模型生成结果的可靠性与一致性，利用多个LLM的集体智慧进行无监督筛选。 Method: 采用LLM-as-a-Judge技术对多个候选响应进行评分，结合图模型或平均策略聚合得分，并选择最高分响应作为最终输出。 Result: 在四个数据集上取得优异表现，两种变体分别比最新方法Smoothie-Global高出6.9%和7.3%。 Conclusion: LLM-PeerReview是一种概念简洁、效果强大的无监督集成框架，具备良好的可解释性与泛化能力。 Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.

[41] Anka: A Domain-Specific Language for Reliable LLM Code Generation

Saif Khalfan Saif Al Mazrouei

Main category: cs.CL

TL;DR: 本文提出了一种面向数据转换管道的领域特定语言Anka，通过显式、受限的语法减少大语言模型在复杂编程任务中的错误。实验表明，即使没有训练数据，Claude 3.5 Haiku在Anka上的任务准确率高达95.8%，在多步任务中显著优于Python（100% vs. 60%）。

Details

Motivation: 大语言模型在生成复杂、多步骤程序时存在系统性错误，源于通用语言的灵活性和隐式状态管理问题。作者希望探索通过设计更适合LLM生成的DSL来降低代码生成难度。 Method: 设计了一种名为Anka的DSL，具有显式和受限的语法，并构建了包含100个基准问题的测试集。通过在上下文中提供语言规范，评估多个大模型在Anka和Python上的表现。 Result: Claude 3.5 Haiku在Anka上实现了99.9%的解析成功率和95.8%的整体任务准确率；在多步任务中，Anka比Python高出40个百分点（100% vs. 60%）；GPT-4o-mini也显示出类似优势（+26.7个百分点）。 Conclusion: 大语言模型能够仅通过上下文学习掌握全新的DSL；受限的语法能显著减少复杂任务中的错误；专为LLM设计的DSL可在特定任务上超越其广泛训练过的通用语言。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python's flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.

[42] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

Dianyun Wang,Qingsen Ma,Yuhu Shang,Zhifeng Lu,Lechen Ning,Zhenbo Xu,Huijia Wu,Zhaofeng He

Main category: cs.CL

TL;DR: 本文提出了一种基于稀疏自编码器（SAE）的可解释低秩子空间引导适配器初始化方法，用于参数高效微调。该方法通过解耦特征空间解决传统低秩适应中概念纠缠导致的不可解释性和控制困难问题，在安全对齐任务上以仅0.19-0.24%参数更新达到99.6%的安全率，超越全量微调并接近RLHF性能，同时提供语义可解释的对齐子空间。

Details

Motivation: 现有的低秩适应方法（如LoRA）隐式学习任务相关权重更新的低秩子空间，缺乏可解释性与直接控制能力，主要原因是模型内部表示存在多语义性（polysemanticity），即单一维度编码多个纠缠概念。 Method: 利用预训练的稀疏自编码器（SAE）识别解耦的特征空间中的任务相关特征，并基于此构建显式的、可解释的低秩子空间来指导适配器的初始化；同时提供了在单语义假设下子空间恢复误差趋近于零的理论证明。 Result: 在安全对齐任务上实现了最高99.6%的安全率，比全量微调高出7.4个百分点，接近RLHF方法的效果，且仅需更新0.19%-0.24%的参数；同时提供了对齐子空间的语义解释能力。 Conclusion: 将机械可解释性（如SAE）融入微调过程，能够同时提升模型性能与透明度，验证了基于解耦特征空间进行参数高效微调的优越性与可行性。 Abstract: Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity--individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.

[43] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

Jiahao Zhu,Jipeng Qiang,Ran Bai,Chenyu Liu,Xiaoye Ouyang

Main category: cs.CL

TL;DR: 本研究提出了针对电商直播中语音变体违规行为的Live Auditory Morph Resolution (LiveAMR)任务，构建了首个包含86,790个样本的数据集，并利用大语言模型生成数据以提升性能，有效推动直播监管。

Details

Motivation: 直播中主播常通过发音变体规避审查并进行虚假宣传，尤其是在医疗健康领域，亟需有效技术手段识别此类行为。 Method: 将LiveAMR任务转化为文本到文本的生成问题，利用大语言模型生成训练数据以增强模型性能。 Result: 成功构建了首个LiveAMR数据集，并验证了所提方法在检测发音变体方面的有效性。 Conclusion: 该方法显著提升了对直播中语音变体的识别能力，有助于加强电商直播的合规监管。 Abstract: E-commerce live streaming in China, particularly on platforms like Douyin, has become a major sales channel, but hosts often use morphs to evade scrutiny and engage in false advertising. This study introduces the Live Auditory Morph Resolution (LiveAMR) task to detect such violations. Unlike previous morph research focused on text-based evasion in social media and underground industries, LiveAMR targets pronunciation-based evasion in health and medical live streams. We constructed the first LiveAMR dataset with 86,790 samples and developed a method to transform the task into a text-to-text generation problem. By leveraging large language models (LLMs) to generate additional training data, we improved performance and demonstrated that morph resolution significantly enhances live streaming regulation.

[44] AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

Minjiang Huang,Jipeng Qiang,Yi Zhu,Chaowei Zhang,Xiangyu Zhao,Kui Yu

Main category: cs.CL

TL;DR: 提出了一种基于大语言模型和语音合成技术的多智能体系统AI4Reading，用于自动生成类似播客的有声书解读，相较于人工制作，在内容准确性和简洁性上表现良好，但在语音生成质量上仍有提升空间。

Details

Motivation: 有声书解读的手动制作过程耗时且资源密集，亟需自动化解决方案以提高效率和可及性。 Method: 构建了一个包含11个专业化智能体（如主题分析、案例分析、编辑、叙述者和校对等）的协作框架，利用大语言模型和语音合成技术生成有声书解读内容。 Result: 与专家制作的解读相比，AI4Reading生成的解说稿更简洁准确，但在语音生成质量方面仍存在差距。 Conclusion: AI4Reading在保留内容准确性、提升可理解性和构建逻辑叙事结构方面具有潜力，是迈向自动化有声书解读的有效多智能体解决方案。 Abstract: Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system's output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.

[45] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

Jiafeng Liang,Hao Li,Chang Li,Jiaqi Zhou,Shixin Jiang,Zekun Wang,Changkai Ji,Zhihao Zhu,Runxuan Liu,Tao Ren,Jinlan Fu,See-Kiong Ng,Xia Liang,Ming Liu,Bing Qin

Main category: cs.CL

TL;DR: 本文系统地整合了认知神经科学与大语言模型驱动的智能体之间的跨学科知识，提出了对记忆机制的全面理解，并探讨了其在人工智能中的应用与未来研究方向。

Details

Motivation: 现有研究受限于跨学科障碍，难以充分吸收人类记忆机制的核心思想，因此需要建立认知神经科学与AI智能体之间的桥梁。 Method: 通过从认知神经科学到大语言模型再到智能体的渐进路径，阐明记忆的定义与功能，比较生物与人工系统中记忆的分类、存储机制及管理生命周期，并综述主流评测基准与记忆安全问题。 Result: 提供了生物与人工记忆系统的对比分析，总结了当前智能体记忆的评估基准，并从攻防两个角度探讨了记忆安全性问题。 Conclusion: 展望了多模态记忆系统与技能获取等未来研究方向，推动更高效、安全的智能体记忆设计。 Abstract: Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.

[46] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

Xin Zhang,Yang Cao,Baoxing Wu,Xinyi Chen,Kai Song,Siying Li

Main category: cs.CL

TL;DR: 提出了一种基于外部子图生成的逐步推理增强框架SGR，以提升大语言模型在复杂任务中的推理能力。

Details

Motivation: 大语言模型在需要深度推理和逻辑推断的任务中表现不佳，容易受到训练数据中的噪声或无关信息影响，导致输出不准确或与事实不符。 Method: 动态从外部知识库构建与查询相关的子图，并利用其语义结构引导模型进行多步推理，最后融合多个推理路径得出答案。 Result: 在多个基准数据集上的实验表明，SGR consistently 优于强基线模型，显著提升了推理准确性。 Conclusion: SGR通过结合外部知识和结构化推理过程，有效增强了大语言模型的推理能力，减少了噪声信息的影响。 Abstract: Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.

[47] Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

Jiapeng Wang,Yiwen Hu,Yanzipeng Gao,Haoyu Wang,Shuo Wang,Hongyu Lu,Jiaxin Mao,Wayne Xin Zhao,Junyi Li,Xiao Zhang

Main category: cs.CL

TL;DR: 本文提出EntroDrop，一种基于熵的词元丢弃方法，用于缓解在多轮训练中大语言模型因重复暴露数据而导致的性能下降问题。

Details

Motivation: 由于高质量领域特定数据稀缺，多轮训练成为适应大语言模型的常见策略，但自回归模型在重复数据暴露下常出现性能退化，主要表现为过拟合导致泛化能力下降。 Method: 通过分析发现低熵、可预测的词元学习速度快并主导优化过程，而高熵词元的泛化能力随训练恶化。为此提出EntroDrop方法，采用熵引导的词元丢弃机制，并结合课程学习调整正则化强度。 Result: 在0.6B到8B参数规模的模型上实验表明，EntroDrop在多轮训练中持续优于标准正则化基线，保持更稳健的性能。 Conclusion: 在数据受限场景下训练大语言模型时，将正则化与词元级别学习动态对齐至关重要，EntroDrop为该问题提供了有效解决方案。 Abstract: As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model's ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.

[48] The Effect of Gender Diversity on Scientific Team Impact: A Team Roles Perspective

Yi Zhao,Yongjun Zhu,Donghun Kim,Yuzhuo Wang,Heng Zhang,Chao Lu,Chengzhi Zhang

Main category: cs.CL

TL;DR: 该研究基于13万篇PLOS期刊论文，探讨科研团队中领导与支持角色的性别多样性对团队影响力（五年引用数）的影响，发现其关系呈倒U型，且团队规模调节该关系。

Details

Motivation: 现有研究对性别多样性与科研团队成功的关系结论不一，且多忽视团队内部角色差异，本文旨在揭示不同角色中性别多样性对团队影响力的具体作用机制。 Method: 将论文所有合著者视为科研团队，利用作者贡献声明将成员分为领导与支持角色，基于五年引用数衡量团队影响力；采用多元回归与阈值回归模型，分析13万余篇PLOS论文中性别多样性与团队影响的关系，并考察团队规模的调节作用。 Result: （1）领导组和支持组的性别多样性与团队影响力均呈倒U型关系；（2）领导组全女性、支持组全男性的团队影响力最高；（3）小团队中领导组性别多样性负向影响显著，大团队中则不显著；而支持组性别多样性的正向影响在各类规模团队中均显著。 Conclusion: 性别多样性对科研团队影响力的影响因角色和团队规模而异，支持角色中的性别多样性更具稳定促进作用，团队构成的内部角色分工是理解多样性效应的关键。 Abstract: The influence of gender diversity on the success of scientific teams is of great interest to academia. However, prior findings remain inconsistent, and most studies operationalize diversity in aggregate terms, overlooking internal role differentiation. This limitation obscures a more nuanced understanding of how gender diversity shapes team impact. In particular, the effect of gender diversity across different team roles remains poorly understood. To this end, we define a scientific team as all coauthors of a paper and measure team impact through five-year citation counts. Using author contribution statements, we classified members into leadership and support roles. Drawing on more than 130,000 papers from PLOS journals, most of which are in biomedical-related disciplines, we employed multivariable regression to examine the association between gender diversity in these roles and team impact. Furthermore, we apply a threshold regression model to investigate how team size moderates this relationship. The results show that (1) the relationship between gender diversity and team impact follows an inverted U-shape for both leadership and support groups; (2) teams with an all-female leadership group and an all-male support group achieve higher impact than other team types. Interestingly, (3) the effect of leadership-group gender diversity is significantly negative for small teams but becomes positive and statistically insignificant in large teams. In contrast, the estimates for support-group gender diversity remain significant and positive, regardless of team size.

[49] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Xuan Feng,Bo An,Tianlong Gu,Liang Chang,Fengrui Hao,Peipeng Yu,Shuai Zhao

Main category: cs.CL

TL;DR: 本文提出了一种名为Causal-Contrastive Preference Optimization (C2PO) 的统一对齐框架，用于同时缓解大语言模型中的刻板偏见和结构性偏见，通过因果反事实信号分离偏差特征，并在优化过程中动态抑制捷径特征，实验证明其在多个基准上有效且不损害通用推理能力。

Details

Motivation: 现有方法通常孤立处理刻板偏见和结构性偏见，导致一种偏见的缓解可能加剧另一种；本文旨在系统性解决这一问题，识别输入中潜在的虚假特征关联是引发此类错误推理的主要原因。 Method: 提出C2PO框架，利用因果反事实信号隔离偏差诱导特征与有效推理路径，并设计公平敏感的偏好更新机制，动态评估logit级贡献以抑制捷径特征。 Result: 在BBQ、Unqover、MNLI、HANS、Chatbot、MT-Bench、StereoSet、WinoBias等多个基准上实验表明，C2PO能有效减轻两类偏见，同时在MMLU和GSM8K上保持良好的通用性能。 Conclusion: C2PO提供了一种统一且有效的解决方案，能够在不牺牲模型整体推理能力的前提下，协同缓解大语言模型中的刻板和结构性偏见。 Abstract: Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.

[50] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

Yuqi Tang,Jing Yu,Zichang Su,Kehua Feng,Zhihui Zhu,Libin Wang,Lei Liang,Qiang Zhang,Keyan Ding,Huajun Chen

Main category: cs.CL

TL;DR: 提出ClinDEF，一个基于疾病知识图谱的动态框架，通过模拟诊断对话评估大语言模型的临床推理能力，弥补现有静态评测的不足。

Details

Motivation: 现有的LLM医学评测主要关注静态问答，无法反映医生在多轮医患互动中的动态临床推理过程，且缺乏细粒度、多层次的评估方法。 Method: 构建基于疾病知识图谱的动态框架ClinDEF，生成患者病例，并实现LLM医生与自动化患者代理之间的多轮交互；引入诊断效率分析和基于评分标准的诊断质量评估。 Result: 实验表明ClinDEF能有效揭示当前最先进LLM在临床推理中的关键缺陷，提供比传统方法更细致、更具临床意义的评估结果。 Conclusion: ClinDEF为评估LLM的临床推理能力提供了一个更贴近真实诊疗流程、更全面的动态评测范式。 Abstract: Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.

[51] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv,Jin Ma,Yiyuan Ma,Siyuan Qiao

Main category: cs.CL

TL;DR: 提出了一种轻量级的专家-路由器耦合（ERC）损失，通过约束专家和其对应路由器嵌入之间的激活关系，提升MoE模型性能，并实现对专家专业化程度的量化控制。

Details

Motivation: 现有MoE模型中路由器决策与专家能力之间缺乏显式对齐，限制了模型性能。 Method: 将每个专家的路由器嵌入视为分配给该专家的代理token，通过向路由器嵌入添加扰动并前向传播获取内部激活，设计ERC损失强制两个约束：每个专家对其自身代理token的激活更高；每个代理token在对应专家处引发更强激活。 Result: ERC损失计算高效，成本为固定的n²（n为专家数），不随batch size增长；在3B到15B参数的MoE-LLM上经万亿token训练验证有效，并可量化追踪专家专业化水平。 Conclusion: ERC损失能有效增强路由器决策与专家能力的对齐，提升MoE模型性能，同时提供训练过程中专家专业化程度的可控性与可解释性。 Abstract: Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

[52] Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Thomas Haschka,Joseph Bakarji

Main category: cs.CL

TL;DR: 提出一种基于嵌套密度聚类的层次化语义文本分类方法，利用大语言模型嵌入构建语义关系树，无需预定义类别即可发现文本数据中的研究领域及其子领域，并在多个基准数据集上验证了其跨领域的鲁棒性。

Details

Motivation: 现有大语言模型嵌入虽能捕捉语义相似性，但文本语料库中的全局语义结构仍不清晰，缺乏能够揭示层次化语义关系的无监督分类方法。 Method: 提出嵌套密度聚类方法，在LLM嵌入空间中通过逐步放松密度阈值，将密集簇合并为更松散的簇，形成覆盖整个数据集的层次化聚类树，从而构建文本间的语义层级结构。 Result: 该方法成功应用于科学论文摘要、20 Newsgroups和IMDB电影评论等数据集，能够自动发现研究领域与子领域，展现出跨域的鲁棒性，并可用于科学计量与主题演化分析。 Conclusion: 嵌套密度聚类能有效揭示文本数据中的层次化语义结构，支持无监督、数据驱动的文本分类与语义演化分析，具有广泛的应用潜力。 Abstract: Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster -- the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 Newsgroups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.

[53] Automatic Detection of Complex Quotation Patterns in Aggadic Literature

Hadar Miller,Tsvi Kuflik,Moshe Lavee

Main category: cs.CL

TL;DR: 本文提出了一种名为ACT的三阶段算法，用于在拉比文献中自动检测圣经引文，其在F1分数上达到0.91，优于现有系统。

Details

Motivation: 现有文本重用框架难以处理短小、改写或结构嵌套的引文，尤其在形态丰富且引用密集的文本传统中表现不佳，因此需要一种更精确的检测方法。 Method: 提出ACT算法，包含形态感知对齐、上下文敏感增强和分类三个阶段，结合不同配置（如ACT-QE、ACT-2、ACT-3）进行消融实验，并与主流系统及人工校勘本对比评估。 Result: ACT-QE在F1得分为0.91，召回率0.89，精确率0.94，优于所有基线系统；ACT-2召回更高但精度下降，ACT-3在覆盖与特异性间取得权衡。 Conclusion: ACT有效提升了宗教文本中复杂引文模式的检测能力，填补了机器自动化检测与人文编辑判断之间的方法论空白，为数字人文学科中的文体分类与互文分析提供了新工具。 Abstract: This paper presents ACT (Allocate Connections between Texts), a novel three-stage algorithm for the automatic detection of biblical quotations in Rabbinic literature. Unlike existing text reuse frameworks that struggle with short, paraphrased, or structurally embedded quotations, ACT combines a morphology-aware alignment algorithm with a context-sensitive enrichment stage that identifies complex citation patterns such as "Wave" and "Echo" quotations. Our approach was evaluated against leading systems, including Dicta, Passim, Text-Matcher, as well as human-annotated critical editions. We further assessed three ACT configurations to isolate the contribution of each component. Results demonstrate that the full ACT pipeline (ACT-QE) outperforms all baselines, achieving an F1 score of 0.91, with superior Recall (0.89) and Precision (0.94). Notably, ACT-2, which lacks stylistic enrichment, achieves higher Recall (0.90) but suffers in Precision, while ACT-3, using longer n-grams, offers a tradeoff between coverage and specificity. In addition to improving quotation detection, ACT's ability to classify stylistic patterns across corpora opens new avenues for genre classification and intertextual analysis. This work contributes to digital humanities and computational philology by addressing the methodological gap between exhaustive machine-based detection and human editorial judgment. ACT lays a foundation for broader applications in historical textual analysis, especially in morphologically rich and citation-dense traditions like Aggadic literature.

[54] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Fengjiao Chen,Minhao Jing,Weitao Lu,Yan Feng,Xiaoyu Li,Xuezhi Cao

Main category: cs.CL

TL;DR: 本文研究了在大规模预训练下生成任务对视觉理解的提升作用，提出统一模型UniHetero，发现语义生成而非像素生成更能增强理解能力，且生成任务具有更优的数据扩展性和利用率，自回归输入嵌入有助于捕捉视觉细节。

Details

Motivation: 探索视觉生成任务是否能够增强视觉理解能力，尤其是在大规模数据上的统一建模效果。 Method: 提出一个简洁结构的统一模型UniHetero，在超过2亿样本的大规模数据上进行预训练，分析生成与理解之间的关系。 Result: 发现语义生成能显著提升理解性能；生成任务展现出更优的数据扩展趋势和更高的数据利用效率；输入嵌入上的自回归机制有效捕捉视觉细节。 Conclusion: 生成可以促进理解，但关键在于生成语义而非像素，且在大规模数据下生成任务具有更高的潜力和效率。 Abstract: Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.

[55] Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

Hazel Kim,Philip Torr

Main category: cs.CL

TL;DR: 本文提出了Mixture of Latent Concept Experts (MoLaCE)，一种轻量级的推理时框架，用于减轻大语言模型中的输入确认偏见。通过混合不同潜在概念上的专家（表现为不同的激活强度），MoLaCE能够在单个模型内模拟辩论的好处，提高事实准确性和鲁棒性，同时保持计算效率和可扩展性。

Details

Motivation: 大语言模型容易受到输入确认偏见的影响，当提示暗示一个偏好答案时，模型往往会强化这种偏见而不是探索其他可能性。这一现象在基础模型中已有害处，在多智能体辩论中风险更大，因为回音室会加强而非纠正偏见。因此需要一种方法来缓解这个问题。 Method: 提出了一种名为Mixture of Latent Concept Experts (MoLaCE) 的框架，该框架通过在推理时混合基于不同潜在概念激活强度的专家来应对确认偏见。利用语言的组合性质，根据不同提示重新加权影响事实正确性的潜在概念，从而实现对确认偏见的有效控制。 Result: 实验证明，MoLaCE能够持续减少确认偏见，提升模型的鲁棒性，并且在仅需少量计算资源的情况下，表现达到甚至超过多智能体辩论的效果。此外，它还可以集成到多智能体辩论框架中以增加观点多样性并减少相关错误。 Conclusion: MoLaCE是一种有效、高效且可扩展的方法，可以在不牺牲性能的前提下显著降低大语言模型中的确认偏见问题，既适用于单一模型也适用于多智能体系统。 Abstract: Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi-agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference-time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt-specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi-agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate while requiring only a fraction of the computation.

[56] Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

Sahil Kale,Antonio Luca Alfeo

Main category: cs.CL

TL;DR: 提出一种基于知识图谱的LLM幻觉自检测方法，通过将模型输出转化为实体和关系的知识图谱来提升幻觉检测性能，相比现有方法在准确率和F1分数上分别提升16%和20%。

Details

Motivation: 幻觉问题严重阻碍了大语言模型的安全部署，现有自检测方法仍有改进空间，需要更有效的技术来识别生成内容中的虚假陈述。 Method: 将大语言模型的响应转化为知识图谱（包含实体和关系），并利用该图结构评估响应中存在幻觉的可能性，从而增强幻觉的自我检测能力。 Result: 在GPT-4o和Gemini-2.5-Flash两个主流模型及两个数据集上验证了方法的有效性，相较于标准自检测方法和SelfCheckGPT，在准确率上最高提升16%，F1-score最高提升20%；其中一个数据集经过人工标注并公开以支持后续研究。 Conclusion: 结构化的知识表示（如知识图谱）能有效提升大语言模型对原子事实的分析能力，即使原始输出存在错误也能改善幻觉检测效果；该方法成本低且不依赖特定模型，有助于构建更安全可信的语言模型系统。 Abstract: Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection. Specifically, we propose a simple yet powerful approach that enriches hallucination self-detection by (i) converting LLM responses into knowledge graphs of entities and relations, and (ii) using these graphs to estimate the likelihood that a response contains hallucinations. We evaluate the proposed approach using two widely used LLMs, GPT-4o and Gemini-2.5-Flash, across two hallucination detection datasets. To support more reliable future benchmarking, one of these datasets has been manually curated and enhanced and is released as a secondary outcome of this work. Compared to standard self-detection methods and SelfCheckGPT, a state-of-the-art approach, our method achieves up to 16% relative improvement in accuracy and 20% in F1-score. Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs, even when initial outputs contain inaccuracies. This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.

[57] Instruction-Following Evaluation of Large Vision-Language Models

Daiki Shiono,Shumpei Miyawaki,Ryota Tanaka,Jun Suzuki

Main category: cs.CL

TL;DR: 本研究发现，大型视觉-语言模型（LVLMs）在使用常用数据集微调后，其指令跟随能力会下降；通过构建强调输出格式是否指定的新训练数据集，发现包含输出格式指示的训练数据能显著提升LVLMs的指令遵循准确性。

Details

Motivation: 观察到LVLMs在视觉指令微调后丧失了原有LLMs的指令跟随能力，亟需量化分析其退化原因并寻找缓解方法。 Method: 构建新的训练数据集，区分是否包含输出格式说明，并定量评估不同设置下LVLMs的指令跟随表现。 Result: 实验表明，使用常见数据集微调会导致LVLMs指令跟随能力下降；而包含输出格式指示的数据能显著提升模型对指令的遵循程度。 Conclusion: 在视觉指令微调中引入明确的输出格式指令有助于缓解LVLMs指令跟随能力的退化。 Abstract: Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.

[58] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Yu-Xiang Lin,Cheng-Han Chiang,Hung-yi Lee

Main category: cs.CL

TL;DR: 该论文研究了口语语言模型（SLMs）在多轮对话中难以维持指定说话风格的问题，称为“风格失忆”（style amnesia），并发现当前主流的闭源和开源SLM均存在此问题。

Details

Motivation: 探究SLM在长期对话中维持情感、口音、音量和语速等副语言风格的能力，揭示现有模型在风格一致性上的缺陷。 Method: 评估三种闭源和两种开源SLM在多轮对话中保持指令指定说话风格的能力，并测试不同提示策略（如用户消息 vs 系统消息）对风格维持的影响。 Result: 所有被测SLM都无法在多轮后持续保持指定风格；尽管模型能回忆起风格指令，但无法持续表达；将风格指令放在用户消息中比系统消息更有效；显式要求回忆可部分缓解风格失忆。 Conclusion: 当前SLM存在显著的风格失忆问题，系统提示的设计未能有效支持风格一致性，需改进提示机制和训练方法以提升风格维持能力。 Abstract: In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.

[59] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

Yuwen Li,Wei Zhang,Zelong Huang,Mason Yang,Jiajun Wu,Shawn Guo,Huahao Hu,Lingyi Sun,Jian Yang,Mingjie Tang,Byran Dai

Main category: cs.CL

TL;DR: InfTool是一个完全自主的框架，通过自我演化的多智能体合成，仅基于原始API规范生成多样化且经过验证的工具调用轨迹，显著提升了大语言模型在工具调用任务上的性能，无需人工标注。

Details

Motivation: 现有的大语言模型在调用外部工具时面临三大挑战：依赖昂贵的人工标注高质量轨迹、难以泛化到未见过的工具、以及单模型合成的质量上限导致的偏差和覆盖不足。 Method: 提出InfTool框架，利用三个协作智能体（用户模拟器、工具调用助手和MCP服务器）自动生成工具调用轨迹，并通过分组相对策略优化（GRPO）结合门控奖励训练模型，形成一个无需人工干预的闭环迭代系统。 Result: 在Berkeley Function-Calling Leaderboard（BFCL）上，InfTool将一个32B的基础模型准确率从19.8%提升至70.9%（+258%），超越了大10倍的模型，性能接近Claude-Opus，且完全使用合成数据训练。 Conclusion: InfTool实现了无需人工标注的大规模工具调用能力提升，打破了现有方法在成本、泛化性和质量上的瓶颈，为构建自主智能体提供了高效可扩展的解决方案。 Abstract: Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single-model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self-evolving multi-agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool-Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single-turn calls to complex multi-step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher-quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus, and entirely from synthetic data without human annotation.

[60] A Dataset and Benchmark for Consumer Healthcare Question Summarization

Abhishek Basu,Deepak Gupta,Dina Demner-Fushman,Shweta Yadav

Main category: cs.CL

TL;DR: 本文介绍了一个新的数据集CHQ-Sum，包含1507个由领域专家标注的消费者健康问题及其摘要，旨在推动针对消费者医疗问题的摘要系统研究。

Details

Motivation: 由于消费者在表达医疗需求时通常使用冗长且外围的信息，导致自然语言理解困难，缺乏专家标注的数据集限制了高效摘要系统的发展。 Method: 从社区问答论坛收集消费者健康问题，由领域专家进行标注并生成对应摘要，构建CHQ-Sum数据集，并在多个先进的摘要模型上进行基准测试。 Result: 成功构建了包含1507个样本的CHQ-Sum数据集，并通过实验验证了其在多种先进摘要模型上的有效性。 Conclusion: CHQ-Sum数据集为消费者健康问题的自动摘要提供了有价值的资源，有助于推动该领域的研究和技术发展。 Abstract: The quest for seeking health information has swamped the web with consumers health-related questions. Generally, consumers use overly descriptive and peripheral information to express their medical condition or other healthcare needs, contributing to the challenges of natural language understanding. One way to address this challenge is to summarize the questions and distill the key information of the original question. Recently, large-scale datasets have significantly propelled the development of several summarization tasks, such as multi-document summarization and dialogue summarization. However, a lack of a domain-expert annotated dataset for the consumer healthcare questions summarization task inhibits the development of an efficient summarization system. To address this issue, we introduce a new dataset, CHQ-Sum,m that contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media. We benchmark the dataset on multiple state-of-the-art summarization models to show the effectiveness of the dataset

[61] Nested Browser-Use Learning for Agentic Information Seeking

Baixuan Li,Jialong Wu,Wenbiao Yin,Kuan Li,Zhongwang Zhang,Huifeng Yin,Zhengwei Tao,Liwen Zhang,Pengjun Xie,Jingren Zhou,Yong Jiang

Main category: cs.CL

TL;DR: 本文提出了NestBrowse，一种用于信息寻求代理的嵌套浏览器使用学习框架，通过解耦交互控制与页面探索，简化了代理推理并实现了有效的深层网络信息获取。

Details

Motivation: 现有的信息寻求代理在工具使用上受限于API级代码片段检索和基于URL的页面抓取，无法充分利用真实浏览中的丰富信息。 Method: 提出了一种最小且完整的浏览器操作框架NestBrowse，采用嵌套结构将交互控制与页面探索分离。 Result: 在具有挑战性的深度信息搜索基准测试中，实验结果表明NestBrowse在实践中具有明显优势，并且深入分析证明了其高效性和灵活性。 Conclusion: NestBrowse能够有效提升信息寻求代理在复杂网页环境下的性能，为实现更深层次的网络信息获取提供了可行方案。 Abstract: Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.

[62] Less is more: Probabilistic reduction is best explained by small-scale predictability measures

Cassandra L. Jacobs,Andrés Buxó-Lugo,Anna K. Taylor,Marie Leopold-Hooke

Main category: cs.CL

TL;DR: 本文研究了在探究语言模型概率与认知现象关系时所需的适当上下文量，发现n-gram表示足以作为规划的认知单元。

Details

Motivation: 探讨语言模型中上下文长度对认知现象解释能力的影响，明确最小有效上下文单位。 Method: 通过分析概率缩减现象，比较整句与n-gram表示在建模认知过程中的有效性。 Result: 结果表明，无需完整语句，n-gram表示即可充分捕捉语言生成中的认知规划过程。 Conclusion: n-gram级别的上下文已足够用于语言模型与认知现象的关联研究，简化了建模复杂度。 Abstract: The primary research questions of this paper center on defining the amount of context that is necessary and/or appropriate when investigating the relationship between language model probabilities and cognitive phenomena. We investigate whether whole utterances are necessary to observe probabilistic reduction and demonstrate that n-gram representations suffice as cognitive units of planning.

[63] Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

Panagiotis Theocharopoulos,Ajinkya Kulkarni,Mathew Magimai. -Doss

Main category: cs.CL

TL;DR: 该研究构建了一个包含约500篇ICML接收论文的数据集，评估在不同语言中嵌入隐式对抗性提示对LLM审稿结果的影响，发现英语、日语和中文的提示注入显著改变了评审分数和接受决策，而阿拉伯语则无明显影响。

Details

Motivation: 随着大语言模型（LLMs）被越来越多地应用于高影响力的工作流程（如学术同行评审），其安全性问题尤其是文档级隐式提示注入攻击的风险亟需评估。 Method: 研究人员构建了一个包含约500篇真实ICML录用论文的数据集，并在这些文档中嵌入四种不同语言但语义等价的隐式对抗性指令，随后使用LLM对这些论文进行评审，分析不同语言下提示注入对评审结果的影响。 Result: 实验发现，英语、日语和中文的隐式提示注入显著改变了LLM生成的评审分数和接受/拒绝决策，而阿拉伯语的注入几乎没有产生影响。 Conclusion: 研究表明基于LLM的评审系统容易受到文档级提示注入攻击，且不同语言之间的脆弱性存在显著差异，提示多语言安全防护的重要性。 Abstract: Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.

[64] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

Deepak Babu Piskala

Main category: cs.CL

TL;DR: 本文提出了ProfASR-Bench，一个面向金融、医疗、法律和技术等专业领域的高风险场景的自动语音识别（ASR）评测套件，旨在解决领域术语密集、正式语体变化及关键实体错误容忍度低等问题。该评测集支持上下文条件识别评估，并引入实体感知评分和按口音、性别划分的细粒度报告。实验发现，当前主流ASR模型在使用文本上下文提示时几乎无法改善词错误率，表现出“上下文利用差距”（CUG）。

Details

Motivation: 现有ASR基准未能充分反映专业场景中的实际挑战，如密集的专业术语、正式语体差异以及对关键实体识别极低的容错要求。因此需要一个更贴近高风险应用实际需求的评测基准。 Method: 构建了一个包含金融、医学、法律和技术领域的真实语言提示与富含实体的目标语句配对的数据集ProfASR-Bench，支持无上下文、人物画像、领域+画像、理想提示和对抗提示等多种条件下的测试。采用Whisper和Qwen-Omni两类代表性模型进行评估，结合传统WER指标、实体感知分数及按口音和性别的切片分析。 Result: 实验结果显示，在各种提示条件下，包括理想提示，当前ASR系统的平均词错误率（WER）几乎没有改善；对抗性提示也未显著降低性能。表明现有模型虽支持提示输入，但未能有效利用外部上下文信息，存在明显的上下文利用差距（CUG）。 Conclusion: ProfASR-Bench为专业场景ASR提供了标准化的上下文阶梯测试框架、支持细粒度和实体感知的评估方式，并揭示了当前模型在融合外部上下文信息方面的不足，呼吁未来研究关注更有效的上下文融合策略。 Abstract: Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench Code: https://github.com/prdeepakbabu/ProfASR-Bench

[65] Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans

Sky CH-Wang,Justin Svegliato,Helen Appel,Jason Eisner

Main category: cs.CL

TL;DR: 提出一种基于偏好监督的增量式语言模型微调方法，利用细粒度反馈和逐步改写链进行直接对齐，优于传统的A/B排序或全句重写方法。

Details

Motivation: 传统偏好学习依赖整体响应比较（如A/B测试），缺乏对局部改进信号的有效利用，导致训练效率低且难以精准优化。 Method: 引入细粒度人类反馈机制：标注者标记响应中的“喜欢”和“不喜欢”片段并说明原因；模型从左到右依次重写“不喜欢”部分，形成改进链；利用相邻改进步构建偏好对用于直接对齐训练。 Result: 该方法在偏好微调中表现优于标准A/B偏好排序和全句对比重写方法，能更高效地捕捉局部优化信号，提升模型性能。 Conclusion: 结构化的、基于修订的监督方式能够更有效地利用人类反馈，实现更精细、高效的语言模型对齐。 Abstract: We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.

[66] Eliciting Behaviors in Multi-Turn Conversations

Jing Huang,Shujian Zhang,Lun Wang,Andrew Hard,Rajiv Mathews,John Lambert

Main category: cs.CL

TL;DR: 本文研究了在多轮对话中从大语言模型中诱导特定行为的方法，提出了一种分类框架，并比较了不同方法在查询预算与成功率之间的权衡，发现在线方法在少量查询下即可显著提高行为诱导的成功率。

Details

Motivation: 现有研究主要关注单轮对话中的行为诱导，而在多轮对话场景下的探索不足，因此需要系统性地分析和改进多轮设置下的行为诱发方法。 Method: 提出了一个将现有方法分为仅使用先验知识、离线交互和在线学习三类的分析框架，并引入了一种统一单轮与多轮的在线多轮行为诱导泛化方法。 Result: 在线方法在三个任务上仅用几千次查询就达到了平均45%、19%和77%的成功率，而静态方法在相同任务中几乎未能发现失败案例。 Conclusion: 在线交互方法在多轮对话测试用例生成中更有效，突显了向动态评估基准发展的必要性。 Abstract: Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.

cs.CV [Back]

[67] Characterizing Motion Encoding in Video Diffusion Timesteps

Vatsal Baherwani,Yixuan Ren,Abhinav Shrivastava

Main category: cs.CV

TL;DR: 本文通过大规模定量研究揭示了视频扩散模型中运动与外观的分离机制，提出了一种基于时间步范围的运动-外观解耦原则，并利用该机制简化了现有的运动定制方法。

Details

Motivation: 尽管实践中常用早期时间步控制运动、后期时间步优化外观的经验法则，但这一行为尚未被系统地刻画和理解。 Method: 通过在指定时间步范围内注入新条件，量化分析其对外观编辑与运动保持之间的权衡，作为运动编码的代理指标，并在多种架构下进行大规模实验验证。 Result: 一致发现早期时间步主导运动、后期主导外观的现象，确定了时间步空间中的运动-外观操作边界；仅在运动主导阶段进行训练和推理即可实现有效的运动迁移。 Conclusion: 将经验启发式转化为可操作的空间-时间解耦原则，所提方法无需额外模块或特殊目标函数，可轻松集成到现有视频编辑与运动迁移框架中。 Abstract: Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.

[68] Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture, Training, and Deployment

Dawnena Key

Main category: cs.CV

TL;DR: 提出了一种结合3D CNN和LSTM的混合深度学习架构，用于实时美国手语（ASL）识别，基于多个数据集训练并在AWS及边缘设备上部署，实现了高F1分数。

Details

Motivation: 为解决全球超过7000万聋哑和听力障碍人士的沟通障碍，需要高效准确的实时手语识别系统。 Method: 采用3D卷积神经网络提取视频中的时空特征，结合LSTM建模手势中的时序依赖关系，构建端到端的深度学习模型，并在WLASL、ASL-LEX等数据集上进行训练与评估。 Result: 系统在不同手语类别上取得0.71至0.99之间的F1分数，支持实时推理，并成功部署于AWS云平台和OAK-D边缘摄像头。 Conclusion: 该混合架构在实时ASL识别任务中表现优异，具备实际部署能力，有助于提升听障人群的交流可及性。 Abstract: This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture combining 3D Convolutional Neural Networks (3D CNN) with Long Short-Term Memory (LSTM) networks. The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide. Our architecture leverages 3D convolutions to capture spatial-temporal features from video frames, followed by LSTM layers that model sequential dependencies inherent in sign language gestures. Trained on the WLASL dataset (2,000 common words), ASL-LEX lexical database (~2,700 signs), and a curated set of 100 expert-annotated ASL signs, the system achieves F1-scores ranging from 0.71 to 0.99 across sign classes. The model is deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference. We discuss the architecture design, training methodology, evaluation metrics, and deployment considerations for practical accessibility applications.

[69] Enhancing Medical Data Analysis through AI-Enhanced Locally Linear Embedding: Applications in Medical Point Location and Imagery

Hassan Khalid,Muhammad Mahad Khaliq,Muhammad Jawad Bashir

Main category: cs.CV

TL;DR: 本文提出了一种将人工智能与局部线性嵌入（LLE）相结合的创新方法，用于提升医疗计费和转录系统的准确性与效率，实验结果表明该模型在数据处理精度和操作效率方面均有显著改进。

Details

Motivation: 为了减少医疗数据处理中的人为错误，提高医疗计费和转录的准确性和效率，推动AI在高维医疗数据中的应用。 Method: 提出一种AI增强的局部线性嵌入（LLE）模型，结合人工智能技术对高维医疗数据进行降维与特征提取，并应用于医疗计费与转录自动化。 Result: 实验结果显示，该模型显著提升了数据处理的准确性与系统运行效率，有效支持了医疗文档记录与财务流程的自动化。 Conclusion: AI增强的LLE模型在医疗数据处理中具有巨大潜力，可为未来在更广泛医疗场景中的应用奠定基础。 Abstract: The rapid evolution of Artificial intelligence in healthcare has opened avenues for enhancing various processes, including medical billing and transcription. This paper introduces an innovative approach by integrating AI with Locally Linear Embedding (LLE) to revolutionize the handling of high-dimensional medical data. This AI-enhanced LLE model is specifically tailored to improve the accuracy and efficiency of medical billing systems and transcription services. By automating these processes, the model aims to reduce human error and streamline operations, thereby facilitating faster and more accurate patient care documentation and financial transactions. This paper provides a comprehensive mathematical model of AI-enhanced LLE, demonstrating its application in real-world healthcare scenarios through a series of experiments. The results indicate a significant improvement in data processing accuracy and operational efficiency. This study not only underscores the potential of AI-enhanced LLE in medical data analysis but also sets a foundation for future research into broader healthcare applications.

[70] Unbiased Visual Reasoning with Controlled Visual Inputs

Zhaonan Li,Shijie Lu,Fei Wang,Jacob Dineen,Xiao Ye,Zhikun Xu,Siyi Liu,Young Min Cho,Bangzheng Li,Daniel Chang,Kenny Nguyen,Qizheng Yang,Muhao Chen,Ben Zhou

Main category: cs.CV

TL;DR: 本文提出了VISTA，一个通过显式信息瓶颈分离视觉感知与语言推理的模块化框架，以增强视觉问答中对因果视觉证据的依赖，减少对虚假相关性的依赖。

Details

Motivation: 现有的端到端视觉语言模型（VLMs）在回答视觉问题时倾向于利用虚假相关性而非真实的视觉证据，尤其在微调后更易依赖捷径，导致泛化能力差。 Method: VISTA将系统解耦为两部分：一个冻结的VLM传感器仅处理简短、客观的感知查询，提取视觉事实；一个纯文本LLM推理器负责分解问题、规划查询并以自然语言整合视觉信息。通过强化学习在奖励对齐的环境中训练该框架，避免偏差传播。 Result: 基于Qwen2.5-VL和Llama3.2-Vision传感器，仅使用641个精心设计的多步问题训练，VISTA在SpuriVerse上分别提升16.29%和6.77%，在MMVP和SeedBench子集上保持竞争力，并展现出跨不同VLM传感器的强泛化能力，能识别并恢复感知错误。人类分析表明其推理过程更中立、更少依赖虚假特征、更紧密基于视觉证据。 Conclusion: VISTA通过模块化设计和显式控制接口有效缓解了VLM中的虚假相关性问题，实现了更鲁棒、可解释和泛化的视觉问答，为构建可信的多模态系统提供了新路径。 Abstract: End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA's reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.

Antara Titikhsha,Divyanshu Tak

Main category: cs.CV

TL;DR: 提出双编码器框架SAMM2D，显著提升颅内动脉瘤检测性能，发现强预训练下数据增强反而损害模型表现，强调强预训练优于复杂增强策略。

Details

Motivation: 动脉瘤检测因形态细微、类别不平衡和标注数据稀缺而具有挑战性，需提高检测准确率并降低临床漏诊风险。 Method: 提出SAMM2D双编码器框架，基于ImageNet预训练 backbone，系统评估六种数据增强策略，并通过Grad-CAM可视化分析模型关注区域。 Result: 在RSNA数据集上AUC达0.686，较临床基线提升32%；无增强模型优于所有增强变体（p < 0.01）；调优阈值后敏感度达95%，超过放射科医生平均水平；Grad-CAM显示85%真阳性样本聚焦于相关血管区域（与专家标注IoU为62%）。 Conclusion: 在低数据量医学图像任务中，强预训练已蕴含足够不变性特征，过度数据增强反而破坏特征流形；应优先依赖强预训练而非复杂增强策略，以实现高效、可解释的临床部署。 Abstract: Effective aneurysm detection is essential to avert life-threatening hemorrhages, but it remains challenging due to the subtle morphology of the aneurysm, pronounced class imbalance, and the scarcity of annotated data. We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset; an improvement of 32% over the clinical baseline. In a comprehensive ablation across six augmentation regimes, we made a striking discovery: any form of data augmentation degraded performance when coupled with a strong pretrained backbone. Our unaugmented baseline model outperformed all augmented variants by 1.75--2.23 percentage points (p < 0.01), overturning the assumption that "more augmentation is always better" in low-data medical settings. We hypothesize that ImageNet-pretrained features already capture robust invariances, rendering additional augmentations both redundant and disruptive to the learned feature manifold. By calibrating the decision threshold, SAMM2D reaches 95% sensitivity, surpassing average radiologist performance, and translates to a projected \$13.9M in savings per 1,000 patients in screening applications. Grad-CAM visualizations confirm that 85% of true positives attend to relevant vascular regions (62% IoU with expert annotations), demonstrating the model's clinically meaningful focus. Our results suggest that future medical imaging workflows could benefit more from strong pretraining than from increasingly complex augmentation pipelines.

[72] HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology

Xitong Ling,Minxi Ouyang,Xiaoxiao Li,Jiawen Li,Ying Chen,Yuxuan Sun,Xinrui Chen,Tian Guan,Xiaoping Liu,Yonghong He

Main category: cs.CV

TL;DR: 提出HookMIL，一种上下文感知且计算高效的多实例学习框架，通过可学习的hook token实现结构化上下文聚合，在病理图像分析中实现了最先进的性能。

Details

Motivation: 传统MIL方法易丢失关键上下文信息，基于Transformer的方法虽表达能力强但存在计算复杂度高和冗余计算问题。 Method: 引入紧凑、可学习的hook token，支持多种模态初始化（视觉、文本、空间特征），并通过双向注意力（线性复杂度）与实例交互；设计Hook多样性损失和hook间通信机制以提升专业化和减少冗余。 Result: 在四个公开病理数据集上实验表明，HookMIL在性能、计算效率和可解释性方面均达到最先进水平。 Conclusion: HookMIL有效平衡了上下文建模能力与计算效率，通过多模态初始化和结构化聚合机制，为弱监督病理图像分析提供了新范式。 Abstract: Multiple Instance Learning (MIL) has enabled weakly supervised analysis of whole-slide images (WSIs) in computational pathology. However, traditional MIL approaches often lose crucial contextual information, while transformer-based variants, though more expressive, suffer from quadratic complexity and redundant computations. To address these limitations, we propose HookMIL, a context-aware and computationally efficient MIL framework that leverages compact, learnable hook tokens for structured contextual aggregation. These tokens can be initialized from (i) key-patch visual features, (ii) text embeddings from vision-language pathology models, and (iii) spatially grounded features from spatial transcriptomics-vision models. This multimodal initialization enables Hook Tokens to incorporate rich textual and spatial priors, accelerating convergence and enhancing representation quality. During training, Hook tokens interact with instances through bidirectional attention with linear complexity. To further promote specialization, we introduce a Hook Diversity Loss that encourages each token to focus on distinct histopathological patterns. Additionally, a hook-to-hook communication mechanism refines contextual interactions while minimizing redundancy. Extensive experiments on four public pathology datasets demonstrate that HookMIL achieves state-of-the-art performance, with improved computational efficiency and interpretability. Codes are available at https://github.com/lingxitong/HookMIL.

[73] Tiny-YOLOSAM: Fast Hybrid Image Segmentation

Kenneth Xu,Songhan Wu

Main category: cs.CV

TL;DR: 提出Tiny-YOLOSAM，一种结合YOLO检测器与TinySAM的混合方法，在显著提升分割覆盖率的同时大幅降低运行时间。

Details

Motivation: TinySAM虽轻量但仍依赖密集提示导致速度慢，难以满足实时需求，需更高效的全场景分割方案。 Method: 使用YOLOv12生成前景物体框作为TinySAM的提示，并在未覆盖区域补充稀疏点提示，实现检测引导的混合分割流程。 Result: 在COCO val2017上，AR从16.4%提升至77.1%，mIoU从19.2%升至67.8%，单图推理时间由49.20秒降至10.39秒（提速4.7倍）。 Conclusion: 检测器引导提示加稀疏采样是一种优于密集提示的有效策略，适用于实际应用中的快速全场景分割。 Abstract: The Segment Anything Model (SAM) enables promptable, high-quality segmentation but is often too computationally expensive for latency-critical settings. TinySAM is a lightweight, distilled SAM variant that preserves strong zero-shot mask quality, yet its "segment-everything" mode still requires hundreds of prompts and remains slow in practice. We first replicate TinySAM on COCO val2017 using official checkpoints, matching the reported AP within 0.03%, establishing a reliable experimental baseline. Building on this, we propose Tiny-YOLOSAM, a fast hybrid pipeline that uses a recent YOLO detector (YOLOv12) to generate box prompts for TinySAM on salient foreground objects, and supplements uncovered regions with sparse point prompts sampled only where YOLO-guided masks provide no coverage. On COCO val2017, the hybrid system substantially improves class-agnostic coverage (AR from 16.4% to 77.1%, mIoU from 19.2% to 67.8%) while reducing end-to-end runtime from 49.20s/image to 10.39s/image (4.7x) on an Apple M1 Pro CPU. These results suggest detector-guided prompting combined with targeted sparse sampling as an effective alternative to dense "segment-everything" prompting for practical full-scene segmentation.

[74] Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy

Shivum Telang

Main category: cs.CV

TL;DR: 本文提出了一种用于糖尿病视网膜病变（DR）诊断的新型多模态可解释AI模型，结合视觉语言模型（VLM）与少样本学习，通过分析眼底和OCT图像中的病变分布生成自然语言解释和Grad-CAM热图，提升诊断的可解释性与准确性。

Details

Motivation: 现有DR诊断模型依赖单一成像模式，且仅能标注病灶位置，缺乏对分类决策的解释能力；手动标注病灶耗时费力，临床实用性差。需要一种能模仿眼科医生推理过程、提供定量检测与自然语言解释的系统。 Method: 提出一种基于视觉语言模型（VLM）的多模态可解释模型，采用少样本学习方法，分析视网膜四象限内的病灶分布；结合眼底图像与OCT图像，生成配对的Grad-CAM热图以可视化不同神经元对DR严重程度分类的关注区域，并输出自然语言解释。 Result: 在3,000张眼底图像和1,000张OCT图像的数据集上验证了该方法的有效性，模型能够准确分类DR严重程度，同时提供可视化的热图与自然语言解释，增强了临床可解释性与实用性。 Conclusion: 该多模态可解释模型克服了传统单模态模型的局限，不仅能精确定位病灶，还能模拟医生推理过程，为DR的筛查、治疗和研究提供了更全面、实用的工具。 Abstract: Diabetic Retinopathy (DR) is a leading cause of vision loss worldwide, requiring early detection to preserve sight. Limited access to physicians often leaves DR undiagnosed. To address this, AI models utilize lesion segmentation for interpretability; however, manually annotating lesions is impractical for clinicians. Physicians require a model that explains the reasoning for classifications rather than just highlighting lesion locations. Furthermore, current models are one-dimensional, relying on a single imaging modality for explainability and achieving limited effectiveness. In contrast, a quantitative-detection system that identifies individual DR lesions in natural language would overcome these limitations, enabling diverse applications in screening, treatment, and research settings. To address this issue, this paper presents a novel multimodal explainability model utilizing a VLM with few-shot learning, which mimics an ophthalmologist's reasoning by analyzing lesion distributions within retinal quadrants for fundus images. The model generates paired Grad-CAM heatmaps, showcasing individual neuron weights across both OCT and fundus images, which visually highlight the regions contributing to DR severity classification. Using a dataset of 3,000 fundus images and 1,000 OCT images, this innovative methodology addresses key limitations in current DR diagnostics, offering a practical and comprehensive tool for improving patient outcomes.

[75] TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting

Qiang Guo,Rubo Zhang,Bingbing Zhang,Junjie Liu,Jianqing Liu

Main category: cs.CV

TL;DR: 本文提出TCFormer，一种超轻量级、弱监督的Transformer架构，用于人群计数，仅用图像级标签即可在资源受限设备上实现高效准确的估计。

Details

Motivation: 现有方法依赖密集标注和计算昂贵的模型，限制了在边缘设备上的应用，本文旨在设计一个参数少、训练成本低且性能良好的弱监督框架。 Method: 采用高效的视觉Transformer作为骨干网络，引入可学习密度加权平均模块动态聚合局部特征，并设计密度分级分类损失以增强训练稳定性。 Result: 在ShanghaiTech A/B、UCF-QNRF和NWPU四个基准上取得优异表现，在仅5M参数和弱监督设定下实现了精度与效率的良好平衡。 Conclusion: TCFormer为资源受限环境下的实际人群计数任务提供了一个高效可行的解决方案。 Abstract: Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model's classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.

[76] A CNN-Based Malaria Diagnosis from Blood Cell Images with SHAP and LIME Explainability

Md. Ismiel Hossen Abir,Awolad Hossain

Main category: cs.CV

TL;DR: 本研究提出了一种基于自定义卷积神经网络（CNN）的深度学习方法，用于自动分类疟疾感染的血细胞图像，在准确率、精确率和召回率方面均表现优异，并结合可解释AI技术提升模型透明度，适用于资源有限地区的疟疾快速诊断。

Details

Motivation: 传统疟疾诊断方法如显微镜血涂片检测敏感性低、依赖专家判断且资源需求高，难以在偏远地区广泛应用，因此需要一种更高效、自动化且可解释的诊断方案。 Method: 设计并训练一个自定义的卷积神经网络（CNN）对血细胞图像进行二分类（感染/未感染），并与ResNet50、VGG16、MobileNetV2和DenseNet121等预训练模型进行比较；同时应用SHAP、LIME和显著性图等可解释AI技术增强模型的可解释性。 Result: 所提出的自定义CNN模型准确率达到96%，各类别的精确率和召回率均超过0.95，性能优于或媲美主流深度学习模型；可解释性方法有效揭示了模型关注的图像区域，增强了临床可信度。 Conclusion: 深度学习结合可解释AI能够实现快速、准确且透明的疟疾诊断，具有在资源匮乏地区部署的潜力，有助于改善疟疾筛查的可及性和效率。 Abstract: Malaria remains a prevalent health concern in regions with tropical and subtropical climates. The cause of malaria is the Plasmodium parasite, which is transmitted through the bites of infected female Anopheles mosquitoes. Traditional diagnostic methods, such as microscopic blood smear analysis, are low in sensitivity, depend on expert judgment, and require resources that may not be available in remote settings. To overcome these limitations, this study proposes a deep learning-based approach utilizing a custom Convolutional Neural Network (CNN) to automatically classify blood cell images as parasitized or uninfected. The model achieves an accuracy of 96%, with precision and recall scores exceeding 0.95 for both classes. This study also compares the custom CNN with established deep learning architectures, including ResNet50, VGG16, MobileNetV2, and DenseNet121. To enhance model interpretability, Explainable AI techniques such as SHAP, LIME, and Saliency Maps are applied. The proposed system shows how deep learning can provide quick, accurate and understandable malaria diagnosis, especially in areas with limited resources.

[77] Signal-SGN++: Topology-Enhanced Time-Frequency Spiking Graph Network for Skeleton-Based Action Recognition

Naichuan Zheng,Xiahai Lun,Weiyi Li,Yuchen Du

Main category: cs.CV

TL;DR: 本文提出Signal-SGN++，一种结合图结构感知与脉冲时频动态的高效骨骼动作识别框架，在保持低能耗的同时实现了优于现有SNN方法和媲美GCN的性能。

Details

Motivation: 现有GCN在骨骼动作识别中计算密集、能耗高，而SNN虽节能但难以捕捉人体运动的时空频域与拓扑依赖关系，亟需一种兼顾效率与建模能力的新方法。 Method: 提出Signal-SGN++，包含1D-SGC和FSC主干用于时空与频谱特征提取；引入TSSA机制自适应关注学习到的骨骼拓扑结构；设计MWTF分支结合TATF单元进行多尺度小波变换并融合结构先验以保持拓扑一致性。 Result: 在大规模基准上实验表明，Signal-SGN++在显著降低能耗的同时，取得了优于现有SNN方法和媲美最先进GCN的精度，实现了更优的精度-效率权衡。 Conclusion: Signal-SGN++通过融合结构感知与脉冲时频建模，有效平衡了性能与能效，为低功耗动作识别提供了新的解决方案。 Abstract: Graph Convolutional Networks (GCNs) demonstrate strong capability in modeling skeletal topology for action recognition, yet their dense floating-point computations incur high energy costs. Spiking Neural Networks (SNNs), characterized by event-driven and sparse activation, offer energy efficiency but remain limited in capturing coupled temporal-frequency and topological dependencies of human motion. To bridge this gap, this article proposes Signal-SGN++, a topology-aware spiking graph framework that integrates structural adaptivity with time-frequency spiking dynamics. The network employs a backbone composed of 1D Spiking Graph Convolution (1D-SGC) and Frequency Spiking Convolution (FSC) for joint spatiotemporal and spectral feature extraction. Within this backbone, a Topology-Shift Self-Attention (TSSA) mechanism is embedded to adaptively route attention across learned skeletal topologies, enhancing graph-level sensitivity without increasing computational complexity. Moreover, an auxiliary Multi-Scale Wavelet Transform Fusion (MWTF) branch decomposes spiking features into multi-resolution temporal-frequency representations, wherein a Topology-Aware Time-Frequency Fusion (TATF) unit incorporates structural priors to preserve topology-consistent spectral fusion. Comprehensive experiments on large-scale benchmarks validate that Signal-SGN++ achieves superior accuracy-efficiency trade-offs, outperforming existing SNN-based methods and achieving competitive results against state-of-the-art GCNs under substantially reduced energy consumption.

[78] VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Fadi Dornaika,Cosimo Distante,Abdenour Hadid

Main category: cs.CV

TL;DR: 本文提出了一种基于冻结的SigLIP多语言编码器的模块化视觉-语言框架VLM-PAR，通过紧凑的交叉注意力融合优化视觉特征，实现了在严重类别不平衡的行人属性识别任务中的性能突破，在PA100K、PETA和Market-1501等多个基准上取得了显著提升。

Details

Motivation: 行人属性识别（PAR）面临类别严重不平衡、属性间复杂依赖关系以及域偏移等挑战，现有方法难以有效应对这些问题。 Method: 提出VLM-PAR框架，利用冻结的SigLIP多语言编码器，通过紧凑的交叉注意力机制对齐图像与提示文本的嵌入表示，并引入跨模态精细化融合策略以增强特征表达。 Result: 在PA100K数据集上达到新的最先进性能，在PETA和Market-1501上也显著提升了平均准确率，验证了方法在处理类别不平衡和泛化问题上的有效性。 Conclusion: 结合大规模视觉-语言预训练与针对性的跨模态精细化策略，能有效提升PAR任务的性能，为解决不平衡和域适应问题提供了新思路。 Abstract: Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.

[79] Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

Hieu Minh Nguyen,Tam Le-Thanh Dang,Kiet Van Nguyen

Main category: cs.CV

TL;DR: 本文介绍了ViSignVQA，首个大规模面向越南语招牌文本的视觉问答（VQA）数据集，包含10,762张图像和25,573个问题答案对，结合OCR与越南语预训练模型提升性能，并提出基于多智能体框架的解决方案。

Details

Motivation: 现有VQA研究在低资源语言（如越南语）的招牌文本理解上仍不足，缺乏能反映真实场景中语言、文化和视觉多样性的数据集，因此需要构建专门的资源以推动该领域发展。 Method: 构建了名为ViSignVQA的大规模数据集，整合越南语OCR模型SwinTextSpotter与语言模型ViT5，并适配BLIP-2、LaTr等先进VQA模型；进一步提出结合感知与推理代理及GPT-4的多智能体VQA框架，采用多数投票机制提升准确率。 Result: 实验显示引入OCR文本可使F1分数最高提升209%；所提多智能体框架通过多数投票达到75.98%的准确率；验证了OCR增强上下文对文本型VQA的关键作用。 Conclusion: ViSignVQA填补了低资源语言场景下招牌理解的数据空白，强调了领域特定资源的重要性，为越南语OCR集成VQA模型提供了有效基准和评估平台。 Abstract: Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.

[80] On Extending Semantic Abstraction for Efficient Search of Hidden Objects

Tasha Pais,Nikhilesh Belulkar

Main category: cs.CV

TL;DR: 本文提出了一种基于语义抽象的框架，利用2D视觉语言模型（VLM）的相关性激活来表示“抽象对象”，进而实现对被遮挡物体的3D定位与补全。

Details

Motivation: 由于被遮挡物体无法被VLM直接识别，传统方法难以有效定位，因此需要一种能利用先验位置知识的高效搜索机制。 Method: 将VLM的 relevancy 激活视为抽象对象表示，并结合物体常出现的位置历史数据，用于学习3D空间中的隐藏物体定位与补全。 Result: 模型能在首次尝试中准确识别隐藏物体的完整3D位置，搜索效率显著高于朴素随机搜索。 Conclusion: 该方法扩展了语义抽象的应用，为家庭机器人寻找丢失物品提供了更省时省力的解决方案。 Abstract: Semantic Abstraction's key observation is that 2D VLMs' relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as "abstract object" representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.

[81] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs

Naishan Zheng,Jie Huang,Qingpei Guo,Feng Zhao

Main category: cs.CV

TL;DR: VideoScaffold是一种用于流式视频理解的动态表示框架，通过自适应调整事件粒度并保持细粒度视觉语义，实现从帧级理解到事件推理的平滑过渡。

Details

Motivation: 现有静态方法在处理连续视频流时容易产生碎片化或过度压缩的问题，且难以满足长时间视频中多模态大语言模型对时间连贯表示的需求。 Method: 提出VideoScaffold框架，包含弹性尺度事件分割（EES）和层次化事件整合（HEC）：EES进行预测引导的分割以动态优化事件边界，HEC则逐步聚合语义相关的片段形成多层次抽象。 Result: 在离线和流式视频理解基准上均取得最先进性能，框架具有模块化和即插即用特性，可无缝扩展现有的基于图像的多模态大语言模型。 Conclusion: VideoScaffold有效解决了长视频理解中的冗余与时间连贯性问题，为多模态大语言模型在流式视频场景中的应用提供了高效、灵活的解决方案。 Abstract: Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at https://github.com/zheng980629/VideoScaffold.

[82] KAN-FPN-Stem:A KAN-Enhanced Feature Pyramid Stem for Boosting ViT-based Pose Estimation

HaoNan Tang

Main category: cs.CV

TL;DR: 本文提出了一种KAN增强的FPN-Stem架构，用于改进Vision Transformer在姿态估计中的特征融合过程，显著提升了性能。

Details

Motivation: 现有的ViT模型（如ViTPose）因前端设计过于简单，导致多尺度特征提取时信息丢失，限制了性能提升。 Method: 保留FPN的经典上采样加和融合方式，但将末端的标准3x3卷积替换为基于KAN的卷积层，以增强非线性建模能力，修复多尺度融合中的伪影。 Result: 在COCO数据集上，相比轻量级ViTPose-S基线，性能提升高达+2.0 AP。 Conclusion: 性能瓶颈主要不在注意力机制等特征 refinement 模块，而在于特征融合的质量；引入KAN操作符是解决该问题的有效途径。 Abstract: Vision Transformers (ViT) have demonstrated significant promise in dense prediction tasks such as pose estimation. However, their performance is frequently constrained by the overly simplistic front-end designs employed in models like ViTPose. This naive patchification mechanism struggles to effectively handle multi-scale variations and results in irreversible information loss during the initial feature extraction phase. To overcome this limitation, we introduce a novel KAN-enhanced FPN-Stem architecture. Through rigorous ablation studies, we first identified that the true bottleneck for performance improvement lies not in plug-and-play attention modules (e.g., CBAM), but in the post-fusion non-linear smoothing step within the FPN. Guided by this insight, our core innovation is to retain the classic "upsample-and-add" fusion stream of the FPN, but replace its terminal, standard linear 3x3 smoothing convolution with a powerful KAN-based convolutional layer. Leveraging its superior non-linear modeling capabilities, this KAN-based layer adaptively learns and rectifies the "artifacts" generated during the multi-scale fusion process. Extensive experiments on the COCO dataset demonstrate that our KAN-FPN-Stem achieves a significant performance boost of up to +2.0 AP over the lightweight ViTPose-S baseline. This work not only delivers a plug-and-play, high-performance module but, more importantly, reveals that: the performance bottleneck in ViT front-end often lies not in 'feature refinement' (Attention), but in the quality of 'feature fusion' (Fusion). Furthermore, it provides an effective path to address this bottleneck through the introduction of the KAN operator.

[83] Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction

Mengxiao Geng,Ran Hong,Xiaoling Xu,Bingxuan Li,Qiegen Liu

Main category: cs.CV

TL;DR: 提出了一种元信息引导的跨域协同扩散模型（MiG-DM），通过融合投影域和图像域信息及临床元信息，显著提升低剂量PET图像质量。

Details

Motivation: 现有低剂量PET成像方法常忽略投影域物理先验和患者特异性元信息，导致噪声高、对比度低、生理细节丢失，难以实现功能与语义的有效关联。 Method: 设计了元信息编码模块，将临床参数转化为语义提示，并结合跨域架构：在投影域使用正弦图适配器捕捉全局物理结构，在图像域进行重建，实现多模态先验融合。 Result: 在UDPET公开数据集和多个临床数据集上验证，MiG-DM在图像质量、噪声抑制和生理细节保留方面优于现有最先进方法。 Conclusion: MiG-DM通过引入元信息引导和跨域协同机制，有效提升了低剂量PET图像重建性能，具有良好的临床应用潜力。 Abstract: Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.

[84] Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture

Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan

Main category: cs.CV

TL;DR: 本文提出了一种用于智能农业的轻量级高效卷积神经网络知识蒸馏框架，结合倒残差块与密集连接结构，在多个植物识别任务中实现了接近教师模型精度但计算成本大幅降低的性能。

Details

Motivation: 在资源受限的边缘设备上部署深度学习模型面临计算效率与识别精度之间的权衡问题，特别是在智能农业应用中亟需高效的轻量模型。 Method: 设计一种结合倒残差块与密集连接的定制化学生网络，采用ResNet18作为教师网络，通过融合硬标签监督、特征层蒸馏、响应层蒸馏和自蒸馏的多目标策略进行训练。 Result: 在水稻种子分类任务中，学生模型达到98.56%的准确率（仅比教师模型低0.09%），计算量仅为0.68 GFLOPs，参数量约107万，相比ResNet18降低了2.7倍计算成本和10倍以上模型大小；相较于DenseNet121和ViT，参数减少6至80倍，同时保持相当或更高的精度。在多个植物病害数据集上也表现出良好泛化能力。 Conclusion: 所提混合知识蒸馏框架能有效平衡精度与效率，具有良好的鲁棒性和部署潜力，适用于资源受限的智能农业系统。 Abstract: Deploying deep learning models on resource-constrained edge devices remains a major challenge in smart agriculture due to the trade-off between computational efficiency and recognition accuracy. To address this challenge, this study proposes a hybrid knowledge distillation framework for developing a lightweight yet high-performance convolutional neural network. The proposed approach designs a customized student model that combines inverted residual blocks with dense connectivity and trains it under the guidance of a ResNet18 teacher network using a multi-objective strategy that integrates hard-label supervision, feature-level distillation, response-level distillation, and self-distillation. Experiments are conducted on a rice seed variety identification dataset containing nine varieties and further extended to four plant leaf disease datasets, including rice, potato, coffee, and corn, to evaluate generalization capability. On the rice seed variety classification task, the distilled student model achieves an accuracy of 98.56%, which is only 0.09% lower than the teacher model (98.65%), while requiring only 0.68 GFLOPs and approximately 1.07 million parameters. This corresponds to a reduction of about 2.7 times in computational cost and more than 10 times in model size compared with the ResNet18 teacher model. In addition, compared with representative pretrained models, the proposed student reduces the number of parameters by more than 6 times relative to DenseNet121 and by over 80 times compared with the Vision Transformer (ViT) architecture, while maintaining comparable or superior classification accuracy. Consistent performance gains across multiple plant leaf disease datasets further demonstrate the robustness, efficiency, and strong deployment potential of the proposed framework for hardware-limited smart agriculture systems.

[85] Evaluating an Adaptive Multispectral Turret System for Autonomous Tracking Across Variable Illumination Conditions

Aahan Sachdeva,Dhanvinkumar Ganeshkumar,James E. Gallagher,Tyler Treat,Edward J. Oughton

Main category: cs.CV

TL;DR: 提出了一种自适应融合RGB和长波红外（LWIR）视频流的框架，通过动态选择最优检测模型，提升不同光照条件下的目标检测性能，尤其在低光环境下显著优于传统方法。

Details

Motivation: 传统RGB检测在低光环境下表现差，热成像系统缺乏颜色和纹理信息，需融合多模态数据以提升复杂环境下的检测能力。 Method: 融合对齐的RGB与LWIR图像（11种比例），训练33个YOLO模型，覆盖三种光照条件（无光、弱光、全光），并根据环境动态选择最佳融合比例与模型。 Result: 在全光和弱光下，最优融合比（80/20 和 90/10）分别达到92.8%和92.0%平均置信度，显著优于YOLOv5n和YOLOv11n基线；无光条件下40/60融合达71.0%，优于基线但不显著。 Conclusion: 自适应RGB-LWIR融合显著提升了不同光照条件下的检测置信度与鲁棒性，增强了自主机器人在应急任务中的视觉能力。 Abstract: Autonomous robotic platforms are playing a growing role across the emergency services sector, supporting missions such as search and rescue operations in disaster zones and reconnaissance. However, traditional red-green-blue (RGB) detection pipelines struggle in low-light environments, and thermal-based systems lack color and texture information. To overcome these limitations, we present an adaptive framework that fuses RGB and long-wave infrared (LWIR) video streams at multiple fusion ratios and dynamically selects the optimal detection model for each illumination condition. We trained 33 You Only Look Once (YOLO) models on over 22,000 annotated images spanning three light levels: no-light (<10 lux), dim-light (10-1000 lux), and full-light (>1000 lux). To integrate both modalities, fusion was performed by blending aligned RGB and LWIR frames at eleven ratios, from full RGB (100/0) to full LWIR (0/100) in 10% increments. Evaluation showed that the best full-light model (80/20 RGB-LWIR) and dim-light model (90/10 fusion) achieved 92.8% and 92.0% mean confidence; both significantly outperformed the YOLOv5 nano (YOLOv5n) and YOLOv11 nano (YOLOv11n) baselines. Under no-light conditions, the top 40/60 fusion reached 71.0%, exceeding baselines though not statistically significant. Adaptive RGB-LWIR fusion improved detection confidence and reliability across all illumination conditions, enhancing autonomous robotic vision performance.

[86] Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models

Antara Titikhsha,Om Kulkarni,Dharun Muthaiah

Main category: cs.CV

TL;DR: 提出一种无需专门训练的几何控制方法，通过轻量级外部判别器（HPE教师模型）在文本到图像生成中分离几何与风格，提升几何一致性与语义对齐。

Details

Motivation: 现有文本到图像扩散模型在生成细节纹理时忽略严格的几何约束，难以满足人类对形状的感知需求，存在语义鸿沟。 Method: 基于THINGS三元组数据集训练一个人类感知嵌入（HPE）教师模型，将其梯度注入稳定扩散、SiT-XL/2和PixArt-Σ等不同架构的扩散过程中，作为外部引导信号以实现几何控制。 Result: 实现了零样本迁移复杂三维形状（如伊姆斯椅）到冲突材质（如粉色金属）上的生成；相比无引导基线，语义对齐提升约80%；发现流模型需持续引导以防轨迹偏移。 Conclusion: 小型教师模型可有效引导大型生成系统，在不重新训练的情况下增强几何控制能力，拓展了文本到图像合成的创造范围与几何保真度。 Abstract: Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-Σ. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.

[87] GeCo: A Differentiable Geometric Consistency Metric for Video Generation

Leslie Gu,Junhwa Hur,Charles Herrmann,Fangneng Zhan,Todd Zickler,Deqing Sun,Hanspeter Pfister

Main category: cs.CV

TL;DR: 提出了一种名为GeCo的几何基础度量方法，用于检测静态场景中的几何变形和遮挡不一致伪影，并可用于评估和改进视频生成模型。

Details

Motivation: 现有的视频生成模型在处理静态场景时容易产生几何变形和遮挡不一致等伪影，缺乏有效的评估和纠正手段。 Method: 通过融合残差运动和深度先验，设计了GeCo指标，生成可解释的密集一致性图以揭示伪影，并将其作为无需训练的引导损失应用于视频生成过程。 Result: 使用GeCo对最新视频生成模型进行了系统性基准测试，发现了常见的失败模式，并验证了其作为引导损失可有效减少变形伪影。 Conclusion: GeCo是一种有效的、无需训练的工具，既能用于检测视频生成中的几何伪影，又能用于提升生成质量。 Abstract: We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.

[88] The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

Dingyu Wang,Zimu Yuan,Jiajun Liu,Shanggui Liu,Nan Zhou,Tianxing Xu,Di Huang,Dong Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为Bones and Joints (B&J)的基准测试，用于评估人工智能模型在真实骨科与运动医学病例中的多模态临床推理能力。研究发现，尽管当前模型在选择题上表现良好，但在需要整合文本与图像信息的开放性任务中表现不佳，尤其在医学图像理解和避免文本驱动幻觉方面存在显著缺陷，表明现有AI模型尚不具备复杂临床推理能力。

Details

Motivation: 现有的医学AI评估基准多基于考试题目或简化的病例，无法全面反映真实临床环境中所需的多模态、综合推理能力。因此，亟需一个更贴近实际诊疗流程的评估框架，以准确衡量AI模型的临床实用性。 Method: 构建包含1,245个真实病例问题的B&J基准，涵盖7项模拟临床推理路径的任务（如知识回忆、图文解读、诊断、治疗规划和理由说明），并对11个视觉-语言模型和6个大语言模型进行评估，结果与专家标准答案对比。 Result: 最先进的模型在结构化选择题中准确率超过90%，但在需多模态整合的开放任务中准确率仅约60%；视觉-语言模型在医学图像解释上表现差，常出现严重文本驱动的幻觉；专为医疗微调的模型未显示出一致优势。 Conclusion: 当前AI模型尚未具备处理复杂多模态临床推理的能力，其临床部署应限于辅助性的文本支持任务；未来需在多模态融合与视觉理解方面取得基础性突破。 Abstract: Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (B&J) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.

[89] FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

Hussain Alasmawi,Numan Saeed,Mohammad Yaqub

Main category: cs.CV

TL;DR: 本文提出了Fetal-Gauge，首个大规模用于评估视觉-语言模型（VLMs）在胎儿超声成像中性能的视觉问答基准，包含42,000多张图像和93,000个问题回答对，实验显示现有VLMs表现不佳，突显需针对该领域定制模型。

Details

Motivation: 由于缺乏标准化基准和公开数据集，现有视觉-语言模型难以有效评估和应用于胎儿超声成像，亟需一个统一基准来推动该领域发展。 Method: 构建了一个名为Fetal-Gauge的大规模视觉问答基准，涵盖解剖平面识别、结构定位、胎儿方向判断、临床视角合规性和临床诊断等多个任务，并对多种先进VLM进行系统评估。 Result: Fetal-Gauge包含超过42,000张图像和93,000个问答对；当前最优模型准确率仅为55%，远未达到临床应用要求；揭示了现有VLM在胎儿超声理解中的关键缺陷。 Conclusion: Fetal-Gauge为胎儿超声中的多模态AI研究提供了坚实基础，凸显了开发领域适配模型的紧迫性，并有望推动全球孕产健康可及性。 Abstract: The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.

[90] A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation

Philip Xu,David Elizondo,Raouf Hamzaoui

Main category: cs.CV

TL;DR: Uni4D是一个统一的框架，用于大规模开放词汇的3D检索和可控4D生成，通过文本、3D模型和图像模态之间的三级对齐实现跨模态理解。

Details

Motivation: 为了提升开放词汇下3D内容检索与4D动态资产生成的能力，解决现有方法在语义对齐和跨模态一致性上的不足。 Method: 基于Align3D 130数据集，采用3D文本多头注意力机制和搜索模型优化文本到3D的检索，并通过文本-3D、多视图3D-图像、图像-文本三重对齐增强跨模态关联。 Result: 实验表明，Uni4D在3D检索质量和可控4D生成方面表现优异，能生成时间一致的4D资产。 Conclusion: Uni4D有效推动了动态多模态理解和实际应用的发展，为开放词汇下的3D/4D内容创建提供了强大支持。 Abstract: We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.

[91] Learning Dynamic Scene Reconstruction with Sinusoidal Geometric Priors

Tian Guo,Hui Yuan,Philip Xu,David Elizondo

Main category: cs.CV

TL;DR: SirenPose是一种新的损失函数，结合了正弦表示网络的周期性激活特性和关键点结构的几何先验，提升了动态3D场景重建的精度。

Details

Motivation: 现有方法在快速运动和多目标场景中难以保持运动建模精度和时空一致性，需要更有效的约束机制。 Method: 提出SirenPose损失函数，融合周期性激活与几何先验，并引入物理启发的约束机制，确保时空维度上的连贯关键点预测；训练数据集扩展至60万标注实例。 Result: 实验结果表明，使用SirenPose训练的模型在时空一致性指标上显著优于先前方法，尤其在快速运动和复杂场景变化中表现更优。 Conclusion: SirenPose通过结合周期性激活和几何先验有效提升了动态3D场景重建的准确性和一致性，具有良好的应用潜力。 Abstract: We propose SirenPose, a novel loss function that combines the periodic activation properties of sinusoidal representation networks with geometric priors derived from keypoint structures to improve the accuracy of dynamic 3D scene reconstruction. Existing approaches often struggle to maintain motion modeling accuracy and spatiotemporal consistency in fast moving and multi target scenes. By introducing physics inspired constraint mechanisms, SirenPose enforces coherent keypoint predictions across both spatial and temporal dimensions. We further expand the training dataset to 600,000 annotated instances to support robust learning. Experimental results demonstrate that models trained with SirenPose achieve significant improvements in spatiotemporal consistency metrics compared to prior methods, showing superior performance in handling rapid motion and complex scene changes.

[92] Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware

Vesal Ahsani,Babak Hossein Khalaj

Main category: cs.CV

TL;DR: 提出了一种低成本、低延迟的单摄像头驾驶员行为识别系统，可在树莓派5和Google Coral Edge TPU等边缘设备上实现实时监控，支持17类行为识别，并通过紧凑模型设计、混淆因素感知标签和时序决策机制提升准确性和稳定性。

Details

Motivation: 在计算、功耗和成本受限的条件下，实现车内驾驶员监测系统对分心和疲劳行为的低延迟、高精度识别，推动其在实际车辆中的普及应用。 Method: 采用单摄像头输入，结合轻量级逐帧视觉模型、混淆因素感知的标签设计以及时序决策头，在低算力边缘设备上实现高效推理与稳定报警触发。 Result: 系统在树莓派5上达到约16 FPS（INT8，每帧延迟低于60 ms），在Coral Edge TPU上达25 FPS，覆盖17类驾驶相关行为，并在真实车载环境中验证了运行性能。 Conclusion: 该系统能够在低成本硬件上实现实时、可靠的驾驶员行为识别，为以人为本的车辆智能（如代理型车辆）提供了可行的前置感知方案。 Abstract: In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.

[93] Attack-Aware Deepfake Detection under Counter-Forensic Manipulations

Noor Fatima,Hasan Faraz Khan,Muzammil Behzad

Main category: cs.CV

TL;DR: 本文提出了一种攻击感知的深度伪造与图像取证检测器，结合红队训练与随机测试时防御，在真实部署条件下实现鲁棒性、良好校准的概率输出和可解释的证据。

Details

Motivation: 现有的深度伪造检测方法在面对实际攻击（如压缩、重采样、颜色扰动等）时性能下降明显，且缺乏对模型预测可信度的校准和局部化伪造区域的能力。因此需要一种在现实攻击下仍稳健、输出可靠概率并提供可解释热图的检测方法。 Method: 采用双流架构：一_stream 使用预训练主干提取语义内容，另一_stream 提取法医残差，通过轻量级残差适配器融合进行分类；同时使用浅层特征金字塔网络结构在弱监督下生成篡改热图。训练中引入 worst-of-K 反取证攻击（如JPEG重对齐、重压缩、去噪再加噪等），测试时采用低成本抖动（如缩放、裁剪、伽马变化、JPEG相位偏移）并聚合预测结果。热图通过人脸框掩码引导集中在面部区域，无需精确像素级标注。 Result: 在标准深度伪造数据集及低光照、高压缩的监控场景划分上评估，报告了干净样本与受攻击样本下的AUC、最坏情况准确率、可靠性、拒绝质量与弱定位得分。结果显示该方法在多种攻击下排名近乎完美，校准误差低，拒绝风险小，并在‘regrain’攻击下表现出可控退化。 Conclusion: 所提方法实现了模块化、数据高效且可实际部署的基准方案，支持攻击感知检测、概率校准和可操作的热图输出，适用于真实世界复杂条件下的深度伪造检测。 Abstract: This work presents an attack-aware deepfake and image-forensics detector designed for robustness, well-calibrated probabilities, and transparent evidence under realistic deployment conditions. The method combines red-team training with randomized test-time defense in a two-stream architecture, where one stream encodes semantic content using a pretrained backbone and the other extracts forensic residuals, fused via a lightweight residual adapter for classification, while a shallow Feature Pyramid Network style head produces tamper heatmaps under weak supervision. Red-team training applies worst-of-K counter-forensics per batch, including JPEG realign and recompress, resampling warps, denoise-to-regrain operations, seam smoothing, small color and gamma shifts, and social-app transcodes, while test-time defense injects low-cost jitters such as resize and crop phase changes, mild gamma variation, and JPEG phase shifts with aggregated predictions. Heatmaps are guided to concentrate within face regions using face-box masks without strict pixel-level annotations. Evaluation on existing benchmarks, including standard deepfake datasets and a surveillance-style split with low light and heavy compression, reports clean and attacked performance, AUC, worst-case accuracy, reliability, abstention quality, and weak-localization scores. Results demonstrate near-perfect ranking across attacks, low calibration error, minimal abstention risk, and controlled degradation under regrain, establishing a modular, data-efficient, and practically deployable baseline for attack-aware detection with calibrated probabilities and actionable heatmaps.

[94] PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation

Darrin Bright,Rakshith Raj,Kanchan Keisham

Main category: cs.CV

TL;DR: 提出PortionNet，一种基于跨模态知识蒸馏的框架，利用点云特征训练轻量适配网络，仅用RGB图像实现准确的食物营养估计。

Details

Motivation: 由于3D信息丢失，从单张图像进行准确的食物营养估计具有挑战性；现有深度方法因依赖深度传感器而受限于多数智能手机。 Method: 提出PortionNet，采用跨模态知识蒸馏框架，在训练时学习点云中的几何特征，推理时仅需RGB图像；采用双模式训练策略，通过轻量适配网络模拟点云表示，实现伪3D推理。 Result: 在MetaFood3D上达到最优性能，优于所有先前方法，尤其在体积和能量估计方面；在SimpleFood45上的跨数据集实验显示出色的泛化能力。 Conclusion: PortionNet无需专用硬件即可实现高精度食物营养估计，具有良好的实用性和推广潜力。 Abstract: Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.

[95] MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Run Ling,Ke Cao,Jian Lu,Ao Ma,Haowei Liu,Runze He,Changwei Wang,Rongtao Xu,Yihua Shao,Zhanjie Zhang,Peng Wu,Guibing Guo,Wei Feng,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Xingwei Wang

Main category: cs.CV

TL;DR: 本文提出了MoFu框架，用于解决多主体视频生成中的尺度不一致和排列敏感性问题，通过引入尺度感知调制、傅里叶融合策略及稳定性损失，在新构建的基准上显著优于现有方法。

Details

Motivation: 现有方法在多主体视频生成中存在主体尺度不一致和输入顺序敏感的问题，影响生成结果的自然性和保真度。 Method: 提出MoFu框架：1）LLM引导的尺度感知调制（SMO）提取提示中的隐式尺度信息；2）傅里叶融合策略利用FFT整合参考特征的频域信息；3）设计尺度-排列稳定性损失以提升一致性。 Result: 在控制尺度和排列变化的基准上实验表明，MoFu在主体尺度一致性、视觉保真度和整体质量上优于现有方法。 Conclusion: MoFu有效解决了多主体视频生成中的尺度不一致与排列敏感问题，提升了生成视频的自然性与稳定性。 Abstract: Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

[96] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

Yang Ding,Yizhen Zhang,Xin Lai,Ruihang Chu,Yujiu Yang

Main category: cs.CV

TL;DR: 提出VideoZoomer，一种基于代理框架的多模态大模型动态视觉聚焦方法，通过时序缩放工具实现长视频理解中的细粒度推理。

Details

Motivation: 现有MLLM在长视频理解中受限于上下文窗口，常采用均匀采样或静态预选帧，易遗漏关键信息且无法纠正初始选择错误。 Method: 设计VideoZoomer框架，从低帧率概览开始，通过可学习的时序缩放工具自主选择时刻获取高帧率片段，以多轮交互方式逐步收集细粒度证据；采用两阶段训练：冷启动监督微调+强化学习优化代理策略。 Result: 7B模型在多个长视频理解与推理基准上表现优异，展现出多样复杂的推理模式，在减少帧数预算下仍具高效性，性能超越现有开源模型并媲美闭源系统。 Conclusion: VideoZoomer通过动态聚焦机制提升了MLLM在长视频理解中的推理能力与效率，展现了强大的细粒度分析潜力。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.

[97] SpotEdit: Selective Region Editing in Diffusion Transformers

Zhibin Qin,Zhenxiong Tan,Zeqing Wang,Songhua Liu,Xinchao Wang

Main category: cs.CV

TL;DR: 提出SpotEdit，一种无需训练的扩散编辑框架，通过选择性更新修改区域实现高效精确的图像编辑。

Details

Motivation: 现有方法在每一步都均匀处理和去噪所有令牌，导致计算冗余并可能损害未更改区域，因此需要一种更高效的编辑方法。 Method: SpotEdit包含两个关键组件：SpotSelector通过感知相似性识别稳定区域并重用条件图像特征以跳过其计算；SpotFusion通过动态融合机制自适应地将这些特征与编辑后的令牌结合。 Result: SpotEdit减少了不必要的计算，同时保持了未修改区域的高保真度，实现了高效且精确的图像编辑。 Conclusion: SpotEdit通过选择性更新机制，在不牺牲编辑质量的前提下显著提升了效率，验证了并非所有区域都需要在编辑过程中重新生成。 Abstract: Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.

[98] DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

Jianrong Zhang,Hehe Fan,Yi Yang

Main category: cs.CV

TL;DR: 本文提出了DeMoGen，一种基于能量扩散模型的可分解学习训练范式，能够将复杂人体运动分解为语义明确的子组件，并支持这些运动概念的重新组合以生成新颖动作。

Details

Motivation: 现有方法主要关注从文本到动作的前向建模或直接组合动作概念，缺乏对复杂动作进行语义分解的能力。本文旨在通过逆向视角实现动作的可分解学习，从而发现可重用的运动基元。 Method: 提出DeMoGen框架，采用基于能量的扩散模型建模多个运动概念的联合分布，并设计三种训练变体：DeMoGen-Exp利用分解后的文本提示进行显式训练，DeMoGen-OSS实现正交自监督分解，DeMoGen-SC强制原始与分解文本嵌入之间的语义一致性。 Result: 模型能够在无个体概念真值动作的情况下成功分解复杂动作为语义子成分，并实现运动基元的灵活重组，生成训练分布之外的多样化新动作；同时构建了一个支持文本分解的新型数据集。 Conclusion: DeMoGen实现了对复杂人体运动的有效分解与再组合，推动了动作生成中的可组合性与可解释性，为文本到动作生成提供了新的训练范式和数据资源。 Abstract: Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.

[99] The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

Mariya Miteva,Maria Nisheva-Pavlova

Main category: cs.CV

TL;DR: 提出了一种基于变分自编码器的多视图潜在表示学习框架，用于整合来自T1Gd和FLAIR MRI的放射组学特征，以实现胶质母细胞瘤中MGMT启动子甲基化的非侵入性预测。

Details

Motivation: 传统单模态或早期融合方法在建模模态特异性信息方面存在不足，且特征冗余高，限制了放射基因组学中MGMT状态预测的性能。 Method: 采用独立的概率编码器对T1Gd和FLAIR两种MRI模态分别编码，并在紧凑的潜在空间中进行融合，利用变分自编码器框架学习多视图潜在表示，进而用于MGMT启动子甲基化分类。 Result: 所提方法能有效保留模态特异性结构并实现多模态信息融合，提升了MGMT状态预测的准确性。 Conclusion: 该多视图潜在学习框架为放射基因组学中的分子特征预测提供了更优的非侵入性解决方案，具有良好的临床应用潜力。 Abstract: Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and an incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) to integrate complementary radiomic features derived from post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Recovery (FLAIR) magnetic resonance imaging (MRI). By encoding each modality through an independent probabilistic encoder and performing fusion in a compact latent space, the proposed approach preserves modality-specific structure while enabling effective multimodal integration. The resulting latent embeddings are subsequently used for MGMT promoter methylation classification.

[100] Feature Learning with Multi-Stage Vision Transformers on Inter-Modality HER2 Status Scoring and Tumor Classification on Whole Slides

Olaide N. Oyelade,Oliver Hoxey,Yulia Humrye

Main category: cs.CV

TL;DR: 本研究提出了一种基于视觉Transformer的端到端管道，用于联合分析H&E和IHC染色全切片图像（WSI），实现像素级HER2状态评分（0, 1+, 2+, 3+），并在肿瘤定位和HER2状态预测中表现出高准确率。

Details

Motivation: 准确评估HER2蛋白表达水平对癌症治疗至关重要，但现有深度学习方法难以实现H&E与IHC图像的联合分析及像素级HER2状态定位。 Method: 采用基于视觉Transformer的端到端框架，通过补丁级处理H&E WSI进行肿瘤定位，并设计新映射函数关联H&E恶性区域与IHC对应区域，嵌入临床启发的HER2评分机制以实现自动像素级四分类评分。 Result: 在私有数据集上实验显示，肿瘤定位分类准确率良好；HER2状态四分类预测准确率达0.94，特异性达0.933，且可准确区分HER2阴性和阳性。 Conclusion: 该方法验证了联合使用H&E与IHC图像在基于ViT的端到端模型中进行HER2评分的可行性与高效性，具有临床应用潜力。 Abstract: The popular use of histopathology images, such as hematoxylin and eosin (H&E), has proven to be useful in detecting tumors. However, moving such cancer cases forward for treatment requires accurate on the amount of the human epidermal growth factor receptor 2 (HER2) protein expression. Predicting both the lower and higher levels of HER2 can be challenging. Moreover, jointly analyzing H&E and immunohistochemistry (IHC) stained images for HER2 scoring is difficult. Although several deep learning methods have been investigated to address the challenge of HER2 scoring, they suffer from providing a pixel-level localization of HER2 status. In this study, we propose a single end-to-end pipeline using a system of vision transformers with HER2 status scoring on whole slide images of WSIs. The method includes patch-wise processing of H&E WSIs for tumor localization. A novel mapping function is proposed to correspondingly identify correlated IHC WSIs regions with malignant regions on H&E. A clinically inspired HER2 scoring mechanism is embedded in the pipeline and allows for automatic pixel-level annotation of 4-way HER2 scoring (0, 1+, 2+, and 3+). Also, the proposed method accurately returns HER2-negative and HER2-positive. Privately curated datasets were collaboratively extracted from 13 different cases of WSIs of H&E and IHC. A thorough experiment is conducted on the proposed method. Results obtained showed a good classification accuracy during tumor localization. Also, a classification accuracy of 0.94 and a specificity of 0.933 were returned for the prediction of HER2 status, scoring in the 4-way methods. The applicability of the proposed pipeline was investigated using WSIs patches as comparable to human pathologists. Findings from the study showed the usability of jointly evaluated H&E and IHC images on end-to-end ViTs-based models for HER2 scoring

[101] Human-like visual computing advances explainability and few-shot learning in deep neural networks for complex physiological data

Alaa Alahmadi,Mohamed Hasan

Main category: cs.CV

TL;DR: 本文提出一种基于感知启发的伪彩色编码技术，通过将心电图中的临床显著时间特征（如QT间期）转化为结构化颜色表示，显著提升深度神经网络在极小样本下的学习能力与模型可解释性，尤其适用于药物性长QT综合征等数据稀缺的临床场景。

Details

Motivation: 现有的深度学习模型在生理信号分析中依赖大量标注数据且缺乏可解释性，限制了其在临床环境中的可信度与应用。本文旨在提升模型的数据效率和可解释性，并使其推理过程更贴近人类医学思维。 Method: 采用感知启发的伪彩色编码方法，将ECG信号中的关键时序特征（如QT间期）映射为颜色空间，结合原型网络与ResNet-18架构，在单周期和10秒节律图像上进行单样本与少样本学习，并通过注意力可视化分析模型解释性。 Result: 模型在仅使用1到5个训练样本的情况下即可实现有效分类；伪彩色编码引导模型关注临床相关特征，抑制无关信号成分；多心动周期聚合进一步提升性能，表现出类似人类感知平均的效果。 Conclusion: 人类感知启发的编码方式可同时提升医学AI模型的数据效率、可解释性和因果推理能力，为小样本、高风险临床诊断任务提供了可行的技术路径。 Abstract: Machine vision models, particularly deep neural networks, are increasingly applied to physiological signal interpretation, including electrocardiography (ECG), yet they typically require large training datasets and offer limited insight into the causal features underlying their predictions. This lack of data efficiency and interpretability constrains their clinical reliability and alignment with human reasoning. Here, we show that a perception-informed pseudo-colouring technique, previously demonstrated to enhance human ECG interpretation, can improve both explainability and few-shot learning in deep neural networks analysing complex physiological data. We focus on acquired, drug-induced long QT syndrome (LQTS) as a challenging case study characterised by heterogeneous signal morphology, variable heart rate, and scarce positive cases associated with life-threatening arrhythmias such as torsades de pointes. This setting provides a stringent test of model generalisation under extreme data scarcity. By encoding clinically salient temporal features, such as QT-interval duration, into structured colour representations, models learn discriminative and interpretable features from as few as one or five training examples. Using prototypical networks and a ResNet-18 architecture, we evaluate one-shot and few-shot learning on ECG images derived from single cardiac cycles and full 10-second rhythms. Explainability analyses show that pseudo-colouring guides attention toward clinically meaningful ECG features while suppressing irrelevant signal components. Aggregating multiple cardiac cycles further improves performance, mirroring human perceptual averaging across heartbeats. Together, these findings demonstrate that human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence.

[102] VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Zhengfei Kuang,Rui Lin,Long Zhao,Gordon Wetzstein,Saining Xie,Sanghyun Woo

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型（MLLM）的3D物体排列方法，通过引入MCP-API、专用视觉工具和多智能体协作框架，解决了MLLM在3D场景中视觉定位弱、感知不足和迭代错误等问题，在25个复杂任务上显著优于现有基线。

Details

Motivation: 尽管MLLM在2D视觉语言任务中取得了显著进展，但在复杂的3D场景操作中的应用仍探索不足，尤其在精确的3D对象操控和程序化编辑对齐方面存在明显缺陷。 Method: 1) 提出基于MCP的API，将交互从原始代码操作转为函数级更新；2) 引入一系列专用视觉工具增强MLLM对3D场景状态、空间信息的理解与动作结果验证；3) 设计具有规划、执行与验证角色分工的多智能体协作框架以处理多步指令和错误恢复。 Result: 在25个复杂3D物体排列任务上的实验表明，该方法显著优于现有基线，有效提升了MLLM在3D场景中的操作精度与鲁棒性。 Conclusion: 通过API抽象、感知增强和多智能体协作，可有效弥补MLLM在3D场景操作中的不足，为MLLM应用于复杂3D环境提供了可行路径。 Abstract: Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan-3d.github.io

[103] Self-Evaluation Unlocks Any-Step Text-to-Image Generation

Xin Yu,Xiaojuan Qi,Zhengqi Li,Kai Zhang,Richard Zhang,Zhe Lin,Eli Shechtman,Tianyu Wang,Yotam Nitzan

Main category: cs.CV

TL;DR: 本文提出了Self-Evaluating Model (Self-E)，一种从零开始训练的文本到图像生成新方法，支持任意步数推理。它结合了Flow Matching的数据学习方式和新颖的自评估机制，无需预训练教师模型，实现了高效且可扩展的生成。

Details

Motivation: 传统扩散或流模型依赖局部监督，需要大量推理步骤；蒸馏方法则依赖预训练教师模型。本文旨在提出一种无需教师、支持任意步数推理且能从零训练的高效文本到图像生成模型。 Method: Self-E采用类似Flow Matching的方式从数据中学习，并引入自评估机制：利用当前得分估计来评估自身生成样本，作为动态自教师。结合即时局部学习与自驱动全局匹配，实现端到端训练。 Result: 在大规模文本到图像基准上实验表明，Self-E在极少步数下表现优异，在50步时性能媲美最先进的Flow Matching模型，且随着步数增加性能持续提升。 Conclusion: Self-E是首个从零开始、支持任意步数推理的文本到图像生成模型，提供了一个统一、高效且可扩展的生成框架。 Abstract: We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.

[104] iOSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI

Himanshu Naidu,Yuxiang Zhang,Sachin Mehta,Anat Caspi

Main category: cs.CV

TL;DR: iOSPointMapper是一个基于iPhone和iPad的移动应用，利用设备端语义分割、LiDAR深度估计和融合定位技术，实现隐私保护的实时人行道特征测绘，并通过用户验证提升数据质量，支持可扩展的行人基础设施数据采集。

Details

Motivation: 现有 sidewalk 数据收集方法成本高、碎片化且难以扩展，缺乏准确及时的数据阻碍了无障碍步行环境的建设。 Method: 开发iOSPointMapper移动应用，结合设备端语义分割、LiDAR深度估计与GPS/IMU融合定位，在移动设备上实时检测并定位交通标志、信号灯和灯杆等 sidewalk 特征；采用用户引导标注界面验证系统输出；数据匿名化后提交至TDEI平台，集成到多模态交通数据系统中。 Result: 系统在特征检测和空间映射性能方面表现良好，能够有效识别 sidewalk 相关要素并保证较高定位精度，用户参与提升了数据可靠性，数据可无缝集成至现有交通数据库。 Conclusion: iOSPointMapper提供了一种可扩展、以用户为中心的方法，填补了步行基础设施中的关键数据空白，有助于推动更包容、可达的城市步行环境建设。 Abstract: Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system's feature detection and spatial mapping performance reveal the application's potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian

[105] DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization

Hansang Lee,Chaelin Lee,Nieun Seo,Joon Seok Lim,Helen Hong

Main category: cs.CV

TL;DR: DeFloMat是一种新的生成式目标检测框架，通过引入条件流匹配（CFM）解决了扩散模型在临床应用中推理延迟高的问题，在仅3步内实现了比DiffusionDet更优的性能。

Details

Motivation: 扩散模型虽准确但推理慢，难以满足时间敏感的临床需求（如克罗恩病检测），需解决精度与效率的权衡问题。 Method: 采用基于条件最优传输理论的确定性流场（近似Rectified Flow），将检测建模为快速求解常微分方程的过程，取代传统扩散模型的多步去噪。 Result: 在MRE临床数据集上，仅用3步推理即达到43.32% AP_{10:50}，超越DiffusionDet 4步时的31.03%，且召回率和稳定性更优。 Conclusion: DeFloMat在保持高精度的同时大幅缩短推理时间，为快速稳定的目标检测设定了新标准，特别适用于临床场景。 Abstract: We propose DeFloMat (Detection with Flow Matching), a novel generative object detection framework that addresses the critical latency bottleneck of diffusion-based detectors, such as DiffusionDet, by integrating Conditional Flow Matching (CFM). Diffusion models achieve high accuracy by formulating detection as a multi-step stochastic denoising process, but their reliance on numerous sampling steps ($T \gg 60$) makes them impractical for time-sensitive clinical applications like Crohn's Disease detection in Magnetic Resonance Enterography (MRE). DeFloMat replaces this slow stochastic path with a highly direct, deterministic flow field derived from Conditional Optimal Transport (OT) theory, specifically approximating the Rectified Flow. This shift enables fast inference via a simple Ordinary Differential Equation (ODE) solver. We demonstrate the superiority of DeFloMat on a challenging MRE clinical dataset. Crucially, DeFloMat achieves state-of-the-art accuracy ($43.32\% \text{ } AP_{10:50}$) in only $3$ inference steps, which represents a $1.4\times$ performance improvement over DiffusionDet's maximum converged performance ($31.03\% \text{ } AP_{10:50}$ at $4$ steps). Furthermore, our deterministic flow significantly enhances localization characteristics, yielding superior Recall and stability in the few-step regime. DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization.

[106] Bright 4B: Scaling Hyperspherical Learning for Segmentation in 3D Brightfield Microscopy

Amil Khan,Matheus Palhares Viana,Suraj Mishra,B. S. Manjunath

Main category: cs.CV

TL;DR: Bright-4B是一种基于40亿参数的3D亮场显微图像分割基础模型，无需荧光标记或复杂后处理即可实现细胞器的高精度形态分割。

Details

Motivation: 现有的无标记3D亮场显微成像虽快速且非侵入，但缺乏鲁棒的体积分割方法，通常依赖荧光标记或重度后处理，限制了其广泛应用。 Method: 提出Bright-4B，采用单位超球面学习策略；引入硬件对齐的稀疏注意力机制（Native Sparse Attention）、深度宽度残差HyperConnections、软混合专家系统（Mixture-of-Experts）以增强建模能力，并设计各向异性图像块嵌入以保持共聚焦点扩散函数和轴向变薄的几何特性。 Result: 在多个共聚焦数据集上，Bright-4B能准确分割核、线粒体等亚细胞结构，保留不同深度和细胞类型的细微结构细节，性能优于当前主流CNN和Transformer基线模型。 Conclusion: Bright-4B实现了仅从亮场图像进行高保真3D细胞结构分割，推动了大规模无标记细胞图谱构建的发展，代码与预训练权重将全部开源。 Abstract: Label-free 3D brightfield microscopy offers a fast and noninvasive way to visualize cellular morphology, yet robust volumetric segmentation still typically depends on fluorescence or heavy post-processing. We address this gap by introducing Bright-4B, a 4 billion parameter foundation model that learns on the unit hypersphere to segment subcellular structures directly from 3D brightfield volumes. Bright-4B combines a hardware-aligned Native Sparse Attention mechanism (capturing local, coarse, and selected global context), depth-width residual HyperConnections that stabilize representation flow, and a soft Mixture-of-Experts for adaptive capacity. A plug-and-play anisotropic patch embed further respects confocal point-spread and axial thinning, enabling geometry-faithful 3D tokenization. The resulting model produces morphology-accurate segmentations of nuclei, mitochondria, and other organelles from brightfield stacks alone--without fluorescence, auxiliary channels, or handcrafted post-processing. Across multiple confocal datasets, Bright-4B preserves fine structural detail across depth and cell types, outperforming contemporary CNN and Transformer baselines. All code, pretrained weights, and models for downstream finetuning will be released to advance large-scale, label-free 3D cell mapping.

[107] FluenceFormer: Transformer-Driven Multi-Beam Fluence Map Regression for Radiotherapy Planning

Ujunwa Mgboh,Rafi Ibn Sultan,Joshua Kim,Kundan Thind,Dongxiao Zhu

Main category: cs.CV

TL;DR: 本文提出了FluenceFormer，一种基于变换器的框架，用于几何感知的直接通量图预测，以改进放疗计划中的剂量分布。

Details

Motivation: 由于解剖结构与射束调制之间关系复杂，通量图预测是一个病态的逆问题；传统卷积方法难以捕捉长距离依赖性，导致计划结构不一致或物理上不可行。 Method: 提出FluenceFormer，采用两阶段设计：第一阶段从解剖输入预测全局剂量先验，第二阶段结合射束几何条件回归出物理校准的通量图，并引入物理信息驱动的Fluence-Aware Regression (FAR) 损失函数。 Result: 在前列腺IMRT数据集上评估显示，结合Swin UNETR的FluenceFormer表现最佳，能量误差降至4.5%，并在结构保真度上显著优于现有CNN和单阶段方法（p < 0.05）。 Conclusion: FluenceFormer通过融合解剖与几何信息及物理约束，在通量图预测中展现出优越性能和通用性，为自动化放疗计划提供了更可靠解决方案。 Abstract: Fluence map prediction is central to automated radiotherapy planning but remains an ill-posed inverse problem due to the complex relationship between volumetric anatomy and beam-intensity modulation. Convolutional methods in prior work often struggle to capture long-range dependencies, which can lead to structurally inconsistent or physically unrealizable plans. We introduce \textbf{FluenceFormer}, a backbone-agnostic transformer framework for direct, geometry-aware fluence regression. The model uses a unified two-stage design: Stage~1 predicts a global dose prior from anatomical inputs, and Stage~2 conditions this prior on explicit beam geometry to regress physically calibrated fluence maps. Central to the approach is the \textbf{Fluence-Aware Regression (FAR)} loss, a physics-informed objective that integrates voxel-level fidelity, gradient smoothness, structural consistency, and beam-wise energy conservation. We evaluate the generality of the framework across multiple transformer backbones, including Swin UNETR, UNETR, nnFormer, and MedFormer, using a prostate IMRT dataset. FluenceFormer with Swin UNETR achieves the strongest performance among the evaluated models and improves over existing benchmark CNN and single-stage methods, reducing Energy Error to $\mathbf{4.5\%}$ and yielding statistically significant gains in structural fidelity ($p < 0.05$).

[108] EmoCtrl: Controllable Emotional Image Content Generation

Jingyuan Yang,Weibin Luo,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了EmoCtrl模型，用于实现内容保真与情感可控的图像生成，通过文本和视觉情感增强模块有效结合内容描述与目标情绪，在保持图像内容一致性的同时提升情感表达能力。

Details

Motivation: 现有文本到图像模型在保持内容一致性方面表现良好，但缺乏对情感的控制能力；而以情感为导向的模型虽能生成富有情感的图像，却容易导致内容失真。因此需要一种既能保持内容忠实又能准确表达目标情感的生成方法。 Method: 提出EmoCtrl模型，包含文本和视觉情感增强模块，利用标注了内容、情感和情感提示的 dataset 学习情感token，并通过描述性语义和感知线索增强情感表达。 Result: 实验表明EmoCtrl在定量和定性评估中均优于现有方法，能够同时实现高保真的内容生成和精准的情感控制，用户研究显示其结果更符合人类偏好。 Conclusion: EmoCtrl有效解决了内容保真与情感表达之间的权衡问题，具有良好的泛化能力和应用潜力，验证了学习到的情感token的有效性和适应性。 Abstract: An image conveys meaning through both its visual content and emotional tone, jointly shaping human perception. We introduce Controllable Emotional Image Content Generation (C-EICG), which aims to generate images that remain faithful to a given content description while expressing a target emotion. Existing text-to-image models ensure content consistency but lack emotional awareness, whereas emotion-driven models generate affective results at the cost of content distortion. To address this gap, we propose EmoCtrl, supported by a dataset annotated with content, emotion, and affective prompts, bridging abstract emotions to visual cues. EmoCtrl incorporates textual and visual emotion enhancement modules that enrich affective expression via descriptive semantics and perceptual cues. The learned emotion tokens exhibit complementary effects, as demonstrated through ablations and visualizations. Quantatitive and qualatitive experiments demonstrate that EmoCtrl achieves faithful content and expressive emotion control, outperforming existing methods across multiple aspects. User studies confirm EmoCtrl's strong alignment with human preference. Moreover, EmoCtrl generalizes well to creative applications, further demonstrating the robustness and adaptability of the learned emotion tokens.

[109] SuperiorGAT: Graph Attention Networks for Sparse LiDAR Point Cloud Reconstruction in Autonomous Systems

Khalfalla Awedat,Mohamed Abidalrekab,Gurcan Comert,Mustafa Ayad

Main category: cs.CV

TL;DR: 本文提出了一种基于图注意力网络的框架SuperiorGAT，用于重建稀疏LiDAR点云中的缺失高程信息，通过建模为beam-aware图并结合门控残差融合与前馈优化，在不增加网络深度的情况下实现了高精度重建。

Details

Motivation: LiDAR感知受限于固定的垂直分辨率和环境遮挡导致的光束丢失问题，影响了自动驾驶系统中对三维场景的理解能力。 Method: 将LiDAR扫描建模为beam-aware图，利用图注意力机制进行特征学习，并引入门控残差融合与前馈 refinement 模块提升重建精度，避免加深网络带来的计算负担。 Result: 在KITTI数据集多个场景（如Person、Road、Campus、City）下，通过模拟每第四根垂直光束丢失进行评估，SuperiorGAT相比PointNet和更深的GAT基线模型具有更低的重建误差和更好的几何一致性；X-Z投影显示其能有效保持结构完整性且垂直失真小。 Conclusion: 通过网络架构改进可在不依赖额外硬件的前提下，高效提升LiDAR点云的分辨率与感知性能，为实际自动驾驶应用提供了轻量且有效的解决方案。 Abstract: LiDAR-based perception in autonomous systems is constrained by fixed vertical beam resolution and further compromised by beam dropout resulting from environmental occlusions. This paper introduces SuperiorGAT, a graph attention-based framework designed to reconstruct missing elevation information in sparse LiDAR point clouds. By modeling LiDAR scans as beam-aware graphs and incorporating gated residual fusion with feed-forward refinement, SuperiorGAT enables accurate reconstruction without increasing network depth. To evaluate performance, structured beam dropout is simulated by removing every fourth vertical scanning beam. Extensive experiments across diverse KITTI environments, including Person, Road, Campus, and City sequences, demonstrate that SuperiorGAT consistently achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. Qualitative X-Z projections further confirm the model's ability to preserve structural integrity with minimal vertical distortion. These results suggest that architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware.

[110] LECalib: Line-Based Event Camera Calibration

Zibin Liu,Banglei Guana,Yang Shanga,Zhenbao Yu,Yifei Bian,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于线特征的事件相机标定框架，直接从事件流中检测线条并利用几何信息估计相机参数，适用于人造环境中的常见结构。

Details

Motivation: 现有事件相机标定方法耗时且依赖人工放置的标定物，难以适应快速变化的场景，因此需要一种更高效、自动化的标定方法。 Method: 提出一种线基事件相机标定框架，直接从事件流中检测线条，结合平面与非平面线特征建立事件-线标定模型以生成相机参数初值，并通过非线性优化进一步精化参数。 Result: 在单目和双目事件相机上进行了仿真与真实实验，验证了该方法的可行性与准确性。 Conclusion: 所提方法无需闪光图案或重建强度图像，能高效利用环境中常见的线结构实现高精度标定，具有良好的实用性和扩展性。 Abstract: Camera calibration is an essential prerequisite for event-based vision applications. Current event camera calibration methods typically involve using flashing patterns, reconstructing intensity images, and utilizing the features extracted from events. Existing methods are generally time-consuming and require manually placed calibration objects, which cannot meet the needs of rapidly changing scenarios. In this paper, we propose a line-based event camera calibration framework exploiting the geometric lines of commonly-encountered objects in man-made environments, e.g., doors, windows, boxes, etc. Different from previous methods, our method detects lines directly from event streams and leverages an event-line calibration model to generate the initial guess of camera parameters, which is suitable for both planar and non-planar lines. Then, a non-linear optimization is adopted to refine camera parameters. Both simulation and real-world experiments have demonstrated the feasibility and accuracy of our method, with validation performed on monocular and stereo event cameras. The source code is released at https://github.com/Zibin6/line_based_event_camera_calib.

[111] Towards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework

Zhicheng Zhao,Yuancheng Xu,Andong Lu,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种用于光学与合成孔径雷达（SAR）图像融合的鲁棒目标检测方法，名为质量感知动态融合网络（QDFNet），通过动态评估特征可靠性并自适应融合，在模态缺失或退化情况下仍保持高性能。

Details

Motivation: 由于成像机制不同、时间异步和配准困难，获取对齐的光学-SAR图像对非常困难，常导致模态缺失或退化，现有方法在随机模态缺失下的鲁棒性和融合性能稳定性不足。 Method: 提出QDFNet，包含动态模态质量评估（DMQA）模块，利用可学习参考令牌迭代优化特征可靠性评估；并设计正交约束归一化融合（OCNF）模块，通过正交约束保持模态独立性，并根据可靠性动态调整融合权重。 Result: 在SpaceNet6-OTD和OGSOD-2.0数据集上的实验表明，QDFNet在模态部分损坏或缺失的情况下显著优于现有最先进方法。 Conclusion: QDFNet通过质量感知的动态融合机制，有效提升了光学-SAR融合检测在不完整或多变输入条件下的鲁棒性与检测性能。 Abstract: Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing, as these modalities provide complementary information for all-weather monitoring. However, practical deployment is severely limited by inherent challenges. Due to distinct imaging mechanisms, temporal asynchrony, and registration difficulties, obtaining well-aligned optical-SAR image pairs remains extremely difficult, frequently resulting in missing or degraded modality data. Although recent approaches have attempted to address this issue, they still suffer from limited robustness to random missing modalities and lack effective mechanisms to ensure consistent performance improvement in fusion-based detection. To address these limitations, we propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection. Our proposed method leverages learnable reference tokens to dynamically assess feature reliability and guide adaptive fusion in the presence of missing modalities. In particular, we design a Dynamic Modality Quality Assessment (DMQA) module that employs learnable reference tokens to iteratively refine feature reliability assessment, enabling precise identification of degraded regions and providing quality guidance for subsequent fusion. Moreover, we develop an Orthogonal Constraint Normalization Fusion (OCNF) module that employs orthogonal constraints to preserve modality independence while dynamically adjusting fusion weights based on reliability scores, effectively suppressing unreliable feature propagation. Extensive experiments on the SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate the superiority and effectiveness of QDFNet compared to state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.

[112] SonoVision: A Computer Vision Approach for Helping Visually Challenged Individuals Locate Objects with the Help of Sound Cues

Md Abu Obaida Zishan,Annajiat Alim Rasel

Main category: cs.CV

TL;DR: SonoVision是一个基于智能手机的应用程序，利用声音提示帮助视障人士定位日常物品，提升其独立性。

Details

Motivation: 帮助视障人士克服在日常生活中定位物体的困难，减少对他人的依赖，提高安全性和自主性。 Method: 使用Flutter开发平台构建应用，后端采用Efficientdet-D2模型进行物体检测，并通过耳机左右声道的声音提示指示物体方向。 Result: 应用能准确通过声音反馈告知用户物体位置，且支持完全离线运行，具备良好的安全性和可用性。 Conclusion: SonoVision为视障人士提供了一种安全、便捷、独立的物体定位解决方案，具有实际应用潜力。 Abstract: Locating objects for the visually impaired is a significant challenge and is something no one can get used to over time. However, this hinders their independence and could push them towards risky and dangerous scenarios. Hence, in the spirit of making the visually challenged more self-sufficient, we present SonoVision, a smart-phone application that helps them find everyday objects using sound cues through earphones/headphones. This simply means, if an object is on the right or left side of a user, the app makes a sinusoidal sound in a user's respective ear through ear/headphones. However, to indicate objects located directly in front, both the left and right earphones are rung simultaneously. These sound cues could easily help a visually impaired individual locate objects with the help of their smartphones and reduce the reliance on people in their surroundings, consequently making them more independent. This application is made with the flutter development platform and uses the Efficientdet-D2 model for object detection in the backend. We believe the app will significantly assist the visually impaired in a safe and user-friendly manner with its capacity to work completely offline. Our application can be accessed here https://github.com/MohammedZ666/SonoVision.git.

[113] SAM 3D for 3D Object Reconstruction from Remote Sensing Images

Junsheng Yao,Lichao Mou,Qingyu Li

Main category: cs.CV

TL;DR: 本文首次系统评估了通用图像到3D基础模型SAM 3D在单目遥感建筑重建中的应用，相较于TRELLIS表现出更优的屋顶几何一致性和边界清晰度，并通过分段-重建-组合流程扩展至城市场景建模，展示了其潜力与未来研究方向。

Details

Motivation: 现有单目3D建筑重建方法通常依赖特定架构和强监督，缺乏通用性，亟需一种可泛化、低监督的基础模型解决方案。 Method: 采用SAM 3D这一通用图像到3D基础模型，在纽约城市数据集上与TRELLIS进行对比，使用FID和CLIP-based MMD作为评估指标，并提出“分段-重建-组合”流程以扩展至城市场景重建。 Result: SAM 3D在屋顶几何结构和边界清晰度上优于TRELLIS，能够有效生成更连贯的3D建筑形状，并成功应用于城市级场景重建。 Conclusion: SAM 3D展现出在遥感图像单目3D重建中的强大潜力，为基于基础模型的城市建模提供了可行路径，未来可融合场景级结构先验以进一步提升性能。 Abstract: Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.

[114] Comparing Object Detection Models for Electrical Substation Component Mapping

Haley Mody,Namish Bansal,Dennies Kiprono Bor,Edward J. Oughton

Main category: cs.CV

TL;DR: 本研究比较了YOLOv8、YOLOv11和RF-DETR三种计算机视觉模型在美式变电站组件检测中的性能，旨在实现高效、自动化的基础设施映射，以提升电网脆弱性评估能力。

Details

Motivation: 变电站是电网的关键部分，易受多种自然灾害影响，传统人工映射方法耗时费力，亟需自动化手段提升效率与覆盖范围。 Method: 使用手动标注的美国变电站图像数据集，训练并比较YOLOv8、YOLOv11和RF-DETR三种模型，在检测精度、准确率和运行效率方面进行评估。 Result: 三种模型在检测变电站组件方面均表现出一定能力，其中某一模型在精度与效率之间取得最佳平衡，适用于大规模自动化映射。成功实现了对美国多个变电站组件的自动识别与映射。 Conclusion: 基于计算机视觉的自动化方法可有效支持变电站基础设施的快速识别与脆弱性分析，具备在全国范围内推广的潜力，显著优于传统人工方式。 Abstract: Electrical substations are a significant component of an electrical grid. Indeed, the assets at these substations (e.g., transformers) are prone to disruption from many hazards, including hurricanes, flooding, earthquakes, and geomagnetically induced currents (GICs). As electrical grids are considered critical national infrastructure, any failure can have significant economic and public safety implications. To help prevent and mitigate these failures, it is thus essential that we identify key substation components to quantify vulnerability. Unfortunately, traditional manual mapping of substation infrastructure is time-consuming and labor-intensive. Therefore, an autonomous solution utilizing computer vision models is preferable, as it allows for greater convenience and efficiency. In this research paper, we train and compare the outputs of 3 models (YOLOv8, YOLOv11, RF-DETR) on a manually labeled dataset of US substation images. Each model is evaluated for detection accuracy, precision, and efficiency. We present the key strengths and limitations of each model, identifying which provides reliable and large-scale substation component mapping. Additionally, we utilize these models to effectively map the various substation components in the United States, showcasing a use case for machine learning in substation mapping.

[115] Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

Sukhyun Jeong,Yong-Hoon Choi

Main category: cs.CV

TL;DR: 本文提出了一种名为PGR²M的混合表示方法，通过在可解释的姿态码基础上引入残差码来增强文本驱动的3D动作生成与编辑，提升了动作序列的细节重建和时序控制能力。

Details

Motivation: 现有基于姿态码的方法（如CoMo）在帧级表示上难以捕捉细微的时序动态和高频细节，导致重建质量和局部可控性下降，因此需要一种能同时保持语义可解释性和高保真重建的新型表示方法。 Method: 提出PGR²M，采用姿态引导的残差向量量化（RVQ）分解动作为姿态潜码（表征粗略全局结构）和残差潜码（建模细粒度时序变化），并通过残差dropout机制避免对残差码过度依赖；构建两个Transformer：基础Transformer自回归预测姿态码，精修Transformer在文本、姿态码和量化阶段条件下预测残差码。 Result: 在HumanML3D和KIT-ML数据集上的实验表明，相比CoMo及近期基于扩散和令牌化的方法，PGR²M在Fréchet inception distance和重建指标上均有提升，用户研究也验证了其在生成与编辑任务中具备更优的结构保持性和直观编辑能力。 Conclusion: PGR²M通过混合表示有效平衡了动作生成中的语义可解释性与高保真细节重建，在文本到3D动作生成与编辑任务中实现了性能与可控性的提升。 Abstract: Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.

[116] Event-based high temporal resolution measurement of shock wave motion field

Taihang Lei,Banglei Guan,Minzu Liang,Pengju Sun,Jing Tao,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于多事件相机的新型框架，用于高时空分辨率下冲击波运动参数的精确测量，实现了多角度测量、运动场重建和爆炸当量反演。

Details

Motivation: 由于冲击波传播速度快且不均匀，测试条件不稳定，传统方法难以实现高精度测量，因此需要一种能够应对这些挑战的新方法。 Method: 利用多个事件相机捕捉冲击波事件，建立极坐标系统编码事件以揭示传播模式，通过自适应感兴趣区域提取和迭代斜率分析提取冲击波前缘事件，并基于事件光学成像模型推导几何模型与3D重建模型。 Result: 速度测量结果与压力传感器和经验公式对比，最大误差为5.20%，最小误差为0.06%，实现了高精度的冲击波运动场测量。 Conclusion: 该方法在高动态和复杂环境下实现了对冲击波的高时空分辨率精确测量，显著提升了测量性能，具有重要的应用前景。 Abstract: Accurate measurement of shock wave motion parameters with high spatiotemporal resolution is essential for applications such as power field testing and damage assessment. However, significant challenges are posed by the fast, uneven propagation of shock waves and unstable testing conditions. To address these challenges, a novel framework is proposed that utilizes multiple event cameras to estimate the asymmetry of shock waves, leveraging its high-speed and high-dynamic range capabilities. Initially, a polar coordinate system is established, which encodes events to reveal shock wave propagation patterns, with adaptive region-of-interest (ROI) extraction through event offset calculations. Subsequently, shock wave front events are extracted using iterative slope analysis, exploiting the continuity of velocity changes. Finally, the geometric model of events and shock wave motion parameters is derived according to event-based optical imaging model, along with the 3D reconstruction model. Through the above process, multi-angle shock wave measurement, motion field reconstruction, and explosive equivalence inversion are achieved. The results of the speed measurement are compared with those of the pressure sensors and the empirical formula, revealing a maximum error of 5.20% and a minimum error of 0.06%. The experimental results demonstrate that our method achieves high-precision measurement of the shock wave motion field with both high spatial and temporal resolution, representing significant progress.

[117] Scalpel-SAM: A Semi-Supervised Paradigm for Adapting SAM to Infrared Small Object Detection

Zihan Liu,Xiangning Ren,Dezhang Kong,Yipeng Zhang,Meng Han

Main category: cs.CV

TL;DR: 提出了一种基于分层MoE适配器的两阶段知识蒸馏与迁移范式，用于解决红外小目标检测中因标注成本高导致的数据稀缺问题，在仅使用10%标注数据的情况下实现了媲美甚至超越全监督模型的性能。

Details

Motivation: 现有半监督方法在红外小目标检测中面临域差距大、难以编码物理先验和结构复杂等问题，且标注成本高昂，亟需有效的半监督范式。 Method: 设计了一个包含四个白盒神经算子的分层MoE适配器，并构建两阶段蒸馏与迁移框架：第一阶段利用少量全监督数据通过先验引导的知识蒸馏得到专家教师模型Scalpel-SAM；第二阶段利用该教师模型生成伪标签训练轻量级下游模型。 Result: 实验表明，仅用10%标注数据训练的下游模型即可达到甚至超过全监督模型的性能，验证了方法在缓解数据稀缺问题上的有效性。 Conclusion: 本文提出的半监督范式首次系统性地结合SAM作为教师模型解决红外小目标检测中的标注稀缺问题，具有良好的应用前景和推广价值。 Abstract: Infrared small object detection urgently requires semi-supervised paradigms due to the high cost of annotation. However, existing methods like SAM face significant challenges of domain gaps, inability of encoding physical priors, and inherent architectural complexity. To address this, we designed a Hierarchical MoE Adapter consisting of four white-box neural operators. Building upon this core component, we propose a two-stage paradigm for knowledge distillation and transfer: (1) Prior-Guided Knowledge Distillation, where we use our MoE adapter and 10% of available fully supervised data to distill SAM into an expert teacher (Scalpel-SAM); and (2) Deployment-Oriented Knowledge Transfer, where we use Scalpel-SAM to generate pseudo labels for training lightweight and efficient downstream models. Experiments demonstrate that with minimal annotations, our paradigm enables downstream models to achieve performance comparable to, or even surpassing, their fully supervised counterparts. To our knowledge, this is the first semi-supervised paradigm that systematically addresses the data scarcity issue in IR-SOT using SAM as the teacher model.

[118] Tracking by Predicting 3-D Gaussians Over Time

Tanish Baranwal,Himanshu Gaurav Singh,Jathushan Rajasegaran,Jitendra Malik

Main category: cs.CV

TL;DR: 提出Video-GMAE，一种基于高斯点阵的自监督视频表示学习方法，能自然涌现出追踪能力，并在多个数据集上超越现有自监督方法。

Details

Motivation: 现有的自监督视频表示学习方法缺乏对3D动态场景结构的合理归纳偏置，难以有效建模视频中的物体运动轨迹。 Method: 将视频表示为随时间移动的一组2D高斯点（Gaussian splats），通过重建掩码区域进行自监督预训练，使模型学习到具有物理意义的动态表征。 Result: 在零样本追踪任务中表现与当前最优方法相当；经过小规模微调后，在Kinetics上提升34.6%，在Kubric上提升13.1%，优于现有自监督视频方法。 Conclusion: 使用高斯点表示视频并引入3D感知的归纳偏置是有效的，能够自然地涌现出物体追踪能力，并显著提升视频理解性能。 Abstract: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at https://videogmae.org/ and https://github.com/tekotan/video-gmae.

[119] SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration

Xin Chen,Kang Luo,Yangyi Xiao,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SCAFusion的多模态3D目标检测模型，专为月球机器人任务设计，通过引入认知适配器、对比对齐模块、相机辅助训练分支和区域感知坐标注意力机制，在小而不规则目标检测方面显著优于现有方法，且参数和计算量增加极少。

Details

Motivation: 现有的多模态3D感知方法在地外环境中表现不佳，尤其是在特征对齐、多模态协同以及小目标检测方面存在不足，难以满足月球表面自主导航的需求。 Method: 基于BEVFusion框架，SCAFusion引入了四个关键组件：用于高效调整相机主干网络的认知适配器、增强相机与LiDAR特征一致性的对比对齐模块、强化视觉表征的相机辅助训练分支，以及专门提升小而不规则目标检测性能的区域感知坐标注意力机制。 Result: 在nuScenes验证集上，模型实现了69.7%的mAP和72.1%的NDS，分别比基线提高5.0%和2.7%；在基于Isaac Sim构建的模拟月球环境中，mAP达到90.93%，比基线高出11.5%，在检测小型陨石类障碍物方面表现尤为突出。 Conclusion: SCAFusion在几乎不增加参数和计算开销的前提下，显著提升了月球环境下小而不规则物体的检测能力，具备在深空探测任务中应用的潜力。 Abstract: Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.

[120] DreamOmni3: Scribble-based Editing and Generation

Bin Xia,Bohao Peng,Jiyang Liu,Sitong Wu,Jingyao Li,Junjia Huang,Xu Zhao,Yitong Wang,Ruihang Chu,Bei Yu,Jiaya Jia

Main category: cs.CV

TL;DR: 本文提出了基于涂鸦的编辑和生成任务，结合文本、图像和手绘草图实现更灵活的内容创作，并提出DreamOmni3模型及联合输入框架以精准定位涂鸦区域并完成复杂编辑。

Details

Motivation: 现有生成与编辑模型主要依赖文本提示，难以准确捕捉用户意图中的编辑位置和细粒度视觉细节，因此需要引入涂鸦等直观交互方式提升编辑精度和灵活性。 Method: 提出涂鸦-based编辑与生成任务，构建包含多种编辑类型的合成数据集，并设计联合输入方案，将原始图像与涂鸦图像一同输入模型，利用颜色区分区域，共享编码信息以精确定位和编辑。 Result: 在新构建的数据集上建立了全面的基准测试，实验结果表明DreamOmni3在涂鸦引导的编辑和生成任务中表现优异。 Conclusion: DreamOmni3通过引入涂鸦输入和联合编码框架，有效提升了图文生成与编辑的精确性和交互性，推动了多模态内容创作的发展。 Abstract: Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.

[121] CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

Qinglin Zeng,Kaitong Cai,Ruiqi Chen,Qinhan Lv,Keze Wang

Main category: cs.CV

TL;DR: 提出CoAgent框架，通过计划-合成-验证的闭环流程提升开放域视频生成中的叙事连贯性和视觉一致性。

Details

Motivation: 现有文本到视频模型常导致身份漂移、场景不一致和时序结构不稳定，缺乏跨镜头的全局一致性控制。 Method: 采用协作式闭环框架：由Storyboard Planner分解输入为镜头级计划，Global Context Manager维护实体记忆，Synthesis Module生成镜头并由Visual Consistency Controller指导，Verifier Agent检测不一致并触发重生成，最后由节奏感知编辑器优化时间流。 Result: 实验表明，CoAgent在长视频生成中显著提升了叙事连贯性、视觉一致性和整体叙事质量。 Conclusion: CoAgent通过引入结构化规划与闭环验证机制，有效解决了开放域视频生成中的长期一致性难题。 Abstract: Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.

[122] Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

Jesen Zhang,Ningyuan Liu,Kaitong Cai,Sidi Liu,Jing Yang,Ziliang Chen,Xiaofei Sun,Keze Wang

Main category: cs.CV

TL;DR: SR-MCR是一种轻量级、无标签的多模态大模型推理对齐框架，通过自引用信号实现过程级对齐，在多个视觉基准上达到开源模型中的最先进性能。

Details

Motivation: 现有对齐方法仅监督最终答案，忽略中间推理过程的可靠性，导致多模态大模型推理不连贯、视觉接地不足。 Method: 提出SR-MCR框架，利用语义对齐、词汇保真、非冗余性、视觉接地和步骤一致性五个自引用线索，构建归一化、可靠性加权的奖励机制，并采用无批评者的GRPO目标与置信度感知冷却机制进行稳定训练。 Result: 基于Qwen2.5-VL构建的SR-MCR在多个视觉基准上提升了答案准确性和推理连贯性，SR-MCR-7B在同类开源模型中取得81.4%的平均准确率，达到SOTA水平。消融实验验证了各奖励项与冷却模块的独立贡献。 Conclusion: SR-MCR通过利用模型输出的内在过程信号实现了有效的推理对齐，显著提升多模态大模型的推理可靠性与视觉接地能力，且具备良好的可扩展性和实用性。 Abstract: Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.

[123] ReFRM3D: A Radiomics-enhanced Fused Residual Multiparametric 3D Network with Multi-Scale Feature Fusion for Glioma Characterization

Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Arefin Ittesafun Abian,Yan Zhang,Mirjam Jonkman,Sami Azam

Main category: cs.CV

TL;DR: 提出了一种基于多参数MRI的新型放射组学增强融合残差3D网络（ReFRM3D）和多特征肿瘤标志物分类器，用于提高胶质瘤分割与分类效率，在多个BraTS数据集上取得了优异的分割性能。

Details

Motivation: 现有胶质瘤诊断方法存在影像数据变异大、计算资源利用不足、分割与分类效率低等问题，需提升自动化精准诊断水平。 Method: 基于3D U-Net架构，提出ReFRM3D网络，结合多尺度特征融合、混合上采样和扩展残差跳跃机制，并利用放射组学特征构建多特征肿瘤标志物分类器。 Result: 在BraTS2019、2020和2021数据集上实现了高Dice相似性系数，如BraTS2019中全肿瘤（WT）、增强肿瘤（ET）和肿瘤核心（TC）分别为94.04%、92.68%和93.64%。 Conclusion: 所提方法显著提升了胶质瘤的分割精度与分类效率，具有良好的临床应用潜力。 Abstract: Gliomas are among the most aggressive cancers, characterized by high mortality rates and complex diagnostic processes. Existing studies on glioma diagnosis and classification often describe issues such as high variability in imaging data, inadequate optimization of computational resources, and inefficient segmentation and classification of gliomas. To address these challenges, we propose novel techniques utilizing multi-parametric MRI data to enhance tumor segmentation and classification efficiency. Our work introduces the first-ever radiomics-enhanced fused residual multiparametric 3D network (ReFRM3D) for brain tumor characterization, which is based on a 3D U-Net architecture and features multi-scale feature fusion, hybrid upsampling, and an extended residual skip mechanism. Additionally, we propose a multi-feature tumor marker-based classifier that leverages radiomic features extracted from the segmented regions. Experimental results demonstrate significant improvements in segmentation performance across the BraTS2019, BraTS2020, and BraTS2021 datasets, achieving high Dice Similarity Coefficients (DSC) of 94.04%, 92.68%, and 93.64% for whole tumor (WT), enhancing tumor (ET), and tumor core (TC) respectively in BraTS2019; 94.09%, 92.91%, and 93.84% in BraTS2020; and 93.70%, 90.36%, and 92.13% in BraTS2021.

[124] KV-Tracker: Real-Time Pose Tracking with Transformers

Marwan Taher,Ignacio Alzugaray,Kirill Mazur,Xin Kong,Andrew J. Davison

Main category: cs.CV

TL;DR: 提出KV-Tracker，通过缓存全局自注意力的KV对实现基于多视角3D几何网络的实时6-DoF姿态跟踪与在线重建，速度提升达15倍，适用于无深度和先验的单目RGB视频。

Details

Motivation: 多视角3D几何网络虽强大但推理速度慢，难以用于实时场景，需在不牺牲精度的前提下实现高效在线应用。 Method: 通过关键帧选择与管理，结合π³网络的全双向注意力机制，缓存自注意力模块的键值（KV）对作为唯一场景表示，用于实时姿态跟踪，且无需重新训练即可推广到其他多视角网络。 Result: 在TUM RGB-D、7-Scenes、Arctic和OnePose数据集上验证了方法的有效性，实现最高约27 FPS的帧率，推理速度提升达15倍，且无漂移或灾难性遗忘。 Conclusion: KV-Tracker通过KV缓存策略实现了高效、稳定的在线6-DoF跟踪与重建，为多视角几何网络走向实时应用提供了通用解决方案。 Abstract: Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $π^3$ with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.

[125] PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment

Bin Wang,Yang Xu,Huan Zhao,Hao Zhang,Zixing Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的个性化3D说话头动画框架PTalker，通过音频和面部运动序列的风格解耦与多层次对齐机制，实现高保真、个性化且唇音同步准确的面部动画生成。

Details

Motivation: 现有方法在语音驱动的3D说话头生成中忽视了个体说话风格的细微差异，导致个性化和真实感不足。本文旨在提升说话风格的保留与唇音同步精度。 Method: 提出PTalker框架：1）设计解耦约束，将音频和运动序列映射到独立的风格与内容空间以实现风格解耦；2）采用三层次模态对齐机制——基于图注意力网络的空间对齐、跨注意力的时序对齐、以及结合top-k双向对比损失和KL散度的特征对齐，增强音-网格一致性。 Result: 在公开数据集上的大量定性和定量实验表明，PTalker在生成真实感强、风格化明显的3D说话头方面优于现有最先进方法，尤其在个性化表达和唇音同步准确性上表现突出。 Conclusion: PTalker通过风格解耦和多层级对齐机制，有效实现了个性化3D talking head生成，在保持个体说话风格的同时显著提升了唇音同步精度和整体真实感。 Abstract: Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely "PTalker". This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.

[126] Enhancing Noise Resilience in Face Clustering via Sparse Differential Transformer

Dafeng Zhang,Yongqi Song,Shizhuo Liu

Main category: cs.CV

TL;DR: 提出了一种基于稀疏微分Transformer的Top-K Jaccard相似度方法，用于提升人脸聚类中相似性测量的准确性和鲁棒性，在多个数据集上达到SOTA性能。

Details

Motivation: 现有方法使用Jaccard系数虽优于余弦距离，但引入过多无关节点导致相似度判别能力弱，影响聚类效果。 Method: 提出预测驱动的Top-K Jaccard相似度，并设计基于Transformer的预测模型；进一步引入稀疏微分Transformer（SDT）以抑制噪声，增强节点关系建模的可靠性。 Result: 在MS-Celeb-1M等多个大规模数据集上实验表明，该方法显著优于现有方法，取得最优聚类性能。 Conclusion: 所提方法通过提升邻域节点纯度和优化相似度预测，有效增强了人脸聚类的准确性和鲁棒性，为基于图的关系建模提供了新思路。 Abstract: The method used to measure relationships between face embeddings plays a crucial role in determining the performance of face clustering. Existing methods employ the Jaccard similarity coefficient instead of the cosine distance to enhance the measurement accuracy. However, these methods introduce too many irrelevant nodes, producing Jaccard coefficients with limited discriminative power and adversely affecting clustering performance. To address this issue, we propose a prediction-driven Top-K Jaccard similarity coefficient that enhances the purity of neighboring nodes, thereby improving the reliability of similarity measurements. Nevertheless, accurately predicting the optimal number of neighbors (Top-K) remains challenging, leading to suboptimal clustering results. To overcome this limitation, we develop a Transformer-based prediction model that examines the relationships between the central node and its neighboring nodes near the Top-K to further enhance the reliability of similarity estimation. However, vanilla Transformer, when applied to predict relationships between nodes, often introduces noise due to their overemphasis on irrelevant feature relationships. To address these challenges, we propose a Sparse Differential Transformer (SDT), instead of the vanilla Transformer, to eliminate noise and enhance the model's anti-noise capabilities. Extensive experiments on multiple datasets, such as MS-Celeb-1M, demonstrate that our approach achieves state-of-the-art (SOTA) performance, outperforming existing methods and providing a more robust solution for face clustering.

[127] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

Jiacheng Ye,Shansan Gong,Jiahui Gao,Junming Fan,Shuang Wu,Wei Bi,Haoli Bai,Lifeng Shang,Lingpeng Kong

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散语言模型的视觉-语言-动作模型Dream-VL和Dream-VLA，通过其双向生成特性在视觉规划与机器人控制任务中表现出优越性能，显著优于现有自回归模型。

Details

Motivation: 自回归视觉语言模型在复杂视觉规划和动态机器人控制中受限于序列生成方式，本文旨在探索基于扩散语言模型的视觉语言模型以克服这一局限。 Method: 构建基于扩散大语言模型（dLLM）的开放式视觉语言模型Dream-VL，并进一步通过在开放机器人数据集上连续预训练得到视觉-语言-动作模型Dream-VLA，利用其双向结构支持动作分块与并行生成。 Result: Dream-VL在多个基准上达到或接近顶级自回归VLM水平，在视觉规划任务中表现更优；Dream-VLA在LIBERO、SimplerEnv-Bridge和SimplerEnv-Fractal上分别取得97.2%、71.4%和60.5%的平均成功率，超越π₀和GR00T-N1等领先模型，且下游微调收敛更快。 Conclusion: 基于扩散的视觉语言模型具有更强的建模能力和更高的训练效率，尤其适合复杂的视觉规划与机器人动作生成任务，具备广阔的应用前景。 Abstract: While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.

[128] Rethinking Memory Design in SAM-Based Visual Object Tracking

Mohamad Alansari,Muzammal Naseer,Hasan Al Marzouqi,Naoufel Werghi,Sajid Javed

Main category: cs.CV

TL;DR: 本文对基于SAM的视觉目标跟踪中的记忆机制进行了系统性研究，分析了现有方法在短时记忆帧选择上的差异，并提出了一种统一的混合记忆框架，将记忆分解为短期外观记忆和长期干扰分辨记忆，显著提升了在长时遮挡、复杂运动和强干扰场景下的跟踪鲁棒性。

Details

Motivation: 现有的SAM2-based跟踪器以各自特定的方式改进记忆机制，缺乏对记忆设计原则的统一理解，且其记忆策略是否适用于新一代基础模型（如SAM3）尚不明确。因此，需要一个系统性的记忆中心研究来揭示关键设计因素并提升模型泛化能力。 Method: 首先分析代表性SAM2-based跟踪器的记忆设计共性与差异，然后在SAM3框架下忠实复现这些机制，并在十个基准上进行大规模消融实验；基于发现，提出一种模块化的混合记忆框架，显式分离短期外观记忆与长期干扰分辨记忆。 Result: 实验表明，所提出的混合记忆框架在SAM2和SAM3 backbone上均能稳定提升性能，尤其在长时遮挡、复杂运动和强干扰场景下表现出更强的鲁棒性；通过解耦记忆功能，实现了现有记忆策略的可组合集成。 Conclusion: 记忆机制的设计对SAM-based跟踪至关重要，本文提出的统一混合记忆框架为未来基于foundation model的跟踪系统提供了清晰、可扩展的设计范式。 Abstract: \noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: https://github.com/HamadYA/SAM3_Tracking_Zoo. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}

[129] Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

Yuming Gu,Yizhi Wang,Yining Hong,Yipeng Gao,Hao Jiang,Angtian Wang,Bo Liu,Nathaniel S. Dennler,Zhengfei Kuang,Hao Li,Gordon Wetzstein,Chongyang Ma

Main category: cs.CV

TL;DR: 本文提出了Envision，一种基于扩散模型的具身视觉规划框架，通过显式结合目标图像来生成物理合理且目标一致的视觉轨迹，提升了任务执行中的目标对齐与空间一致性。

Details

Motivation: 现有视觉规划方法多为前向预测，缺乏显式的目标建模，导致生成轨迹易出现空间漂移和目标偏离问题。 Method: Envision分为两个阶段：首先由目标想象模型根据场景和指令生成任务相关的连贯目标图像；然后利用初始观察和该目标图像，在首尾帧条件下的视频扩散模型（FL2V）中插值生成平滑、物理合理的视频轨迹。 Result: 在物体操作和图像编辑基准上，Envision在目标对齐、空间一致性和对象保持方面优于基线方法。 Conclusion: Envision通过显式目标约束的扩散生成机制，有效支持了具身智能体的视觉规划与控制，为下游任务提供了可靠的行动引导。 Abstract: Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.

[130] FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Yidi Liu,Zihao Fan,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Xueyang Fu,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 提出了一种细粒度感知奖励模型（FinPercep-RM）和协同进化课程学习机制（CCL），用于解决图像超分辨率中基于人类反馈的强化学习中的奖励欺骗问题。

Details

Motivation: 传统图像质量评估模型仅提供全局评分，对局部细微失真不敏感，导致超分模型产生感知伪影并出现奖励欺骗。 Method: 设计了一个编码器-解码器结构的细粒度感知奖励模型（FinPercep-RM），输出全局评分和感知退化图；构建FGR-30k数据集训练该模型；提出协同进化课程学习（CCL）机制，使奖励模型与超分模型同步由易到难地演化，提升训练稳定性。 Result: 实验表明，所提方法在多种超分模型上均提升了全局质量和局部真实感，有效抑制了奖励欺骗现象，且训练过程更稳定。 Conclusion: 通过引入细粒度奖励信号和协同演化的课程学习策略，能够更有效地对齐人类感知偏好，提升基于强化学习的图像超分辨率性能。 Abstract: Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.

[131] Visual Autoregressive Modelling for Monocular Depth Estimation

Amir El-Ghoussani,André Kaup,Nassir Navab,Gustavo Carneiro,Vasileios Belagiannis

Main category: cs.CV

TL;DR: 提出一种基于视觉自回归（VAR）先验的单目深度估计方法，通过适配大规模文本到图像的VAR模型，并引入尺度条件上采样机制，在少量合成数据上微调即可实现室内场景下的最先进性能，并在室外数据集上表现良好。

Details

Motivation: 寻找扩散模型之外的替代方案，探索自回归先验在单目深度估计中的潜力，提升模型对几何结构的理解能力以及在3D视觉任务中的适应性。 Method: 采用大规模文本到图像的VAR模型，设计尺度级条件上采样机制并结合无分类器引导，以10个固定的自回归阶段进行推理，仅使用74K合成样本微调。 Result: 在受限训练条件下实现了室内基准上的最先进性能，并在户外数据集上表现出竞争力。 Conclusion: 验证了自回归先验作为一类具有几何感知能力的生成模型，在深度估计任务中具备数据可扩展性和对3D视觉任务的良好适应性，是扩散模型的有力补充。 Abstract: We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".

[132] Investigating Deep Learning Models for Ejection Fraction Estimation from Echocardiography Videos

Shravan Saranyan,Pramit Saha

Main category: cs.CV

TL;DR: 本研究评估了多种深度学习架构在从超声心动图视频中估计左心室射血分数（LVEF）中的表现，发现改进的3D Inception模型性能最佳，RMSE为6.79%，并指出模型设计和超参数选择对泛化能力有显著影响。

Details

Motivation: 手动评估超声心动图中的心脏功能耗时且存在较大的观察者间差异，亟需一种高效、准确的自动化方法来提升临床诊断效率和一致性。 Method: 研究比较了3D Inception、双流网络和CNN-RNN等多种深度学习架构，系统评估了结构修改和特征融合策略，并在包含10,030个视频的EchoNet-Dynamic数据集上进行训练与验证。 Result: 改进的3D Inception架构表现最优（RMSE 6.79%），较小和较简单的模型泛化能力更好，且模型性能对卷积核大小和归一化策略等超参数高度敏感。 Conclusion: 深度学习可用于自动、准确地估算LVEF，其中3D Inception架构最具潜力；研究结果对医学及其他领域的视频分析任务具有借鉴意义。 Abstract: Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and plays a central role in the diagnosis and management of cardiovascular disease. Echocardiography, as a readily accessible and non-invasive imaging modality, is widely used in clinical practice to estimate LVEF. However, manual assessment of cardiac function from echocardiograms is time-consuming and subject to considerable inter-observer variability. Deep learning approaches offer a promising alternative, with the potential to achieve performance comparable to that of experienced human experts. In this study, we investigate the effectiveness of several deep learning architectures for LVEF estimation from echocardiography videos, including 3D Inception, two-stream, and CNN-RNN models. We systematically evaluate architectural modifications and fusion strategies to identify configurations that maximize prediction accuracy. Models were trained and evaluated on the EchoNet-Dynamic dataset, comprising 10,030 echocardiogram videos. Our results demonstrate that modified 3D Inception architectures achieve the best overall performance, with a root mean squared error (RMSE) of 6.79%. Across architectures, we observe a tendency toward overfitting, with smaller and simpler models generally exhibiting improved generalization. Model performance was also found to be highly sensitive to hyperparameter choices, particularly convolutional kernel sizes and normalization strategies. While this study focuses on echocardiography-based LVEF estimation, the insights gained regarding architectural design and training strategies may be applicable to a broader range of medical and non-medical video analysis tasks.

[133] Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains

Qiankun Li,Feng He,Huabao Chen,Xin Ning,Kun Wang,Zengfu Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Cluster Attention Adapter (CLAdapter) 的新方法，用于将大规模预训练模型的知识有效迁移到数据受限的下游科学任务中。

Details

Motivation: 在许多专业且数据有限的科学领域中，尽管已有大规模数据集和预训练模型，但下游任务仍面临挑战，因此需要更有效的自适应迁移方法。 Method: CLAdapter引入注意力机制和聚类中心，通过分布相关性和变换矩阵来个性化增强特征表示，并设计了统一接口以兼容CNN和Transformer等多种架构。 Result: 在涵盖多个科学领域的10个数据集上进行了实验，CLAdapter在各种数据受限场景中均达到最先进的性能。 Conclusion: CLAdapter能有效释放基础视觉模型在多样化下游任务中的潜力，具备良好的通用性和适应性。 Abstract: In the big data era, the computer vision field benefits from large-scale datasets such as LAION-2B, LAION-400M, and ImageNet-21K, Kinetics, on which popular models like the ViT and ConvNeXt series have been pre-trained, acquiring substantial knowledge. However, numerous downstream tasks in specialized and data-limited scientific domains continue to pose significant challenges. In this paper, we propose a novel Cluster Attention Adapter (CLAdapter), which refines and adapts the rich representations learned from large-scale data to various data-limited downstream tasks. Specifically, CLAdapter introduces attention mechanisms and cluster centers to personalize the enhancement of transformed features through distribution correlation and transformation matrices. This enables models fine-tuned with CLAdapter to learn distinct representations tailored to different feature sets, facilitating the models' adaptation from rich pre-trained features to various downstream scenarios effectively. In addition, CLAdapter's unified interface design allows for seamless integration with multiple model architectures, including CNNs and Transformers, in both 2D and 3D contexts. Through extensive experiments on 10 datasets spanning domains such as generic, multimedia, biological, medical, industrial, agricultural, environmental, geographical, materials science, out-of-distribution (OOD), and 3D analysis, CLAdapter achieves state-of-the-art performance across diverse data-limited scientific domains, demonstrating its effectiveness in unleashing the potential of foundation vision models via adaptive transfer. Code is available at https://github.com/qklee-lz/CLAdapter.

[134] INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

Mert Ikinci,Luna Toma,Karin U. Loeffler,Leticia Ussem,Daniela Süsskind,Julia M. Weller,Yousef Yeganeh,Martina C. Herwig-Carl,Shadi Albarqouni

Main category: cs.CV

TL;DR: 提出了一种名为INTERACT-CMIL的多头深度学习框架，用于联合预测结膜黑色素细胞上皮内病变（CMIL）的五个组织病理学轴，显著提升了分级准确性。

Details

Motivation: 准确的CMIL分级对治疗和黑色素瘤预测至关重要，但由于形态学特征细微且诊断标准相互关联，传统方法难以实现稳定、准确的分级。 Method: 开发了INTERACT-CMIL框架，采用共享特征学习与组合部分监督，并引入任务间依赖性损失以增强跨任务一致性，联合预测WHO4、WHO5、水平扩散、垂直扩散和细胞非典型性五个指标。 Result: 在包含486个专家标注活检样本的多中心数据集上测试，该模型相较CNN和基础模型基线显著提升，宏F1分数相对提高最高达55.1%（WHO4）和25.0%（垂直扩散）。 Conclusion: INTERACT-CMIL能提供一致且可解释的多标准预测，符合专家评级，为CMIL诊断建立了可重复的计算基准，推动数字眼病理的标准化。 Abstract: Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

[135] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

ZhenQi Chen,TsaiChing Ni,YuanFu Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为CritiFusion的新框架，通过结合多模态语义批评机制和频域优化，在推理阶段提升文本到图像生成的语义一致性和细节质量，无需额外训练，兼容现有扩散模型。

Details

Motivation: 现有文本到图像扩散模型在视觉保真度上表现优异，但难以准确对齐复杂文本提示的语义，需要提升生成内容与文本意图的一致性。 Method: 提出CritiFusion框架，包含CritiCore模块（利用视觉-语言模型和大语言模型生成高层语义反馈以丰富提示上下文）和SpecFusion模块（在频谱域融合中间生成状态，注入粗粒度结构并保留高频细节），作为即插即用的优化阶段，无需训练。 Result: 在标准基准上的实验表明，该方法显著提升了文本-图像对应性的人类对齐指标、偏好评分和美学评估表现，效果媲美最先进的奖励优化方法，定性结果也显示更优的细节、真实感和提示忠实度。 Conclusion: CritiFusion通过语义批评与频谱对齐策略，有效增强了文本到图像生成的语义一致性与视觉质量，是一种通用且无需训练的增强方案。 Abstract: Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

[136] Autoregressive Flow Matching for Motion Prediction

Johnathan Xie,Stefan Stojanov,Cristobal Eyzaguirre,Daniel L. K. Yamins,Jiajun Wu

Main category: cs.CV

TL;DR: 本文提出了一种新的序列连续数据概率建模方法ARFM，通过在多样化视频数据集上训练，实现长期点轨迹预测，并展示了其在人类和机器人运动预测下游任务中的有效性。

Details

Motivation: 现有的运动预测模型通常在窄分布上训练，难以泛化；而大规模视频预测模型虽具视觉真实感，却难以准确建模复杂运动。因此需要一种能扩展并精确预测长时程复杂运动的方法。 Method: 提出自回归流匹配（ARFM）方法，采用概率建模框架，对连续序列数据进行建模，并在多样化的视频数据集上训练以预测未来的点轨迹位置。 Result: ARFM能够有效预测复杂的运动模式，在人类和机器人运动预测基准上表现良好；将预测的未来轨迹用于条件生成，显著提升了下游任务性能。 Conclusion: ARFM是一种有效的长时程运动预测方法，通过大规模多样化数据训练和轨迹预测建模，可广泛应用于人类与机器人运动预测等场景。 Abstract: Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: https://github.com/Johnathan-Xie/arfm-motion-prediction.

[137] Multimodal Diffeomorphic Registration with Neural ODEs and Structural Descriptors

Salvador Rodriguez-Sanz,Monica Hernandez

Main category: cs.CV

TL;DR: 提出一种基于Neural ODE的多模态微分同胚配准方法，利用结构描述符和局部互信息实现高精度、鲁棒且高效的非刚性图像配准。

Details

Motivation: 现有非刚性配准方法在准确性、计算复杂度和正则化之间存在权衡，且大多局限于单模态假设，难以推广到多模态场景。 Method: 提出一种实例特定的框架，结合Neural ODE与模态无关的结构描述符（基于图像或特征）及局部互信息，构建连续深度网络模型，支持多模态、大/小变形配准。 Result: 在多种数据集组合上验证了方法的优越性，定性和定量结果均优于现有先进方法，具备良好的多尺度适应性、低正则化敏感性和高效率。 Conclusion: 该方法在无需训练的前提下实现了鲁棒、灵活且高效的多模态非刚性图像配准，适用于不同形变程度和尺度的场景。 Abstract: This work proposes a multimodal diffeomorphic registration method using Neural Ordinary Differential Equations (Neural ODEs). Nonrigid registration algorithms exhibit tradeoffs between their accuracy, the computational complexity of their deformation model, and its proper regularization. In addition, they also assume intensity correlation in anatomically homologous regions of interest among image pairs, limiting their applicability to the monomodal setting. Unlike learning-based models, we propose an instance-specific framework that is not subject to high scan requirements for training and does not suffer performance degradation at inference time on modalities unseen during training. Our method exploits the potential of continuous-depth networks in the Neural ODE paradigm with structural descriptors, widely adopted as modality-agnostic metric models which exploit self-similarities on parameterized neighborhood geometries. We propose three different variants that integrate image-based or feature-based structural descriptors and nonstructural image similarities computed by local mutual information. We conduct extensive evaluations on different experiments formed by scan dataset combinations and show surpassing qualitative and quantitative results compared to state-of-the-art baselines adequate for large or small deformations, and specific of multimodal registration. Lastly, we also demonstrate the underlying robustness of the proposed framework to varying levels of explicit regularization while maintaining low error, its suitability for registration at varying scales, and its efficiency with respect to other methods targeted to large-deformation registration.

[138] SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis

Paul Dobre,Jackson Cooper,Xin Wang,Hongzhou Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为SCPainter的统一框架，结合3D高斯点阵资产表示和扩散生成模型，实现逼真的3D资产插入与新视角合成（NVS），以增强自动驾驶模拟中的训练数据多样性。

Details

Motivation: 现有方法在3D资产插入和新视角合成方面孤立处理，难以实现与场景的真实交互；同时，插入的3D资产在光照、阴影等细节上缺乏真实感，限制了训练数据的有效性。因此需要一个能联合优化这两个任务的统一框架。 Method: SCPainter将3D高斯点阵（GS）表示的车辆资产与场景点云共同投影到新视角，并利用这些投影作为条件输入，通过扩散模型生成高质量图像，从而联合实现3D资产插入与NVS。 Result: 在Waymo Open Dataset上的实验表明，该框架能够有效支持3D资产插入与新视角图像生成，提升了合成场景的真实性和多样性。 Conclusion: SCPainter为自动驾驶仿真提供了一个统一且有效的解决方案，能够生成更真实、更多样的训练数据，有助于提升自动驾驶模型的鲁棒性与安全性。 Abstract: 3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.

[139] Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning

Youssef Megahed,Robin Ducharme,Inok Lee,Inbal Willner,Olivier X. Miguel,Kevin Dick,Adrian D. C. Chan,Mark Walker,Steven Hawken

Main category: cs.CV

TL;DR: 本研究评估了超声特异性的自监督预训练（USF-MAE）在提升深度学习模型对孕早期囊性水囊瘤检测性能中的作用，结果表明其在准确性、敏感性和特异性上均显著优于传统DenseNet-169模型。

Details

Motivation: 由于标注数据集较小，传统的监督深度学习方法在自动化检测囊性水囊瘤方面受限，因此需要更高效的学习范式来提升模型性能和泛化能力。 Method: 采用在超过37万张未标注超声图像上预训练的USF-MAE模型，并在其基础上进行微调以实现对正常与囊性水囊瘤病例的二分类；使用与DenseNet-169基线相同的数据库、预处理流程和四折交叉验证方案进行评估，并通过Score-CAM进行可视化分析模型可解释性。 Result: USF-MAE模型在所有指标上均优于基线模型：平均准确率为0.96（vs 0.93）、敏感度为0.94（vs 0.92）、特异性为0.98（vs 0.94）、ROC-AUC为0.98（vs 0.94），且Wilcoxon检验显示差异具有统计学意义（p=0.0057）。 Conclusion: 超声特异性的自监督预训练能有效提升小标注数据下的囊性水囊瘤检测性能，具备临床相关性和推广潜力，支持其在早期筛查项目中的应用。 Abstract: Cystic hygroma is a high-risk prenatal ultrasound finding that portends high rates of chromosomal abnormalities, structural malformations, and adverse pregnancy outcomes. Automated detection can increase reproducibility and support scalable early screening programs, but supervised deep learning methods are limited by small labelled datasets. This study assesses whether ultrasound-specific self-supervised pretraining can facilitate accurate, robust deep learning detection of cystic hygroma in first-trimester ultrasound images. We fine-tuned the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), pretrained on over 370,000 unlabelled ultrasound images, for binary classification of normal controls and cystic hygroma cases used in this study. Performance was evaluated on the same curated ultrasound dataset, preprocessing pipeline, and 4-fold cross-validation protocol as for the DenseNet-169 baseline, using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). Model interpretability was analyzed qualitatively using Score-CAM visualizations. USF-MAE outperformed the DenseNet-169 baseline on all evaluation metrics. The proposed model yielded a mean accuracy of 0.96, sensitivity of 0.94, specificity of 0.98, and ROC-AUC of 0.98 compared to 0.93, 0.92, 0.94, and 0.94 for the DenseNet-169 baseline, respectively. Qualitative Score-CAM visualizations of model predictions demonstrated clinical relevance by highlighting expected regions in the fetal neck for both positive and negative cases. Paired statistical analysis using a Wilcoxon signed-rank test confirmed that performance improvements achieved by USF-MAE were statistically significant (p = 0.0057).

[140] Split4D: Decomposed 4D Scene Reconstruction Without Video Segmentation

Yongzhen Hu,Yihui Yang,Haotong Lin,Yifan Wang,Junting Dong,Yifu Deng,Xinyu Zhu,Fan Jia,Hujun Bao,Xiaowei Zhou,Sida Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为Freetime FeatureGS的新方法，用于从多视角视频中进行分解式4D场景重建，通过引入可学习特征的高斯基元和流式特征学习策略，避免了对视频分割质量的依赖，实现了更鲁棒和高质量的4D重建。

Details

Motivation: 现有方法依赖于不稳定的视频分割图，导致重建结果不可靠，本文旨在摆脱对高质量视频分割的依赖，提升4D场景重建的稳定性与精度。 Method: 采用Freetime FeatureGS表示动态场景，将其建模为具有可学习特征和线性运动能力的高斯基元；通过对比损失函数，使同一实例的投影特征相近、不同实例的特征远离，并以时间顺序采样训练数据，实现跨时间的流式特征传播。 Result: 在多个数据集上的实验表明，该方法显著优于近期方法，重建质量大幅提升。 Conclusion: Freetime FeatureGS通过结合可移动高斯基元与流式对比特征学习，有效实现了不依赖视频分割的高质量分解式4D场景重建，具有更强的鲁棒性和优化稳定性。 Abstract: This paper addresses the problem of decomposed 4D scene reconstruction from multi-view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per-image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.

[141] TrimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

Hao Zhang,Mengsi Lyu,Bo Huang,Yulong Ao,Yonghua Lin

Main category: cs.CV

TL;DR: 本文提出了一种针对长上下文、多图像场景的自适应视觉token剪枝方法，通过分解冗余并动态分配预算，在保持性能的同时显著减少视觉token数量。

Details

Motivation: 现有视觉token剪枝方法常忽视长上下文、多图像输入场景，导致推理成本高且效率低。 Method: 将冗余分解为图像内和图像间两部分，分别通过图像内多样性和图像间变化量来量化，并设计两阶段方法：图像内阶段进行内容感知的token预算分配与代表性token选择，图像间阶段通过全局多样性过滤和帕累托选择平衡多样性与文本对齐。 Result: 实验表明该方法在长上下文设置下显著减少视觉token数量的同时保持了良好的模型性能。 Conclusion: 所提出的自适应剪枝方法有效应对了长上下文、多图像场景下的视觉token冗余问题，提升了LMM的推理效率。 Abstract: Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.

[142] Neighbor-Aware Token Reduction via Hilbert Curve for Vision Transformers

Yunge Li,Lanyu Xu

Main category: cs.CV

TL;DR: 本文提出了一种基于希尔伯特曲线重排序的邻域感知令牌缩减方法，通过保留二维空间中的邻域结构来提升视觉Transformer的计算效率与准确性。

Details

Motivation: 现有的令牌合并和剪枝策略常忽略空间连续性和邻居关系，导致局部上下文丢失，限制了视觉Transformer的效率与性能。 Method: 提出两种新策略：基于邻域感知的剪枝（NAP）用于选择性保留令牌，以及基于相邻令牌相似性的合并（MAT）用于局部聚合；利用希尔伯特曲线将2D空间结构映射为1D序列以保持邻域关系。 Result: 实验表明该方法在多种模型和数据集上实现了最先进的精度-效率权衡，优于现有令牌缩减方法。 Conclusion: 保持空间连续性和邻域结构对视觉Transformer的优化至关重要，所提方法为架构设计提供了新思路。 Abstract: Vision Transformers (ViTs) have achieved remarkable success in visual recognition tasks, but redundant token representations limit their computational efficiency. Existing token merging and pruning strategies often overlook spatial continuity and neighbor relationships, resulting in the loss of local context. This paper proposes novel neighbor-aware token reduction methods based on Hilbert curve reordering, which explicitly preserves the neighbor structure in a 2D space using 1D sequential representations. Our method introduces two key strategies: Neighbor-Aware Pruning (NAP) for selective token retention and Merging by Adjacent Token similarity (MAT) for local token aggregation. Experiments demonstrate that our approach achieves state-of-the-art accuracy-efficiency trade-offs compared to existing methods. This work highlights the importance of spatial continuity and neighbor structure, offering new insights for the architectural optimization of ViTs.

[143] Next Best View Selections for Semantic and Dynamic 3D Gaussian Splatting

Yiqian Li,Wen Jiang,Kostas Daniilidis

Main category: cs.CV

TL;DR: 提出了一种基于Fisher信息的主动学习算法，用于选择对语义和动态场景建模最具信息量的视图，提升了渲染质量和语义分割性能。

Details

Motivation: 语义与动态建模任务中存在大量数据冗余，需要有效选择信息量最大的帧以提升模型训练效率。 Method: 将视图选择问题建模为主动学习问题，利用Fisher信息量化候选视图在语义高斯参数和形变网络上的信息增益。 Result: 在大规模静态图像和动态视频数据集上验证了方法的有效性，相比随机选择和基于不确定性的启发式方法，显著提升了渲染质量和语义分割性能。 Conclusion: 该方法为联合处理语义推理与动态场景建模提供了有原则的解决方案，优于现有启发式策略。 Abstract: Understanding semantics and dynamics has been crucial for embodied agents in various tasks. Both tasks have much more data redundancy than the static scene understanding task. We formulate the view selection problem as an active learning problem, where the goal is to prioritize frames that provide the greatest information gain for model training. To this end, we propose an active learning algorithm with Fisher Information that quantifies the informativeness of candidate views with respect to both semantic Gaussian parameters and deformation networks. This formulation allows our method to jointly handle semantic reasoning and dynamic scene modeling, providing a principled alternative to heuristic or random strategies. We evaluate our method on large-scale static images and dynamic video datasets by selecting informative frames from multi-camera setups. Experimental results demonstrate that our approach consistently improves rendering quality and semantic segmentation performance, outperforming baseline methods based on random selection and uncertainty-based heuristics.

[144] Plug In, Grade Right: Psychology-Inspired AGIQA

Zhicheng Liao,Baoliang Chen,Hanwei Zhu,Lingyu Zhu,Shiqi Wang,Weisi Lin

Main category: cs.CV

TL;DR: 提出一种基于算术分级响应模型（AGQG）的质量评估模块，通过建模图像能力和等级难度的单调关系，缓解文本-图像嵌入中的语义漂移问题，提升通用图像质量评估性能。

Details

Motivation: 现有AGIQA模型依赖文本-图像嵌入相似性，但存在跨等级的多模态分布和语义漂移问题，影响评估可靠性。 Method: 受心理测量学启发，引入分级响应模型（GRM），设计双分支质量评分模块：一支估计图像能力，另一支以等差方式建模等级难度，确保难度单调性和质量分布的单峰性。 Result: AGQG模块可即插即用，显著提升多种SOTA AGIQA框架在自然与屏幕内容图像上的性能。 Conclusion: AGQG通过结构化建模图像能力与等级难度的关系，有效缓解了语义漂移，具备良好通用性和解释性，有望成为未来IQA模型的核心组件。 Abstract: Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both "excellent" and "poor" grade descriptions while deviating from the "good" one. We refer to this phenomenon as "semantic drift", where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject's ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image's ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.

[145] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

Ruoyu Wang,Ziyu Li,Beier Zhu,Liangyu Yuan,Hanwang Zhang,Xun Yang,Xiaojun Chang,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为EPD-Solver的新型常微分方程求解器，通过引入多条并行梯度评估来减少扩散模型在低延迟采样下的累积截断误差，从而加速生成过程并保持图像质量。

Details

Motivation: 现有基于求解器的加速方法在低延迟条件下常因无法捕捉高曲率轨迹段而导致显著的图像质量下降，本文旨在解决这一问题。 Method: 提出EPD-Solver，利用向量值函数的中值定理，结合多条并行梯度方向进行积分近似，并设计两阶段优化框架：先通过蒸馏方法优化少量可学习参数，再使用参数高效的强化学习微调策略，在低维求解器空间内优化性能。 Result: EPD-Solver在保持低延迟的同时显著提升了生成图像质量，尤其在复杂文本到图像任务中表现优异，且可作为插件（EPD-Plugin）提升现有ODE采样器性能。 Conclusion: 本文方法有效缓解了扩散模型采样中的截断误差问题，通过并行化梯度计算和低维空间优化，实现了高效、高质量的生成，具有良好的通用性和实用性。 Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.

[146] VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM

Jingchao Wang,Kaiwen Zhou,Zhijian Wu,Kunhua Ji,Dingjiang Huang,Yefeng Zheng

Main category: cs.CV

TL;DR: 本文提出了首个基于多模态大语言模型（VPTracker）的全局视觉-语言跟踪框架，通过引入位置感知的视觉提示机制，结合空间先验信息，在全图范围内进行目标定位，有效应对视角变化、遮挡和快速运动等挑战，提升了跟踪稳定性和目标分辨能力。

Details

Motivation: 现有视觉-语言跟踪方法通常局限于局部搜索，在视角变化、遮挡或目标快速移动时容易失效，缺乏全局语义推理能力来维持稳定跟踪。 Method: 提出VPTracker，一种基于多模态大语言模型的全局跟踪框架；设计了一种位置感知的视觉提示机制，利用目标前一时刻的位置构建区域级提示，使模型优先进行区域识别，并在必要时才进行全局推断。 Result: 实验表明，该方法在多种复杂场景下显著提升了跟踪稳定性与抗干扰能力，尤其在存在遮挡、视角变化和相似物干扰时表现优越。 Conclusion: VPTracker首次将多模态大语言模型用于全局视觉-语言跟踪，通过融合空间先验与语义推理，实现了更鲁棒的跟踪性能，为MLLM在视觉跟踪中的应用开辟了新路径。 Abstract: Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.

[147] Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation

Bin Liu,Wenyan Tian,Huangxin Fu,Zizheng Li,Zhifen He,Bo Li

Main category: cs.CV

TL;DR: 提出一种基于3D高斯与三平面表示的高效3D医学图像重建方法，可在稀疏切片条件下提升结构连续性与语义一致性，显著提高重建效率和图像质量。

Details

Motivation: 传统医学图像3D重建方法计算成本高，在稀疏切片下易出现结构不连续和细节丢失，难以满足临床精度需求。 Method: 结合3D高斯表示与三平面表示，利用高斯表示在高效渲染和几何表达上的优势，通过三平面增强稀疏条件下的结构连续性和语义一致性。 Result: 在超声（US）和磁共振成像（MRI）等多模态医学数据集上验证，该方法在稀疏数据下生成了解剖结构连贯、语义稳定的高质量图像，并显著提升了重建效率。 Conclusion: 该方法为医学图像的3D可视化与临床分析提供了一种高效且可靠的新型解决方案。 Abstract: 3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy requirements.To address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.

[148] Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image

Po-Chih Wu

Main category: cs.CV

TL;DR: 本文研究了现有开放词汇目标检测模型在低质量图像条件下的性能，提出了一种模拟真实世界低质量图像的新数据集。实验表明，尽管在轻度图像退化下模型性能下降不明显，但在重度退化下所有模型性能均显著下降，其中OWLv2表现最优。

Details

Motivation: 评估开放词汇目标检测模型在现实世界低质量图像中的鲁棒性，推动更接近人类水平的识别能力发展。 Method: 构建一个新的低质量图像数据集，模拟真实世界的图像退化情况，并在多种退化类型下对现有模型（如OWLv2、OWL-ViT、GroundingDINO、Detic）进行性能评估。 Result: 在轻度图像退化下模型mAP下降不显著，但在重度退化下性能急剧下降；OWLv2在各类退化中表现最佳，其他模型如OWL-ViT、GroundingDINO和Detic性能下降明显。 Conclusion: 当前开放词汇目标检测模型在高程度图像退化下仍面临挑战，需进一步提升鲁棒性，OWLv2显示出更强的适应能力。 Abstract: Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.

[149] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

Libo Zhang,Zekun Li,Tianyu Li,Zeyu Cao,Rui Xu,Xiaoxiao Long,Wenjia Wang,Jingbo Wang,Yuan Liu,Wenping Wang,Daquan Zhou,Taku Komura,Zhiyang Dou

Main category: cs.CV

TL;DR: 本文提出了EgoReAct，首个能够从第一人称视角视频实时生成3D对齐人体反应动作的自回归框架，并构建了更精确对齐的人类反应数据集HRD以解决现有数据的空间不一致问题。

Details

Motivation: 现有的第一人称视频与人类反应动作之间存在显著的空间不一致（如动态动作配对静态镜头），且缺乏高质量、时空对齐的数据集，导致难以建模真实、因果的人类反应。 Method: 构建了空间对齐的第一人称视频-反应数据集HRD；提出EgoReAct框架：使用VQ-VAE将反应动作压缩到紧凑的潜在空间，并用GPT结构进行自回归生成；引入度量深度和头部动态等3D动态特征增强空间定位。 Result: 实验表明，EgoReAct在真实性、空间一致性与生成效率方面显著优于先前方法，同时保持严格的因果性。 Conclusion: EgoReAct实现了从第一人称视频到3D对齐人体反应动作的高效、真实且因果的生成，为虚拟现实、人机交互等应用提供了新可能。 Abstract: Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.

[150] Depth Anything in $360^\circ$: Towards Scale Invariance in the Wild

Hualie Jiang,Ziyang Song,Zhiqiang Lou,Rui Xu,Minglang Tan

Main category: cs.CV

TL;DR: 本文提出了DA360，一种适用于全景图像的深度估计模型，通过改进ViT骨干网络和引入循环填充机制，在室内外数据集上实现了零样本全景深度估计的新SOTA性能。

Details

Motivation: 全景深度估计在开放世界场景中的零样本泛化能力远落后于透视图像，缺乏有效的跨域迁移方法。 Method: 基于Depth Anything V2，提出学习ViT骨干中的位移参数以生成尺度不变的深度估计，并在DPT解码器中引入圆形填充来消除接缝伪影，保证球面连续性。 Result: 在标准室内基准和新构建的室外数据集Metropolis上，DA360相比基础模型在室内和室外分别降低了50%和10%的相对深度误差，且比PanDA等方法提升约30%的相对误差表现。 Conclusion: DA360显著提升了零样本全景深度估计的性能，成为当前最先进的方法，具备良好的实际应用潜力。 Abstract: Panoramic depth estimation provides a comprehensive solution for capturing complete $360^\circ$ environmental structural information, offering significant benefits for robotics and AR/VR applications. However, while extensively studied in indoor settings, its zero-shot generalization to open-world domains lags far behind perspective images, which benefit from abundant training data. This disparity makes transferring capabilities from the perspective domain an attractive solution. To bridge this gap, we present Depth Anything in $360^\circ$ (DA360), a panoramic-adapted version of Depth Anything V2. Our key innovation involves learning a shift parameter from the ViT backbone, transforming the model's scale- and shift-invariant output into a scale-invariant estimate that directly yields well-formed 3D point clouds. This is complemented by integrating circular padding into the DPT decoder to eliminate seam artifacts, ensuring spatially coherent depth maps that respect spherical continuity. Evaluated on standard indoor benchmarks and our newly curated outdoor dataset, Metropolis, DA360 shows substantial gains over its base model, achieving over 50\% and 10\% relative depth error reduction on indoor and outdoor benchmarks, respectively. Furthermore, DA360 significantly outperforms robust panoramic depth estimation methods, achieving about 30\% relative error improvement compared to PanDA across all three test datasets and establishing new state-of-the-art performance for zero-shot panoramic depth estimation.

[151] KANO: Kolmogorov-Arnold Neural Operator for Image Super-Resolution

Chenyu Li,Danfeng Hong,Bing Zhang,Zhaojie Pan,Jocelyn Chanussot

Main category: cs.CV

TL;DR: 本文提出了一种基于Kolmogorov-Arnold定理的可解释性神经算子KANO，用于单幅图像超分辨率重建，通过B样条函数建模退化过程，实现物理可解释的SR结果。

Details

Motivation: 现有超分辨率方法依赖黑箱网络，难以解释和控制退化过程，缺乏物理可解释性。 Method: 基于Kolmogorov-Arnold定理设计KANO算子，采用有限B样条函数的加性结构分段逼近连续光谱曲线，并优化样条参数以捕捉局部线性趋势和非线性拐点处的峰谷结构。 Result: KANO在自然图像、航拍和卫星遥感数据上验证了有效性，能够准确拟合复杂退化过程；与MLP对比表明KAN在序列拟合和可解释性方面更具优势。 Conclusion: KANO为超分辨率提供了可解释的建模框架，揭示了深度模型在退化建模中的潜力，推动了物理可解释SR技术的发展。 Abstract: The highly nonlinear degradation process, complex physical interactions, and various sources of uncertainty render single-image Super-resolution (SR) a particularly challenging task. Existing interpretable SR approaches, whether based on prior learning or deep unfolding optimization frameworks, typically rely on black-box deep networks to model latent variables, which leaves the degradation process largely unknown and uncontrollable. Inspired by the Kolmogorov-Arnold theorem (KAT), we for the first time propose a novel interpretable operator, termed Kolmogorov-Arnold Neural Operator (KANO), with the application to image SR. KANO provides a transparent and structured representation of the latent degradation fitting process. Specifically, we employ an additive structure composed of a finite number of B-spline functions to approximate continuous spectral curves in a piecewise fashion. By learning and optimizing the shape parameters of these spline functions within defined intervals, our KANO accurately captures key spectral characteristics, such as local linear trends and the peak-valley structures at nonlinear inflection points, thereby endowing SR results with physical interpretability. Furthermore, through theoretical modeling and experimental evaluations across natural images, aerial photographs, and satellite remote sensing data, we systematically compare multilayer perceptrons (MLPs) and Kolmogorov-Arnold networks (KANs) in handling complex sequence fitting tasks. This comparative study elucidates the respective advantages and limitations of these models in characterizing intricate degradation mechanisms, offering valuable insights for the development of interpretable SR techniques.

[152] 3D Scene Change Modeling With Consistent Multi-View Aggregation

Zirui Zhou,Junfeng Ni,Shujie Zhang,Yixin Chen,Siyuan Huang

Main category: cs.CV

TL;DR: 提出SCaR-3D，一种基于3D高斯的物体级变化检测框架，实现空间一致的前后状态分离与持续场景重建。

Details

Motivation: 现有3D变化检测方法存在空间不一致性和无法显式分离前后变化状态的问题。 Method: 采用基于符号距离的2D差分模块，结合多视角投票与剪枝聚合，并利用3DGS的一致性实现鲁棒变化检测；提出选择性更新动态区域的持续重建策略。 Result: 在多个实验中表现出高精度和高效率，优于现有方法；并提出了新的合成数据集CCS3D用于可控评估。 Conclusion: SCaR-3D有效解决了空间不一致问题，实现了精确的物体级变化检测与持续场景重建。 Abstract: Change detection plays a vital role in scene monitoring, exploration, and continual reconstruction. Existing 3D change detection methods often exhibit spatial inconsistency in the detected changes and fail to explicitly separate pre- and post-change states. To address these limitations, we propose SCaR-3D, a novel 3D scene change detection framework that identifies object-level changes from a dense-view pre-change image sequence and sparse-view post-change images. Our approach consists of a signed-distance-based 2D differencing module followed by multi-view aggregation with voting and pruning, leveraging the consistent nature of 3DGS to robustly separate pre- and post-change states. We further develop a continual scene reconstruction strategy that selectively updates dynamic regions while preserving the unchanged areas. We also contribute CCS3D, a challenging synthetic dataset that allows flexible combinations of 3D change types to support controlled evaluations. Extensive experiments demonstrate that our method achieves both high accuracy and efficiency, outperforming existing methods.

[153] A Minimal Solver for Relative Pose Estimation with Unknown Focal Length from Two Affine Correspondences

Zhenbao Yu,Shirong Ye,Ronghe Jin,Shunkun Liang,Zibin Liu,Huiyun Zhang,Banglei Guan

Main category: cs.CV

TL;DR: 本文提出了一种利用两个仿射对应关系和已知垂直方向估计双视图相对位姿及焦距的新方法，通过引入IMU测量简化了问题，并设计了一个新的求解器，在合成和真实数据集上表现优于现有最先进方法。

Details

Motivation: 由于相机常与IMU结合使用，而IMU可提供垂直方向信息，因此可以利用该信息减少相对位姿的自由度，从而提高估计效率和精度。 Method: 基于两个仿射对应关系建立约束方程，利用方程组有非平凡解的性质推导出仅含焦距和相对旋转角的四个方程，并采用多项式特征值方法求解。 Result: 在合成和真实世界数据集上的实验表明，所提求解器在估计焦距和3DOF相对位姿方面优于现有的最先进方法。 Conclusion: 本文方法有效利用IMU提供的垂直方向信息，降低了问题复杂度，实现了更精确的相对位姿和焦距估计，具有良好的实际应用前景。 Abstract: In this paper, we aim to estimate the relative pose and focal length between two views with known intrinsic parameters except for an unknown focal length from two affine correspondences (ACs). Cameras are commonly used in combination with inertial measurement units (IMUs) in applications such as self-driving cars, smartphones, and unmanned aerial vehicles. The vertical direction of camera views can be obtained by IMU measurements. The relative pose between two cameras is reduced from 5DOF to 3DOF. We propose a new solver to estimate the 3DOF relative pose and focal length. First, we establish constraint equations from two affine correspondences when the vertical direction is known. Then, based on the properties of the equation system with nontrivial solutions, four equations can be derived. These four equations only involve two parameters: the focal length and the relative rotation angle. Finally, the polynomial eigenvalue method is utilized to solve the problem of focal length and relative rotation angle. The proposed solver is evaluated using synthetic and real-world datasets. The results show that our solver performs better than the existing state-of-the-art solvers.

[154] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

Bangya Liu,Xinyu Gong,Zelin Zhao,Ziyang Song,Yulei Lu,Suhui Wu,Jun Zhang,Suman Banerjee,Hao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于Diffusion Transformer的框架ByteLoom，用于生成具有几何一致性的多视角人体-物体交互视频，通过引入RCM-cache机制和渐进式训练策略，解决了现有方法在跨视角一致性与手部标注依赖上的局限。

Details

Motivation: 现有HOI视频生成方法在多视角一致性方面表现不佳，且严重依赖精细的手部网格标注，限制了其在实际场景中的应用。本文旨在通过更高效的条件输入和3D物体建模来克服这些挑战。 Method: 提出ByteLoom框架，采用Diffusion Transformer结构；设计RCM-cache机制，利用相对坐标图（RCM）保持物体几何一致性并控制6自由度变换；采用简化的姿态条件输入和3D物体输入，并设计渐进式训练课程以减少对手部网格标注的依赖。 Result: 实验表明，该方法在保持人物身份、物体多视角几何形状、动作流畅性和操作自然性方面显著优于现有方法，实现了更高的视觉质量和跨视角一致性。 Conclusion: ByteLoom通过引入RCM-cache和弱化对手部标注的依赖，有效提升了HOI视频生成的几何一致性和实用性，为数字人、电商等应用提供了更具扩展性的解决方案。 Abstract: Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.

Zhuonan Liu,Xinyu Zhang,Zishuo Wang,Tomohito Kawabata,Xuesu Xiao,Ling Xiao

Main category: cs.CV

TL;DR: MUSON是一个用于短视距社交导航的多模态数据集，采用五步思维链注释（感知、预测、推理、动作、解释），显式建模物理约束并平衡动作空间，支持对社会合规导航的有效基准测试。

Details

Motivation: 现有社交导航数据集缺乏显式推理监督，且动作分布长尾严重，难以学习安全关键行为。 Method: 提出MUSON数据集，采用五步Chain-of-Thought注释流程，并引入结构化物理约束与均衡的离散动作空间。 Result: 在多个视觉语言模型上进行基准测试，Qwen2.5-VL-3B取得0.8625的最高决策准确率。 Conclusion: MUSON为社交合规导航提供了可重用、结构化的高质量基准数据集，有助于提升模型的安全性与可解释性。 Abstract: Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models' ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at https://huggingface.co/datasets/MARSLab/MUSON

[156] Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs

Ziyu Zhou,Haozhe Luo,Mohammad Reza Hosseinzadeh Taher,Jiaxuan Pang,Xiaowei Ding,Michael B. Gotway,Jianming Liang

Main category: cs.CV

TL;DR: Lamps是一种基于自监督学习的医学影像基础模型，通过利用人体解剖结构的一致性、连贯性和层次性，在大规模胸部X光图像上进行预训练，显著提升了模型的鲁棒性、可迁移性和临床潜力。

Details

Motivation: 现有自监督学习方法在医学影像中忽略了人体解剖结构的一致性、连贯性和层次性，限制了对解剖特征的有效学习。Lamps旨在通过引入这些结构性先验来提升表示学习质量。 Method: 提出Lamps模型，从多个视角联合利用解剖结构的一致性（如双侧对称）、连贯性（器官空间关系）和层次性（组织-器官-系统层级）作为自监督信号，在大规模胸部X光数据上进行预训练。 Result: 在10个数据集上通过微调和涌现特性分析验证，Lamps在多种下游任务中表现优于10种基线模型，展现出更强的鲁棒性、跨域迁移能力和潜在临床应用价值。 Conclusion: 通过多视角建模人体解剖结构，Lamps为医学影像基础模型提供了更符合生物结构的学习范式，推动了具有意义且稳健的医学表示学习的发展。 Abstract: Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps' superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.

[157] Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples

Weiwei Li,Junzhuo Liu,Yuanyuan Ren,Yuchen Zheng,Yahao Liu,Wen Li

Main category: cs.CV

TL;DR: 本文提出了一种数据驱动的方法来减轻深度学习模型中的虚假相关性，通过识别、中和、消除和更新的流程，在图像和NLP去偏基准上显著提升了最差组的准确率。

Details

Motivation: 现有的方法通常依赖于标注潜在的虚假属性或基于经验假设过滤虚假特征，但在真实世界数据中由于虚假相关的复杂性和难以捉摸性，可能导致性能不佳。 Method: 观察到受虚假特征影响的样本在学习到的特征空间中表现出分散分布，据此识别虚假特征的存在；通过简单的分组策略获得偏差不变表示，并学习特征变换以对齐该表示来消除虚假特征；最后结合学习到的特征变换更新分类器。 Result: 在图像和自然语言处理的去偏基准实验中，相比标准的经验风险最小化（ERM），最差组准确率提高了超过20%。 Conclusion: 所提出的流程能有效缓解深度学习模型中的虚假相关问题，提升模型在不同任务下的鲁棒性和公平性。 Abstract: Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions (e.g., simplicity of bias). However, these methods may yield unsatisfactory performance due to the intricate and elusive nature of spurious correlations in real-world data. In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. We observe that samples that are influenced by spurious features tend to exhibit a dispersed distribution in the learned feature space. This allows us to identify the presence of spurious features. Subsequently, we obtain a bias-invariant representation by neutralizing the spurious features based on a simple grouping strategy. Then, we learn a feature transformation to eliminate the spurious features by aligning with this bias-invariant representation. Finally, we update the classifier by incorporating the learned feature transformation and obtain an unbiased model. By integrating the aforementioned identifying, neutralizing, eliminating and updating procedures, we build an effective pipeline for mitigating spurious correlation. Experiments on image and NLP debiasing benchmarks show an improvement in worst group accuracy of more than 20% compared to standard empirical risk minimization (ERM). Codes and checkpoints are available at https://github.com/davelee-uestc/nsf_debiasing .

[158] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

Ju-Hsuan Weng,Jia-Wei Liao,Cheng-Fu Chou,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 本文提出了M-ErasureBench，首个全面评估文本到图像扩散模型中多模态概念擦除的基准框架，并提出IRECE方法以增强推理时的概念擦除鲁棒性。

Details

Motivation: 现有概念擦除研究主要关注文本提示，忽视了在图像编辑和个性化生成等实际应用中日益重要的其他输入模态，导致这些模态成为攻击面，被擦除的概念可能重新出现。 Method: 引入M-ErasureBench，涵盖文本提示、学习嵌入和反演潜在码三种输入模态，并区分白盒与黑盒访问设置；提出IRECE方法，在去噪过程中通过交叉注意力定位目标概念并扰动相关潜在表示。 Result: 实验表明现有方法在文本提示下表现良好，但在学习嵌入和反演潜在码下失败严重（CRR超90%）；IRECE显著降低CRR达40%，同时保持生成质量。 Conclusion: M-ErasureBench填补了多模态概念擦除评估的空白，IRECE有效提升了复杂场景下的擦除鲁棒性，为构建更可靠的生成模型提供了实用防护方案。 Abstract: Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.

[159] SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation

Hasan Faraz Khan,Noor Fatima,Muzammil Behzad

Main category: cs.CV

TL;DR: SwinTF3D是一种轻量级多模态融合模型，结合视觉与语言表征，实现文本引导的3D医学图像分割，在BTCV数据集上表现出色且计算开销低。

Details

Motivation: 现有3D分割框架依赖大量标注数据且缺乏语义理解，难以适应新领域和灵活的用户定义任务。 Method: 提出SwinTF3D，采用基于Transformer的视觉编码器提取体积特征，并通过高效融合机制与轻量文本编码器结合，实现文本提示与空间结构的语义对齐。 Result: 在BTCV数据集上取得有竞争力的Dice和IoU分数，具有良好的泛化能力和显著的计算效率优势。 Conclusion: SwinTF3D bridging visual and linguistic understanding，为交互式、文本驱动的3D医学图像分割提供了实用、可解释的新范式。 Abstract: The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.

[160] Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance

Haosen Li,Wenshuo Chen,Shaofeng Liang,Lei Wang,Haozhe Jia,Yutao Yue

Main category: cs.CV

TL;DR: 本文提出了一种名为Guided Path Sampling (GPS)的新方法，用于解决在扩散模型中使用Classifier-Free Guidance (CFG)进行迭代优化时路径不稳定的问题。通过将不稳定的外推替换为流形约束的插值，并设计动态调度策略，GPS保证了采样路径的稳定性，显著提升了生成质量和语义一致性。

Details

Motivation: 标准的Classifier-Free Guidance (CFG)在迭代精炼过程中由于其外推特性导致采样路径偏离数据流形，造成误差发散，限制了精炼效果。因此需要一种更稳定的方法来确保路径保持在数据流形上。 Method: 提出Guided Path Sampling (GPS)，用基于流形约束的插值替代CFG的外推机制，并引入动态调整引导强度的最优调度策略，使语义注入与模型的由粗到细生成过程对齐。 Result: 理论证明GPS能将误差从无界放大变为有界，保证稳定性；实验表明其在SDXL和Hunyuan-DiT等现代架构上优于现有方法，ImageReward达0.79，HPS v2达0.2995，在GenEval上语义对齐准确率达57.45%。 Conclusion: 路径稳定性是有效迭代精炼的前提，GPS提供了一个稳定且高效的框架，解决了CFG在精炼过程中的根本缺陷。 Abstract: Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG's extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model's natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.

[161] Hash Grid Feature Pruning

Yangzhi Ma,Bojun Liu,Jie Li,Li Li,Dong Liu

Main category: cs.CV

TL;DR: 提出了一种基于高斯点坐标识别和剪枝无效特征的哈希网格特征剪枝方法，有效减少存储和传输开销，在不牺牲模型性能的前提下实现8%的平均比特率降低。

Details

Motivation: 由于高斯点在3D空间中分布不规则且稀疏，导致哈希网格中存在大量无效特征区域，造成存储和传输冗余。 Method: 根据输入高斯点的坐标识别并剪除哈希网格中的无效特征，仅编码有效特征，从而减少哈希网格的存储量。 Result: 在Common Test Conditions下测试，相比基线方法平均比特率降低了8%，同时保持了模型性能。 Conclusion: 该方法能有效提升哈希网格的编码效率，改善率失真性能，适用于基于哈希网格的隐式神经场学习与高斯点云压缩。 Abstract: Hash grids are widely used to learn an implicit neural field for Gaussian splatting, serving either as part of the entropy model or for inter-frame prediction. However, due to the irregular and non-uniform distribution of Gaussian splats in 3D space, numerous sparse regions exist, rendering many features in the hash grid invalid. This leads to redundant storage and transmission overhead. In this work, we propose a hash grid feature pruning method that identifies and prunes invalid features based on the coordinates of the input Gaussian splats, so that only the valid features are encoded. This approach reduces the storage size of the hash grid without compromising model performance, leading to improved rate-distortion performance. Following the Common Test Conditions (CTC) defined by the standardization committee, our method achieves an average bitrate reduction of 8% compared to the baseline approach.

Kai Liu,Jungang Li,Yuchong Sun,Shengqiong Wu,Jianzhang Gao,Daoan Zhang,Wei Zhang,Sheng Jin,Sicheng Yu,Geng Zhan,Jiayi Ji,Fan Zhou,Liang Zheng,Shuicheng Yan,Hao Fei,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了JavisGPT，首个用于联合音频-视频理解与生成的统一多模态大语言模型，采用简洁的编码器-LLM-解码器架构和SyncFusion模块实现时空音视频融合，并通过三阶段训练流程和自建高质量指令数据集JavisInst-Omni，在复杂且时间同步的任务上显著优于现有MLLM。

Details

Motivation: 现有的多模态大语言模型在联合音频-视频（JAV）的理解与生成任务上缺乏统一框架，尤其难以处理时间同步和复杂场景下的多模态交互，因此需要一个专门针对JAV设计的统一模型。 Method: 提出JavisGPT，采用编码器-LLM-解码器架构，引入SyncFusion模块进行时空音视频融合，并使用同步感知可学习查询连接预训练的JAV-DiT生成器；设计了包含多模态预训练、音视频微调和大规模指令调优的三阶段训练流程，并构建了含20万以上GPT-4o标注对话的JavisInst-Omni数据集支持训练与评估。 Result: 在多个JAV理解和生成基准上的实验表明，JavisGPT在复杂和时间同步场景下性能优于现有MLLM，展现出更强的多模态理解与生成能力。 Conclusion: JavisGPT是首个统一处理联合音频-视频理解与生成的多模态大语言模型，其架构设计、训练策略和高质量指令数据集有效提升了在复杂时序多模态任务上的表现，为未来JAV智能系统提供了新范式。 Abstract: This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

[163] ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Qihang Peng,Xuesong Chen,Chenye Yang,Shaoshuai Shi,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出ColaVLA，一种统一的视觉-语言-动作框架，通过将推理从文本转移到统一的潜在空间，并结合分层并行轨迹解码器，实现高效、准确且安全的自动驾驶轨迹生成。

Details

Motivation: 现有基于视觉语言模型（VLM）的规划器存在文本推理与连续控制之间的模态不匹配、自回归推理延迟高以及规划器效率低或缺乏因果性等问题，限制了其在实时系统中的应用。 Method: 提出ColaVLA框架，包含认知潜在推理器（Cognitive Latent Reasoner），通过两次VLM前向传播将场景理解压缩为决策导向的元动作嵌入；以及分层并行规划器（Hierarchical Parallel Planner），在单次前向传播中生成多尺度、因果一致的轨迹。 Result: 在nuScenes基准上实验表明，ColaVLA在开环和闭环设置下均达到最先进性能，同时具有更高的推理效率和鲁棒性。 Conclusion: ColaVLA成功融合了VLM的泛化能力与实时控制需求，在保持可解释性的同时实现了高效、安全的端到端自动驾驶规划。 Abstract: Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

[164] Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects

Zhicheng Zhao,Xuanang Fan,Lingma Sun,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了一种用于高分辨率遥感影像中密集小目标检测的新型网络DRMNet，通过密度图引导的自适应特征学习提升检测性能。

Details

Motivation: 现有检测方法在处理高密度、严重遮挡的小目标时，因计算资源分配不均和特征学习效率低而受限，缺乏对密集区域的自适应关注机制。 Method: 提出DRMNet，包含三个核心模块：1）密度生成分支（DGB）构建密度图作为空间先验；2）密集区域聚焦模块（DAFM）基于密度图实现高效局部-全局特征交互；3）双滤波融合模块（DFFM）结合离散余弦变换与密度引导的交叉注意力增强多尺度特征互补性并抑制背景干扰。 Result: 在AI-TOD和DTOD数据集上进行了大量实验，DRMNet在高密度和严重遮挡场景下显著优于现有最先进方法。 Conclusion: DRMNet通过引入密度图作为显式空间先验，有效提升了密集小目标的检测性能，尤其适用于复杂高密度遥感场景。 Abstract: High-resolution remote sensing imagery increasingly contains dense clusters of tiny objects, the detection of which is extremely challenging due to severe mutual occlusion and limited pixel footprints. Existing detection methods typically allocate computational resources uniformly, failing to adaptively focus on these density-concentrated regions, which hinders feature learning effectiveness. To address these limitations, we propose the Dense Region Mining Network (DRMNet), which leverages density maps as explicit spatial priors to guide adaptive feature learning. First, we design a Density Generation Branch (DGB) to model object distribution patterns, providing quantifiable priors that guide the network toward dense regions. Second, to address the computational bottleneck of global attention, our Dense Area Focusing Module (DAFM) uses these density maps to identify and focus on dense areas, enabling efficient local-global feature interaction. Finally, to mitigate feature degradation during hierarchical extraction, we introduce a Dual Filter Fusion Module (DFFM). It disentangles multi-scale features into high- and low-frequency components using a discrete cosine transform and then performs density-guided cross-attention to enhance complementarity while suppressing background interference. Extensive experiments on the AI-TOD and DTOD datasets demonstrate that DRMNet surpasses state-of-the-art methods, particularly in complex scenarios with high object density and severe occlusion.

[165] CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

Behnam Raoufi,Hossein Sharify,Mohamad Mahdee Ramezanee,Khosrow Hajsadeghi,Saeed Bagheri Shouraki

Main category: cs.CV

TL;DR: 提出了一种名为CLIP-Joint-Detect的检测框架，通过引入CLIP风格的视觉-语言对比监督，结合InfoNCE损失和交叉熵损失，实现了对一阶段和两阶段检测器的通用、高效增强，在Pascal VOC和MS COCO上均取得了显著性能提升。

Details

Motivation: 传统目标检测器依赖交叉熵分类，易受类别不平衡和标签噪声影响，且缺乏语义丰富的监督信号，因此需要一种更鲁棒、通用的分类监督方式。 Method: 设计了一个轻量级并行头，将区域或网格特征映射到CLIP嵌入空间，并与可学习的类别文本嵌入对齐，使用InfoNCE对比损失和辅助交叉熵损失进行优化，同时联合训练所有检测损失，支持端到端训练。 Result: 在Pascal VOC 2007+2012（Faster R-CNN）和MS COCO 2017（YOLOv11）上均取得一致且显著的性能提升，保持实时推理速度，消融实验验证了可学习文本嵌入和联合优化的有效性。 Conclusion: CLIP-Joint-Detect是一种简单、检测器无关的框架，通过引入视觉-语言对比学习，有效增强了目标检测的分类能力，提升了闭集检测性能，具有良好的通用性和实用性。 Abstract: Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.

[166] Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection

Runwei Guan,Jianan Liu,Shaofeng Liang,Fangqiang Ding,Shanliang Yao,Xiaokai Bai,Daizong Liu,Tao Huang,Guoqiang Mao,Hui Xiong

Main category: cs.CV

TL;DR: 本文提出了一种名为WRCFormer的新型3D目标检测框架，通过融合原始4D毫米波雷达立方体和相机数据，利用小波注意力模块和几何引导的渐进式融合机制，在K-Radar基准上实现了最先进的性能，尤其在恶劣天气条件下表现出更强的鲁棒性。

Details

Motivation: 4D毫米波雷达虽然成本低且全天候性能强，但其点云稀疏、语义信息不足，限制了感知能力；现有方法在处理雷达数据时存在信息损失或计算成本过高的问题，因此需要一种高效融合雷达与视觉信息的新方法。 Method: 提出WRCFormer框架，使用小波注意力模块构建基于小波的特征金字塔网络（FPN），增强稀疏雷达信号和图像特征表示，并设计了一个两阶段、模态无关的几何引导渐进融合机制，通过多视角雷达立方体表示实现高效跨模态融合。 Result: 在K-Radar基准上，WRCFormer在所有场景下超越最优模型约2.4%，在雨夹雪场景下提升1.6%，验证了其在复杂天气下的有效性与鲁棒性。 Conclusion: WRCFormer通过融合原始4D雷达数据与相机输入，克服了传统方法的信息损失与计算开销问题，显著提升了3D目标检测性能，尤其适用于恶劣天气下的自动驾驶与机器人感知应用。 Abstract: 4D millimeter-wave (mmWave) radar has been widely adopted in autonomous driving and robot perception due to its low cost and all-weather robustness. However, its inherent sparsity and limited semantic richness significantly constrain perception capability. Recently, fusing camera data with 4D radar has emerged as a promising cost effective solution, by exploiting the complementary strengths of the two modalities. Nevertheless, point-cloud-based radar often suffer from information loss introduced by multi-stage signal processing, while directly utilizing raw 4D radar data incurs prohibitive computational costs. To address these challenges, we propose WRCFormer, a novel 3D object detection framework that fuses raw radar cubes with camera inputs via multi-view representations of the decoupled radar cube. Specifically, we design a Wavelet Attention Module as the basic module of wavelet-based Feature Pyramid Network (FPN) to enhance the representation of sparse radar signals and image data. We further introduce a two-stage query-based, modality-agnostic fusion mechanism termed Geometry-guided Progressive Fusion to efficiently integrate multi-view features from both modalities. Extensive experiments demonstrate that WRCFormer achieves state-of-the-art performance on the K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios and 1.6% in the sleet scenario, highlighting its robustness under adverse weather conditions.

[167] YOLO-IOD: Towards Real Time Incremental Object Detection

Shizhou Zhang,Xueqiang Lv,Yinghui Xing,Qirui Wu,Di Xu,Chen Zhao,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出YOLO-IOD，一种基于YOLO-World的实时增量目标检测框架，通过解决三类知识冲突（前景-背景混淆、参数干扰和知识蒸馏不匹配）来缓解灾难性遗忘，并引入LoCo COCO新基准验证其优越性能。

Details

Motivation: 现有增量目标检测方法主要基于Faster R-CNN或DETR系列，无法支持实时YOLO框架；且在YOLO中存在多种知识冲突导致灾难性遗忘，需针对性解决。 Method: 提出YOLO-IOD框架，包含三个核心组件：1）冲突感知伪标签优化（CPR）缓解前景-背景混淆；2）基于重要性的卷积核选择（IKS）减少参数干扰；3）跨阶段非对称知识蒸馏（CAKD）解决蒸馏不匹配问题；并采用分阶段参数高效微调策略。 Result: 在传统COCO和新提出的LoCo COCO基准上实验表明，YOLO-IOD在保持高性能的同时显著减轻了遗忘现象，优于现有方法。 Conclusion: YOLO-IOD有效解决了YOLO系列模型在增量学习中的关键冲突，实现了高性能、低遗忘的实时增量目标检测，推动了该方向的实际应用进展。 Abstract: Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.

[168] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance

Chunyuan Chen,Yunuo Cai,Shujuan Li,Weiyun Liang,Bin Wang,Jing Xu

Main category: cs.CV

TL;DR: 提出ReamCamo，一种基于外绘的统一框架，通过布局控制和多模态条件生成更逼真的伪装图像。

Details

Motivation: 现有伪装图像生成方法在视觉相似性和语义一致性方面存在不足，难以逼近真实伪装场景。 Method: 采用基于外绘的框架，引入布局控制调节全局结构，并结合细粒度文本描述与纹理导向的背景检索构建多模态条件以指导生成。 Result: 生成图像在语义一致性和视觉真实性上优于现有方法，并提出新指标量化伪装质量。 Conclusion: ReamCamo能有效生成结构合理、视觉逼真的伪装图像，显著缩小了合成与真实伪装图像间的差距。 Abstract: Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.

Huiming Yang,Linglin Liao,Fei Ding,Sibo Wang,Zijian Zeng

Main category: cs.CV

TL;DR: 本文提出了一种名为PoseStreamer的鲁棒多模态6DoF姿态估计框架，专为高速移动场景设计，并引入了一个新的多模态数据集MoCapCube6D用于快速运动下的性能评估。

Details

Motivation: 现有的6DoF姿态估计方法在高速运动和低光照场景下表现不佳，尤其是在标准RGB相机因运动模糊而受限的情况下，需要更鲁棒的解决方案。 Method: 提出了PoseStreamer框架，包含三个核心组件：自适应姿态记忆队列、以物体为中心的2D跟踪器和沿相机射线进行几何优化的Ray Pose Filter，并结合事件相机的高时间分辨率优势。 Result: 实验表明，PoseStreamer在高速运动场景中实现了更高的精度，且对未见物体具有良好的泛化能力，同时新提出的MoCapCube6D数据集有效支持了快速运动下的性能测试。 Conclusion: PoseStreamer是一种高效、通用的多模态6DoF姿态估计方法，在高速和复杂环境下表现出色，推动了事件相机在姿态估计中的应用。 Abstract: Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.

[170] Spatial-aware Symmetric Alignment for Text-guided Medical Image Segmentation

Linglin Liao,Qichuan Geng,Yu Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Spatial-aware Symmetric Alignment (SSA)的框架，用于增强基于混合临床文本（包括位置、描述和诊断信息）的医学图像分割能力，通过双向最优传输对齐机制和复合方向引导策略，在公共基准上实现了最先进的性能，尤其在具有空间关系约束的病灶分割中表现优异。

Details

Motivation: 现有方法难以同时处理诊断性和描述性文本，且无法有效捕捉文本中的位置约束，导致分割结果出现偏差，例如将‘左下肺’错误地覆盖到双侧肺部。 Method: 提出了对称最优传输对齐机制以加强图像区域与多表达文本之间的细粒度跨模态关联，并设计了复合方向引导策略，通过构建区域级引导掩码显式引入文本中的空间约束。 Result: 在多个公开数据集上进行了广泛实验，SSA在具有空间关系约束的病灶分割任务中显著优于现有方法，达到了最先进的性能。 Conclusion: SSA框架有效解决了当前文本引导医学图像分割中多类型文本理解与空间定位不准的问题，提升了分割精度，特别是在复杂临床文本指导下具有应用潜力。 Abstract: Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text "in the left lower lung", the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.

[171] Reverse Personalization

Han-Wei Kung,Tuomas Varanka,Nicu Sebe

Main category: cs.CV

TL;DR: 提出一种基于条件扩散反演的反向个性化框架，用于实现无需文本提示的人脸匿名化，并支持属性可控的匿名化生成。

Details

Motivation: 现有基于提示词的人脸匿名化方法依赖于预训练模型中主体的表现或需要对特定身份进行微调，难以泛化到训练数据外的个体且缺乏对人脸属性的控制。 Method: 通过分析身份生成过程，引入条件扩散反演技术，结合身份引导的条件分支，直接在图像上操作以实现人脸匿名化，无需使用文本提示。 Result: 该方法在身份去除、属性保持和图像质量之间实现了最先进的平衡，能够推广到训练数据之外的主体，并支持属性可控的匿名化。 Conclusion: 所提出的反向个性化框架有效解决了现有方法在泛化性和属性控制方面的局限，为可控人脸匿名化提供了新思路。 Abstract: Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model's training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at https://github.com/hanweikung/reverse-personalization .

[172] A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease Diagnosis,Freshness Assessment, and Fruit Detection

Soham Dutta,Soham Banerjee,Sneha Mahata,Anindya Sen,Sayantani Datta

Main category: cs.CV

TL;DR: 本文提出了一种基于低成本RGB无人机的统一智能果园管理管道，集成了多种深度学习模型，用于叶片病害检测、苹果新鲜度判断和实时苹果检测与定位，并在嵌入式设备上实现离线推理，实验结果表现出高准确率。

Details

Motivation: 现有无人机系统通常孤立处理果园管理任务且依赖昂贵的多光谱传感器，成本高且不便于普及，因此需要一种低成本、集成化的解决方案。 Method: 采用ResNet50进行叶片病害检测，VGG16进行苹果新鲜度分类，YOLOv8进行苹果检测与定位，整个系统部署在ESP32-CAM和树莓派上，仅使用RGB图像并支持完全离线运行。 Result: 实验结果显示叶片病害分类准确率达98.9%，新鲜度分类准确率达97.4%，苹果检测F1得分为0.857。 Conclusion: 该框架为多光谱无人机方案提供了可访问且可扩展的替代方案，能够在低成本硬件上支持实际的精准农业应用。 Abstract: Apple orchards require timely disease detection, fruit quality assessment, and yield estimation, yet existing UAV-based systems address such tasks in isolation and often rely on costly multispectral sensors. This paper presents a unified, low-cost RGB-only UAV-based orchard intelligent pipeline integrating ResNet50 for leaf disease detection, VGG 16 for apple freshness determination, and YOLOv8 for real-time apple detection and localization. The system runs on an ESP32-CAM and Raspberry Pi, providing fully offline on-site inference without cloud support. Experiments demonstrate 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. The framework provides an accessible and scalable alternative to multispectral UAV solutions, supporting practical precision agriculture on affordable hardware.

[173] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

Wenyuan Huang,Zhao Wang,Zhou Wei,Ting Huang,Fang Zhao,Jian Yang,Zhenyu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为OpenGround的零样本框架，用于开放世界的3D视觉定位，通过主动认知推理模块（ACR）克服了传统预定义对象查找表（OLT）的局限性，并提出了新的数据集OpenTarget进行评估。

Details

Motivation: 现有方法依赖于预定义的对象查找表（OLT），限制了在未定义或不可预见目标场景中的应用，因此需要一种能够在开放世界中工作的3D视觉定位方法。 Method: 提出了OpenGround框架，其核心是主动认知推理（ACR）模块，该模块通过认知任务链模拟人类感知，动态更新OLT以扩展VLM的认知范围，从而支持预定义和开放世界类别。 Result: 在Nr3D上表现出竞争力，在ScanRefer上达到最先进水平，并在新提出的OpenTarget数据集上实现了17.6%的显著提升。 Conclusion: OpenGround能够有效解决3D视觉定位中对预定义OLT的依赖问题，适用于开放世界场景，具有良好的性能和应用潜力。 Abstract: 3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at [this https URL](https://why-102.github.io/openground.io/).

[174] With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs

Ciprian Constantinescu,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文提出了一种基于单目图像构建地理语义上下文图（GSCG）的新框架，用于上下文感知的物体分类，通过整合几何、语义和材质信息，显著提升了分类准确率。

Details

Motivation: 现有物体识别系统多忽略场景上下文信息，而人类依赖丰富上下文进行识别，因此需要一种能利用上下文的可解释模型。 Method: 结合单目深度估计与全景及材质分割模型构建GSCG，将物体表示为具有几何、颜色和材质属性的节点，关系作为边，并设计图分类器融合局部邻域与全局场景特征进行分类。 Result: 在COCO数据集上，该方法达到73.4%的分类准确率，远超无上下文模型（最低38.4%）、微调ResNet（最高53.5%）和多模态大模型Llama 4 Scout（42.3%）。 Conclusion: 显式建模结构化上下文可显著提升物体分类性能，且所提GSCG框架具有良好的可解释性，验证了上下文在视觉识别中的关键作用。 Abstract: Humans effortlessly identify objects by leveraging a rich understanding of the surrounding scene, including spatial relationships, material properties, and the co-occurrence of other objects. In contrast, most computational object recognition systems operate on isolated image regions, devoid of meaning in isolation, thus ignoring this vital contextual information. This paper argues for the critical role of context and introduces a novel framework for contextual object classification. We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image. This rich, structured representation is built by integrating a metric depth estimator with a unified panoptic and material segmentation model. The GSCG encodes objects as nodes with detailed geometric, chromatic, and material attributes, and their spatial relationships as edges. This explicit graph structure makes the model's reasoning process inherently interpretable. We then propose a specialized graph-based classifier that aggregates features from a target object, its immediate neighbors, and the global scene context to predict its class. Through extensive ablation studies, we demonstrate that our context-aware model achieves a classification accuracy of 73.4%, dramatically outperforming context-agnostic versions (as low as 38.4%). Furthermore, our GSCG-based approach significantly surpasses strong baselines, including fine-tuned ResNet models (max 53.5%) and a state-of-the-art multimodal Large Language Model (LLM), Llama 4 Scout, which, even when given the full image alongside a detailed description of objects, maxes out at 42.3%. These results on COCO 2017 train/val splits highlight the superiority of explicitly structured and interpretable context for object recognition tasks.

[175] An Architecture-Led Hybrid Report on Body Language Detection Project

Thomson Tong,Diba Darooneh

Main category: cs.CV

TL;DR: 本文分析了两个现代视觉语言模型（Qwen2.5-VL-7B-Instruct 和 Llama-4-Scout-17B-16E-Instruct）的架构，并将其特性应用于一个视频到产物的处理流程，强调结构化输出与系统约束之间的关系。

Details

Motivation: 理解VLM的架构如何影响实际应用中的行为，特别是在生成结构化输出时的语义正确性与系统设计之间的差距。 Method: 通过架构级分析，总结两模型共有的多模态基础（视觉标记化、Transformer注意力机制和指令遵循），并详细描述各自架构以支持工程决策。 Result: 明确了模型行为与系统限制之间的关键区别：如结构验证仅检查语法而非几何正确性，人物标识为帧局部，单帧分析返回自由文本等。 Conclusion: 这些区分对于撰写可辩护的技术声明、设计鲁棒接口和规划评估至关重要。 Abstract: This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.

[176] Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Yi Zhou,Xuechao Zou,Shun Zhang,Kai Li,Shiying Wang,Jingming Chen,Congyan Lang,Tengfei Cao,Pin Tao,Yuanchun Shi

Main category: cs.CV

TL;DR: 本文提出了一种名为Co2S的半监督遥感图像语义分割框架，通过融合视觉-语言模型（如CLIP）和自监督模型（如DINOv3）的先验知识，有效缓解了伪标签漂移和错误累积问题，在多个数据集上表现出色。

Details

Motivation: 半监督学习在遥感图像分割中面临伪标签漂移和确认偏差导致的错误累积问题，现有方法难以稳定训练并保持语义一致性，因此需要更鲁棒的框架来提升性能。 Method: 提出Co2S框架，采用基于ViT的双学生异构架构，分别初始化自CLIP和DINOv3；引入显式-隐式语义协同引导机制，利用文本嵌入和可学习查询提供类级指导；设计全局-局部特征协同融合策略，结合CLIP的全局上下文与DINOv3的局部细节。 Result: 在六个主流遥感数据集上进行了广泛实验，Co2S在不同划分协议和多种场景下均取得领先性能，显著优于现有半监督方法。 Conclusion: Co2S通过融合多源先验和协同引导机制，有效抑制了伪标签漂移，提升了半监督遥感图像分割的稳定性和精度，具有较强的泛化能力与应用潜力。 Abstract: Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.

[177] 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

Ryousuke Yamada,Kohsuke Ide,Yoshihiro Fukuhara,Hirokatsu Kataoka,Gilles Puy,Andrei Bursuc,Yuki M. Asano

Main category: cs.CV

TL;DR: 提出了一种名为LAM3C的自监督框架，利用无标签视频生成的点云进行3D表示学习，在无需真实3D扫描的情况下，在室内语义和实例分割任务上超越了之前的自监督方法。

Details

Motivation: 由于收集大规模真实3D场景扫描成本高昂且费时，探索能否仅从无标签视频中学习3D表示。 Method: 提出了LAM3C框架，利用网络获取的房间漫游视频构建RoomTours数据集，并通过前馈重建模型生成点云；引入噪声正则化损失以增强局部几何平滑性和特征稳定性。 Result: 在不使用任何真实3D扫描的情况下，LAM3C在室内语义和实例分割任务上表现优于以往的自监督方法。 Conclusion: 无标签视频是3D自监督学习的一个丰富且可行的数据来源。 Abstract: Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.

[178] Video-BrowseComp: Benchmarking Agentic Video Research on Open Web

Zhengyang Liang,Yan Shu,Xiangrui Liu,Minghao Qin,Kaixin Liang,Paolo Rota,Nicu Sebe,Zheng Liu,Lizi Liao

Main category: cs.CV

TL;DR: 本文提出了Video-BrowseComp，首个面向开放网络的自主视频推理基准，旨在评估智能体在动态视频模态中主动检索与推理的能力，揭示现有模型严重依赖文本元数据、在视觉密集场景下性能显著下降的问题。

Details

Motivation: 现有视频基准多关注被动感知，无法评估智能体在开放网络中主动进行视频时间线探索、跨源验证和视觉证据推理的能力，存在模态鸿沟。 Method: 构建包含210个问题的Video-BrowseComp基准，强制要求依赖时间性视觉证据作答，禁止仅通过文本搜索得出答案，评估先进模型在开放网页视频研究任务中的表现。 Result: 最先进的搜索增强模型（如GPT-5.1 w/ Search）准确率仅为15.24%；模型在元数据丰富的领域（如电视剧）表现较好，但在元数据稀疏的动态场景（如体育、游戏）中性能急剧下降。 Conclusion: Video-BrowseComp填补了主动视频推理基准的空白，推动智能体从被动感知向基于视觉证据的主动研究演进，揭示了当前模型在动态视觉理解上的根本瓶颈。 Abstract: The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.

[179] MedSAM-based lung masking for multi-label chest X-ray classification

Brayden Miao,Zain Rehman,Xin Miao,Siming Liu,Jianjie Wang

Main category: cs.CV

TL;DR: 提出一种基于MedSAM分割引导的胸部X光分类流程，通过引入解剖先验提升模型鲁棒性和可解释性，发现不同掩码策略对异常分类和正常筛查存在任务与架构依赖的权衡。

Details

Motivation: 胸部X光自动解读因疾病信号弱、数据集偏差和空间监督有限而具有挑战性，现有方法缺乏足够的解剖结构引导，影响分类性能与临床适用性。 Method: 利用MedSAM作为肺区提取模块，先对图像进行肺部分割，再基于掩码后的图像训练ResNet等CNN模型进行多标签异常分类；采用松紧两种掩码策略，并在公开NIH CXR数据集上评估五种异常及正常情况的分类效果。 Result: MedSAM能生成解剖结构合理的肺部掩码；宽松掩码在保持总体AUROC的同时显著提升正常样本识别能力，紧密掩码降低异常分类性能但提高训练效率；不同掩码策略的效果依赖于网络架构和任务目标。 Conclusion: 肺部掩码应被视为可调节的空间先验，需根据主干网络和临床目标灵活选择，而非统一应用，以平衡异常检测与正常筛查的需求。 Abstract: Chest X-ray (CXR) imaging is widely used for screening and diagnosing pulmonary abnormalities, yet automated interpretation remains challenging due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models for medical image segmentation (MedSAM) provide an opportunity to introduce anatomically grounded priors that may improve robustness and interpretability in CXR analysis. We propose a segmentation-guided CXR classification pipeline that integrates MedSAM as a lung region extraction module prior to multi-label abnormality classification. MedSAM is fine-tuned using a public image-mask dataset from Airlangga University Hospital. We then apply it to a curated subset of the public NIH CXR dataset to train and evaluate deep convolutional neural networks for multi-label prediction of five abnormalities (Mass, Nodule, Pneumonia, Edema, and Fibrosis), with the normal case (No Finding) evaluated via a derived score. Experiments show that MedSAM produces anatomically plausible lung masks across diverse imaging conditions. We find that masking effects are both task-dependent and architecture-dependent. ResNet50 trained on original images achieves the strongest overall abnormality discrimination, while loose lung masking yields comparable macro AUROC but significantly improves No Finding discrimination, indicating a trade-off between abnormality-specific classification and normal case screening. Tight masking consistently reduces abnormality level performance but improves training efficiency. Loose masking partially mitigates this degradation by preserving perihilar and peripheral context. These results suggest that lung masking should be treated as a controllable spatial prior selected to match the backbone and clinical objective, rather than applied uniformly.

[180] PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion

Jian Wang,Sixing Rong,Jiarui Xing,Yuling Xu,Weide Liu

Main category: cs.CV

TL;DR: 本文提出了PathoSyn，一种用于MRI图像合成的统一生成框架，通过在稳定解剖流形上解耦病理偏差，解决了现有模型中特征纠缠和结构不连续的问题。

Details

Motivation: 现有的MRI图像生成模型通常在全局像素域操作或依赖二值掩码，容易导致解剖结构破坏和特征纠缠，因此需要一种能保持结构完整性和局部病理变化真实性的新方法。 Method: PathoSyn将合成任务分解为确定性解剖重建和随机偏差建模，核心是一个偏差空间扩散模型，学习病理残差的条件分布，并结合缝合感知融合策略和推理时稳定模块以确保空间一致性。 Result: 在肿瘤成像基准上的定量和定性评估表明，PathoSyn在感知真实性和解剖保真度方面显著优于整体扩散和掩码条件基线模型。 Conclusion: PathoSyn提供了一个数学上严谨的框架，可用于生成高保真、患者特异性的合成MRI数据，支持低数据场景下的诊断算法开发、可解释的疾病进展建模及临床决策系统的基准测试。 Abstract: We present PathoSyn, a unified generative framework for Magnetic Resonance Imaging (MRI) image synthesis that reformulates imaging-pathology as a disentangled additive deviation on a stable anatomical manifold. Current generative models typically operate in the global pixel domain or rely on binary masks, these paradigms often suffer from feature entanglement, leading to corrupted anatomical substrates or structural discontinuities. PathoSyn addresses these limitations by decomposing the synthesis task into deterministic anatomical reconstruction and stochastic deviation modeling. Central to our framework is a Deviation-Space Diffusion Model designed to learn the conditional distribution of pathological residuals, thereby capturing localized intensity variations while preserving global structural integrity by construction. To ensure spatial coherence, the diffusion process is coupled with a seam-aware fusion strategy and an inference-time stabilization module, which collectively suppress boundary artifacts and produce high-fidelity internal lesion heterogeneity. PathoSyn provides a mathematically principled pipeline for generating high-fidelity patient-specific synthetic datasets, facilitating the development of robust diagnostic algorithms in low-data regimes. By allowing interpretable counterfactual disease progression modeling, the framework supports precision intervention planning and provides a controlled environment for benchmarking clinical decision-support systems. Quantitative and qualitative evaluations on tumor imaging benchmarks demonstrate that PathoSyn significantly outperforms holistic diffusion and mask-conditioned baselines in both perceptual realism and anatomical fidelity. The source code of this work will be made publicly available.

[181] Domain-Shift Immunity in Deep Deformable Registration via Local Feature Representations

Mingzhen Shao,Sarang Joshi

Main category: cs.CV

TL;DR: 本文提出了一种名为UniReg的通用图像配准框架，揭示了基于深度学习的可变形图像配准模型对域偏移具有内在鲁棒性，其关键在于局部特征表示而非全局外观。

Details

Motivation: 尽管学习型模型被认为对域偏移敏感，但本文旨在探究其鲁棒性的内在机制，并提升跨域和多模态场景下的配准性能。 Method: 提出UniReg框架，解耦特征提取与形变估计，采用固定的预训练特征提取器和UNet形变网络，仅用单一数据集训练即可实现跨域泛化。 Result: UniReg在跨域和多模态任务中表现稳健，性能媲美基于优化的传统方法；分析表明CNN模型在模态偏移下的失败源于早期卷积层的数据诱导偏差。 Conclusion: 局部特征一致性是学习型配准模型鲁棒性的核心，应设计保留域不变局部特征的骨干网络。 Abstract: Deep learning has advanced deformable image registration, surpassing traditional optimization-based methods in both accuracy and efficiency. However, learning-based models are widely believed to be sensitive to domain shift, with robustness typically pursued through large and diverse training datasets, without explaining the underlying mechanisms. In this work, we show that domain-shift immunity is an inherent property of deep deformable registration models, arising from their reliance on local feature representations rather than global appearance for deformation estimation. To isolate and validate this mechanism, we introduce UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed, pre-trained feature extractors and a UNet-based deformation network. Despite training on a single dataset, UniReg exhibits robust cross-domain and multi-modal performance comparable to optimization-based methods. Our analysis further reveals that failures of conventional CNN-based models under modality shift originate from dataset-induced biases in early convolutional layers. These findings identify local feature consistency as the key driver of robustness in learning-based deformable registration and motivate backbone designs that preserve domain-invariant local features.

[182] GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

Jingyu Li,Xiaolong Zhao,Zhe Liu,Wenxiao Wu,Li Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为GeoTeacher的半监督3D目标检测方法，通过关键点几何关系监督和体素级数据增强提升学生模型对几何结构的理解能力，在ONCE和Waymo数据集上实现了最先进的性能。

Details

Motivation: 现有半监督3D检测方法忽视了模型在标注数据有限时对物体几何形状敏感性低的问题，难以有效捕捉几何信息，限制了学生模型的感知与定位能力。 Method: 提出GeoTeacher，包含关键点驱动的几何关系监督模块，将教师模型的几何知识迁移给学生；设计带距离衰减机制的体素级数据增强策略，增加物体几何多样性并保持远距离物体完整性。 Result: 在ONCE和Waymo数据集上取得当前最优性能，且可集成到多种半监督3D检测框架中，显著提升其检测效果。 Conclusion: GeoTeacher有效增强了学生模型在少量标注数据下对几何关系的建模能力，提升了3D目标检测的性能与泛化性，为半监督3D检测提供了新的思路。 Abstract: Semi-supervised 3D object detection, aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student model's ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model's ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher model's knowledge of object geometry to the student, thereby improving the student's capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student model's ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at https://github.com/SII-Whaleice/GeoTeacher

[183] REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

Fulin Shi,Wenyi Xiao,Bin Chen,Liang Din,Leilei Gan

Main category: cs.CV

TL;DR: 提出REVEALER，一种基于强化引导视觉推理的细粒度文本到图像对齐评估框架，采用“定位-推理-结论”范式，通过多模态大模型实现可解释的元素级对齐评估，并在多个基准上达到SOTA性能。

Details

Motivation: 现有文本到图像评估方法多依赖粗粒度指标或静态问答流程，缺乏细粒度可解释性且难以反映人类偏好，需更精细、可解释的评估机制。 Method: 提出REVEALER框架，采用“grounding-reasoning-conclusion”范式，利用多模态大语言模型（MLLMs）显式定位语义元素并进行可解释的对齐判断；通过分组相对策略优化（GRPO）和包含结构、定位与对齐准确性的复合奖励函数优化模型。 Result: 在EvalMuse-40K、RichHF、MHaluBench和GenAI-Bench四个基准上实验表明，REVEALER性能优于强闭源模型和监督基线，且推理效率高于现有迭代视觉推理方法。 Conclusion: REVEALER实现了高效、可解释的元素级对齐评估，在多种评测基准上达到先进水平，具备优越的泛化性和实用性。 Abstract: Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.

[184] GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection

Yi Zhang,Yi Wang,Lei Yao,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出了GVSynergy-Det，一种通过高斯-体素双表示协同学习来提升纯图像3D目标检测性能的新框架，无需密集3D监督或深度传感器输入。

Details

Motivation: 现有基于图像的3D检测方法在高精度与无需密集3D监督之间难以兼顾，本文旨在通过融合互补的几何表示来突破这一限制。 Method: 提出双表示架构：1）自适应通用高斯Splatting提取细粒度表面几何特征；2）设计跨表示增强机制，将高斯场的几何细节融入体素特征，并通过可学习方式融合实现精准定位。 Result: 在ScanNetV2和ARKitScenes数据集上达到最先进性能，显著优于现有无深度输入的方法，且无需点云或TSDF等密集3D监督。 Conclusion: GVSynergy-Det通过高斯与体素表示的协同学习，有效提升了纯图像3D检测的精度，为无需密集3D监督的检测提供了新思路。 Abstract: Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).

Tianchen Deng,Xuefeng Chen,Yi Chen,Qu Chen,Yuyao Xu,Lijin Yang,Le Xu,Yu Zhang,Bo Zhang,Wuxiong Huang,Hesheng Wang

Main category: cs.CV

TL;DR: 提出基于3D高斯场景表示的统一驾驶世界模型框架，实现3D场景理解与多模态生成，通过将语言特征嵌入高斯图元实现早期跨模态对齐，并设计任务感知的语言引导采样策略与双条件生成模型，在nuScenes和NuInteract上达到SOTA。

Details

Motivation: 现有驾驶世界模型缺乏3D场景理解能力，无法准确对齐文本信息与3D场景，且生成过程缺乏推理与上下文理解能力。 Method: 采用3D高斯场景表示，将语言特征嵌入每个高斯图元以实现早期模态融合；设计任务感知的语言引导采样策略，压缩3D信息为紧凑的3D token输入LLM；构建双条件生成模型，结合高层语言条件与低层图像条件进行多模态生成。 Result: 在nuScenes和NuInteract数据集上验证了方法有效性，实现了最先进的性能，支持3D场景理解与高质量多模态内容生成。 Conclusion: 该框架有效提升了驾驶世界模型的3D理解与跨模态对齐能力，为自动驾驶中的环境理解与交互生成提供了新思路。 Abstract: Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.

[186] ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis

Maisha Haque,Israt Jahan Ayshi,Sadaf M. Anis,Nahian Tasnim,Mithila Moontaha,Md. Sabbir Ahmed,Muhammad Iqbal Hossain,Mohammad Zavid Parvez,Subrata Chakraborty,Biswajeet Pradhan,Biswajit Banik

Main category: cs.CV

TL;DR: 本研究提出了一种名为ForCM的新型森林覆盖制图方法，结合对象基像分析（OBIA）与深度学习（DL），利用多光谱Sentinel-2影像对亚马逊雨林进行高精度制图。通过评估多种DL模型并与OBIA融合，结果表明ResUNet-OBIA和AttentionUNet-OBIA显著优于传统OBIA方法。

Details

Motivation: 提高森林覆盖制图的准确性，克服传统方法精度有限的问题，并探索免费、易用工具在环境监测中的潜力。 Method: 采用基于对象的图像分析（OBIA）与多种深度学习模型（如UNet、ResUNet、AttentionUNet等）相结合的方法，使用Sentinel-2 Level 2A卫星影像进行实验，并比较不同模型与OBIA融合的效果。 Result: ForCM方法显著提升了制图精度：ResUNet-OBIA达到94.54%的整体准确率，AttentionUNet-OBIA达到95.64%，优于传统OBIA的92.91%。 Conclusion: 结合深度学习与OBIA能有效提升森林覆盖制图精度，且使用QGIS等开源工具具备实际应用潜力，有助于全球生态环境监测与保护。 Abstract: This research proposes "ForCM", a novel approach to forest cover mapping that combines Object-Based Image Analysis (OBIA) with Deep Learning (DL) using multispectral Sentinel-2 imagery. The study explores several DL models, including UNet, UNet++, ResUNet, AttentionUNet, and ResNet50-Segnet, applied to high-resolution Sentinel-2 Level 2A satellite images of the Amazon Rainforest. The datasets comprise three collections: two sets of three-band imagery and one set of four-band imagery. After evaluation, the most effective DL models are individually integrated with the OBIA technique to enhance mapping accuracy. The originality of this work lies in evaluating different deep learning models combined with OBIA and comparing them with traditional OBIA methods. The results show that the proposed ForCM method improves forest cover mapping, achieving overall accuracies of 94.54 percent with ResUNet-OBIA and 95.64 percent with AttentionUNet-OBIA, compared to 92.91 percent using traditional OBIA. This research also demonstrates the potential of free and user-friendly tools such as QGIS for accurate mapping within their limitations, supporting global environmental monitoring and conservation efforts.

[187] Exploring Syn-to-Real Domain Adaptation for Military Target Detection

Jongoh Jeong,Youngjin Oh,Gyeongrae Nam,Jeongeun Lee,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文提出使用虚幻引擎生成基于RGB的合成数据，用于军事目标检测的跨域适应，并通过合成到真实的迁移实验评估了现有域适应方法的性能。

Details

Motivation: 由于军事领域常涉及多种环境，且缺乏公开的军事目标检测数据集，现有方法难以有效应对多变的目标域。此外，SAR数据成本较高，而RGB相机是一种更低成本的替代方案。因此，需要一种低成本、高效的跨域目标检测解决方案。 Method: 利用Unreal Engine生成逼真的RGB合成数据，构建合成-真实跨域数据集对，进行合成到真实的迁移学习实验，并在不同监督程度下评测最新的域适应方法。 Result: 实验表明，在仅提供少量图像提示（如物体类别）的情况下，当前方法相较于无监督或半监督域适应方法有显著性能提升。 Conclusion: 尽管现有域适应方法在低监督条件下表现良好，但在应用于复杂军事场景时仍面临挑战，未来需进一步研究以提升实际部署能力。 Abstract: Object detection is one of the key target tasks of interest in the context of civil and military applications. In particular, the real-world deployment of target detection methods is pivotal in the decision-making process during military command and reconnaissance. However, current domain adaptive object detection algorithms consider adapting one domain to another similar one only within the scope of natural or autonomous driving scenes. Since military domains often deal with a mixed variety of environments, detecting objects from multiple varying target domains poses a greater challenge. Several studies for armored military target detection have made use of synthetic aperture radar (SAR) data due to its robustness to all weather, long range, and high-resolution characteristics. Nevertheless, the costs of SAR data acquisition and processing are still much higher than those of the conventional RGB camera, which is a more affordable alternative with significantly lower data processing time. Furthermore, the lack of military target detection datasets limits the use of such a low-cost approach. To mitigate these issues, we propose to generate RGB-based synthetic data using a photorealistic visual tool, Unreal Engine, for military target detection in a cross-domain setting. To this end, we conducted synthetic-to-real transfer experiments by training our synthetic dataset and validating on our web-collected real military target datasets. We benchmark the state-of-the-art domain adaptation methods distinguished by the degree of supervision on our proposed train-val dataset pair, and find that current methods using minimal hints on the image (e.g., object class) achieve a substantial improvement over unsupervised or semi-supervised DA methods. From these observations, we recognize the current challenges that remain to be overcome.

[188] Task-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks

Changgyoon Oh,Jongoh Jeong,Jegyeong Cho,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 提出了一种自适应选择扩散模型时间步的方法，用于少样本密集预测任务，通过任务感知的时间步选择和特征整合模块提升性能。

Details

Motivation: 现有扩散模型在多步马尔可夫过程中依赖经验直觉选择时间步特征，导致特定任务上的次优表现。 Method: 提出了任务感知时间步选择（TTS）和时间步特征整合（TFC）模块，并结合参数高效的微调适配器来自适应选择并整合最优时间步特征。 Result: 在Taskonomy数据集上验证了方法的有效性，在少样本和通用学习场景下显著提升了密集预测性能。 Conclusion: 所提方法能够有效自适应地选择和整合扩散模型中的时间步特征，显著提高少样本密集预测任务的性能。 Abstract: Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.

[189] AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding

Jongoh Jeong,Taek-Jin Song,Jong-Hwan Kim,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文提出了一种名为AVOID的新数据集，用于在恶劣视觉条件下进行实时障碍物检测，以应对自动驾驶中对小道路危险的可靠检测需求。

Details

Motivation: 现有的驾驶数据集通常只包含正常或恶劣条件下的图像，并且缺乏与其他类别相同视觉域中的道路障碍物数据，因此需要一个更全面的数据集来提高复杂环境下的感知能力。 Method: 通过在模拟环境中收集各种天气和时间条件下包含意外道路障碍物的大规模图像，构建了AVOID数据集，并提供了语义图、深度图、LiDAR数据和路径点等多模态信息；同时对高性能实时网络进行了基准测试，并设计了一个多任务网络进行语义分割、深度估计和路径点预测的消融研究。 Result: 该数据集支持多种视觉感知任务，实验结果展示了所提出多任务网络在不同任务上的性能表现及各模块的有效性。 Conclusion: AVOID数据集填补了恶劣视觉条件下道路障碍物检测数据的空白，为自动驾驶系统在复杂环境中的感知能力提升提供了有力支持。 Abstract: Understanding road scenes for visual perception remains crucial for intelligent self-driving cars. In particular, it is desirable to detect unexpected small road hazards reliably in real-time, especially under varying adverse conditions (e.g., weather and daylight). However, existing road driving datasets provide large-scale images acquired in either normal or adverse scenarios only, and often do not contain the road obstacles captured in the same visual domain as for the other classes. To address this, we introduce a new dataset called AVOID, the Adverse Visual Conditions Dataset, for real-time obstacle detection collected in a simulated environment. AVOID consists of a large set of unexpected road obstacles located along each path captured under various weather and time conditions. Each image is coupled with the corresponding semantic and depth maps, raw and semantic LiDAR data, and waypoints, thereby supporting most visual perception tasks. We benchmark the results on high-performing real-time networks for the obstacle detection task, and also propose and conduct ablation studies using a comprehensive multi-task network for semantic segmentation, depth and waypoint prediction tasks.

[190] MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?

Shiqi Dai,Zizhi Ma,Zhicong Luo,Xuesong Yang,Yibin Huang,Wanyue Zhang,Chi Chen,Zonghao Guo,Wang Xu,Yufei Sun,Maosong Sun

Main category: cs.CV

TL;DR: 本文提出了MM-UAVBench，首个针对低空无人机场景中多模态大语言模型（MLLMs）的综合基准测试，涵盖感知、认知与规划三大能力维度，包含19个子任务和超过5.7K个人工标注问题，基于真实UAV数据构建。实验表明现有MLLM在复杂低空环境中表现有限，存在空间偏差和多视角理解等关键瓶颈。

Details

Motivation: 现有的MLLM基准测试缺乏对低空无人机特殊应用场景的覆盖，而无人机相关评估又未从通用智能角度出发，因此需要一个统一且全面的基准来评估MLLM在该领域的实际能力。 Method: 构建了一个名为MM-UAVBench的新基准，涵盖感知、认知和规划三个核心能力维度，共19个子任务，基于公开数据集中的真实无人机数据生成超过5.7K个人工标注问题，并对16种主流开源与专有MLLM进行系统评测。 Result: 实验结果显示当前MLLM在低空场景中表现不佳，暴露出空间偏差、多视图理解困难等问题，难以满足复杂视觉与认知需求。 Conclusion: MM-UAVBench填补了低空无人机场景中MLLM评估的空白，揭示了现有模型的关键缺陷，有望推动更鲁棒、可靠的无人机智能系统研究。 Abstract: While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.

[191] Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information

Youngchae Kwon,Jinyoung Choi,Injung Kim

Main category: cs.CV

TL;DR: 提出了一种新的Holistic Detection Transformer (Holi-DETR)模型，通过整合三种上下文信息来整体检测服装搭配图像中的时尚单品，显著提升了检测性能。

Details

Motivation: 由于时尚单品外观多样且子类别之间相似性高，传统检测方法难以准确识别，因此需要利用上下文信息减少歧义。 Method: 在DETR框架基础上，引入三种上下文信息：时尚单品的共现关系、基于空间布局的相对位置和大小、以及与人体关键点的空间关系，并设计新架构将其融合。 Result: 实验表明，该方法相比基础DETR和Co-DETR分别提升了3.6和1.1个百分点的平均精度（AP）。 Conclusion: Holi-DETR通过有效利用多类型上下文信息，实现了更准确的时尚单品检测，验证了整体化建模在时尚检测中的优势。 Abstract: Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).

[192] Bridging Your Imagination with Audio-Video Generation via a Unified Director

Jiaxu Zhang,Tianshu Hu,Yuan Zhang,Zenan Li,Linjie Luo,Guosheng Lin,Xin Chen

Main category: cs.CV

TL;DR: 本文提出了UniMAGE，一个统一的导演模型，通过混合Transformer架构将脚本生成与关键帧设计融合，提升AI视频创作的叙事逻辑与视觉一致性。

Details

Motivation: 现有AI视频生成系统将脚本撰写与镜头设计分离，缺乏类似导演的统一思维，导致叙事不连贯、画面不一致。 Method: 采用Mixture-of-Transformers架构，提出“先交织、后解耦”的训练范式：先进行文本-图像交错的概念学习，再解耦脚本写作与关键帧生成。 Result: 实验表明，UniMAGE在开源模型中实现了最先进的性能，能生成逻辑连贯的视频脚本和视觉一致的关键帧图像。 Conclusion: 统一脚本与视觉生成框架有助于提升AI视频创作质量，为非专业用户提供端到端的多镜头影片制作能力。 Abstract: Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.

[193] Anomaly Detection by Effectively Leveraging Synthetic Images

Sungho Kang,Hyunkyu Park,Yeonho Lee,Hanbyul Lee,Mijoo Jeong,YeongHyeon Park,Injae Lee,Juneho Yi

Main category: cs.CV

TL;DR: 本文提出了一种结合文本引导图像翻译模型与图像检索的高效合成缺陷图像框架，通过两阶段训练策略在降低数据成本的同时提升工业异常检测性能。

Details

Motivation: 由于真实缺陷图像稀缺，现有生成方法在成本与图像质量之间存在权衡，本文旨在设计一种高效利用合成图像以提升异常检测性能的方法。 Method: 利用预训练的文本引导图像到图像转换模型生成缺陷图像，并结合图像检索模型筛选与真实正常图像相似的高质量合成结果；采用两阶段训练策略，先在规则合成图像上预训练，再在高质量合成图像上微调。 Result: 在MVTec AD数据集上的实验表明，该方法显著降低了数据收集成本，同时提升了异常检测性能。 Conclusion: 所提出的框架能有效平衡合成成本与图像质量，为无监督异常检测提供了一种高效且实用的解决方案。 Abstract: Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.

[194] SURE Guided Posterior Sampling: Trajectory Correction for Diffusion-Based Inverse Problems

Minwoo Kim,Hongki Lim

Main category: cs.CV

TL;DR: 提出了一种名为SGPS的新方法，利用SURE梯度和PCA噪声估计来纠正扩散模型采样轨迹偏差，显著减少误差累积，在少于100次网络评估下实现高质量逆问题重建。

Details

Motivation: 现有扩散模型在解决逆问题时因交替采样与数据一致性步骤导致累积误差，需大量迭代才能获得高质量结果，效率低下。 Method: 采用Stein无偏风险估计（SURE）梯度更新结合PCA噪声估计，在采样早期和中期纠正轨迹偏差，提升后验采样的准确性。 Result: SGPS在多种逆问题上均优于现有方法，尤其在低NFE（<100）情况下保持高重建质量。 Conclusion: SGPS通过减少采样过程中的误差累积，实现了高效且高质量的逆问题求解，显著降低了计算成本。 Abstract: Diffusion models have emerged as powerful learned priors for solving inverse problems. However, current iterative solving approaches which alternate between diffusion sampling and data consistency steps typically require hundreds or thousands of steps to achieve high quality reconstruction due to accumulated errors. We address this challenge with SURE Guided Posterior Sampling (SGPS), a method that corrects sampling trajectory deviations using Stein's Unbiased Risk Estimate (SURE) gradient updates and PCA based noise estimation. By mitigating noise induced errors during the critical early and middle sampling stages, SGPS enables more accurate posterior sampling and reduces error accumulation. This allows our method to maintain high reconstruction quality with fewer than 100 Neural Function Evaluations (NFEs). Our extensive evaluation across diverse inverse problems demonstrates that SGPS consistently outperforms existing methods at low NFE counts.

[195] Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network

Dongsheng Li,Chaobo Chen,Siling Wang,Song Gao

Main category: cs.CV

TL;DR: 提出了一种物理-边缘混合的气体动态路由网络PEG-DRNet，用于提升红外气体泄漏检测性能，在多个指标上优于现有方法。

Details

Motivation: 红外气体泄漏检测因羽流微弱、边界模糊而具有挑战性，需增强弱对比度和轮廓特征的提取能力。 Method: 设计了Gas Block建模气体输运过程，结合局部与大核分支；提出AGPEO算子提取多方向梯度与相位一致的边缘先验，并通过MSEPM生成分层边缘特征；采用CASR-PAN根据边缘与内容线索自适应聚合多尺度特征。 Result: 在IIG数据集上，PEG-DRNet达到29.8% AP、84.3% AP$_{50}$和25.3%小目标AP，分别超越基线3.0%、6.5%和5.3%，计算量仅为43.7 Gflops，参数量14.9M。在IIG和LangGas数据集上均优于CNN和Transformer检测器。 Conclusion: PEG-DRNet通过融合物理建模与边缘感知机制，实现了高精度与高效性的平衡，显著提升了红外气体泄漏检测性能。 Abstract: Detecting infrared gas leaks is critical for environmental monitoring and industrial safety, yet remains difficult because plumes are faint, small, semitransparent, and have weak, diffuse boundaries. We present physics-edge hybrid gas dynamic routing network (PEG-DRNet). First, we introduce the Gas Block, a diffusion-convection unit modeling gas transport: a local branch captures short-range variations, while a large-kernel branch captures long-range propagation. An edge-gated learnable fusion module balances local detail and global context, strengthening weak-contrast plume and contour cues. Second, we propose the adaptive gradient and phase edge operator (AGPEO), computing reliable edge priors from multi-directional gradients and phase-consistent responses. These are transformed by a multi-scale edge perception module (MSEPM) into hierarchical edge features that reinforce boundaries. Finally, the content-adaptive sparse routing path aggregation network (CASR-PAN), with adaptive information modulation modules for fusion and self, selectively propagates informative features across scales based on edge and content cues, improving cross-scale discriminability while reducing redundancy. Experiments on the IIG dataset show that PEG-DRNet achieves an overall AP of 29.8\%, an AP$_{50}$ of 84.3\%, and a small-object AP of 25.3\%, surpassing the RT-DETR-R18 baseline by 3.0\%, 6.5\%, and 5.3\%, respectively, while requiring only 43.7 Gflops and 14.9 M parameters. The proposed PEG-DRNet achieves superior overall performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors in AP and AP$_{50}$ on the IIG and LangGas dataset.

[196] RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

Fan Wei,Runmin Dong,Yushan Lai,Yixiang Yang,Zhaoyang Luo,Jinxiao Zhang,Miao Yang,Shuai Yuan,Jiyao Zhao,Bin Luo,Haohuan Fu

Main category: cs.CV

TL;DR: 提出一种无需训练的两阶段数据剪枝方法，用于提升遥感扩散生成基础模型的训练效率和生成质量，即使在85%高剪枝率下仍能保持数据多样性与代表性，并在下游任务中达到SOTA性能。

Details

Motivation: 现有遥感扩散模型依赖大规模数据，但数据冗余、噪声和类别不平衡问题影响训练效率与收敛性，且忽视生成建模的分布需求与遥感图像异质性。 Method: 采用两阶段剪枝策略：首先基于熵准则剔除低信息量样本；然后结合场景分类数据集进行场景感知聚类与分层采样，在保证聚类效果的同时降低计算成本；最后通过平衡簇间均匀性与样本代表性实现细粒度选择。 Result: 在剪除85%训练数据的情况下，模型收敛速度和生成质量显著提升，下游任务（如超分辨率、语义图像合成）性能优于现有方法。 Conclusion: 该训练-free的数据剪枝范式可有效提升遥感生成基础模型的训练效率与泛化能力，为构建高质量遥感生成模型提供实用指导。 Abstract: Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.

[197] Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism

Siyu Zhang,Ying Chen,Lianlei Shan,Runhe Qiu

Main category: cs.CV

TL;DR: 本文提出了一种结合动态分辨率输入策略（DRIS）和多尺度视觉-语言对齐机制（MS-VLAM）的视觉-语言模型框架，用于提升遥感图像多模态融合的语义理解精度与计算效率。

Details

Motivation: 现有方法在固定分辨率下难以兼顾效率与细节，且单尺度对齐缺乏语义层次，限制了遥感图像信息提取的准确性。 Method: 提出DRIS采用由粗到细的策略自适应分配计算资源；设计MS-VLAM实现对象、局部区域和全局三个层次的跨模态对齐，增强语义一致性。 Result: 在RS-GPT4V数据集上实验表明，该方法在图像描述生成（BLEU-4、CIDEr）和跨模态检索（R@10）任务中均优于传统方法，显著提升语义理解和计算效率。 Conclusion: 所提框架为构建高效、鲁棒的多模态遥感系统提供了新思路，对智能遥感解译的工程应用具有理论和技术指导意义。 Abstract: Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.

[198] ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

Xingwei Ma,Shiyang Feng,Bo Zhang,Bin Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为ViLaCD-R1的两阶段遥感变化检测框架，结合视觉-语言模型与掩码引导解码器，提升了语义理解、空间定位和边界精度。

Details

Motivation: 传统方法在遥感变化检测中难以捕捉高层语义且易受非语义干扰，现有VLM方法存在定位不准和可解释性差的问题。 Method: 采用两阶段框架：第一阶段用VLM进行块级双时相推理并输出粗略变化掩码；第二阶段通过掩码引导解码器融合双时相特征生成精确二值变化图。模型经监督微调和强化学习训练。 Result: 在多个遥感变化检测基准上表现优越，显著提升语义变化识别与定位能力，有效抑制非语义变化，达到最先进的准确率。 Conclusion: ViLaCD-R1通过结合视觉-语言推理与掩码引导解码，实现了更精准、鲁棒且可解释的遥感变化检测，适用于复杂真实场景。 Abstract: Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.

[199] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

Shin seong Kim,Minjung Shin,Hyunin Cho,Youngjung Uh

Main category: cs.CV

TL;DR: 本文提出了一种名为ASemconsist的新框架，通过选择性修改文本嵌入来实现对角色身份的显式语义控制，同时保持图像与提示的一致性，并引入自适应特征共享策略和统一评估指标CQS，在角色一致性与提示对齐之间取得平衡。

Details

Motivation: 现有方法在生成一系列图像时难以兼顾角色身份一致性和每张图像的文本对齐，存在两者之间的权衡问题。 Method: 提出ASemconsist框架，利用选择性文本嵌入修改和将填充嵌入作为语义容器的策略；设计自适应特征共享机制以应对文本歧义；并构建统一的评估指标CQS。 Result: 该方法在多个基准上实现了最先进的性能，有效克服了身份一致性与提示对齐之间的权衡，提升了整体生成质量。 Conclusion: ASemconsist通过语义控制和自适应约束策略，显著提高了多场景下角色身份的一致性，同时保持良好的文本对齐能力，为文本到图像生成提供了更优的解决方案。 Abstract: Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist

[200] Contour Information Aware 2D Gaussian Splatting for Image Representation

Masaya Takabe,Hiroshi Watanabe,Sujun Hong,Tomohiro Ikai,Zheming Fan,Ryo Ishimoto,Kakeru Sugimoto,Ruri Imichi

Main category: cs.CV

TL;DR: 提出了一种结合轮廓信息的2D高斯点阵化框架，通过引入分割先验提升图像表示在压缩条件下的边缘重建质量。

Details

Motivation: 现有2D高斯点阵化方法在高压缩下缺乏轮廓感知，导致边界模糊。 Method: 将对象分割先验融入2D高斯表示，在光栅化时约束高斯分布于特定分割区域内，并引入训练预热策略以稳定收敛。 Result: 在合成色卡和DAVIS数据集上验证了方法有效性，尤其在高压缩（少量高斯）情况下显著提升边缘区域重建质量。 Conclusion: 所提轮廓感知框架在保持快速渲染和低内存开销的同时，有效改善了2D高斯图像表示的边界清晰度。 Abstract: Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high compression. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.

[201] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

Tong Shao,Yusen Fu,Guoying Sun,Jingde Kong,Zhuotao Tian,Jingyong Su

Main category: cs.CV

TL;DR: 本文提出了一种名为CEM的保真度优化插件，通过累积误差最小化来优化缓存策略，从而提升扩散变换模型在图像和视频生成中的推理速度与生成质量。

Details

Motivation: 扩散变换模型（DiT）因迭代去噪过程导致推理速度慢，现有基于缓存的加速方法存在计算误差大且缺乏自适应性的问题。 Method: 提出CEM方法，预定义误差以刻画模型对时间步和缓存间隔的敏感性，并基于动态规划算法进行策略优化，实现累积误差近似下的缓存策略最优。 Result: 在九个生成模型和量化方法上实验表明，CEM显著提升了现有加速模型的生成保真度，在多个主流模型上优于原始生成性能。 Conclusion: CEM是一种模型无关、无需额外计算开销、可广泛适用于不同加速预算和纠错框架的高效推理加速方案。 Abstract: Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$α$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.

[202] YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

Xu Lin,Jinlong Peng,Zhenye Gan,Jiawen Zhu,Jun Liu

Main category: cs.CV

TL;DR: 本文提出YOLO-Master，一种基于实例条件自适应计算的实时目标检测框架，通过高效稀疏专家混合（ES-MoE）模块动态分配计算资源，提升复杂场景下的检测性能并减少冗余计算。

Details

Motivation: 现有YOLO类模型采用静态密集计算，对所有输入均匀处理，导致简单场景计算冗余、复杂场景资源不足，影响效率与精度平衡。 Method: 引入ES-MoE模块和轻量级动态路由网络，通过多样性增强目标训练路由策略，实现根据场景复杂度动态激活最相关专家子网络，实现自适应计算。 Result: 在MS COCO上达到42.4% AP，推理延迟仅1.62ms，相比YOLOv13-N提升0.8% mAP且快17.8%，在密集场景增益显著，同时保持实时性。 Conclusion: YOLO-Master通过实例条件自适应计算有效提升了实时目标检测的精度与效率权衡，尤其在复杂场景下表现优越，兼具高性能与低延迟。 Abstract: Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.

[203] Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition

Arman Martirosyan,Shahane Tigranyan,Maria Razzhivina,Artak Aslanyan,Nazgul Salikhova,Ilya Makarov,Andrey Savchenko,Aram Avetisyan

Main category: cs.CV

TL;DR: 本文提出了两种多模态框架，用于在iMiGUE数据集上进行微手势识别和基于行为的情感预测。通过融合RGB、骨骼姿态、面部和上下文信息，结合跨模态融合模块，在情感预测任务中取得了第二名的优异成绩。

Details

Motivation: 微手势识别和基于行为的情感预测需要建模细微的人类行为，现有方法难以充分融合多模态信息以捕捉细粒度时空模式。 Method: 使用MViTv2-S和2s-AGCN提取视频和3D骨骼姿态特征，通过跨模态令牌融合模块整合；对于情感预测，利用SwinFace和MViTv2-S提取面部与上下文特征，并通过InterFusion模块融合。 Result: 在iMiGUE数据集上的实验表明，所提方法在行为情感预测任务中表现优异，于MiGA 2025挑战赛中获得第二名。 Conclusion: 提出的多模态融合框架能有效整合视觉与姿态信息，提升了微手势识别与情感预测的性能，验证了跨模态融合在细粒度行为分析中的有效性。 Abstract: Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.

[204] MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

Md. Sazzadul Islam Prottasha,Nabil Walid Rafi

Main category: cs.CV

TL;DR: 该研究比较了开源的MedGemma与专有的GPT-4在医学影像诊断中的性能，结果显示经LoRA微调的MedGemma在准确率和敏感性上均优于GPT-4，表明领域特定微调对临床应用至关重要。

Details

Motivation: 探索适用于医学影像诊断的最优AI架构，解决通用大模型在临床实践中易产生幻觉的问题。 Method: 采用MedGemma-4b-it模型，并使用低秩适应（LoRA）进行微调，与未微调的GPT-4在六种疾病的分类任务中进行对比，通过混淆矩阵和分类报告进行定量分析。 Result: MedGemma的平均测试准确率达到80.37%，显著高于GPT-4的69.58%，且在癌症和肺炎检测等高风险任务中表现出更高的灵敏度。 Conclusion: 领域特定的微调能有效提升模型在医学诊断中的准确性与可靠性，MedGemma更具潜力成为支持循证医学推理的先进工具。 Abstract: Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.

[205] CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

Ke Niu,Haiyang Yu,Zhuofan Chen,Zhengtao Yao,Weitao Jia,Xiaodong Ge,Jingqun Tang,Benlei Cui,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: 提出了一种新的异构协作多专家强化学习范式（CME-CAD），用于生成高精度、可编辑的CAD模型，并发布了包含17,299个实例的开源基准CADExpert。

Details

Motivation: 传统CAD建模复杂，现有方法生成的3D模型不可编辑且精度不足，基于文本或图像的输入依赖大量人工标注，限制了自动化和工业应用的扩展性。 Method: 提出了CME-CAD范式，结合多专家模型的优势，通过两阶段训练过程：多专家微调（MEFT）和多专家强化学习（MERL），实现协同学习以提升CAD代码生成能力。 Result: 能够生成准确、约束兼容且完全可编辑的CAD模型，显著优于现有方法。 Conclusion: CME-CAD为工业级CAD自动化提供了有效解决方案，CADExpert基准推动了该领域的进一步研究。 Abstract: Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model's ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.

[206] Visual Language Hypothesis

Xiu Li

Main category: cs.CV

TL;DR: 本文从结构和拓扑的角度研究视觉表示学习，提出视觉理解依赖于一种视觉语义语言，并推导出观察空间具有纤维丛结构，语义对应于商基空间。

Details

Motivation: 探讨视觉表示学习中的语义抽象机制，解释为何语义不变性不能仅通过平滑变换实现。 Method: 基于视觉理解需要语义语言的假设，结合迁移性和抽象性前提，采用拓扑学分析推导出观察空间的纤维丛结构及语义商空间的性质。 Result: 1) 语义商空间X/G不是X的子流形，无法仅通过光滑变形获得，需要非同胚的判别目标；2) 模型架构需支持拓扑变化，即先扩展再收缩的‘展开-快照’过程。 Conclusion: 语义抽象不仅需要外部语义目标，还需要能够支持拓扑变化的表示机制，该框架为理解大规模判别式和多模态模型提供了拓扑视角。 Abstract: We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.

[207] CountGD++: Generalized Prompting for Open-World Counting

Niki Amini-Naieni,Andrew Zisserman

Main category: cs.CV

TL;DR: 本文提出了一种新的目标计数方法CountGD++，通过扩展提示方式支持文本和视觉示例指定“不计数”对象，引入伪范例自动标注，并利用自然与合成外部图像中的视觉示例，提升了多模态开放世界计数的灵活性、准确性和泛化能力。

Details

Motivation: 现有自动计数方法在目标指定方式上受限，需手动标注视觉示例，且无法表达‘不计数’的对象，限制了灵活性与准确性。 Method: 扩展提示机制以支持用文本和/或视觉示例描述‘不计数’对象；提出‘伪范例’概念实现推理时视觉示例的自动标注；改进计数模型以接受来自自然和合成外部图像的视觉示例；将CountGD++作为视觉专家模块集成到大语言模型中。 Result: 新方法在多个数据集上显著提升了计数的准确性、效率和泛化性能。 Conclusion: 通过增强提示的表达能力与自动化视觉示例处理，CountGD++推动了开放世界多模态计数的发展，具备更强的实用性和可扩展性。 Abstract: The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.

[208] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Kanghee Lee,Injae Lee,Minseok Kwak,Kwonyoung Ryu,Jungi Hong,Jaesik Park

Main category: cs.CV

TL;DR: 本文提出了一种可扩展的多视角数据生成与标注流程，构建了包含200万QA对的大规模数据集SpatialMosaic和含6项任务、100万QA对的基准SpatialMosaic-Bench，并提出了融合3D重建模型作为几何编码器的混合框架SpatialMosaicVLM，以提升视觉语言模型在真实复杂场景下的空间推理能力。

Details

Motivation: 现有3D场景理解方法依赖预构建的3D表示或现成重建流程，限制了可扩展性和实际应用；且真实环境中常见的部分可见、遮挡和低重叠等问题导致的空间推理挑战尚未被充分探索。 Method: 提出一种可扩展的多视角数据生成与注释管道，构建大规模指令微调数据集SpatialMosaic（200万QA对）和评估基准SpatialMosaic-Bench（100万QA对，6项任务），并设计了SpatialMosaicVLM框架，将3D重建模型作为几何编码器集成到视觉语言模型中，实现无需显式3D重建的鲁棒空间推理。 Result: 实验表明，所构建的数据集和VQA任务能有效提升模型在挑战性多视角条件下的空间推理性能，验证了数据生成流程在创建真实、多样化QA对方面的有效性；提出的SpatialMosaicVLM框架在SpatialMosaic-Bench上表现出色。 Conclusion: 本文通过构建高质量多视角空间推理数据集和基准，推动了不依赖显式3D重建的视觉语言模型空间理解能力的发展，为真实场景中的复杂空间推理提供了可行解决方案。 Abstract: The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.

[209] MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning

Shuyuan Lin,Mengtin Lo,Haosheng Chen,Yanjie Liang,Qiangqiang Wu

Main category: cs.CV

TL;DR: 本文提出了一种用于双视图对应学习的多图上下文注意力网络（MGCA-Net），通过上下文几何注意力（CGA）和跨阶段多图一致性（CSMGC）模块提升几何建模与信息优化，显著优于现有方法。

Details

Motivation: 现有方法在局部几何建模和跨阶段信息优化方面存在不足，难以准确捕捉匹配对的几何约束，影响模型鲁棒性。 Method: 提出MGCA-Net，包含CGA模块（自适应融合空间位置与特征信息，增强局部与全局几何关系建模）和CSMGC模块（通过跨阶段稀疏图网络建立几何一致性）。 Result: 在YFCC100M和SUN3D数据集上实验表明，MGCA-Net在外点剔除和相机位姿估计任务中显著优于现有SOTA方法。 Conclusion: MGCA-Net有效提升了双视图对应学习中的几何建模能力和跨阶段一致性，增强了匹配的鲁棒性和精度。 Abstract: Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at http://www.linshuyuan.com.

[210] NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization

Yifei Li,Haoyuan He,Yu Zheng,Bingyao Yu,Wenzhao Zheng,Lei Chen,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了NeXT-IMDL，一个用于图像篡改检测与定位（IMDL）的大规模诊断基准，旨在系统评估现有检测方法在真实世界泛化场景下的鲁棒性。

Details

Motivation: 现有的IMDL方法在跨数据集评估中表现出良好的性能，但这种简化评估掩盖了其在面对多样化AI生成内容时的脆弱性，导致对进展的误判。因此，需要更严格的评估框架来揭示模型的真实泛化能力。 Method: 提出NeXT-IMDL基准，从编辑模型、篡改类型、语义内容和伪造粒度四个维度对AIGC-based篡改进行分类，并设计五种跨维度评估协议，系统性地测试11种代表性模型。 Result: 实验表明，尽管现有模型在原有设定下表现良好，但在NeXT-IMDL的新协议下普遍存在系统性失败和显著性能下降，暴露出其泛化能力的严重不足。 Conclusion: NeXT-IMDL提供了一个诊断工具包，揭示了当前IMDL方法的局限性，推动未来研究朝向构建真正鲁棒的下一代检测模型发展。 Abstract: The accessibility surge and abuse risks of user-friendly image editing models have created an urgent need for generalizable, up-to-date methods for Image Manipulation Detection and Localization (IMDL). Current IMDL research typically uses cross-dataset evaluation, where models trained on one benchmark are tested on others. However, this simplified evaluation approach conceals the fragility of existing methods when handling diverse AI-generated content, leading to misleading impressions of progress. This paper challenges this illusion by proposing NeXT-IMDL, a large-scale diagnostic benchmark designed not just to collect data, but to probe the generalization boundaries of current detectors systematically. Specifically, NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery granularity. Built upon this, NeXT-IMDL implements five rigorous cross-dimension evaluation protocols. Our extensive experiments on 11 representative models reveal a critical insight: while these models perform well in their original settings, they exhibit systemic failures and significant performance degradation when evaluated under our designed protocols that simulate real-world, various generalization scenarios. By providing this diagnostic toolkit and the new findings, we aim to advance the development towards building truly robust, next-generation IMDL models.

[211] SoulX-LiveTalk Technical Report

Le Shen,Qiao Qian,Tan Yu,Ke Zhou,Tianhang Yu,Yu Zhan,Zhenjie Wang,Ming Tao,Shunshun Yin,Siyuan Liu

Main category: cs.CV

TL;DR: SoulX-LiveTalk是一种140亿参数的实时流式音频驱动虚拟形象生成框架，采用自修正双向蒸馏和多步回溯自修正机制，在保证低延迟和高帧率的同时显著提升视觉质量和运动连贯性。

Details

Motivation: 现有方法在实时、无限时长的音频驱动虚拟形象生成中，因计算负载与低延迟需求之间的矛盾，常牺牲视觉保真度；需要一种既能保持高质量输出又能满足严格延迟限制的大规模模型解决方案。 Method: 提出SoulX-LiveTalk，采用自修正双向蒸馏策略，在视频块内保留双向注意力以维持时空相关性；引入多步回溯自修正机制以防止生成过程中的错误累积崩溃；并设计全栈推理加速方案，包括混合序列并行、并行VAE和内核级优化。 Result: SoulX-LiveTalk实现了0.87秒的亚秒级启动延迟和32 FPS的实时吞吐，成为首个在140亿参数规模下达到此性能的系统，在运动连贯性和视觉细节上优于现有方法。 Conclusion: SoulX-LiveTalk通过创新的双向蒸馏与自修正机制，结合系统级优化，成功解决了大规模扩散模型在实时、无限时长虚拟形象生成中的延迟与质量权衡问题，树立了高保真交互式数字人合成的新标准。 Abstract: Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

[212] SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation

Xiaolan Li,Wanquan Liu,Pengcheng Li,Pengyu Jie,Chenqiang Gao

Main category: cs.CV

TL;DR: 提出SOFTooth，一种融合冻结2D语义的2D-3D牙齿实例分割框架，在3D牙齿分割中实现最优性能，尤其在第三磨牙等复杂情况下表现突出。

Details

Motivation: 由于牙弓密集、牙龈边界模糊、缺牙及第三磨牙罕见但临床重要，3D牙齿实例分割具有挑战性；现有3D方法存在边界泄漏、中心漂移等问题，而2D基础模型难以直接应用于3D临床流程。 Method: 提出SOFTooth框架：1）点级残差门控模块将SAM的2D语义嵌入注入3D点特征以优化边界；2）中心引导的掩码细化模块增强实例与几何中心的一致性；3）顺序感知的匈牙利匹配策略结合解剖顺序与中心距离进行实例分配。 Result: 在3DTeethSeg'22数据集上达到最先进的整体准确率和平均IoU，尤其在第三磨牙案例中表现显著优于现有方法。 Conclusion: 无需2D微调即可有效迁移2D基础模型的丰富语义至3D牙齿分割，SOFTooth在复杂临床场景下实现了鲁棒且一致的实例分割。 Abstract: Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg'22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.

[213] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment

Henglin Liu,Nisha Huang,Chang Liu,Jiangpeng Yan,Huijuan Huang,Jixuan Ying,Tong-Yee Lee,Pengfei Wan,Xiangyang Ji

Main category: cs.CV

TL;DR: 本文提出了一种新的艺术图像美学质量评估框架ArtQuant，并构建了大规模多维度数据集RAD，通过结合LLM解码器与联合描述生成，有效解决了数据稀缺与模型碎片化问题，在多个数据集上实现了最先进的性能。

Details

Motivation: 美学质量评估因涉及视觉感知、认知和情感等复杂因素而具有挑战性，现有数据集标注成本高、维度单一，且当前模型难以有效处理长文本美学描述，导致评估效果受限。 Method: 提出了Refined Aesthetic Description (RAD) 数据集，采用迭代流程自动生成大规模、多维度结构化数据；设计了ArtQuant框架，利用LLM解码器进行联合描述生成，统一建模多个美学维度并捕捉长文本语义。 Result: 在多个美学评估数据集上达到最先进性能，仅需传统方法33%的训练周期，显著提升效率与准确性。理论分析表明RAD的数据充分性与生成范式可最小化预测熵。 Conclusion: ArtQuant结合RAD数据集通过生成式建模范式有效整合多维美学信息，降低了对昂贵标注的依赖，提升了美学评估的全面性与效率，推动了AIGC中人类对齐的量化评价发展。 Abstract: The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.

[214] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia,Yongkang Li,Lijun Zhou,Jingfeng Yao,Kaixin Xiong,Haiyang Sun,Bing Wang,Kun Ma,Hangjun Ye,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: DriveLaW提出了一种统一视频生成与运动规划的新范式，通过共享潜在表示实现了世界模型与规划器的深度融合，显著提升了自动驾驶中的预测与规划性能。

Details

Motivation: 现有自动驾驶系统中，世界模型与运动规划通常分离，导致预测与决策不一致，难以应对复杂真实场景。 Method: 提出DriveLaW，包含DriveLaW-Video（生成高保真未来视频）和DriveLaW-Act（基于视频潜在表示进行扩散规划），采用三阶段渐进训练策略联合优化两个模块。 Result: 在视频生成方面，FID提升33.3%，FVD提升1.8%；在NAVSIM规划基准上达到新纪录，实现预测与规划双SOTA。 Conclusion: DriveLaW通过统一建模实现了世界预测与运动规划的一致性，为自动驾驶提供了更紧密耦合的解决方案，推动了端到端系统的性能边界。 Abstract: World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

[215] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

Dohyun Kim,Seungwoo Lyu,Seung Wook Kim,Paul Hongsuck Seo

Main category: cs.CV

TL;DR: 本文提出了Direct Diffusion Score Preference Optimization (DDSPO)，一种无需人工标注的偏好优化方法，通过在去噪过程中利用赢/输策略生成每一步的监督信号，提升文本-图像生成的对齐性和视觉质量。

Details

Motivation: 扩散模型在文本到图像生成中表现优异，但难以精确对齐用户意图并保持稳定的美学质量；现有基于偏好的训练方法依赖昂贵且可能含噪声的人工标注数据。 Method: 提出DDSPO，从获胜和失败策略中提取每一步的密集监督信号，并通过预训练参考模型对比原始提示与语义退化提示下的输出，自动生成偏好信号，实现无需显式奖励建模或人工标注的分数空间偏好优化。 Result: 实验表明，DDSPO在文本-图像对齐和视觉质量上优于或媲美现有方法，同时显著减少对外部标注数据的依赖。 Conclusion: DDSPO提供了一种高效、低监督成本的扩散模型偏好优化框架，适用于高质量生成任务。 Abstract: Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: https://dohyun-as.github.io/DDSPO

[216] Towards Integrating Uncertainty for Domain-Agnostic Segmentation

Jesse Brouwers,Xiaoyan Xing,Alexander Timans

Main category: cs.CV

TL;DR: 本文研究了不确定性量化在分割基础模型（如SAM）中的作用，提出了UncertSAM基准，并评估了多种轻量级不确定性估计方法，发现最后一层拉普拉斯近似能有效反映分割误差，初步验证了不确定性引导的预测优化潜力。

Details

Motivation: 尽管SAM等分割基础模型具有强零样本性能，但在分布偏移或知识受限领域仍表现脆弱，本文旨在探索不确定性量化是否可提升其在不同领域的泛化能力。 Method: 构建包含八个挑战性数据集的UncertSAM基准，评估一系列轻量、后验的不确定性估计方法，并尝试基于不确定性的预测优化策略。 Result: 最后一层拉普拉斯近似产生的不确定性与分割误差高度相关，显示出有意义的信号；不确定性引导的优化效果尚处初步阶段但具潜力。 Conclusion: 将不确定性纳入分割模型有助于实现更鲁棒、领域无关的性能，未来值得进一步探索。 Abstract: Foundation models for segmentation such as the Segment Anything Model (SAM) family exhibit strong zero-shot performance, but remain vulnerable in shifted or limited-knowledge domains. This work investigates whether uncertainty quantification can mitigate such challenges and enhance model generalisability in a domain-agnostic manner. To this end, we (1) curate UncertSAM, a benchmark comprising eight datasets designed to stress-test SAM under challenging segmentation conditions including shadows, transparency, and camouflage; (2) evaluate a suite of lightweight, post-hoc uncertainty estimation methods; and (3) assess a preliminary uncertainty-guided prediction refinement step. Among evaluated approaches, a last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating a meaningful signal. While refinement benefits are preliminary, our findings underscore the potential of incorporating uncertainty into segmentation models to support robust, domain-agnostic performance. Our benchmark and code are made publicly available.

[217] Fuzzy-Logic and Deep Learning for Environmental Condition-Aware Road Surface Classification

Mustafa Demetgul,Sanja Lazarova Molnar

Main category: cs.CV

TL;DR: 本文提出了一种基于天气条件和路面状况数据的实时道路监测系统，利用手机摄像头采集图像数据，并结合加速度数据，采用多种深度学习算法进行路面分类，实现了超过95%的准确率。

Details

Motivation: 传统道路监测方法成本高且缺乏系统性，需要耗费大量测量时间，因此需要一种低成本、实时的路面状态监测方案。 Method: 采集校园道路的手机摄像头图像和车辆加速度数据，将加速度数据转化为图像形式，使用AlexNet、LeNet、VGG和ResNet等多种图像-based深度学习模型进行训练和比较，并结合模糊逻辑根据天气和时段选择使用加速度或图像数据进行分类。 Result: 在六类路面（沥青、破损沥青、碎石、破损碎石、铺面）分类任务中，系统达到了超过95%的分类准确率，验证了图像与加速度数据融合及模型选择策略的有效性。 Conclusion: 基于深度学习和多源数据融合的实时路面分类系统具有高准确率和实用性，结合模糊逻辑可根据环境动态选择最优传感器输入，为智能交通和车辆控制系统提供了可行的技术方案。 Abstract: Monitoring states of road surfaces provides valuable information for the planning and controlling vehicles and active vehicle control systems. Classical road monitoring methods are expensive and unsystematic because they require time for measurements. This article proposes an real time system based on weather conditional data and road surface condition data. For this purpose, we collected data with a mobile phone camera on the roads around the campus of the Karlsruhe Institute of Technology. We tested a large number of different image-based deep learning algorithms for road classification. In addition, we used road acceleration data along with road image data for training by using them as images. We compared the performances of acceleration-based and camera image-based approaches. The performances of the simple Alexnet, LeNet, VGG, and Resnet algorithms were compared as deep learning algorithms. For road condition classification, 5 classes were considered: asphalt, damaged asphalt, gravel road, damaged gravel road, pavement road and over 95% accuracy performance was achieved. It is also proposed to use the acceleration or the camera image to classify the road surface according to the weather and the time of day using fuzzy logic.

[218] RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction

Shuhong Liu,Chenyu Bao,Ziteng Cui,Yun Liu,Xuangeng Chu,Lin Gu,Marcos V. Conde,Ryo Umagami,Tomohiro Hashimoto,Zijian Hu,Tianhan Xu,Yuan Gan,Yusuke Kurose,Tatsuya Harada

Main category: cs.CV

TL;DR: RealX3D是一个针对多视角视觉恢复和3D重建的真实捕获基准，涵盖多种物理退化类型，并揭示现有方法在真实复杂环境下的脆弱性。

Details

Motivation: 现有的多视角3D重建方法在理想条件下表现良好，但在真实场景中面对各种物理退化时性能显著下降，缺乏统一的基准来评估这些情况。 Method: 构建了一个名为RealX3D的真实数据集，包含四种退化类型（光照、散射、遮挡、模糊），采用统一采集协议获取像素对齐的低质量与高质量图像，并提供高分辨率RAW图像和激光扫描生成的真值3D模型。 Result: 在多种基于优化和前馈的方法上进行评测，结果显示在物理退化下重建质量显著下降。 Conclusion: 当前多视角3D重建方法在真实复杂环境下仍十分脆弱，RealX3D为未来鲁棒性算法的发展提供了重要基准。 Abstract: We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.

[219] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models

Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的解码框架CoFi-Dec，通过生成自反馈和由粗到细的视觉条件来减少大视觉语言模型中的幻觉问题。

Details

Motivation: 大视觉语言模型（LVLMs）在多模态理解和生成方面取得了显著进展，但仍存在与视觉输入不一致的幻觉问题，限制了其在现实应用中的可靠性。 Method: CoFi-Dec首先基于图像的粗粒度和细粒度视图生成两个中间文本响应，并利用文生图模型将其转化为合成图像，形成多层次的视觉假设；然后引入基于Wasserstein的融合机制，对齐多个视觉条件下的预测分布，实现几何上一致的解码轨迹。 Result: 在六个聚焦于幻觉的基准测试上进行了广泛实验，结果显示CoFi-Dec显著减少了实体级和语义级的幻觉现象，优于现有的解码策略。 Conclusion: CoFi-Dec是一种模型无关、无需额外训练且可广泛应用于多种LVLM的解码框架，有效提升了输出的鲁棒性和忠实性。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at https://github.com/AI-Researcher-Team/CoFi-Dec.

[220] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin

Kayathri Vigneswaran,Hugo Retief,Jai Clifford Holmes,Mariangel Garcia Andarcia,Hansaka Tennakoon

Main category: cs.CV

TL;DR: 提出一种结合视觉水位线检测、YOLOv8姿态估计和大语言模型（GPT-4o 和 Gemini 2.0 Flash）的混合框架，用于自动读取河流水尺，实现高精度、可扩展的水位监测。

Details

Motivation: 传统水文观测受限于人工误差和环境条件，难以满足实时、连续的水位监测需求，尤其在洪水预警和水资源管理中亟需自动化解决方案。 Method: 采用多阶段方法：图像预处理、标注、基于视觉的水位线检测、利用YOLOv8进行刻度间隙估计，并结合多模态大语言模型提取数值读数，融合几何元数据提升预测精度。 Result: 水位线检测精度达94.24%，F1分数为83.64%；在最优条件下，Gemini Stage 2的平均绝对误差为5.43 cm，均方根误差为8.58 cm，R²达0.84；结果表明图像质量和几何校准对LLM性能有显著影响。 Conclusion: 该方法通过融合几何信息与多模态AI，实现了高效、可靠的自动化水位监测，具有实现实时水尺数字化和提升水资源管理能力的应用潜力。 Abstract: Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.

[221] Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

Bohan Xiao,Peiyong Wang,Qisheng He,Ming Dong

Main category: cs.CV

TL;DR: 本文提出了一种基于去噪布朗桥和双近似器的新型生成模型Dual-approx Bridge，用于图像到图像翻译，尤其在确定性任务如超分辨率中实现了高保真度和高质量的输出。

Details

Motivation: 现有的图像到图像翻译方法在保持输出一致性与高保真度方面存在不足，尤其是在需要确定性输出的任务中，难以同时保证质量与对真实标签的忠实还原。 Method: 提出Dual-approx Bridge模型，利用布朗桥动力学并引入两个神经网络近似器分别处理前向和反向过程，实现低方差、高质量的确定性图像生成。 Result: 在多个基准数据集上的实验表明，该方法在图像质量和对真实标签的忠实度方面均优于现有的随机和确定性基线方法。 Conclusion: Dual-approx Bridge通过双近似器架构和布朗桥机制，在确定性图像到图像翻译任务中实现了卓越性能，具有广泛的应用潜力。 Abstract: Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: https://github.com/bohan95/dual-app-bridge

[222] HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Yuxin Wen,Qing Shuai,Di Kang,Jing Li,Cheng Wen,Yue Qian,Ningxin Jiao,Changhai Chen,Weijie Chen,Yiran Wang,Jinkun Guo,Dongyue An,Han Liu,Yanyu Tong,Chao Zhang,Qing Guo,Juan Chen,Qiao Zhang,Youyi Zhang,Zihao Yao,Cheng Zhang,Hong Duan,Xiaoping Wu,Qi Chen,Fei Cheng,Liang Dong,Peng He,Hao Zhang,Jiaxin Lin,Chao Zhang,Zhongyi Fan,Yifan Li,Zhichao Hu,Yuhong Liu,Linus,Jie Jiang,Xiaolong Li,Linchao Bao

Main category: cs.CV

TL;DR: HY-Motion 1.0 是首个基于扩散Transformer的十亿参数级3D人体动作生成模型，支持从文本生成高质量动作，涵盖200多种动作类别。

Details

Motivation: 推动3D人体动作生成模型向商业化成熟迈进，提升现有开源模型在指令遵循和动作质量上的不足。 Method: 采用DiT架构，提出全阶段训练范式，包括大规模预训练、高质量微调以及基于人类反馈和奖励模型的强化学习，并构建严格的数据清洗与标注流程。 Result: 模型在动作多样性、文本对齐精度和生成质量上显著优于当前开源基准，覆盖6大类超过200种动作类型。 Conclusion: HY-Motion 1.0 实现了大规模动作生成模型的成功扩展，具备优秀的指令跟随能力与高质生成效果，已开放给社区以促进后续研究。 Abstract: We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.

[223] MCI-Net: A Robust Multi-Domain Context Integration Network for Point Cloud Registration

Shuyuan Lin,Wenwu Peng,Junjie Huang,Qiang Qi,Miaohui Wang,Jian Weng

Main category: cs.CV

TL;DR: 提出MCI-Net，通过多域上下文融合提升点云配准的特征表示与性能，结合全局图结构建模、域内解耦与域间交互以及动态内点选择，在3DMatch上达到96.4%的最高配准召回率。

Details

Motivation: 现有基于欧氏邻域的特征提取方法难以有效捕捉点云中的隐式语义和结构一致性，限制了配准性能。 Method: 提出多域上下文集成网络MCI-Net，包括全局图构建的图邻域聚合模块、进行域内解耦和域间交互的渐进式上下文交互模块，以及利用多次位姿估计残差优化内点权重的动态内点选择方法。 Result: 在室内RGB-D和室外LiDAR数据集上实验表明，MCI-Net显著优于现有最先进方法，在3DMatch上实现96.4%的配准召回率。 Conclusion: MCI-Net通过多域上下文融合有效提升了点云配准的特征判别性和鲁棒性，具有优越的性能表现。 Abstract: Robust and discriminative feature learning is critical for high-quality point cloud registration. However, existing deep learning-based methods typically rely on Euclidean neighborhood-based strategies for feature extraction, which struggle to effectively capture the implicit semantics and structural consistency in point clouds. To address these issues, we propose a multi-domain context integration network (MCI-Net) that improves feature representation and registration performance by aggregating contextual cues from diverse domains. Specifically, we propose a graph neighborhood aggregation module, which constructs a global graph to capture the overall structural relationships within point clouds. We then propose a progressive context interaction module to enhance feature discriminability by performing intra-domain feature decoupling and inter-domain context interaction. Finally, we design a dynamic inlier selection method that optimizes inlier weights using residual information from multiple iterations of pose estimation, thereby improving the accuracy and robustness of registration. Extensive experiments on indoor RGB-D and outdoor LiDAR datasets show that the proposed MCI-Net significantly outperforms existing state-of-the-art methods, achieving the highest registration recall of 96.4\% on 3DMatch. Source code is available at http://www.linshuyuan.com.

[224] SC-Net: Robust Correspondence Learning via Spatial and Cross-Channel Context

Shuyuan Lin,Hailiang Liao,Qiang Qi,Junjie Huang,Taotao Lai,Jian Weng

Main category: cs.CV

TL;DR: 本文提出了一种名为SC-Net的新网络，用于改进双视图对应学习中的运动场估计，通过空间和通道双重视角整合上下文信息，在多个数据集上优于现有方法。

Details

Motivation: 现有的CNN主干网络在处理大视差场景时可能无法有效聚合全局上下文，并会过度平滑密集运动场，因此需要更适应任务的网络设计。 Method: 提出了SC-Net，包含三个模块：自适应聚焦正则化模块（AFR）增强位置感知和鲁棒性；双侧场调整模块（BFA）建模长距离依赖并促进空间与通道维度交互；位置感知恢复模块（PAR）用于精确恢复运动向量。 Result: 在YFCC100M和SUN3D数据集上的实验表明，SC-Net在相对姿态估计和离群点去除任务中优于当前最先进的方法。 Conclusion: SC-Net通过双侧上下文融合有效提升了两视图对应关系学习的性能，尤其在复杂场景下表现出更强的鲁棒性和精度。 Abstract: Recent research has focused on using convolutional neural networks (CNNs) as the backbones in two-view correspondence learning, demonstrating significant superiority over methods based on multilayer perceptrons. However, CNN backbones that are not tailored to specific tasks may fail to effectively aggregate global context and oversmooth dense motion fields in scenes with large disparity. To address these problems, we propose a novel network named SC-Net, which effectively integrates bilateral context from both spatial and channel perspectives. Specifically, we design an adaptive focused regularization module (AFR) to enhance the model's position-awareness and robustness against spurious motion samples, thereby facilitating the generation of a more accurate motion field. We then propose a bilateral field adjustment module (BFA) to refine the motion field by simultaneously modeling long-range relationships and facilitating interaction across spatial and channel dimensions. Finally, we recover the motion vectors from the refined field using a position-aware recovery module (PAR) that ensures consistency and precision. Extensive experiments demonstrate that SC-Net outperforms state-of-the-art methods in relative pose estimation and outlier removal tasks on YFCC100M and SUN3D datasets. Source code is available at http://www.linshuyuan.com.

[225] TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding

Zongsheng Cao,Yangfan He,Anran Liu,Feng Chen,Zepeng Wang,Jun Xie

Main category: cs.CV

TL;DR: 本文提出了TV-RAG，一种无需训练的框架，通过时间对齐和熵引导的语义机制提升大视频语言模型在长视频理解中的表现，结合时间衰减检索和信息密集关键帧采样，在多个基准上超越主流方法。

Details

Motivation: 现有的大视频语言模型在处理长视频时受限于狭窄的时间窗口，难以捕捉长时间跨度的细粒度语义变化；基于文本的检索方法忽略多模态间的时间依赖关系。 Method: 提出TV-RAG框架：(i) 时间衰减检索模块，将显式时间偏移引入相似性计算，以更准确地匹配文本查询与多媒体上下文；(ii) 熵加权关键帧采样器，选择信息密集且分布均匀的关键帧，减少冗余并保持代表性。 Result: TV-RAG在Video-MME、MLVU和LongVideoBench等多个长视频基准上显著优于现有主流基线方法，且无需重新训练或微调即可集成到任何大视频语言模型中。 Conclusion: TV-RAG通过融合时间与语义信号，实现了高效的长视频推理，提供了一种轻量、低成本的升级方案，验证了其在提升长视频理解能力方面的有效性。 Abstract: Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at https://github.com/AI-Researcher-Team/TV-RAG.

[226] Multi-label Classification with Panoptic Context Aggregation Networks

Mingyuan Jiu,Hailong Zhu,Wenchuan Wei,Hichem Sahbi,Rongrong Ji,Mingliang Xu

Main category: cs.CV

TL;DR: 本文提出了Deep Panoptic Context Aggregation Network (PanCAN)，通过在高维Hilbert空间中进行跨尺度特征聚合，实现多阶几何上下文的层次整合，显著提升了复杂场景理解与多标签分类性能。

Details

Motivation: 现有方法主要关注基本几何关系或局部特征，忽略了对象间的跨尺度上下文交互，限制了复杂场景下的识别性能。 Method: PanCAN结合随机游走与注意力机制，在每个尺度上学习多阶邻域关系；通过级联不同尺度模块，选择精细尺度上的显著锚点，并利用注意力动态融合其邻域特征，实现跨尺度上下文建模。 Result: 在NUS-WIDE、PASCAL VOC2007和MS-COCO数据集上的多标签分类实验表明，PanCAN在定量与定性评估中均优于当前最先进方法。 Conclusion: PanCAN通过有效整合多阶与跨尺度上下文特征，显著提升了视觉识别中的复杂场景理解能力，为多标签图像分类提供了新的解决方案。 Abstract: Context modeling is crucial for visual recognition, enabling highly discriminative image representations by integrating both intrinsic and extrinsic relationships between objects and labels in images. A limitation in current approaches is their focus on basic geometric relationships or localized features, often neglecting cross-scale contextual interactions between objects. This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts through cross-scale feature aggregation in a high-dimensional Hilbert space. Specifically, PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism. Modules from different scales are cascaded, where salient anchors at a finer scale are selected and their neighborhood features are dynamically fused via attention. This enables effective cross-scale modeling that significantly enhances complex scene understanding by combining multi-order and cross-scale context-aware features. Extensive multi-label classification experiments on NUS-WIDE, PASCAL VOC2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results, outperforming state-of-the-art techniques in both quantitative and qualitative evaluations, thereby substantially improving multi-label classification performance.

[227] IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation

Donghao Zhou,Jingyu Lin,Guibao Shen,Quande Liu,Jialin Gao,Lihao Liu,Lan Du,Cunjian Chen,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了IdentityStory框架，用于解决人类角色在多图连续生成中的身份一致性问题，尤其在人脸一致性和多角色协调方面表现优异。

Details

Motivation: 现有视觉生成模型在生成以人类为中心的故事时，难以保持人物面部细节的一致性和多个角色之间的协调性。 Method: 提出Iterative Identity Discovery提取角色身份特征，并通过Re-denoising Identity Injection在去噪过程中注入身份信息，同时保留上下文内容。 Result: 在ConsiStory-Human基准上的实验表明，该方法在面部一致性及多角色组合方面优于现有方法。 Conclusion: IdentityStory有效提升了人类角色在长序列图像生成中的身份一致性，具备无限长度故事生成和动态角色组合的应用潜力。 Abstract: Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.

[228] Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

Hexin Zhang,Dong Li,Jie Huang,Bingzhou Wang,Xueyang Fu,Zhengjun Zha

Main category: cs.CV

TL;DR: 提出了一种名为IAFS的无训练推理时缩放框架，通过迭代优化和自适应频率引导，有效平衡了扩散模型在图像超分辨率中的感知质量与结构保真度。

Details

Motivation: 现有扩散模型在图像超分辨率中难以兼顾高频感知质量和低频结构保真度，且当前推理时优化策略存在过平滑或结构不一致问题。 Method: 设计了IAFS框架，结合迭代精细化修正结构偏差，并通过自适应融合高频感知线索与低频结构信息实现频率感知的粒子融合。 Result: 在多个扩散模型上实验表明，IAFS在感知细节和结构准确性方面均优于现有方法，有效缓解了感知-保真冲突。 Conclusion: IAFS为扩散模型图像超分辨率提供了一种高效、无需训练的推理时优化方案，显著提升了生成图像的整体质量。 Abstract: Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.

[229] AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Binhe Yu,Zhen Wang,Kexin Li,Yuqian Yuan,Wenqiao Zhang,Long Chen,Juncheng Li,Jun Xiao,Yueting Zhuang

Main category: cs.CV

TL;DR: 本文提出AnyMS，一种无需训练的框架，用于布局引导的多主体定制，通过双层次注意力解耦机制实现文本对齐、主体身份保持和布局控制的平衡。

Details

Motivation: 现有方法在文本对齐、主体身份保持和布局控制之间难以平衡，且依赖额外训练，限制了可扩展性和效率。 Method: 引入双层次注意力解耦机制：全局解耦分离文本与视觉条件的交叉注意力以保证文本对齐；局部解耦将每个主体的注意力限制在其指定区域，防止冲突；结合预训练图像适配器提取主体特征。 Result: 实验表明AnyMS在复杂组合和更多主体数量下均达到最先进性能，支持高质量多主体图像合成。 Conclusion: AnyMS是一种高效、可扩展的训练-free方法，在多主体定制中实现了优异的文本对齐、身份保持和布局控制。 Abstract: Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.

[230] PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

Shengyi Hua,Jianfeng Wu,Tianle Shen,Kangzhe Hu,Zhongzhen Huang,Shujuan Ni,Zhihong Zhang,Yuan Li,Zhe Wang,Xiaofan Zhang

Main category: cs.CV

TL;DR: 提出PathFound，一种支持证据寻求推理的代理多模态病理诊断模型，通过主动信息获取和诊断优化，在多种临床场景中实现最先进的诊断性能。

Details

Motivation: 现有病理基础模型多采用静态推理范式，无法在诊断不明确时重新评估或主动获取更多证据，而临床诊断通常通过反复观察和进一步检查来优化判断，因此需要一种能模拟这一过程的动态推理模型。 Method: PathFound结合病理视觉基础模型、视觉-语言模型和强化学习训练的推理模型，设计了包含初步诊断、证据寻求和最终决策三个阶段的动态推理流程，支持主动信息获取与诊断迭代优化。 Result: 在多个大型多模态模型中，采用该策略持续提升诊断准确率；PathFound在多种临床场景下达到最先进水平，并展现出发现核特征、局部浸润等细微病变的潜力。 Conclusion: 证据寻求的推理范式更贴近真实临床流程，显著提升病理诊断准确性，PathFound为计算病理学提供了更具交互性和智能性的诊断框架。 Abstract: Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.

[231] PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang

Main category: cs.CV

TL;DR: 提出PurifyGen，一种无需训练的双阶段文本到图像生成安全方法，通过语义距离评估和双空间变换净化提示词。

Details

Motivation: 现有安全方法易被绕过或依赖大量数据与训练，难以有效防止扩散模型生成不安全内容。 Method: 首先计算提示词与预定义有毒和清洁概念之间的互补语义距离以识别风险token；然后对风险token进行双空间变换：将其投影至有毒概念矩阵的零空间并对其齐清洁概念的范围空间，实现去毒同时保留原始意图。 Result: 在五个数据集上验证了PurifyGen的有效性，显著优于现有无训练方法，并媲美需训练的方法，具备良好泛化性和即插即用特性。 Conclusion: PurifyGen为文本到图像生成提供了一种高效、无需训练且理论严谨的安全解决方案，能有效抑制不安全内容生成。 Abstract: Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to https://github.com/AI-Researcher-Team/PurifyGen.

[232] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li,Xi Fang,Yixuan Li,Chaozheng Huang,Junjie Wang,Xi Wang,Hongzhe Bai,Bojun Hao,Shenyu Lin,Huiqi Liang,Linfeng Zhang,Guolin Ke

Main category: cs.CV

TL;DR: RxnBench是一个用于评估多模态大语言模型在化学反应理解方面能力的基准，包含单图问答和全文问答两个任务，揭示了现有模型在深层化学推理和结构识别上的不足。

Details

Motivation: 探索多模态大语言模型在真实化学文献中理解复杂反应图的能力，推动AI在化学领域的应用。 Method: 构建了一个名为RxnBench的多层级基准，包含SF-QA（基于反应图的细粒度视觉感知与机理推理）和FD-QA（跨模态整合文本、图表和表格信息的全文理解）两个任务。 Result: 评估显示当前MLLMs在提取显性文本上表现良好，但在深层化学逻辑和精确结构识别上存在显著缺陷；具备推理时推理能力的模型表现更优，但FD-QA准确率均未超过50%。 Conclusion: 需要开发领域专用的视觉编码器和更强的推理引擎，以提升AI在化学研究中的自主能力。 Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

[233] ThinkGen: Generalized Thinking for Visual Generation

Siyu Jiao,Yiheng Lin,Yujie Zhong,Qi She,Wei Zhou,Xiaohan Lan,Zilong Huang,Fei Yu,Yingchen Yu,Yunqing Zhao,Yao Zhao,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出了ThinkGen，首个基于多模态大语言模型的思维链推理驱动的视觉生成框架，结合解耦架构与分离式强化学习训练范式，在多种生成任务中实现先进性能。

Details

Motivation: 现有视觉生成方法在利用思维链推理方面局限于特定场景机制，缺乏通用性和适应性，难以扩展到多样化生成任务。 Method: 提出ThinkGen框架，采用预训练MLLM与DiT解耦架构，MLLM根据用户意图生成指令，DiT据此生成图像；并设计基于GRPO的分离训练范式（SepGRPO），交替优化两模块。 Result: 在多个生成基准上实现了强大且最先进的性能，验证了CoT推理在多样化视觉生成场景中的有效性与泛化能力。 Conclusion: ThinkGen通过显式引入MLLM的思维链推理，实现了跨场景的高效视觉内容生成，为生成式AI提供了可解释、可控制的新范式。 Abstract: Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

[234] Image Denoising Using Global and Local Circulant Representation

Zhaoming Kong,Xiaowei Yang,Jiahuan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Haar-tSVD的新型图像去噪方法，结合Haar变换与张量奇异值分解，实现高效、快速且无需学习局部基的去噪，同时引入自适应噪声估计和深度网络增强性能。

Details

Motivation: 针对日益增长的图像数据去噪需求，现有方法在速度与性能之间难以平衡，本文旨在建立PCA与Haar变换的理论联系，提出一种高效且有效的去噪算法。 Method: 基于循环表示下PCA与Haar变换的理论联系，提出Haar-tSVD方法，采用统一的张量SVD投影结合Haar变换捕获图像块的全局与局部相关性，并引入自适应噪声估计；进一步结合深度神经网络提升在强噪声下的表现。 Result: 在多个去噪数据集上的实验表明，Haar-tSVD在去噪效果和计算效率方面均表现出色，尤其在并行化和无需训练局部基方面具有优势，结合深度网络后在严重噪声下性能进一步提升。 Conclusion: Haar-tSVD是一种高效、简洁且可扩展的图像去噪方法，在保持高性能的同时显著提升速度，理论与实验验证了其有效性与鲁棒性。 Abstract: The proliferation of imaging devices and countless image data generated every day impose an increasingly high demand on efficient and effective image denoising. In this paper, we establish a theoretical connection between principal component analysis (PCA) and the Haar transform under circulant representation, and present a computationally simple denoising algorithm. The proposed method, termed Haar-tSVD, exploits a unified tensor singular value decomposition (t-SVD) projection combined with Haar transform to efficiently capture global and local patch correlations. Haar-tSVD operates as a one-step, parallelizable plug-and-play denoiser that eliminates the need for learning local bases, thereby striking a balance between denoising speed and performance. Besides, an adaptive noise estimation scheme is introduced to improve robustness according to eigenvalue analysis of the circulant structure. To further enhance the performance under severe noise conditions, we integrate deep neural networks with Haar-tSVD based on the established Haar-PCA relationship. Experimental results on various denoising datasets demonstrate the efficiency and effectiveness of proposed method for noise removal. Our code is publicly available at https://github.com/ZhaomingKong/Haar-tSVD.

[235] ProGuard: Towards Proactive Multimodal Safeguard

Shaohan Yu,Lijun Li,Chenyang Si,Lu Sheng,Jing Shao

Main category: cs.CV

TL;DR: ProGuard是一种基于视觉-语言模型的主动式多模态安全防护方法，通过强化学习训练实现对分布外（OOD）安全风险的高效检测与描述，显著优于现有开源守卫模型。

Details

Motivation: 现有防御方法在应对生成模型带来的新型多模态安全风险时存在局限，且传统反应式方法需频繁调整模型，难以适应快速演化的威胁环境。 Method: 构建包含87K样本的模态平衡数据集，采用分层多模态安全分类体系进行标注；基于该数据集纯用强化学习训练视觉-语言基础模型，并引入OOD安全类别推断任务及基于同义词库的相似性奖励机制以增强对未见风险的描述能力。 Result: ProGuard在二元安全分类上表现媲美闭源大模型，在不安全内容分类上显著超越现有开源守卫模型；在OOD风险检测和描述上分别提升52.6%和64.8%。 Conclusion: ProGuard实现了无需模型调整的主动式多模态安全防护，在检测和描述未知安全风险方面具有显著优势，为应对生成模型的安全挑战提供了更高效、可扩展的解决方案。 Abstract: The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.

[236] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Ethan Chern,Zhulin Hu,Bohao Tang,Jiadi Su,Steffi Chern,Zhijie Deng,Pengfei Liu

Main category: cs.CV

TL;DR: 本文提出了一种改进的蒸馏方法，用于实现基于多模态条件（文本、图像、音频）的实时交互式视频生成扩散模型，显著降低了推理成本和延迟，并构建了名为LiveTalk的实时多模态交互系统，在多轮对话连贯性和内容质量上优于现有最先进模型。

Details

Motivation: 现有的视频扩散模型因迭代过程中使用双向注意力导致无法实现实时交互，且现有蒸馏方法主要关注文本到视频生成，难以支持自然高效的人机交互。本文旨在解决多模态条件下实时交互视频生成的问题。 Method: 提出一种改进的蒸馏策略，重点关注条件输入质量以及on-policy优化的初始化与调度方案，使模型具备自回归能力并减少采样步数，从而支持实时生成。结合音频语言模型和长视频推断技术Anchor-Heavy Identity Sinks构建LiveTalk系统。 Result: 在HDTF、AVSpeech和CelebV-HQ等多模态条件头像视频生成基准上，蒸馏后的模型以仅20倍于原模型的推理代价达到了与全步长双向基线相当甚至更优的视觉质量。LiveTalk系统在多轮交互中表现出更优的视频连贯性和内容质量，响应延迟从1-2分钟降至实时。 Conclusion: 该方法有效解决了多模态条件下视频扩散模型的实时性问题，实现了高质量、低延迟的交互式视频生成，推动了通用多模态人机交互系统的发展。 Abstract: Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

[237] Same or Not? Enhancing Visual Perception in Vision-Language Models

Damiano Marsili,Aditya Mehta,Ryan Y. Lin,Georgia Gkioxari

Main category: cs.CV

TL;DR: 本文提出了TWIN，一个包含561,000个图像对查询的大规模数据集，旨在提升视觉-语言模型（VLMs）的细粒度感知能力。通过在TWIN上微调，VLMs在细粒度识别任务中表现显著提升，即使在艺术、动物、植物和地标等未见领域也有效，且不影响其在通用VQA基准上的性能。同时提出FGVQA基准测试套件来评估这些改进。

Details

Motivation: 现有的VLMs在视觉理解上表现出色，但缺乏对细微视觉细节的关注，训练语料库主要强调一般性识别，忽略了细粒度感知。因此需要新的数据集和任务来增强模型的精细辨别能力。 Method: 构建了一个名为TWIN的大规模图像对数据集，包含561,000个样本，要求模型判断两张视觉相似的图像是否为同一物体，从而促使模型关注细微视觉差异。在此数据集上微调现有VLMs，并引入FGVQA作为评估细粒度识别能力的新基准。 Result: 在TWIN上微调后的VLMs在FGVQA基准上最高提升了19.3%，且在未见领域如艺术、动物、植物和地标中也表现出良好的泛化能力，同时保持了在通用VQA任务上的性能。分析还表明，数据规模对性能至关重要。 Conclusion: TWIN能有效提升VLMs的细粒度视觉感知能力，可作为开源训练语料的即插即用组件，推动未来模型在感知精度方面的发展。 Abstract: Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/

[238] Detection Fire in Camera RGB-NIR

Nguyen Truong Khai,Luong Duc Vinh

Main category: cs.CV

TL;DR: 本文提出了三种改进夜间火灾检测的方法：新的近红外数据集、两阶段检测模型（YOLOv11 + EfficientNetV2-B0）以减少人工光源误检，以及用于提升RGB图像中小目标检测的Patched-YOLO。

Details

Motivation: 现有火灾检测模型在夜间使用红外图像时易将人工光源误判为火焰，且数据集不足限制了性能提升。 Method: 1）采用数据增强策略构建NIR数据集；2）设计YOLOv11与EfficientNetV2-B0结合的两阶段检测流程；3）提出Patched-YOLO，通过基于图像块的处理增强对小目标的检测能力。 Result: 所提两阶段模型在夜间火灾检测中准确率高于先前方法，有效降低人工光源引起的误报；Patched-YOLO提升了RGB图像中远距离和小尺度火焰的检测效果。 Conclusion: 通过数据扩充、两阶段检测架构和基于图像块的处理策略，显著提升了红外和RGB图像中的火灾检测性能，尤其在减少误报和检测小目标方面表现优越。 Abstract: Improving the accuracy of fire detection using infrared night vision cameras remains a challenging task. Previous studies have reported strong performance with popular detection models. For example, YOLOv7 achieved an mAP50-95 of 0.51 using an input image size of 640 x 1280, RT-DETR reached an mAP50-95 of 0.65 with an image size of 640 x 640, and YOLOv9 obtained an mAP50-95 of 0.598 at the same resolution. Despite these results, limitations in dataset construction continue to cause issues, particularly the frequent misclassification of bright artificial lights as fire. This report presents three main contributions: an additional NIR dataset, a two-stage detection model, and Patched-YOLO. First, to address data scarcity, we explore and apply various data augmentation strategies for both the NIR dataset and the classification dataset. Second, to improve night-time fire detection accuracy while reducing false positives caused by artificial lights, we propose a two-stage pipeline combining YOLOv11 and EfficientNetV2-B0. The proposed approach achieves higher detection accuracy compared to previous methods, particularly for night-time fire detection. Third, to improve fire detection in RGB images, especially for small and distant objects, we introduce Patched-YOLO, which enhances the model's detection capability through patch-based processing. Further details of these contributions are discussed in the following sections.

[239] Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging

Janani Annur Thiruvengadam,Kiran Mayee Nabigaru,Anusha Kovi

Main category: cs.CV

TL;DR: 提出了一种可扩展的残差特征聚合（SRFA）框架，用于CT图像中胰腺肿瘤的早期检测，结合MAGRes-UNet分割、DenseNet-121特征提取、HHO-BA特征选择及ViT与EfficientNet-B3的混合分类模型，并通过双优化机制提升性能，在准确率、F1分数和特异性上均优于传统方法。

Details

Motivation: 胰腺肿瘤在CT图像中对比度低、解剖结构变异大，早期检测困难，现有方法难以有效突出细微视觉线索并实现多模态数据的泛化。 Method: 提出SRFA框架：首先通过预处理和MAGRes-UNet进行胰腺结构分割；然后利用DenseNet-121提取深层残差特征；采用HHO-BA混合元启发式算法进行特征选择；最后结合Vision Transformer与EfficientNet-B3构建混合分类模型，并使用SSA和GWO双优化机制调参。 Result: 实验结果显示该模型达到96.23%的准确率、95.58%的F1分数和94.83%的特异性，显著优于传统CNN和现有Transformer模型。 Conclusion: SRFA框架能有效提升胰腺肿瘤早期检测的准确性与鲁棒性，具有良好的临床应用潜力。 Abstract: The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.

[240] Memorization in 3D Shape Generation: An Empirical Study

Shu Pu,Boya Zeng,Kaichen Zhou,Mengyu Wang,Zhuang Liu

Main category: cs.CV

TL;DR: 本文提出了一种评估3D生成模型记忆化的框架，并通过实验分析了数据和建模设计对记忆化的影响，提出了减少记忆化而不降低生成质量的有效策略。

Details

Motivation: 理解3D生成模型是否依赖记忆训练数据，以防止数据泄露并提高生成结果的多样性。 Method: 设计了一个量化3D生成模型记忆化的评估框架，并在潜在向量集（Vecset）扩散模型上进行受控实验，研究不同数据模态、数据多样性、条件粒度、引导尺度、Vecset长度和旋转增强对记忆化的影响。 Result: 发现记忆化受数据模态影响，随数据多样性和细粒度条件增加而上升，在中等引导尺度时达到峰值，较长的Vecset和简单旋转增强可减轻记忆化。 Conclusion: 该研究提供了对3D生成模型记忆化的实证理解，并提出了简单有效的缓解策略。 Abstract: Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at https://github.com/zlab-princeton/3d_mem.

[241] Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception

Xiaoyu Li,Peidong Li,Xian Wu,Long Shi,Dedong Liu,Yitao Wu,Jiajia Fu,Dixiao Cui,Lijun Zhao,Lining Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为HAT的时空对齐模块，通过多假设解码自适应地为每个对象选择最优对齐方案，提升了自动驾驶中端到端感知的鲁棒性和性能。

Details

Motivation: 现有方法依赖统一的显式运动模型和语义特征进行跨帧对齐，难以应对不同类别和帧间对象运动状态与特征的变化，导致对齐效果次优。 Method: HAT利用多个显式运动模型生成历史实例的空间锚点和运动感知特征提议，并结合缓存的对象查询中的语义与运动线索进行多假设解码，以获得目标帧的最优对齐结果。 Result: 在nuScenes上，HAT显著提升了3D时序检测器和跟踪器的性能，与DETR3D结合达到46.0% AMOTA的领先水平；在端到端自动驾驶方法中，mAP提升1.3%，AMOTA提升3.1%，碰撞率降低32%；在语义退化场景下仍保持更强的感知与规划鲁棒性。 Conclusion: HAT通过融合多假设运动建模与语义-运动联合解码，实现了更优的时空对齐，增强了自动驾驶系统在复杂动态环境下的感知精度与安全性。 Abstract: Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.

[242] OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Keda Tao,Wenjie Du,Bohan Yu,Weiqiang Wang,Jian Liu,Huan Wang

Main category: cs.CV

TL;DR: OmniAgent是一种全音频引导的主动感知智能体，通过动态规划和粗到细的音频引导范式，实现细粒度音视频推理，在多个基准上超越现有模型10%-20%。

Details

Motivation: 现有多模态大模型在跨模态理解与对齐方面存在不足，尤其缺乏细粒度的音视频联合分析能力。 Method: 提出OmniAgent，采用动态规划自主调用专用工具，利用音频线索定位时序事件，并引导后续推理，实现从被动响应到主动探究的范式转变。 Result: 在三个音视频理解基准上的实验表明，OmniAgent显著优于当前开源和闭源模型，准确率提升10%-20%。 Conclusion: OmniAgent通过音频引导的主动感知机制，实现了更精细的跨模态理解，推动了多模态智能向主动推理发展。 Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

[243] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

Kang Du,Yirui Guan,Zeyu Wang

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的前馈框架Intrinsic Decomposition Transformer (IDT)，用于多视角图像的本征分解，通过物理建模将图像分解为漫反射、漫射阴影和镜面阴影，显著提升了多视角一致性。

Details

Motivation: 单视角本征分解已有较好进展，但多视角下存在严重视图不一致问题，现有扩散模型难以直接扩展到多视角场景。 Method: 提出IDT框架，利用Transformer注意力机制联合推理多个输入图像，结合物理成像模型，显式分解为漫反射、漫射阴影和镜面阴影，实现单次前向传播即可输出视图一致的本征成分。 Result: 在合成和真实数据集上实验表明，IDT在漫反射分离、阴影一致性及镜面成分提取方面优于现有方法，并显著提升多视角一致性。 Conclusion: IDT通过结合物理模型与Transformer结构，有效实现了多视角本征图像分解，具有良好的可解释性、可控性和视图一致性，推动了多视角视觉理解的发展。 Abstract: Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.

[244] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Shaocong Xu,Songlin Wei,Qizhe Wei,Zheng Geng,Hong Li,Licheng Shen,Qianpu Sun,Shu Han,Bin Ma,Bohan Li,Chongjie Ye,Yuhang Zheng,Nan Wang,Saining Zhang,Hao Zhao

Main category: cs.CV

TL;DR: 本文提出了一种利用视频扩散模型处理透明物体感知难题的新方法，通过构建大规模合成数据集TransPhy3D并训练轻量级适配器，实现了在透明、反射场景中高质量、时序一致的深度与法线估计，显著提升了机器人抓取性能。

Details

Motivation: 透明物体因折射、反射和透射导致传统深度感知方法失效，缺乏稳定准确的感知能力，而现有扩散模型已隐式学习光学规律，可被用于解决该问题。 Method: 构建包含1.1万段序列的合成视频数据集TransPhy3D，使用Blender/Cycles进行物理渲染，获得RGB、深度和法线；基于大型视频扩散模型，采用LoRA适配器将RGB映射到深度与法线，通过拼接RGB与噪声深度潜在表示并在DiT主干中联合训练，实现视频到视频的转换。 Result: 所提模型DKT在多个真实与合成视频基准（如ClearPose、DREDS、TransPhy3D-Test）上实现零样本SOTA表现，提升精度与时序一致性；1.3B版本运行速度达~0.17秒/帧；法线估计变体在ClearPose上也取得最佳结果；集成至抓取系统后显著提高对透明、反光表面的抓取成功率。 Conclusion: 生成式视频先验可被高效、无标签地转化为鲁棒且时序连贯的感知能力，验证了“扩散模型理解透明性”的观点，为复杂现实操作任务提供了新方向。 Abstract: Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.

[245] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu,Chin-Yang Lin,Zhixiang Wang,Chi-Wei Hsiao,Po-Fan Yu,Yu-Chih Chen,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了Stream-DiffVSR，一种因果条件扩散框架，用于高效在线视频超分辨率（VSR），在仅使用过去帧的情况下显著降低延迟并提升感知质量，是首个适用于低延迟在线部署的扩散模型。

Details

Motivation: 现有的基于扩散的VSR方法依赖未来帧且多步去噪成本高，导致在对延迟敏感的应用中不实用。因此需要一种可在严格因果条件下运行、低延迟的高效在线VSR方法。 Method: 提出Stream-DiffVSR，采用四步蒸馏去噪器实现快速推理，结合自回归时序引导（ARTG）模块在潜在去噪过程中注入运动对齐信息，并设计轻量级时序感知解码器与时序处理模块（TPM）增强细节及时序一致性。 Result: 在720p视频上单帧处理时间为0.328秒（RTX4090），相比之前扩散方法大幅降低延迟，初始延迟从超过4600秒降至0.328秒，比在线SOTA方法TMP降低130倍以上，同时LPIPS提升0.095。 Conclusion: Stream-DiffVSR是目前延迟最低的扩散VSR方法，首次实现了适用于低延迟在线场景的扩散模型VSR部署，兼顾高质量与实时性。 Abstract: Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/

Table of Contents

cs.CL [Back]

[1] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

[2] Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

[3] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

[4] The Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

[5] Towards Efficient Post-Training via Fourier-Driven Adapter Architectures

[6] LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

[7] Hallucination Detection and Evaluation of Large Language Model

[8] HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG

[9] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

[10] Constituency Structure over Eojeol in Korean Treebanks

[11] ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

[12] Learning When Not to Attend Globally

[13] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis

[14] Chain-of-thought Reviewing and Correction for Time Series Question Answering

[15] M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

[16] On the Role of Discreteness in Diffusion LLMs

[17] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

[18] Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

[19] Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency

[20] GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages

[21] Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

[22] Mitigating Social Desirability Bias in Random Silicon Sampling

[23] Data Augmentation for Classification of Negative Pregnancy Outcomes in Imbalanced Data

[24] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

[25] Harnessing Large Language Models for Biomedical Named Entity Recognition

[26] Text-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis

[27] Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

[28] CNSight: Evaluation of Clinical Note Segmentation Tools

[29] NepEMO: A Multi-Label Emotion and Sentiment Analysis on Nepali Reddit with Linguistic Insights and Temporal Trends

[30] AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning

[31] Diversity or Precision? A Deep Dive into Next Token Prediction

[32] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

[33] Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

[34] LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

[35] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

[36] Accelerating Language Model Workflows with Prompt Choreography

[37] TabiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

[38] Reservoir Computing inspired Matrix Multiplication-free Language Model

[39] Not too long do read: Evaluating LLM-generated extreme scientific summaries

[40] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

[41] Anka: A Domain-Specific Language for Reliable LLM Code Generation

[42] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

[43] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

[44] AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration

[45] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

[46] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

[47] Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

[48] The Effect of Gender Diversity on Scientific Team Impact: A Team Roles Perspective

[49] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

[50] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

[51] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

[52] Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

[53] Automatic Detection of Complex Quotation Patterns in Aggadic Literature

[54] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

[55] Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

[56] Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

[57] Instruction-Following Evaluation of Large Vision-Language Models

[58] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

[59] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

[60] A Dataset and Benchmark for Consumer Healthcare Question Summarization

[61] Nested Browser-Use Learning for Agentic Information Seeking

[62] Less is more: Probabilistic reduction is best explained by small-scale predictability measures

[63] Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing

[64] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

[65] Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans

[66] Eliciting Behaviors in Multi-Turn Conversations

cs.CV [Back]

[67] Characterizing Motion Encoding in Video Diffusion Timesteps

[68] Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture, Training, and Deployment

[69] Enhancing Medical Data Analysis through AI-Enhanced Locally Linear Embedding: Applications in Medical Point Location and Imagery

[70] Unbiased Visual Reasoning with Controlled Visual Inputs

[71] SAMM2D: Scale-Aware Multi-Modal 2D Dual-Encoder for High-Sensitivity Intracrania Aneurysm Screening

[72] HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology

[73] Tiny-YOLOSAM: Fast Hybrid Image Segmentation

[74] Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy

[75] TCFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting

[76] A CNN-Based Malaria Diagnosis from Blood Cell Images with SHAP and LIME Explainability

[77] Signal-SGN++: Topology-Enhanced Time-Frequency Spiking Graph Network for Skeleton-Based Action Recognition