Skip to content

Table of Contents

cs.CL [Back]

[1] Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis

Angelina Parfenova,Andreas Marfurt,Alexander Denzler,Juergen Pfeffer

Main category: cs.CL

TL;DR: 该研究探讨了使用大语言模型(LLMs)进行定性数据分析中的归纳编码,比较了六种开源LLM与人类专家在编码任务中的表现,发现人类在复杂句子上表现更好,而LLM在简单句子上更优,且两者在标签生成上存在系统性偏差。

Details Motivation: 旨在探索大语言模型在定性研究中进行归纳编码的可行性,弥补传统依赖预定义标签的定量自动化方法在灵活性和适应性上的不足。 Method: 采用六种开源大语言模型与人类专家对比,对文本引述进行归纳编码,并引入专家评估引述难度及标注结果质量,同时与测试集的黄金标准进行比较分析。 Result: 人类编码者在处理复杂句子时表现良好但对简单句子准确率较低,LLM则相反;部分LLM在接近黄金标准方面优于人类,但在人工评价中得分较低。 Conclusion: LLM在定性数据分析中具有潜力,尤其在处理简单明确的文本时表现良好,但其输出在可接受性和解释性上仍与人类存在差距,需进一步优化以提升实用性。 Abstract: This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.

[2] Emergent Convergence in Multi-Agent LLM Annotation

Angelina Parfenova,Alexander Denzler,Juergen Pfeffer

Main category: cs.CL

TL;DR: 该研究通过模拟7500次多智能体、多轮讨论,探究黑盒式大语言模型(LLM)在协作标注任务中的协调动态,提出过程级指标并分析输出嵌入的几何演化,发现LLM群体在无显式角色提示下仍能实现词汇与语义收敛,并表现出非对称影响和类似协商的行为。

Details Motivation: 尽管大语言模型越来越多地被用于协作场景,但作为黑盒代理时它们如何协调尚不清楚,因此需要探索其在交互过程中涌现的协调机制。 Method: 在归纳编码任务中模拟7500次多智能体、多轮对话,生成超过125000条话语;引入代码稳定性、语义自洽性、词汇置信度等过程级指标,并分析输出嵌入的几何变化(如内在维度下降)。 Result: LLM群体在多轮交互中实现词汇和语义上的收敛,出现非对称影响力模式,并展现出类似谈判的行为;嵌入空间的内在维度随轮次减少,表明语义压缩的发生。 Conclusion: 即使没有显式角色设定,黑盒式LLM在协作中也能自发发展出协调策略;基于交互过程的分析可作为可扩展的补充方法,用于理解模型内部的对齐信号和协作行为。 Abstract: Large language models (LLMs) are increasingly deployed in collaborative settings, yet little is known about how they coordinate when treated as black-box agents. We simulate 7500 multi-agent, multi-round discussions in an inductive coding task, generating over 125000 utterances that capture both final annotations and their interactional histories. We introduce process-level metrics: code stability, semantic self-consistency, and lexical confidence alongside sentiment and convergence measures, to track coordination dynamics. To probe deeper alignment signals, we analyze the evolving geometry of output embeddings, showing that intrinsic dimensionality declines over rounds, suggesting semantic compression. The results reveal that LLM groups converge lexically and semantically, develop asymmetric influence patterns, and exhibit negotiation-like behaviors despite the absence of explicit role prompting. This work demonstrates how black-box interaction analysis can surface emergent coordination strategies, offering a scalable complement to internal probe-based interpretability methods.

[3] Tree Matching Networks for Natural Language Inference: Parameter-Efficient Semantic Understanding via Dependency Parse Trees

Jason Lunder

Main category: cs.CL

TL;DR: 本文提出了一种基于依赖句法树的Tree Matching Networks (TMN) 模型,用于自然语言推断任务,在SNLI任务上以更小的内存占用和更短的训练时间显著优于BERT模型,但在SemEval任务上表现不佳。研究发现显式结构化表示在相当规模下优于序列模型,但当前聚合方法限制了其可扩展性,为此提出了多头注意力聚合机制来解决该问题。

Details Motivation: 利用显式的语言结构(如依存句法树)可能比从零学习词间关系的Transformer模型更高效,从而提升句子嵌入在NLI任务中的学习效率和性能。 Method: 将图匹配网络(GMN)适配到依存句法树上,构建Tree Matching Networks(TMN),并在SNLI蕴涵任务和SemEval相似性任务上与BERT模型进行比较。 Result: TMN在SNLI任务上取得了更好的结果,且内存占用更小、训练时间更短;但在SemEval任务上两个模型都表现不佳。同时发现显式结构化表示在相当规模下优于序列模型,但现有聚合方法限制了其扩展性。 Conclusion: 显式的句法结构有助于提升模型效率和性能,但需要改进聚合方法以增强可扩展性,多头注意力聚合被提出作为解决方案。 Abstract: In creating sentence embeddings for Natural Language Inference (NLI) tasks, using transformer-based models like BERT leads to high accuracy, but require hundreds of millions of parameters. These models take in sentences as a sequence of tokens, and learn to encode the meaning of the sequence into embeddings such that those embeddings can be used reliably for NLI tasks. Essentially, every word is considered against every other word in the sequence, and the transformer model is able to determine the relationships between them, entirely from scratch. However, a model that accepts explicit linguistic structures like dependency parse trees may be able to leverage prior encoded information about these relationships, without having to learn them from scratch, thus improving learning efficiency. To investigate this, we adapt Graph Matching Networks (GMN) to operate on dependency parse trees, creating Tree Matching Networks (TMN). We compare TMN to a BERT based model on the SNLI entailment task and on the SemEval similarity task. TMN is able to achieve significantly better results with a significantly reduced memory footprint and much less training time than the BERT based model on the SNLI task, while both models struggled to preform well on the SemEval. Explicit structural representations significantly outperform sequence-based models at comparable scales, but current aggregation methods limit scalability. We propose multi-headed attention aggregation to address this limitation.

[4] Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis

Matej Klemen,Tjaša Arčon,Luka Terčon,Marko Robnik-Šikonja,Kaja Dobrovoljc

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的代理框架,用于自动化分析依存句法树库中的语法现象,支持多语言词序特征的可解释性研究。

Details Motivation: 传统语料库语法研究依赖人工分析,耗时费力,且难以扩展;本文旨在利用大语言模型降低方法论和技术门槛,实现数据驱动的自动化语法探究。 Method: 构建一个能够理解自然语言任务、生成代码并进行数据推理的代理式LLM框架,并应用于Universal Dependencies语料库,针对受WALS启发的多语言语法任务进行测试。 Result: 在170多种语言和13个词序特征上验证了该框架的有效性,评估涵盖主导顺序准确率、覆盖完整性和分布保真度三个维度,结果表明LLM能有效结合结构化语言数据进行语法分析。 Conclusion: 将大语言模型与标注语料库结合是可行的,为可解释、可扩展的语料库语法研究自动化提供了新路径。 Abstract: Empirical grammar research has become increasingly data-driven, but the systematic analysis of annotated corpora still requires substantial methodological and technical effort. We explore how agentic large language models (LLMs) can streamline this process by reasoning over annotated corpora and producing interpretable, data-grounded answers to linguistic questions. We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation, code generation, and data-driven reasoning. As a proof of concept, we apply it to Universal Dependencies (UD) corpora, testing it on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS). The evaluation spans 13 word-order features and over 170 languages, assessing system performance across three complementary dimensions - dominant-order accuracy, order-coverage completeness, and distributional fidelity - which reflect how well the system generalizes, identifies, and quantifies word-order variations. The results demonstrate the feasibility of combining LLM reasoning with structured linguistic data, offering a first step toward interpretable, scalable automation of corpus-based grammatical inquiry.

[5] Minimal-Edit Instruction Tuning for Low-Resource Indic GEC

Akhil Rajeev P

Main category: cs.CL

TL;DR: 提出一种无需数据增强的印度语言语法错误纠正方法,利用指令调优的大语言模型和确定性解码,在马来语和印地语上取得良好成绩。

Details Motivation: 印度语言的语法错误纠正面临监督数据少、文字多样、形态丰富等挑战。 Method: 使用4位精度的12B GEMMA 3模型,结合参数高效微调(PEFT)和Alpaca风格格式进行指令调优,并采用基于分类器提示设计和轻量级规范化器的确定性解码策略。 Result: 在官方GLEU评估中,马来语得分为92.41(排名第六),印地语得分为81.44(排名第三)。 Conclusion: 基于分类器提示设计、适配器指令调优和确定性解码的方法为印度语言GEC提供了一种可复现且计算高效的替代方案。 Abstract: Grammatical error correction for Indic languages faces limited supervision, diverse scripts, and rich morphology. We propose an augmentation-free setup that uses instruction-tuned large language models and conservative decoding. A 12B GEMMA 3 model is instruction-tuned in bnb 4-bit precision with parameter-efficient fine-tuning (PEFT) and Alpaca-style formatting. Decoding follows a deterministic, constraint-aware procedure with a lightweight normaliser that encourages minimal, meaning-preserving edits. We operationalise inference, subsequent to instruction fine-tuning (IFT), via a fixed, language-specific prompt directly synthesised from a deterministic error classifier's taxonomy, label distributions, and precedence ordering computed on the training corpus. Under the official untuned GLEU evaluation, the system scores 92.41 on Malayalam, sixth overall, and 81.44 on Hindi, third overall. These results indicate that classifier-informed prompt design, adapter-based instruction tuning, and deterministic decoding provide a reproducible and a computationally efficient alternative to augmentation-centred pipelines for Indic GEC. The approach also motivates future work on stronger morphosyntactic constraints and human-centred evaluation of conservative edits.

[6] OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

Sai Koneru,Matthias Huck,Jan Niehues

Main category: cs.CL

TL;DR: 本文提出了一种名为OmniFusion的端到端多模态翻译系统,通过融合多模态基础模型(MMFM)和翻译大语言模型(LLM),实现了语音、图像和文本的联合翻译,显著降低了同时语音翻译(SimulST)的延迟并提升了翻译质量。

Details Motivation: 现有的开源文本翻译大模型在语音翻译中只能用于级联流水线,导致额外延迟且无法利用多模态上下文(如图像)进行消歧;而多模态基础模型虽具备跨模态能力,但缺乏翻译专用性能和多语言覆盖。因此需要一种融合两者优势的有效多模态翻译系统。 Method: 提出一种新颖的融合策略,将预训练多模态基础模型(Omni 2.5-7B)多个层的隐藏状态连接到翻译大语言模型(SeedX PPO-7B),实现联合端到端训练,构建出OmniFusion模型,支持语音到文本、语音-图像到文本及文本-图像到文本翻译。 Result: 实验表明,OmniFusion能有效利用音频和视觉输入,在SimulST任务中比级联系统减少约1秒的延迟,并提升整体翻译质量。 Conclusion: 通过融合多模态基础模型与翻译大语言模型,OmniFusion实现了高效、低延迟、高质量的多模态翻译,为语音翻译系统提供了新的端到端解决方案。 Abstract: There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation quality\footnote{Code is available at https://github.com/saikoneru/OmniFusion}.

[7] Lost without translation -- Can transformer (language models) understand mood states?

Prakrithi Shivaprakash,Diptadhi Mukherjee,Lekhansh Shukla,Animesh Mukherjee,Prabhat Chand,Pratima Murthy

Main category: cs.CL

TL;DR: 当前语言模型无法有效表征印度语言中的情绪状态,直接使用多语言或印度特定模型对本土脚本进行嵌入效果极差(综合得分0.002)。翻译为英语或中文后使用大模型嵌入显著提升聚类性能,其中经人工翻译为英文再转中文并用中文模型嵌入效果最佳(综合得分0.67),但依赖专有模型或复杂翻译流程不可持续。结论是:必须首先构建能理解本地语言的模型,才能在全球心理健康中发挥作用。

Details Motivation: 评估大语言模型在非英语语境下(特别是印度语言)识别情绪状态的能力,因为不同语言具有独特的表达痛苦的方式(如抑郁、欣快躁狂等),而现有模型以英语为中心,可能无法准确捕捉这些非英语情绪表达。 Method: 收集了11种印度语言中关于四种情绪状态(抑郁、安乐、欣快躁狂、烦躁躁狂)的247个独特短语;比较七种实验条件下的k-means聚类表现,包括直接使用原生和罗马化脚本的嵌入(多语言与印度特定模型),以及将短语翻译成英语和中文后的嵌入;使用调整兰德指数、归一化互信息、同质性和完整性构成的综合评分衡量性能。 Result: 直接嵌入印度语言的表现极差(综合得分0.002);所有基于翻译的方法均有显著改善:Gemini翻译的英文嵌入得分为0.60,人工翻译英文为0.61(gemini-001嵌入),而人工翻译英文后再译为中文并用中文模型嵌入表现最优(综合得分0.67);专门的印度语言模型(IndicBERT 和 Sarvam-M)表现不佳。 Conclusion: 现有语言模型无法从印度语言中直接有意义地表征情绪状态,这构成了其在印度精神科诊断或治疗应用的根本障碍。虽然高质量翻译可以弥补这一差距,但依赖专有模型或复杂的翻译流程不可持续。要实现全球心理健康的应用,必须首先开发能够理解多样化本地语言的模型。 Abstract: Background: Large Language Models show promise in psychiatry but are English-centric. Their ability to understand mood states in other languages is unclear, as different languages have their own idioms of distress. Aim: To quantify the ability of language models to faithfully represent phrases (idioms of distress) of four distinct mood states (depression, euthymia, euphoric mania, dysphoric mania) expressed in Indian languages. Methods: We collected 247 unique phrases for the four mood states across 11 Indic languages. We tested seven experimental conditions, comparing k-means clustering performance on: (a) direct embeddings of native and Romanised scripts (using multilingual and Indic-specific models) and (b) embeddings of phrases translated to English and Chinese. Performance was measured using a composite score based on Adjusted Rand Index, Normalised Mutual Information, Homogeneity and Completeness. Results: Direct embedding of Indic languages failed to cluster mood states (Composite Score = 0.002). All translation-based approaches showed significant improvement. High performance was achieved using Gemini-translated English (Composite=0.60) and human-translated English (Composite=0.61) embedded with gemini-001. Surprisingly, human-translated English, further translated into Chinese and embedded with a Chinese model, performed best (Composite = 0.67). Specialised Indic models (IndicBERT and Sarvam-M) performed poorly. Conclusion: Current models cannot meaningfully represent mood states directly from Indic languages, posing a fundamental barrier to their psychiatric application for diagnostic or therapeutic purposes in India. While high-quality translation bridges this gap, reliance on proprietary models or complex translation pipelines is unsustainable. Models must first be built to understand diverse local languages to be effective in global mental health.

[8] EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Guoqing Ma,Jia Zhu,Hanghui Guo,Weijie Shi,Yue Cui,Jiawei Shen,Zilong Li,Yidan Liang

Main category: cs.CL

TL;DR: 本文提出了EduEval,一个面向中文K-12教育的大语言模型评估基准,包含认知框架、真实性和规模三大贡献,并通过评估14种主流LLM揭示其在不同认知任务上的表现差异。

Details Motivation: 大语言模型在教育中潜力巨大,但缺乏系统评估可能导致教育标准受损,因此需要一个针对中文K-12场景的全面、分层的评估基准。 Method: 构建了EduAbility分类体系,结合布鲁姆分类法与韦伯知识深度模型,涵盖六个认知维度;收集真实考试题、课堂对话、学生作文和专家设计提示,形成包含24类任务、超过1.1万道题目的EduEval基准;在零样本和少样本设置下评估14种主流大模型。 Result: 模型在记忆和理解类任务上表现较好,但在课堂对话分类和创造性内容生成上表现不稳定;部分开源模型在复杂教育推理任务上优于闭源商用模型;少样本提示的效果因认知维度而异。 Conclusion: EduEval为中文教育场景下的大模型提供了系统化评估工具,揭示了不同模型在多维度认知任务中的优劣,强调需针对教育目标定制评估与优化策略。 Abstract: Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom's Taxonomy and Webb's Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.

[9] Comparative Analysis of 47 Context-Based Question Answer Models Across 8 Diverse Datasets

Muhammad Muneeb,David B. Ascher,Ahsan Baidar Bakht

Main category: cs.CL

TL;DR: 本研究评估了来自Hugging Face的47个上下文问答(CBQA)模型在八个数据集上的性能,旨在找出无需额外微调即可在多种场景下表现最佳的模型。结果显示,基于SQuAD数据集训练的模型表现最优,其中ahotrod/electra_large_discriminator_squad2_512整体准确率最高(43%),并在多个特定数据集上表现突出。研究还发现模型性能受上下文长度、答案长度和上下文复杂度影响,并尝试使用遗传算法融合多模型提升准确性。

Details Motivation: 识别无需微调即可在多样化数据集中表现优异的CBQA模型,以降低实际应用中重新训练模型的成本,提升部署效率。 Method: 对47个Hugging Face CBQA模型在八个不同数据集上进行基准测试,分析其准确率、计算时间与模型大小、上下文长度、答案长度及上下文复杂度的关系,并采用遗传算法融合多个模型输出以提升整体准确性。 Result: 表现最好的模型是ahotrod/electra_large_discriminator_squad2_512,在所有数据集上的平均准确率为43%,在bioasq10b-factoid(65.92%)、biomedical_cpgQA(96.45%)、QuAC(11.13%)和Question Answer Dataset(41.6%)上表现最佳;bert-large-uncased-whole-word-masking-finetuned-squad在IELTS数据集上达到82%准确率。模型性能随上下文长度和复杂度增加而下降,计算时间受模型大小和上下文长度影响。通过遗传算法集成模型可提升整体准确率。 Conclusion: 基于SQuAD训练的CBQA模型在跨数据集任务中表现最佳,尤其ahotrod/electra_large_discriminator_squad2_512具备较强的泛化能力。无需微调的情况下,该模型适用于多种实际应用场景。同时,上下文特征和模型结构显著影响性能,未来可通过模型集成进一步优化结果。 Abstract: Context-based question answering (CBQA) models provide more accurate and relevant answers by considering the contextual information. They effectively extract specific information given a context, making them functional in various applications involving user support, information retrieval, and educational platforms. In this manuscript, we benchmarked the performance of 47 CBQA models from Hugging Face on eight different datasets. This study aims to identify the best-performing model across diverse datasets without additional fine-tuning. It is valuable for practical applications where the need to retrain models for specific datasets is minimized, streamlining the implementation of these models in various contexts. The best-performing models were trained on the SQuAD v2 or SQuAD v1 datasets. The best-performing model was ahotrod/electra_large_discriminator_squad2_512, which yielded 43\% accuracy across all datasets. We observed that the computation time of all models depends on the context length and the model size. The model's performance usually decreases with an increase in the answer length. Moreover, the model's performance depends on the context complexity. We also used the Genetic algorithm to improve the overall accuracy by integrating responses from other models. ahotrod/electra_large_discriminator_squad2_512 generated the best results for bioasq10b-factoid (65.92\%), biomedical\_cpgQA (96.45\%), QuAC (11.13\%), and Question Answer Dataset (41.6\%). Bert-large-uncased-whole-word-masking-finetuned-squad achieved an accuracy of 82\% on the IELTS dataset.

[10] Evidence-Guided Schema Normalization for Temporal Tabular Reasoning

Ashish Thanga,Vibhu Dixit,Abhilash Shankarampeta,Vivek Gupta

Main category: cs.CL

TL;DR: 提出一种基于SQL的方法,通过生成符合第三范式(3NF)的模式、生成SQL查询并执行,来提升在演化半结构化表格上的时序推理问答性能。研究发现模式设计质量对精度的影响超过模型容量,并提出了三条原则:保持上下文的规范化、减少歧义的语义命名、一致的时间锚定。最佳配置比基线提升了16.8%。

Details Motivation: 现有问答系统在处理随时间演化的半结构化表格(如维基百科信息框)时,难以有效进行时序推理。 Method: 1) 从维基百科信息框生成符合第三范式(3NF)的数据库模式;2) 利用大模型生成相应的SQL查询;3) 执行SQL查询获取答案。 Result: 最佳配置(Gemini 2.5 Flash模式 + Gemini-2.0-Flash查询)达到80.39 EM,相比基线68.89 EM提升了16.8%。实验表明模式设计质量对QA精度的影响大于模型容量。 Conclusion: 在基于表格的问答系统中,良好的数据库模式设计(遵循规范化、语义清晰、时间一致的原则)比单纯扩大模型规模更能提升性能。 Abstract: Temporal reasoning over evolving semi-structured tables poses a challenge to current QA systems. We propose a SQL-based approach that involves (1) generating a 3NF schema from Wikipedia infoboxes, (2) generating SQL queries, and (3) query execution. Our central finding challenges model scaling assumptions: the quality of schema design has a greater impact on QA precision than model capacity. We establish three evidence-based principles: normalization that preserves context, semantic naming that reduces ambiguity, and consistent temporal anchoring. Our best configuration (Gemini 2.5 Flash schema + Gemini-2.0-Flash queries) achieves 80.39 EM, a 16.8\% improvement over the baseline (68.89 EM).

[11] Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents

Daud Waqas,Aaryamaan Golthi,Erika Hayashida,Huanzhi Mao

Main category: cs.CL

TL;DR: 本文提出了Assertion-Conditioned Compliance (A-CC),一种用于评估多轮工具调用对话中大模型行为的新范式,揭示了模型在面对用户或系统来源的误导性断言时的脆弱性。

Details Motivation: 现有的基准测试缺乏对多轮对话级鲁棒性的评估,尤其是在安全关键领域中,模型可能受到误导性信息影响而产生错误行为。 Method: 提出A-CC评估框架,通过用户来源断言(USA)和函数来源断言(FSA)两种向量,量化模型在多轮对话中对误导信息的服从性和一致性。 Result: 实验表明当前模型在USA和FSA场景下均表现出高度脆弱性,容易屈从于错误的用户信念或过时的系统策略。 Conclusion: A-CC揭示了部署型智能体中存在的潜在漏洞,强调需加强对多轮交互中逻辑一致性和抗误导能力的评估与改进。 Abstract: Multi-turn tool-calling LLMs (models capable of invoking external APIs or tools across several user turns) have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, and financial operations. Yet implementing multi-turn pipelines remains difficult for many safety-critical industries due to ongoing concerns regarding model resilience. While standardized benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) have underpinned confidence concerning advanced function-calling models (like Salesforce's xLAM V2), there is still a lack of visibility into multi-turn conversation-level robustness, especially given their exposure to real-world systems. In this paper, we introduce Assertion-Conditioned Compliance (A-CC), a novel evaluation paradigm for multi-turn function-calling dialogues. A-CC provides holistic metrics that evaluate a model's behavior when confronted with misleading assertions originating from two distinct vectors: (1) user-sourced assertions (USAs), which measure sycophancy toward plausible but misinformed user beliefs, and (2) function-sourced assertions (FSAs), which measure compliance with plausible but contradictory system policies (e.g., stale hints from unmaintained tools). Our results show that models are highly vulnerable to both USA sycophancy and FSA policy conflicts, confirming A-CC as a critical, latent vulnerability in deployed agents.

[12] IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages

Ayush Maheshwari,Kaushal Sharma,Vivek Patel,Aditya Maheshwari

Main category: cs.CL

TL;DR: IndicParam是一个针对低资源和极低资源印度语言的人工策划基准,包含13000多个多项选择题,评估显示当前大模型在这些语言上表现有限。

Details Motivation: 低资源和极低资源的印度语言在现有研究中严重缺乏评估,需要一个高质量、细粒度的基准来衡量大语言模型的跨语言能力。 Method: 构建了一个名为IndicParam的人工标注多选题数据集,覆盖11种低/极低资源印度语言及梵英混合语,对19个大语言模型进行评估,并按知识型与语言学类型标注题目,同时测试多种题型。 Result: 即使表现最好的GPT-5在该基准上的平均准确率也仅为45.0%,其次是DeepSeek-3.2(43.1%)和Claude-4.5(42.7%),表明当前模型在低资源印度语言上仍有显著局限。 Conclusion: IndicParam揭示了现有大语言模型在低资源印度语言上的跨语言迁移能力不足,提供了一个具有挑战性的新基准,推动未来研究关注更广泛的语言公平性。 Abstract: While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at https://huggingface.co/datasets/bharatgenai/IndicParam. Scripts to run benchmark are present at https://github.com/ayushbits/IndicParam.

[13] CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA

Vsevolod Kovalev,Parteek Kumar

Main category: cs.CL

TL;DR: 本文提出了CourseTimeQA数据集和轻量级跨模态检索器CrossFusion-RAG,用于在单GPU延迟/内存限制下进行教育视频的时间戳问答。

Details Motivation: 在资源受限条件下实现高效、准确的教育视频中基于自然语言查询的时间戳段落检索与答案生成。 Method: 结合冻结编码器、学习的视觉投影、浅层查询无关的跨模态注意力机制(ASR和帧)并加入时间一致性正则化,以及小型交叉注意力重排序器。 Result: 在CourseTimeQA上,CrossFusion-RAG相比BLIP-2检索器nDCG@10提升0.10,MRR提升0.08,单A100上中位端到端延迟约1.55秒,并在ASR噪声下表现出鲁棒性。 Conclusion: CrossFusion-RAG在保持低延迟的同时显著提升了检索性能,具备良好的实用性和可复现性。 Abstract: We study timestamped question answering over educational lecture videos under a single-GPU latency/memory budget. Given a natural-language query, the system retrieves relevant timestamped segments and synthesizes a grounded answer. We present CourseTimeQA (52.3 h, 902 queries across six courses) and a lightweight, latency-constrained cross-modal retriever (CrossFusion-RAG) that combines frozen encoders, a learned 512->768 vision projection, shallow query-agnostic cross-attention over ASR and frames with a temporal-consistency regularizer, and a small cross-attentive reranker. On CourseTimeQA, CrossFusion-RAG improves nDCG@10 by 0.10 and MRR by 0.08 over a strong BLIP-2 retriever while achieving approximately 1.55 s median end-to-end latency on a single A100. Closest comparators (zero-shot CLIP multi-frame pooling; CLIP + cross-encoder reranker + MMR; learned late-fusion gating; text-only hybrid with cross-encoder reranking and its MMR variant; caption-augmented text retrieval; non-learned temporal smoothing) are evaluated under matched hardware and indexing. We report robustness across ASR noise (WER quartiles), diagnostics for temporal localization, and full training/tuning details to support reproducible comparison.

[14] Mitigating the Threshold Priming Effect in Large Language Model-Based Relevance Judgments via Personality Infusing

Nuo Chen,Hanpei Fang,Jiqun Liu,Wilson Wei,Tetsuya Sakai,Xiao-Ming Wu

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)中模拟的“大五人格”特质如何影响相关性标注中的启动效应,发现高开放性和低神经质等人格特征可有效降低启动偏差,并提出“人格提示”作为一种缓解阈值启动的新方法。

Details Motivation: 尽管已有研究表明大语言模型在相关性标注中易受先前判断影响(即启动效应),但尚不清楚模拟的人格特质是否会影响这种偏差。本研究旨在探索人格特征与启动效应之间的关系,以提升LLM在信息检索评估中的可靠性。 Method: 通过在TREC 2021和2022深度学习任务数据集上,对多个大语言模型施加不同“大五人格”配置,系统地分析其在相关性判断中的启动效应变化。 Result: 研究发现,如高开放性和低神经质等特定人格配置能显著减少启动效应;但最有效的人格配置因模型和任务类型而异。 Conclusion: 人格提示(personality prompting)可作为一种有效的策略来缓解大语言模型在相关性标注中的启动偏差,连接了心理学理论与LLM评估实践。 Abstract: Recent research has explored LLMs as scalable tools for relevance labeling, but studies indicate they are susceptible to priming effects, where prior relevance judgments influence later ones. Although psychological theories link personality traits to such biases, it is unclear whether simulated personalities in LLMs exhibit similar effects. We investigate how Big Five personality profiles in LLMs influence priming in relevance labeling, using multiple LLMs on TREC 2021 and 2022 Deep Learning Track datasets. Our results show that certain profiles, such as High Openness and Low Neuroticism, consistently reduce priming susceptibility. Additionally, the most effective personality in mitigating priming may vary across models and task types. Based on these findings, we propose personality prompting as a method to mitigate threshold priming, connecting psychological evidence with LLM-based evaluation practices.

[15] A Taxonomy of Errors in English as she is spoke: Toward an AI-Based Method of Error Analysis for EFL Writing Instruction

Damian Heywood,Joseph Andrew Carrier,Kyu-Hong Hwang

Main category: cs.CL

TL;DR: 本研究开发了一个基于大语言模型的AI辅助英语写作错误分析系统,能够识别和分类拼写、语法和标点等多层次错误,并提供细粒度反馈。

Details Motivation: 传统写作评估依赖评分量规,难以提供详细的错误反馈。受Corder、Richards和James等语言学理论启发,研究旨在构建一个更精确、自动化的错误分析工具,以提升英语作为外语(EFL)教学效果。 Method: 系统基于LLM(如Claude 3.5 Sonnet和DeepSeek R1),结合源自Corder(1967)、Richards(1971)和James(1998)的错误分类体系,通过Python实现API调用,在词级和句级对错误进行分类与纠正。先使用孤立错误进行初步测试优化分类体系,再用《English as she is spoke》中的真实错误文本进行最终评估。 Result: 系统能成功识别多种错误类型,但在语境理解上存在局限,遇到未编码错误时可能生成新的错误类别。测试发现分类体系存在重叠问题,需进一步调整。使用历史真实错误文本验证了系统处理复杂语言错误的能力。 Conclusion: AI在自动化英语写作错误分析方面具有潜力,可超越传统评分方式提供更细致反馈。但需改进语境理解能力,并扩展分类体系至风格和语篇层面,以实现更全面的应用。 Abstract: This study describes the development of an AI-assisted error analysis system designed to identify, categorize, and correct writing errors in English. Utilizing Large Language Models (LLMs) like Claude 3.5 Sonnet and DeepSeek R1, the system employs a detailed taxonomy grounded in linguistic theories from Corder (1967), Richards (1971), and James (1998). Errors are classified at both word and sentence levels, covering spelling, grammar, and punctuation. Implemented through Python-coded API calls, the system provides granular feedback beyond traditional rubric-based assessments. Initial testing on isolated errors refined the taxonomy, addressing challenges like overlapping categories. Final testing used "English as she is spoke" by Jose da Fonseca (1855), a text rich with authentic linguistic errors, to evaluate the system's capacity for handling complex, multi-layered analysis. The AI successfully identified diverse error types but showed limitations in contextual understanding and occasionally generated new error categories when encountering uncoded errors. This research demonstrates AI's potential to transform EFL instruction by automating detailed error analysis and feedback. While promising, further development is needed to improve contextual accuracy and expand the taxonomy to stylistic and discourse-level errors.

[16] CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Jiacheng Guo,Suozhi Huang,Zixin Yao,Yifan Zhang,Yifu Lu,Jiashuo Liu,Zihao Li,Yanyan Deng,Qixin Xiao,Jia Tian,Kanghong Zhan,Tianyi Li,Xiaochen Liu,Jason Ge,Chaoyang He,Kaixuan Huang,Lin Yang,Wenhao Huang,Mengdi Wang

Main category: cs.CL

TL;DR: 本文提出了CryptoBench,首个由专家策划的动态基准,用于评估大语言模型代理在加密货币领域的真实能力,揭示了现有模型在预测分析上的薄弱环节。

Details Motivation: 现有的通用代理基准无法充分评估LLM在加密货币这一高时效性、信息对抗性强且数据源多样的专业领域的实际表现,因此需要一个专门的、更严格的评估基准。 Method: 构建了一个由加密原生专家设计的动态基准测试集,每月包含50个问题,并基于四种任务类型(简单/复杂检索、简单/复杂预测)进行分类,对10个LLM及其代理框架进行系统评估。 Result: 评估发现了“检索-预测失衡”现象:许多领先模型擅长数据检索但缺乏预测分析能力,表现出表面事实准确但深层分析不足的问题。 Conclusion: CryptoBench为评估LLM代理在复杂现实场景中的能力提供了更严格的标准,揭示了当前模型在高级分析和预测任务中的关键缺陷,强调了提升综合推理能力的重要性。 Abstract: This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.

[17] SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

Yang Xiao,Chunpu Xu,Ruifeng Yuan,Jiashuo Wang,Wenjie Li,Pengfei Liu

Main category: cs.CL

TL;DR: 本文提出了SCALE框架,通过基于子问题难度的选择性资源分配来提升大语言模型在数学推理中的性能,相较于均匀分配资源的方法,在提高准确率的同时显著降低了计算成本。

Details Motivation: 现有测试时计算扩展方法对所有推理子问题采用统一的资源分配,导致复杂子问题资源不足而简单操作占用过多资源,造成性能瓶颈和资源浪费。 Method: 受双过程理论启发,SCALE将问题分解为顺序推理子问题,评估每个子问题的难度,并据此分配System 1(简单)或System 2(复杂)处理模式,进行选择性计算并传递上下文信息。 Result: 实验表明,SCALE在AIME25上准确率从57.50%提升至71.25%(+13.75个百分点),同时减少33%-53%的计算成本,显著优于均匀扩展基线。 Conclusion: SCALE通过差异化资源分配有效解决了当前测试时扩展方法的局限性,实现了更高效、更精准的数学推理,推动了推理系统中资源利用方式的进步。 Abstract: Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.

[18] CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning

Diego A. B. Moreira,Alef I. Ferreira,Jhessica Silva,Gabriel O. dos Santos,Gustavo Bonil,João Gondim,Marina dos Santos,Helena Maia,Simone Hashiguti,Nádia da Silva,Carolina Scarton,Helio Pedrini,Sandra Avila

Main category: cs.CL

TL;DR: 本文提出了一种名为CACARA的多模态多语言架构,通过新兴对齐学习实现新模态的无缝集成,无需全面重新训练,并在仅使用英语对齐数据微调的情况下,实现了超过100种语言的支持,显著提升了音频到文本检索性能,同时保持较低的训练成本。

Details Motivation: 现有的多模态模型通常依赖于跨多个模态的资源密集型训练,扩展至新语言时也需类似高成本策略。本文旨在探索一种更高效的机制,能够在不进行全模型重训的情况下整合新模态并自然获得多语言能力。 Method: 提出CACARA架构,采用新兴对齐学习方法,在已有双模态或多模态模型基础上,仅对新增模态使用与英语对齐的数据进行微调,从而实现多语言支持,避免对文本编码器进行显式多语言预训练或调整。 Result: 该方法在音频到文本检索任务中R@1指标最高提升14.24个百分点,超越现有最先进多模态模型,且训练成本仅相当于单语模型。 Conclusion: 新兴对齐范式可有效解锁多模态与多语言能力,提供一种高效、可扩展的模型构建方式,兼顾性能与计算资源节约。 Abstract: As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models -- all without the heavy computational cost of retraining across every modality and language.

[19] G-KV: Decoding-Time KV Cache Eviction with Global Attention

Mengqi Liao,Lu Wang,Chaoyun Zhang,Zekai Shen,Xiaowei Mao,Si Qin,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Huaiyu Wan

Main category: cs.CL

TL;DR: 提出了一种名为G-KV的KV缓存驱逐方法,结合局部和历史注意力得分进行全局评分,并引入后训练技术以优化在压缩KV缓存环境下的模型性能。

Details Motivation: 现有KV缓存压缩方法多关注提示压缩或基于局部注意力得分的令牌剔除,忽略了令牌的长期重要性。 Method: 提出G-KV方法,采用结合局部和历史注意力得分的全局评分机制来评估令牌重要性,并使用强化学习和蒸馏等后训练技术优化模型。 Result: G-KV能更准确地识别重要令牌,提升推理效率,同时保持模型性能。 Conclusion: G-KV通过全局重要性评估和后训练策略,在压缩KV缓存的同时有效维持了推理质量,显著提升了推理效率。 Abstract: Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: https://github.com/microsoft/G-KV.

[20] Developing a Comprehensive Framework for Sentiment Analysis in Turkish

Cem Rifki Aydin

Main category: cs.CL

TL;DR: 本论文提出了一个全面的情感分析框架,主要针对土耳其语,并包含针对英语的多种新方法,在特征提取、词嵌入和神经网络架构方面取得显著成果。

Details Motivation: 情感分析在土耳其语等形态丰富的语言中面临挑战,现有方法在多语言和多维度分析上存在不足,需要更全面的解决方案。 Method: 结合无监督、半监督和有监督指标构建新特征集;采用经典机器学习方法;创建领域特定的极性词典;进行精细的形态学分析;提出融合循环与递归神经网络的新架构;构建利用情感、句法、语义和词汇特征的新词嵌入;将上下文窗口重新定义为子句。 Result: 在不同体裁的土耳其语和英语数据集上均优于神经网络模型;实现了最先进的结果;首次将半监督方法应用于土耳其语文本;所提方法可推广至其他形态丰富或黏着语言及其他NLP任务。 Conclusion: 该研究是截至2020年7月最详尽的土耳其语情感分析研究,不仅推动了土耳其语的情感分析发展,也对英语意见分类问题做出了贡献。 Abstract: In this thesis, we developed a comprehensive framework for sentiment analysis that takes its many aspects into account mainly for Turkish. We have also proposed several approaches specific to sentiment analysis in English only. We have accordingly made five major and three minor contributions. We generated a novel and effective feature set by combining unsupervised, semi-supervised, and supervised metrics. We then fed them as input into classical machine learning methods, and outperformed neural network models for datasets of different genres in both Turkish and English. We created a polarity lexicon with a semi-supervised domain-specific method, which has been the first approach applied for corpora in Turkish. We performed a fine morphological analysis for the sentiment classification task in Turkish by determining the polarities of morphemes. This can be adapted to other morphologically-rich or agglutinative languages as well. We have built a novel neural network architecture, which combines recurrent and recursive neural network models for English. We built novel word embeddings that exploit sentiment, syntactic, semantic, and lexical characteristics for both Turkish and English. We also redefined context windows as subclauses in modelling word representations in English. This can also be applied to other linguistic fields and natural language processing tasks. We have achieved state-of-the-art and significant results for all these original approaches. Our minor contributions include methods related to aspect-based sentiment in Turkish, parameter redefinition in the semi-supervised approach, and aspect term extraction techniques for English. This thesis can be considered the most detailed and comprehensive study made on sentiment analysis in Turkish as of July, 2020. Our work has also contributed to the opinion classification problem in English.

[21] Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

Subramanyam Sahoo,Vinija Jain,Saanidhya Vats,Siddharth Mohapatra,Rui Min,Aman Chadha,Divya Chaudhary

Main category: cs.CL

TL;DR: 提出一种诊断框架,通过四个轴评估语言模型的数学推理能力,揭示表面准确率高但实际推理能力差的问题。

Details Motivation: 现有数学推理评估主要依赖答案准确性,可能掩盖模型在逻辑计算上的根本缺陷,需要更深入的诊断方法。 Method: 引入四维诊断框架:前向-后向一致性、传递性覆盖、反事实敏感性和扰动鲁棒性,并在Qwen3-0.6B模型和MenatQA数据集上进行案例研究。 Result: 模型虽有70%以上的答案准确率,但后向一致性仅15%,传递性覆盖仅32.2%,对扰动敏感,显示出推理脆弱性。 Conclusion: 传统准确率指标不足以反映真实推理能力,该诊断框架可帮助揭示模式匹配与真正逻辑推理之间的差异,推动更可靠的数学推理评估。 Abstract: Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.

[22] Slovak Conceptual Dictionary

Miroslav Blšták

Main category: cs.CL

TL;DR: 本文介绍了斯洛伐克语的一种新型概念词典,这是该语言的首个此类语言工具,旨在解决低资源语言中字典数据不足的问题。

Details Motivation: 由于斯洛伐克语是一种语言资源有限的语言,目前缺乏足够大规模的机器可读语言数据源,导致许多需要自动处理斯洛伐克语文本的任务效果较差甚至几乎无法完成。 Method: 提出并构建了一种新的斯洛伐克语概念词典,作为支持自然语言处理任务的首个语言工具。 Result: 为斯洛伐克语提供了首个大规模的机器可读概念词典,有望提升该语言在自然语言处理任务中的表现。 Conclusion: 该概念词典填补了斯洛伐克语语言资源的空白,为未来相关研究和应用奠定了基础。 Abstract: When solving tasks in the field of natural language processing, we sometimes need dictionary tools, such as lexicons, word form dictionaries or knowledge bases. However, the availability of dictionary data is insufficient in many languages, especially in the case of low resourced languages. In this article, we introduce a new conceptual dictionary for the Slovak language as the first linguistic tool of this kind. Since Slovak language is a language with limited linguistic resources and there are currently not available any machine-readable linguistic data sources with a sufficiently large volume of data, many tasks which require automated processing of Slovak text achieve weaker results compared to other languages and are almost impossible to solve.

[23] Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models

Alla Chepurova,Aydar Bulatov,Yuri Kuratov,Mikhail Burtsev

Main category: cs.CL

TL;DR: Wikontic 是一个用于从开放域文本构建知识图谱(KG)的多阶段管道,通过提取带限定词的候选三元组、强制执行基于 Wikidata 的类型和关系约束,并归一化实体以减少重复,生成紧凑、本体一致且高度连接的知识图谱。

Details Motivation: 当前基于大语言模型(LLM)的系统通常将知识图谱作为文本检索的辅助结构,而未充分探索其内在质量。本文旨在提升从文本生成的知识图谱的质量和可用性。 Method: 提出 Wikontic 多阶段流水线:1)从开放域文本中提取带限定词的候选三元组;2)施加基于 Wikidata 的类型和关系约束;3)实体归一化以减少冗余;仅使用三元组作为输入进行问答评估。 Result: 在 MuSiQue 数据集上,96% 的正确答案实体出现在生成的三元组中;在 HotpotQA 上取得 76.0 F1,在 MuSiQue 上取得 59.8 F1,媲美甚至超过依赖文本上下文的检索增强生成方法;在 MINE-1 基准上达到 86% 的信息保留性能,为当前最优;KG 构建耗时少于 1,000 输出 token,效率显著高于 AriGraph 和 GraphRAG。 Conclusion: Wikontic 能高效生成高质量、紧凑且本体一致的知识图谱,验证了仅用结构化三元组即可支持复杂推理任务,为 LLMs 中结构化知识的利用提供了可扩展的解决方案。 Abstract: Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3$\times$ fewer than AriGraph and $<$1/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs.

[24] Prism: A Minimal Compositional Metalanguage for Specifying Agent Behavior

Franck Binard,Vanja Kljajevic

Main category: cs.CL

TL;DR: Prism是一种用于指定使用工具的软件代理行为的小型、组合式元语言,通过固定核心上下文和可扩展的领域上下文实现可检查、可执行的策略,并支持自然语言决策规则的映射。

Details Motivation: 为了克服传统方法中引入特设控制结构的局限性,提供一种更清晰、可组合的方式来表达智能体策略,并实现策略的可分析性与安全性。 Method: 设计了一个名为Core1的核心上下文,包含基础类型和抽象组合子,领域通过定义自己的上下文扩展Core1,策略以表达式形式编写,使用单一抽象操作符和选择机制代替传统条件语句。 Result: 展示了在恒温器控制、家庭安全、电子商务推荐和医疗监测等领域的应用实例,证明了自然语言决策规则可以被有效映射为可检查和可执行的策略。 Conclusion: Prism通过分离可重用的语法核心与领域特定词汇,并将工具视为内外部世界的桥梁,为代理控制提供了紧凑且易于分析的接口语言。 Abstract: Prism is a small, compositional metalanguage for specifying the behaviour of tool-using software agents. Rather than introducing ad hoc control constructs, Prism is built around a fixed core context, Core1, which provides a minimal background grammar of categories numbers, strings, user prompts, tools together with abstract combinators for booleans, predicates, pairs, and lists. Agent policies are written as ordinary expressions using a single abstraction operator so that conditionals appear as selections between alternatives instead of imperative if-else blocks. Domains extend the core by defining their own context-mini-grammars that introduce new categories, predicates, and external tools while reusing the same compositional machinery. We illustrate this with worked examples from thermostat control, home security, e-commerce recommendation, and medical monitoring, showing how natural language decision rules can be mapped to inspectable, executable policies. From a linguistic perspective, Prism enforces a clear separation between a reusable grammar-like core and domain specific lexicons and treats tools as bridges between internal policy representations and the external world. From an engineering perspective, it offers a compact interface language for agent control, making the space of possible actions explicit and amenable to analysis, verification, and safety constraints.

[25] ART: Adaptive Response Tuning Framework -- A Multi-Agent Tournament-Based Approach to LLM Response Optimization

Omer Jauhar Khan

Main category: cs.CL

TL;DR: 本文提出了一种名为ART(自适应响应调优)的新框架,通过ELO排名和多智能体推理的锦标赛机制优化大语言模型输出,显著提升了响应的准确性、连贯性和可靠性。

Details Motivation: 单一大模型存在不一致、幻觉和跨领域表现不稳定的问题,需要更可靠的输出优化方法。 Method: 采用多智能体竞争、批评与协作的锦标赛式工作流,结合ELO排名、动态智能体选择和多种共识融合策略来生成优化后的共识响应。 Result: 实验显示,相比基线模型,ART框架在响应质量上整体提升了8.4%,ELO评分收敛性R22超过0.96,显著提高了准确性和一致性。 Conclusion: ART框架为需要高质量、经过验证的LLM响应的应用提供了一个可扩展且可用于生产的解决方案。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, single-model responses often exhibit inconsistencies, hallucinations, and varying quality across different query domains. This paper presents ART (Adaptive Response Tuning), a novel framework that employs tournament-style ELO ranking and multi-agent reasoning to systematically optimize LLM outputs. By enabling multiple LLM agents to compete, critique, and collaborate through structured tournament workflows, ART produces consensus responses that outperform individual model outputs. Our framework introduces configurable tournament parameters, dynamic agent selection, and multiple consensus fusion strategies. Experimental evaluations demonstrate significant improvements in response accuracy, coherence, and reliability compared to baseline single-model approaches. The ART framework provides a scalable, production-ready solution for applications requiring high-quality, vetted LLM responses, achieving an 8.4% improvement in overall quality metrics and R22 values exceeding 0.96 in ELO rating convergence.

[26] Sycophancy Claims about Language Models: The Missing Human-in-the-Loop

Jan Batzner,Volker Stocker,Stefan Schmid,Gjergji Kasneci

Main category: cs.CL

TL;DR: 本文综述了大语言模型(LLM)谄媚反应模式的测量方法挑战,识别出五种核心操作化定义,并指出当前研究缺乏对人类感知的评估,难以区分谄媚反应与人工智能对齐中的相关概念,提出了未来研究的可行建议。

Details Motivation: 尽管谄媚行为本质上是以人为中心的,但现有研究并未评估人类对其的感知,且难以明确区分谄媚与其他AI对齐概念。 Method: 通过文献回顾,识别并分析测量LLM谄媚行为的方法论挑战,归纳出五种核心操作化定义。 Result: 发现当前研究在评估人类感知方面存在不足,且难以清晰区分谄媚反应与AI对齐中的其他相关概念。 Conclusion: 需要更严谨的方法来测量LLM的谄媚行为,强调纳入人类感知评估,并为未来研究提供了具体建议。 Abstract: Sycophantic response patterns in Large Language Models (LLMs) have been increasingly claimed in the literature. We review methodological challenges in measuring LLM sycophancy and identify five core operationalizations. Despite sycophancy being inherently human-centric, current research does not evaluate human perception. Our analysis highlights the difficulties in distinguishing sycophantic responses from related concepts in AI alignment and offers actionable recommendations for future research.

[27] Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLMs

Tanmay Agrawal

Main category: cs.CL

TL;DR: 本文提出了一种通过交互式可视化知识图谱来检测和减少大型语言模型在企业环境中生成幻觉内容的框架,增强了模型的可靠性和响应质量。

Details Motivation: 由于上下文窗口限制和预训练数据与提供的知识之间的不一致,大型语言模型在企业应用中容易产生难以察觉的幻觉问题,现有缓解策略缺乏确定性保证。 Method: 将专有知识和模型生成的内容组织成交互式可视化知识图谱,将模型断言与底层真实来源关联并显示置信度。 Result: 用户可通过可视化界面诊断不一致、识别薄弱推理链并提供纠正反馈,形成人机协同的结构化反馈循环。 Conclusion: 该框架通过提供对潜在幻觉区域的直观洞察,提升了模型的可解释性与可靠性,并支持持续的质量改进。 Abstract: Large Language Models have rapidly advanced in their ability to interpret and generate natural language. In enterprise settings, they are frequently augmented with closed-source domain knowledge to deliver more contextually informed responses. However, operational constraints such as limited context windows and inconsistencies between pre-training data and supplied knowledge often lead to hallucinations, some of which appear highly credible and escape routine human review. Current mitigation strategies either depend on costly, large-scale gold-standard Q\&A curation or rely on secondary model verification, neither of which offers deterministic assurance. This paper introduces a framework that organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs. The objective is to provide end users with a clear, intuitive view of potential hallucination zones by linking model assertions to underlying sources of truth and indicating confidence levels. Through this visual interface, users can diagnose inconsistencies, identify weak reasoning chains, and supply corrective feedback. The resulting human-in-the-loop workflow creates a structured feedback loop that can enhance model reliability and continuously improve response quality.

[28] A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data

Breanna E. Green,Ashley L. Shea,Pengfei Zhao,Drew B. Margolin

Main category: cs.CL

TL;DR: 本研究评估了GPT-4在涉及细微语言分类任务中的表现,并与人类标注者进行比较,发现尽管在提示中加入标签定义有助于提升性能,但GPT-4仍难以准确处理复杂语言。

Details Motivation: 理解生成式人工智能(如ChatGPT)在处理复杂、微妙语言任务中的实际表现,尤其是在计算社会科学数据标注中的适用性。 Method: 通过测试GPT-3.5、GPT-4和GPT-4o在四种不同提示风格下的分类表现,使用精确率、召回率和F1分数进行量化评估,并结合定性分析。 Result: GPT-4整体在细微语言分类上表现不佳;加入标签定义可提升性能,但仍有局限;定性分析揭示出四个具体问题。 Conclusion: 在涉及细微语言的分类任务中使用ChatGPT应保持谨慎,不宜完全替代人类标注。 Abstract: Generative artificial intelligence tools, like ChatGPT, are an increasingly utilized resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language. Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We craft four prompt styles as input and evaluate precision, recall, and F1 scores. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language. Qualitative analysis reveals four specific findings. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.

[29] FastPOS: Language-Agnostic Scalable POS Tagging Framework Low-Resource Use Case

Md Abdullah Al Kafi,Sumit Kumar Banshal

Main category: cs.CL

TL;DR: 提出了一种语言无关的基于transformer的POS标注框架,适用于低资源语言,以孟加拉语和印地语为例,具有高准确率和强可移植性。

Details Motivation: 针对低资源语言缺乏高效、可移植的词性标注框架的问题,旨在减少模型设计和调优开销,推动小语种NLP发展。 Method: 采用基于transformer的通用架构,通过模块化和开源设计,仅用三行代码适配不同语言(从孟加拉语到印地语),强调框架的语言无关性和易迁移性。 Result: 在孟加拉语和印地语上分别达到96.85%和97%的词级别准确率,F1分数表现稳健,显示出对数据不平衡和语言重叠的适应能力。 Conclusion: 该框架具有良好的可移植性和性能,有助于将研究重点转向语言预处理和数据集优化,促进低资源语言的自然语言处理发展。 Abstract: This study proposes a language-agnostic transformer-based POS tagging framework designed for low-resource languages, using Bangla and Hindi as case studies. With only three lines of framework-specific code, the model was adapted from Bangla to Hindi, demonstrating effective portability with minimal modification. The framework achieves 96.85 percent and 97 percent token-level accuracy across POS categories in Bangla and Hindi while sustaining strong F1 scores despite dataset imbalance and linguistic overlap. A performance discrepancy in a specific POS category underscores ongoing challenges in dataset curation. The strong results stem from the underlying transformer architecture, which can be replaced with limited code adjustments. Its modular and open-source design enables rapid cross-lingual adaptation while reducing model design and tuning overhead, allowing researchers to focus on linguistic preprocessing and dataset refinement, which are essential for advancing NLP in underrepresented languages.

[30] Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation

Xiaodong Cai,Hai Lin,Shaoxiong Zhan,Weiqi Luo,Hong-Gee Kim,Hongyan Hao,Yu Yang,Hai-Tao Zheng

Main category: cs.CL

TL;DR: 提出了一种无需超参数的熵平衡采样方法(EES),通过信息论原理动态调整候选集,简化部署并提升生成质量。

Details Motivation: 现有文本生成中的token采样策略依赖超参数调优,增加部署复杂性,缺乏自适应能力。 Method: 基于信息论提出Entropy Equilibrium Sampling(EES),通过平衡归一化熵与概率质量来自适应地调整候选集合,无需额外超参数。 Result: 在多种模型架构和任务(推理与生成)上验证了EES的有效性,结果显示其在不同温度设置下均保持良好的准确性、连贯性和多样性。 Conclusion: EES是一种无需调参、易于部署的采样方法,能稳定提升大语言模型的生成质量。 Abstract: Token sampling strategies critically influence text generation quality in large language models (LLMs). However, existing methods introduce additional hyperparameters, requiring extensive tuning and complicating deployment. We present Entropy Equilibrium Sampling (EES), an auxiliary hyperparameter-free approach inspired by information theory that can dynamically adjust candidate sets by balancing normalized entropy with probability mass. We evaluate EES on both reasoning and generation tasks across a range of model architectures. Our results show that EES consistently performs well across temperature settings, delivering competitive accuracy and coherence while maintaining diversity. By eliminating the need for hyperparameter tuning, EES greatly simplifies deployment while improving performance. Code is available at https://github.com/shuanncai/EES

[31] Accelerating Bangla NLP Tasks with Automatic Mixed Precision: Resource-Efficient Training Preserving Model Efficacy

Md Mehrab Hossain Opi,Sumaiya Khan,Moshammad Farzana Rahman

Main category: cs.CL

TL;DR: 本研究探讨了在孟加拉语自然语言处理(NLP)任务中使用自动混合精度(AMP)训练,以提升计算效率并降低资源消耗。

Details Motivation: 由于训练NLP模型需要大量计算资源,而在孟加拉语NLP开发中,高性能硬件的获取常受限,因此亟需一种高效且低资源消耗的训练方法。 Method: 采用自动混合精度(AMP)技术,结合16位和32位浮点运算,在四个基于Transformer的模型(BanglaBERT、BanglishBERT、XLM-R、mBERT)上评估其在四种孟加拉语NLP任务(情感分析、命名实体识别、错误分类、问答)中的表现。 Result: AMP使训练速度提升了44.5%,内存消耗减少了17.6%,同时保持F1分数达到全精度基线模型的99.7%。 Conclusion: AMP能有效降低硬件需求,加速模型训练,有助于在资源受限环境中推广先进的NLP技术,促进孟加拉语NLP的发展。 Abstract: Training models for Natural Language Processing (NLP) requires substantial computational resources and time, posing significant challenges, especially for NLP development in Bangla, where access to high-end hardware is often limited. In this work, we explore automatic mixed precision (AMP) training as a means to improve computational efficiency without sacrificing model performance. By leveraging a dynamic mix of 16-bit and 32-bit floating-point computations, AMP lowers GPU memory requirements and speeds up training without degrading model performance. We evaluate AMP across four standard Bangla NLP tasks, namely sentiment analysis, named entity recognition, error classification, and question answering, using four transformer-based models: BanglaBERT, BanglishBERT, XLM-R, and mBERT. Our results demonstrate that AMP accelerates training by 44.5% and reduces memory consumption by 17.6%, while maintaining F-1 score within 99.7% of the full-precision baselines. This empirical study highlights AMP's potential to democratize access to state-of-the-art NLP capabilities in hardware-constrained settings by lowering computational barriers.

[32] WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models

Yukang Lin,Jiahao Shao,Shuoran Jiang,Wentao Zhu,Bingjie Lu,Xiangping Wu,Joanna Siebert,Qingcai Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为WaterSearch的新框架,通过控制种子池实现多样化并行生成带水印的文本,能够在保持高检测性的同时显著提升生成文本质量,并具备强抗攻击能力。

Details Motivation: 现有LLM生成文本的水印方法在可检测性和文本质量之间存在权衡,难以兼顾安全性和实用性。 Method: 设计了一种基于种子池控制的嵌入方案,提出WaterSearch框架,通过句子级搜索联合优化分布保真度和水印信号特性,并配套设计了句子级检测方法。 Result: 在三个主流大模型和十个任务上的实验表明,在95%水印可检测强度下,平均性能比现有最优方法提升51.01%;在短文本和低熵输出等挑战场景下分别提升47.78%和36.47%,且在插入、同义替换和改写等攻击下仍保持高检测率。 Conclusion: WaterSearch有效解决了水印强度与文本质量之间的冲突,在多种生成场景和攻击条件下均表现出优越的性能和鲁棒性,推动了实用化LLM水印技术的发展。 Abstract: Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks. In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01\% over state-of-the-art baselines at a watermark detectability strength of 95\%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78\% and 36.47\%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \href{https://github.com/Yukang-Lin/WaterSearch}{https://github.com/Yukang-Lin/WaterSearch}.

[33] Less is More: Resource-Efficient Low-Rank Adaptation

Chunlin Tian,Xuyang Wei,Huanrong Liu,Zhijiang Guo,Li Li

Main category: cs.CL

TL;DR: 提出了一种资源高效的低秩适应方法EffiLoRA,通过共享统一的A矩阵和动态选择性更新B矩阵,提升LoRA在多模态任务中的效率与性能。

Details Motivation: LoRA虽然被广泛使用,但在复杂数据集上仍存在参数干扰和较高开销的问题,现有解耦方法未能有效降低训练成本。 Method: 引入跨所有Transformer层的统一A矩阵,并设计运行时选择性B矩阵更新机制,以动态平衡资源消耗与模型性能。 Result: EffiLoRA在常识推理、视觉指令调优和图像生成等多个任务上 consistently 超过标准LoRA,展现出更高的效率和更强的鲁棒性。 Conclusion: EffiLoRA是一种轻量且通用的PEFT方法,有效降低了参数冗余和训练开销,适用于语言、多模态及扩散模型。 Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), but it still incurs notable overhead and suffers from parameter interference in complex datasets. While re- cent works decouple LoRA update matrices to exploit matrix-wise asymmetry, training costs remain high. We revisit LoRA from the perspective of inter-matrix and intra-layer parameter redundancy and propose Resource-Efficient Low-Rank Adaptation, EffiLoRA, a lightweight and generalizable approach for language, multimodal, and diffusion models. EffiLoRA employs a unified A matrix across all transformer layers and introduces a runtime selective B matrices up- date to dynamically trade-off the system resource budget and model performance. EffiLoRA consistently outperforms LoRA across diverse modalities, including commonsense reasoning, visual instruction tuning, and image generation, demon- strating improved efficiency and robustness.

[34] Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Jianxiang Zang,Yongda Wei,Ruxue Bai,Shiyu Jiang,Nijia Mo,Binhong Li,Qiang Sun,Hui Liu

Main category: cs.CL

TL;DR: 本文提出了一个名为Reward Auditor的新框架,用于评估奖励模型在现实扰动场景下的条件可靠性(即“适用性”),通过假设检验和置信度分布退化分析,揭示奖励模型的系统性漏洞。

Details Motivation: 现有评估方法仅关注特定情境下的偏好判断准确率,忽略了奖励模型在现实世界扰动下的可靠性问题,难以发现其系统性脆弱性。 Method: 提出Reward Auditor框架,采用科学审计方法,在真实扰动场景下对奖励模型的偏好置信度分布退化进行统计显著性和效应量分析,以推断其适用性。 Result: 该框架能够量化奖励模型在多种现实场景中的漏洞确定性和严重程度,识别出传统准确性指标无法发现的系统性缺陷。 Conclusion: Reward Auditor为构建可验证安全、更鲁棒和可信的下一代大语言模型对齐系统提供了坚实基础。 Abstract: Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

[35] Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study

Imane Jaaouine,Ross D. King

Main category: cs.CL

TL;DR: 该研究探讨了提示工程(PE)方法在零样本设置下减少大语言模型(LLM)科学文本摘要中上下文不一致幻觉的有效性,发现上下文重复和随机添加显著提高了生成摘要与原文的词汇对齐度。

Details Motivation: 大语言模型在生成输出时可能出现与用户提示不符的上下文不一致幻觉,尤其是在零样本科学文本摘要任务中,亟需有效方法来缓解这一问题。 Method: 在八个酵母生物技术论文摘要上,使用六种指令调优的大语言模型,比较七种提示方法(包括基线、两种指令复杂度提升、两种上下文重复CR-K1/2、两种随机添加RA-K1/2)的效果;通过ROUGE、BERTScore、METEOR和余弦相似度评估336个生成摘要,并采用BCa自举置信区间和Wilcoxon符号秩检验进行统计分析。 Result: 上下文重复(CR)和随机添加(RA)方法显著提升了LLM生成摘要与原文之间的词汇对齐度,表明这些提示工程技术能有效减轻零样本科学摘要中的幻觉现象。 Conclusion: 提示工程,特别是上下文重复和随机添加策略,有助于缓解大语言模型在零样本科学文本摘要中的上下文不一致幻觉,具有实际应用潜力。 Abstract: Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a base- line prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment be- tween the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.

[36] DeformAr: Rethinking NER Evaluation through Component Analysis and Visual Analytics

Ahmed Mustafa Younes

Main category: cs.CL

TL;DR: 本文提出了DeformAr,一个针对基于Transformer的阿拉伯语命名实体识别(NER)系统的调试与评估框架,旨在分析阿拉伯语和英语NER系统之间的性能差距。

Details Motivation: 尽管Transformer模型在英语NLP任务中表现优异,但在阿拉伯语NER任务中仍存在显著性能差距,现有研究未能全面分析数据与模型组件间的交互影响。 Method: 提出DeformAr框架,包含数据提取库和交互式仪表板,支持跨组件分析和行为分析两种模式;将语言分解为数据和模型组件,结合可解释性技术、逐词指标、可视化和表示空间分析进行诊断。 Result: DeformAr实现了对阿拉伯语NER系统中数据与模型交互效应的系统性诊断,能够检测并解释模型行为及其与数据因素和表示模式的关联。 Conclusion: DeformAr是首个面向阿拉伯语、基于组件的可解释性工具,为资源匮乏语言的模型分析提供了重要支持。 Abstract: Transformer models have significantly advanced Natural Language Processing (NLP), demonstrating strong performance in English. However, their effectiveness in Arabic, particularly for Named Entity Recognition (NER), remains limited, even with larger pre-trained models. This performance gap stems from multiple factors, including tokenisation, dataset quality, and annotation inconsistencies. Existing studies often analyze these issues in isolation, failing to capture their joint effect on system behaviour and performance. We introduce DeformAr (Debugging and Evaluation Framework for Transformer-based NER Systems), a novel framework designed to investigate and explain the performance discrepancy between Arabic and English NER systems. DeformAr integrates a data extraction library and an interactive dashboard, supporting two modes of evaluation: cross-component analysis and behavioural analysis. The framework divides each language into dataset and model components to examine their interactions. The analysis proceeds in two stages. First, cross-component analysis provides systematic diagnostic measures across data and model subcomponents, addressing the "what," "how," and "why" behind observed discrepancies. The second stage applies behavioural analysis by combining interpretability techniques with token-level metrics, interactive visualisations, and representation space analysis. These stages enable a component-aware diagnostic process that detects model behaviours and explains them by linking them to underlying representational patterns and data factors. DeformAr is the first Arabic-specific, component-based interpretability tool, offering a crucial resource for advancing model analysis in under-resourced languages.

[37] Fine-tuning of lightweight large language models for sentiment classification on heterogeneous financial textual data

Alvaro Paredes Amorin,Andre Python,Christoph Weisser

Main category: cs.CL

TL;DR: 本研究探讨了轻量级开源大语言模型(LLMs)在金融文本情感分析中的表现,发现Qwen3 8B和Llama3 8B等模型即使仅使用5%训练数据,在多种语言和格式的公开数据集上仍优于FinBERT等传统模型,表明其在资源受限场景下具有高效、低成本的优势。

Details Motivation: 由于大型语言模型在金融文本分析中依赖昂贵的计算资源和专有数据,许多研究者难以使用。本文旨在评估轻量级、开源LLMs在不同规模、来源、格式和语言的金融数据中进行情感理解泛化的能力,以反映真实世界中资源受限的情况。 Method: 比较了金融NLP基准模型FinBERT与三种开源轻量级LLM(DeepSeek-LLM 7B、Llama3 8B Instruct、Qwen3 8B)在五个公开金融数据集(FinancialPhraseBank、Financial Question Answering、Gold News Sentiment、Twitter Sentiment、Chinese Finance Sentiment)上的表现,涵盖零样本和少样本学习场景,并测试不同训练数据比例(低至5%)下的性能。 Result: 发现Qwen3 8B和Llama3 8B在大多数情况下表现最佳,即使仅使用5%的训练数据,其性能仍优于或接近FinBERT,且在零样本和少样本设置下保持稳定。结果表明轻量级开源LLM能有效泛化于异构金融文本。 Conclusion: 轻量级开源大语言模型是金融情感分析中一种成本效益高的替代方案,能够在训练数据有限的情况下实现与主流模型相当甚至更优的性能,有助于降低研究与应用门槛。 Abstract: Large language models (LLMs) play an increasingly important role in finan- cial markets analysis by capturing signals from complex and heterogeneous textual data sources, such as tweets, news articles, reports, and microblogs. However, their performance is dependent on large computational resources and proprietary datasets, which are costly, restricted, and therefore inacces- sible to many researchers and practitioners. To reflect realistic situations we investigate the ability of lightweight open-source LLMs - smaller and publicly available models designed to operate with limited computational resources - to generalize sentiment understanding from financial datasets of varying sizes, sources, formats, and languages. We compare the benchmark finance natural language processing (NLP) model, FinBERT, and three open-source lightweight LLMs, DeepSeek-LLM 7B, Llama3 8B Instruct, and Qwen3 8B on five publicly available datasets: FinancialPhraseBank, Financial Question Answering, Gold News Sentiment, Twitter Sentiment and Chinese Finance Sentiment. We find that LLMs, specially Qwen3 8B and Llama3 8B, perform best in most scenarios, even from using only 5% of the available training data. These results hold in zero-shot and few-shot learning scenarios. Our findings indicate that lightweight, open-source large language models (LLMs) consti- tute a cost-effective option, as they can achieve competitive performance on heterogeneous textual data even when trained on only a limited subset of the extensive annotated corpora that are typically deemed necessary.

[38] Table as a Modality for Large Language Models

Liyao Li,Chao Ye,Wentao Ye,Yifei Sun,Zhe Jiang,Haobo Wang,Jiaming Tian,Yiming Zhang,Ningtao Wang,Xing Fu,Gang Chen,Junbo Zhao

Main category: cs.CL

TL;DR: 本文提出TAMO框架,通过将表格视为独立模态并与文本融合,利用超图神经网络作为全局表格编码器与大语言模型结合,有效保留表格结构信息,在多个基准数据集上实现平均42.65%的相对性能提升。

Details Motivation: 现有大语言模型在处理表格数据时多依赖序列化输入,导致结构信息丢失,难以有效理解表格内容,因此需要一种能更好保留和利用表格结构的方法。 Method: 提出TAMO框架,采用超图神经网络作为表格编码器,将表格作为独立模态与文本令牌融合,构建与主流大语言模型无缝集成的多模态架构。 Result: 在HiTab、WikiTQ、WikiSQL、FeTaQA和StructQA等多个基准数据集上验证了方法的有效性,实现了显著的泛化性能提升,平均相对增益达42.65%。 Conclusion: 表格应被视为独立模态而非简单序列化输入,TAMO通过保留结构信息显著提升了大语言模型在表格推理任务上的表现,为多模态建模提供了新思路。 Abstract: To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.

[39] Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Zhihan Guo,Feiyang Xu,Yifan Li,Muzhi Li,Shuai Zou,Jiele Wu,Han Shi,Haoli Bai,Ho-fung Leung,Irwin King

Main category: cs.CL

TL;DR: Dr.Mi-Bench是一个面向科学深度研究代理的模块化综合基准,旨在评估规划、检索和推理能力,揭示现有研究代理在多源检索和跨领域一致性上的关键缺陷。

Details Motivation: 现有基准过于关注检索且偏向通用领域,缺乏对科学领域中高阶规划与推理能力的系统评估。 Method: 提出Dr.Mi-Bench和Dr.Mi-Eval,基于200个人工标注实例覆盖10个科学领域,采用端到端与隔离两种模式评估研究代理及基础大模型。 Result: 实验显示当前研究代理表现碎片化,擅长特定任务但在多源检索和跨学科一致性上存在明显弱点,且高层规划能力是释放基础模型推理潜力的关键。 Conclusion: Dr.Mi-Bench作为诊断工具可识别研究代理的可操作失败模式,为构建更可靠的学术研究助手提供指导方向。 Abstract: The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

[40] Advancing Academic Chatbots: Evaluation of Non Traditional Outputs

Nicole Favero,Francesca Salute,Daniel Hardt

Main category: cs.CL

TL;DR: 本研究比较了Graph RAG与Advanced RAG两种检索策略在问答任务中的表现,并评估了LLM生成非传统学术输出(如幻灯片和播客脚本)的能力,发现GPT-4o mini结合Advanced RAG效果最佳,但人类评审在评估新兴学术输出中仍不可或缺。

Details Motivation: 扩展大语言模型的评估范围,不仅限于传统任务,探索其在复杂检索策略下的问答性能及生成非传统学术内容的能力。 Method: 对比Graph RAG(基于知识图谱)和Advanced RAG(关键词-语义混合搜索)两种检索方法,在问答、幻灯片和播客脚本生成任务中结合LLaMA 3 70B与GPT-4o mini模型,采用人工评分与LLM裁判进行多维度评估。 Result: GPT-4o mini配合Advanced RAG在问答中准确率最高;Graph RAG改进有限且导致更多幻觉;在幻灯片和播客生成中GPT-4o mini表现最优,LLaMA 3在叙事连贯性上有潜力;人工评审对发现布局与风格问题至关重要。 Conclusion: Advanced RAG优于Graph RAG,GPT-4o mini在多种任务中表现更佳,但高质量的新兴学术输出评估需结合人类与LLM共同判断。 Abstract: Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural complexity and manual setup. Slide and podcast generation was tested with document grounded retrieval. GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence. Human reviewers were crucial for detecting layout and stylistic flaws, highlighting the need for combined human LLM evaluation in assessing emerging academic outputs.

[41] When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Riad Ahmed Anonto,Md Labid Al Nahiyan,Md Tanvir Hassan,Ch. Md. Rakin Haider

Main category: cs.CL

TL;DR: 本文提出了“语义混淆”这一新的失败模式,用以衡量安全对齐语言模型在处理相似意图的不同表述时的局部不一致性,并构建了包含1万条受控同义句簇的ParaGuard数据集和三种模型无关的评估指标,揭示了全局拒绝率掩盖下的关键结构性问题。

Details Motivation: 现有的安全对齐语言模型评估主要依赖全局指标(如错误拒绝率),忽略了模型对相近表达的局部不一致响应问题,难以有效诊断和优化模型行为。因此需要一种能捕捉局部不一致性的新框架。 Method: 提出“语义混淆”概念并构建ParaGuard数据集,该数据集包含10,000个保持意图不变但表面形式变化的同义句簇;设计三种模型无关的令牌级指标——混淆指数、混淆率和混淆深度,利用词元嵌入、下一词概率和困惑度信号来比较每个拒绝与其最近被接受的邻居。 Result: 实验表明,不同模型家族和部署防护机制中普遍存在语义混淆现象;所提指标揭示了某些系统存在全局不稳定的决策边界,另一些则存在局部不一致区域,并发现更严格的拒绝策略并不一定导致更高程度的不一致;混淆感知审计可分离系统拒绝频率与拒绝合理性。 Conclusion: 提出的语义混淆框架和指标为诊断和优化安全对齐语言模型提供了实用工具,帮助开发者在保持安全性的同时减少误拒,提升用户体验。 Abstract: Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.

[42] ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages

Neha Joshi,Pamir Gogoi,Aasim Mirza,Aayush Jansari,Aditya Yadavalli,Ayushi Pandey,Arunima Shukla,Deepthi Sudharsan,Kalika Bali,Vivek Seshadri

Main category: cs.CL

TL;DR: 本文介绍了一个名为ELR-1000的多模态数据集,包含来自印度东部偏远地区10种濒危语言的1,060个传统食谱,旨在推动面向濒危语言的语言技术发展。

Details Motivation: 濒危语言和文化正迅速消失,现有语言技术在低资源、文化特定语言上的表现不佳,缺乏相关基准数据集来支持这类研究。 Method: 通过为数字素养较低的用户设计移动界面,众包收集濒危语言的传统食谱,并构建ELR-1000数据集;评估多种大型语言模型在翻译这些食谱时的表现,测试提供文化背景信息等上下文对翻译质量的影响。 Result: 现有大模型在直接翻译濒危语言食谱时表现较差,但通过提供针对性上下文(如语言背景、翻译示例和文化保留指南),翻译质量显著提升。 Conclusion: 需要针对少数族群语言和文化特定领域的基准数据集,以促进公平且具文化意识的语言技术发展;作者公开了ELR-1000数据集以支持后续研究。 Abstract: We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.

[43] How do we measure privacy in text? A survey of text anonymization metrics

Yaxuan Ren,Krithika Ramesh,Yaxing Yao,Anjalie Field

Main category: cs.CL

TL;DR: 本文系统地调查了文本隐私保护的评估指标,识别并比较了六种不同的隐私概念,分析了这些指标如何捕捉不同方面的隐私风险,并评估了它们与法律标准(如HIPAA和GDPR)以及基于HCI研究的用户期望的一致性,旨在促进更稳健、可比且符合法律要求的文本匿名化隐私评估。

Details Motivation: 尽管文本匿名化对于在敏感数据领域推动NLP研究和模型开发至关重要,但评估匿名化方法是否充分保护隐私仍然是一个开放的挑战。因此,需要澄清和统一现有的隐私评估指标。 Method: 通过手动审查47篇报告隐私指标的论文,识别和比较六种不同的隐私概念,并分析相关指标对隐私风险各方面的捕捉能力;同时评估这些隐私概念与法律标准(HIPAA、GDPR)及用户中心期望的一致性。 Result: 识别出六种不同的隐私概念及其对应的度量方式,揭示了当前隐私评估方法在法律合规性和用户感知之间的差距,并提供了关于如何选择和应用隐私评估方法的实践指导。 Conclusion: 为了实现更强大、可比较且符合法律规范的文本匿名化隐私评估,必须整合多样化的隐私度量方法,并考虑法律标准与用户实际期望之间的平衡。 Abstract: In this work, we aim to clarify and reconcile metrics for evaluating privacy protection in text through a systematic survey. Although text anonymization is essential for enabling NLP research and model development in domains with sensitive data, evaluating whether anonymization methods sufficiently protect privacy remains an open challenge. In manually reviewing 47 papers that report privacy metrics, we identify and compare six distinct privacy notions, and analyze how the associated metrics capture different aspects of privacy risk. We then assess how well these notions align with legal privacy standards (HIPAA and GDPR), as well as user-centered expectations grounded in HCI studies. Our analysis offers practical guidance on navigating the landscape of privacy evaluation approaches further and highlights gaps in current practices. Ultimately, we aim to facilitate more robust, comparable, and legally aware privacy evaluations in text anonymization.

[44] DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

Hyunjun Kim,Sooyoung Ryu

Main category: cs.CL

TL;DR: DrawingBench是一个面向代理型大语言模型的可验证评估框架,通过空间推理和GUI操作任务实现透明、基于规则的评估,强调外部监督在建立AI信任中的关键作用。

Details Motivation: 现有AI代理评估基准缺乏透明性和可审计性,难以验证代理行为的可靠性,因此需要一个可验证、透明的评估框架来建立对自主AI系统的信任。 Method: 提出DrawingBench框架,包含250个跨20类、4个难度级别的多样化提示,基于8项客观标准进行确定性评分,并引入多轮反馈机制支持外部人工监督;通过生成低级GUI操作序列的空间推理任务评估主流大模型。 Result: 在1,000次测试中评估了四个主流LLM,整体完美表现达92.8%,结构化外部反馈带来平均+3.2%、最高+32.8%的提升;发现模型在工具状态管理和长视野规划方面存在系统性错误,且任务规范清晰度比复杂度更重要——当标准明确时,性能可达100%。 Conclusion: 透明、可审计的评估框架(如DrawingBench)能有效建立对代理型AI系统的信任,外部监督比自我修正更可靠;该开源框架为可信AI代理评估提供了可复用的模板。 Abstract: As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity -- models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: https://github.com/hyunjun1121/DrawingBench

[45] TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

Yongxin Zhou,Philippe Mulhem,Didier Schwab

Main category: cs.CL

TL;DR: 本文提出了一种新的RAG扰动-温度分析框架,系统研究了检索噪声与生成温度之间的相互作用,揭示了不同温度设置下模型对扰动的敏感性差异,并提供了提升RAG系统鲁棒性的实践指南。

Details Motivation: 现有RAG评估方法通常孤立地分析检索质量与生成参数(如温度),忽略了二者间的交互影响,导致对系统鲁棒性理解不足。 Method: 构建了一个综合的RAG扰动-温度分析框架,在HotpotQA数据集上对开源和专有大语言模型进行实验,通过三种不同类型的文本扰动模拟噪声检索,并在不同温度设置下评估其影响。 Result: 实验证明高温设置会显著放大模型对扰动的脆弱性,且某些扰动类型在温度变化时表现出非线性敏感性;提出了可量化扰动-温度交互效应的分析框架。 Conclusion: 该研究为RAG系统的鲁棒性评估提供了诊断基准和分析工具,并给出了在噪声检索条件下模型选择与参数调优的实用建议。 Abstract: The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.

[46] Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Krithik Vishwanath,Mrigayu Ghosh,Anton Alyakin,Daniel Alexander Alber,Yindalon Aphinyanaphongs,Eric Karl Oermann

Main category: cs.CL

TL;DR: Generalist large language models outperform specialized clinical AI tools in medical knowledge and clinician-alignment tasks, highlighting the need for independent evaluation of clinical AI systems.

Details Motivation: There is a lack of transparent, independent evaluation for specialized clinical AI assistants despite their increasing use in medical decision-making. Method: Compared two clinical AI systems (OpenEvidence and UpToDate Expert AI) with three generalist LLMs (GPT-5, Gemini 3 Pro, Claude Sonnet 4.5) on a 1,000-item benchmark combining MedQA and HealthBench tasks. Result: Generalist models, especially GPT-5, outperformed clinical AI tools in medical knowledge, completeness, communication quality, context awareness, and safety reasoning. Conclusion: Specialized clinical AI may lag behind frontier generalist models, emphasizing the need for rigorous, independent evaluation before deployment in clinical settings. Abstract: Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.

[47] Conveying Imagistic Thinking in Traditional Chinese Medicine Translation: A Prompt Engineering and LLM-Based Evaluation Framework

Jiatong Han

Main category: cs.CL

TL;DR: 本研究提出一种基于“人在回路”(HITL)框架的提示调整方法,通过认知引导提升大语言模型对《黄帝内经》中隐喻和转喻的翻译效果,经多模型模拟读者评估与定性分析,验证了该方法在中医典籍翻译中的有效性与可复制性。

Details Motivation: 传统中医理论依赖意象思维,现有英译多采用直译法,难以帮助目标语读者重构其深层概念网络,限制了理论的理解与临床应用。因此,亟需一种能有效传递隐喻与转喻的认知翻译策略。 Method: 采用人在回路(HITL)框架,选取《黄帝内经》四段核心文本,利用提示工程引导DeepSeek V3.1识别源文中的隐喻与转喻;通过ChatGPT 5 Pro与Gemini 2.5 Pro模拟三类真实读者,对人工、基线模型及提示调整后的翻译进行五维认知评分,并结合结构化访谈与解释性现象学分析(IPA)进行深入探讨。 Result: 提示调整的大模型翻译在五个认知维度上均表现最优,具有跨模型与跨角色的一致性;访谈主题揭示了人译与机译的差异、有效的隐喻/转喻传递策略以及读者的认知偏好。 Conclusion: 本研究建立了一条认知导向、高效且可复制的HITL方法路径,为中医等古代高密度概念文本的翻译提供了新范式。 Abstract: Traditional Chinese Medicine (TCM) theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop (HITL) framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis (IPA). Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient, and replicable HITL methodological pathway for the translation of ancient, concept-dense texts such as TCM.

[48] Sentiment Analysis and Emotion Classification using Machine Learning Techniques for Nagamese Language - A Low-resource Language

Ekha Morang,Surhoni A. Ngullie,Sashienla Longkumer,Teisovi Angami

Main category: cs.CL

TL;DR: 这是首次针对Nagamese语言进行情感分析和情绪分类的研究,构建了包含1195个词的情感极性词典,并结合机器学习方法进行情感识别。

Details Motivation: Nagamese作为一种资源稀缺语言,在自然语言处理方面研究较少,尤其是情感分析领域尚属空白,因此有必要开展相关研究以推动低资源语言的技术发展。 Method: 构建了一个包含1195个Nagamese词汇的情感极性词典,并提取基于词典的特征及其他附加特征,使用朴素贝叶斯和支持向量机等监督学习方法进行情感极性和基本情绪分类。 Result: 成功实现了Nagamese语言文本的情感极性(正面、负面、中性)和基本情绪的分类,为该语言的情感分析奠定了基础。 Conclusion: 该研究填补了Nagamese语言在情感分析领域的空白,展示了机器学习方法在低资源语言处理中的可行性与潜力。 Abstract: The Nagamese language, a.k.a Naga Pidgin, is an Assamese-lexified creole language developed primarily as a means of communication in trade between the people from Nagaland and people from Assam in the north-east India. Substantial amount of work in sentiment analysis has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in Nagamese language. To the best of our knowledge, this is the first attempt on sentiment analysis and emotion classification for the Nagamese Language. The aim of this work is to detect sentiments in terms of polarity (positive, negative and neutral) and basic emotions contained in textual content of Nagamese language. We build sentiment polarity lexicon of 1,195 nagamese words and use these to build features along with additional features for supervised machine learning techniques using Na"ive Bayes and Support Vector Machines. Keywords: Nagamese, NLP, sentiment analysis, machine learning

[49] SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Zehua Zhao,Zhixian Huang,Junren Li,Siyu Lin,Junting Zhou,Fengqi Cao,Kun Zhou,Rui Ge,Tingting Long,Yuexiang Zhu,Yan Liu,Jie Zheng,Junnian Wei,Rong Zhu,Peng Zou,Wenyu Li,Zekai Cheng,Tian Ding,Yaxuan Wang,Yizhao Yan,Tingru Wei,Haowei Ming,Weijie Mao,Chen Sun,Yiming Liu,Zichen Wang,Zuo Zhang,Tong Yang,Hao Ma,Zhen Gao,Jian Pei

Main category: cs.CL

TL;DR: SUPERChem是一个包含500个专家策划的化学推理问题的新基准,旨在评估大语言模型在多模态、多步骤推理下的化学智能水平,引入推理路径保真度(RPF)评分以超越传统答案准确率的评估方式。

Details Motivation: 现有化学推理评测基准存在任务过于简单、缺乏过程性评估以及与专家级化学能力脱节的问题,亟需更贴近真实科研场景的挑战性评测工具。 Method: 构建了一个涵盖多个化学子领域的高质量、多模态化学推理数据集,采用专家编写解题路径和迭代筛选机制确保题目质量,并提出Reasoning Path Fidelity(RPF)评分用于评估模型推理过程的准确性。 Result: 在40.3%人类基线准确率下,最佳模型GPT-5(High)仅达到38.5%的准确率,Gemini 2.5 Pro为37.9%,DeepSeek-V3.1-Think为37.3%,显示出当前模型仍接近人类水平但尚未超越;同时发现视觉信息对不同模型有差异化影响,并能有效区分高保真推理与启发式猜测模型。 Conclusion: SUPERChem提供了一个更具挑战性和可靠性的化学推理评测基准和框架,有助于推动大语言模型向专家级化学智能发展。 Abstract: Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at https://huggingface.co/datasets/ZehuaZhao/SUPERChem.

[50] Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning

Jiahao Yuan,Zhiqing Cui,Hanqing Wang,Yuansheng Gao,Yucheng Zhou,Usman Naseem

Main category: cs.CL

TL;DR: 本文提出了KardiaBench,一个大规模、基于用户身份的对话情感理解基准,以及Kardia-R1框架,通过可解释的评分规则强化学习实现身份感知的情感推理,显著提升了对话系统在共情准确性、一致性与安全性方面的表现。

Details Motivation: 现有对话系统依赖情境中心的数据集且缺乏持久用户身份,同时使用不透明的粗粒度奖励信号,难以实现可验证的个性化共情推理。因此需要构建身份感知、心理合理且可评估的共情对话系统。 Method: 提出KardiaBench数据集,包含178,080个问答对,基于671个真实用户画像构建多轮对话,并采用模型在环、评分规则引导的迭代优化流程确保心理合理性与人设一致性;进一步提出Kardia-R1框架,采用基于评分规则的GRPO强化学习(Rubric-ERL),实现可解释、分步骤的共情认知训练。 Result: 在四个主流大语言模型上实验表明,Kardia-R1在情感准确性、共情程度、相关性、人设一致性和安全性方面均优于现有方法。KardiaBench和Kardia-R1将开源。 Conclusion: 通过引入身份感知的高质量数据集与基于可解释规则的强化学习框架,能够有效提升对话系统在复杂情感交互中的个性化共情能力与推理可验证性。 Abstract: As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at https://github.com/JhCircle/Kardia-R1.

[51] Agreement-Constrained Probabilistic Minimum Bayes Risk Decoding

Koki Natsumi,Hiroyuki Deguchi,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe

Main category: cs.CL

TL;DR: 提出了一种新的AC-PMBR解码方法,通过利用知识蒸馏模型引导评分矩阵的补全,在降低计算成本的同时显著提高了翻译质量。

Details Motivation: 为了改善最小贝叶斯风险(MBR)解码在质量和计算成本之间的权衡,特别是减少PMBR解码中因减少效用函数调用而导致的翻译质量下降问题。 Method: 提出了同意约束的PMBR(AC-PMBR)解码方法,该方法利用知识蒸馏模型来指导评分矩阵的补全过程,从而减少对所有候选对进行评估的需求。 Result: 在WMT'23英德双向翻译任务上,与PMBR解码相比,AC-PMBR解码将矩阵补全的近似误差降低了最多3倍,并且在相当的计算成本下实现了更高的翻译质量。 Conclusion: AC-PMBR解码方法有效提升了翻译质量和计算效率之间的平衡,为实际应用中的高质量翻译提供了一个有前景的解决方案。 Abstract: Minimum Bayes risk (MBR) decoding generates high-quality translations by maximizing the expected utility of output candidates, but it evaluates all pairwise scores over the candidate set; hence, it takes quadratic time with respect to the number of candidates. To reduce the number of utility function calls, probabilistic MBR (PMBR) decoding partially evaluates quality scores using sampled pairs of candidates and completes the missing scores with a matrix completion algorithm. Nevertheless, it degrades the translation quality as the number of utility function calls is reduced. Therefore, to improve the trade-off between quality and cost, we propose agreement-constrained PMBR (AC-PMBR) decoding, which leverages a knowledge distilled model to guide the completion of the score matrix. Our AC-PMBR decoding improved approximation errors of matrix completion by up to 3 times and achieved higher translation quality compared with PMBR decoding at a comparable computational cost on the WMT'23 En$\leftrightarrow$De translation tasks.

[52] MARSAD: A Multi-Functional Tool for Real-Time Social Media Analysis

Md. Rafiul Biswas,Firoj Alam,Wajdi Zaghouani

Main category: cs.CL

TL;DR: MARSAD是一个面向阿拉伯语社交网络的多功能NLP平台,支持实时和历史数据的多维度分析与可视化。

Details Motivation: 为研究人员和非技术用户提供针对阿拉伯语社交媒体内容的高效、易用的分析工具。 Method: 结合灵活的文档存储与结构化数据管理,通过API密钥实现安全数据抓取,并提供用户友好的前端界面。 Result: 实现了包括情感、情绪、宣传、事实核查和仇恨言论检测在内的多维度分析功能,并生成详细可视化报告。 Conclusion: MARSAD是一个高效、安全且易于使用的阿拉伯语社交媒体分析平台,适用于大规模多模态数据处理。 Abstract: MARSAD is a multifunctional natural language processing (NLP) platform designed for real-time social media monitoring and analysis, with a particular focus on the Arabic-speaking world. It enables researchers and non-technical users alike to examine both live and archived social media content, producing detailed visualizations and reports across various dimensions, including sentiment analysis, emotion analysis, propaganda detection, fact-checking, and hate speech detection. The platform also provides secure data-scraping capabilities through API keys for accessing public social media data. MARSAD's backend architecture integrates flexible document storage with structured data management, ensuring efficient processing of large and multimodal datasets. Its user-friendly frontend supports seamless data upload and interaction.

[53] DyFuLM: An Advanced Multimodal Framework for Sentiment Analysis

Ruohan Zhou,Jiachen Yuan,Churui Yang,Wenzheng Huang,Guoyan Zhang,Shiyao Wei,Jiazhen Hu,Ning Xin,Md Maruf Hasan

Main category: cs.CL

TL;DR: 提出了一种动态融合学习模型DyFuLM,用于多模态情感分析,通过分层动态融合和门控特征聚合模块有效提升粗粒度和细粒度情感分类性能。

Details Motivation: 为了更好地捕捉复杂文本中的层次语义表示和细粒度情感细微差异,解决现有情感计算方法在多模态特征融合上的局限性。 Method: 设计了DyFuLM模型,包含分层动态融合模块和门控特征聚合模块,以实现多层次特征的自适应融合与跨层信息流的调控。 Result: 在多任务情感数据集上,DyFuLM取得了82.64%的粗粒度和68.48%的细粒度准确率,最低回归误差(MAE=0.0674,MSE=0.0082)和最高的R²(0.6903),消融实验验证了各模块的有效性。 Conclusion: DyFuLM通过有效的分层特征融合机制显著提升了情感表示能力和整体性能,各模块对特征交互和任务平衡有重要贡献。 Abstract: Understanding sentiment in complex textual expressions remains a fundamental challenge in affective computing. To address this, we propose a Dynamic Fusion Learning Model (DyFuLM), a multimodal framework designed to capture both hierarchical semantic representations and fine-grained emotional nuances. DyFuLM introduces two key moodules: a Hierarchical Dynamic Fusion module that adaptively integrates multi-level features, and a Gated Feature Aggregation module that regulates cross-layer information ffow to achieve balanced representation learning. Comprehensive experiments on multi-task sentiment datasets demonstrate that DyFuLM achieves 82.64% coarse-grained and 68.48% fine-grained accuracy, yielding the lowest regression errors (MAE = 0.0674, MSE = 0.0082) and the highest R^2 coefficient of determination (R^2= 0.6903). Furthermore, the ablation study validates the effectiveness of each module in DyFuLM. When all modules are removed, the accuracy drops by 0.91% for coarse-grained and 0.68% for fine-grained tasks. Keeping only the gated fusion module causes decreases of 0.75% and 0.55%, while removing the dynamic loss mechanism results in drops of 0.78% and 0.26% for coarse-grained and fine-grained sentiment classification, respectively. These results demonstrate that each module contributes significantly to feature interaction and task balance. Overall, the experimental findings further validate that DyFuLM enhances sentiment representation and overall performance through effective hierarchical feature fusion.

[54] PromptBridge: Cross-Model Prompt Transfer for Large Language Models

Yaxuan Wang,Quan Liu,Zhenting Wang,Zichao Li,Wei Wei,Yang Liu,Yujia Bao

Main category: cs.CL

TL;DR: 本文提出了PromptBridge框架,以解决在不同大语言模型间迁移提示时因模型差异导致性能下降的“模型漂移”问题。该方法通过无训练的跨模型提示映射实现高效提示迁移。

Details Motivation: 由于大语言模型快速演进,系统常需切换模型,但提示对模型敏感,直接复用提示会导致性能显著下降。因此需要一种能有效应对模型漂移的方法。 Method: 提出PromptBridge框架:首先使用模型自适应反射提示进化(MAP-RPE)在源模型和目标模型上生成最优提示;基于少量校准任务学习跨模型提示映射;在新任务中,利用该映射将源模型提示直接转换为目标模型优化提示。 Result: 实验表明,PromptBridge在单智能体和多智能体场景下均能显著提升跨模型提示迁移后的下游任务准确率,并减少迁移成本。 Conclusion: PromptBridge为应对大语言模型演进中的提示有效性保持问题提供了高效、免训练的解决方案,支持低成本的跨模型提示迁移。 Abstract: Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.

[55] Multilingual Conversational AI for Financial Assistance: Bridging Language Barriers in Indian FinTech

Bharatdeep Hazarika,Arya Suneesh,Prasanna Devadiga,Pawan Kumar Rajpoot,Anshuman B Suresh,Ahmed Ifthaquar Hussain

Main category: cs.CL

TL;DR: 提出了一种支持混合语言(如Hinglish)的多语言对话AI系统,用于印度金融援助场景,通过多智能体架构提升用户参与度并保持低延迟,推动新兴市场数字金融服务的语言包容性。

Details Motivation: 印度语言多样性导致英语使用率低(仅10%),阻碍了金融科技平台的金融包容性,需要支持本地语言和代码混合语言的解决方案。 Method: 采用多智能体架构,包含语言分类、功能管理和多语言响应生成模块,并对多种语言模型进行比较分析,结合真实场景部署优化系统性能。 Result: 系统显著提升了用户参与度,同时保持4-8%的低延迟开销,在支持多语言和代码混合输入方面表现良好。 Conclusion: 该多语言对话AI系统有效缩小了数字金融服务中的语言鸿沟,为新兴市场的语言多样性挑战提供了可行方案。 Abstract: India's linguistic diversity presents both opportunities and challenges for fintech platforms. While the country has 31 major languages and over 100 minor ones, only 10\% of the population understands English, creating barriers to financial inclusion. We present a multilingual conversational AI system for a financial assistance use case that supports code-mixed languages like Hinglish, enabling natural interactions for India's diverse user base. Our system employs a multi-agent architecture with language classification, function management, and multilingual response generation. Through comparative analysis of multiple language models and real-world deployment, we demonstrate significant improvements in user engagement while maintaining low latency overhead (4-8\%). This work contributes to bridging the language gap in digital financial services for emerging markets.

[56] MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

Xabier de Zuazo,Ibon Saratxaga,Eva Navas

Main category: cs.CL

TL;DR: 本论文提出基于Conformer的解码器,用于LibriBrain 2025 PNPL竞赛中的MEG语音检测和音素分类任务,通过适配原始MEG信号并引入任务特定策略,在两个任务上均取得优异成绩并进入排行榜前十。

Details Motivation: 针对脑磁图(MEG)信号处理中的语音检测和音素分类任务,探索适用于高维、低信噪比MEG数据的高效深度学习模型,并填补MEG特定数据增强与训练策略的研究空白。 Method: 采用紧凑型Conformer架构,结合轻量卷积投影层处理306通道原始MEG信号;语音检测任务中引入面向MEG的SpecAugment数据增强;音素分类任务中使用逆平方根类别加权和动态分组加载器处理平均样本,并引入实例级归一化以缓解测试集分布偏移。 Result: 在官方标准分割下,最佳系统在语音检测任务上达到88.9%的F1-macro,在音素分类任务上达到65.8%,均超过竞赛基线并进入排行榜前10名。 Conclusion: Conformer架构可有效适配原始MEG信号,结合任务特定的训练策略能显著提升性能,验证了其在复杂神经信号解码中的潜力,为未来MEG相关研究提供了可行的技术路径与开源资源。 Abstract: We present Conformer-based decoders for the LibriBrain 2025 PNPL competition, targeting two foundational MEG tasks: Speech Detection and Phoneme Classification. Our approach adapts a compact Conformer to raw 306-channel MEG signals, with a lightweight convolutional projection layer and task-specific heads. For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, surpassing the competition baselines and ranking within the top-10 in both tasks. For further implementation details, the technical documentation, source code, and checkpoints are available at https://github.com/neural2speech/libribrain-experiments.

[57] Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages

Jozef Kubík,Marek Šuppa,Martin Takáč

Main category: cs.CL

TL;DR: 提出一种结合主动学习(AL)、数据聚类和动态数据选择调度器的集成微调方法,用于低资源语言建模,在减少标注成本的同时提升模型性能。

Details Motivation: 低资源语言的数据有限,导致语言模型性能较弱,而重新预训练计算成本高,因此需要在微调阶段提升效率与效果。 Method: 将主动学习与结构化数据选择策略(即'AL调度器')结合,引入数据聚类,构建集成微调流程,动态选择最具价值的样本进行标注和训练。 Result: 在斯洛伐克语、马耳他语、冰岛语和土耳其语上的实验表明,该方法可减少高达30%的标注量,F1分数最高提升4个点,并提高微调过程的稳定性。 Conclusion: 结合主动学习、聚类和动态调度的微调策略能有效提升低资源语言模型的性能,同时显著降低标注成本,具备实际应用价值。 Abstract: Limited data for low-resource languages typically yield weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies which we term 'Active Learning schedulers', to boost the fine-tuning process with a limited amount of training data. We connect the AL to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers to enhance model's performance. Experiments in the Slovak, Maltese, Icelandic and Turkish languages show that the use of clustering during the fine-tuning phase together with AL scheduling can simultaneously produce annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.

[58] MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

Yexing Du,Kaiyuan Liu,Youcheng Pan,Bo Yang,Keqi Deng,Xie Chen,Yang Xiang,Ming Liu,Bin Qin,YaoWei Wang

Main category: cs.CL

TL;DR: 提出MCAT框架,通过语言扩展方法和优化的语音适配器模块,支持70种语言的多对多语音翻译,并显著提升推理效率。

Details Motivation: 现有S2TT研究受限于语言覆盖范围窄(以英语为中心)和推理效率低(长序列导致速度下降)两大问题,难以扩展多语言大模型的多对多翻译能力。 Method: 1)引入基于课程学习和数据平衡策略的语言扩展方法,将MLLM支持的语言扩展至70种并实现互译;2)设计优化的语音适配器模块,将语音序列长度压缩至仅30个token。 Result: 在FLEURS数据集70x69个翻译方向上超越最先进的端到端模型,仅使用约1亿可训练参数和每语言10小时S2TT数据,显著提升批处理推理效率。 Conclusion: MCAT框架有效提升了MLLM在多语言S2TT任务中的语言覆盖能力和推理效率,推动了低成本、高性能多语言语音翻译的发展,且已开源以促进社区研究。 Abstract: Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at https://github.com/yxduir/m2m-70.

[59] Language Diversity: Evaluating Language Usage and AI Performance on African Languages in Digital Spaces

Edward Ajayi,Eudoxie Umwari,Mawuli Deku,Prosper Singadi,Jules Udahemuka,Bekalu Tadele,Chukuemeka Edeh

Main category: cs.CL

TL;DR: 该研究探讨了非洲语言在数字环境中的表征问题,发现当前语言检测工具在处理约鲁巴语、基尼亚卢旺达语和阿姆哈拉语时,对混杂英语的社交媒体数据表现不佳,而对本地新闻中的纯语言数据准确率高。研究表明,专业新闻内容是训练非洲语言AI模型更可靠的数据来源,并呼吁开发能同时处理纯语言和语码转换文本的新型模型。

Details Motivation: 由于非洲语言在线会话数据稀少且多为语码混用,导致现有语言检测工具难以准确识别,缺乏代表真实单语交流的数据集制约了非洲语言的AI模型发展。 Method: 收集了三种非洲语言在Reddit和本地新闻媒体中的文本数据,对比分析其语言使用特征,并评估包括AfroLID和通用大语言模型在内的语言检测工具在两类数据上的表现。 Result: 语言检测模型在干净的新闻数据上接近完美识别,但在代码混用的Reddit数据上表现差;新闻内容不仅语言纯净,还促进用户以本地语言互动,成为更优质的数据源。 Conclusion: 本地新闻媒体是构建非洲语言AI模型更可靠、高效的单语数据来源,优于社交平台的会话数据;未来需开发能同时处理单语与语码混用文本的语言检测模型以提升实用性。 Abstract: This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.

[60] MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark

Yuezhang Peng,Chonghao Cai,Ziang Liu,Shuai Fan,Sheng Jiang,Hua Xu,Yuxin Liu,Qiguang Chen,Kele Xu,Yao Li,Sheng Wang,Libo Qin,Xie Chen

Main category: cs.CL

TL;DR: 本文提出了一个新的多意图汽车座舱语音理解数据集MAC-SLU,用于提升语音理解任务的复杂性,并基于该数据集对主流大语言模型和大音频语言模型进行了全面基准测试,发现监督微调仍优于上下文学习,而端到端LALMs表现与流水线方法相当。

Details Motivation: 现有SLU数据集缺乏多样性和复杂性,且缺少针对大语言模型和大音频语言模型的统一评测基准。 Method: 构建了名为MAC-SLU的多意图汽车座舱语音理解数据集,并在上下文学习、监督微调、端到端和流水线范式下对主流开源LLMs和LALMs进行系统性评测。 Result: 实验表明,尽管LLMs和LALMs可通过上下文学习完成SLU任务,但性能仍显著落后于监督微调;端到端LALMs表现与流水线方法相当,且可避免语音识别带来的误差传播。 Conclusion: MAC-SLU提升了SLU任务的挑战性,为评估先进模型提供了新基准;当前监督微调仍是更优策略,端到端LALMs具有应用潜力。 Abstract: Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnote{https://github.com/Gatsby-web/MAC\_SLU} and datasets\footnote{huggingface.co/datasets/Gatsby1984/MAC\_SLU} are released publicly.

[61] Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

Dengyun Peng,Qiguang Chen,Bofei Liu,Jiannan Guan,Libo Qin,Zheng Yan,Jinhao Liu,Jianshu Zhang,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了UnsolvableQA数据集和UnsolvableRL框架,用于提升大模型在面对可解与不可解问题时的判断能力,有效检测问题的内在矛盾并合理拒绝超出能力范围的任务。

Details Motivation: 当前大模型难以区分问题本质上的不可解性与自身能力的局限,导致过度自信和幻觉现象,影响可靠性。 Method: 通过双轨方法构建UnsolvableQA数据集:程序化生成逻辑谜题,并提出“逆向构造”法在数学推理链中注入矛盾;基于此设计包含准确性、不可解性和难度三部分奖励的强化学习框架UnsolvableRL。 Result: 该方法在不可解问题检测上接近完美表现,同时提升了对可解任务的准确率,并发现“能力崩溃”现象,表明暴露于不可解数据对防止系统性过度自信至关重要。 Conclusion: 显式训练模型识别不可解问题是提升其可靠性和决策审慎性的关键,UnsolvableQA与UnsolvableRL为此提供了有效途径。 Abstract: Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model's competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel "Reverse Construction" method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA.

[62] MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

Stefano Zeppieri

Main category: cs.CL

TL;DR: 本文提出了混合记忆增强生成(MMAG)框架,通过五层记忆结构提升大语言模型在长期交互中的连贯性、个性化和情境适应能力。

Details Motivation: 大语言模型在单次生成中表现良好,但在长期对话中难以维持相关性、个性和连续性,因此需要借鉴人类记忆机制来增强模型的记忆能力。 Method: 设计了一个包含对话记忆、长期用户记忆、情景与事件关联记忆、感知与上下文感知记忆以及短期工作记忆的五层记忆框架,并将其应用于Heero对话代理中。 Result: 在Heero代理中的实现表明,加密的长期用户画像和对话历史能够提升用户参与度和留存率,同时框架有效支持了记忆的协调、优先级划分和冲突解决。 Conclusion: MMAG为构建更具连贯性、主动性和人性化的大语言模型代理提供了可行的基础框架。 Abstract: Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.

[63] Self-Supervised Borrowing Detection on Multilingual Wordlists

Tim Wientzek

Main category: cs.CL

TL;DR: 提出了一种完全自监督的多语言词表借词检测方法,结合PMI相似性和对比学习,在无需标注数据的情况下达到或超过有监督方法的性能。

Details Motivation: 现有基于字符串相似性的借词检测方法(如NED和SCA)在多语言场景下表现有限,且依赖人工标注数据,缺乏可扩展性和普适性。 Method: 结合基于全局对应模型的PMI相似性与基于音素特征向量的轻量级对比学习组件,并设计自动选择决策阈值的无监督流程。 Result: 在基准数据集上,仅PMI已优于NED和SCA等方法,联合相似性度量与有监督基线相当或更优;消融实验验证了字符编码、温度参数和增强策略的重要性。 Conclusion: 该方法无需人工监督,可扩展至不同规模数据集,并提供命令行工具支持实际研究应用,为借词检测提供了高效、实用的自监督解决方案。 Abstract: This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct their own studies.

[64] Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks

Jiannan Guan,Qiguang Chen,Libo Qin,Dengyun Peng,Jinhao Liu,Liangyu Huo,Jian Xie,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出“推理过度自信”概念,指出大语言模型在多解任务中表现不佳,并引入MuSoBench基准进行评估,发现长链式思维(Long-CoT)能通过迭代探索缓解该问题,提出“认知僵化假说”解释其成因。

Details Motivation: 大语言模型在单答案推理任务中表现良好,但在需要生成多样解答的多解任务中表现较差,本文旨在探究其根本原因并提供评估与改进方法。 Method: 构建多解任务基准MuSoBench,比较短链式思维(Short-CoT)与长链式思维(Long-CoT)的表现,结合注意力熵分析验证认知僵化假说。 Result: 实验表明Short-CoT存在明显的推理过度自信,而Long-CoT通过迭代探索和自我反思显著缓解该问题;注意力熵分析支持认知僵化导致过度自信的观点。 Conclusion: 大语言模型在多解任务中的局限源于推理过程过早收敛,应从单一正确性评估转向对推理全面性的衡量,Long-CoT有助于提升多样性与完整性。 Abstract: Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.

[65] Reasoning About the Unsaid: Misinformation Detection with Omission-Aware Graph Inference

Zhengjia Wang,Danding Wang,Qiang Sheng,Jiaying Wu,Juan Cao

Main category: cs.CL

TL;DR: 本文提出了OmiGraph,首个关注信息遗漏的虚假信息检测框架,通过构建遗漏感知图和建模上下文依赖关系,显著提升了检测性能。

Details Motivation: 现有研究多关注显性捏造内容的虚假信息,而忽视了通过隐性遗漏关键信息导致误导的欺骗形式,本文旨在填补这一研究空白。 Method: 提出OmiGraph框架:1)利用事件的上下文环境构建遗漏感知图;2)设计遗漏导向的关系建模以捕捉上下文依赖与动态遗漏意图;3)引入遗漏感知的消息传递与聚合机制提取遗漏模式。 Result: 在两个大规模基准数据集上,OmiGraph相较于基线方法平均提升了5.4%的F1分数和5.3%的准确率。 Conclusion: 考虑信息遗漏视角能有效增强虚假信息检测能力,OmiGraph为该方向提供了新的思路与有效解决方案。 Abstract: This paper investigates the detection of misinformation, which deceives readers by explicitly fabricating misleading content or implicitly omitting important information necessary for informed judgment. While the former has been extensively studied, omission-based deception remains largely overlooked, even though it can subtly guide readers toward false conclusions under the illusion of completeness. To pioneer in this direction, this paper presents OmiGraph, the first omission-aware framework for misinformation detection. Specifically, OmiGraph constructs an omission-aware graph for the target news by utilizing a contextual environment that captures complementary perspectives of the same event, thereby surfacing potentially omitted contents. Based on this graph, omission-oriented relation modeling is then proposed to identify the internal contextual dependencies, as well as the dynamic omission intents, formulating a comprehensive omission relation representation. Finally, to extract omission patterns for detection, OmiGraph introduces omission-aware message-passing and aggregation that establishes holistic deception perception by integrating the omission contents and relations. Experiments show that, by considering the omission perspective, our approach attains remarkable performance, achieving average improvements of +5.4% F1 and +5.3% ACC on two large-scale benchmarks.

[66] InnoGym: Benchmarking the Innovation Potential of AI Agents

Jintian Zhang,Kewei Xu,Jingsheng Zheng,Zhuoyun Yu,Yuqi Zhu,Yujie Luo,Lanning Wei,Shuofei Qiao,Lun Du,Da Zheng,Shumin Deng,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: InnoGym是首个系统评估AI代理创新潜力的基准框架,提出性能增益和新颖性两个互补指标,涵盖18个真实工程与科学任务,并提供统一执行环境iGym;实验表明当前代理在方法新颖性与实际性能提升间存在差距。

Details Motivation: 现有基准主要关注解法正确性,忽视了解决方案背后方法的多样性,而真正的创新需要评估方法的原创性而不仅仅是结果正确性。 Method: 提出InnoGym基准框架,包含性能增益(衡量相对于已知最优解的改进)和新颖性(衡量方法论差异)两个指标,构建了18个真实世界任务的数据集,并开发iGym统一执行环境支持可复现的长周期评估。 Result: 实验发现一些代理能生成新颖方法,但其鲁棒性不足导致性能增益有限,揭示了创造力与有效性之间的关键差距。 Conclusion: 创新评估需同时考虑正确性与方法原创性,InnoGym为衡量AI系统的真正创新能力提供了新标准,并凸显出当前AI代理在稳健性和实际改进上的不足。 Abstract: LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.

[67] Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

Jinghan Jia,Nathalie Baracaldo,Sijia Liu

Main category: cs.CL

TL;DR: 本文研究了大型推理模型(LRM)中的安全对齐问题,提出使用强化学习(RL)替代或补充监督微调(SFT),以更稳定地抑制推理过程中的不安全行为,同时保持模型的推理能力。

Details Motivation: 大型推理模型通过生成显式的思维链(CoT)提升逻辑和数学解题能力,但中间推理步骤可能隐含不安全行为。现有基于监督微调的安全对齐方法效果不稳定、泛化差,并损害推理能力,因此需要更鲁棒的对齐方法。 Method: 提出采用强化学习(RL)框架进行安全训练,利用奖励信号直接优化模型策略,并在多个模型族和基准上进行实验验证,同时分析反思动态和词元级熵以理解RL的作用机制。 Result: 实验表明,相比SFT,RL实现了更强且更一致的安全性提升,同时保持了推理性能;分析显示RL能有效抑制不安全的探索性推理,同时保留必要的反思深度。 Conclusion: 强化学习是一种更有效、稳定且可泛化的安全对齐方法,适用于大型推理模型,为构建安全可靠的推理系统提供了新方向。 Abstract: Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.

[68] BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages

Hrishikesh Terdalkar,Kirtan Bhojani,Aryan Dongare,Omm Aditya Behera

Main category: cs.CL

TL;DR: 本文提出了BHRAM-IL,一个用于检测多种印度语言幻觉的基准测试,涵盖印地语、古吉拉特语、马拉地语、奥里亚语及英语,包含36,047个问题,并评估了14种多语言大模型在跨语言和事实性幻觉上的表现。

Details Motivation: 尽管大语言模型在多语言应用中广泛使用,但其在资源匮乏的印度语言中的幻觉问题尚未充分研究,缺乏合适的评估基准。 Method: 构建了一个包含九类任务(事实、数值、推理、语言等)的多语言幻觉识别基准BHRAM-IL,包含36,047个标注问题,选取10,265个问题子集对14种主流多语言大模型进行评测,采用类别特定的归一化指标(0-1范围)分析不同语言、模型、规模、类别和领域的幻觉表现。 Result: 在所有类别和模型上的聚合主得分为0.23,语言校正后的模糊得分为0.385,表明当前多语言模型在印度语言中仍存在严重幻觉问题,BHRAM-IL能有效支持幻觉检测研究。 Conclusion: BHRAM-IL填补了印度语言幻觉检测的空白,为多语言幻觉研究提供了标准化评估工具,推动未来在低资源语言中的幻觉缓解工作。 Abstract: Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (https://github.com/sambhashana/BHRAM-IL/) and HuggingFace (https://huggingface.co/datasets/sambhashana/BHRAM-IL/) to support future research in multilingual hallucination detection and mitigation.

[69] Cross-Lingual Interleaving for Speech Language Models

Adel Moumen,Guangzhi Sun,Philip C. Woodland

Main category: cs.CL

TL;DR: 本文提出了一种跨语言交错方法,通过在无文本监督的情况下混合语音标记来提升多语言口语语言模型(SLM)的性能,并发布了用于英法语训练和评估的合成数据集,结果表明该方法能有效提高单语语义准确率、实现稳健的跨语言续接并增强隐藏状态对齐。

Details Motivation: 由于缺乏跨语言的语音评估基准和训练数据,当前口语语言模型的研究主要集中在英语,限制了多语言学习的发展。本文旨在通过构建跨语言训练方法和公开数据集,推动多语言SLM的发展。 Method: 提出一种跨语言交错方法,在训练过程中混合不同语言的语音标记,无需文本监督;同时使用GPT-4生成英法双语的TinyStories训练数据集及StoryCloze、TopicCloze评估基准。 Result: 在360M和1B规模的SLM上,该方法在相同训练token预算下提升了单语语义准确率,实现了鲁棒的跨语言语音续接,并增强了跨语言的隐藏层状态对齐。 Conclusion: 跨语言交错是一种简单且可扩展的方法,有助于构建能理解和跨语言对话的多语言口语语言模型,所有资源将开源以支持可复现性。 Abstract: Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.

[70] Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models

Heloisa Candello,Muneeza Azmat,Uma Sushmitha Gunturi,Raya Horesh,Rogerio Abreu de Paula,Heloisa Pimentel,Marcelo Carpinette Grave,Aminat Adebiyi,Tiago Machado,Maysa Malfiza Garcia de Macedo

Main category: cs.CL

TL;DR: 本研究通过混合方法实验评估了生成式AI响应的缓解策略在真实性、公平性、去害性和相关性等多个维度上的表现,发现参与者的母语、AI工作经验和标注熟悉度显著影响其评价,且人们对语言和上下文细节高度敏感。

Details Motivation: 随着生成式AI的快速发展,其产生幻觉和有害内容的能力引发关注,尽管已有多种缓解措施,但缺乏对人类如何感知这些措施的研究。 Method: 采用被试内设计的混合方法实验,57名参与者评估两种条件下的AI响应:有害响应加缓解措施,以及仅显示缓解后的响应,从多个维度进行评分。 Result: 参与者母语、AI工作经验和标注经验显著影响评估结果;人们对语法错误敏感,但重视语义连贯性;与传统LLM量化评估不同,语言和上下文因素在人类评价中起关键作用。同时提出了新的缓解策略训练与评估指标。 Conclusion: 人类对AI生成内容缓解策略的感知受个体背景和语言细节影响显著,未来评估应结合人类反馈并考虑语言质量与上下文保持之间的平衡。 Abstract: With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.

[71] OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation

Jinzheng Yu,Yang Xu,Haozhen Li,Junqi Li,Yifan Feng,Ligu Zhu,Hao Shen,Lei Shi

Main category: cs.CL

TL;DR: 本文提出了自动化网络舆情报告生成(OPOR-GEN)任务,构建了事件中心的数据集OPOR-BENCH,并设计了基于智能体的评估框架OPOR-EVAL,为该领域研究提供了系统性基础。

Details Motivation: 尽管大语言模型使自动生成舆情报告成为可能,但该领域仍缺乏系统的任务定义和基准数据集,亟需建立标准化的研究框架。 Method: 定义了OPOR-GEN任务,构建包含463个危机事件的OPOR-BENCH数据集,并提出基于智能体的OPOR-EVAL评估框架以模拟人类专家评判。 Result: 实验表明,OPOR-EVAL与人类判断具有高度相关性,验证了框架的有效性。 Conclusion: 本文通过任务定义、数据集构建和评估框架设计,为自动化舆情报告生成研究奠定了坚实基础。 Abstract: Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises. While large language models have made automated report generation technically feasible, systematic research in this specific area remains notably absent, particularly lacking formal task definitions and corresponding benchmarks. To bridge this gap, we define the Automated Online Public Opinion Report Generation (OPOR-GEN) task and construct OPOR-BENCH, an event-centric dataset covering 463 crisis events with their corresponding news articles, social media posts, and a reference summary. To evaluate report quality, we propose OPOR-EVAL, a novel agent-based framework that simulates human expert evaluation by analyzing generated reports in context. Experiments with frontier models demonstrate that our framework achieves high correlation with human judgments. Our comprehensive task definition, benchmark dataset, and evaluation framework provide a solid foundation for future research in this critical domain.

[72] Latent Debate: A Surrogate Framework for Interpreting LLM Thinking

Lihu Chen,Xiang Yin,Francesca Toni

Main category: cs.CL

TL;DR: 提出“潜在辩论”框架,用于解释大语言模型在单次推理中隐含的内部支持与反对信号,揭示其与幻觉现象的关联。

Details Motivation: 理解大语言模型内部思维过程及产生幻觉的原因是当前的重要挑战,现有方法依赖多模型或多答案显式辩论,无法捕捉单个模型内部的隐含推理动态。 Method: 提出一种模型和任务无关的潜在辩论概念框架,并在True/False任务上进行符号化实例化,以近似LLM的思维过程。 Result: 实验证明潜在辩论能高度一致地复现原LLM的预测,是一种可靠的结构化代理模型;并可用于幻觉检测,发现中间层存在大量潜在辩论与幻觉风险显著相关。 Conclusion: 潜在辩论为理解LLM内部机制提供了新框架,尤其适用于推断过程中存在内部分歧的场景,同时为幻觉检测提供了有效基线。 Abstract: Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.

[73] Rectifying LLM Thought from Lens of Optimization

Junnan Liu,Hongwei Liu,Songyang Zhang,Kai Chen

Main category: cs.CL

TL;DR: 本文提出RePro方法,通过优化视角将思维链(CoT)推理建模为梯度下降过程,并引入过程级奖励来改进大语言模型的推理能力。

Details Motivation: 长链思维(long-CoT)虽然提升了推理能力,但常导致过度思考和冗长推理链等次优行为,影响模型性能。 Method: 将CoT视为梯度下降过程,设计双评分机制(强度与稳定性)构建过程级奖励,并将其融入强化学习框架RLVR中进行后训练优化。 Result: 在多个强化学习算法和不同大模型上实验表明,RePro在数学、科学和编程等任务中显著提升推理性能并缓解次优推理行为。 Conclusion: RePro为优化大语言模型的推理过程提供了有效且通用的框架,通过过程级奖励改善了长链思维推理的质量。 Abstract: Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

[74] How Far Are We from Genuinely Useful Deep Research Agents?

Dingling Zhang,He Zhu,Jincheng Ren,Kangqi Song,Xinran Zhou,Boyu Feng,Shudong Liu,Jiabin Luo,Weihao Xie,Zhaohui Wang,Tianrui Qin,King Zhu,Yuqing Wang,Qianben Chen,Yuchen Eleanor Jiang,Wei Wang,Jiaheng Liu,Wangchunshu Zhou

Main category: cs.CL

TL;DR: 本文提出了FINDER基准和DEFT失败分类法,以评估深度研究代理(DRAs)在生成综合报告中的表现,揭示了现有DRA在证据整合、验证和推理规划方面的关键缺陷。

Details Motivation: 现有的深度研究代理多在问答任务上验证,缺乏对生成完整报告能力的系统评估;同时当前的报告生成基准存在复杂性高、指标主观等问题,难以反映实际需求。 Method: 构建了一个包含100个手工研究任务和419个结构化检查项的FINDER基准,并基于约1000份DRA生成的报告,采用扎根理论与人-LLM协同标注方式建立DEFT失败分类体系。 Result: 发现当前DRA并非在任务理解上出问题,而是在证据整合、验证以及抗干扰的推理规划方面表现不佳;DEFT识别出涵盖推理、检索和生成的14种细粒度失败模式。 Conclusion: FINDER和DEFT为深度研究代理提供了更精细、可量化的评估框架,推动其向更可靠、实用的研究报告自动生成发展。 Abstract: Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.

[75] The Art of Scaling Test-Time Compute for Large Language Models

Aradhye Agarwal,Ayan Sengupta,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本研究首次大规模系统评估了测试时扩展(TTS)策略在大语言模型推理中的表现,发现无单一策略始终最优,不同模型和问题难度下表现各异,并提出基于问题难度、模型类型和计算预算选择最佳TTS策略的实用指南。

Details Motivation: 缺乏在相同条件下对已知TTS策略的系统性比较,且模型类型与问题难度对性能的影响尚不明确。 Method: 在四种推理数据集上,使用八个开源大语言模型(7B至235B参数),生成超过三百亿token,进行大规模TTS实验。 Result: 发现三个趋势:1)无单一TTS策略普遍占优;2)推理模型在问题难度和推理链长度下呈现短视程与长视程类别;3)给定模型类型下,TTS性能随计算预算单调提升。 Conclusion: 提出了一个根据问题难度、模型类型和计算预算选择最优TTS策略的实用方案,为高效推理时扩展提供了指导。 Abstract: Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.

[76] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook,Junxian Guo,Guangxuan Xiao,Yujun Lin,Song Han

Main category: cs.CL

TL;DR: 本文提出了一种名为Four Over Six (4/6)的NVFP4量化算法改进方法,通过为每个数值块评估两个可能的缩放因子,减少近最大值的量化误差,从而避免训练发散并提升推理性能。该方法在NVIDIA Blackwell GPU上高效实现,并在预训练和后训练量化中均表现出优于现有NVFP4方案的效果。

Details Motivation: 随着大语言模型规模增大,低精度格式如NVFP4被广泛用于加速计算和节省内存,但其在训练和推理中常因量化误差导致发散或性能下降,尤其是对近最大值的表示不佳。因此需要一种更优的量化策略来缓解这一问题。 Method: 提出4/6方法,修改NVFP4量化算法,对每个数据块评估两个候选缩放因子,选择能更好表示近最大值的因子;利用浮点格式特性优化分布均匀性,并确保在NVIDIA Blackwell架构上的高效实现。 Result: 在Transformer及混合架构的预训练中,4/6有效防止了训练发散,使训练损失接近BF16水平;同时可集成到多种后训练量化方法中,普遍提升下游任务准确率。 Conclusion: 4/6是一种高效且实用的NVFP4改进方案,显著提升了使用低精度浮点格式训练和部署大模型的稳定性与性能,有望推动NVFP4在大模型中的进一步应用。 Abstract: As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.

cs.CV [Back]

[77] MOTION: ML-Assisted On-Device Low-Latency Motion Recognition

Veeramani Pugazhenthi,Wei-Hsiang Chu,Junwei Lu,Jadyn N. Miyahira,Soheil Salehi

Main category: cs.CV

TL;DR: 本文提出了一种基于三轴加速度传感器的高效运动模型,利用AutoML流水线提取关键特征,并在WeBe Band可穿戴设备上实现低延迟、高准确率的实时手势识别,适用于医疗监测场景。

Details Motivation: 为了满足医疗监控等领域对快速、可靠且低误报的人体动作跟踪需求,需要在微型嵌入式设备上实现高效的实时手势识别。 Method: 采用三轴加速度计数据,结合AutoML流水线自动提取数据段中的重要特征,并训练多种轻量级机器学习模型(包括神经网络)以实现设备端实时识别。 Result: 实验表明,神经网络在准确性、延迟和内存使用之间达到了最佳平衡,WeBe Band设备能够实现可靠的实时手势识别。 Conclusion: 该方法为可穿戴医疗监测设备提供了高效、安全且响应迅速的手势识别解决方案,具有广泛的临床应用潜力。 Abstract: The use of tiny devices capable of low-latency gesture recognition is gaining momentum in everyday human-computer interaction and especially in medical monitoring fields. Embedded solutions such as fall detection, rehabilitation tracking, and patient supervision require fast and efficient tracking of movements while avoiding unwanted false alarms. This study presents an efficient solution on how to build very efficient motion-based models only using triaxial accelerometer sensors. We explore the capability of the AutoML pipelines to extract the most important features from the data segments. This approach also involves training multiple lightweight machine learning algorithms using the extracted features. We use WeBe Band, a multi-sensor wearable device that is equipped with a powerful enough MCU to effectively perform gesture recognition entirely on the device. Of the models explored, we found that the neural network provided the best balance between accuracy, latency, and memory use. Our results also demonstrate that reliable real-time gesture recognition can be achieved in WeBe Band, with great potential for real-time medical monitoring solutions that require a secure and fast response time.

[78] Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions

Egemen Sert,Şeyda Ertekin

Main category: cs.CV

TL;DR: 本文提出了一种以数据为中心的方法,通过高质量的多模态数据集和优化的推理语法(QMSA)对Qwen-2.5VL-32B进行监督微调,在标准化考试题基准YKSUniform上达到了接近Gemini 2.0 Flash的性能,表明数据构成和表示语法在多模态推理中起关键作用。

Details Motivation: 尽管多模态推理在AI研究中至关重要,但现有研究多关注算法改进,而忽视了数据本身的作用。本文旨在探索高质量、课程对齐的数据在视觉语言推理中的潜力。 Method: 构建了一个包含1.614亿token的多模态数据集,整合教科书问答对、课程匹配图表和上下文材料,并采用优化的推理语法QMSA对Qwen-2.5VL-32B进行监督微调。 Result: 在新发布的基准YKSUniform(涵盖309个课程主题的1854道多模态考题)上达到78.6%的准确率,仅比Gemini 2.0 Flash低1.0%。 Conclusion: 精心策划且基于课程的多模态数据能显著提升监督微调效果,证明数据质量与表示方式是推动开源视觉语言模型发展的关键因素。 Abstract: Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored. We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854 multimodal exam questions across 309 curriculum topics. Our results reveal that data composition and representational syntax play a decisive role in multimodal reasoning. This work establishes a data centric framework for advancing open weight vision language models, demonstrating that carefully curated and curriculum-grounded multimodal data can elevate supervised fine-tuning to near state-of-the-art performance.

[79] PEFT-DML: Parameter-Efficient Fine-Tuning Deep Metric Learning for Robust Multi-Modal 3D Object Detection in Autonomous Driving

Abdolazim Rezaei,Mehdi Sookhak

Main category: cs.CV

TL;DR: PEFT-DML 是一种参数高效的多模态3D目标检测框架,通过将多种传感器模态映射到共享隐空间,提升在传感器失效或动态环境下的鲁棒性。

Details Motivation: 传统模型假设传感器始终可用,难以应对实际自动驾驶中传感器失效或模态组合变化的问题,因此需要更鲁棒的多模态融合方法。 Method: 提出PEFT-DML框架,利用LoRA和适配器层将LiDAR、雷达、相机等多模态数据映射到共享隐空间,实现参数高效的深度度量学习。 Result: 在nuScenes基准上验证了方法的有效性,表现出优越的检测精度和对传感器丢失、快速运动及天气变化的强鲁棒性。 Conclusion: PEFT-DML通过参数高效微调策略实现了鲁棒的多模态3D检测,为复杂自动驾驶场景下的感知系统提供了可行解决方案。 Abstract: This study introduces PEFT-DML, a parameter-efficient deep metric learning framework for robust multi-modal 3D object detection in autonomous driving. Unlike conventional models that assume fixed sensor availability, PEFT-DML maps diverse modalities (LiDAR, radar, camera, IMU, GNSS) into a shared latent space, enabling reliable detection even under sensor dropout or unseen modality class combinations. By integrating Low-Rank Adaptation (LoRA) and adapter layers, PEFT-DML achieves significant training efficiency while enhancing robustness to fast motion, weather variability, and domain shifts. Experiments on benchmarks nuScenes demonstrate superior accuracy.

[80] DL-CapsNet: A Deep and Light Capsule Network

Pouya Shiri,Amirali Baniasadi

Main category: cs.CV

TL;DR: 提出了一种深度胶囊网络(DL-CapsNet),包含多个胶囊层和胶囊摘要层,以减少参数数量、降低复杂度,同时保持高精度,适用于处理类别多的复杂数据集。

Details Motivation: 为了解决传统CNN在处理重叠类别和仿射变换图像时的局限性,并提升CapsNet的性能与效率。 Method: 设计了一个具有多个胶囊层的深度CapsNet,并引入胶囊摘要层来减少参数数量和模型复杂度。 Result: DL-CapsNet在保持高准确率的同时,参数更少,训练和推理速度更快,能有效处理高类别复杂数据集。 Conclusion: DL-CapsNet是一种高效且准确的深度胶囊网络,适合复杂图像分类任务,具备实际应用潜力。 Abstract: Capsule Network (CapsNet) is among the promising classifiers and a possible successor of the classifiers built based on Convolutional Neural Network (CNN). CapsNet is more accurate than CNNs in detecting images with overlapping categories and those with applied affine transformations. In this work, we propose a deep variant of CapsNet consisting of several capsule layers. In addition, we design the Capsule Summarization layer to reduce the complexity by reducing the number of parameters. DL-CapsNet, while being highly accurate, employs a small number of parameters and delivers faster training and inference. DL-CapsNet can process complex datasets with a high number of categories.

[81] Satellite to Street : Disaster Impact Estimator

Sreesritha Sai,Sai Venkata Suma Sreeja,Deepthi,Nikhil

Main category: cs.CV

TL;DR: 提出了一种名为Satellite-to-Street的深度学习框架,通过改进的双输入U-Net和加权损失函数,实现灾后卫星图像的精细损伤分割。

Details Motivation: 灾后快速准确的损伤评估对应急响应至关重要,但人工解译卫星图像速度慢、主观性强且难以扩展;现有深度学习模型在处理结构变化细微和类别不平衡问题上表现不佳。 Method: 采用改进的双输入U-Net架构,结合增强的特征融合机制,联合处理灾前和灾后卫星图像,并引入类别感知的加权损失函数以缓解未受损像素主导的问题。 Result: 在公开灾害数据集上的实验表明,该方法相比传统分割和变化检测模型,在结构损伤的定位与分类上表现更优。 Conclusion: 所提方法能生成快速、一致的损伤图,辅助专家决策,提升灾害管理的效率与数据驱动能力。 Abstract: Accurate post-disaster damage assessment is of high importance for prioritizing emergency response; however, manual interpretation of satellite imagery is slow, subjective, and hard to scale. While deep-learning models for image segmentation, such as U-Net-based baselines and change-detection models, are useful baselines, they often struggle with subtle structural variations and severe class imbalance, yielding poor detection of highly damaged regions. The present work proposes a deep-learning framework that jointly processes pre- and post-disaster satellite images to obtain fine-grained pixel-level damage maps: Satellite-to-Street: Disaster Impact Estimator. The model uses a modified dual-input U-Net architecture with enhanced feature fusion to capture both the local structural changes as well as the broader contextual cues. Class-aware weighted loss functions are integrated in order to handle the dominance of undamaged pixels in real disaster datasets, thus enhancing sensitivity toward major and destroyed categories. Experimentation on publicly available disaster datasets shows improved localization and classification of structural damage when compared to traditional segmentation and baseline change-detection models. The resulting damage maps provide a rapid and consistent assessment mechanism to support and not replace expert decision-making, thus allowing more efficient, data-driven disaster management.

[82] ProvRain: Rain-Adaptive Denoising and Vehicle Detection via MobileNet-UNet and Faster R-CNN

Aswinkumar Varathakumaran,Nirmala Paramanandham

Main category: cs.CV

TL;DR: 本文提出了一种名为ProvRain的轻量级车辆检测管道,结合MobileNet-U-Net与课程学习,在雨天夜间条件下有效去噪并提升检测性能,显著提高了准确率、召回率及图像质量指标。

Details Motivation: 夜间车辆检测在雨雪等恶劣天气下易受噪声干扰,现有方法难以兼顾去噪与检测精度,亟需一个鲁棒的检测框架。 Method: 提出ProvRain管道,采用轻量化的MobileNet-U-Net架构进行图像去噪,并结合课程学习策略在合成与真实数据(PVDN)上训练,以提升模型在复杂天气下的泛化能力。 Result: 相比基于Faster R-CNN的基线模型,ProvRain在雨天夜间场景中检测准确率提升8.94%,召回率提升10.25%;去噪模型在PSNR上提高10-15%,SSIM提高5-6%,LPIPS降低高达67%。 Conclusion: ProvRain通过轻量去噪架构与课程学习,有效提升了夜间雨天环境下的车辆检测性能与图像质量,具有良好的实用性和鲁棒性。 Abstract: Provident vehicle detection has a lot of scope in the detection of vehicle during night time. The extraction of features other than the headlamps of vehicles allows us to detect oncoming vehicles before they appear directly on the camera. However, it faces multiple issues especially in the field of night vision, where a lot of noise caused due to weather conditions such as rain or snow as well as camera conditions. This paper focuses on creating a pipeline aimed at dealing with such noise while at the same time maintaining the accuracy of provident vehicular detection. The pipeline in this paper, ProvRain, uses a lightweight MobileNet-U-Net architecture tuned to generalize to robust weather conditions by using the concept of curricula training. A mix of synthetic as well as available data from the PVDN dataset is used for this. This pipeline is compared to the base Faster RCNN architecture trained on the PVDN dataset to see how much the addition of a denoising architecture helps increase the detection model's performance in rainy conditions. The system boasts an 8.94\% increase in accuracy and a 10.25\% increase in recall in the detection of vehicles in rainy night time frames. Similarly, the custom MobileNet-U-Net architecture that was trained also shows a 10-15\% improvement in PSNR, a 5-6\% increase in SSIM, and upto a 67\% reduction in perceptual error (LPIPS) compared to other transformer approaches.

[83] Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation

Jun Jia,Hongyi Miao,Yingjie Zhou,Wangqiu Zhou,Jianbo Zhang,Linhan Cao,Dandan Zhu,Hua Yang,Xiongkuo Min,Wei Sun,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了Adapter Shield,一种用于防御零样本图像生成中个人图像滥用的通用且集成认证的解决方案。

Details Motivation: 随着扩散模型的发展,零样本图像到图像生成技术带来了未经授权的身份克隆和风格模仿等知识产权风险,需要有效的防护机制。 Method: 通过研究现有零样本方法如何利用图像编码器提取嵌入特征,并结合交叉注意力机制,设计了一个可逆的加密系统和多目标对抗性扰动方法,将原始嵌入映射为加密表示。 Result: 实验表明,该方法在阻止未经授权的零样本图像合成方面优于现有的最先进防御方法,同时支持对授权用户的安全访问控制。 Conclusion: Adapter Shield是首个针对零样本生成场景的通用认证防御方案,有效平衡了图像保护与合法使用的灵活性。 Abstract: With the rapid progress in diffusion models, image synthesis has advanced to the stage of zero-shot image-to-image generation, where high-fidelity replication of facial identities or artistic styles can be achieved using just one portrait or artwork, without modifying any model weights. Although these techniques significantly enhance creative possibilities, they also pose substantial risks related to intellectual property violations, including unauthorized identity cloning and stylistic imitation. To counter such threats, this work presents Adapter Shield, the first universal and authentication-integrated solution aimed at defending personal images from misuse in zero-shot generation scenarios. We first investigate how current zero-shot methods employ image encoders to extract embeddings from input images, which are subsequently fed into the UNet of diffusion models through cross-attention layers. Inspired by this mechanism, we construct a reversible encryption system that maps original embeddings into distinct encrypted representations according to different secret keys. The authorized users can restore the authentic embeddings via a decryption module and the correct key, enabling normal usage for authorized generation tasks. For protection purposes, we design a multi-target adversarial perturbation method that actively shifts the original embeddings toward designated encrypted patterns. Consequently, protected images are embedded with a defensive layer that ensures unauthorized users can only produce distorted or encrypted outputs. Extensive evaluations demonstrate that our method surpasses existing state-of-the-art defenses in blocking unauthorized zero-shot image synthesis, while supporting flexible and secure access control for verified users.

[84] Diffusion-Based Synthetic Brightfield Microscopy Images for Enhanced Single Cell Detection

Mario de Jesus da Graca,Jörg Dahlkemper,Peer Stelldinger

Main category: cs.CV

TL;DR: 本研究提出使用基于U-Net的扩散模型生成合成亮场显微图像,用于增强细胞检测任务中的训练数据。实验表明,加入合成数据可提升YOLOv8、YOLOv9和RT-DETR等检测模型的性能,且生成图像逼真度高,专家难以与真实图像区分(识别准确率仅50%)。

Details Motivation: 深度学习在单细胞检测中的应用受限于真实数据的稀缺性和标注成本高,亟需有效数据增强方法以减少对人工标注的依赖。 Method: 训练一个基于U-Net的无条件扩散模型生成合成亮场显微图像,并构建包含不同比例合成与真实图像的数据集,用于训练YOLOv8、YOLOv9和RT-DETR等目标检测模型,评估其对检测性能的影响。 Result: 使用合成数据训练可显著提升检测精度,且生成图像具有高度真实性,在专家调查中被误判率为50%,接近随机猜测。 Conclusion: 基于扩散模型的合成数据生成是显微图像分析中一种有前景的数据增强手段,有助于降低标注成本并提高细胞检测模型的鲁棒性。 Abstract: Accurate single cell detection in brightfield microscopy is crucial for biological research, yet data scarcity and annotation bottlenecks limit the progress of deep learning methods. We investigate the use of unconditional models to generate synthetic brightfield microscopy images and evaluate their impact on object detection performance. A U-Net based diffusion model was trained and used to create datasets with varying ratios of synthetic and real images. Experiments with YOLOv8, YOLOv9 and RT-DETR reveal that training with synthetic data can achieve improved detection accuracies (at minimal costs). A human expert survey demonstrates the high realism of generated images, with experts not capable to distinguish them from real microscopy images (accuracy 50%). Our findings suggest that diffusion-based synthetic data generation is a promising avenue for augmenting real datasets in microscopy image analysis, reducing the reliance on extensive manual annotation and potentially improving the robustness of cell detection models.

[85] Conceptual Evaluation of Deep Visual Stereo Odometry for the MARWIN Radiation Monitoring Robot in Accelerator Tunnels

André Dehne,Juri Zach,Peer Stelldinger

Main category: cs.CV

TL;DR: 本文探讨了在欧洲XFEL加速器隧道中,使用深度视觉立体里程计(DVSO)作为MARWIN机器人自主导航的替代方案,旨在提升其在未知环境中的灵活性和自主性。

Details Motivation: 现有的导航方法在预定义区域表现稳健,但在面对未知几何结构和障碍物时缺乏灵活性,需要一种更灵活、更具适应性的导航技术。 Method: 采用基于视觉的深度视觉立体里程计(DVSO),结合立体视差、光流和自监督学习来联合估计深度和自我运动,并可通过融合绝对参考或其他传感器提高全局一致性。 Result: 预期优势包括通过立体视觉减少尺度漂移、低成本传感和可扩展的数据采集;挑战则在于低纹理表面、光照变化、计算负载以及辐射环境下的鲁棒性。 Conclusion: 本文提出了一个研究议程,旨在使MARWIN机器人能够在受限且安全关键的基础设施中实现更高程度的自主导航。 Abstract: The MARWIN robot operates at the European XFEL to perform autonomous radiation monitoring in long, monotonous accelerator tunnels where conventional localization approaches struggle. Its current navigation concept combines lidar-based edge detection, wheel/lidar odometry with periodic QR-code referencing, and fuzzy control of wall distance, rotation, and longitudinal position. While robust in predefined sections, this design lacks flexibility for unknown geometries and obstacles. This paper explores deep visual stereo odometry (DVSO) with 3D-geometric constraints as a focused alternative. DVSO is purely vision-based, leveraging stereo disparity, optical flow, and self-supervised learning to jointly estimate depth and ego-motion without labeled data. For global consistency, DVSO can subsequently be fused with absolute references (e.g., landmarks) or other sensors. We provide a conceptual evaluation for accelerator tunnel environments, using the European XFEL as a case study. Expected benefits include reduced scale drift via stereo, low-cost sensing, and scalable data collection, while challenges remain in low-texture surfaces, lighting variability, computational load, and robustness under radiation. The paper defines a research agenda toward enabling MARWIN to navigate more autonomously in constrained, safety-critical infrastructures.

[86] Exploring Diagnostic Prompting Approach for Multimodal LLM-based Visual Complexity Assessment: A Case Study of Amazon Search Result Pages

Divendar Murtadak,Yoon Kim,Trilokya Akula

Main category: cs.CV

TL;DR: 该研究探讨了诊断性提示是否能提升多模态大语言模型(MLLM)在评估亚马逊搜索结果页(SRP)视觉复杂度上的可靠性,结果显示诊断性提示显著提升了预测性能,但绝对表现仍有限。

Details Motivation: 旨在提高MLLM在视觉复杂度评估中与人类判断的一致性,解决现有提示方法效果不足的问题。 Method: 比较了诊断性提示与基于格式塔原则的标准提示方法,在200个亚马逊SRP页面上进行实验,并使用人类专家标注作为基准,结合决策树分析和失败案例研究。 Result: 诊断性提示使F1分数从0.031提升至0.297(相对提升858%),但Cohen's κ仅为0.071;模型更关注视觉设计元素(如徽章杂乱),而人类更重视内容相似性,二者推理模式部分对齐。 Conclusion: 诊断性提示是迈向人类对齐的MLLM评估的有前景的第一步,但在产品相似性和颜色强度等感知任务上仍存在挑战,需进一步优化提示方法并构建更大规模的真实数据集以支持实际应用。 Abstract: This study investigates whether diagnostic prompting can improve Multimodal Large Language Model (MLLM) reliability for visual complexity assessment of Amazon Search Results Pages (SRP). We compare diagnostic prompting with standard gestalt principles-based prompting using 200 Amazon SRP pages and human expert annotations. Diagnostic prompting showed notable improvements in predicting human complexity judgments, with F1-score increasing from 0.031 to 0.297 (+858\% relative improvement), though absolute performance remains modest (Cohen's $κ$ = 0.071). The decision tree revealed that models prioritize visual design elements (badge clutter: 38.6\% importance) while humans emphasize content similarity, suggesting partial alignment in reasoning patterns. Failure case analysis reveals persistent challenges in MLLM visual perception, particularly for product similarity and color intensity assessment. Our findings indicate that diagnostic prompting represents a promising initial step toward human-aligned MLLM-based evaluation, though failure cases with consistent human-MLLM disagreement require continued research and refinement in prompting approaches with larger ground truth datasets for reliable practical deployment.

[87] A Fast and Efficient Modern BERT based Text-Conditioned Diffusion Model for Medical Image Segmentation

Venkata Siddharth Dhara,Pawan Kumar

Main category: cs.CV

TL;DR: FastTextDiff是一种基于扩散模型的标签高效医学图像分割方法,通过整合临床文本注释和ModernBERT模型提升语义表征能力,在无需密集像素标注的情况下提高了分割精度与训练效率。

Details Motivation: 现有的医学图像分割模型依赖密集的像素级标签,标注成本高且耗时,限制了其在实际中的应用。 Method: 提出FastTextDiff,结合去噪扩散概率模型与ModernBERT,利用临床文本注释增强语义表示;采用FlashAttention 2机制和大规模语料预训练,并通过跨模态注意力融合视觉与文本特征。 Result: 在MIMIC-III和MIMIC-IV数据集上验证,FastTextDiff相比使用Clinical BioBERT的传统方法在分割准确性和训练效率方面均有提升。 Conclusion: ModernBERT可作为Clinical BioBERT的快速、可扩展替代方案,多模态融合有助于推动低标签依赖的医学图像分析发展。 Abstract: In recent times, denoising diffusion probabilistic models (DPMs) have proven effective for medical image generation and denoising, and as representation learners for downstream segmentation. However, segmentation performance is limited by the need for dense pixel-wise labels, which are expensive, time-consuming, and require expert knowledge. We propose FastTextDiff, a label-efficient diffusion-based segmentation model that integrates medical text annotations to enhance semantic representations. Our approach uses ModernBERT, a transformer capable of processing long clinical notes, to tightly link textual annotations with semantic content in medical images. Trained on MIMIC-III and MIMIC-IV, ModernBERT encodes clinical knowledge that guides cross-modal attention between visual and textual features. This study validates ModernBERT as a fast, scalable alternative to Clinical BioBERT in diffusion-based segmentation pipelines and highlights the promise of multi-modal techniques for medical image analysis. By replacing Clinical BioBERT with ModernBERT, FastTextDiff benefits from FlashAttention 2, an alternating attention mechanism, and a 2-trillion-token corpus, improving both segmentation accuracy and training efficiency over traditional diffusion-based models.

[88] Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

Davide Nadalini,Manuele Rusci,Elia Cereda,Luca Benini,Francesco Conti,Daniele Palossi

Main category: cs.CV

TL;DR: 本文提出了一种用于物联网设备的多模态端上学习技术,以解决单目深度估计在域偏移下的精度下降问题,实现了在超低功耗设备上高效、准确的在线微调。

Details Motivation: 由于训练数据与实际观测数据之间的域偏移,现有轻量级深度神经网络在物联网设备上的单目深度估计精度显著下降,亟需一种能在设备端自适应学习的方法。 Method: 提出一种基于MCU的多模态端上学习方法,利用8x8像素深度传感器采集伪标签,在设备上对μPyD-Net模型进行微调,并引入内存驱动的稀疏更新策略以降低内存开销。 Result: 仅用3k自标注样本,将微调内存降至1.2MB(比全量更新低2.2倍),在KITTI和NYUv2数据集上精度仅下降2%和1.5%;实地测试中17.8分钟内将RMSE从4.9m降至0.6m。 Conclusion: 该方法首次实现在ULP IoT节点上进行端到端的单目深度估计端上学习,兼顾能效与适应性,为边缘设备应对环境变化提供了可行方案。 Abstract: Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $\approx$300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect pseudo-labels when the system is placed in a new environment. Then, the fine-tuning task is performed entirely on the MCU, using the new data. To optimize our backpropagation-based on-device training, we introduce a novel memory-driven sparse update scheme, which minimizes the fine-tuning memory to 1.2 MB, 2.2x less than a full update, while preserving accuracy (i.e., only 2% and 1.5% drops on the KITTI and NYUv2 datasets). Our in-field tests demonstrate, for the first time, that ODL for MDE can be performed in 17.8 minutes on the IoT node, reducing the root mean squared error from 4.9 to 0.6m with only 3 k self-labeled samples, collected in a real-life deployment scenario.

[89] Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data

Ivo Bueno,Ruikun Hou,Babette Bühler,Tim Fütterer,James Drimalla,Jonathan Kyle Foster,Peter Youngs,Peter Gerjets,Ulrich Trautwein,Enkelejda Kasneci

Main category: cs.CV

TL;DR: 本研究探索了基于AI的课堂录音多模态分析,用于自动识别教学活动和话语,以支持可操作的教学反馈。通过164小时视频和68个课 transcripts 的密集标注数据,分别构建了针对视频和文本的模型 pipeline。实验表明,微调模型优于基于提示的LLM方法,在视频和文本任务中分别达到0.577和0.460的macro-F1分数,验证了自动化课堂分析的可行性。

Details Motivation: 传统课堂互动观察依赖人工标注,成本高且难以扩展,因此需要一种自动化的、可扩展的方法来提供教师反馈。 Method: 使用164小时标注视频和68节课的转录文本,设计了模态专用的并行处理流程:视频方面评估了零样本多模态大模型、微调的视觉-语言模型和自监督视频Transformer;文本方面则比较了微调的上下文感知Transformer分类器与基于提示的大语言模型。采用标签级阈值、上下文窗口和抗类别不平衡损失函数应对多标签和类别不平衡问题。 Result: 微调模型在视频(macro-F1 0.577)和文本(macro-F1 0.460)任务上均优于提示工程方法,有效识别24类教学活动和19类话语模式。 Conclusion: 研究表明,基于微调的多模态AI模型能够有效支持自动化课堂分析,为构建可扩展的教师反馈系统奠定了技术基础。 Abstract: Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.

[90] SemImage: Semantic Image Representation for Text, a Novel Framework for Embedding Disentangled Linguistic Features

Mohammad Zare

Main category: cs.CV

TL;DR: 提出SemImage,将文本表示为二维语义图像用于CNN处理,通过HSV颜色空间编码语言特征,并利用动态边界行增强语义分割,在文档分类任务中表现优异且具有高可解释性。

Details Motivation: 为了提升文本表示的可解释性并有效捕捉文档中的主题和情感变化,需要一种能够将语言特征可视化并适配CNN处理的新方法。 Method: 将每个词表示为2D图像中的像素,行对应句子并插入动态边界行;使用解耦的HSV颜色空间编码:色调(H_cos/H_sin)表示主题,饱和度表示情感,明度表示强度;通过多任务学习框架(ColorMapper网络+辅助监督)实现特征解耦,并结合ResNet等2D CNN进行文档分类。 Result: 在多标签和单标签数据集上,SemImage达到或超过了BERT、层次注意力网络等强基线模型的准确率;消融实验验证了HSV表示和动态边界行的重要性;可视化结果显示主题转换和情感变化具有清晰可见的模式。 Conclusion: SemImage能有效将文本转化为富含语言信息的语义图像,不仅在文档分类任务中表现优越,还提供了对机器和人类都可读的视觉可解释性,展示了语言特征可视化的潜力。 Abstract: We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.

[91] TeleViT1.0: Teleconnection-aware Vision Transformers for Subseasonal to Seasonal Wildfire Pattern Forecasts

Ioannis Prapas,Nikolaos Papadopoulos,Nikolaos-Ioannis Bountos,Dimitrios Michail,Gustau Camps-Valls,Ioannis Papoutsis

Main category: cs.CV

TL;DR: 本文提出了一种名为TeleViT的新型火灾预测模型,结合局部火情驱动因素、全球场数据和遥相关指数,利用多尺度融合的Vision Transformer架构,在长达四个月的提前期中显著提升了野火预测性能。

Details Motivation: 长期野火预测对资源规划至关重要,但极具挑战性。传统方法依赖局部天气,难以捕捉地球系统的大尺度关联,因此需要一种能够整合多尺度信息的模型以提升预测能力。 Method: 提出TeleViT模型,采用不对称令牌化策略融合细尺度局部驱动因子、粗分辨率全球场和遥相关指数;通过Transformer编码器联合处理异构令牌,并使用保持空间结构的解码器生成预测。 Result: 在SeasFire数据集上,TeleViT在所有提前期(最高达4个月)均优于U-Net++、ViT和气候学基准;例如在16×8天提前期,其AUPRC为0.601–0.603,显著高于对比模型;注意力分析显示全局输入提供重要上下文信息。 Conclusion: 明确编码地球系统大尺度背景的架构可有效延长野火在次季节至季节尺度上的可预测性,TeleViT为长期环境预测提供了新范式。 Abstract: Forecasting wildfires weeks to months in advance is difficult, yet crucial for planning fuel treatments and allocating resources. While short-term predictions typically rely on local weather conditions, long-term forecasting requires accounting for the Earth's interconnectedness, including global patterns and teleconnections. We introduce TeleViT, a Teleconnection-aware Vision Transformer that integrates (i) fine-scale local fire drivers, (ii) coarsened global fields, and (iii) teleconnection indices. This multi-scale fusion is achieved through an asymmetric tokenization strategy that produces heterogeneous tokens processed jointly by a transformer encoder, followed by a decoder that preserves spatial structure by mapping local tokens to their corresponding prediction patches. Using the global SeasFire dataset (2001-2021, 8-day resolution), TeleViT improves AUPRC performance over U-Net++, ViT, and climatology across all lead times, including horizons up to four months. At zero lead, TeleViT with indices and global inputs reaches AUPRC 0.630 (ViT 0.617, U-Net 0.620), at 16x8day lead (around 4 months), TeleViT variants using global input maintain 0.601-0.603 (ViT 0.582, U-Net 0.578), while surpassing the climatology (0.572) at all lead times. Regional results show the highest skill in seasonally consistent fire regimes, such as African savannas, and lower skill in boreal and arid regions. Attention and attribution analyses indicate that predictions rely mainly on local tokens, with global fields and indices contributing coarse contextual information. These findings suggest that architectures explicitly encoding large-scale Earth-system context can extend wildfire predictability on subseasonal-to-seasonal timescales.

[92] Deep Filament Extraction for 3D Concrete Printing

Karam Mawas,Mehdi Maboudi,Pedro Achanccaray,Markus Gerke

Main category: cs.CV

TL;DR: 本文提出了一种适用于挤出式和喷射式3D混凝土打印的自动化纤维质量控制方法,该方法与传感器类型无关,可用于新鲜或硬化状态的材料,支持在线和打印后检测。

Details Motivation: 为了满足建筑行业对可持续和高效建造的需求,确保3D打印混凝土结构中关键组成部分——纤维的几何质量,亟需一种通用且自动化的质量控制方法。 Method: 提出一种不依赖于特定传感器(如相机、结构光系统或地面激光扫描仪)的自动化工作流程,用于检测挤出式和喷射式3D混凝土打印中的纤维几何质量,适用于材料的新鲜态和固化态。 Result: 开发出一种可应用于多种3D打印技术和传感器的通用QC流程,实现了对打印纤维的在线与后期质量监控。 Conclusion: 该方法提高了3D混凝土打印质量控制的灵活性与适用性,为不同工艺和材料状态下的自动化检测提供了有效解决方案。 Abstract: The architecture, engineering and construction (AEC) industry is constantly evolving to meet the demand for sustainable and effective design and construction of the built environment. In the literature, two primary deposition techniques for large-scale 3D concrete printing (3DCP) have been described, namely extrusion-based (Contour Crafting-CC) and shotcrete 3D printing (SC3DP) methods. The deposition methods use a digitally controlled nozzle to print material layer by layer. The continuous flow of concrete material used to create the printed structure is called a filament or layer. As these filaments are the essential structure defining the printed object, the filaments' geometry quality control is crucial. This paper presents an automated procedure for quality control (QC) of filaments in extrusion-based and SC3DP printing methods. The paper also describes a workflow that is independent of the sensor used for data acquisition, such as a camera, a structured light system (SLS) or a terrestrial laser scanner (TLS). This method can be used with materials in either the fresh or cured state. Thus, it can be used for online and post-printing QC.

[93] Comparative Analysis of Vision Transformer, Convolutional, and Hybrid Architectures for Mental Health Classification Using Actigraphy-Derived Images

Ifeanyi Okala

Main category: cs.CV

TL;DR: 该研究比较了VGG16、ViT-B/16和CoAtNet-Tiny三种图像化方法在使用日常活动记录识别抑郁症、精神分裂症和健康对照方面的性能。结果表明,CoAtNet-Tiny在准确率和稳定性方面表现最佳,尤其在少数类上表现出更高的精确率、召回率和F1分数。

Details Motivation: 探索适用于基于活动记录图像的精神健康识别任务的深度学习模型,特别是针对抑郁症和精神分裂症的自动检测需求。 Method: 将腕戴式设备记录的活动信号转化为30x48像素的图像,采用三折交叉验证评估VGG16、ViT-B/16和CoAtNet-Tiny三种模型的性能。 Result: CoAtNet-Tiny在平均准确率、训练曲线稳定性以及抑郁和精神分裂症类别的精确率、召回率和F1分数上均优于其他两种模型;VGG16收敛较慢且准确率较低,ViT-B/16表现不稳定。 Conclusion: CoAtNet-Tiny在处理由活动记录生成的图像时表现最一致且可靠,说明混合架构可能更适用于基于图像化活动数据的精神健康分析任务。 Abstract: This work examines how three different image-based methods, VGG16, ViT-B/16, and CoAtNet-Tiny, perform in identifying depression, schizophrenia, and healthy controls using daily actigraphy records. Wrist-worn activity signals from the Psykose and Depresjon datasets were converted into 30 by 48 images and evaluated through a three-fold subject-wise split. Although all methods fitted the training data well, their behaviour on unseen data differed. VGG16 improved steadily but often settled at lower accuracy. ViT-B/16 reached strong results in some runs, but its performance shifted noticeably from fold to fold. CoAtNet-Tiny stood out as the most reliable, recording the highest average accuracy and the most stable curves across folds. It also produced the strongest precision, recall, and F1-scores, particularly for the underrepresented depression and schizophrenia classes. Overall, the findings indicate that CoAtNet-Tiny performed most consistently on the actigraphy images, while VGG16 and ViT-B/16 yielded mixed results. These observations suggest that certain hybrid designs may be especially suited for mental-health work that relies on actigraphy-derived images.

[94] TinyViT: Field Deployable Transformer Pipeline for Solar Panel Surface Fault and Severity Screening

Ishwaryah Pandiarajan,Mohamed Mansoor Roomi Sindha,Uma Maheswari Pandyan,Sharafia N

Main category: cs.CV

TL;DR: 提出了一种基于可见光图像的轻量级深度学习方法TinyViT,用于太阳能光伏板表面故障分类与严重性评估,无需依赖红外或电致发光成像,提升了检测的可及性与经济性。

Details Motivation: 现有光伏面板故障检测多依赖多模态成像(如红外、电致发光),成本高且难以大规模部署,限制了在资源有限场景下的应用。 Method: 提出TinyViT,结合Transformer分割、光谱-空间特征工程和集成回归,仅使用消费级彩色相机拍摄的可见光图像进行七类表面故障分类与严重程度估计。 Result: 在真实公开数据集上验证了分类与回归模块的有效性,准确率与可解释性媲美专用方法,且无需额外传感器。 Conclusion: 该方法降低了光伏运维成本与技术门槛,推动太阳能健康监测向普惠化、规模化发展。 Abstract: Sustained operation of solar photovoltaic assets hinges on accurate detection and prioritization of surface faults across vast, geographically distributed modules. While multi modal imaging strategies are popular, they introduce logistical and economic barriers for routine farm level deployment. This work demonstrates that deep learning and classical machine learning may be judiciously combined to achieve robust surface anomaly categorization and severity estimation from planar visible band imagery alone. We introduce TinyViT which is a compact pipeline integrating Transformer based segmentation, spectral-spatial feature engineering, and ensemble regression. The system ingests consumer grade color camera mosaics of PV panels, classifies seven nuanced surface faults, and generates actionable severity grades for maintenance triage. By eliminating reliance on electroluminescence or IR sensors, our method enables affordable, scalable upkeep for resource limited installations, and advances the state of solar health monitoring toward universal field accessibility. Experiments on real public world datasets validate both classification and regression sub modules, achieving accuracy and interpretability competitive with specialized approaches.

[95] Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance

Ruo-Syuan Mei,Sixian Jia,Guangze Li,Soo Yeon Lee,Brian Musser,William Keller,Sreten Zakula,Jorge Arinez,Chenhui Shao

Main category: cs.CV

TL;DR: 本文提出了一种基于合成数据生成(SDG)的混合框架,用于实现无需人工标注的工业零件质量检测零样本学习,通过仿真渲染、域随机化和真实背景融合生成大量标注数据,在仅使用合成数据训练的情况下实现了高检测与分类性能。

Details Motivation: 由于制造环境中缺陷样本稀少且标注成本高,导致机器学习在工业质检中的应用受限,本文旨在解决数据稀缺和类别不平衡问题。 Method: 提出一种结合基于仿真的渲染、域随机化和真实背景合成的混合SDG框架,生成12,960张标注图像;采用YOLOv8n进行目标检测,MobileNetV3-small进行质量分类,完全在合成数据上训练模型。 Result: 在300个真实工业零件上测试,检测mAP@0.5达0.995,分类准确率96%,平衡准确率90.1%;在严重类别不平衡下,相比仅用少量真实数据的基线方法(50%准确率),本方法达到90-91%平衡准确率。 Conclusion: 该方法实现了无需人工标注、可扩展且鲁棒的工业质检方案,显著提升在数据稀缺场景下的模型性能,推动深度学习在实际生产环境中的落地应用。 Abstract: Machine learning, particularly deep learning, is transforming industrial quality inspection. Yet, training robust machine learning models typically requires large volumes of high-quality labeled data, which are expensive, time-consuming, and labor-intensive to obtain in manufacturing. Moreover, defective samples are intrinsically rare, leading to severe class imbalance that degrades model performance. These data constraints hinder the widespread adoption of machine learning-based quality inspection methods in real production environments. Synthetic data generation (SDG) offers a promising solution by enabling the creation of large, balanced, and fully annotated datasets in an efficient, cost-effective, and scalable manner. This paper presents a hybrid SDG framework that integrates simulation-based rendering, domain randomization, and real background compositing to enable zero-shot learning for computer vision-based industrial part inspection without manual annotation. The SDG pipeline generates 12,960 labeled images in one hour by varying part geometry, lighting, and surface properties, and then compositing synthetic parts onto real image backgrounds. A two-stage architecture utilizing a YOLOv8n backbone for object detection and MobileNetV3-small for quality classification is trained exclusively on synthetic data and evaluated on 300 real industrial parts. The proposed approach achieves an mAP@0.5 of 0.995 for detection, 96% classification accuracy, and 90.1% balanced accuracy. Comparative evaluation against few-shot real-data baseline approaches demonstrates significant improvement. The proposed SDG-based approach achieves 90-91% balanced accuracy under severe class imbalance, while the baselines reach only 50% accuracy. These results demonstrate that the proposed method enables annotation-free, scalable, and robust quality inspection for real-world manufacturing applications.

[96] Analysis of Incursive Breast Cancer in Mammograms Using YOLO, Explainability, and Domain Adaptation

Jayan Adhikari,Prativa Joshi,Susish Baral

Main category: cs.CV

TL;DR: 提出一种结合ResNet50和YOLO系列模型的联合框架,用于提升乳腺癌检测中对分布外输入(OOD)的鲁棒性,通过余弦相似度筛选机制有效排除非乳腺X光图像,显著提高系统可靠性与检测精度。

Details Motivation: 深度学习模型在面对分布外输入(如其他影像模态或设备差异)时存在可靠性问题,易导致误诊,需提升模型在真实临床环境中的稳健性。 Method: 采用ResNet50构建OOD过滤器,通过余弦相似度建立域内图库以严格拒绝非哺乳影像;随后使用YOLOv8/v11/v12进行癌症检测,并结合Grad-CAM提升可解释性。 Result: OOD检测组件达到99.77%总体准确率,在OOD测试集上实现100%识别准确率;检测性能mAP@0.5达0.947,显著降低误报并提升对乳腺图像的检测稳定性。 Conclusion: 该联合框架有效解决了乳腺癌AI检测中的分布外干扰问题,为在数据异质性强的临床环境中部署可靠AI系统提供了基础。 Abstract: Deep learning models for breast cancer detection from mammographic images have significant reliability problems when presented with Out-of-Distribution (OOD) inputs such as other imaging modalities (CT, MRI, X-ray) or equipment variations, leading to unreliable detection and misdiagnosis. The current research mitigates the fundamental OOD issue through a comprehensive approach integrating ResNet50-based OOD filtering with YOLO architectures (YOLOv8, YOLOv11, YOLOv12) for accurate detection of breast cancer. Our strategy establishes an in-domain gallery via cosine similarity to rigidly reject non-mammographic inputs prior to processing, ensuring that only domain-associated images supply the detection pipeline. The OOD detection component achieves 99.77\% general accuracy with immaculate 100\% accuracy on OOD test sets, effectively eliminating irrelevant imaging modalities. ResNet50 was selected as the optimum backbone after 12 CNN architecture searches. The joint framework unites OOD robustness with high detection performance (mAP@0.5: 0.947) and enhanced interpretability through Grad-CAM visualizations. Experimental validation establishes that OOD filtering significantly improves system reliability by preventing false alarms on out-of-distribution inputs while maintaining higher detection accuracy on mammographic data. The present study offers a fundamental foundation for the deployment of reliable AI-based breast cancer detection systems in diverse clinical environments with inherent data heterogeneity.

[97] Local and Global Context-and-Object-part-Aware Superpixel-based Data Augmentation for Deep Visual Recognition

Fadi Dornaika,Danyang Sun

Main category: cs.CV

TL;DR: 提出LGCOAMix,一种基于超像素的网格混合数据增强方法,首次利用超像素注意力进行标签混合,提升分类和弱监督定位性能。

Details Motivation: 现有CutMix方法主要关注全局语义,忽略局部判别性上下文,且使用矩形裁剪导致物体部分信息丢失,标签生成效率低。 Method: 提出LGCOAMix,采用超像素分割实现对象感知的网格混合,通过超像素注意力机制进行标签混合,并学习局部超像素区域特征及跨图像对比。 Result: 在多个基准数据集上超越现有CutMix方法,提升分类性能,并在CUB200-2011上实现更好的弱监督物体定位,适用于CNN和Transformer模型。 Conclusion: LGCOAMix有效提升了数据增强中对局部上下文和物体部分的建模能力,具有更高的效率和泛化性能。 Abstract: Cutmix-based data augmentation, which uses a cut-and-paste strategy, has shown remarkable generalization capabilities in deep learning. However, existing methods primarily consider global semantics with image-level constraints, which excessively reduces attention to the discriminative local context of the class and leads to a performance improvement bottleneck. Moreover, existing methods for generating augmented samples usually involve cutting and pasting rectangular or square regions, resulting in a loss of object part information. To mitigate the problem of inconsistency between the augmented image and the generated mixed label, existing methods usually require double forward propagation or rely on an external pre-trained network for object centering, which is inefficient. To overcome the above limitations, we propose LGCOAMix, an efficient context-aware and object-part-aware superpixel-based grid blending method for data augmentation. To the best of our knowledge, this is the first time that a label mixing strategy using a superpixel attention approach has been proposed for cutmix-based data augmentation. It is the first instance of learning local features from discriminative superpixel-wise regions and cross-image superpixel contrasts. Extensive experiments on various benchmark datasets show that LGCOAMix outperforms state-of-the-art cutmix-based data augmentation methods on classification tasks, {and weakly supervised object location on CUB200-2011.} We have demonstrated the effectiveness of LGCOAMix not only for CNN networks, but also for Transformer networks. Source codes are available at https://github.com/DanielaPlusPlus/LGCOAMix.

[98] Efficient Edge-Compatible CNN for Speckle-Based Material Recognition in Laser Cutting Systems

Mohamed Abdallah Salem,Nourhan Zein Diab

Main category: cs.CV

TL;DR: 提出一种轻量级CNN模型用于激光散斑材料分类,在59类材料上达到95.05%准确率,仅含341k参数,可部署于边缘设备,支持激光切割机的智能参数设置。

Details Motivation: 现有材料识别方法依赖计算成本高的网络或仅适用于有限材料类别,难以在资源受限的激光切割系统中部署。 Method: 设计一种专为散斑图案优化的轻量级卷积神经网络(CNN),减少参数量同时保持高判别能力,并在完整的SensiCut数据集(59类材料)上进行训练与评估。 Result: 模型在测试集上达到95.05%的准确率,宏观和加权F1分数均为0.951;参数量仅341k(约1.3MB),比ResNet-50少70倍以上,推理速度达295图像/秒,可在树莓派和Jetson设备运行;当材料聚为9类或5类实用家族时,召回率超98%,接近100%。 Conclusion: 针对特定域设计的紧凑型CNN在散斑材料分类任务中可超越大型骨干网络,推动了可部署于边缘设备的材料感知激光切割系统的可行性。 Abstract: Accurate material recognition is critical for safe and effective laser cutting, as misidentification can lead to poor cut quality, machine damage, or the release of hazardous fumes. Laser speckle sensing has recently emerged as a low-cost and non-destructive modality for material classification; however, prior work has either relied on computationally expensive backbone networks or addressed only limited subsets of materials. In this study, A lightweight convolutional neural network (CNN) tailored for speckle patterns is proposed, designed to minimize parameters while maintaining high discriminative power. Using the complete SensiCut dataset of 59 material classes spanning woods, acrylics, composites, textiles, metals, and paper-based products, the proposed model achieves 95.05% test accuracy, with macro and weighted F1-scores of 0.951. The network contains only 341k trainable parameters (~1.3 MB) -- over 70X fewer than ResNet-50 -- and achieves an inference speed of 295 images per second, enabling deployment on Raspberry Pi and Jetson-class devices. Furthermore, when materials are regrouped into nine and five practical families, recall exceeds 98% and approaches 100%, directly supporting power and speed preset selection in laser cutters. These results demonstrate that compact, domain-specific CNNs can outperform large backbones for speckle-based material classification, advancing the feasibility of material-aware, edge-deployable laser cutting systems.

[99] AutocleanEEG ICVision: Automated ICA Artifact Classification Using Vision-Language AI

Zag ElSayed,Grace Westerkamp,Gavin Gammoh,Yanchen Liu,Peyton Siekierski,Craig Erickson,Ernest Pedapati

Main category: cs.CV

TL;DR: ICVision是首个利用AI视觉与自然语言推理模拟专家级EEG ICA成分分类的系统,通过多模态大模型直接解析ICA可视化图表,实现可解释、可操作的自动分类,推动EEG分析向可扩展、可复现的新范式发展。

Details Motivation: 传统EEG ICA成分分类依赖手工特征(如ICLabel),缺乏可解释性且难以处理模糊情况,亟需一种能像神经科学家一样‘看’和‘解释’的智能系统。 Method: 采用GPT-4 Vision多模态大语言模型,直接输入ICA仪表盘图像(地形图、时序、功率谱、ERP图),进行视觉理解与自然语言推理,输出六类成分标签、置信度及人类可读解释。 Result: 在124个EEG数据集共3,168个ICA成分上评估,与专家共识的kappa值达0.677,优于MNE ICLabel,97%以上输出被专家评为可解释且可操作,并更好保留了模糊情况下的脑信号。 Conclusion: ICVision实现了神经生理学中AI代理视觉认知的首次科学应用,标志着科学AI从单纯分类迈向‘观察-推理-沟通’的新阶段,为EEG分析提供了可解释、可复制、全球可扩展的解决方案。 Abstract: We introduce EEG Autoclean Vision Language AI (ICVision) a first-of-its-kind system that emulates expert-level EEG ICA component classification through AI-agent vision and natural language reasoning. Unlike conventional classifiers such as ICLabel, which rely on handcrafted features, ICVision directly interprets ICA dashboard visualizations topography, time series, power spectra, and ERP plots, using a multimodal large language model (GPT-4 Vision). This allows the AI to see and explain EEG components the way trained neurologists do, making it the first scientific implementation of AI-agent visual cognition in neurophysiology. ICVision classifies each component into one of six canonical categories (brain, eye, heart, muscle, channel noise, and other noise), returning both a confidence score and a human-like explanation. Evaluated on 3,168 ICA components from 124 EEG datasets, ICVision achieved k = 0.677 agreement with expert consensus, surpassing MNE ICLabel, while also preserving clinically relevant brain signals in ambiguous cases. Over 97% of its outputs were rated as interpretable and actionable by expert reviewers. As a core module of the open-source EEG Autoclean platform, ICVision signals a paradigm shift in scientific AI, where models do not just classify, but see, reason, and communicate. It opens the door to globally scalable, explainable, and reproducible EEG workflows, marking the emergence of AI agents capable of expert-level visual decision-making in brain science and beyond.

[100] Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting

Shantanu Ghosh,Vedant Parthesh Joshi,Rayan Syed,Aya Kassem,Abhishek Varshney,Payel Basak,Weicheng Dai,Judy Wawira Gichoya,Hari M. Trivedi,Imon Banerjee,Shyam Visweswaran,Clare B. Poynton,Kayhan Batmanghelich

Main category: cs.CV

TL;DR: Mammo-FM是首个专为乳腺X线摄影设计的基础模型,基于大规模多样化数据集训练,在多种临床任务中表现出优越性能,且具有良好的可解释性和临床适用性。

Details Motivation: 针对乳腺癌早期诊断和综合管理的需求,现有通用基础模型在乳腺影像领域表现受限,亟需一个专门针对该领域多任务需求的统一模型。 Method: 提出Mammo-FM,基于四个美国医疗机构的140,677名患者(821,326张 mammograms)进行预训练,支持癌症诊断、病灶定位、结构化报告生成和风险预测等多任务,并实现图像与文本的对齐以增强可解释性。 Result: 在多个公开和私有基准测试中,尽管仅使用最先进的通用模型三分之一的参数并处理原生分辨率图像,Mammo-FM在分布内外数据上均一致优于现有方法。 Conclusion: 领域特定的基础模型在医学影像中更具效率和实用性,强调围绕临床全任务谱设计模型及进行领域对齐评估的重要性。 Abstract: Breast cancer is one of the leading causes of death among women worldwide. We introduce Mammo-FM, the first foundation model specifically for mammography, pretrained on the largest and most diverse dataset to date - 140,677 patients (821,326 mammograms) across four U.S. institutions. Mammo-FM provides a unified foundation for core clinical tasks in breast imaging, including cancer diagnosis, pathology localization, structured report generation, and cancer risk prognosis within a single framework. Its alignment between images and text enables both visual and textual interpretability, improving transparency and clinical auditability, which are essential for real-world adoption. We rigorously evaluate Mammo-FM across diagnosis, prognosis, and report-generation tasks in in- and out-of-distribution datasets. Despite operating on native-resolution mammograms and using only one-third of the parameters of state-of-the-art generalist FMs, Mammo-FM consistently outperforms them across multiple public and private benchmarks. These results highlight the efficiency and value of domain-specific foundation models designed around the full spectrum of tasks within a clinical domain and emphasize the importance of rigorous, domain-aligned evaluation.

[101] ReactionMamba: Generating Short &Long Human Reaction Sequences

Hajra Anwar Beg,Baptiste Chopin,Hao Tang,Mohamed Daoudi

Main category: cs.CV

TL;DR: ReactionMamba是一种用于生成长序列3D人类反应动作的新框架,结合运动VAE和Mamba模型,实现了高效、连贯的动作生成。

Details Motivation: 现有方法在生成长序列复杂动作时存在效率低和时间一致性差的问题,需要更高效的模型来生成高质量的3D反应动作。 Method: 提出ReactionMamba框架,结合运动VAE进行高效编码,并使用基于Mamba的状态空间模型解码时序一致的反应动作。 Result: 在NTU120-AS、Lindy Hop和InterX三个数据集上验证了模型的有效性,在真实性、多样性及长序列生成方面表现优异,且推理速度显著提升。 Conclusion: ReactionMamba能够高效生成长短不一的复杂3D反应动作,优于现有方法,具有良好的应用前景。 Abstract: We present ReactionMamba, a novel framework for generating long 3D human reaction motions. Reaction-Mamba integrates a motion VAE for efficient motion encoding with Mamba-based state-space models to decode temporally consistent reactions. This design enables ReactionMamba to generate both short sequences of simple motions and long sequences of complex motions, such as dance and martial arts. We evaluate ReactionMamba on three datasets--NTU120-AS, Lindy Hop, and InterX--and demonstrate competitive performance in terms of realism, diversity, and long-sequence generation compared to previous methods, including InterFormer, ReMoS, and Ready-to-React, while achieving substantial improvements in inference speed.

[102] DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation

Zirui Wang,Tao Zhang

Main category: cs.CV

TL;DR: DenseScan是一个新颖的3D场景理解数据集,通过多视角2D图像和多模态大语言模型自动生成密集的多层次语义标注,支持物体级细粒度描述和基于场景的问答任务,推动机器人、增强现实等领域的研究。

Details Motivation: 现有3D场景数据集缺乏丰富的语义标注,难以支持复杂的视觉-语言任务,因此需要一个具有细粒度、上下文感知语义信息的数据集来提升3D理解能力。 Method: 提出DenseScan,利用多视角2D图像与多模态大语言模型(MLLMs)构建自动化标注流水线,实现对3D场景元素的密集描述和场景级问题生成,结合几何信息与语义内容。 Result: 实验表明,DenseScan显著提升了3D环境中的物体级理解和问答性能,相比传统标注方法在语义丰富性和任务表现上更具优势。 Conclusion: DenseScan通过融合几何细节与丰富语义,扩展了下游任务的应用范围,有望推动真实世界环境中3D场景理解的研究与发展。 Abstract: 3D understanding is a key capability for real-world AI assistance. High-quality data plays an important role in driving the development of the 3D understanding community. Current 3D scene understanding datasets often provide geometric and instance-level information, yet they lack the rich semantic annotations necessary for nuanced visual-language tasks.In this work, we introduce DenseScan, a novel dataset with detailed multi-level descriptions generated by an automated pipeline leveraging multi-view 2D images and multimodal large language models (MLLMs). Our approach enables dense captioning of scene elements, ensuring comprehensive object-level descriptions that capture context-sensitive details. Furthermore, we extend these annotations through scenario-based question generation, producing high-level queries that integrate object properties, spatial relationships, and scene context. By coupling geometric detail with semantic richness, DenseScan broadens the range of downstream tasks, from detailed visual-language navigation to interactive question answering. Experimental results demonstrate that our method significantly enhances object-level understanding and question-answering performance in 3D environments compared to traditional annotation pipelines. We release both the annotated dataset and our annotation pipeline to facilitate future research and applications in robotics, augmented reality, and beyond. Through DenseScan, we aim to catalyze new avenues in 3D scene understanding, allowing researchers and practitioners to tackle the complexities of real-world environments with richer, more contextually aware annotations.

[103] Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

Kunwar Maheep Singh,Jianchun Chen,Vladislav Golyanik,Stephan J. Garbin,Thabo Beeler,Rishabh Dabral,Marc Habermann,Christian Theobalt

Main category: cs.CV

TL;DR: 本文提出了一种名为Relightable Holoported Characters (RHC)的新方法,能够基于稀疏视角的RGB视频实现全身体、动态人体的自由视角渲染与重光照。该方法通过Transformer架构的RelightNet在单次前向传播中完成重光照,避免了传统OLAT方式的高成本,并结合物理启发特征与3D高斯点阵实现高质量渲染。

Details Motivation: 传统的人体重光照方法依赖于逐光源采集(OLAT),成本高且难以覆盖多样光照和动态。本文旨在实现仅需普通多视角RGB视频输入,即可高效、高质量地进行自由视角渲染与重光照。 Method: 提出RelightNet,一种基于Transformer的网络,利用从粗略人体网格和输入视图提取的物理启发特征(几何、反照率、阴影、视角),结合新的光照条件,通过交叉注意力机制回归附着于网格的texel对齐3D高斯点阵,实现单次前向传播的重光照。训练数据来自多视角Light Stage系统,交替采集随机环境光照与均匀光照帧,以同时保证运动追踪精度与光照多样性。 Result: 实验表明,该方法在视觉保真度和光照还原质量上优于现有最先进方法,能高效生成高质量的重光照人体视频。 Conclusion: RHC实现了仅需稀疏RGB视频输入的高效、高质量人体重光照与自由视角渲染,通过物理启发特征与3D高斯表示的结合,在单次网络推理中隐式求解渲染方程,显著提升了实用性和视觉效果。 Abstract: We present Relightable Holoported Characters (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method's superior visual fidelity and lighting reproduction compared to state-of-the-art approaches. Project page: https://vcai.mpi-inf.mpg.de/projects/RHC/

[104] UniDiff: Parameter-Efficient Adaptation of Diffusion Models for Land Cover Classification with Multi-Modal Remotely Sensed Imagery and Sparse Annotations

Yuzhen Hu,Saurabh Prasad

Main category: cs.CV

TL;DR: 本文提出UniDiff,一种参数高效的框架,通过仅使用目标域数据来适应预训练的扩散模型到多模态遥感数据(如HSI和SAR),在稀疏标注下实现有效的特征提取与多模态融合。

Details Motivation: 现有的监督方法受限于标注数据的稀缺,而直接将ImageNet预训练模型应用于异构遥感模态(如HSI和SAR)面临挑战,尤其是在缺乏大量标注数据的情况下。 Method: 提出UniDiff框架,结合FiLM-based的时间步-模态条件机制、约5%参数的参数高效微调,以及伪RGB锚定策略,以保持预训练表征并防止灾难性遗忘。 Result: 在两个多模态遥感基准数据集上验证了方法的有效性,表明无监督适配预训练扩散模型可有效缓解标注不足问题,并实现良好的多模态数据融合性能。 Conclusion: UniDiff能够在仅有少量或无标注数据的情况下,成功将ImageNet预训练扩散模型迁移到多模态遥感任务中,为稀疏标注下的遥感图像分析提供了新的解决方案。 Abstract: Sparse annotations fundamentally constrain multimodal remote sensing: even recent state-of-the-art supervised methods such as MSFMamba are limited by the availability of labeled data, restricting their practical deployment despite architectural advances. ImageNet-pretrained models provide rich visual representations, but adapting them to heterogeneous modalities such as hyperspectral imaging (HSI) and synthetic aperture radar (SAR) without large labeled datasets remains challenging. We propose UniDiff, a parameter-efficient framework that adapts a single ImageNet-pretrained diffusion model to multiple sensing modalities using only target-domain data. UniDiff combines FiLM-based timestep-modality conditioning, parameter-efficient adaptation of approximately 5% of parameters, and pseudo-RGB anchoring to preserve pre-trained representations and prevent catastrophic forgetting. This design enables effective feature extraction from remote sensing data under sparse annotations. Our results with two established multi-modal benchmarking datasets demonstrate that unsupervised adaptation of a pre-trained diffusion model effectively mitigates annotation constraints and achieves effective fusion of multi-modal remotely sensed data.

[105] HeartFormer: Semantic-Aware Dual-Structure Transformers for 3D Four-Chamber Cardiac Point Cloud Reconstruction

Zhengda Ma,Abhirup Banerjee

Main category: cs.CV

TL;DR: 提出首个基于点云表示的几何深度学习框架HeartFormer,用于从电影MRI数据进行3D四腔心脏重建,通过SA-DSTNet和SA-GFRTNet实现多类点云补全,并发布大规模数据集HeartCompv1。

Details Motivation: 传统电影MRI通常只提供心脏的2D切片图像,限制了对心脏形态学和生理机制的全面理解,尤其是在健康和病理条件下。 Method: 提出HeartFormer,包含语义感知双结构Transformer网络(SA-DSTNet)和语义感知几何特征细化Transformer网络(SA-GFRTNet),前者生成初始粗略点云,后者逐步优化输出。 Result: 在HeartCompv1和UK Biobank上的跨域实验表明,HeartFormer在鲁棒性、准确性和泛化能力上均优于现有最先进方法。 Conclusion: HeartFormer实现了高保真且几何一致的心脏重建,推动了基于点云的3D心脏建模发展,并为该领域建立了新的基准。 Abstract: We present the first geometric deep learning framework based on point cloud representation for 3D four-chamber cardiac reconstruction from cine MRI data. This work addresses a long-standing limitation in conventional cine MRI, which typically provides only 2D slice images of the heart, thereby restricting a comprehensive understanding of cardiac morphology and physiological mechanisms in both healthy and pathological conditions. To overcome this, we propose \textbf{HeartFormer}, a novel point cloud completion network that extends traditional single-class point cloud completion to the multi-class. HeartFormer consists of two key components: a Semantic-Aware Dual-Structure Transformer Network (SA-DSTNet) and a Semantic-Aware Geometry Feature Refinement Transformer Network (SA-GFRTNet). SA-DSTNet generates an initial coarse point cloud with both global geometry features and substructure geometry features. Guided by these semantic-geometry representations, SA-GFRTNet progressively refines the coarse output, effectively leveraging both global and substructure geometric priors to produce high-fidelity and geometrically consistent reconstructions. We further construct \textbf{HeartCompv1}, the first publicly available large-scale dataset with 17,000 high-resolution 3D multi-class cardiac meshes and point-clouds, to establish a general benchmark for this emerging research direction. Extensive cross-domain experiments on HeartCompv1 and UK Biobank demonstrate that HeartFormer achieves robust, accurate, and generalizable performance, consistently surpassing state-of-the-art (SOTA) methods. Code and dataset will be released upon acceptance at: https://github.com/10Darren/HeartFormer.

[106] USB: Unified Synthetic Brain Framework for Bidirectional Pathology-Healthy Generation and Editing

Jun Wang,Peirong Liu

Main category: cs.CV

TL;DR: USB是一个端到端框架,统一了病理性和健康脑图像的双向生成与编辑,通过配对扩散机制和一致性引导算法实现解剖结构一致性与病灶对应性。

Details Motivation: 由于配对的病理-健康脑部影像数据难以获取,现有方法多独立建模,缺乏统一的双向生成与编辑能力。 Method: 提出USB框架,采用配对扩散机制建模病灶与脑解剖的联合分布,并设计一致性引导算法以保持双向编辑中的解剖一致性和病灶对应性。 Result: 在六个公开脑MRI数据集上验证了USB在生成多样且逼真图像方面的优越性能,支持多种神经影像应用。 Conclusion: USB首次实现了病理与健康脑图像的统一生成与编辑,为可扩展数据集构建和鲁棒神经影像分析提供了新基准。 Abstract: Understanding the relationship between pathological and healthy brain structures is fundamental to neuroimaging, connecting disease diagnosis and detection with modeling, prediction, and treatment planning. However, paired pathological-healthy data are extremely difficult to obtain, as they rely on pre- and post-treatment imaging, constrained by clinical outcomes and longitudinal data availability. Consequently, most existing brain image generation and editing methods focus on visual quality yet remain domain-specific, treating pathological and healthy image modeling independently. We introduce USB (Unified Synthetic Brain), the first end-to-end framework that unifies bidirectional generation and editing of pathological and healthy brain images. USB models the joint distribution of lesions and brain anatomy through a paired diffusion mechanism and achieves both pathological and healthy image generation. A consistency guidance algorithm further preserves anatomical consistency and lesion correspondence during bidirectional pathology-healthy editing. Extensive experiments on six public brain MRI datasets including healthy controls, stroke, and Alzheimer's patients, demonstrate USB's ability to produce diverse and realistic results. By establishing the first unified benchmark for brain image generation and editing, USB opens opportunities for scalable dataset creation and robust neuroimaging analysis. Code is available at https://github.com/jhuldr/USB.

[107] HIMOSA: Efficient Remote Sensing Image Super-Resolution with Hierarchical Mixture of Sparse Attention

Yi Liu,Yi Wan,Xinyi Liu,Qiong Wu,Panwang Xia,Xuejun Huang,Yongjun Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于遥感图像超分辨率的轻量级框架HIMOSA,通过引入内容感知稀疏注意力机制和分层窗口扩展策略,在保持高性能重建的同时实现了高效的计算和快速推理。

Details Motivation: 现有的遥感图像超分辨率方法在模型性能与计算效率之间往往难以平衡,而实际应用如灾害检测需要实时性和轻量化模型,因此亟需一种兼顾性能与效率的解决方案。 Method: 提出HIMOSA框架,利用遥感图像中的固有冗余性,设计内容感知稀疏注意力机制,并结合分层窗口扩展策略来有效捕捉多尺度重复模式,同时降低注意力机制的计算复杂度。 Result: 在多个遥感图像数据集上的实验表明,HIMOSA在保持较低计算成本的同时,达到了最先进的超分辨率重建性能。 Conclusion: HIMOSA通过稀疏注意力和分层窗口设计,成功平衡了遥感图像超分辨率任务中的性能与效率,适用于对实时性要求较高的应用场景。 Abstract: In remote sensing applications, such as disaster detection and response, real-time efficiency and model lightweighting are of critical importance. Consequently, existing remote sensing image super-resolution methods often face a trade-off between model performance and computational efficiency. In this paper, we propose a lightweight super-resolution framework for remote sensing imagery, named HIMOSA. Specifically, HIMOSA leverages the inherent redundancy in remote sensing imagery and introduces a content-aware sparse attention mechanism, enabling the model to achieve fast inference while maintaining strong reconstruction performance. Furthermore, to effectively leverage the multi-scale repetitive patterns found in remote sensing imagery, we introduce a hierarchical window expansion and reduce the computational complexity by adjusting the sparsity of the attention. Extensive experiments on multiple remote sensing datasets demonstrate that our method achieves state-of-the-art performance while maintaining computational efficiency.

[108] Rethinking Lung Cancer Screening: AI Nodule Detection and Diagnosis Outperforms Radiologists, Leading Models, and Standards Beyond Size and Growth

Sylvain Bodard,Pierre Baudot,Benjamin Renoust,Charles Voyton,Gwendoline De Bie,Ezequiel Geremia,Van-Khoa Le,Danny Francis,Pierre-Henri Siot,Yousra Haddou,Vincent Bobin,Jean-Christophe Brisset,Carey C. Thomson,Valerie Bourdes,Benoit Huet

Main category: cs.CV

TL;DR: 提出一种基于低剂量CT扫描的AI系统,可在结节层面同时进行肺结节检测与恶性诊断,性能超越放射科医生和现有AI模型,尤其在早期癌症和缓慢生长结节的诊断中表现出提前一年的预测能力。

Details Motivation: 传统肺癌筛查依赖结节大小和生长速度,导致恶性结节诊断延迟,亟需更早、更准确的诊断方法。 Method: 设计了一个由浅层深度学习模型和基于特征的专用模型组成的集成系统,在25,709次扫描、69,449个标注结节的数据集上训练和评估。 Result: 内部AUC达0.98,独立队列AUC为0.945;每扫描0.5个假阳性时敏感度达99.3%;在所有结节大小、分期和生长指标(包括体积倍增时间)上均优于放射科医生。 Conclusion: 该AI系统可显著提升肺癌早期检出率,克服现有基于生长的筛查局限,推动AI在临床筛查中的实际应用。 Abstract: Early detection of malignant lung nodules is critical, but its dependence on size and growth in screening inherently delays diagnosis. We present an AI system that redefines lung cancer screening by performing both detection and malignancy diagnosis directly at the nodule level on low-dose CT scans. To address limitations in dataset scale and explainability, we designed an ensemble of shallow deep learning and feature-based specialized models. Trained and evaluated on 25,709 scans with 69,449 annotated nodules, the system outperforms radiologists, Lung-RADS, and leading AI models (Sybil, Brock, Google, Kaggle). It achieves an area under the receiver operating characteristic curve (AUC) of 0.98 internally and 0.945 on an independent cohort. With 0.5 false positives per scan at 99.3\% sensitivity, it addresses key barriers to AI adoption. Critically, it outperforms radiologists across all nodule sizes and stages, excelling in stage 1 cancers, and all growth-based metrics, including the least accurate: Volume-Doubling Time. It also surpasses radiologists by up to one year in diagnosing indeterminate and slow-growing nodules.

[109] Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR

Lixing Guo,Tobias Höllerer

Main category: cs.CV

TL;DR: 本文提出了一种模块化的增强现实(AR)代理系统,结合多模态大语言模型(MLLMs)与具身视觉模型,实现基于自然语言的复杂空间关系理解与三维定位。

Details Motivation: 传统AR系统依赖固定类别检测器或标记物,难以理解开放词汇的自然语言查询,限制了其在复杂场景中的应用。 Method: 提出一个自适应任务代理,协调MLLMs与具备坐标感知的感知工具,构建包含九种类型关系的动态AR场景图,并通过任务自适应的兴趣区域高亮和上下文空间检索支持人机协同。 Result: 系统能够处理从简单物体识别到多物体关系推理的复杂查询,返回米级精度的3D锚点,并在真实环境中实现语言驱动的空间定位与关系接地。 Conclusion: 该模块化架构无需重新训练即可集成多种视觉-语言模型,使AR代理成为增强MLLM空间智能的中介,推动交互式场景理解的发展。 Abstract: Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through task-adaptive region-of-interest highlighting and contextual spatial retrieval, the system guides human attention to information-dense areas while supporting human-in-the-loop refinement. The agent dynamically invokes coordinate-aware tools for complex queries-selection, measurement, comparison, and actuation-grounding language understanding in physical operations. The modular architecture supports plug-and-use vision-language models without retraining, establishing AR agents as intermediaries that augment MLLMs with real-world spatial intelligence for interactive scene understanding. We also introduce GroundedAR-Bench, an evaluation framework for language-driven real world localization and relation grounding across diverse environments.

[110] TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion

Rui Qian,Haozhi Cao,Tianchen Deng,Tianxin Hu,Weixiang Guo,Shenghai Yuan,Lihua Xie

Main category: cs.CV

TL;DR: 本文提出了一种可扩展的时序高斯点阵化框架TGSFormer,用于具身3D语义场景补全(SSC),通过持久化高斯记忆和置信度感知的时序融合,在减少图元数量的同时实现了最先进的精度、可扩展性和长期场景一致性。

Details Motivation: 现有基于高斯的方法依赖于在预定义空间范围内随机初始化大量图元,导致冗余且难以扩展到无界场景;近期深度引导方法虽有所改进但仍局限于局部处理,面临延迟和内存开销问题。 Method: 提出TGSFormer框架,维护一个持久化的高斯记忆以进行时序预测,无需依赖图像一致性或帧缓存;采用双路时序编码器通过置信度感知的交叉注意力联合处理当前与历史高斯特征,并设计置信度感知体素融合模块将重叠图元合并为体素对齐表示,以调节密度并保持紧凑性。 Result: 在多个局部和具身SSC基准上实现了最先进的性能,显著减少了使用的图元数量,同时具备更优的准确性、可扩展性以及长期场景完整性。 Conclusion: TGSFormer通过高效的时序高斯融合机制,解决了传统方法在冗余性、可扩展性和内存效率方面的局限,为具身3D语义场景补全提供了一个高效且可扩展的新范式。 Abstract: Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases. To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches. For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention. Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primitives into voxel-aligned representations, regulating density and maintaining compactness. Extensive experiments demonstrate that TGSFormer achieves state-of-the-art results on both local and embodied SSC benchmarks, offering superior accuracy and scalability with significantly fewer primitives while maintaining consistent long-term scene integrity. The code will be released upon acceptance.

[111] Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation

Xiao Cui,Yulei Qin,Wengang Zhou,Hongsheng Li,Houqiang Li

Main category: cs.CV

TL;DR: 本文提出了一种基于最优传输(OT)的图像数据集蒸馏方法,通过最小化OT距离实现全局和实例级别的细粒度对齐,有效保留了分布的几何结构和类内变化,在大规模数据集上显著优于现有方法。

Details Motivation: 现有大规模数据集蒸馏方法主要关注匹配全局统计量,忽略了实例级别特征和类内差异,导致泛化性能受限。 Method: 将数据集蒸馏重新定义为最优传输距离最小化问题,提出三个组件:OT引导的扩散采样、标签-图像对齐的软标签重标注、基于OT的logit匹配,以保持分布几何结构。 Result: 在多种架构和大规模数据集(如ImageNet-1K)上实验表明,该方法在IPC=10设置下比现有方法至少提升4%的准确率。 Conclusion: 该方法通过引入最优传输实现了更精细的分布匹配,显著提升了数据集蒸馏的性能和泛化能力,尤其适用于大规模复杂数据集。 Abstract: Dataset distillation seeks to synthesize a compact distilled dataset, enabling models trained on it to achieve performance comparable to models trained on the full dataset. Recent methods for large-scale datasets focus on matching global distributional statistics (e.g., mean and variance), but overlook critical instance-level characteristics and intraclass variations, leading to suboptimal generalization. We address this limitation by reformulating dataset distillation as an Optimal Transport (OT) distance minimization problem, enabling fine-grained alignment at both global and instance levels throughout the pipeline. OT offers a geometrically faithful framework for distribution matching. It effectively preserves local modes, intra-class patterns, and fine-grained variations that characterize the geometry of complex, high-dimensional distributions. Our method comprises three components tailored for preserving distributional geometry: (1) OT-guided diffusion sampling, which aligns latent distributions of real and distilled images; (2) label-image-aligned soft relabeling, which adapts label distributions based on the complexity of distilled image distributions; and (3) OT-based logit matching, which aligns the output of student models with soft-label distributions. Extensive experiments across diverse architectures and large-scale datasets demonstrate that our method consistently outperforms state-of-the-art approaches in an efficient manner, achieving at least 4% accuracy improvement under IPC=10 settings for each architecture on ImageNet-1K.

[112] ART-ASyn: Anatomy-aware Realistic Texture-based Anomaly Synthesis Framework for Chest X-Rays

Qinyi Cao,Jianan Fan,Weidong Cai

Main category: cs.CV

TL;DR: 提出了一种解剖结构感知的胸部X光片纹理异常合成框架ART-ASyn,通过PBTSeg方法实现肺部分割引导下的真实感异常生成,支持无监督异常检测与零样本异常分割。

Details Motivation: 现有合成异常方法生成的异常与真实病理模式差异大,且忽略解剖结构,限制了模型在真实场景中的泛化能力。 Method: 提出ART-ASyn框架,结合基于纹理的数据增强与新设计的PBTSeg肺部分割方法,在正常样本上生成解剖一致且视觉真实的肺部异常,并提供精确的像素级异常掩码用于监督训练。 Result: 该方法不仅在无监督异常分类上表现优异,还在跨数据集的零样本异常分割任务中展现出良好的泛化能力,无需目标域标注即可实现高性能检测。 Conclusion: ART-ASyn通过解剖感知和纹理增强生成高质量合成异常,有效提升无监督异常检测与分割的性能和实用性。 Abstract: Unsupervised anomaly detection aims to identify anomalies without pixel-level annotations. Synthetic anomaly-based methods exhibit a unique capacity to introduce controllable irregularities with known masks, enabling explicit supervision during training. However, existing methods often produce synthetic anomalies that are visually distinct from real pathological patterns and ignore anatomical structure. This paper presents a novel Anatomy-aware Realistic Texture-based Anomaly Synthesis framework (ART-ASyn) for chest X-rays that generates realistic and anatomically consistent lung opacity related anomalies using texture-based augmentation guided by our proposed Progressive Binary Thresholding Segmentation method (PBTSeg) for lung segmentation. The generated paired samples of synthetic anomalies and their corresponding precise pixel-level anomaly mask for each normal sample enable explicit segmentation supervision. In contrast to prior work limited to one-class classification, ART-ASyn is further evaluated for zero-shot anomaly segmentation, demonstrating generalizability on an unseen dataset without target-domain annotations. Code availability is available at https://github.com/angelacao-hub/ART-ASyn.

[113] Odometry Without Correspondence from Inertially Constrained Ruled Surfaces

Chenqi Zhu,Levi Burner,Yiannis Aloimonos

Main category: cs.CV

TL;DR: 提出了一种基于图像空间中直线扫过的规则曲面进行3D场景重建和视觉里程计的新算法,结合IMU数据降低求解复杂度。

Details Motivation: 传统视觉里程计依赖特征点匹配,计算昂贵且精度不稳定,尤其在处理对应问题时存在困难。 Method: 利用相机移动时直线在图像时空中的规则曲面特性,通过点到线的微分更新关联来估计运动,并融合IMU数据约束曲面模型。 Result: 实现了不依赖传统点对点 correspondence 的视觉里程计算法,降低了计算成本并提高了鲁棒性。 Conclusion: 该方法有效绕过了传统光流匹配的问题,通过规则曲面分析和IMU辅助实现了更高效的3D重建与位姿估计。 Abstract: Visual odometry techniques typically rely on feature extraction from a sequence of images and subsequent computation of optical flow. This point-to-point correspondence between two consecutive frames can be costly to compute and suffers from varying accuracy, which affects the odometry estimate's quality. Attempts have been made to bypass the difficulties originating from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU) to improve performance, many of which still heavily rely on correspondence. If the camera observes a straight line as it moves, the image of the line sweeps a smooth surface in image-space time. It is a ruled surface and analyzing its shape gives information about odometry. Further, its estimation requires only differentially computed updates from point-to-line associations. Inspired by event cameras' propensity for edge detection, this research presents a novel algorithm to reconstruct 3D scenes and visual odometry from these ruled surfaces. By constraining the surfaces with the inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.

[114] MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

Mengxue Hu,Yunfeng Diao,Changtao Miao,Jianshu Li,Zhe Li,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出了首个用于检测AI生成的多模态音视频内容的数据集MVAD,填补了现有数据集在真实多模态伪造内容上的空白。

Details Motivation: 现有的合成视频数据集主要关注单一视觉模态,且包含音频的数据集多局限于面部深度伪造,难以应对日益复杂的多模态AI生成内容,阻碍了可信检测系统的发展。 Method: 构建了一个名为MVAD的多模态音视频数据集,涵盖三种真实的音视频伪造模式,采用多种先进的生成模型生成高质量内容,并覆盖多种视觉风格、内容类别和多模态数据类型。 Result: MVAD具备真实多模态特性、高感知质量与广泛多样性,适用于多模态AI生成内容的检测研究。 Conclusion: MVAD是首个专为检测AI生成多模态音视频内容设计的综合性数据集,有望推动多模态伪造检测技术的发展。 Abstract: The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

[115] Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models

Zhongqi Wang,Jie Zhang,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出了一种无需先验知识的视觉语言预训练模型后门检测框架AMDET,通过发现后门文本编码器中的特征同化现象,利用梯度反演恢复隐式触发特征,并结合损失景观分析有效区分自然与恶意后门,实现了高准确率和强鲁棒性的检测。

Details Motivation: 现有的后门检测方法依赖训练数据、触发器或下游分类器等先验知识,在实际应用中受限,因此需要一种无需这些信息即可有效检测视觉语言模型中后门的新方法。 Method: 提出AMDET框架,基于发现的特征同化现象(即后门样本中所有token表示高度相似),通过梯度反演恢复能激活后门行为的隐式特征,并利用注意力权重集中在触发token的现象进行检测;同时通过分析损失景观过滤CLIP中自然存在的类后门特征。 Result: 在3,600个带后门和良性微调模型上实验显示,AMDET的F1得分为89.90%,单次检测约5分钟(RTX 4090),并对自适应攻击表现出强鲁棒性,成功识别出OpenAI官方CLIP中的自然后门特征。 Conclusion: AMDET是一种高效、实用且无需先验知识的VLP模型后门检测方法,揭示了特征同化现象的本质机制,为模型安全评估提供了新思路。 Abstract: Vision-language pretrained models (VLPs) such as CLIP have achieved remarkable success, but are also highly vulnerable to backdoor attacks. Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem. Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets, or downstream classifiers, which may be impractical for real-world applications. To address this, To address this challenge, we introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge. Specifically, we first reveal the feature assimilation property in backdoored text encoders: the representations of all tokens within a backdoor sample exhibit a high similarity. Further analysis attributes this effect to the concentration of attention weights on the trigger token. Leveraging this insight, AMDET scans a model by performing gradient-based inversion on token embeddings to recover implicit features that capable of activating backdoor behaviors. Furthermore, we identify the natural backdoor feature in the OpenAI's official CLIP model, which are not intentionally injected but still exhibit backdoor-like behaviors. We then filter them out from real injected backdoor by analyzing their loss landscapes. Extensive experiments on 3,600 backdoored and benign-finetuned models with two attack paradigms and three VLP model structures show that AMDET detects backdoors with an F1 score of 89.90%. Besides, it achieves one complete detection in approximately 5 minutes on a RTX 4090 GPU and exhibits strong robustness against adaptive attacks. Code is available at: https://github.com/Robin-WZQ/AMDET

[116] mmPred: Radar-based Human Motion Prediction in the Dark

Junqiao Fan,Haocong Rao,Jiarui Zhang,Jianfei Yang,Lihua Xie

Main category: cs.CV

TL;DR: 本文首次提出将毫米波雷达用于人体运动预测(HMP),并设计了基于扩散模型的mmPred框架,通过双域历史运动表征和全局骨骼关系Transformer,在噪声和不连续的雷达信号下实现领先性能。

Details Motivation: 现有基于RGB-D相机的HMP方法对光照敏感且存在隐私问题,而毫米波雷达具有鲁棒性和隐私保护优势,因此探索雷达作为新型传感模态具有重要意义。 Method: 提出mmPred,一种首个基于扩散模型的雷达HMP框架,包含时域姿态优化分支(TPR)和频域主导运动分支(FDM)的双域历史运动表示,以及全局骨骼关系Transformer(GST)作为扩散主干网络。 Result: 在mmBody和mm-Fi数据集上实验表明,mmPred分别比现有方法提升8.6%和22%,达到最先进水平。 Conclusion: 毫米波雷达是可行且有前景的HMP传感模态,mmPred通过有效建模雷达信号特性显著提升了预测准确性与稳定性。 Abstract: Existing Human Motion Prediction (HMP) methods based on RGB-D cameras are sensitive to lighting conditions and raise privacy concerns, limiting their real-world applications such as firefighting and healthcare. Motivated by the robustness and privacy-preserving nature of millimeter-wave (mmWave) radar, this work introduces radar as a novel sensing modality for HMP, for the first time. Nevertheless, radar signals often suffer from specular reflections and multipath effects, resulting in noisy and temporally inconsistent measurements, such as body-part miss-detection. To address these radar-specific artifacts, we propose mmPred, the first diffusion-based framework tailored for radar-based HMP. mmPred introduces a dual-domain historical motion representation to guide the generation process, combining a Time-domain Pose Refinement (TPR) branch for learning fine-grained details and a Frequency-domain Dominant Motion (FDM) branch for capturing global motion trends and suppressing frame-level inconsistency. Furthermore, we design a Global Skeleton-relational Transformer (GST) as the diffusion backbone to model global inter-joint cooperation, enabling corrupted joints to dynamically aggregate information from others. Extensive experiments show that mmPred achieves state-of-the-art performance, outperforming existing methods by 8.6% on mmBody and 22% on mm-Fi.

[117] SMamDiff: Spatial Mamba for Stochastic Human Motion Prediction

Junqiao Fan,Pengfei Liu,Haocong Rao

Main category: cs.CV

TL;DR: 提出SMamDiff,一种基于空间Mamba的单阶段扩散模型,通过残差-DCT运动编码和stickman-drawing空间-Mamba模块提升人类运动预测的时空一致性,在精度、多样性与效率间取得更好平衡。

Details Motivation: 现有HMP方法在确定性预测中忽略不确定性,或在概率模型中牺牲运动学合理性;多阶段扩散模型虽改善多样性但计算成本高,难以部署于边缘设备。需要一种兼顾准确性、多样性和高效性的单阶段扩散模型。 Method: 提出SMamDiff模型:1)残差-DCT运动编码,在时域DCT前减去最后观测姿态,削弱直流分量主导性,增强高频运动特征学习;2)stickman-drawing空间-Mamba模块,按关节顺序建模,使后续关节依赖先前关节,建立长程跨关节依赖。 Result: 在Human3.6M和HumanEva数据集上,SMamDiff在单阶段概率HMP方法中达到SOTA性能,且相比多阶段扩散模型具有更低延迟和内存占用。 Conclusion: SMamDiff通过引入时空相干性机制,在保持单阶段架构高效性的同时,显著提升了人类运动预测的准确性和合理性,适合边缘部署。 Abstract: With intelligent room-side sensing and service robots widely deployed, human motion prediction (HMP) is essential for safe, proactive assistance. However, many existing HMP methods either produce a single, deterministic forecast that ignores uncertainty or rely on probabilistic models that sacrifice kinematic plausibility. Diffusion models improve the accuracy-diversity trade-off but often depend on multi-stage pipelines that are costly for edge deployment. This work focuses on how to ensure spatial-temporal coherence within a single-stage diffusion model for HMP. We introduce SMamDiff, a Spatial Mamba-based Diffusion model with two novel designs: (i) a residual-DCT motion encoding that subtracts the last observed pose before a temporal DCT, reducing the first DC component ($f=0$) dominance and highlighting informative higher-frequency cues so the model learns how joints move rather than where they are; and (ii) a stickman-drawing spatial-mamba module that processes joints in an ordered, joint-by-joint manner, making later joints condition on earlier ones to induce long-range, cross-joint dependencies. On Human3.6M and HumanEva, these coherence mechanisms deliver state-of-the-art results among single-stage probabilistic HMP methods while using less latency and memory than multi-stage diffusion baselines.

[118] MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters

Jianhong Han,Yupei Wang,Yuan Zhang,Liang Chen

Main category: cs.CV

TL;DR: 本文提出了一种轻量高效的多模态遥感目标检测框架MM-DETR,通过Mamba-based双粒度融合编码器和频率感知适配器实现高效跨模态建模与特征融合。

Details Motivation: 现有方法在性能与轻量化之间难以平衡,共享主干网络导致模态特异性建模不足,而双流架构参数量过大,限制了实际部署。 Method: 提出Mamba-based双粒度融合编码器,将全局交互重构为通道级动态门控,并采用1D选择性扫描实现线性复杂度的跨模态建模;引入区域感知的2D选择性扫描补全分支,通过双向金字塔路径进行细粒度融合;设计轻量化的频率感知模态适配器,结合空间-频率协同专家结构和像素级路由机制。 Result: 在四个多模态遥感基准数据集上进行了大量实验,验证了所提方法的有效性和泛化能力。 Conclusion: MM-DETR在保持轻量化的同时显著提升了多模态遥感目标检测的性能,解决了模态特征融合中的效率与表达能力之间的权衡问题。 Abstract: Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions by fusing complementary information from different modalities. However, existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design. Beyond fusion complexity, extracting modality features with shared backbones yields suboptimal representations due to insufficient modality-specific modeling, whereas dual-stream architectures nearly double the parameter count, ultimately limiting practical deployment. To this end, we propose MM-DETR, a lightweight and efficient framework for multimodal object detection. Specifically, we propose a Mamba-based dual granularity fusion encoder that reformulates global interaction as channel-wise dynamic gating and leverages a 1D selective scan for efficient cross-modal modeling with linear complexity. Following this design, we further reinterpret multimodal fusion as a modality completion problem. A region-aware 2D selective scanning completion branch is introduced to recover modality-specific cues, supporting fine-grained fusion along a bidirectional pyramid pathway with minimal overhead. To further reduce parameter redundancy while retaining strong feature extraction capability, a lightweight frequency-aware modality adapter is inserted into the shared backbone. This adapter employs a spatial-frequency co-expert structure to capture modality-specific cues, while a pixel-wise router dynamically balances expert contributions for efficient spatial-frequency fusion. Extensive experiments conducted on four multimodal benchmark datasets demonstrate the effectiveness and generalization capability of the proposed method.

[119] Towards aligned body representations in vision models

Andrey Gizdov,Andrea Procopio,Yichen Li,Daniel Harari,Tomer Ullman

Main category: cs.CV

TL;DR: 研究发现,较小的视觉分割模型在资源受限下会自然形成类似人类的粗略“身体”表征,而较大的模型则倾向于过度精细的编码,表明计算资源限制可能促进类人物理推理表征的出现。

Details Motivation: 探索人类物理推理中使用的粗略‘身体’表征是否在视觉分割模型中自然出现,并比较不同规模模型的表现差异。 Method: 将针对50名人类参与者的心理物理学实验迁移到语义分割任务中,测试七种不同规模的分割网络,并分析其表征特性。 Result: 较小的模型表现出类似人类的粗略体表征,而较大的模型趋向于更精细、非人类般的细节编码。 Conclusion: 计算资源受限可能导致更接近人类的粗略表征形成,机器模型可为理解大脑中物理推理结构提供可扩展路径。 Abstract: Human physical reasoning relies on internal "body" representations - coarse, volumetric approximations that capture an object's extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human-like coarse body representations, whereas larger models tend toward overly detailed, fine-grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.

[120] THCRL: Trusted Hierarchical Contrastive Representation Learning for Multi-View Clustering

Jian Zhu

Main category: cs.CV

TL;DR: 本文提出了一种新的多视图聚类方法THCRL,通过深度对称层次融合和平均K近邻对比学习模块解决传统方法中因噪声和忽略局部结构导致的不可信融合问题,显著提升了聚类性能。

Details Motivation: 现有MVC方法因忽视单个视图中的噪声及对比学习中未利用同类样本间的结构信息,导致多视图融合结果不可信,影响聚类效果。 Method: 提出THCRL框架,包含两个模块:1) DSHF模块,采用UNet结构并集成多种去噪机制实现可信的多视图融合;2) AKCL模块,利用同簇内样本的平均K近邻关系进行对比学习,增强融合表示的一致性与置信度。 Result: 在多个数据集上进行了广泛实验,结果表明THCRL在深度多视图聚类任务中达到了最先进的性能。 Conclusion: THCRL有效解决了多视图聚类中的不可信融合问题,通过结合去噪融合与基于局部结构的对比学习,提升了聚类质量和模型鲁棒性。 Abstract: Multi-View Clustering (MVC) has garnered increasing attention in recent years. It is capable of partitioning data samples into distinct groups by learning a consensus representation. However, a significant challenge remains: the problem of untrustworthy fusion. This problem primarily arises from two key factors: 1) Existing methods often ignore the presence of inherent noise within individual views; 2) In traditional MVC methods using Contrastive Learning (CL), similarity computations typically rely on different views of the same instance, while neglecting the structural information from nearest neighbors within the same cluster. Consequently, this leads to the wrong direction for multi-view fusion. To address this problem, we present a novel Trusted Hierarchical Contrastive Representation Learning (THCRL). It consists of two key modules. Specifically, we propose the Deep Symmetry Hierarchical Fusion (DSHF) module, which leverages the UNet architecture integrated with multiple denoising mechanisms to achieve trustworthy fusion of multi-view data. Furthermore, we present the Average K-Nearest Neighbors Contrastive Learning (AKCL) module to align the fused representation with the view-specific representation. Unlike conventional strategies, AKCL enhances representation similarity among samples belonging to the same cluster, rather than merely focusing on the same sample across views, thereby reinforcing the confidence of the fused representation. Extensive experiments demonstrate that THCRL achieves the state-of-the-art performance in deep MVC tasks.

[121] POLARIS: Projection-Orthogonal Least Squares for Robust and Adaptive Inversion in Diffusion Models

Wenshuo Chen,Haosen Li,Shaofeng Liang,Lei Wang,Haozhe Jia,Kaishen Yuan,Jieming Wu,Bowen Tian,Yutao Yue

Main category: cs.CV

TL;DR: 提出POLARIS方法,通过将引导尺度建模为逐步变量,从源头减少扩散模型反演中的噪声近似误差,显著提升图像编辑与重建质量。

Details Motivation: 发现现有扩散模型反演范式中因噪声近似导致的误差累积问题,影响重建质量。 Method: 提出POLARIS方法,将反演过程重构为误差溯源问题,利用正交投影与最小二乘法推导每一步最优引导尺度,以数学方式最小化每步反演误差。 Result: 有效缓解噪声近似误差,显著提升反演隐变量质量,在多种下游任务中表现更优,且仅需一行代码修改,计算开销极低。 Conclusion: POLARIS提供了一种高效、理论严谨的反演优化框架,无需复杂调整即可提升扩散模型在图像编辑和恢复任务中的性能。 Abstract: The Inversion-Denoising Paradigm, which is based on diffusion models, excels in diverse image editing and restoration tasks. We revisit its mechanism and reveal a critical, overlooked factor in reconstruction degradation: the approximate noise error. This error stems from approximating the noise at step t with the prediction at step t-1, resulting in severe error accumulation throughout the inversion process. We introduce Projection-Orthogonal Least Squares for Robust and Adaptive Inversion (POLARIS), which reformulates inversion from an error-compensation problem into an error-origin problem. Rather than optimizing embeddings or latent codes to offset accumulated drift, POLARIS treats the guidance scale ω as a step-wise variable and derives a mathematically grounded formula to minimize inversion error at each step. Remarkably, POLARIS improves inversion latent quality with just one line of code. With negligible performance overhead, it substantially mitigates noise approximation errors and consistently improves the accuracy of downstream tasks.

[122] Pore-scale Image Patch Dataset and A Comparative Evaluation of Pore-scale Facial Features

Dong Li,HuaLiang Lin,JiaYu Li

Main category: cs.CV

TL;DR: 本文提出了一个高质量的毛孔级图像块数据集PorePatch和一个数据-模型协同进化(DMCE)框架,用于改善面部弱纹理区域的局部描述符匹配。实验表明,尽管SOTA深度学习模型在匹配任务中表现优异,但在3D重建任务中优势不明显,说明该领域仍存在挑战。

Details Motivation: 面部皮肤区域的弱纹理特性给面部运动分析和3D人脸重建等应用中的局部描述符匹配带来了显著挑战。由于缺乏毛孔级图像块数据集,基于深度学习的描述符在此领域的进一步发展受到限制。 Method: 提出PorePatch数据集和数据-模型协同进化(DMCE)框架,从高分辨率面部图像生成逐步优化的高质量数据集,并在该数据集上训练现有的SOTA模型进行广泛实验。 Result: SOTA模型在匹配任务上的FPR95达到1.91%,远优于PSIFT的22.41%;但在3D重建任务中,其性能与传统描述符相比并无显著提升。 Conclusion: 深度学习描述符在应对面部弱纹理区域挑战方面仍有局限性,该领域尚需大量研究工作。 Abstract: The weak-texture nature of facial skin regions presents significant challenges for local descriptor matching in applications such as facial motion analysis and 3D face reconstruction. Although deep learning-based descriptors have demonstrated superior performance to traditional hand-crafted descriptors in many applications, the scarcity of pore-scale image patch datasets has hindered their further development in the facial domain. In this paper, we propose the PorePatch dataset, a high-quality pore-scale image patch dataset, and establish a rational evaluation benchmark. We introduce a Data-Model Co-Evolution (DMCE) framework to generate a progressively refined, high-quality dataset from high-resolution facial images. We then train existing SOTA models on our dataset and conduct extensive experiments. Our results show that the SOTA model achieves a FPR95 value of 1.91% on the matching task, outperforming PSIFT (22.41%) by a margin of 20.5%. However, its advantage is diminished in the 3D reconstruction task, where its overall performance is not significantly better than that of traditional descriptors. This indicates that deep learning descriptors still have limitations in addressing the challenges of facial weak-texture regions, and much work remains to be done in this field.

[123] EZ-SP: Fast and Lightweight Superpoint-Based 3D Segmentation

Louis Geist,Loic Landrieu,Damien Robert

Main category: cs.CV

TL;DR: EZ-SP是一种基于学习的全GPU超点分割算法,比先前方法快13倍,具有高效、轻量、无需手工特征的优点,结合轻量分类器可在小于2MB显存下运行,适用于大规模场景实时语义分割。

Details Motivation: 传统超点分割流程受限于CPU绑定的分割步骤,效率低下,难以满足实时性和大规模场景需求。 Method: 提出一种可学习的全GPU超点划分算法,使用可微代理损失进行训练,不依赖手工设计特征,并与轻量级分类器结合构建完整流程。 Result: 在S3DIS、KITTI-360和DALES三个数据集上达到与点云SOTA模型相当的精度,推理速度快72倍,参数量减少120倍,整个流程显存占用小于2MB。 Conclusion: EZ-SP实现了高效、快速、轻量化的3D语义分割,解决了传统超点方法的性能瓶颈,支持百万点级场景的实时推理,具有广泛应用潜力。 Abstract: Superpoint-based pipelines provide an efficient alternative to point- or voxel-based 3D semantic segmentation, but are often bottlenecked by their CPU-bound partition step. We propose a learnable, fully GPU partitioning algorithm that generates geometrically and semantically coherent superpoints 13$\times$ faster than prior methods. Our module is compact (under 60k parameters), trains in under 20 minutes with a differentiable surrogate loss, and requires no handcrafted features. Combine with a lightweight superpoint classifier, the full pipeline fits in $<$2 MB of VRAM, scales to multi-million-point scenes, and supports real-time inference. With 72$\times$ faster inference and 120$\times$ fewer parameters, EZ-SP matches the accuracy of point-based SOTA models across three domains: indoor scans (S3DIS), autonomous driving (KITTI-360), and aerial LiDAR (DALES). Code and pretrained models are accessible at github.com/drprojects/superpoint_transformer.

[124] WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing

Kaihang Pan,Weile Chen,Haiyi Qiu,Qifan Yu,Wendong Bu,Zehan Wang,Yun Zhu,Juncheng Li,Siliang Tang

Main category: cs.CV

TL;DR: 本文提出了WiseEdit,一个面向认知与创造性图像编辑的知识密集型基准,通过分解编辑过程为感知、解释和想象三个阶段,系统评估现有最先进模型在知识驱动的推理与创作能力上的局限性。

Details Motivation: 现有图像编辑评测基准无法全面评估模型的认知与创造性能力,缺乏对深层任务和多维知识的覆盖,因此需要构建更全面的评测体系。 Method: 提出WiseEdit基准,将图像编辑类比人类认知创造过程,分解为Awareness(感知)、Interpretation(解释)和Imagination(想象)三个递进步骤,并引入Declarative(陈述性)、Procedural(程序性)和Metacognitive(元认知)三类知识;构建包含1,220个测试案例的数据集。 Result: WiseEdit揭示了当前最先进图像编辑模型在基于知识的认知推理和创造性组合能力方面的显著不足,尤其在复杂任务中表现不佳。 Conclusion: WiseEdit为评估智能图像编辑模型提供了更全面、深入的评测框架,突出了模型在认知与创造性方面的发展方向,推动图像编辑向更高层次的智能迈进。 Abstract: Recent image editing models boast next-level intelligent capabilities, facilitating cognition- and creativity-informed image editing. Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. Drawing an analogy to human cognitive creation, WiseEdit decomposes image editing into three cascaded steps, i.e., Awareness, Interpretation, and Imagination, each corresponding to a task that poses a challenge for models to complete at the specific step. It also encompasses complex tasks, where none of the three steps can be finished easily. Furthermore, WiseEdit incorporates three fundamental types of knowledge: Declarative, Procedural, and Metacognitive knowledge. Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models in knowledge-based cognitive reasoning and creative composition capabilities. The benchmark, evaluation code, and the generated images of each model will be made publicly available soon. Project Page: https://qnancy.github.io/wiseedit_project_page/.

[125] Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

Jiazhen Liu,Mingkuan Feng,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为STAMP的新范式,通过全掩码预测解决多模态大语言模型中分割与对话能力、性能和推理速度之间的三难困境。

Details Motivation: 现有方法在保持对话能力的同时难以兼顾高分割性能和快速推理,存在根本性权衡。 Method: 采用全掩码预测范式,将分割任务视为图像块上的并行‘填空’任务,在生成文本后单次前向传播预测完整分割掩码。 Result: STAMP在多个分割基准上显著优于现有最先进方法,同时保持优秀对话能力与极高推理速度。 Conclusion: STAMP实现了对话能力、高分割性能和快速推理的统一,解决了多模态模型中的核心三难问题。 Abstract: Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.

[126] Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Lingdong Wang,Guan-Ming Su,Divya Kothandaraman,Tsung-Wei Huang,Mohammad Hajiesmaili,Ramesh K. Sitaraman

Main category: cs.CV

TL;DR: 提出了一种名为DiSCo的语义视频压缩框架,在超低比特率下通过传输语义信息并利用生成先验重构视频,显著优于传统方法。

Details Motivation: 传统视频编解码器在超低比特率下因追求像素保真度而产生严重伪影,与人类感知不一致,需更符合感知需求的压缩方法。 Method: 将源视频分解为文本描述、时空降质视频和可选草图或姿态三种紧凑模态,利用条件视频扩散模型从这些表示中重建高质量、时序连贯的视频,并采用时间前向填充、令牌交错和模态专用编解码器优化生成与压缩效率。 Result: 在低比特率下,该方法在感知指标上比基线语义和传统编解码器提升2-10倍。 Conclusion: DiSCo通过结合语义压缩与生成式先验,有效解决了超低比特率下的视频压缩难题,实现了更优的视觉质量与压缩效率平衡。 Abstract: Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

[127] SplatFont3D: Structure-Aware Text-to-3D Artistic Font Generation with Part-Level Style Control

Ji Gan,Lingxu Chen,Jiaxu Leng,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为SplatFont3D的新型结构感知文本到3D艺术字体生成框架,利用3D高斯点阵实现从文本提示生成具有精细部件级风格控制的3D艺术字体。

Details Motivation: 现有的艺术字体生成研究主要集中在2D平面设计上,缺乏对个性化3D艺术字体生成的研究,而3D-AFG在沉浸式3D环境中有广泛应用需求且能反哺2D-AFG发展。 Method: 提出SplatFont3D框架,包含Glyph2Cloud模块用于将2D字形逐步增强并生成对应的3D点云进行高斯初始化,并通过与预训练2D扩散模型交互优化3D高斯分布;采用动态组件分配策略实现部件级控制。 Result: 实验证明SplatFont3D在风格-文本一致性、视觉质量和渲染效率方面优于现有3D艺术字体生成模型,提供了比NeRF更显式和有效的部件级风格控制,并具备更快的渲染效率。 Conclusion: SplatFont3D为3D艺术字体生成提供了一个高效且可控的新方法,在保持精确语义和结构约束的同时实现了高质量的个性化生成。 Abstract: Artistic font generation (AFG) can assist human designers in creating innovative artistic fonts. However, most previous studies primarily focus on 2D artistic fonts in flat design, leaving personalized 3D-AFG largely underexplored. 3D-AFG not only enables applications in immersive 3D environments such as video games and animations, but also may enhance 2D-AFG by rendering 2D fonts of novel views. Moreover, unlike general 3D objects, 3D fonts exhibit precise semantics with strong structural constraints and also demand fine-grained part-level style control. To address these challenges, we propose SplatFont3D, a novel structure-aware text-to-3D AFG framework with 3D Gaussian splatting, which enables the creation of 3D artistic fonts from diverse style text prompts with precise part-level style control. Specifically, we first introduce a Glyph2Cloud module, which progressively enhances both the shapes and styles of 2D glyphs (or components) and produces their corresponding 3D point clouds for Gaussian initialization. The initialized 3D Gaussians are further optimized through interaction with a pretrained 2D diffusion model using score distillation sampling. To enable part-level control, we present a dynamic component assignment strategy that exploits the geometric priors of 3D Gaussians to partition components, while alleviating drift-induced entanglement during 3D Gaussian optimization. Our SplatFont3D provides more explicit and effective part-level style control than NeRF, attaining faster rendering efficiency. Experiments show that our SplatFont3D outperforms existing 3D models for 3D-AFG in style-text consistency, visual quality, and rendering efficiency.

[128] PhysGen: Physically Grounded 3D Shape Generation for Industrial Design

Yingxuan You,Chen Zhao,Hantao Zhang,Mingda Xu,Pascal Fua

Main category: cs.CV

TL;DR: 提出一种基于物理的3D形状生成框架,通过结合流匹配模型与物理引导机制,提升工业设计中3D形状的真实性。

Details Motivation: 现有3D生成模型缺乏对物理属性(如空气动力学)的理解,难以生成符合工程真实性的形状,本文旨在引入物理知识以增强生成质量。 Method: 提出一种带有显式物理引导的流匹配模型,采用交替更新策略:基于速度的更新和基于物理的细化,并在更新中引入物理感知正则化项;同时构建联合编码形状与物理信息的SP-VAE模型。 Result: 在三个基准上实验表明,该方法相比现有方法能生成更具物理合理性和视觉逼真的3D形状。 Conclusion: 所提物理引导的统一生成框架能有效提升工业设计类3D形状的生成质量,兼顾视觉合理性与物理真实性。 Abstract: Existing generative models for 3D shapes can synthesize high-fidelity and visually plausible shapes. For certain classes of shapes that have undergone an engineering design process, the realism of the shape is tightly coupled with the underlying physical properties, e.g., aerodynamic efficiency for automobiles. Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. Specifically, we introduce a new flow matching model with explicit physical guidance, consisting of an alternating update process. We iteratively perform a velocity-based update and a physics-based refinement, progressively adjusting the latent code to align with the desired 3D shapes and physical properties. We further strengthen physical validity by incorporating a physics-aware regularization term into the velocity-based update step. To support such physics-guided updates, we build a shape-and-physics variational autoencoder (SP-VAE) that jointly encodes shape and physics information into a unified latent space. The experiments on three benchmarks show that this synergistic formulation improves shape realism beyond mere visual plausibility.

[129] Recovering Origin Destination Flows from Bus CCTV: Early Results from Nairobi and Kigali

Nthenya Kyatha,Jay Taneja

Main category: cs.CV

TL;DR: 提出了一种基于现有CCTV的公交乘客流检测基线方法,结合目标检测、跟踪、重识别和时间戳OCR,在理想条件下表现良好,但在实际复杂场景中仍面临挑战。

Details Motivation: 撒哈拉以南非洲的公共交通普遍超载,缺乏可靠的自动乘客计数系统,现有技术难以应对复杂的运营环境。 Method: 结合YOLOv12检测、BotSORT跟踪、OSNet重识别嵌入、OCR时间戳和 telemetry 停靠站分类,从车载监控视频中恢复公交起讫客流数据。 Result: 在内罗毕和基加利的数据上,理想条件下召回率约95%,精确率约91%,F1达93%;OD矩阵与人工统计接近。但在拥挤、黑白画面、姿态变化等现实压力下性能显著下降(如高峰时段低估约40%)。 Conclusion: 该方法在理想条件下有效,但实际部署中存在明显性能退化,揭示了特定于部署环境的失效模式,需发展更鲁棒、面向实际部署的重识别方法。 Abstract: Public transport in sub-Saharan Africa (SSA) often operates in overcrowded conditions where existing automated systems fail to capture reliable passenger flow data. Leveraging onboard CCTV already deployed for security, we present a baseline pipeline that combines YOLOv12 detection, BotSORT tracking, OSNet embeddings, OCR-based timestamping, and telematics-based stop classification to recover bus origin--destination (OD) flows. On annotated CCTV segments from Nairobi and Kigali buses, the system attains high counting accuracy under low-density, well-lit conditions (recall $\approx$95\%, precision $\approx$91\%, F1 $\approx$93\%). It produces OD matrices that closely match manual tallies. Under realistic stressors such as overcrowding, color-to-monochrome shifts, posture variation, and non-standard door use, performance degrades sharply (e.g., $\sim$40\% undercount in peak-hour boarding and a $\sim$17 percentage-point drop in recall for monochrome segments), revealing deployment-specific failure modes and motivating more robust, deployment-focused Re-ID methods for SSA transit.

[130] What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards

Minh-Quan Le,Yuanzhi Zhu,Vicky Kalogeiton,Dimitris Samaras

Main category: cs.CV

TL;DR: 本文提出了NewtonRewards,一种基于可验证奖励的物理基础视频生成后训练框架,通过光学流和外观特征作为速度和质量的代理,显式施加牛顿动力学约束,提升了生成视频的物理合理性和时间连贯性。

Details Motivation: 现有视频扩散模型在视觉上逼真但违反基本物理规律,缺乏物理真实性,需要一种无需人类或视觉语言模型反馈的可扩展方法来提升生成视频的物理合理性。 Method: 提出NewtonRewards框架,利用冻结的辅助模型提取光学流(速度代理)和高层外观特征(质量代理),设计两种互补奖励:牛顿运动学约束奖励(恒定加速度)和质量守恒奖励,以增强物理一致性。 Result: 在新构建的大规模基准NewtonBench-60K上的五种牛顿运动基元(自由落体、水平/抛物线投掷、斜面滑下/上)测试中,NewtonRewards在视觉与物理指标上均优于先前的后训练方法,提升了物理可信度、运动平滑性和时间连贯性,并在高度、速度和摩擦等分布外场景中保持良好性能。 Conclusion: 基于物理的可验证奖励为实现物理感知的视频生成提供了一条可扩展的路径,能够在不依赖人类反馈的情况下显著提升生成视频的物理真实性。 Abstract: Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $\texttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $\textit{verifiable rewards}$. Instead of relying on human or VLM feedback, $\texttt{NewtonRewards}$ extracts $\textit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $\texttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $\texttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $\texttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.

[131] Recognizing Pneumonia in Real-World Chest X-rays with a Classifier Trained with Images Synthetically Generated by Nano Banana

Jiachuan Peng,Kyle Lam,Jianing Qiu

Main category: cs.CV

TL;DR: 该研究使用Google最新AI模型Nano Banana生成的合成胸部X光图像训练分类器,并在真实世界数据上实现了较高的肺炎识别性能,验证了合成数据在医学AI开发中的潜力。

Details Motivation: 探索合成数据在医学影像AI模型训练中的可行性,减少对真实患者数据的依赖。 Method: 使用Nano Banana模型生成合成胸部X光图像,训练分类器并在两个公开真实数据集(RSNA 2018和Chest X-Ray)上进行外部验证。 Result: 在RSNA数据集上达到AUROC 0.923和AUPR 0.900,在Chest X-Ray数据集上达到AUROC 0.824和AUPR 0.913,显示模型在真实数据上的良好泛化能力。 Conclusion: 合成数据有望用于医学AI开发,但仍需解决提示设计、数据对齐等问题,并加强验证、监管与伦理审查以支持临床转化。 Abstract: We trained a classifier with synthetic chest X-ray (CXR) images generated by Nano Banana, the latest AI model for image generation and editing, released by Google. When directly applied to real-world CXRs having only been trained with synthetic data, the classifier achieved an AUROC of 0.923 (95% CI: 0.919 - 0.927), and an AUPR of 0.900 (95% CI: 0.894 - 0.907) in recognizing pneumonia in the 2018 RSNA Pneumonia Detection dataset (14,863 CXRs), and an AUROC of 0.824 (95% CI: 0.810 - 0.836), and an AUPR of 0.913 (95% CI: 0.904 - 0.922) in the Chest X-Ray dataset (5,856 CXRs). These external validation results on real-world data demonstrate the feasibility of this approach and suggest potential for synthetic data in medical AI development. Nonetheless, several limitations remain at present, including challenges in prompt design for controlling the diversity of synthetic CXR data and the requirement for post-processing to ensure alignment with real-world data. However, the growing sophistication and accessibility of medical intelligence will necessitate substantial validation, regulatory approval, and ethical oversight prior to clinical translation.

[132] FR-TTS: Test-Time Scaling for NTP-based Image Generation with Effective Filling-based Reward Signal

Hang Xu,Linjiang Huang,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种基于填充的奖励机制(Filling-Based Reward, FR),用于解决在图像生成中测试时扩展(TTS)方法应用于下一词预测(NTP)任务时中间样本奖励与最终结果相关性低的问题。基于FR,作者进一步提出了FR-TTS方法,通过有效搜索合理的序列填充方案并结合动态加权的多样性奖励,提升了对中间样本质量评估的准确性。实验表明该方法优于多个基准模型。

Details Motivation: 由于中间生成阶段的token序列解码出的图像奖励与最终完整图像奖励之间相关性较低,导致传统TTS方法难以直接应用于NTP范式。因此需要一种更可靠的中间状态评估机制来指导生成方向。 Method: 提出Filling-Based Reward(FR),通过为中间token序列寻找合理的“填充”方式来预估其未来完成路径,并计算近似最终奖励;在此基础上设计FR-TTS策略,结合高效搜索填充方案与动态加权的多样性奖励机制,实现对中间样本的高质量评估与筛选。 Result: FR显著提升了中间样本与最终样本之间的奖励相关性;多种内在指标(如token置信度)也验证了其可靠性;FR-TTS在多个基准和奖励模型上均表现出优于现有方法的性能。 Conclusion: FR-TTS通过引入基于填充的奖励估计,有效解决了NTP场景下TTS面临的中间表示不可靠问题,为测试时扩展在语言生成等领域的应用提供了新思路。 Abstract: Test-time scaling (TTS) has become a prevalent technique in image generation, significantly boosting output quality by expanding the number of parallel samples and filtering them using pre-trained reward models. However, applying this powerful methodology to the next-token prediction (NTP) paradigm remains challenging. The primary obstacle is the low correlation between the reward of an image decoded from an intermediate token sequence and the reward of the fully generated image. Consequently, these incomplete intermediate representations prove to be poor indicators for guiding the pruning direction, a limitation that stems from their inherent incompleteness in scale or semantic content. To effectively address this critical issue, we introduce the Filling-Based Reward (FR). This novel design estimates the approximate future trajectory of an intermediate sample by finding and applying a reasonable filling scheme to complete the sequence. Both the correlation coefficient between rewards of intermediate samples and final samples, as well as multiple intrinsic signals like token confidence, indicate that the FR provides an excellent and reliable metric for accurately evaluating the quality of intermediate samples. Building upon this foundation, we propose FR-TTS, a sophisticated scaling strategy. FR-TTS efficiently searches for good filling schemes and incorporates a diversity reward with a dynamic weighting schedule to achieve a balanced and comprehensive evaluation of intermediate samples. We experimentally validate the superiority of FR-TTS over multiple established benchmarks and various reward models. Code is available at \href{https://github.com/xuhang07/FR-TTS}{https://github.com/xuhang07/FR-TTS}.

[133] RecruitView: A Multimodal Dataset for Predicting Personality and Interview Performance for Human Resources Applications

Amit Kumar Gupta,Farhan Sheth,Hammad Shaikh,Dheeraj Kumar,Angkul Puniya,Deepak Panwar,Sandeep Chaurasia,Priya Mathur

Main category: cs.CV

TL;DR: 本文提出了RecruitView数据集和一种基于几何深度学习的跨模态回归框架CRMF,用于从多模态行为数据中自动评估人格与软技能。

Details Motivation: 现有数据集有限且方法未能有效捕捉人类特质中的几何结构,导致人格与软技能的自动化评估面临挑战。 Method: 提出CRMF框架,利用双曲、球面和欧几里得流形上的几何深度学习,通过特定几何结构的专家网络和自适应路由机制融合多模态行为表征。 Result: CRMF在Spearman相关系数上最高提升11.4%,一致性指数提升6.0%,同时训练参数减少40-50%。 Conclusion: CRMF能更高效地建模人格与行为的几何结构,在软技能评估中表现优越,且RecruitView数据集已公开。 Abstract: Automated personality and soft skill assessment from multimodal behavioral data remains challenging due to limited datasets and methods that fail to capture geometric structure inherent in human traits. We introduce RecruitView, a dataset of 2,011 naturalistic video interview clips from 300+ participants with 27,000 pairwise comparative judgments across 12 dimensions: Big Five personality traits, overall personality score, and six interview performance metrics. To leverage this data, we propose Cross-Modal Regression with Manifold Fusion (CRMF), a geometric deep learning framework that explicitly models behavioral representations across hyperbolic, spherical, and Euclidean manifolds. CRMF employs geometry-specific expert networks to capture hierarchical trait structures, directional behavioral patterns, and continuous performance variations simultaneously. An adaptive routing mechanism dynamically weights expert contributions based on input characteristics. Through principled tangent space fusion, CRMF achieves superior performance while training 40-50% fewer trainable parameters than large multimodal models. Extensive experiments demonstrate that CRMF substantially outperforms the selected baselines, achieving up to 11.4% improvement in Spearman correlation and 6.0% in concordance index. Our RecruitView dataset is publicly available at https://huggingface.co/datasets/AI4A-lab/RecruitView

[134] CausalAffect: Causal Discovery for Facial Affective Understanding

Guanyu Hu,Tangzheng Lian,Dimitrios Kollias,Oya Celiktutan,Xinyu Yang

Main category: cs.CV

TL;DR: 本文提出了CausalAffect,首个用于面部情感分析中的因果图发现框架,通过两层因果层次模型和反事实干预机制,从数据中自动推断动作单元(AU)与表情之间的因果关系,无需联合标注数据或手工设计的因果先验,且在多个基准上达到先进性能。

Details Motivation: 现有方法很少从数据中直接推断AU与表情之间心理上合理的因果关系,缺乏对驱动面部肌肉激活及其表达结果的潜在依赖结构的建模。 Method: 提出CausalAffect框架,采用两层极性与方向感知的因果层次结构,结合群体级规律与样本自适应结构,并引入特征级反事实干预机制以消除虚假相关,实现因果效应的准确识别。 Result: 在六个基准上实验表明,该方法在AU检测和表情识别上均达到最先进的性能,所发现的因果结构符合心理学理论,并揭示了新的抑制性和未被刻画的依赖关系。 Conclusion: CausalAffect建立了因果发现与可解释面部行为之间的原则性联系,为情感计算提供了更具解释性的建模路径。 Abstract: Understanding human affect from facial behavior requires not only accurate recognition but also structured reasoning over the latent dependencies that drive muscle activations and their expressive outcomes. Although Action Units (AUs) have long served as the foundation of affective computing, existing approaches rarely address how to infer psychologically plausible causal relations between AUs and expressions directly from data. We propose CausalAffect, the first framework for causal graph discovery in facial affect analysis. CausalAffect models AU-AU and AU-Expression dependencies through a two-level polarity and direction aware causal hierarchy that integrates population-level regularities with sample-adaptive structures. A feature-level counterfactual intervention mechanism further enforces true causal effects while suppressing spurious correlations. Crucially, our approach requires neither jointly annotated datasets nor handcrafted causal priors, yet it recovers causal structures consistent with established psychological theories while revealing novel inhibitory and previously uncharacterized dependencies. Extensive experiments across six benchmarks demonstrate that CausalAffect advances the state of the art in both AU detection and expression recognition, establishing a principled connection between causal discovery and interpretable facial behavior. All trained models and source code will be released upon acceptance.

[135] RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

Junyan Ye,Leiqi Zhu,Yuncheng Guo,Dongzhi Jiang,Zilong Huang,Yifan Zhang,Zhiyuan Yan,Haohuan Fu,Conghui He,Weijia Li

Main category: cs.CV

TL;DR: 本文提出了RealGen,一个用于生成高度逼真的文本到图像框架,通过引入“检测器奖励”机制和GRPO算法优化生成流程,并提出RealBench作为自动化评估基准,实验证明其在真实感、细节和美学方面优于现有模型。

Details Motivation: 现有的先进图像生成模型(如GPT-Image-1和Qwen-Image)在生成照片级真实感图像方面仍存在不足,常产生具有AI伪影的“假”图像,无法实现“与现实无法区分”的原始目标。 Method: RealGen结合了LLM组件用于提示优化和扩散模型用于真实感图像生成,引入基于语义级和特征级合成图像检测器的“检测器奖励”机制,并使用GRPO算法优化整个生成流程。同时提出RealBench,利用Detector-Scoring和Arena-Scoring进行自动化评估。 Result: 实验表明,RealGen在真实感、细节和美学质量上显著优于GPT-Image-1、Qwen-Image等通用模型以及FLUX-Krea等专用真实感模型,且RealBench实现了与人类感知更一致的无人工评估。 Conclusion: RealGen通过检测器奖励机制和端到端优化有效提升了生成图像的真实感,RealBench为未来真实感图像评估提供了可靠、自动化的解决方案,推动了文本到图像生成向“与现实不可区分”的目标迈进。 Abstract: With the continuous advancement of image generation technology, advanced models such as GPT-Image-1 and Qwen-Image have achieved remarkable text-to-image consistency and world knowledge However, these models still fall short in photorealistic image generation. Even on simple T2I tasks, they tend to produce " fake" images with distinct AI artifacts, often characterized by "overly smooth skin" and "oily facial sheens". To recapture the original goal of "indistinguishable-from-reality" generation, we propose RealGen, a photorealistic text-to-image framework. RealGen integrates an LLM component for prompt optimization and a diffusion model for realistic image generation. Inspired by adversarial generation, RealGen introduces a "Detector Reward" mechanism, which quantifies artifacts and assesses realism using both semantic-level and feature-level synthetic image detectors. We leverage this reward signal with the GRPO algorithm to optimize the entire generation pipeline, significantly enhancing image realism and detail. Furthermore, we propose RealBench, an automated evaluation benchmark employing Detector-Scoring and Arena-Scoring. It enables human-free photorealism assessment, yielding results that are more accurate and aligned with real user experience. Experiments demonstrate that RealGen significantly outperforms general models like GPT-Image-1 and Qwen-Image, as well as specialized photorealistic models like FLUX-Krea, in terms of realism, detail, and aesthetics. The code is available at https://github.com/yejy53/RealGen.

[136] Structured Context Learning for Generic Event Boundary Detection

Xin Gu,Congcong Li,Xinyao Wang,Dexiang Hong,Libo Zhang,Tiejian Luo,Longyin Wen,Heng Fan

Main category: cs.CV

TL;DR: 提出一种名为Structured Context Learning的新方法,通过Structured Partition of Sequence(SPoS)提供结构化上下文来检测视频中的事件边界,具有线性计算复杂度和优异的速度-精度权衡。

Details Motivation: 通用事件边界检测(GEBD)需要有效建模时间信息以识别视频中人类感知的事件边界,现有方法受限于特定时序模型且难以平衡速度与精度。 Method: 引入Structured Partition of Sequence(SPoS)对帧序列进行划分以构建结构化上下文,计算组相似性捕捉帧间差异,并使用轻量级全卷积网络基于相似性图预测事件边界;采用高斯核预处理真实标注以缓解标注模糊问题。 Result: 在Kinetics-GEBD、TAPOS和镜头过渡检测数据集上显著优于现有最先进方法,验证了其优越性能和良好泛化能力。 Conclusion: 所提方法灵活、高效且端到端可训练,不依赖特定时序模型,在多个数据集上实现了领先的事件边界检测性能。 Abstract: Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide a structured context for learning temporal information. Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS's overall computational complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method has been extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.

[137] Learning What Helps: Task-Aligned Context Selection for Vision Tasks

Jingyu Guo,Emir Konuk,Fredrik Strand,Christos Matsoukas,Kevin Smith

Main category: cs.CV

TL;DR: 提出Task-Aligned Context Selection (TACS)框架,通过联合训练选择器网络和任务模型,学习选取真正提升任务性能的上下文示例,而非仅相似示例,在18个数据集上优于基于相似性的检索方法。

Details Motivation: ViTs在面对视觉不确定性时缺乏选择有助于预测的示例的能力,现有基于相似性的检索方法不一定能提升任务性能。 Method: 提出TACS框架,结合梯度监督和强化学习的混合优化方案,联合训练选择器网络与任务模型,使示例选择与任务目标对齐。 Result: 在18个涵盖细粒度识别、医学图像分类与分割的数据集上,TACS consistently 优于基于相似性的检索方法,尤其在具有挑战性或数据有限的场景下表现更优。 Conclusion: TACS通过将上下文示例的选择与任务奖励对齐,使模型能够自主发现真正有助于性能提升的示例,增强了判别模型的上下文利用能力。 Abstract: Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present Task-Aligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective. By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help. Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.

[138] CC-FMO: Camera-Conditioned Zero-Shot Single Image to 3D Scene Generation with Foundation Model Orchestration

Boshi Tang,Henry Zheng,Rui Huang,Gao Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为CC-FMO的零样本、相机条件下的单图像到3D场景生成方法,通过结合语义感知的向量集表示和细节丰富的结构化潜在表示,并利用相机条件尺度求解算法提升场景一致性,在生成高质量、空间一致的3D场景方面优于现有方法。

Details Motivation: 现有单图像生成3D场景的方法在泛化性和场景一致性方面存在不足,尤其是物体姿态估计不准确和空间布局不一致,限制了其在AR/VR和具身AI中的应用。 Method: 提出CC-FMO,采用混合实例生成器,结合语义感知的向量集表示与结构化潜在表示,并设计相机条件下的尺度求解算法,以利用基础姿态估计模型来增强场景的整体对齐与连贯性。 Result: 实验表明,CC-FMO在生成高保真、相机对齐的组合式3D场景方面显著优于当前最先进的方法,实现了更好的语义合理性和几何质量。 Conclusion: CC-FMO实现了无需训练的单图像3D场景生成,在保持实例细节的同时确保了场景级的空间一致性,为高质量3D场景生成提供了有效解决方案。 Abstract: High-quality 3D scene generation from a single image is crucial for AR/VR and embodied AI applications. Early approaches struggle to generalize due to reliance on specialized models trained on curated small datasets. While recent advancements in large-scale 3D foundation models have significantly enhanced instance-level generation, coherent scene generation remains a challenge, where performance is limited by inaccurate per-object pose estimations and spatial inconsistency. To this end, this paper introduces CC-FMO, a zero-shot, camera-conditioned pipeline for single-image to 3D scene generation that jointly conforms to the object layout in input image and preserves instance fidelity. CC-FMO employs a hybrid instance generator that combines semantics-aware vector-set representation with detail-rich structured latent representation, yielding object geometries that are both semantically plausible and high-quality. Furthermore, CC-FMO enables the application of foundational pose estimation models in the scene generation task via a simple yet effective camera-conditioned scale-solving algorithm, to enforce scene-level coherence. Extensive experiments demonstrate that CC-FMO consistently generates high-fidelity camera-aligned compositional scenes, outperforming all state-of-the-art methods.

[139] Terrain Sensing with Smartphone Structured Light: 2D Dynamic Time Warping for Grid Pattern Matching

Tanaka Nobuaki

Main category: cs.CV

TL;DR: 提出了一种基于智能手机的结构光系统,利用二维动态时间规整算法(2D-DTW)重建地面不平整度,适用于低成本移动机器人在非平坦地形上的稳定导航。

Details Motivation: 低速移动机器人在难以目视察觉的小幅起伏地形上运行时,稳定性易受影响,需一种低成本、便携式的地面不平感知方案。 Method: 利用智能手机投射结构光网格图案,通过单设备拍摄变形后的网格,提出一种带拓扑约束的二维动态时间规整(2D-DTW)算法,在存在透视畸变和部分遮挡的情况下实现鲁棒的网格匹配与地形重建。 Result: 实现了在资源受限设备上运行的高效网格匹配方法,成功重建局部地形不平整,并验证了2D-DTW在结构光网格匹配中的有效性与通用性。 Conclusion: 所提出的2D-DTW算法不仅适用于手机端地形感知系统,还可作为图像处理中结构化网格匹配的通用工具,具备实际应用与扩展潜力。 Abstract: Low-cost mobile rovers often operate on uneven terrain where small bumps or tilts are difficult to perceive visually but can significantly affect locomotion stability. To address this problem, we explore a smartphone-based structured-light system that projects a grid pattern onto the ground and reconstructs local terrain unevenness from a single handheld device. The system is inspired by face-recognition projectors, but adapted for ground sensing. A key technical challenge is robustly matching the projected grid with its deformed observation under perspective distortion and partial occlusion. Conventional one-dimensional dynamic time warping (1D-DTW) is not directly applicable to such two-dimensional grid patterns. We therefore propose a topology-constrained two-dimensional dynamic time warping (2D-DTW) algorithm that performs column-wise alignment under a global grid consistency constraint. The proposed method is designed to be simple enough to run on resource limited platforms while preserving the grid structure required for accurate triangulation. We demonstrate that our 2D-DTW formulation can be used not only for terrain sensing but also as a general tool for matching structured grid patterns in image processing scenarios. This paper describes the overall system design as well as the 2D-DTW extension that emerged from this application.

[140] Image Generation as a Visual Planner for Robotic Manipulation

Ye Pang

Main category: cs.CV

TL;DR: 本文探索了预训练图像生成模型在轻量微调后作为机器人视觉规划器的潜力,提出了一种基于文本和轨迹条件的两阶段视频生成框架,并在多个机器人操作数据集上验证了其生成连贯、符合条件的机器人操作视频的能力。

Details Motivation: 现有视频扩散模型依赖大量特定领域数据且泛化能力差,而大规模语言-图像预训练模型展现出强组合性与潜在的时间一致性生成能力,因此探索其在机器人视觉规划中的迁移潜力具有重要意义。 Method: 采用LoRA微调预训练图像生成模型,设计两种生成模式:(1) 文本条件生成,结合语言指令和第一帧;(2) 轨迹条件生成,结合2D轨迹叠加和初始帧,以生成时序连贯的操作视频。 Result: 在Jaco Play、Bridge V2和RT1数据集上的实验表明,两种生成模式均能生成平滑、连贯且与条件对齐的机器人操作视频。 Conclusion: 预训练图像生成模型蕴含可迁移的时间先验知识,在极少量监督下即可作为类视频的机器人视觉规划器,为机器人感知-规划-行动一体化提供了新思路。 Abstract: Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.

[141] Cross-Temporal 3D Gaussian Splatting for Sparse-View Guided Scene Update

Zeyuan An,Yanghang Xiao,Zhiying Leng,Frederick W. B. Li,Xiaohui Liang

Main category: cs.CV

TL;DR: 本文提出了一种名为跨时间3D高斯点阵(Cross-Temporal 3DGS)的新框架,用于利用稀疏图像和历史场景先验信息高效地重建和更新不同时期的3D场景。

Details Motivation: 在现实应用中,如城市规划、灾害评估和历史遗址保护,通常难以获取密集扫描数据,因此需要从稀疏视角观测中更新3D场景。现有的方法难以有效利用跨时间段的先验信息进行高质量重建与动态更新,因此亟需一种能够处理非连续捕获、支持双向时空推理的高效方案。 Method: 该方法包含三个阶段:1)跨时间相机对齐,估计并校准不同时间戳下的相机位姿;2)基于干扰的置信度初始化,识别时间变化中的不变区域以指导更新;3)渐进式跨时间优化,迭代融合历史先验信息以提升重建质量。此外,该方法支持非连续捕获,可实现从当前数据恢复过去场景或用新视图更新现有场景。 Result: 实验结果表明,该方法在重建质量和数据效率方面显著优于基线方法,能够在仅使用稀疏图像的情况下实现高质量的跨时间3D重建,并成功应用于场景版本控制和长期空间记录等任务。 Conclusion: Cross-Temporal 3DGS 提供了一个高效且灵活的框架,能够利用稀疏图像和历史先验实现跨时间3D场景的重建与更新,为数字孪生、长期环境建模等应用提供了可行解决方案。 Abstract: Maintaining consistent 3D scene representations over time is a significant challenge in computer vision. Updating 3D scenes from sparse-view observations is crucial for various real-world applications, including urban planning, disaster assessment, and historical site preservation, where dense scans are often unavailable or impractical. In this paper, we propose Cross-Temporal 3D Gaussian Splatting (Cross-Temporal 3DGS), a novel framework for efficiently reconstructing and updating 3D scenes across different time periods, using sparse images and previously captured scene priors. Our approach comprises three stages: 1) Cross-temporal camera alignment for estimating and aligning camera poses across different timestamps; 2) Interference-based confidence initialization to identify unchanged regions between timestamps, thereby guiding updates; and 3) Progressive cross-temporal optimization, which iteratively integrates historical prior information into the 3D scene to enhance reconstruction quality. Our method supports non-continuous capture, enabling not only updates using new sparse views to refine existing scenes, but also recovering past scenes from limited data with the help of current captures. Furthermore, we demonstrate the potential of this approach to achieve temporal changes using only sparse images, which can later be reconstructed into detailed 3D representations as needed. Experimental results show significant improvements over baseline methods in reconstruction quality and data efficiency, making this approach a promising solution for scene versioning, cross-temporal digital twins, and long-term spatial documentation.

[142] SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning

Yongkang Hu,Yu Cheng,Yushuo Zhang,Yuan Xie,Zhaoxia Yin

Main category: cs.CV

TL;DR: 本文提出了一种名为SAIDO的场景感知与重要性引导的动态优化检测框架,用于提升AI生成图像检测在持续学习下的泛化能力。该方法通过引入基于场景感知的专家模块(SAEM)和重要性引导的动态优化机制(IDOM),有效平衡了模型的可塑性与稳定性,在检测误差、遗忘率和跨场景泛化方面均优于现有最先进方法。

Details Motivation: 现有的AI生成图像检测方法在面对新型生成模型和多样化内容时泛化能力差,且在持续学习中容易发生灾难性遗忘,难以适应真实开放环境中的动态变化。 Method: 提出SAIDO框架,包含两个核心组件:1)场景感知专家模块(SAEM),利用VLLMs动态识别新场景并分配独立专家模块以捕捉场景特定伪造特征;2)重要性引导动态优化机制(IDOM),通过重要性引导的梯度投影策略优化神经元更新,平衡模型稳定性与可塑性。 Result: 在持续学习任务上,相比当前SOTA方法,平均检测误差率降低44.22%,遗忘率降低40.57%;在开放世界数据集上,平均检测准确率提升9.47%。 Conclusion: SAIDO框架显著提升了AI生成图像检测在动态环境下的泛化能力和持续学习性能,为应对现实世界中不断演进的生成技术提供了有效解决方案。 Abstract: The widespread misuse of image generation technologies has raised security concerns, driving the development of AI-generated image detection methods. However, generalization has become a key challenge and open problem: existing approaches struggle to adapt to emerging generative methods and content types in real-world scenarios. To address this issue, we propose a Scene-Aware and Importance-Guided Dynamic Optimization detection framework with continual learning (SAIDO). Specifically, we design Scene-Awareness-Based Expert Module (SAEM) that dynamically identifies and incorporates new scenes using VLLMs. For each scene, independent expert modules are dynamically allocated, enabling the framework to capture scene-specific forgery features better and enhance cross-scene generalization. To mitigate catastrophic forgetting when learning from multiple image generative methods, we introduce Importance-Guided Dynamic Optimization Mechanism (IDOM), which optimizes each neuron through an importance-guided gradient projection strategy, thereby achieving an effective balance between model plasticity and stability. Extensive experiments on continual learning tasks demonstrate that our method outperforms the current SOTA method in both stability and plasticity, achieving 44.22\% and 40.57\% relative reductions in average detection error rate and forgetting rate, respectively. On open-world datasets, it improves the average detection accuracy by 9.47\% compared to the current SOTA method.

[143] Asset-Driven Sematic Reconstruction of Dynamic Scene with Multi-Human-Object Interactions

Sandika Biswas,Qianyi Wu,Biplab Banerjee,Hamid Rezatofighi

Main category: cs.CV

TL;DR: 提出一种混合方法,结合3D生成模型、语义感知变形和3D高斯点阵优化,用于多人体、多物体动态场景的高质量3D几何建模,在严重遮挡下仍能保持结构一致性。

Details Motivation: 现有3D高斯点阵方法难以处理多人多物动态场景中的复杂运动和频繁遮挡,尤其在单目设置下缺乏结构一致性。 Method: 融合1)3D生成模型生成高保真网格;2)语义感知变形(刚体变换与LBS形变)映射动态场景中的网格;3)基于GS优化各元素对齐。 Result: 在HOI-M3数据集上验证,相比SOTA方法实现了更优的表面重建效果,具备多视角和时间一致性。 Conclusion: 该混合策略有效提升了复杂动态场景下的几何建模质量,尤其在严重遮挡情况下保持了良好的结构稳定性。 Abstract: Real-world human-built environments are highly dynamic, involving multiple humans and their complex interactions with surrounding objects. While 3D geometry modeling of such scenes is crucial for applications like AR/VR, gaming, and embodied AI, it remains underexplored due to challenges like diverse motion patterns and frequent occlusions. Beyond novel view rendering, 3D Gaussian Splatting (GS) has demonstrated remarkable progress in producing detailed, high-quality surface geometry with fast optimization of the underlying structure. However, very few GS-based methods address multihuman, multiobject scenarios, primarily due to the above-mentioned inherent challenges. In a monocular setup, these challenges are further amplified, as maintaining structural consistency under severe occlusion becomes difficult when the scene is optimized solely based on GS-based rendering loss. To tackle the challenges of such a multihuman, multiobject dynamic scene, we propose a hybrid approach that effectively combines the advantages of 1) 3D generative models for generating high-fidelity meshes of the scene elements, 2) Semantic-aware deformation, \ie rigid transformation of the rigid objects and LBS-based deformation of the humans, and mapping of the deformed high-fidelity meshes in the dynamic scene, and 3) GS-based optimization of the individual elements for further refining their alignments in the scene. Such a hybrid approach helps maintain the object structures even under severe occlusion and can produce multiview and temporally consistent geometry. We choose HOI-M3 for evaluation, as, to the best of our knowledge, this is the only dataset featuring multihuman, multiobject interactions in a dynamic scene. Our method outperforms the state-of-the-art method in producing better surface reconstruction of such scenes.

[144] NeuroVolve: Evolving Visual Stimuli toward Programmable Neural Objectives

Haomiao Chen,Keith W Jamison,Mert R. Sabuncu,Amy Kuceyeski

Main category: cs.CV

TL;DR: 本文提出了一种名为NeuroVolve的生成框架,通过优化预训练视觉-语言模型嵌入空间中的神经目标函数,实现基于脑活动引导的图像生成。该方法不仅能复现已知脑区的选择性反应,还能探索多个脑区间的协同与对抗关系,支持个性化、可解释的视觉表征分析。

Details Motivation: 研究不同脑区在复杂自然视觉中编码的视觉信息及其分布式模式如何共同构建神经表征,现有生成模型对脑区间交互机制揭示有限。 Method: 提出NeuroVolve框架,在预训练视觉-语言模型的嵌入空间中定义可编程的神经目标函数,通过优化该目标生成满足单个或多个脑区激活约束的图像,并追踪优化路径以揭示语义轨迹。 Result: 成功复现了单个脑区(如FFA对人脸的选择性)的已知特性;生成了满足多区域复杂约束的连贯场景;揭示了脑区间的协同与拮抗关系;支持个体化脑驱动合成。 Conclusion: NeuroVolve提供了一个统一的框架,将脑引导的图像编辑与偏好刺激生成结合,能够解析多层次视觉表征,并为理解大脑中视觉信息的组织方式提供了新的可解释工具。 Abstract: What visual information is encoded in individual brain regions, and how do distributed patterns combine to create their neural representations? Prior work has used generative models to replicate known category selectivity in isolated regions (e.g., faces in FFA), but these approaches offer limited insight into how regions interact during complex, naturalistic vision. We introduce NeuroVolve, a generative framework that provides brain-guided image synthesis via optimization of a neural objective function in the embedding space of a pretrained vision-language model. Images are generated under the guidance of a programmable neural objective, i.e., activating or deactivating single regions or multiple regions together. NeuroVolve is validated by recovering known selectivity for individual brain regions, while expanding to synthesize coherent scenes that satisfy complex, multi-region constraints. By tracking optimization steps, it reveals semantic trajectories through embedding space, unifying brain-guided image editing and preferred stimulus generation in a single process. We show that NeuroVolve can generate both low-level and semantic feature-specific stimuli for single ROIs, as well as stimuli aligned to curated neural objectives. These include co-activation and decorrelation between regions, exposing cooperative and antagonistic tuning relationships. Notably, the framework captures subject-specific preferences, supporting personalized brain-driven synthesis and offering interpretable constraints for mapping, analyzing, and probing neural representations of visual information.

[145] Describe Anything Anywhere At Any Moment

Nicolas Gorlo,Lukas Schmid,Luca Carlone

Main category: cs.CV

TL;DR: 提出DAAAM框架,实现大规模实时4D场景理解,结合几何结构与语义细节,显著提升时空问答与任务接地性能。

Details Motivation: 现有方法在实现实时性能的同时难以兼顾丰富的开放词汇描述与3D语义接地,需统一框架解决该权衡问题。 Method: 设计基于优化的前端,利用批处理加速局部描述模型推理,并构建层次化4D场景图作为时空一致的记忆表示。 Result: 在NaVQA和SG3D基准上达到SOTA,OC-NaVQA准确率提升53.6%,位置误差降低21.9%,时间误差减少21.6%,SG3D任务接地准确率提高27.8%。 Conclusion: DAAAM能高效构建具几何接地的语义丰富4D场景记忆,支持实时推理与工具调用代理的良好交互。 Abstract: Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing. It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation. DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM's 4D SG interfaces well with a tool-calling agent for inference and reasoning. We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations. DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, position errors by 21.9%, temporal errors by 21.6%, and SG3D task grounding accuracy by 27.8% over the most competitive baselines, respectively. We release our data and code open-source.

[146] Integrating Skeleton Based Representations for Robust Yoga Pose Classification Using Deep Learning Models

Mohammed Mohiuddin,Syed Mohammod Minhaz Hossain,Sumaiya Khanam,Prionkar Barua,Aparup Barua,MD Tamim Hossain

Main category: cs.CV

TL;DR: 本研究提出了一种用于瑜伽体式分类的系统性评估方法,引入了新的数据集Yoga-16,并比较了三种深度学习模型在不同输入模态下的表现,发现基于骨架表示的方法优于原始图像,VGG16结合MediaPipe姿态骨架实现了96.09%的最高准确率。

Details Motivation: 现有的瑜伽体式自动分类研究多依赖于原始图像或单一姿态提取模型,缺乏系统性基准评估;同时不正确的瑜伽姿势可能导致受伤,因此需要更可靠、可解释的自动化识别方法来减少对专家的依赖。 Method: 构建了一个名为Yoga-16的新数据集,系统评估了三种深度学习架构(VGG16、ResNet50和Xception)在三种输入模态(原始图像、MediaPipe姿态骨架图像和YOLOv8姿态骨架图像)上的性能,并使用Grad-CAM进行可解释性分析,同时进行了交叉验证。 Result: 基于骨架的表示方法优于原始图像输入,其中VGG16结合MediaPipe姿态骨架取得了最高的分类准确率96.09%,且模型具备良好的可解释性。 Conclusion: 骨架信息在瑜伽体式识别中更具优势,结合合适的深度模型能实现高精度与可解释性的自动化分类,为未来智能健身应用提供了有效参考。 Abstract: Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, 'Yoga-16', which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception) using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross validation analysis.

[147] SatireDecoder: Visual Cascaded Decoupling for Enhancing Satirical Image Comprehension

Yue Jiang,Haiwei Xue,Minghao Han,Mingcheng Li,Xiaolu Hou,Dingkang Yang,Lihua Zhang,Xu Zheng

Main category: cs.CV

TL;DR: 本文提出了一个无需训练的框架SatireDecoder,用于提升视觉讽刺图像的理解能力,通过多智能体系统和基于不确定性分析的链式推理策略,有效分解局部与全局语义,显著提高理解准确率并减少幻觉现象。

Details Motivation: 当前视觉-语言模型在理解纯视觉讽刺方面存在困难,难以融合局部实体关系与全局上下文,导致误解、偏见和幻觉。 Method: 提出SatireDecoder框架,采用多智能体系统进行视觉级联解耦,将图像分解为细粒度的局部和全局语义表示,并结合基于不确定性分析的思维链推理策略,分步完成讽刺理解任务。 Result: 实验结果表明,SatireDecoder在多个指标上优于现有基线方法,显著提升了讽刺图像的理解准确性和推理一致性,同时减少了模型幻觉。 Conclusion: SatireDecoder为复杂语义下的视觉-语言推理提供了有效解决方案,推动了讽刺理解等高阶认知任务的发展。 Abstract: Satire, a form of artistic expression combining humor with implicit critique, holds significant social value by illuminating societal issues. Despite its cultural and societal significance, satire comprehension, particularly in purely visual forms, remains a challenging task for current vision-language models. This task requires not only detecting satire but also deciphering its nuanced meaning and identifying the implicated entities. Existing models often fail to effectively integrate local entity relationships with global context, leading to misinterpretation, comprehension biases, and hallucinations. To address these limitations, we propose SatireDecoder, a training-free framework designed to enhance satirical image comprehension. Our approach proposes a multi-agent system performing visual cascaded decoupling to decompose images into fine-grained local and global semantic representations. In addition, we introduce a chain-of-thought reasoning strategy guided by uncertainty analysis, which breaks down the complex satire comprehension process into sequential subtasks with minimized uncertainty. Our method significantly improves interpretive accuracy while reducing hallucinations. Experimental results validate that SatireDecoder outperforms existing baselines in comprehending visual satire, offering a promising direction for vision-language reasoning in nuanced, high-level semantic tasks.

[148] Scaling Down to Scale Up: Towards Operationally-Efficient and Deployable Clinical Models via Cross-Modal Low-Rank Adaptation for Medical Vision-Language Models

Thuraya Alzubaidi,Farhad R. Nezami,Muzammil Behzad

Main category: cs.CV

TL;DR: MedCT-VLM是一种基于低秩适应(LoRA)的参数高效视觉-语言模型,用于将大规模CT基础模型迁移到多标签胸部病理分类任务中,在零样本场景下显著提升了性能。

Details Motivation: 尽管视觉-语言预训练模型在自然图像领域表现出强大的零样本能力,但在体数据医学影像(尤其是胸部CT)中的应用仍受限,主要由于标注数据稀缺和全参数微调成本高。 Method: 提出MedCT-VLM框架,基于在25,692个体积胸部CT上训练的CT-CLIP模型,通过在视觉和文本编码器的注意力层中插入LoRA模块进行参数高效微调,仅训练1.67M参数(占总参数0.38%),实现对18种胸部病理的零样本多标签分类。 Result: 在零样本分类任务中,LoRA微调使平均AUROC从61.3%提升至68.9%(+7.6个百分点),准确率从67.2%提升至73.6%(+6.4个百分点),宏F1从32.1%提升至36.9%(+4.8个百分点)。 Conclusion: 参数高效的迁移方法(如LoRA)能有效将在大规模医疗影像上预训练的视觉-语言模型适配到下游临床任务中,尤其适用于标注数据稀缺的零样本医学图像分析场景。 Abstract: Foundation models trained via vision-language pretraining have demonstrated strong zero-shot capabilities across diverse image domains, yet their application to volumetric medical imaging remains limited. We introduce MedCT-VLM: Medical CT Vision-Language Model, a parameter-efficient vision-language framework designed to adapt large-scale CT foundation models for downstream clinical tasks. MedCT-VLM uses a parameter-efficient approach to adapt CT-CLIP, a contrastive vision-language model trained on 25,692 chest CT volumes, for multi-label pathology classification using Low-Rank Adaptation (LoRA). Rather than fine-tuning the model's 440 M parameters directly, we insert low-rank decomposition matrices into attention layers of both vision and text encoders, training only 1.67M parameters (0.38\% of total). We evaluate on zero-shot classification across 18 thoracic pathologies, where the model must align CT embeddings with unseen text prompts at inference without task-specific training. LoRA fine-tuning improves mean AUROC from 61.3\% to 68.9\% (+7.6 pp), accuracy from 67.2\% to 73.6\% (+6.4 pp), and macro-F1 from 32.1\% to 36.9\% (+4.8 pp). These results demonstrate that parameter-efficient methods can effectively transfer large-scale pretraining to downstream medical imaging tasks, particularly for zero-shot scenarios where labeled data is scarce.

[149] Automatic Pith Detection in Tree Cross-Section Images Using Deep Learning

Tzu-I Liao,Mahmoud Fakhry,Jibin Yesudas Varghese

Main category: cs.CV

TL;DR: 本研究评估了多种深度学习模型(YOLOv9、U-Net、Swin Transformer、DeepLabV3、Mask R-CNN)在自动检测树木横截面髓心任务中的性能,使用动态增强的数据集提升泛化能力,结果显示Swin Transformer精度最高(0.94),而结合NMS的Mask R-CNN显著改善了重叠检测问题,模型选择应根据数据特征和应用需求权衡。

Details Motivation: 髓心检测对林业和木材质量分析至关重要,但传统方法依赖人工且易出错,因此需要一种高效、准确的自动化解决方案。 Method: 采用五种深度学习模型进行对比实验:YOLOv9用于边界框检测,U-Net、Swin Transformer、DeepLabV3和Mask R-CNN用于分割任务;使用包含582张标注图像的数据集并进行动态数据增强;通过IoU、精度等指标评估性能,并引入NMS优化Mask R-CNN的重叠预测问题;进一步在外部橡木数据集上测试泛化能力。 Result: Swin Transformer取得最高准确率(0.94),在精细分割方面表现最佳;YOLOv9适用于快速定位但边界精度不足;U-Net对结构化纹理有效;DeepLabV3能捕捉多尺度特征但边界略有偏差;原始Mask R-CNN因重叠检测导致IoU仅为0.45,应用NMS后提升至0.80;在外部橡木数据集上的测试表明模型具有一定的泛化能力,但小样本下仍存在挑战。 Conclusion: 深度学习模型在树木横截面髓心检测中展现出巨大潜力,其中Swin Transformer整体表现最优,而模型选择需结合具体应用场景与数据特性;通过非最大抑制等后处理策略可显著提升模型表现,未来工作应关注跨物种泛化与小样本适应性。 Abstract: Pith detection in tree cross-sections is essential for forestry and wood quality analysis but remains a manual, error-prone task. This study evaluates deep learning models -- YOLOv9, U-Net, Swin Transformer, DeepLabV3, and Mask R-CNN -- to automate the process efficiently. A dataset of 582 labeled images was dynamically augmented to improve generalization. Swin Transformer achieved the highest accuracy (0.94), excelling in fine segmentation. YOLOv9 performed well for bounding box detection but struggled with boundary precision. U-Net was effective for structured patterns, while DeepLabV3 captured multi-scale features with slight boundary imprecision. Mask R-CNN initially underperformed due to overlapping detections, but applying Non-Maximum Suppression (NMS) improved its IoU from 0.45 to 0.80. Generalizability was next tested using an oak dataset of 11 images from Oregon State University's Tree Ring Lab. Additionally, for exploratory analysis purposes, an additional dataset of 64 labeled tree cross-sections was used to train the worst-performing model to see if this would improve its performance generalizing to the unseen oak dataset. Key challenges included tensor mismatches and boundary inconsistencies, addressed through hyperparameter tuning and augmentation. Our results highlight deep learning's potential for tree cross-section pith detection, with model choice depending on dataset characteristics and application needs.

[150] XAI-Driven Skin Disease Classification: Leveraging GANs to Augment ResNet-50 Performance

Kim Gerard A. Villanueva,Priyanka Kumar

Main category: cs.CV

TL;DR: 提出了一种结合DCGAN数据增强和ResNet-50分类器的可信赖皮肤病变诊断系统,并通过LIME和SHAP实现可解释性,取得了高准确率和临床可解释性的平衡。

Details Motivation: 解决皮肤病变多类别诊断中数据不平衡、主观性强以及深度学习模型缺乏可解释性的问题。 Method: 使用DCGAN对每类皮肤病变数据进行增强以缓解类别不平衡问题,采用微调的ResNet-50模型进行七类皮肤病分类,并结合LIME和SHAP等可解释AI技术提升模型透明度。 Result: 系统达到92.50%的整体准确率和98.82%的Macro-AUC,优于多个基准模型,且F1分数在关键类别Melanoma NOS上为0.8602。 Conclusion: 该研究验证了一个高性能且可验证的计算机辅助诊断框架,兼具准确性与临床可解释性,适合安全诊断部署,未来需进一步提升关键类别判别能力。 Abstract: Accurate and timely diagnosis of multi-class skin lesions is hampered by subjective methods, inherent data imbalance in datasets like HAM10000, and the "black box" nature of Deep Learning (DL) models. This study proposes a trustworthy and highly accurate Computer-Aided Diagnosis (CAD) system to overcome these limitations. The approach utilizes Deep Convolutional Generative Adversarial Networks (DCGANs) for per class data augmentation to resolve the critical class imbalance problem. A fine-tuned ResNet-50 classifier is then trained on the augmented dataset to classify seven skin disease categories. Crucially, LIME and SHAP Explainable AI (XAI) techniques are integrated to provide transparency by confirming that predictions are based on clinically relevant features like irregular morphology. The system achieved a high overall Accuracy of 92.50 % and a Macro-AUC of 98.82 %, successfully outperforming various prior benchmarked architectures. This work successfully validates a verifiable framework that combines high performance with the essential clinical interpretability required for safe diagnostic deployment. Future research should prioritize enhancing discrimination for critical categories, such as Melanoma NOS (F1-Score is 0.8602).

[151] Doppler-Enhanced Deep Learning: Improving Thyroid Nodule Segmentation with YOLOv5 Instance Segmentation

Mahmoud El Hussieni

Main category: cs.CV

TL;DR: 本研究利用YOLOv5算法在超声图像上进行甲状腺结节的实例分割,发现包含多普勒图像可显著提升分割性能,其中YOLOv5-Large表现最佳(Dice分数91%,mAP 0.87),表明该方法具有用于实时自动化诊断系统的潜力。

Details Motivation: 准确分割甲状腺结节是开发AI辅助临床决策系统的关键第一步,现有方法对多普勒图像的利用不足,需探索其在实例分割中的价值。 Method: 采用多种YOLOv5变体(Nano、Small、Medium、Large、XLarge)在含与不含多普勒图像的两个数据集版本上进行甲状腺结节实例分割,并比较其性能。 Result: YOLOv5-Large在包含多普勒图像的数据集上表现最优,Dice分数达91%,mAP为0.87;所有模型在引入多普勒图像后性能均提升,YOLOv5-Small在无多普勒图像时Dice分数为79%。 Conclusion: YOLOv5可用于实现高效的甲状腺结节实例分割,且多普勒图像虽常被医生忽略,却能显著提高分割精度,具备临床自动化诊断应用前景。 Abstract: The increasing prevalence of thyroid cancer globally has led to the development of various computer-aided detection methods. Accurate segmentation of thyroid nodules is a critical first step in the development of AI-assisted clinical decision support systems. This study focuses on instance segmentation of thyroid nodules using YOLOv5 algorithms on ultrasound images. We evaluated multiple YOLOv5 variants (Nano, Small, Medium, Large, and XLarge) across two dataset versions, with and without doppler images. The YOLOv5-Large algorithm achieved the highest performance with a dice score of 91\% and mAP of 0.87 on the dataset including doppler images. Notably, our results demonstrate that doppler images, typically excluded by physicians, can significantly improve segmentation performance. The YOLOv5-Small model achieved 79\% dice score when doppler images were excluded, while including them improved performance across all model variants. These findings suggest that instance segmentation with YOLOv5 provides an effective real-time approach for thyroid nodule detection, with potential clinical applications in automated diagnostic systems.

[152] Graph-Attention Network with Adversarial Domain Alignment for Robust Cross-Domain Facial Expression Recognition

Razieh Ghaedi,AmirReza BabaAhmadi,Reyer Zwiggelaar,Xinqi Fan,Nashid Alam

Main category: cs.CV

TL;DR: 提出了一种结合ResNet-50和图注意力网络(GAT)的跨域面部表情识别方法GAT-ADA,通过对抗域对齐和统计对齐显著提升性能。

Details Motivation: 由于训练与部署数据之间存在严重的域偏移,跨域面部表情识别(CD-FER)仍然具有挑战性。 Method: 将每个小批量构造成稀疏环形图,利用图注意力网络建模样本间关系,并结合对抗学习(GRL)与CORAL、MMD进行分布对齐。 Result: 在标准无监督域适应协议下,GAT-ADA达到74.39%的平均跨域准确率,在RAF-DB到FER2013任务上达到98.0%准确率,比基线提升约36个百分点。 Conclusion: GAT-ADA有效缓解了域偏移问题,在多种目标数据集上表现出优异的跨域表情识别性能。 Abstract: Cross-domain facial expression recognition (CD-FER) remains difficult due to severe domain shift between training and deployment data. We propose Graph-Attention Network with Adversarial Domain Alignment (GAT-ADA), a hybrid framework that couples a ResNet-50 as backbone with a batch-level Graph Attention Network (GAT) to model inter-sample relations under shift. Each mini-batch is cast as a sparse ring graph so that attention aggregates cross-sample cues that are informative for adaptation. To align distributions, GAT-ADA combines adversarial learning via a Gradient Reversal Layer (GRL) with statistical alignment using CORAL and MMD. GAT-ADA is evaluated under a standard unsupervised domain adaptation protocol: training on one labeled source (RAF-DB) and adapting to multiple unlabeled targets (CK+, JAFFE, SFEW 2.0, FER2013, and ExpW). GAT-ADA attains 74.39% mean cross-domain accuracy. On RAF-DB to FER2013, it reaches 98.0% accuracy, corresponding to approximately a 36-point improvement over the best baseline we re-implemented with the same backbone and preprocessing.

[153] MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba

Shanhui Liu,Rui Xu,Yunke Wang

Main category: cs.CV

TL;DR: 提出Coarse-to-Fine Vision Mamba (CF-ViM),一种基于图像复杂度自适应分配计算资源的高效视觉Mamba框架,通过先粗后细的动态推理机制,在减少token数量的同时保留关键视觉信息。

Details Motivation: 现有token压缩方法普遍存在信息丢失问题,且对所有图像统一处理,忽视了不同图像的视觉复杂度差异;而Vision Mamba的效率仍受限于输入token数量。 Method: 首先将图像划分为大块进行粗粒度推理以降低token长度和计算量;当模型预测置信度较低时,对选定区域进行细粒度重处理以恢复关键细节,实现动态分辨率分配。 Result: 在ImageNet上的实验表明,CF-ViM在准确率和效率方面均优于基准Vision Mamba及现有的先进token压缩技术。 Conclusion: CF-ViM通过自适应计算分配,有效平衡了效率与性能,为高效视觉建模提供了新的解决方案。 Abstract: Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss, as they discard or compress token representations. This problem is exacerbated when applied uniformly to fine-grained token representations across all images, regardless of visual complexity. We observe that not all inputs require fine-grained processing. Simple images can be effectively handled at coarse resolution, while only complex ones may warrant refinement. Based on this insight, we propose \textit{Coarse-to-Fine Vision Mamba (CF-ViM)}, an adaptive framework for efficient inference. CF-ViM first performs coarse-grained inference by dividing the input image into large patches, significantly reducing the token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover critical visual details with minimal additional cost. This dynamic resolution assignment strategy allows CF-ViM to allocate computation adaptively according to image complexity, ensuring efficient processing without compromising essential visual information. Experiments on ImageNet demonstrate that CF-ViM outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.

[154] Realistic Handwritten Multi-Digit Writer (MDW) Number Recognition Challenges

Kiri L. Wagstaff

Main category: cs.CV

TL;DR: 本文通过利用NIST数字图像的书写者信息,构建了更贴近实际的多数字书写者(MDW)数据集,用于评估手写数字识别在真实场景中的性能。

Details Motivation: 由于现实中的数字通常是多位且由同一人书写,传统的孤立数字分类无法反映真实情况,因此需要更现实的基准来推动研究。 Method: 基于书写者信息构建多数字书写者(MDW)数据集,并引入任务特定的性能指标,以更好衡量多数字识别任务的表现。 Result: 实验发现,分类器在孤立数字上表现良好,但在多数字识别任务中表现较差,说明现有方法不足以应对真实数字识别问题。 Conclusion: 为解决实际的手写数字识别问题,需要进一步研究并利用任务相关知识来提升多数字识别性能,MDW数据集为此提供了新的基准和机会。 Abstract: Isolated digit classification has served as a motivating problem for decades of machine learning research. In real settings, numbers often occur as multiple digits, all written by the same person. Examples include ZIP Codes, handwritten check amounts, and appointment times. In this work, we leverage knowledge about the writers of NIST digit images to create more realistic benchmark multi-digit writer (MDW) data sets. As expected, we find that classifiers may perform well on isolated digits yet do poorly on multi-digit number recognition. If we want to solve real number recognition problems, additional advances are needed. The MDW benchmarks come with task-specific performance metrics that go beyond typical error calculations to more closely align with real-world impact. They also create opportunities to develop methods that can leverage task-specific knowledge to improve performance well beyond that of individual digit classification methods.

[155] Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

Dong In Lee,Hyungjun Doh,Seunggeun Chi,Runlin Duan,Sangpil Kim,Karthik Ramani

Main category: cs.CV

TL;DR: 本文提出Dynamic-eDiTor,一种无需训练的文本驱动4D场景编辑框架,结合MM-DiT与4D高斯点阵,通过时空子网格注意力和上下文令牌传播实现多视角与时间一致性编辑。

Details Motivation: 现有基于2D扩散模型的文本驱动4D编辑方法难以保证多视角与时间上的连续性,易导致运动失真、几何漂移和编辑不完整。 Method: 提出Dynamic-eDiTor框架,利用MM-DiT和4DGS,引入时空子网格注意力(STGA)实现局部一致的跨视角与时序融合,并通过上下文令牌传播(CTP)实现全局信息传递,支持无需训练的直接优化。 Result: 在DyNeRF数据集上实验表明,该方法在编辑保真度及多视角、时间一致性方面优于先前方法。 Conclusion: Dynamic-eDiTor有效解决了文本驱动4D场景编辑中的多视角与时间一致性问题,实现了高质量、无需训练的动态场景编辑。 Abstract: Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing. Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor to perform seamless, globally consistent multi-view video without additional training and directly optimize pre-trained source 4DGS. Extensive experiments on multi-view video dataset DyNeRF demonstrate that our method achieves superior editing fidelity and both multi-view and temporal consistency prior approaches. Project page for results and code: https://di-lee.github.io/dynamic-eDiTor/

[156] Silhouette-based Gait Foundation Model

Dingqiang Ye,Chao Fan,Kartik Narayan,Bingzhe Wu,Chengwen Luo,Jianqiang Li,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了FoundationGait,首个可扩展的自监督预训练步态理解框架,通过在12个公开数据集上预训练大规模模型(近1.3亿参数),实现了跨任务、跨模态和跨场景的优异泛化能力,在零样本步态识别等任务中取得新里程碑结果。

Details Motivation: 现有步态模型因规模小、设计狭窄而难以扩展和泛化,缺乏统一的基础模型;同时,步态研究长期受限于两个问题:为何步态模型无法遵循扩展定律?能否构建一个通用模型支持多样化的步态任务? Method: 提出FoundationGait,一种可扩展的自监督预训练框架,最大版本包含约1.3亿参数,并在超过200万行走序列的12个公共步态数据集上进行预训练,以实现对多种步态任务的统一建模。 Result: 实验表明,FoundationGait在无需微调或经过微调的情况下,在多种数据集、条件、任务(如身份识别、脊柱侧弯筛查、抑郁预测、属性估计)及输入模态下均表现出强大鲁棒性;在野外挑战性数据集Gait3D上实现48.0%的零样本rank-1准确率,在实验室最大数据集OU-MVLP上达到64.5%,创下新纪录。 Conclusion: FoundationGait验证了构建统一、可扩展步态基础模型的可行性,打破了传统步态模型在扩展性和泛化性上的局限,为未来多任务、多场景步态分析提供了有效解决方案。 Abstract: Gait patterns play a critical role in human identification and healthcare analytics, yet current progress remains constrained by small, narrowly designed models that fail to scale or generalize. Building a unified gait foundation model requires addressing two longstanding barriers: (a) Scalability. Why have gait models historically failed to follow scaling laws? (b) Generalization. Can one model serve the diverse gait tasks that have traditionally been studied in isolation? We introduce FoundationGait, the first scalable, self-supervised pretraining framework for gait understanding. Its largest version has nearly 0.13 billion parameters and is pretrained on 12 public gait datasets comprising over 2 million walking sequences. Extensive experiments demonstrate that FoundationGait, with or without fine-tuning, performs robustly across a wide spectrum of gait datasets, conditions, tasks (e.g., human identification, scoliosis screening, depression prediction, and attribute estimation), and even input modality. Notably, it achieves 48.0% zero-shot rank-1 accuracy on the challenging in-the-wild Gait3D dataset (1,000 test subjects) and 64.5% on the largest in-the-lab OU-MVLP dataset (5,000+ test subjects), setting a new milestone in robust gait recognition. Coming code and model: https://github.com/ShiqiYu/OpenGait.

[157] Affordance-First Decomposition for Continual Learning in Video-Language Understanding

Mengzhu Xu,Hanzhi Liu,Ningkang Peng,Qianyu Chen,Canran Xiao

Main category: cs.CV

TL;DR: 本文提出了Affordance-First Decomposition (AFD) 方法,用于视频-语言理解中的持续学习,通过稳定的行为基底和动态调度器实现稳定性与适应性的明确分离,在多个基准上达到最先进性能。

Details Motivation: 现有持续学习方法在处理非平稳数据时难以平衡稳定性与可塑性,且常依赖回放机制或固定结构,缺乏对隐私和内存的考虑。 Method: 提出AFD框架:将视频映射为缓慢变化的affordance tokens作为共享基底,并设计轻量级、查询驱动、冲突感知的调度器来动态调整容量;采用弱对齐和教师一致性稳定基底,训练时仅回放问题文本。 Result: AFD在多个任务上表现优异:领域增量VideoQA平均准确率51.6%,遗忘率-1.8%;ViLCo指标中MQ达29.6%,NLQ达20.7%,VQ stAP@0.25为18.4%;时间增量iVQA准确率39.5%,遗忘率-1.6%。 Conclusion: AFD实现了稳定基底与局部适应的清晰分离,具有良好的可解释性和实用性,适用于现实内存与隐私约束下的视频-语言持续学习。 Abstract: Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.

[158] CAR-Net: A Cascade Refinement Network for Rotational Motion Deblurring under Angle Information Uncertainty

Ka Chung Lai,Ahmet Cetinkaya

Main category: cs.CV

TL;DR: 提出一种名为CAR-net的级联精炼网络,用于处理旋转运动模糊图像的去模糊,尤其适用于仅提供模糊角度噪声信息的半盲场景。

Details Motivation: 在半盲条件下,现有方法难以有效处理旋转运动模糊,尤其是当模糊角度信息不准确时,导致去模糊效果不佳。 Method: 采用级联精炼结构,首先通过频域反演获得初始去模糊图像,随后多个精炼阶段逐步预测并施加残差校正,以抑制伪影并恢复细节;同时集成可选的角度检测模块,实现端到端训练。 Result: 在合成和真实图像上的实验表明,CAR-net能有效提升去模糊质量,优于现有方法,尤其在模糊角度不确定的情况下表现更优。 Conclusion: CAR-net通过渐进式精炼机制和角度检测模块,显著提升了旋转运动模糊图像在半盲条件下的去模糊性能。 Abstract: We propose a new neural network architecture called CAR-net (CAscade Refinement Network) to deblur images that are subject to rotational motion blur. Our architecture is specifically designed for the semi-blind scenarios where only noisy information of the rotational motion blur angle is available. The core of our approach is progressive refinement process that starts with an initial deblurred estimate obtained from frequency-domain inversion; A series of refinement stages take the current deblurred image to predict and apply residual correction to the current estimate, progressively suppressing artifacts and restoring fine details. To handle parameter uncertainty, our architecture accommodates an optional angle detection module which can be trained end-to-end with refinement modules. We provide a detailed description of our architecture and illustrate its efficiency through experiments using both synthetic and real-life images. Our code and model as well as the links to the datasets are available at https://github.com/tony123105/CAR-Net

[159] Optimizing LVLMs with On-Policy Data for Effective Hallucination Mitigation

Chengzhi Yu,Yifan Xu,Yifan Chen,Wenyi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于幻觉分类器和迭代DPO的高效方法,显著降低大型视觉-语言模型中的幻觉率,并使开源模型性能超越GPT-4V。

Details Motivation: 现有LVLM幻觉缓解方法在标注训练数据时可能引入额外幻觉,影响模型对齐效果,亟需可靠且高效的在线策略(on-policy)数据标注方法。 Method: 设计一个二元幻觉分类器以提供干净的偏好标注,并提出一种带动态样本重加权机制的迭代直接偏好优化(DPO)算法,充分利用on-policy数据。 Result: 在三个基准上优于8个先进基线:LLaVA-1.5-7B在MMHalBench上幻觉率下降50.8%,在Object HalBench上平均下降79.5%;LLaVA-1.5-13B性能超过GPT-4V。 Conclusion: 所提方法有效缓解了LVLM中的幻觉问题,通过清洁标注与鲁棒训练策略,充分释放了开源模型的潜力。 Abstract: Recently, large vision-language models (LVLMs) have risen to be a promising approach for multimodal tasks. However, principled hallucination mitigation remains a critical challenge.In this work, we first analyze the data generation process in LVLM hallucination mitigation and affirm that on-policy data significantly outperforms off-policy data, which thus calls for efficient and reliable preference annotation of on-policy data. We then point out that, existing annotation methods introduce additional hallucination in training samples, which may enhance the model's hallucination patterns, to address this problem, we propose training a hallucination classifier giving binary annotations, which guarantee clean chosen samples for the subsequent alignment. To further harness of the power of on-policy data, we design a robust iterative direct preference optimization (DPO) algorithm adopting a dynamic sample reweighting scheme. We conduct comprehensive experiments on three benchmarks with comparison to 8 state-of-the-art baselines. In particular, our approach reduces the hallucination rate of LLaVA-1.5-7B on MMHalBench by 50.8% and the average hallucination rate on Object HalBench by 79.5%; more significantly, our method fully taps into the potential of open-source models, enabling LLaVA-1.5-13B to even surpass the performance of GPT-4V.

[160] Deep Learning-Based Computer Vision Models for Early Cancer Detection Using Multimodal Medical Imaging and Radiogenomic Integration Frameworks

Emmanuella Avwerosuoghene Oghenekaro

Main category: cs.CV

TL;DR: 该论文综述了深度学习在早期癌症检测中的应用,强调其通过多模态医学影像分析提升诊断精度的潜力。

Details Motivation: 早期癌症检测对提高生存率至关重要,但传统方法存在局限性,需要更精准、非侵入性的技术手段。 Method: 采用基于深度学习的计算机视觉模型(如CNN、Transformer及混合注意力架构)分析多模态影像数据,并结合放射组学与基因组学进行融合研究。 Result: 模型能识别肉眼难以察觉的组织异常和肿瘤微环境变化,可预测肿瘤基因型、分子亚型及治疗耐药性。 Conclusion: 深度学习与多模态数据融合为个性化肿瘤学提供了强大工具,有望实现无创、精准的早期癌症检测。 Abstract: Early cancer detection remains one of the most critical challenges in modern healthcare, where delayed diagnosis significantly reduces survival outcomes. Recent advancements in artificial intelligence, particularly deep learning, have enabled transformative progress in medical imaging analysis. Deep learning-based computer vision models, such as convolutional neural networks (CNNs), transformers, and hybrid attention architectures, can automatically extract complex spatial, morphological, and temporal patterns from multimodal imaging data including MRI, CT, PET, mammography, histopathology, and ultrasound. These models surpass traditional radiological assessment by identifying subtle tissue abnormalities and tumor microenvironment variations invisible to the human eye. At a broader scale, the integration of multimodal imaging with radiogenomics linking quantitative imaging features with genomics, transcriptomics, and epigenetic biomarkers has introduced a new paradigm for personalized oncology. This radiogenomic fusion allows the prediction of tumor genotype, immune response, molecular subtypes, and treatment resistance without invasive biopsies.

[161] RS-ISRefiner: Towards Better Adapting Vision Foundation Models for Interactive Segmentation of Remote Sensing Images

Deliang Wang,Peng Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为RS-ISRefiner的新型基于点击的交互式图像分割框架,专为遥感图像设计,通过适配器调优策略和混合注意力机制,在多个数据集上实现了优于现有方法的分割精度、效率和交互成本。

Details Motivation: 现有的交互式图像分割方法主要针对自然图像,难以适应遥感图像中的尺度变化、不规则边界和复杂背景,且受限于标注数据少和计算开销大。 Method: 提出RS-ISRefiner框架,采用基于适配器的微调策略,在保留视觉基础模型通用表征的同时学习遥感特有的空间与边界特征;引入卷积局部建模与Transformer全局推理相结合的混合注意力机制;改进概率图调制方案以更好融合历史用户交互信息。 Result: 在iSAID、ISPRS Potsdam、SandBar、NWPU、LoveDA Urban和WHUBuilding六个遥感数据集上实验表明,RS-ISRefiner在分割准确性、效率和交互成本方面均优于当前最先进的方法。 Conclusion: RS-ISRefiner有效提升了遥感图像交互式分割的性能,具有良好的通用性和实用性,适用于高质量遥感实例分割场景。 Abstract: Interactive image segmentation(IIS) plays a critical role in generating precise annotations for remote sensing imagery, where objects often exhibit scale variations, irregular boundaries and complex backgrounds. However, existing IIS methods, primarily designed for natural images, struggle to generalize to remote sensing domains due to limited annotated data and computational overhead. To address these challenges, we proposed RS-ISRefiner, a novel click-based IIS framework tailored for remote sensing images. The framework employs an adapter-based tuning strategy that preserves the general representations of Vision Foundation Models while enabling efficient learning of remote sensing-specific spatial and boundary characteristics. A hybrid attention mechanism integrating convolutional local modeling with Transformer-based global reasoning enhances robustness against scale diversity and scene complexity. Furthermore, an improved probability map modulation scheme effectively incorporates historical user interactions, yielding more stable iterative refinement and higher boundary fidelity. Comprehensive experiments on six remote sensing datasets, including iSAID, ISPRS Potsdam, SandBar, NWPU, LoveDA Urban and WHUBuilding, demonstrate that RS-ISRefiner consistently outperforms state-of-the-art IIS methods in terms of segmentation accuracy, efficiency and interaction cost. These results confirm the effectiveness and generalizability of our framework, making it highly suitable for high-quality instance segmentation in practical remote sensing scenarios.

[162] TrajDiff: End-to-end Autonomous Driving without Perception Annotation

Xingtai Gui,Jianbo Zhao,Wencheng Han,Jikai Wang,Jiahao Gong,Feiyang Tan,Cheng-zhong Xu,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出了一种名为TrajDiff的端到端自动驾驶框架,完全无需感知标注,通过轨迹导向的BEV扩散模型直接从原始传感器输入生成多样化且合理的行驶轨迹,在NAVSIM基准上达到最先进的性能。

Details Motivation: 由于手动感知标注成本高昂,构建无需感知标注的端到端自动驾驶系统成为关键挑战,现有方法依赖辅助感知任务,限制了系统的真正端到端学习能力。 Method: 提出TrajDiff框架,利用原始传感器输入和未来轨迹构建高斯BEV热图目标,设计轨迹导向的BEV编码器提取无监督的TrajBEV特征,并引入TB-DiT模型结合自车状态与TrajBEV特征生成轨迹,完全避免手工设计的运动先验。 Result: 在NAVSIM基准上,TrajDiff达到87.5 PDMS,为当前最优的无标注方法;通过数据扩展进一步提升至88.5 PDMS,性能接近先进的基于感知的方法。 Conclusion: TrajDiff实现了完全无需感知标注的端到端自动驾驶生成方法,验证了纯数据驱动轨迹生成的可行性,并展示了数据规模对无标注系统的重要增益。 Abstract: End-to-end autonomous driving systems directly generate driving policies from raw sensor inputs. While these systems can extract effective environmental features for planning, relying on auxiliary perception tasks, developing perception annotation-free planning paradigms has become increasingly critical due to the high cost of manual perception annotation. In this work, we propose TrajDiff, a Trajectory-oriented BEV Conditioned Diffusion framework that establishes a fully perception annotation-free generative method for end-to-end autonomous driving. TrajDiff requires only raw sensor inputs and future trajectory, constructing Gaussian BEV heatmap targets that inherently capture driving modalities. We design a simple yet effective trajectory-oriented BEV encoder to extract the TrajBEV feature without perceptual supervision. Furthermore, we introduce Trajectory-oriented BEV Diffusion Transformer (TB-DiT), which leverages ego-state information and the predicted TrajBEV features to directly generate diverse yet plausible trajectories, eliminating the need for handcrafted motion priors. Beyond architectural innovations, TrajDiff enables exploration of data scaling benefits in the annotation-free setting. Evaluated on the NAVSIM benchmark, TrajDiff achieves 87.5 PDMS, establishing state-of-the-art performance among all annotation-free methods. With data scaling, it further improves to 88.5 PDMS, which is comparable to advanced perception-based approaches. Our code and model will be made publicly available.

[163] Multi-GRPO: Multi-Group Advantage Estimation for Text-to-Image Generation with Tree-Based Trajectories and Multiple Rewards

Qiang Lyu,Zicong Chen,Chongxiao Wang,Haolin Shi,Shibo Gao,Ran Piao,Youwei Zeng,Jianlou Si,Fei Ding,Jing Li,Chun Pong Lau,Weiqiang Wang

Main category: cs.CV

TL;DR: 本文提出Multi-GRPO,一种用于文本到图像模型对齐的多组优势估计框架,通过时间分组和奖励分组机制解决现有GRPO方法在信用分配和多目标奖励融合中的局限性,并在多个基准上验证了其优越性能。

Details Motivation: 现有基于GRPO的方法存在共享信用分配不准确和多目标奖励混合导致梯度不稳定的问题,难以有效对齐文本到图像生成模型。 Method: 引入基于树结构的轨迹分组(时间分组)以提升早期去噪步骤的优势估计精度;采用奖励分组机制独立计算各奖励函数对应的优势后再聚合,解耦冲突信号;构建OCR-Color-10数据集用于多目标对齐评估。 Result: 在PickScore-25k和OCR-Color-10基准测试中,Multi-GRPO展现出更优的稳定性与对齐性能,能有效平衡文本准确性、视觉质量和颜色等冲突目标。 Conclusion: Multi-GRPO通过双正交分组机制显著提升了GRPO在复杂多目标T2I对齐任务中的表现,为未来细粒度控制生成提供了可行方向。 Abstract: Recently, Group Relative Policy Optimization (GRPO) has shown promising potential for aligning text-to-image (T2I) models, yet existing GRPO-based methods suffer from two critical limitations. (1) \textit{Shared credit assignment}: trajectory-level advantages derived from group-normalized sparse terminal rewards are uniformly applied across timesteps, failing to accurately estimate the potential of early denoising steps with vast exploration spaces. (2) \textit{Reward-mixing}: predefined weights for combining multi-objective rewards (e.g., text accuracy, visual quality, text color)--which have mismatched scales and variances--lead to unstable gradients and conflicting updates. To address these issues, we propose \textbf{Multi-GRPO}, a multi-group advantage estimation framework with two orthogonal grouping mechanisms. For better credit assignment, we introduce tree-based trajectories inspired by Monte Carlo Tree Search: branching trajectories at selected early denoising steps naturally forms \emph{temporal groups}, enabling accurate advantage estimation for early steps via descendant leaves while amortizing computation through shared prefixes. For multi-objective optimization, we introduce \emph{reward-based grouping} to compute advantages for each reward function \textit{independently} before aggregation, disentangling conflicting signals. To facilitate evaluation of multiple objective alignment, we curate \textit{OCR-Color-10}, a visual text rendering dataset with explicit color constraints. Across the single-reward \textit{PickScore-25k} and multi-objective \textit{OCR-Color-10} benchmarks, Multi-GRPO achieves superior stability and alignment performance, effectively balancing conflicting objectives. Code will be publicly available at \href{https://github.com/fikry102/Multi-GRPO}{https://github.com/fikry102/Multi-GRPO}.

[164] Joint Multi-scale Gated Transformer and Prior-guided Convolutional Network for Learned Image Compression

Zhengxin Chen,Xiaohai He,Tingrong Zhang,Shuhua Xiong,Chao Ren

Main category: cs.CV

TL;DR: 本文提出了一种用于学习型图像压缩的新型联合多尺度门控变换器与先验引导卷积网络(MGTPCN),通过改进局部和非局部特征提取,实现了性能与复杂度之间的更优权衡。

Details Motivation: 为了克服传统图像编解码器在非线性变换编码上的局限性,提升学习型图像压缩方法的性能,本文旨在增强卷积操作提取局部特征的能力以及Transformer块提取非局部多尺度特征的能力。 Method: 提出先验引导卷积(PGConv),引入非对称卷积(AConvs)和差分卷积(DConvs)以强化骨架元素并提取高频信息,并采用重参数化策略降低计算复杂度;设计多尺度门控变换器(MGT),结合不同膨胀率的膨胀窗口多头自注意力机制和不同核大小的深度卷积,引入门控机制增强非线性,从而捕获多尺度非局部特征。最终构建MGTPCN网络。 Result: 实验结果表明,所提出的MGTPCN在多个标准数据集上优于当前最先进的图像压缩算法,在率失真性能和计算复杂度之间取得了更好的平衡。 Conclusion: 通过融合改进的卷积与多尺度门控Transformer结构,MGTPCN有效提升了学习型图像压缩的性能,为高效图像压缩提供了新的解决方案。 Abstract: Recently, learned image compression methods have made remarkable achievements, some of which have outperformed the traditional image codec VVC. The advantages of learned image compression methods over traditional image codecs can be largely attributed to their powerful nonlinear transform coding. Convolutional layers and shifted window transformer (Swin-T) blocks are the basic units of neural networks, and their representation capabilities play an important role in nonlinear transform coding. In this paper, to improve the ability of the vanilla convolution to extract local features, we propose a novel prior-guided convolution (PGConv), where asymmetric convolutions (AConvs) and difference convolutions (DConvs) are introduced to strengthen skeleton elements and extract high-frequency information, respectively. A re-parameterization strategy is also used to reduce the computational complexity of PGConv. Moreover, to improve the ability of the Swin-T block to extract non-local features, we propose a novel multi-scale gated transformer (MGT), where dilated window-based multi-head self-attention blocks with different dilation rates and depth-wise convolution layers with different kernel sizes are used to extract multi-scale features, and a gate mechanism is introduced to enhance non-linearity. Finally, we propose a novel joint Multi-scale Gated Transformer and Prior-guided Convolutional Network (MGTPCN) for learned image compression. Experimental results show that our MGTPCN surpasses state-of-the-art algorithms with a better trade-off between performance and complexity.

[165] Probabilistic Modeling of Multi-rater Medical Image Segmentation for Diversity and Personalization

Ke Liu,Shangde Gao,Yichao Fu,Shangqi Gao,Chunhua Shen

Main category: cs.CV

TL;DR: 提出ProSeg模型,通过引入两个潜在变量建模专家标注偏好和边界模糊性,实现医学图像分割中的多样化与个性化统一。

Details Motivation: 现有模型难以同时兼顾多专家标注下的分割多样性与专家特异性,无法有效处理数据不确定性问题。 Method: 提出ProSeg,利用两个潜在变量建模专家偏好和边界模糊性,并通过变分推断获得条件概率分布,从而通过采样生成多样且个性化的分割结果。 Result: 在NPC和LIDC-IDRI数据集上取得当前最优性能,能生成既多样化又具有专家个性化的分割结果。 Conclusion: ProSeg能够有效建模多专家标注的不确定性,在医学图像分割中实现了更好的多样化与个性化平衡。 Abstract: Medical image segmentation is inherently influenced by data uncertainty, arising from ambiguous boundaries in medical scans and inter-observer variability in diagnosis. To address this challenge, previous works formulated the multi-rater medical image segmentation task, where multiple experts provide separate annotations for each image. However, existing models are typically constrained to either generate diverse segmentation that lacks expert specificity or to produce personalized outputs that merely replicate individual annotators. We propose Probabilistic modeling of multi-rater medical image Segmentation (ProSeg) that simultaneously enables both diversification and personalization. Specifically, we introduce two latent variables to model expert annotation preferences and image boundary ambiguity. Their conditional probabilistic distributions are then obtained through variational inference, allowing segmentation outputs to be generated by sampling from these distributions. Extensive experiments on both the nasopharyngeal carcinoma dataset (NPC) and the lung nodule dataset (LIDC-IDRI) demonstrate that our ProSeg achieves a new state-of-the-art performance, providing segmentation results that are both diverse and expert-personalized. Code can be found in https://github.com/AI4MOL/ProSeg.

[166] Charts Are Not Images: On the Challenges of Scientific Chart Editing

Shawn Li,Ryan Rossi,Sungchul Kim,Sunav Choudhary,Franck Dernoncourt,Puneet Mathur,Zhengzhong Tu,Yue Zhao

Main category: cs.CV

TL;DR: 本文提出了一个名为FigEdit的大规模科学图表编辑基准,旨在解决现有生成模型在处理图表时忽视其结构化特性的缺陷。

Details Motivation: 现有的生成模型(如扩散模型和自回归模型)主要针对自然图像设计,将图表视为像素集合进行编辑,忽略了图表作为结构化数据可视化表示的本质。这种假设导致编辑结果往往不符合图形语法规则,无法实现有效的结构变换。因此,需要一个专门面向科学图表的编辑基准来推动结构感知模型的发展。 Method: 作者构建了一个包含超过3万样本的大型基准FigEdit,涵盖10种不同类型的图表,并提供多样且复杂的编辑指令。该基准被划分为五个逐步递进的任务:单次编辑、多次编辑、对话式编辑、基于视觉引导的编辑和风格迁移。通过在这些任务上评估多种先进模型,揭示其在结构化转换方面的不足。 Result: 实验表明,当前最先进的模型在FigEdit基准上的表现较差,难以正确执行需要理解底层结构的编辑操作。同时,传统评价指标(如SSIM、PSNR)无法有效反映图表编辑的语义正确性。 Conclusion: FigEdit揭示了基于像素级操作的生成模型在科学图表编辑中的根本局限性,强调了发展结构感知模型的重要性。该基准为未来研究提供了统一的评估平台,促进对兼具视觉与语义理解能力的图表编辑模型的探索。 Abstract: Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem. To address this fundamental mismatch, we introduce \textit{FigEdit}, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits, visual-guidance-based edits, and style transfer. Our evaluation of a range of state-of-the-art models on this benchmark reveals their poor performance on scientific figures, as they consistently fail to handle the underlying structured transformations required for valid edits. Furthermore, our analysis indicates that traditional evaluation metrics (e.g., SSIM, PSNR) have limitations in capturing the semantic correctness of chart edits. Our benchmark demonstrates the profound limitations of pixel-level manipulation and provides a robust foundation for developing and evaluating future structure-aware models. By releasing \textit{FigEdit} (https://github.com/adobe-research/figure-editing), we aim to enable systematic progress in structure-aware figure editing, provide a common ground for fair comparison, and encourage future research on models that understand both the visual and semantic layers of scientific charts.

[167] Seeing the Wind from a Falling Leaf

Zhiyuan Gao,Jiageng Mao,Hong-Xing Yu,Haozhe Lou,Emily Yue-Ting Jia,Jernej Barbic,Jiajun Wu,Yue Wang

Main category: cs.CV

TL;DR: 本文提出了一种端到端可微分的逆向图形框架,用于从视频中恢复不可见的物理力(如风场),通过联合建模物体几何、物理属性和相互作用,并利用反向传播从物体运动中推断力的表示。

Details Motivation: 视觉中的运动背后隐藏的物理相互作用长期被忽视,本文旨在从视觉观测中恢复这些不可见的力,以更好地理解像素背后的物理过程。 Method: 提出一个端到端可微分的逆向图形框架,联合建模物体几何、物理属性和相互作用,通过反向传播从视频中的物体运动恢复力的表示。 Result: 在合成和真实场景中验证了方法的有效性,能够从视频中推断出合理的力场,并展示了其在基于物理的视频生成与编辑中的应用潜力。 Conclusion: 该方法为连接视觉与物理提供了新思路,有助于理解和建模视觉数据背后的物理过程。 Abstract: A longstanding goal in computer vision is to model motions from videos, while the representations behind motions, i.e. the invisible physical interactions that cause objects to deform and move, remain largely unexplored. In this paper, we study how to recover the invisible forces from visual observations, e.g., estimating the wind field by observing a leaf falling to the ground. Our key innovation is an end-to-end differentiable inverse graphics framework, which jointly models object geometry, physical properties, and interactions directly from videos. Through backpropagation, our approach enables the recovery of force representations from object motions. We validate our method on both synthetic and real-world scenarios, and the results demonstrate its ability to infer plausible force fields from videos. Furthermore, we show the potential applications of our approach, including physics-based video generation and editing. We hope our approach sheds light on understanding and modeling the physical process behind pixels, bridging the gap between vision and physics. Please check more video results in our \href{https://chaoren2357.github.io/seeingthewind/}{project page}.

[168] The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches

Haojie Jia,Te Hu,Haowen Li,Long Jin,Chongshi Xin,Yuchi Yao,Jiarui Xiao

Main category: cs.CV

TL;DR: 本文提出了一种名为TESP-Attack的新型隐蔽性物理对抗补丁方法,通过边缘对齐掩码和视觉融合优化,在交通标志分类中实现高攻击成功率且难以被察觉。

Details Motivation: 现有的物理对抗攻击通常在交通标志中心区域添加明显扰动,易被人类观察者发现,限制了其实际应用。因此需要更隐蔽的攻击方法来评估智能驾驶系统的安全性。 Method: 利用实例分割生成符合交通标志形状的边缘对齐掩码,使用U-Net生成器结合颜色、纹理约束及频域分析优化对抗补丁,实现与背景环境的无缝融合。 Result: 该方法在多种架构的交通标志分类模型上均达到90%以上的攻击成功率,具有强跨模型迁移性和稳定的现实世界性能,且在不同角度和距离下表现鲁棒。 Conclusion: TESP-Attack是一种高效、隐蔽且实用的物理对抗攻击方法,能够有效揭示智能驾驶系统在真实场景中的安全漏洞,为防御机制设计提供参考。 Abstract: Intelligent driving systems are vulnerable to physical adversarial attacks on traffic signs. These attacks can cause misclassification, leading to erroneous driving decisions that compromise road safety. Moreover, within V2X networks, such misinterpretations can propagate, inducing cascading failures that disrupt overall traffic flow and system stability. However, a key limitation of current physical attacks is their lack of stealth. Most methods apply perturbations to central regions of the sign, resulting in visually salient patterns that are easily detectable by human observers, thereby limiting their real-world practicality. This study proposes TESP-Attack, a novel stealth-aware adversarial patch method for traffic sign classification. Based on the observation that human visual attention primarily focuses on the central regions of traffic signs, we employ instance segmentation to generate edge-aligned masks that conform to the shape characteristics of the signs. A U-Net generator is utilized to craft adversarial patches, which are then optimized through color and texture constraints along with frequency domain analysis to achieve seamless integration with the background environment, resulting in highly effective visual concealment. The proposed method demonstrates outstanding attack success rates across traffic sign classification models with varied architectures, achieving over 90% under limited query budgets. It also exhibits strong cross-model transferability and maintains robust real-world performance that remains stable under varying angles and distances.

[169] Generalized Medical Phrase Grounding

Wenjun Zhang,Shekhar S. Chandra,Aaron Nicolson

Main category: cs.CV

TL;DR: 本文提出了通用医学短语定位(GMPG)任务和首个GMPG模型MedGrounder,可处理多区域、非可定位短语等复杂情况,并在少标注下实现优于传统方法的性能。

Details Motivation: 现有医学短语定位系统受限于每短语仅输出一个边界框的假设,无法处理真实报告中常见的多区域发现、非诊断性文本和不可定位短语(如否定或正常解剖描述)。因此,作者提出更符合实际的GMPG任务定义。 Method: 提出MedGrounder模型,采用两阶段训练:先在报告句子-解剖区域对齐数据集上预训练,再在人工标注的句子-边界框数据集上微调。将原任务扩展为每个句子可映射到零个、一个或多个评分区域。 Result: 在PadChest-GR和MS-CXR数据集上的实验表明,MedGrounder在多区域和不可定位短语上优于REC式方法和已有的定位报告生成基线,且具备强零样本迁移能力,所需人工标注边界框更少。此外,该模型可与现有报告生成器结合使用,无需重新训练即可生成定位报告。 Conclusion: MedGrounder通过引入更灵活的GMPG框架,有效解决了传统医学短语定位中的局限性,提升了对现实临床报告复杂性的适应能力,并展现出良好的实用性和兼容性。 Abstract: Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.

[170] EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes

Xiaoshan Wu,Yifei Yu,Xiaoyang Lyu,Yihua Huang,Bo Wang,Baoheng Zhang,Zhongrui Wang,Xiaojuan Qi

Main category: cs.CV

TL;DR: 本文提出EAG3R,一种结合异步事件流的3D几何估计框架,在低光和动态场景下实现鲁棒的视频几何重建,无需重新训练即可在夜间场景中表现优异。

Details Motivation: 现有基于RGB视频的3D几何估计方法在动态物体和极端光照条件下性能受限,主要由于传统相机的固有局限性,难以保持准确的重建效果。 Method: 基于MonST3R框架,引入两个创新:一是受Retinex启发的图像增强模块和基于信噪比感知的轻量级事件适配器,自适应融合RGB与事件特征;二是提出一种基于事件的光度一致性损失,增强全局优化中的时空一致性。 Result: 实验表明,EAG3R在单目深度估计、相机位姿跟踪和动态重建任务上显著优于现有的RGB-only方法,尤其在低光和动态场景中表现突出。 Conclusion: EAG3R通过融合事件相机数据,在不需夜间数据重训练的情况下,实现了对挑战性真实场景的鲁棒、高效3D几何估计,为SLAM和3D重建等应用提供了新思路。 Abstract: Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose EAG3R, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.

[171] Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation

Zirui Zhao,Boye Niu,David Hsu,Wee Sun Lee

Main category: cs.CV

TL;DR: 提出一种结合几何推理与神经语义的约束引导框架,通过蒙特卡洛树搜索和微调视觉-语言模型,在抽象图形组合任务中实现更高的有效性和语义保真度。

Details Motivation: 传统生成模型难以处理组合结构中稀疏解空间、几何约束和有限数据下的抽象视觉组合问题。 Method: 采用AlphaGo风格的搜索确保几何可行性,利用微调的视觉-语言模型提供语义对齐奖励;使用策略网络作为蒙特卡洛树搜索的启发式函数,并通过搜索生成的方案反向微调网络;借鉴GAN思想进行对抗性奖励精炼。 Result: 在Tangram Assembly任务上,相比扩散模型和自回归基线,该方法在有效性(无重叠、合法朝向)和语义保真度方面表现更优,尤其在约束 tighter 时优势更明显。 Conclusion: 将显式几何推理与神经语义学习相结合的框架能更有效地解决抽象视觉组合中的稀疏解空间与多约束挑战。 Abstract: We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.

[172] DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering

Toshiki Katsube,Taiga Fukuhara,Kenichiro Ando,Yusuke Mukuta,Kohei Uehara,Tatsuya Harada

Main category: cs.CV

TL;DR: 本文提出了一种可扩展且可复现的管道,用于构建大规模日语视觉-语言数据集,发布了包含388万样本的DEJIMA-Cap和DEJIMA-VQA数据集,显著提升日语多模态模型性能。

Details Motivation: 现有日语视觉-语言数据集规模小、质量低,缺乏文化代表性,限制了日语多模态模型的发展。 Method: 结合大规模网页采集、严格过滤与去重、基于目标检测的证据提取,以及在 grounding 约束下的大语言模型精细化处理,构建高质量数据集。 Result: 建成各含388万图文对的DEJIMA-Cap和DEJIMA-VQA数据集;人工评估显示其日语自然度和文化贴合度优于翻译或人工标注数据集,图像特征覆盖广泛;训练出的模型在多个日语多模态基准上表现提升。 Conclusion: 文化接地的大规模高质量数据集对提升日语多模态建模至关重要,所发布数据集和工具均支持商业使用并公开,推动学术与工业应用发展。 Abstract: This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling. We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement under grounding constraints. Using this pipeline, we build two resources: an image-caption dataset (DEJIMA-Cap) and a VQA dataset (DEJIMA-VQA), each containing 3.88M image-text pairs, far exceeding the size of existing Japanese V&L datasets. Human evaluations demonstrate that DEJIMA achieves substantially higher Japaneseness and linguistic naturalness than datasets constructed via translation or manual annotation, while maintaining factual correctness at a level comparable to human-annotated corpora. Quantitative analyses of image feature distributions further confirm that DEJIMA broadly covers diverse visual domains characteristic of Japan, complementing its linguistic and cultural representativeness. Models trained on DEJIMA exhibit consistent improvements across multiple Japanese multimodal benchmarks, confirming that culturally grounded, large-scale resources play a key role in enhancing model performance. All data sources and modules in our pipeline are licensed for commercial use, and we publicly release the resulting dataset and metadata to encourage further research and industrial applications in Japanese V&L modeling.

[173] StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

Daeun Lee,Subhojyoti Mukherjee,Branislav Kveton,Ryan A. Rossi,Viet Dac Lai,Seunghyun Yoon,Trung Bui,Franck Dernoncourt,Mohit Bansal

Main category: cs.CV

TL;DR: 本文提出了StreamGaze,首个用于评估多模态大语言模型在流式视频中利用眼动信号进行时间与前瞻性推理的基准。

Details Motivation: 现有流式视频理解基准未衡量MLLMs是否能解释或利用人类眼动信号,限制了对模型在真实场景(如AR眼镜)中理解用户意图能力的评估。 Method: 构建了一个眼动-视频问答生成流程,通过注视点提取、区域特定视觉提示和扫描路径构建,将第一视角视频与真实眼动轨迹对齐,生成时空定位的问答对,并设计了涵盖过去、当前和前瞻性任务的评测体系。 Result: 在所有StreamGaze任务上,最先进的MLLMs与人类表现存在显著差距,暴露出其在基于眼动的时间推理、意图建模和前瞻性预测方面的根本缺陷。 Conclusion: StreamGaze揭示了当前MLLMs在眼动引导的流式理解中的关键不足,为未来模型发展提供了明确方向,包括更好地融合眼动信号以实现主动感知与意图推断。 Abstract: Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.

[174] PolarGS: Polarimetric Cues for Ambiguity-Free Gaussian Splatting with Accurate Geometry Recovery

Bo Guo,Sijia Wen,Yifan Zhao,Jia Li,Zhiming Zheng

Main category: cs.CV

TL;DR: 提出PolarGS,一种利用偏振信息增强3D高斯点阵重建的方法,通过偏振引导的光度校正和高斯致密化机制,在反射和无纹理区域显著提升几何重建精度。

Details Motivation: 在反射和无纹理等光度模糊区域,传统基于RGB的3D高斯点阵重建因缺乏可靠光度一致性而导致几何估计性能下降。 Method: 引入两个模块:1)基于线性偏振度(DoLP)识别反射区域,并用颜色优化图进行光度校正;2)结合角度和线性偏振度(A/DoLP)的PatchMatch深度补全方法,实现无纹理区域的高斯致密化与几何恢复。 Result: PolarGS在多种场景中实现了优于现有最先进方法的几何重建精度,尤其在反射和无纹理区域表现突出,且框架通用性强。 Conclusion: 偏振信息可作为有效的光学先验,显著缓解光度模糊对3D重建的影响,PolarGS为3D高斯点阵提供了高效、准确的扩展方案。 Abstract: Recent advances in surface reconstruction for 3D Gaussian Splatting (3DGS) have enabled remarkable geometric accuracy. However, their performance degrades in photometrically ambiguous regions such as reflective and textureless surfaces, where unreliable cues disrupt photometric consistency and hinder accurate geometry estimation. Reflected light is often partially polarized in a manner that reveals surface orientation, making polarization an optic complement to photometric cues in resolving such ambiguities. Therefore, we propose PolarGS, an optics-aware extension of RGB-based 3DGS that leverages polarization as an optical prior to resolve photometric ambiguities and enhance reconstruction accuracy. Specifically, we introduce two complementary modules: a polarization-guided photometric correction strategy, which ensures photometric consistency by identifying reflective regions via the Degree of Linear Polarization (DoLP) and refining reflective Gaussians with Color Refinement Maps; and a polarization-enhanced Gaussian densification mechanism for textureless area geometry recovery, which integrates both Angle and Degree of Linear Polarization (A/DoLP) into a PatchMatch-based depth completion process. This enables the back-projection and fusion of new Gaussians, leading to more complete reconstruction. PolarGS is framework-agnostic and achieves superior geometric accuracy compared to state-of-the-art methods.

[175] CircleFlow: Flow-Guided Camera Blur Estimation using a Circle Grid Target

Jiajian He,Enjie Hu,Shiqi Chen,Tianchen Qiu,Huajun Feng,Zhihai Xu,Yueting Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为CircleFlow的高保真点扩散函数(PSF)估计框架,通过流引导的边缘定位实现精确的模糊表征,在模拟和真实数据上均表现出最先进的精度和可靠性。

Details Motivation: 准确的PSF估计对于光学表征和计算视觉至关重要,但由于基于强度的去卷积具有固有的模糊性和病态性,该任务仍然具有挑战性。 Method: CircleFlow采用成像圆形网格目标的结构化采集方式,利用目标的二值亮度先验解耦图像与核估计;通过光流引导下对初始化的二值结构进行亚像素对齐重建潜在清晰图像,并将PSF建模为能量约束的隐式神经表示,在去马赛克感知的可微框架中联合优化两者。 Result: 在模拟和真实世界数据上的大量实验表明,CircleFlow在PSF估计的准确性和鲁棒性方面优于现有方法。 Conclusion: CircleFlow通过精确的边缘定位和物理一致的联合优化框架,实现了高性能的PSF估计,验证了其在实际PSF校准中的有效性。 Abstract: The point spread function (PSF) serves as a fundamental descriptor linking the real-world scene to the captured signal, manifesting as camera blur. Accurate PSF estimation is crucial for both optical characterization and computational vision, yet remains challenging due to the inherent ambiguity and the ill-posed nature of intensity-based deconvolution. We introduce CircleFlow, a high-fidelity PSF estimation framework that employs flow-guided edge localization for precise blur characterization. CircleFlow begins with a structured capture that encodes locally anisotropic and spatially varying PSFs by imaging a circle grid target, while leveraging the target's binary luminance prior to decouple image and kernel estimation. The latent sharp image is then reconstructed through subpixel alignment of an initialized binary structure guided by optical flow, whereas the PSF is modeled as an energy-constrained implicit neural representation. Both components are jointly optimized within a demosaicing-aware differentiable framework, ensuring physically consistent and robust PSF estimation enabled by accurate edge localization. Extensive experiments on simulated and real-world data demonstrate that CircleFlow achieves state-of-the-art accuracy and reliability, validating its effectiveness for practical PSF calibration.

[176] Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

Pengfei Hu,Meng Cao,Yingyao Wang,Yi Wang,Jiahua Dong,Jun Song,Yu Cheng,Bo Zheng,Xiaodan Liang

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的推测性时序推理框架SpecTemp,通过轻量级草稿MLLM和强大目标MLLM协同工作,提升长视频理解的效率与准确性。

Details Motivation: 现有的“边思考边看帧”范式在处理长视频时存在多模态上下文冗余和效率瓶颈问题。 Method: 采用协作式双模型设计:轻量级草稿MLLM快速选择显著帧,目标MLLM进行时序推理并验证提议,通过强化学习实现动态优化;同时构建了包含粗粒度和细粒度标注的SpecTemp-80K数据集用于训练。 Result: 在多个视频理解基准上实验表明,SpecTemp在保持竞争力准确率的同时,显著提升了推理速度。 Conclusion: SpecTemp通过解耦时序感知与推理,模仿人脑协作机制,在保证精度的前提下大幅提高长视频理解的推理效率,为视频MLLM提供了高效的新范式。 Abstract: Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm, which alternates between global temporal reasoning and local frame examination, has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft's proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.

[177] IRPO: Boosting Image Restoration via Post-training GRPO

Haoxuan Xu. Yi Liu,Boyuan Jiang,Jinlong Peng,Donghao Luo,Xiaobin Hu,Shuicheng Yan,Haoang Li

Main category: cs.CV

TL;DR: 本文提出了一种基于低层次GRPO的后训练范式IRPO,用于图像恢复任务,通过数据构建和奖励建模提升性能与泛化能力。

Details Motivation: 现有图像恢复方法依赖像素级硬拟合,存在过平滑和泛化性差的问题,且后训练范式在低层次视觉中的潜力尚未充分挖掘。 Method: 提出IRPO,采用表现较差的预训练样本进行数据构建,并设计包含通用奖励、专家奖励(基于Qwen-VL)和恢复奖励的多层次奖励系统,实现准确性和感知质量的平衡。 Result: 在六个域内和五个域外低层次基准上取得最优结果,相比AdaIR基线在域内提升0.83 dB,域外提升3.43 dB。 Conclusion: IRPO为低层次视觉任务提供了一种高效且泛化的后训练范式,显著提升了图像恢复在多种退化类型下的性能。 Abstract: Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

[178] PanFlow: Decoupled Motion Control for Panoramic Video Generation

Cheng Zhang,Hanwen Liang,Donny Y. Chen,Qianyi Wu,Konstantinos N. Plataniotis,Camilo Cruz Gambardella,Jianfei Cai

Main category: cs.CV

TL;DR: 提出PanFlow,一种利用全景图球面特性解耦相机旋转与光流条件的新方法,实现对大范围动态运动的精确控制,并通过球面噪声扭曲策略提升运动循环一致性。

Details Motivation: 现有全景视频生成方法缺乏显式运动控制,难以处理大而复杂的运动场景。 Method: 利用球面几何解耦相机旋转与输入光流;引入球面噪声扭曲策略以增强全景边界处的运动循环一致性;构建大规模、动作丰富的带姿态和光流标注的全景视频数据集用于训练。 Result: 在运动保真度、视觉质量和时间连贯性方面显著优于先前方法,在运动迁移和视频编辑等应用中表现出色。 Conclusion: PanFlow通过球面建模和新型训练策略,实现了对复杂动态全景视频的高质量生成,推动了虚拟现实和沉浸式媒体的发展。 Abstract: Panoramic video generation has attracted growing attention due to its applications in virtual reality and immersive media. However, existing methods lack explicit motion control and struggle to generate scenes with large and complex motions. We propose PanFlow, a novel approach that exploits the spherical nature of panoramas to decouple the highly dynamic camera rotation from the input optical flow condition, enabling more precise control over large and dynamic motions. We further introduce a spherical noise warping strategy to promote loop consistency in motion across panorama boundaries. To support effective training, we curate a large-scale, motion-rich panoramic video dataset with frame-level pose and flow annotations. We also showcase the effectiveness of our method in various applications, including motion transfer and video editing. Extensive experiments demonstrate that PanFlow significantly outperforms prior methods in motion fidelity, visual quality, and temporal coherence. Our code, dataset, and models are available at https://github.com/chengzhag/PanFlow.

[179] AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

Neeraj Anand,Rishabh Jain,Sohan Patnaik,Balaji Krishnamurthy,Mausoom Sarkar

Main category: cs.CV

TL;DR: 本文提出了AFRAgent,一种基于instruct-BLIP的多模态架构,通过自适应特征重归一化技术提升图像嵌入质量,在手机UI自动化任务中实现了优于现有模型的性能,同时模型体积更小。

Details Motivation: 现有视觉语言模型在GUI自动化中因视觉编码特征空间信息有限而难以准确识别控件和动作,且高性能模型通常体积大、推理慢,需要更高效准确的解决方案。 Method: 提出AFRAgent,采用基于instruct-BLIP的多模态架构,并引入自适应特征重归一化技术(逐token仿射变换),增强低分辨率图像嵌入并融合高分辨率细节,以改善LLM中的图像表示。 Result: 在Meta-GUI和AITW基准测试上,AFRAgent取得了新的SOTA性能,模型大小不到最佳竞争模型的四分之一。 Conclusion: AFRAgent通过改进图像特征融合方法,在显著减小模型规模的同时提升了GUI自动化任务的准确性和效率,为移动UI自动化提供了高效可行的解决方案。 Abstract: There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.

[180] Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Haishan Wang,Mohammad Hassan Vali,Arno Solin

Main category: cs.CV

TL;DR: Smol-GS提出了一种用于3D高斯点阵的紧凑表示学习方法,通过递归体素层次结构和抽象特征实现高效压缩,保持高质量渲染并支持下游3D场景理解任务。

Details Motivation: 为了在不牺牲灵活性的前提下大幅压缩3D高斯点阵(3DGS)的存储和计算开销,同时保留空间与语义信息。 Method: 采用递归体素层次结构编码点阵坐标,并在点级特征中存储颜色、不透明度、变换和材质等抽象线索,实现紧凑且高效的3D场景表示。 Result: 在标准基准上实现了最先进的压缩性能,显著减小了3D场景规模,同时保持了高渲染质量。 Conclusion: Smol-GS能够有效学习3DGS的紧凑表示,在压缩率和视觉保真度之间取得了优异平衡,并具备支持导航、规划等下游任务的潜力。 Abstract: We present Smol-GS, a novel method for learning compact representations for 3D Gaussian Splatting (3DGS). Our approach learns highly efficient encodings in 3D space that integrate both spatial and semantic information. The model captures the coordinates of the splats through a recursive voxel hierarchy, while splat-wise features store abstracted cues, including color, opacity, transformation, and material properties. This design allows the model to compress 3D scenes by orders of magnitude without loss of flexibility. Smol-GS achieves state-of-the-art compression on standard benchmarks while maintaining high rendering quality. Beyond visual fidelity, the discrete representations could potentially serve as a foundation for downstream tasks such as navigation, planning, and broader 3D scene understanding.

[181] TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models

Tim Veenboer,George Yiasemis,Eric Marcus,Vivien Van Veldhuizen,Cees G. M. Snoek,Jonas Teuwen,Kevin B. W. Groot Lipman

Main category: cs.CV

TL;DR: 本文提出了TAP-CT,一种面向3D CT体积数据的无任务预训练基础模型,基于Vision Transformers和DINOv2进行改进,通过修改补丁嵌入、位置编码和数据增强策略,实现了无需大量微调即可泛化的强大多任务表征。

Details Motivation: 现有医学基础模型通常需要大量微调或依赖资源密集型解码器,而许多编码器的预训练目标存在任务偏差,缺乏真正任务无关且轻量高效的通用特征提取器。 Method: 提出TAP-CT框架,对ViT和DINOv2进行适配以支持3D CT体数据的自监督预训练,关键改进包括深度感知的补丁嵌入、三维位置编码和体素级数据增强,直接在10.5万例CT容积上进行大规模预训练。 Result: 实验表明,所获得的冻结特征在多个下游任务中表现出强泛化能力,性能稳定且鲁棒,仅需线性探测即可取得优异结果,无需额外微调。 Conclusion: TAP-CT提供了一个强大、通用且低资源依赖的医学图像基础模型基准,推动透明性和可复现性,并将公开所有模型与代码以促进后续研究。 Abstract: Existing foundation models (FMs) in the medical domain often require extensive fine-tuning or rely on training resource-intensive decoders, while many existing encoders are pretrained with objectives biased toward specific tasks. This illustrates a need for a strong, task-agnostic foundation model that requires minimal fine-tuning beyond feature extraction. In this work, we introduce a suite of task-agnostic pretraining of CT foundation models (TAP-CT): a simple yet effective adaptation of Vision Transformers (ViTs) and DINOv2 for volumetric data, enabling scalable self-supervised pretraining directly on 3D CT volumes. Our approach incorporates targeted modifications to patch embeddings, positional encodings, and volumetric augmentations, making the architecture depth-aware while preserving the simplicity of the underlying architectures. We show that large-scale 3D pretraining on an extensive in-house CT dataset (105K volumes) yields stable, robust frozen representations that generalize strongly across downstream tasks. To promote transparency and reproducibility, and to establish a powerful, low-resource baseline for future research in medical imaging, we will release all pretrained models, experimental configurations, and downstream benchmark code at https://huggingface.co/fomofo/tap-ct-b-3d.

[182] Neural Discrete Representation Learning for Sparse-View CBCT Reconstruction: From Algorithm Design to Prospective Multicenter Clinical Evaluation

Haoshen Wang,Lei Chen,Wei-Hua Zhang,Linxia Wu,Yong Luo,Zengmao Wang,Yuan Xiong,Chengcheng Zhu,Wenjuan Tang,Xueyi Zhang,Wei Zhou,Xuhua Duan,Lefei Zhang,Gao-Jun Teng,Bo Du,Huangxuan Zhao

Main category: cs.CV

TL;DR: 提出了一种名为DeepPriorCBCT的三阶段深度学习框架,可在仅使用传统辐射剂量六分之一的情况下实现诊断级锥形束CT(CBCT)重建,基于多中心回顾性数据和前瞻性临床试验证实其图像质量与常规方法无显著差异,显著降低术中辐射风险。

Details Motivation: CBCT引导穿刺在胸腔肿瘤诊疗中应用广泛,但其辐射暴露增加继发恶性肿瘤风险;尽管已有多种低剂量策略提出,但缺乏大规模多中心回顾性数据验证和前瞻性临床评估。 Method: 开发并验证DeepPriorCBCT——一种三阶段深度学习框架,利用12个中心共4102名患者8675次CBCT扫描进行模型训练与验证,并开展一项前瞻性交叉试验(NCT07035977,纳入138名患者)评估其临床适用性,由11名医师、5名放射科医生和25名介入医师进行图像质量与诊断性能评估。 Result: 重建图像被评估者认为与原始扫描无法区分,诊断性能和整体图像质量与标准重建算法相当;前瞻性试验中,放射科医生在图像质量和病灶评估上未发现显著差异(均P>0.05),介入医师对手术引导图像无偏好(Kappa<0.2);辐射剂量降至传统的约六分之一。 Conclusion: DeepPriorCBCT能够在稀疏采样条件下实现高质量CBCT重建,显著降低术中辐射暴露,具备良好的临床适用性,为低剂量CBCT提供了经充分验证的解决方案。 Abstract: Cone beam computed tomography (CBCT)-guided puncture has become an established approach for diagnosing and treating early- to mid-stage thoracic tumours, yet the associated radiation exposure substantially elevates the risk of secondary malignancies. Although multiple low-dose CBCT strategies have been introduced, none have undergone validation using large-scale multicenter retrospective datasets, and prospective clinical evaluation remains lacking. Here, we propose DeepPriorCBCT - a three-stage deep learning framework that achieves diagnostic-grade reconstruction using only one-sixth of the conventional radiation dose. 4102 patients with 8675 CBCT scans from 12 centers were included to develop and validate DeepPriorCBCT. Additionally, a prospective cross-over trial (Registry number: NCT07035977) which recruited 138 patients scheduled for percutaneous thoracic puncture was conducted to assess the model's clinical applicability. Assessment by 11 physicians confirmed that reconstructed images were indistinguishable from original scans. Moreover, diagnostic performance and overall image quality were comparable to those generated by standard reconstruction algorithms. In the prospective trial, five radiologists reported no significant differences in image quality or lesion assessment between DeepPriorCBCT and the clinical standard (all P>0.05). Likewise, 25 interventionalists expressed no preference between model-based and full-sampling images for surgical guidance (Kappa<0.2). Radiation exposure with DeepPriorCBCT was reduced to approximately one-sixth of that with the conventional approach, and collectively, the findings confirm that it enables high-quality CBCT reconstruction under sparse sampling conditions while markedly decreasing intraoperative radiation risk.

[183] Feed-Forward 3D Gaussian Splatting Compression with Long-Context Modeling

Zhening Liu,Rui Song,Yushi Huang,Yingdong Hu,Xinjie Zhang,Jiawei Shao,Zehong Lin,Jun Zhang

Main category: cs.CV

TL;DR: 提出了一种新的前馈式3D高斯点阵压缩框架,通过大规模上下文结构和注意力机制有效建模长距离相关性,实现了20倍压缩比,性能达到领先水平。

Details Motivation: 现有的3DGS压缩方法难以建模长距离空间依赖关系,受限于变换编码网络的感受野和熵模型的上下文容量。 Method: 引入基于Morton序列化的大规模上下文结构,设计细粒度的空间-通道自回归熵模型,并采用基于注意力的变换编码模型聚合邻近高斯特征以提取潜在先验。 Result: 在前馈推理下实现了20倍的压缩比,且在可泛化的编解码器中性能达到业界最优。 Conclusion: 所提方法能高效压缩3D高斯点阵表示,显著提升压缩效率与泛化能力,推动其广泛应用。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a revolutionary 3D representation. However, its substantial data size poses a major barrier to widespread adoption. While feed-forward 3DGS compression offers a practical alternative to costly per-scene per-train compressors, existing methods struggle to model long-range spatial dependencies, due to the limited receptive field of transform coding networks and the inadequate context capacity in entropy models. In this work, we propose a novel feed-forward 3DGS compression framework that effectively models long-range correlations to enable highly compact and generalizable 3D representations. Central to our approach is a large-scale context structure that comprises thousands of Gaussians based on Morton serialization. We then design a fine-grained space-channel auto-regressive entropy model to fully leverage this expansive context. Furthermore, we develop an attention-based transform coding model to extract informative latent priors by aggregating features from a wide range of neighboring Gaussians. Our method yields a $20\times$ compression ratio for 3DGS in a feed-forward inference and achieves state-of-the-art performance among generalizable codecs.

[184] Quantum-Inspired Spectral Geometry for Neural Operator Equivalence and Structured Pruning

Haijian Shao,Wei Liu,Xing Deng

Main category: cs.CV

TL;DR: 提出了一种量子启发的几何框架,用于在资源受限和异构硬件上实现多模态神经算子的高效优化,通过奇异值谱在Bloch超球面上的表示,建立了跨模态和跨架构算子可替换性的理论基础,并验证了其优越性。

Details Motivation: 解决多模态智能在资源受限和异构硬件上面临的特征异质性、实时性要求和硬件算子冗余等瓶颈问题。 Method: 引入量子启发的几何框架,将神经算子表示为Bloch超球面上归一化的奇异值谱,提出谱到功能等价定理,并构建量子度量驱动的功能冗余图(QM-FRG)与一次性结构化剪枝方法。 Result: 证明了谱距离趋近于零时算子功能接近,仿真显示所提度量优于幅值和随机基线,实验验证了方法的有效性。 Conclusion: 该框架为多模态和跨架构神经算子的冗余分析与压缩提供了理论基础和实用工具,有望提升异构硬件上的推理效率。 Abstract: The rapid growth of multimodal intelligence on resource-constrained and heterogeneous domestic hardware exposes critical bottlenecks: multimodal feature heterogeneity, real-time requirements in dynamic scenarios, and hardware-specific operator redundancy. This work introduces a quantum-inspired geometric framework for neural operators that represents each operator by its normalized singular value spectrum on the Bloch hypersphere. We prove a tight spectral-to-functional equivalence theorem showing that vanishing Fubini--Study/Wasserstein-2 distance implies provable functional closeness, establishing the first rigorous foundation for cross-modal and cross-architecture operator substitutability. Based on this metric, we propose Quantum Metric-Driven Functional Redundancy Graphs (QM-FRG) and one-shot structured pruning. Controlled simulation validates the superiority of the proposed metric over magnitude and random baselines. An extensive experimental validation on large-scale multimodal transformers and domestic heterogeneous hardware (Huawei Ascend, Cambricon MLU, Kunlunxin) hardware is deferred to an extended journal version currently in preparation.

[185] Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints

Xisheng Feng

Main category: cs.CV

TL;DR: 提出“Look, Recite, Then Answer”框架,通过自生成知识提示缓解视觉-语言模型在精细农业中的推理幻觉问题,实现模态 gap 的缩小和性能提升。

Details Motivation: 解决VLMs在专业领域(如精准农业)中因语言先验覆盖视觉感知导致的“推理驱动幻觉”问题,以及视觉嵌入无法有效激活模型参数中细粒度专家知识的“模态差距”瓶颈。 Method: 提出三阶段解耦框架:Look阶段生成客观视觉描述和候选集;Recite阶段使用轻量级路由器将视觉线索转化为目标查询以触发特定知识;Answer阶段进行描述与知识间的并行证据对齐来选择最一致标签。骨干模型保持冻结,实现参数高效增强。 Result: 在AgroBench上达到SOTA结果,杂草识别准确率比Qwen-VL提高23.6%,优于无外部搜索的GPT-4o。 Conclusion: 该模块化设计通过将被动感知转化为主动可控的知识检索,有效缓解了幻觉问题,为专业领域VLM应用提供了高效可行的解决方案。 Abstract: Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modality Gap": visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.6% over Qwen-VL and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval

[186] HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno,Gido Kato,Hirokatsu Kataoka,Yoichi Sato,Takuma Yagi

Main category: cs.CV

TL;DR: 本文提出了HanDyVQA,一个细粒度的视频问答基准,用于全面评估手-物交互(HOI)中的操作与效果动态,包含11.1K个问答对和10.3K分割掩码,实验表明现有模型仍有显著提升空间。

Details Motivation: 现有HOI数据集缺乏对操作与物体状态变化之间细粒度时空动态的建模,难以支持对手-物交互深层理解的评估。 Method: 构建HanDyVQA基准,涵盖六类问题(动作、过程、物体、位置、状态变化、物体部件),收集11.1K个多项选择问答对及10.3K个分割掩码,并评估当前视频基础模型的表现。 Result: 最先进的Gemini-2.5-Pro模型在该基准上平均准确率为73%,远低于人类的97%;分析揭示了在空间关系、运动理解和部件级几何推理方面的挑战。 Conclusion: HanDyVQA为HOI的细粒度动态理解提供了新评测标准,集成显式的HOI相关线索可提升模型性能,未来模型需加强对手-物交互中时空与结构动态的建模。 Abstract: Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

[187] Multilingual Training-Free Remote Sensing Image Captioning

Carlos Rebelo,Gil Rocha,João Daniel Silva,Bruno Martins

Main category: cs.CV

TL;DR: 提出了一种无需训练的多语言遥感图像描述生成方法,基于检索增强提示和图重排序策略,在多个语言和数据集上表现优异。

Details Motivation: 现有遥感图像描述方法依赖大量标注数据且局限于英语,限制了全球应用,因此需要一种无需训练且支持多语言的方法。 Method: 采用领域适配的SigLIP2编码器进行检索增强提示,结合多语言大语言模型(LLM)或视觉-语言模型(VLM),并引入基于PageRank的图结构对检索内容进行重排序。 Result: 在四个基准数据集、十种语言上实验表明,该方法与全监督英文系统相当,并显著优于翻译策略;重排序使性能提升高达35%;VLM生成更视觉相关的描述,LLM在BLEU和CIDEr指标上更优。 Conclusion: 该工作实现了首个系统性的无需训练的多语言遥感图像描述方法,推动了更具包容性和可扩展的多模态地球观测系统发展。 Abstract: Remote sensing image captioning has advanced rapidly through encoder--decoder models, although the reliance on large annotated datasets and the focus on English restricts global applicability. To address these limitations, we propose the first training-free multilingual approach, based on retrieval-augmented prompting. For a given aerial image, we employ a domain-adapted SigLIP2 encoder to retrieve related captions and few-shot examples from a datastore, which are then provided to a language model. We explore two variants: an image-blind setup, where a multilingual Large Language Model (LLM) generates the caption from textual prompts alone, and an image-aware setup, where a Vision--Language Model (VLM) jointly processes the prompt and the input image. To improve the coherence of the retrieved content, we introduce a graph-based re-ranking strategy using PageRank on a graph of images and captions. Experiments on four benchmark datasets across ten languages demonstrate that our approach is competitive with fully supervised English-only systems and generalizes to other languages. Results also highlight the importance of re-ranking with PageRank, yielding up to 35% improvements in performance metrics. Additionally, it was observed that while VLMs tend to generate visually grounded but lexically diverse captions, LLMs can achieve stronger BLEU and CIDEr scores. Lastly, directly generating captions in the target language consistently outperforms other translation-based strategies. Overall, our work delivers one of the first systematic evaluations of multilingual, training-free captioning for remote sensing imagery, advancing toward more inclusive and scalable multimodal Earth observation systems.

[188] Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

Yiyu Wang,Xuyang Liu,Xiyan Gui,Xinying Lin,Boxue Yang,Chenfei Liao,Tailai Chen,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为STC的流式视频大语言模型加速框架,通过缓存相似帧特征和剪枝冗余视觉令牌,在保持高达99%准确率的同时显著降低ViT编码和LLM预填充阶段的延迟。

Details Motivation: 现有的流式VideoLLMs因处理连续视频中密集视觉令牌的高计算成本而面临实时部署难题,尤其是ViT编码阶段对时间上相似帧的重复计算以及LLM预填充时过长的令牌序列导致效率低下。 Method: 提出STC框架,包含两个组件:STC-Cacher用于缓存并复用时间相似帧的ViT特征以减少编码开销;STC-Pruner基于时空相关性筛选出最显著的视觉令牌,压缩输入LLM的令牌序列。该方法可无缝集成到现有流式VideoLLMs中。 Result: 在四个基准模型和五个数据集上的实验表明,STC优于其他压缩方法,在ReKV框架上最高减少24.5%的ViT编码延迟和45.3%的LLM预填充延迟,同时保持最多99%的准确性。 Conclusion: STC是一种高效、即插即用的流式VideoLLM加速方案,有效缓解了视频流中的冗余计算问题,显著提升了推理速度与资源利用率,适用于对延迟敏感的实际应用场景。 Abstract: Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \textbf{S}treaming \textbf{T}oken \textbf{C}ompression (\textbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \textbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \textbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \textbf{99\%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \textbf{24.5\%} and \textbf{45.3\%}.

[189] SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Chaojun Ni,Cheng Chen,Xiaofeng Wang,Zheng Zhu,Wenzhao Zheng,Boyuan Wang,Tianrun Chen,Guosheng Zhao,Haoyun Li,Zhehao Dong,Qiang Zhang,Yun Ye,Yang Wang,Guan Huang,Wenjun Mei

Main category: cs.CV

TL;DR: SwiftVLA 是一种高效的轻量级视觉-语言-动作(VLA)模型,通过引入4D视觉几何变换器和Fusion Tokens,实现了对2D图像与4D时空特征的统一建模,并利用掩码重建策略在推理时去除4D分支,显著提升效率。

Details Motivation: 现有的VLA模型依赖大规模视觉语言模型,参数量大、计算成本高,轻量模型又缺乏时空推理能力;现有方法使用额外3D输入或大型VLM融合多模态信息,仍受限于时间理解不足和资源消耗大。 Method: 提出SwiftVLA架构:1)采用预训练的4D视觉几何变换器结合时序缓存,从2D图像提取4D特征;2)设计Fusion Tokens作为可学习令牌,通过未来预测目标融合2D与4D信息;3)采用掩码与重建策略训练VLM重构被遮蔽的4D输入,使模型能在推理阶段舍弃4D分支。 Result: 实验表明,SwiftVLA在真实和模拟环境中性能超过轻量基线,媲美参数量多达7倍的VLA模型,在边缘设备上运行速度快18倍,内存占用减少12倍。 Conclusion: SwiftVLA通过高效架构设计实现了强大的4D理解能力,在保持极低资源消耗的同时达到先进性能,适合部署于资源受限的实际应用场景。 Abstract: Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

[190] Hierarchical Semantic Alignment for Image Clustering

Xingyu Zhu,Beier Zhu,Yunfan Li,Junfeng Fang,Shuo Wang,Kesen Zhao,Hanwang Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练的层次语义对齐方法CAE,通过结合词级概念和描述级语义提升图像聚类性能。

Details Motivation: 现有基于名词的图像聚类方法忽略了名词本身的语义模糊性,导致语义表征失真,影响聚类质量。 Method: 引入WordNet中的相关名词和图像描述文本构建语义空间,利用最优传输将图像特征与名词和描述对齐,融合增强后的语义与图像特征进行聚类。 Result: 在8个数据集上验证了方法的有效性,在ImageNet-1K上比当前最先进的无训练方法准确率提高4.2%,ARI提高2.9%。 Conclusion: CAE通过细粒度描述与高阶类别概念的互补对齐,有效缓解了语义模糊问题,显著提升了无训练图像聚类的性能。 Abstract: Image clustering is a classic problem in computer vision, which categorizes images into different groups. Recent studies utilize nouns as external semantic knowledge to improve clus- tering performance. However, these methods often overlook the inherent ambiguity of nouns, which can distort semantic representations and degrade clustering quality. To address this issue, we propose a hierarChical semAntic alignmEnt method for image clustering, dubbed CAE, which improves cluster- ing performance in a training-free manner. In our approach, we incorporate two complementary types of textual seman- tics: caption-level descriptions, which convey fine-grained attributes of image content, and noun-level concepts, which represent high-level object categories. We first select relevant nouns from WordNet and descriptions from caption datasets to construct a semantic space aligned with image features. Then, we align image features with selected nouns and captions via optimal transport to obtain a more discriminative semantic space. Finally, we combine the enhanced semantic and image features to perform clustering. Extensive experiments across 8 datasets demonstrate the effectiveness of our method, notably surpassing the state-of-the-art training-free approach with a 4.2% improvement in accuracy and a 2.9% improvement in adjusted rand index (ARI) on the ImageNet-1K dataset.

[191] TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

Alireza Javanmardi,Pragati Jaiswal,Tewodros Amberbir Habtegebrial,Christen Millerdurai,Shaoxiang Wang,Alain Pagani,Didier Stricker

Main category: cs.CV

TL;DR: 本文提出了一种名为TalkingPose的新型扩散模型框架,用于生成长时程、时间一致的人体上半身动画,通过引入反馈机制在不增加计算成本的情况下实现无限时长的连贯动画生成,并提供了大规模数据集作为新基准。

Details Motivation: 现有基于扩散模型的角色动画方法受限于计算和内存,只能处理短片段,难以生成长时间连贯的动画内容。 Method: 提出TalkingPose,基于图像扩散模型,利用驱动帧捕捉面部和手部动作,并通过反馈驱动机制增强时间连续性,支持无限长度动画生成。 Result: 实现了高质量、长时间且时间一致的上半身动画生成,无需额外计算开销或二次训练,并发布了用于评估的大规模数据集。 Conclusion: TalkingPose有效解决了长时程动画生成中的时间不连贯问题,为角色动画提供了高效、可扩展的解决方案。 Abstract: Recent advancements in diffusion models have significantly improved the realism and generalizability of character-driven animation, enabling the synthesis of high-quality motion from just a single RGB image and a set of driving poses. Nevertheless, generating temporally coherent long-form content remains challenging. Existing approaches are constrained by computational and memory limitations, as they are typically trained on short video segments, thus performing effectively only over limited frame lengths and hindering their potential for extended coherent generation. To address these constraints, we propose TalkingPose, a novel diffusion-based framework specifically designed for producing long-form, temporally consistent human upper-body animations. TalkingPose leverages driving frames to precisely capture expressive facial and hand movements, transferring these seamlessly to a target actor through a stable diffusion backbone. To ensure continuous motion and enhance temporal coherence, we introduce a feedback-driven mechanism built upon image-based diffusion models. Notably, this mechanism does not incur additional computational costs or require secondary training stages, enabling the generation of animations with unlimited duration. Additionally, we introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.

[192] Dual-Projection Fusion for Accurate Upright Panorama Generation in Robotic Vision

Yuhao Shan,Qianyi Yuan,Jingguo Liu,Shigang Li,Jianfeng Li,Tong Chen

Main category: cs.CV

TL;DR: 提出了一种双流角度感知生成网络,用于同时估计相机倾斜角度并重建直立全景图像,结合CNN和ViT分支及多模块设计,在SUN360和M3D数据集上优于现有方法。

Details Motivation: 非直立的全景图像因机器人姿态不稳定而影响下游视觉任务,传统IMU方法存在漂移问题,纯视觉方法亟需提升精度与鲁棒性。 Method: 设计双流网络:CNN分支处理等距柱状投影提取局部几何结构,ViT分支通过立方体投影捕捉全局上下文信息,并通过双投影自适应融合模块整合特征;引入高频增强块、环形填充和通道注意力机制以保持360度连续性和几何敏感性。 Result: 在SUN360和M3D数据集上,该方法在倾斜角估计和直立全景图生成两方面均优于现有方法,消融实验验证了各模块的有效性及任务间的协同作用。 Conclusion: 所提出的双流网络结合多模态投影表示与特征融合策略,有效解决了非直立全景图像的校正问题,提升了机器人视觉系统在复杂环境中的稳定性与性能。 Abstract: Panoramic cameras, capable of capturing a 360-degree field of view, are crucial in robotic vision, particularly in environments with sparse features. However, non-upright panoramas due to unstable robot postures hinder downstream tasks. Traditional IMU-based correction methods suffer from drift and external disturbances, while vision-based approaches offer a promising alternative. This study presents a dual-stream angle-aware generation network that jointly estimates camera inclination angles and reconstructs upright panoramic images. The network comprises a CNN branch that extracts local geometric structures from equirectangular projections and a ViT branch that captures global contextual cues from cubemap projections. These are integrated through a dual-projection adaptive fusion module that aligns spatial features across both domains. To further enhance performance, we introduce a high-frequency enhancement block, circular padding, and channel attention mechanisms to preserve 360° continuity and improve geometric sensitivity. Experiments on the SUN360 and M3D datasets demonstrate that our method outperforms existing approaches in both inclination estimation and upright panorama generation. Ablation studies further validate the contribution of each module and highlight the synergy between the two tasks. The code and related datasets can be found at: https://github.com/YuhaoShine/DualProjectionFusion.

[193] ForamDeepSlice: A High-Accuracy Deep Learning Framework for Foraminifera Species Classification from 2D Micro-CT Slices

Abdelghafour Halimi,Ali Alibrahim,Didier Barradas-Bautista,Ronell Sicat,Abdulkader M. Afifi

Main category: cs.CV

TL;DR: 本研究提出了一种基于深度学习的自动化流程,利用2D微CT切片对12种有孔虫进行分类,构建了高质量数据集并采用先进的CNN模型和集成方法,实现了95.64%的测试准确率,并开发了用于实时分类和3D匹配的交互式仪表板。

Details Motivation: 传统有孔虫分类依赖人工,耗时且需要专业知识,易受主观因素影响;现有自动化方法在数据质量和物种覆盖上存在不足,缺乏可重复性与实际应用支持。 Method: 使用27个物种共97个标本的微CT扫描数据,筛选出12个代表性物种,按标本级别划分训练、验证和测试集(共109,617张2D切片),采用七种先进2D CNN架构进行迁移学习,并构建ConvNeXt-Large与EfficientNetV2-Small的集成模型,同时开发包含SSIM、NCC和Dice系数等相似性度量的交互式仪表板。 Result: 集成模型在测试集上达到95.64%的准确率,top-3准确率为99.6%,各类别AUC达0.998;仪表板支持实时分类与3D切片匹配,显著提升分类效率与可解释性。 Conclusion: 该研究为AI辅助微古生物学鉴定设立了新基准,提供了完全可复现的框架,有效连接了深度学习与应用地球科学之间的鸿沟,具有良好的推广潜力。 Abstract: This study presents a comprehensive deep learning pipeline for the automated classification of 12 foraminifera species using 2D micro-CT slices derived from 3D scans. We curated a scientifically rigorous dataset comprising 97 micro-CT scanned specimens across 27 species, selecting 12 species with sufficient representation for robust machine learning. To ensure methodological integrity and prevent data leakage, we employed specimen-level data splitting, resulting in 109,617 high-quality 2D slices (44,103 for training, 14,046 for validation, and 51,468 for testing). We evaluated seven state-of-the-art 2D convolutional neural network (CNN) architectures using transfer learning. Our final ensemble model, combining ConvNeXt-Large and EfficientNetV2-Small, achieved a test accuracy of 95.64%, with a top-3 accuracy of 99.6% and an area under the ROC curve (AUC) of 0.998 across all species. To facilitate practical deployment, we developed an interactive advanced dashboard that supports real-time slice classification and 3D slice matching using advanced similarity metrics, including SSIM, NCC, and the Dice coefficient. This work establishes new benchmarks for AI-assisted micropaleontological identification and provides a fully reproducible framework for foraminifera classification research, bridging the gap between deep learning and applied geosciences.

[194] LAHNet: Local Attentive Hashing Network for Point Cloud Registration

Wentao Qu,Xiaoshui Huang,Liang Xiao

Main category: cs.CV

TL;DR: 本文提出了一种用于点云配准的局部注意力哈希网络(LAHNet),通过引入基于局部性归纳偏置的局部注意力机制,结合分组Transformer和交互Transformer,有效扩大感受野并增强特征区分度,在室内外真实场景基准上实现了显著的配准性能。

Details Motivation: 现有基于学习的点云描述符主要关注局部信息,缺乏合理的宽广感受野,限制了特征的独特性。因此需要一种能够感知更广泛上下文且保持局部敏感性的方法来提升配准性能。 Method: 提出LAHNet,设计Group Transformer利用局部敏感哈希(LSH)将点云均匀划分为非重叠窗口,并采用跨窗口策略扩展感受野;进一步提出Interaction Transformer,通过构建重叠矩阵增强点云对之间重叠区域的特征交互,每个窗口以全局信号表示。 Result: 实验结果表明,LAHNet在多个真实世界室内外数据集上均取得优异的配准效果,能生成鲁棒且具强区分性的特征,优于现有方法。 Conclusion: LAHNet通过引入局部注意力机制与有效的窗口划分策略,成功增强了点云描述符的感受野与特征交互能力,为点云配准提供了高效且可扩展的解决方案。 Abstract: Most existing learning-based point cloud descriptors for point cloud registration focus on perceiving local information of point clouds to generate distinctive features. However, a reasonable and broader receptive field is essential for enhancing feature distinctiveness. In this paper, we propose a Local Attentive Hashing Network for point cloud registration, called LAHNet, which introduces a local attention mechanism with the inductive bias of locality of convolution-like operators into point cloud descriptors. Specifically, a Group Transformer is designed to capture reasonable long-range context between points. This employs a linear neighborhood search strategy, Locality-Sensitive Hashing, enabling uniformly partitioning point clouds into non-overlapping windows. Meanwhile, an efficient cross-window strategy is adopted to further expand the reasonable feature receptive field. Furthermore, building on this effective windowing strategy, we propose an Interaction Transformer to enhance the feature interactions of the overlap regions within point cloud pairs. This computes an overlap matrix to match overlap regions between point cloud pairs by representing each window as a global signal. Extensive results demonstrate that LAHNet can learn robust and distinctive features, achieving significant registration results on real-world indoor and outdoor benchmarks.

[195] SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding

Keita Otani,Tatsuya Harada

Main category: cs.CV

TL;DR: 本文提出了SceneProp,一种通过将场景图接地重新定义为马尔可夫随机场中的最大后验推断问题的方法,利用可微信念传播实现全局推理,显著提升了复杂查询的定位性能。

Details Motivation: 现有的短语接地方法在处理包含多个对象和关系的复杂视觉查询时表现不佳,缺乏解析复杂关系描述的结构归纳偏置。此外,现有场景图接地方法随着查询图增大反而性能下降,未能有效利用更多的关系信息。 Method: 将场景图接地任务建模为马尔可夫随机场(MRF)中的最大后验(MAP)推断问题,并采用可微的信念传播算法在端到端框架内进行全局推理,以联合满足整个查询图的所有约束条件。 Result: 在四个基准数据集上的实验表明,SceneProp显著优于先前方法,且其准确性随着查询图的大小和复杂性增加而持续提高。 Conclusion: SceneProp首次证明了更多关系上下文可以且应该带来更好的接地效果,解决了现有方法在复杂查询中性能退化的问题。 Abstract: Grounding complex, compositional visual queries with multiple objects and relationships is a fundamental challenge for vision-language models. While standard phrase grounding methods excel at localizing single objects, they lack the structural inductive bias to parse intricate relational descriptions, often failing as queries become more descriptive. To address this structural deficit, we focus on scene-graph grounding, a powerful but less-explored formulation where the query is an explicit graph of objects and their relationships. However, existing methods for this task also struggle, paradoxically showing decreased performance as the query graph grows -- failing to leverage the very information that should make grounding easier. We introduce SceneProp, a novel method that resolves this issue by reformulating scene-graph grounding as a Maximum a Posteriori (MAP) inference problem in a Markov Random Field (MRF). By performing global inference over the entire query graph, SceneProp finds the optimal assignment of image regions to nodes that jointly satisfies all constraints. This is achieved within an end-to-end framework via a differentiable implementation of the Belief Propagation algorithm. Experiments on four benchmarks show that our dedicated focus on the scene-graph grounding formulation allows SceneProp to significantly outperform prior work. Critically, its accuracy consistently improves with the size and complexity of the query graph, demonstrating for the first time that more relational context can, and should, lead to better grounding. Codes are available at https://github.com/keitaotani/SceneProp.

[196] Binary-Gaussian: Compact and Progressive Representation for 3D Gaussian Segmentation

An Yang,Chenyu Liu,Jun Du,Jianqing Gao,Jia Pan,Jinshui Hu,Baocai Yin,Bing Yin,Cong Liu

Main category: cs.CV

TL;DR: 提出一种基于3D高斯点阵的高效语义分割方法,采用二进制编码和渐进式训练策略,显著降低内存消耗并提升细粒度分割性能。

Details Motivation: 现有基于3D高斯点阵的分割方法依赖高维特征导致内存开销大,且在细粒度分割上受限于标签空间拥挤和缺乏多粒度控制机制。 Method: 提出从粗到细的二进制编码方案,将每个高斯点的类别特征压缩为单个整数;设计渐进式训练策略,将全景分割分解为多个独立子任务;并在训练中微调不透明度以缓解光度渲染与语义分割间的冲突。 Result: 在多个基准上实现了最先进的分割性能,同时显著减少内存使用并加快推理速度。 Conclusion: 该方法在保持高性能的同时大幅降低资源消耗,推动了3D-GS在语义分割中的实际应用。 Abstract: 3D Gaussian Splatting (3D-GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation. However, existing 3D-GS-based segmentation methods typically rely on high-dimensional category features, which introduce substantial memory overhead. Moreover, fine-grained segmentation remains challenging due to label space congestion and the lack of stable multi-granularity control mechanisms. To address these limitations, we propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping, drastically reducing memory usage. We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub-tasks, reducing inter-class conflicts and thereby enhancing fine-grained segmentation capability. Additionally, we fine-tune opacity during segmentation training to address the incompatibility between photometric rendering and semantic segmentation, which often leads to foreground-background confusion. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art segmentation performance while significantly reducing memory consumption and accelerating inference.

[197] Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval

Haojian Huang,Kaijing Ma,Jin Chen,Haodong Chen,Zhou Wu,Xianghao Zang,Han Fang,Chao Ban,Hao Sun,Mulin Chen,Zhongjiang He

Main category: cs.CV

TL;DR: 提出了一种新的去偏证据学习框架DEMR,用于提升基于自然语言查询的视频时刻检索性能,通过改进不确定性估计实现更准确和鲁棒的跨模态对齐。

Details Motivation: 传统方法在处理复杂或模糊时刻时难以进行细粒度对齐和确定性推理,且现有证据回归方法存在模态不平衡和不确定性估计偏差问题。 Method: 引入Deep Evidential Regression构建基线,并提出DEMR框架,包含Reflective Flipped Fusion(RFF)模块、查询重构任务以及Geom-regularizer,以改善跨模态对齐和不确定性估计。 Result: 在ActivityNet-CD和Charades-CD等标准及去偏数据集上取得了显著的效果提升,表现出更强的鲁棒性和可解释性。 Conclusion: DEMR有效缓解了模态不平衡和不确定性误校准问题,提升了 moment retrieval 的准确性和语义-时间鲁棒性,为该领域提供了新思路。 Abstract: In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER's heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval. The code is publicly available at https://github.com/KaijingOfficial/DEMR.

[198] Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

Boran Wen,Ye Lu,Keyan Wan,Sirui Wang,Jiahong Zhou,Junxuan Liang,Xinpeng Liu,Bang Xiao,Dingbang Huang,Ruiyang Liu,Yong-Lu Li

Main category: cs.CV

TL;DR: 本文提出了4DHOISolver,一种利用稀疏的人工标注接触点来解决从互联网单目视频中重建四维人-物交互(4D HOI)的优化框架,并发布了大规模数据集Open4DHOI,推动机器人模仿学习与3D基础模型发展。

Details Motivation: 从真实世界的单目视频中准确、可扩展地提取四维人-物交互数据是一个未解决的难题,而现有方法难以保证时空连贯性和物理合理性,因此需要新方法利用易获取的大规模视频数据。 Method: 提出4DHOISolver,通过引入稀疏的人工参与接触点标注,构建一个高效的优化框架,约束病态的4D HOI重建问题,同时保持高时空一致性与物理合理性,并基于此构建Open4DHOI数据集。 Result: 成功构建了包含144种物体类型和103种动作的大规模4D HOI数据集Open4DHOI,验证了重建结果可用于强化学习代理的动作模仿,且实验证明当前3D基础模型在预测精确接触关系上仍有不足。 Conclusion: 人工参与的接触点标注是当前实现高质量4D HOI重建的关键策略,Open4DHOI为机器人学习提供了宝贵资源,同时也揭示了自动预测人-物接触关系仍是亟待解决的开放问题。 Abstract: Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/

[199] MM-ACT: Learn from Multimodal Parallel Generation to Act

Haotian Liang,Xinyi Chen,Bin Wang,Mingkang Chen,Yitian Liu,Yuhao Zhang,Zanxin Chen,Tianshuo Yang,Yilun Chen,Jiangmiao Pang,Dong Liu,Xiaokang Yang,Yao Mu,Wenqi Shao,Ping Luo

Main category: cs.CV

TL;DR: 本文提出了一种统一的视觉-语言-动作(VLA)模型MM-ACT,通过共享上下文实现文本、图像和动作的联合建模,并在多个机器人任务中表现出色。

Details Motivation: 为了使通用机器人策略同时具备语义理解和环境交互能力,需要融合多模态信息进行任务规划与执行。 Method: 提出MM-ACT模型,采用重新掩码并行解码策略进行文本和图像生成,使用单步并行解码生成动作;引入上下文共享多模态学习框架,在共享上下文中统一训练三类模态输出。 Result: 在LIBERO模拟环境、Franka真实机器人和RoboTwin2.0双手机器人任务上分别达到96.3%、72.0%和52.38%的成功率,跨模态学习带来额外9.25%性能提升。 Conclusion: MM-ACT通过统一的多模态建模范式有效提升了机器人策略的泛化能力和执行效率,验证了跨模态学习对机器人动作生成的价值。 Abstract: A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.

[200] PhotoFramer: Multi-modal Image Composition Instruction

Zhiyuan You,Ke Wang,He Zhang,Xin Cai,Jinjin Gu,Tianfan Xue,Chao Dong,Zhoutong Zhang

Main category: cs.CV

TL;DR: 本文提出了PhotoFramer,一个基于多模态的摄影构图指导框架,通过自然语言指令和生成示例图像帮助用户改善照片构图。

Details Motivation: 许多普通用户在拍照时难以掌握良好的构图,而专业摄影知识不易普及,因此需要一种易用且具体的构图辅助工具。 Method: 构建了一个包含文字描述与图像生成的多模态模型,并设计了一个分层的数据生成流程:利用裁剪数据集构建平移与缩放任务,通过两阶段管道(多视角数据采样+退化模型合成)构建视角变化任务,最终微调一个联合图文处理与生成的模型。 Result: 实验证明,结合文本指导与示例图像的方法显著优于仅使用示例的基线方法,能有效引导用户改进构图。 Conclusion: PhotoFramer为普通人提供了可操作的构图建议,使专业摄影先验知识更易于被大众获取,推动了智能摄影辅助系统的发展。 Abstract: Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users. Codes, model weights, and datasets have been released in https://zhiyuanyou.github.io/photoframer.

[201] S2AM3D: Scale-controllable Part Segmentation of 3D Point Cloud

Han Su,Tianyu Huang,Zichen Wan,Xiaohe Wu,Wangmeng Zuo

Main category: cs.CV

TL;DR: 本文提出了一种名为S2AM3D的新方法,用于解决部分级点云分割中的泛化性和视图一致性问题,结合2D分割先验与3D一致性监督,并设计了点一致的部分编码器和尺度感知提示解码器,同时发布了一个包含超过10万样本的大规模高质量数据集。实验表明该方法在多种评估设置下表现领先,具有优异的鲁棒性和可控性。

Details Motivation: 现有3D点云分割方法受限于数据稀缺导致的泛化能力差,以及引入2D预训练知识后产生的多视角分割结果不一致问题。 Method: 提出S2AM3D,设计点一致的部分编码器,通过3D对比学习聚合多视角2D特征;引入尺度感知提示解码器,利用连续尺度信号实时调整分割粒度;并构建大规模高质量部分级点云数据集用于训练。 Result: S2AM3D在多个评估设置中达到领先性能,在处理复杂结构和尺寸差异大的部件时表现出卓越的鲁棒性和可控性。 Conclusion: S2AM3D有效融合2D先验与3D一致性监督,解决了点云分割中的关键挑战,且新数据集为后续研究提供了有力支持。 Abstract: Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision. Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views. To address these challenges, we propose S2AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providing ample supervision signals for model training. Extensive experiments demonstrate that S2AM3D achieves leading performance across multiple evaluation settings, exhibiting exceptional robustness and controllability when handling complex structures and parts with significant size variations.

[202] Provenance-Driven Reliable Semantic Medical Image Vector Reconstruction via Lightweight Blockchain-Verified Latent Fingerprints

Mohsin Rasheed,Abdullah Al-Mamun

Main category: cs.CV

TL;DR: 提出一种语义感知的医学图像重建框架,结合高层语义信息与混合U-Net架构,并引入基于区块链的轻量级溯源机制,提升重建结果的解剖学保真度与可信性。

Details Motivation: 现有医学图像重建方法侧重像素级恢复,可能牺牲解剖结构的准确性,且缺乏对重建过程的可信追溯机制,影响临床诊断可靠性。 Method: 提出语义感知重建框架,利用高层潜在嵌入指导混合U-Net进行图像恢复;同时设计基于无标度图的轻量级区块链溯源层,记录每次重建操作。 Result: 在多个数据集和多种退化类型下验证,所提方法在结构一致性、重建精度和溯源完整性方面均优于现有方法。 Conclusion: 该方法通过融合语义引导重建与安全可追溯性,提升了医学AI系统的可靠性、诊断信心及医疗合规性。 Abstract: Medical imaging is essential for clinical diagnosis, yet real-world data frequently suffers from corruption, noise, and potential tampering, challenging the reliability of AI-assisted interpretation. Conventional reconstruction techniques prioritize pixel-level recovery and may produce visually plausible outputs while compromising anatomical fidelity, an issue that can directly impact clinical outcomes. We propose a semantic-aware medical image reconstruction framework that integrates high-level latent embeddings with a hybrid U-Net architecture to preserve clinically relevant structures during restoration. To ensure trust and accountability, we incorporate a lightweight blockchain-based provenance layer using scale-free graph design, enabling verifiable recording of each reconstruction event without imposing significant overhead. Extensive evaluation across multiple datasets and corruption types demonstrates improved structural consistency, restoration accuracy, and provenance integrity compared with existing approaches. By uniting semantic-guided reconstruction with secure traceability, our solution advances dependable AI for medical imaging, enhancing both diagnostic confidence and regulatory compliance in healthcare environments.

[203] LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency

Zhongbin Guo,Jiahe Liu,Wenyu Gao,Yushan Li,Chengzhi Li,Ping Jian

Main category: cs.CV

TL;DR: LISA-3D是一个两阶段框架,通过几何感知的LoRA层和冻结的SAM-3D重建器,将语言-图像分割提升到3D,利用RGB-D序列和相机姿态构建可微投影损失,实现跨视角一致性,无需额外3D文本监督。

Details Motivation: 文本驱动的3D重建需要一个能同时理解开放词汇指令并在不同视角保持一致性的掩码生成器。 Method: 提出LISA-3D框架,结合指令跟随模型LISA与几何感知LoRA层,并复用冻结的SAM-3D重建器;利用RGB-D序列和相机姿态建立可微投影损失以强制跨视角一致性。 Result: 在ScanRefer和Nr3D数据集上,相比单视图基线,LISA-3D的语言到3D准确率最高提升15.6点,仅调整1160万参数。 Conclusion: LISA-3D具有模块化、数据高效和零样本部署能力,为语言引导的3D内容创作提供了实用方案。 Abstract: Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at https://github.com/binisalegend/LISA-3D.

[204] Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

Jing He,Haodong Li,Mingzhi Sheng,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了Lotus-2,一种两阶段确定性框架,用于从单幅图像中稳定、准确地预测细粒度几何信息,通过充分利用预训练扩散模型中的生成先验,在极小训练数据下实现了单目深度估计的最先进性能。

Details Motivation: 从单张图像恢复像素级几何属性本质上是病态问题,现有判别式回归模型受限于数据规模与物理推理能力,而扩散模型虽具备强大先验但其随机生成范式不适用于确定性几何推断,因此需要一种最优适配协议来发挥其潜力。 Method: 提出Lotus-2,包含两个阶段:第一阶段使用单步确定性公式和轻量级局部连续性模块(LCM)生成全局连贯结构;第二阶段在核心预测器定义的流形内进行约束多步修正流细化,通过无噪声的确定性流匹配增强细粒度几何。 Result: 仅用59K训练样本(不足现有大规模数据集的1%),Lotus-2在单目深度估计上达到最先进水平,并在表面法线预测上表现极具竞争力。 Conclusion: 扩散模型可作为确定性世界先验,支持高质量几何推理,超越传统的判别与生成范式。 Abstract: Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality and diversity of available data and limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaption protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.

[205] TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models

Maya Varma,Jean-Benoit Delbrouck,Sophie Ostmeier,Akshay Chaudhari,Curtis Langlotz

Main category: cs.CV

TL;DR: 本文提出TRoVe,一种自动化发现时序视觉语言模型(VLM)中导致错误的静态特征偏差的方法,通过提取候选特征并评估其对分类错误的影响及模型对其依赖程度,在101个模型上验证其有效性,并应用于7个现成VLM和2个任务中揭示未知偏差,提升测试性能。

Details Motivation: 现有VLM在时序理解任务中可能依赖静态特征偏差(如背景或对象特征),而非真正的动态变化,这种捷径会导致系统性预测错误,因此需在部署前识别和刻画这些偏差。 Method: 提出TRoVe方法:给定训练好的VLM和标注验证集,从数据集中提取候选静态特征,并根据两个指标打分:(i) 该特征对分类错误的影响;(ii) VLM在预测中对该特征的依赖程度。 Result: 构建包含101个训练VLM及真实偏差标注的评估框架,实验显示TRoVe比最强基线提升28.6%;在7个现成VLM和2个时序任务上应用TRoVe,发现了先前未知的静态特征偏差,并证明利用这些偏差知识可在测试时提升模型性能。 Conclusion: TRoVe能有效识别导致错误的静态特征偏差,有助于提高VLM在下游任务中的鲁棒性和可解释性,为实际部署前的模型诊断提供了实用工具。 Abstract: Vision-language models (VLMs) have made great strides in addressing temporal understanding tasks, which involve characterizing visual changes across a sequence of images. However, recent works have suggested that when making predictions, VLMs may rely on static feature biases, such as background or object features, rather than dynamic visual changes. Static feature biases are a type of shortcut and can contribute to systematic prediction errors on downstream tasks; as a result, identifying and characterizing error-inducing static feature biases is critical prior to real-world model deployment. In this work, we introduce TRoVe, an automated approach for discovering error-inducing static feature biases learned by temporal VLMs. Given a trained VLM and an annotated validation dataset associated with a downstream classification task, TRoVe extracts candidate static features from the dataset and scores each feature by (i) the effect of the feature on classification errors as well as (ii) the extent to which the VLM relies on the feature when making predictions. In order to quantitatively evaluate TRoVe, we introduce an evaluation framework consisting of 101 trained temporal VLMs paired with ground-truth annotations for learned static feature biases. We use this framework to demonstrate that TRoVe can accurately identify error-inducing static feature biases in VLMs, achieving a 28.6% improvement over the closest baseline. Finally, we apply TRoVe to 7 off-the-shelf VLMs and 2 temporal understanding tasks, surfacing previously-unknown static feature biases and demonstrating that knowledge of learned biases can aid in improving model performance at test time. Our code is available at https://github.com/Stanford-AIMI/TRoVe.

[206] Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction

Anantha Padmanaban Krishna Kumar

Main category: cs.CV

TL;DR: 研究发现,在ViT-B/16模型中,通过减少MLP块的参数(如权重共享或缩小隐藏维度)可提升训练稳定性与推理效率,同时提高ImageNet-1K上的准确率,表明该模型存在过参数化问题。

Details Motivation: 尽管扩大Vision Transformer规模通常能提升性能,但性能并不总是随规模单调增长。本文旨在探索在不损害甚至提升性能的前提下,如何有效减少模型参数。 Method: 提出两种针对ViT-B/16中MLP块的参数缩减策略:GroupedMLP(在相邻Transformer块间共享MLP权重)和ShallowMLP(将MLP隐藏层维度减半),并在ImageNet-1K上进行实验验证。 Result: 两种方法均减少了32.7%的参数,GroupedMLP达到81.47%准确率,ShallowMLP达到81.25%且推理吞吐量提升38%;两者均优于基线模型(81.05%),并显著提升训练稳定性(峰值到最终准确率下降从0.47%降至0.03%-0.06%)。 Conclusion: ViT-B/16在标准训练下处于过参数化状态,适当降低MLP容量不仅可行,还能略微提升性能;参数共享和宽度缩减可作为有效的归纳偏置,提示设计ViT时应更关注参数分配方式。 Abstract: Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47\% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25\% top-1 accuracy with a 38\% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05\%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47\% to the range 0.03\% to 0.06\%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps.

[207] Accelerating Inference of Masked Image Generators via Reinforcement Learning

Pranav Subbaraman,Shufan Li,Siyan Zhao,Aditya Grover

Main category: cs.CV

TL;DR: 提出Speed-RL,一种基于强化学习的加速预训练掩码生成模型(MGM)的新方法,通过结合质量与速度奖励,在减少采样步数的同时保持高质量生成,实现3倍加速。

Details Motivation: 掩码生成模型(MGM)虽然能生成高质量图像,但需要大量采样步骤,导致推理速度慢,限制了其实际应用。因此,亟需有效方法加速MGM的生成过程。 Method: 将MGM加速问题建模为强化学习问题,而非传统的蒸馏式分布匹配。通过设计包含图像质量和生成速度的复合奖励函数,对预训练MGM进行微调,使其在更少步数内生成高质量图像。 Result: 实验表明,Speed-RL可将基础模型加速3倍,同时保持与原始多步生成相当的图像质量。 Conclusion: Speed-RL为加速掩码生成模型提供了一种新范式,通过强化学习优化生成效率,在保证质量的前提下显著提升推理速度,具有良好的应用潜力。 Abstract: Masked Generative Models (MGM)s demonstrate strong capabilities in generating high-fidelity images. However, they need many sampling steps to create high-quality generations, resulting in slow inference speed. In this work, we propose Speed-RL, a novel paradigm for accelerating a pretrained MGMs to generate high-quality images in fewer steps. Unlike conventional distillation methods which formulate the acceleration problem as a distribution matching problem, where a few-step student model is trained to match the distribution generated by a many-step teacher model, we consider this problem as a reinforcement learning problem. Since the goal of acceleration is to generate high quality images in fewer steps, we can combine a quality reward with a speed reward and finetune the base model using reinforcement learning with the combined reward as the optimization target. Through extensive experiments, we show that the proposed method was able to accelerate the base model by a factor of 3x while maintaining comparable image quality.

[208] CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

Simon Kohaut,Daniel Ochs,Shun Zhang,Benedict Flade,Julian Eggert,Kristian Kersting,Devendra Singh Dhami

Main category: cs.CV

TL;DR: CycliST是一个新的基准数据集,用于评估视频语言模型在周期性状态转换中的文本推理能力,揭示了现有模型在理解循环动态和时间模式上的不足。

Details Motivation: 现有的视频语言模型在处理现实世界中常见的周期性过程(如循环运动和时变视觉属性)时缺乏有效的时空理解能力,需要一个专门的基准来评估和推动这一方向的发展。 Method: 提出CycliST,一个包含合成但结构丰富的视频序列的数据集,具有不同数量的循环对象、场景杂乱程度和光照条件,并采用分级评估体系测试模型的时空认知能力。 Result: 实验表明当前最先进的VLM在检测和利用周期模式方面表现不佳,缺乏时间理解能力,无法提取定量信息,且模型大小或架构与性能无明显关联,没有单一模型在所有任务上均表现优异。 Conclusion: CycliST揭示了现有视频语言模型在周期性动态理解方面的关键缺陷,为开发具备更强时序推理能力的视觉模型提供了重要挑战和评估框架。 Abstract: We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

[209] Learning Eigenstructures of Unstructured Data Manifolds

Roy Velich,Arkadi Piven,David Bensaïd,Daniel Cremers,Thomas Dagès,Ron Kimmel

Main category: cs.CV

TL;DR: 提出了一种从非结构化数据中直接学习形状和流形分析谱基的新框架,无需传统算子选择、离散化和特征求解器。

Details Motivation: 传统方法依赖显式构造算子并进行特征分解,难以扩展到高维或非结构化数据,且对数据流形假设较多。本文旨在提供一种更灵活、可扩展的数据驱动替代方案。 Method: 基于最优逼近理论,通过训练网络最小化在所选探测函数分布下的重建误差,来隐式分解近似算子,从而联合学习谱基、采样密度和特征值。 Result: 在3D表面点云和高维图像流形上,成功获得了类似拉普拉斯算子的有意义谱基,且无需显式构造算子,适用于任意维度的数据。 Conclusion: 该方法为几何处理提供了原理性、数据驱动的新范式,特别适用于高维非结构化数据,避免了传统流程中的多个复杂步骤。 Abstract: We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers. Grounded in optimal-approximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing. Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric's sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension. On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.

[210] Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis

Yilan Zhang,Li Nanbo,Changchun Yang,Jürgen Schmidhuber,Xin Gao

Main category: cs.CV

TL;DR: 提出了一种基于槽位的框架SlotSPE,用于建模结构性预后事件,通过槽注意力机制压缩多模态输入,提升癌症生存预测性能。

Details Motivation: 现有方法难以有效建模高维复杂的多模态数据中的预后相关事件,且这些关键事件稀疏、个体化且无标注,亟需一种能高效捕捉高层结构信号的方法。 Method: 受因子编码启发,使用槽注意力将每位患者的多模态输入压缩为紧凑、模态特定且互不重叠的槽位集合,利用这些槽位表征预后事件,并结合生物学先验知识进行建模。 Result: 在十个癌症基准数据集上实验表明,SlotSPE在8个队列中优于现有方法,整体提升2.9%,对缺失基因组数据具有鲁棒性,并通过结构化事件分解显著增强可解释性。 Conclusion: SlotSPE能有效捕捉稀疏、患者特异性的高层预后事件,在提高生存预测准确性的同时增强了模型的可解释性,为多模态癌症预后分析提供了新思路。 Abstract: The integration of histology images and gene profiles has shown great promise for improving survival prediction in cancer. However, current approaches often struggle to model intra- and inter-modal interactions efficiently and effectively due to the high dimensionality and complexity of the inputs. A major challenge is capturing critical prognostic events that, though few, underlie the complexity of the observed inputs and largely determine patient outcomes. These events, manifested as high-level structural signals such as spatial histologic patterns or pathway co-activations, are typically sparse, patient-specific, and unannotated, making them inherently difficult to uncover. To address this, we propose SlotSPE, a slot-based framework for structural prognostic event modeling. Specifically, inspired by the principle of factorial coding, we compress each patient's multimodal inputs into compact, modality-specific sets of mutually distinctive slots using slot attention. By leveraging these slot representations as encodings for prognostic events, our framework enables both efficient and effective modeling of complex intra- and inter-modal interactions, while also facilitating seamless incorporation of biological priors that enhance prognostic relevance. Extensive experiments on ten cancer benchmarks show that SlotSPE outperforms existing methods in 8 out of 10 cohorts, achieving an overall improvement of 2.9%. It remains robust under missing genomic data and delivers markedly improved interpretability through structured event decomposition.

[211] OmniFD: A Unified Model for Versatile Face Forgery Detection

Haotian Liu,Haoyu Chen,Chenhui Pan,You Hu,Guoying Zhao,Xiaobai Li

Main category: cs.CV

TL;DR: 本文提出OmniFD,一个统一的多任务框架,用于同时解决图像和视频的伪造检测、空间定位和时间定位四个核心任务,通过共享编码器、跨任务交互模块和轻量化解码头,在减少63%参数和50%训练时间的同时提升性能。

Details Motivation: 现有方法使用独立模型处理不同伪造检测任务,导致计算冗余且忽视任务间相关性,缺乏统一建模的能力。 Method: 设计OmniFD框架:1)共享Swin Transformer编码器提取统一的4D时空表示;2)基于可学习查询和注意力机制的跨任务交互模块捕捉任务依赖关系;3)轻量化解码头生成各任务输出。 Result: 在多个基准上超越专用模型,视频分类准确率提升4.63%(引入图像数据),减少63%参数量和50%训练时间,实现高效、可扩展的多任务伪造检测。 Conclusion: OmniFD通过统一架构实现了高效、通用的面部伪造检测,支持多任务协同与知识迁移,为实际应用提供了高性价比解决方案。 Abstract: Face forgery detection encompasses multiple critical tasks, including identifying forged images and videos and localizing manipulated regions and temporal segments. Current approaches typically employ task-specific models with independent architectures, leading to computational redundancy and ignoring potential correlations across related tasks. We introduce OmniFD, a unified framework that jointly addresses four core face forgery detection tasks within a single model, i.e., image and video classification, spatial localization, and temporal localization. Our architecture consists of three principal components: (1) a shared Swin Transformer encoder that extracts unified 4D spatiotemporal representations from both images and video inputs, (2) a cross-task interaction module with learnable queries that dynamically captures inter-task dependencies through attention-based reasoning, and (3) lightweight decoding heads that transform refined representations into corresponding predictions for all FFD tasks. Extensive experiments demonstrate OmniFD's advantage over task-specific models. Its unified design leverages multi-task learning to capture generalized representations across tasks, especially enabling fine-grained knowledge transfer that facilitates other tasks. For example, video classification accuracy improves by 4.63% when image data are incorporated. Furthermore, by unifying images, videos and the four tasks within one framework, OmniFD achieves superior performance across diverse benchmarks with high efficiency and scalability, e.g., reducing 63% model parameters and 50% training time. It establishes a practical and generalizable solution for comprehensive face forgery detection in real-world applications. The source code is made available at https://github.com/haotianll/OmniFD.

[212] Weakly Supervised Continuous Micro-Expression Intensity Estimation Using Temporal Deep Neural Network

Riyadh Mohammed Almushrafy

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏时间标注(onset, apex, offset)的连续微表情强度估计统一框架,通过三角形先验生成密集伪强度轨迹,并结合ResNet18与双向GRU进行时序建模,无需帧级标注,在SAMM和CASME II数据集上取得了优异的斯皮尔曼和肯德尔相关性。

Details Motivation: 现有微表情研究多集中于离散类别分类,缺乏对强度随时间连续变化的关注,且缺少帧级强度标注限制了监督回归的发展。 Method: 利用稀疏的时间点标注(onset, apex, offset)构建三角形先验,生成伪强度轨迹;采用ResNet18编码图像特征,结合双向GRU建模时序动态,实现从视频序列到帧级别强度预测的端到端回归。 Result: 在SAMM上达到0.9014 Spearman和0.7999 Kendall相关系数,在CASME II上分别达到0.9116和0.8168,优于基线方法,消融实验验证了时序建模与伪标签设计的有效性。 Conclusion: 该方法首次实现了仅用稀疏时间标注进行统一的连续微表情强度估计,具有良好的跨数据集一致性,为无帧级标注下的微表情分析提供了有效解决方案。 Abstract: Micro-facial expressions are brief and involuntary facial movements that reflect genuine emotional states. While most prior work focuses on classifying discrete micro-expression categories, far fewer studies address the continuous evolution of intensity over time. Progress in this direction is limited by the lack of frame-level intensity labels, which makes fully supervised regression impractical. We propose a unified framework for continuous micro-expression intensity estimation using only weak temporal labels (onset, apex, offset). A simple triangular prior converts sparse temporal landmarks into dense pseudo-intensity trajectories, and a lightweight temporal regression model that combines a ResNet18 encoder with a bidirectional GRU predicts frame-wise intensity directly from image sequences. The method requires no frame-level annotation effort and is applied consistently across datasets through a single preprocessing and temporal alignment pipeline. Experiments on SAMM and CASME II show strong temporal agreement with the pseudo-intensity trajectories. On SAMM, the model reaches a Spearman correlation of 0.9014 and a Kendall correlation of 0.7999, outperforming a frame-wise baseline. On CASME II, it achieves up to 0.9116 and 0.8168, respectively, when trained without the apex-ranking term. Ablation studies confirm that temporal modeling and structured pseudo labels are central to capturing the rise-apex-fall dynamics of micro-facial movements. To our knowledge, this is the first unified approach for continuous micro-expression intensity estimation using only sparse temporal annotations.

[213] SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

Hamza Tahboub,Weiyan Shi,Gang Hua,Huaizu Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为SocialFusion的统一框架,用于解决视觉-语言模型在多任务社会感知中的负迁移问题,揭示了“社会退化”现象,并展示了正向迁移和优异性能。

Details Motivation: 现有的预训练视觉-语言模型在处理多种社会感知任务时表现出负迁移,作者旨在探究其原因并提出改进方案。 Method: 通过线性表示探测和梯度冲突分析研究社会退化现象,提出SocialFusion框架,该框架在冻结的视觉编码器与语言模型之间学习最小连接以实现多任务协同。 Result: SocialFusion在五个社会感知任务上均实现了正向迁移,性能媲美特定任务的最先进模型,并揭示了视觉编码器中社会信息解码能力在预训练过程中显著下降。 Conclusion: 当前的视觉-语言预训练策略可能损害模型的社会理解能力,需要设计更注重社会感知的训练范式。 Abstract: Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

[214] DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling

Han-Jin Lee,Han-Ju Lee,Jin-Seong Kim,Seok-Hwan Choi

Main category: cs.CV

TL;DR: 本文提出了一种基于随机最优控制理论的扩散采样引导方法DPAC,通过将对抗梯度投影到由生成得分几何定义的切空间中,减少控制轨迹与标称轨迹之间的路径空间KL散度,从而在保持攻击成功率的同时提升生成样本的质量。

Details Motivation: 现有的对抗性引导扩散方法虽然能有效实现目标类别生成,但随着控制轨迹与原始扩散轨迹偏差的累积,生成样本质量会下降。本文旨在从理论上解释这一现象,并提出一种能够保持分布特性的高效引导策略。 Method: 引入路径空间KL散度(path-KL)衡量控制过程对原始扩散过程的偏离,并结合Girsanov定理证明其等于控制能量;基于变分分析推导出控制的最优一阶条件,提出DPAC方法——将对抗梯度投影到等密度面的切空间(正交于得分函数方向),以最小化分布漂移。 Result: 理论表明,最小化path-KL可同时收紧2-Wasserstein距离和FID的上界;DPAC在离散求解器中消除了Wasserstein距离的O(Δt)主导误差项,实现O(Δt²)的质量差距,并对得分或度量近似具有二阶鲁棒性;在ImageNet-100上的实验验证了DPAC在相同攻击成功率下显著降低FID和估计的path-KL。 Conclusion: 通过建立对抗控制与生成质量之间的理论联系,DPAC提供了一种原则性的引导机制,在保证分类性能的同时有效维持生成分布的保真度,揭示了利用得分几何结构进行高效可控生成的重要性。 Abstract: Adversarially guided diffusion sampling often achieves the target class, but sample quality degrades as deviations between the adversarially controlled and nominal trajectories accumulate. We formalize this degradation as a path-space Kullback-Leibler divergence(path-KL) between controlled and nominal (uncontrolled) diffusion processes, thereby showing via Girsanov's theorem that it exactly equals the control energy. Building on this stochastic optimal control (SOC) view, we theoretically establish that minimizing this path-KL simultaneously tightens upper bounds on both the 2-Wasserstein distance and Fréchet Inception Distance (FID), revealing a principled connection between adversarial control energy and perceptual fidelity. From a variational perspective, we derive a first-order optimality condition for the control: among all directions that yield the same classification gain, the component tangent to iso-(log-)density surfaces (i.e., orthogonal to the score) minimizes path-KL, whereas the normal component directly increases distributional drift. This leads to DPAC (Distribution-Preserving Adversarial Control), a diffusion guidance rule that projects adversarial gradients onto the tangent space defined by the generative score geometry. We further show that in discrete solvers, the tangent projection cancels the O(Δt) leading error term in the Wasserstein distance, achieving an O(Δt^2) quality gap; moreover, it remains second-order robust to score or metric approximation. Empirical studies on ImageNet-100 validate the theoretical predictions, confirming that DPAC achieves lower FID and estimated path-KL at matched attack success rates.

[215] Real-Time On-the-Go Annotation Framework Using YOLO for Automated Dataset Generation

Mohamed Abdallah Salem,Ahmed Harb Rabia

Main category: cs.CV

TL;DR: 提出了一种基于边缘设备上YOLO模型的实时标注方法,显著减少了数据集准备时间,同时保持高标注质量。

Details Motivation: 传统标注方法耗时耗力,尤其在需要快速决策的农业应用中,缺乏高效准确的实时标注方案。 Method: 利用部署在边缘设备上的YOLO模型(YOLOv5、YOLOv8、YOLOv12)在图像采集过程中进行实时标注,并比较单类与多类、预训练与从头训练配置的效果。 Result: 实验表明,预训练和单类配置在模型收敛性、性能和鲁棒性方面表现更优,显著提升标注效率。 Conclusion: 所提出的实时标注框架可行且有效,可广泛应用于农业等需要快速部署目标检测模型的场景。 Abstract: Efficient and accurate annotation of datasets remains a significant challenge for deploying object detection models such as You Only Look Once (YOLO) in real-world applications, particularly in agriculture where rapid decision-making is critical. Traditional annotation techniques are labor-intensive, requiring extensive manual labeling post data collection. This paper presents a novel real-time annotation approach leveraging YOLO models deployed on edge devices, enabling immediate labeling during image capture. To comprehensively evaluate the efficiency and accuracy of our proposed system, we conducted an extensive comparative analysis using three prominent YOLO architectures (YOLOv5, YOLOv8, YOLOv12) under various configurations: single-class versus multi-class annotation and pretrained versus scratch-based training. Our analysis includes detailed statistical tests and learning dynamics, demonstrating significant advantages of pretrained and single-class configurations in terms of model convergence, performance, and robustness. Results strongly validate the feasibility and effectiveness of our real-time annotation framework, highlighting its capability to drastically reduce dataset preparation time while maintaining high annotation quality.

[216] VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering

Zihua Liu,Hiroki Sakuma,Masatoshi Okutomi

Main category: cs.CV

TL;DR: 本文提出了一种名为VSRD++的弱监督框架,用于单目3D目标检测,无需依赖3D标注,通过基于神经场的体渲染和2D弱监督实现高性能检测。

Details Motivation: 现有的单目3D目标检测方法严重依赖大量3D标注数据,而这些数据通常需要耗费大量人力从LiDAR点云中获取,因此成本高且难以扩展。为了减少对3D标注的依赖,本文旨在探索一种更高效的弱监督学习方法。 Method: VSRD++采用两阶段流程:多视角3D自动标注和单目3D检测器训练。在第一阶段,使用符号距离场(SDF)表示物体表面,并通过实例感知的体素轮廓渲染生成实例掩码;将SDF分解为立方体SDF和残差距离场(RDF),以优化3D边界框;引入速度属性和置信度机制来建模动态物体并缓解几何不一致性问题;同时设计了3D属性初始化模块。第二阶段利用优化后的3D边界框作为伪标签训练单目3D检测器。 Result: 在KITTI-360数据集上的实验表明,VSRD++在静态和动态场景下均显著优于现有的弱监督单目3D目标检测方法。 Conclusion: VSRD++成功实现了无需3D标注的高效单目3D目标检测,通过神经场与弱监督结合的方法,在保持高质量检测性能的同时,有效降低了对昂贵标注数据的依赖。 Abstract: Monocular 3D object detection is a fundamental yet challenging task in 3D scene understanding. Existing approaches heavily depend on supervised learning with extensive 3D annotations, which are often acquired from LiDAR point clouds through labor-intensive labeling processes. To tackle this problem, we propose VSRD++, a novel weakly supervised framework for monocular 3D object detection that eliminates the reliance on 3D annotations and leverages neural-field-based volumetric rendering with weak 2D supervision. VSRD++ consists of a two-stage pipeline: multi-view 3D autolabeling and subsequent monocular 3D detector training. In the multi-view autolabeling stage, object surfaces are represented as signed distance fields (SDFs) and rendered as instance masks via the proposed instance-aware volumetric silhouette rendering. To optimize 3D bounding boxes, we decompose each instance's SDF into a cuboid SDF and a residual distance field (RDF) that captures deviations from the cuboid. To address the geometry inconsistency commonly observed in volume rendering methods applied to dynamic objects, we model the dynamic objects by including velocity into bounding box attributes as well as assigning confidence to each pseudo-label. Moreover, we also employ a 3D attribute initialization module to initialize the dynamic bounding box parameters. In the monocular 3D object detection phase, the optimized 3D bounding boxes serve as pseudo labels for training monocular 3D object detectors. Extensive experiments on the KITTI-360 dataset demonstrate that VSRD++ significantly outperforms existing weakly supervised approaches for monocular 3D object detection on both static and dynamic scenes. Code is available at https://github.com/Magicboomliu/VSRD_plus_plus

[217] TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image

Ziqian Wang,Yonghao He,Licheng Yang,Wei Zou,Hongxuan Ma,Liu Liu,Wei Sui,Yuxin Guo,Hu Su

Main category: cs.CV

TL;DR: 提出TabletopGen,一种无需训练、全自动的框架,用于生成多样且可交互的3D桌面场景,通过解耦旋转与平移-尺度估计,实现从2D参考图像到高保真3D场景的重建。

Details Motivation: 现有3D场景生成方法主要针对大尺度场景,难以处理桌面场景中高密度布局和复杂空间关系的问题,限制了机器人操作策略学习和数据合成的应用。 Method: 输入参考图像(可由文本到图像模型生成),进行实例分割与补全,逐个重建为3D模型并进行坐标对齐;提出两阶段姿态与尺度对齐方法:可微旋转优化器用于精确恢复旋转,俯视图空间对齐机制用于稳健估计平移与尺度,最终组装成无碰撞、可仿真的3D桌面场景。 Result: 实验和用户研究表明,TabletopGen在视觉保真度、布局准确性和物理合理性方面显著优于现有方法,能生成风格和空间多样性丰富的逼真桌面场景。 Conclusion: TabletopGen是一种高效、自动化的3D桌面场景生成框架,在无需训练的前提下实现了高质量、物理可交互的场景合成,推动了具身AI中的仿真环境构建。 Abstract: Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI--especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.

[218] Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations

Yangbangyan Jiang,Qianqian Xu,Huiyang Shao,Zhiyong Yang,Shilong Bao,Xiaochun Cao,Qingming Huang

Main category: cs.CV

TL;DR: 本文提出了两种新的实例级极小极大重构方法来优化部分AUC(PAUC),解决了现有方法在近似误差和可扩展性方面的局限性,具有较低的计算复杂度和良好的收敛性,并提供了紧致的泛化界。

Details Motivation: 由于PAUC计算中的实例选择是NP难问题,现有方法通常依赖近似技术,但存在不可控的近似误差或可扩展性差的问题,因此需要更优的优化方法。 Method: 提出两种实例级极小极大重构方法:一种具有渐近消失的近似间隙,另一种以增加变量为代价保持无偏性;通过阈值学习简化样本选择,并采用不同平滑技术,结合高效求解器实现线性迭代复杂度。 Result: 算法对典型单向和双向PAUC达到O(ε^{-1/3})的收敛速率,泛化界明确显示TPR/FPR约束α/β的影响,阶数为\tilde{O}(α^{-1}n_+^{-1} + β^{-1}n_-^{-1}),实验验证了方法在多个基准数据集上的优越性。 Conclusion: 所提方法有效缩小了PAUC优化的近似差距,在理论和实验上均表现出优于现有方法的性能,具备良好的可扩展性和理论保证。 Abstract: As a variant of the Area Under the ROC Curve (AUC), the partial AUC (PAUC) focuses on a specific range of false positive rate (FPR) and/or true positive rate (TPR) in the ROC curve. It is a pivotal evaluation metric in real-world scenarios with both class imbalance and decision constraints. However, selecting instances within these constrained intervals during its calculation is NP-hard, and thus typically requires approximation techniques for practical resolution. Despite the progress made in PAUC optimization over the last few years, most existing methods still suffer from uncontrollable approximation errors or a limited scalability when optimizing the approximate PAUC objectives. In this paper, we close the approximation gap of PAUC optimization by presenting two simple instance-wise minimax reformulations: one with an asymptotically vanishing gap, the other with the unbiasedness at the cost of more variables. Our key idea is to first establish an equivalent instance-wise problem to lower the time complexity, simplify the complicated sample selection procedure by threshold learning, and then apply different smoothing techniques. Equipped with an efficient solver, the resulting algorithms enjoy a linear per-iteration computational complexity w.r.t. the sample size and a convergence rate of $O(ε^{-1/3})$ for typical one-way and two-way PAUCs. Moreover, we provide a tight generalization bound of our minimax reformulations. The result explicitly demonstrates the impact of the TPR/FPR constraints $α$/$β$ on the generalization and exhibits a sharp order of $\tilde{O}(α^{-1}\n_+^{-1} + β^{-1}\n_-^{-1})$. Finally, extensive experiments on several benchmark datasets validate the strength of our proposed methods.

[219] M4-BLIP: Advancing Multi-Modal Media Manipulation Detection through Face-Enhanced Local Analysis

Hang Wu,Ke Sun,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了一种名为M4-BLIP的新框架,用于检测多模态媒体篡改,利用BLIP-2模型提取局部特征并融合面部先验信息,结合大语言模型提升检测准确性和结果可解释性。

Details Motivation: 现有检测方法常忽略局部信息,而篡改多发生在特定区域(如面部),因此需要更关注局部特征以提高检测效果。 Method: 基于BLIP-2模型提取局部特征,引入面部局部信息作为先验,并通过专门设计的对齐与融合模块整合局部与全局特征,同时集成大语言模型以增强解释性。 Result: 实验表明,该框架在定量和可视化评估中均优于当前最先进的方法。 Conclusion: M4-BLIP通过融合局部细节与全局上下文,并结合大语言模型,显著提升了多模态媒体篡改检测的性能与可解释性。 Abstract: In the contemporary digital landscape, multi-modal media manipulation has emerged as a significant societal threat, impacting the reliability and integrity of information dissemination. Current detection methodologies in this domain often overlook the crucial aspect of localized information, despite the fact that manipulations frequently occur in specific areas, particularly in facial regions. In response to this critical observation, we propose the M4-BLIP framework. This innovative framework utilizes the BLIP-2 model, renowned for its ability to extract local features, as the cornerstone for feature extraction. Complementing this, we incorporate local facial information as prior knowledge. A specially designed alignment and fusion module within M4-BLIP meticulously integrates these local and global features, creating a harmonious blend that enhances detection accuracy. Furthermore, our approach seamlessly integrates with Large Language Models (LLM), significantly improving the interpretability of the detection outcomes. Extensive quantitative and visualization experiments validate the effectiveness of our framework against the state-of-the-art competitors.

[220] S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

Beining Xu,Siting Zhu,Zhao Jin,Junxian Li,Hesheng Wang

Main category: cs.CV

TL;DR: 本文提出S$^2$-MLLM,一种通过隐式空间推理增强多模态大模型在3D视觉定位(3DVG)中空间理解能力的高效框架,无需依赖低效的点云重建即可实现优异性能。

Details Motivation: 现有基于多模态大模型的3D视觉定位方法依赖视图相关的点云渲染,效率低且空间推理能力有限,难以有效理解三维场景结构。 Method: 提出S$^2$-MLLM框架,引入隐式空间推理机制,利用前馈式3D重建获取结构感知;设计结构增强模块(SE),结合 intra-view 和 inter-view 注意力机制,并融合多层次位置编码以关联空间与视角信息。 Result: 在ScanRefer、Nr3D和Sr3D数据集上显著优于现有方法,兼具高性能、良好泛化性和高效率。 Conclusion: S$^2$-MLLM通过隐式结构学习有效提升了MLLM在3D视觉定位中的空间推理能力,避免了复杂的点云重建,为高效3D场景理解提供了新思路。 Abstract: 3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S$^2$-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.

[221] PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards

Shulei Wang,Longhui Wei,Xin He,Jianbo Ouyang,Hui Lu,Zhou Zhao,Qi Tian

Main category: cs.CV

TL;DR: 提出了一种可扩展的多主体数据生成 pipeline,并通过改进的强化学习策略提升多主体个性化图像生成中的主体一致性和文本可控性。

Details Motivation: 现有模型在多主体生成任务中表现不佳,难以保持主体一致性和遵循文本提示,主要受限于缺乏高质量的多主体数据集和优化的后训练策略。 Method: 构建了一个可扩展的多主体数据生成 pipeline,利用强大的单主体生成模型创建高质量、多样化的多主体训练数据;设计了成对的主体一致性奖励和通用奖励机制,并引入精细化的强化学习阶段进行后训练优化。 Result: 模型在新提出的多维度基准测试中表现出色,涵盖七个子集、三个评估维度,实验验证了方法在多主体个性化图像生成上的有效性。 Conclusion: 该方法显著提升了多主体个性化生成的质量,在主体一致性与文本对齐方面优于现有方法,为未来多主体生成研究提供了高质量数据和有效训练范式。 Abstract: Personalized generation models for a single subject have demonstrated remarkable effectiveness, highlighting their significant potential. However, when extended to multiple subjects, existing models often exhibit degraded performance, particularly in maintaining subject consistency and adhering to textual prompts. We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. Through this dataset, we first enable single-subject personalization models to acquire knowledge of synthesizing multi-image and multi-subject scenarios. Furthermore, to enhance both subject consistency and text controllability, we design a set of Pairwise Subject-Consistency Rewards and general-purpose rewards, which are incorporated into a refined reinforcement learning stage. To comprehensively evaluate multi-subject personalization, we introduce a new benchmark that assesses model performance using seven subsets across three dimensions. Extensive experiments demonstrate the effectiveness of our approach in advancing multi-subject personalized image generation. Github Link: https://github.com/wang-shulei/PSR

[222] TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Junyuan Zhang,Bin Wang,Qintong Zhang,Fan Wu,Zichen Wen,Jialin Lu,Junjie Shan,Ziqi Zhao,Shuya Yang,Ziling Wang,Ziyang Miao,Huaping Zhong,Yuhang Zang,Xiaoyi Dong,Ka-Ho Chow,Conghui He

Main category: cs.CV

TL;DR: 本文提出了一种名为TRivia的自监督微调方法,使预训练的视觉语言模型能够直接从无标签的真实表格图像中学习表格识别(TR),无需人工标注。通过基于问答的奖励机制和注意力引导模块,构建闭环学习过程,实现了对表格的自主识别、结构化与推理。基于该方法开发的开源模型TRivia-3B在多个基准上超越现有系统,包括Gemini和MinerU等专有模型。

Details Motivation: 现有的表格识别方法依赖大量标注数据进行监督学习,而获取这些数据成本高昂;开源模型因资源受限难以匹敌专有模型性能。为缩小这一差距,需一种不依赖人工标注的高效训练方法。 Method: 提出TRivia,基于Group Relative Policy Optimization框架,利用注意力引导模块为每个表格图像生成多样化问题,并通过模型回答问题的能力作为反馈信号来优化表格识别模型,形成闭环的自监督学习流程。 Result: TRivia-3B在三个主流基准测试上均取得领先性能,超越Gemini 2.5 Pro和MinerU 2.5等现有系统,成为当前最先进的开源紧凑型表格识别模型。 Conclusion: TRivia实现了无需人工标注的高效表格识别模型微调,推动了开源TR模型的发展,展示了自监督学习在文档解析领域的巨大潜力。 Abstract: Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/opendatalab/TRivia

[223] ViscNet: Vision-Based In-line Viscometry for Fluid Mixing Process

Jongwon Sohn,Juhyeon Moon,Hyunjoon Jung,Jaewook Nam

Main category: cs.CV

TL;DR: 提出一种基于计算机视觉的非接触式粘度测量方法,利用光线通过动态自由表面时的折射畸变来推断粘度,具备自动化潜力和不确定性量化功能。

Details Motivation: 传统粘度计侵入性强且依赖受控实验室环境,难以适应真实工况,因此需要一种可在实际过程中非侵入、可自动化的粘度监测方法。 Method: 通过分析固定背景图案在液体自由表面折射下的光学畸变,结合计算机视觉技术进行粘度回归与分类,并采用多图案策略增强鲁棒性,同时引入不确定性量化以提高可靠性。 Result: 在多样光照条件下,回归任务的平均绝对误差为0.113(log m2 s^-1),分类准确率最高达81%,多图案策略提升了对粘度接近样本的区分能力。 Conclusion: 该非接触式粘度计为现有粘度测量方法提供了一种实用、适合自动化的新替代方案,适用于过程监控和自主实验室操作。 Abstract: Viscosity measurement is essential for process monitoring and autonomous laboratory operation, yet conventional viscometers remain invasive and require controlled laboratory environments that differ substantially from real process conditions. We present a computer-vision-based viscometer that infers viscosity by exploiting how a fixed background pattern becomes optically distorted as light refracts through the mixing-driven, continuously deforming free surface. Under diverse lighting conditions, the system achieves a mean absolute error of 0.113 in log m2 s^-1 units for regression and reaches up to 81% accuracy in viscosity-class prediction. Although performance declines for classes with closely clustered viscosity values, a multi-pattern strategy improves robustness by providing enriched visual cues. To ensure sensor reliability, we incorporate uncertainty quantification, enabling viscosity predictions with confidence estimates. This stand-off viscometer offers a practical, automation-ready alternative to existing viscometry methods.

[224] nnMobileNet++: Towards Efficient Hybrid Networks for Retinal Image Analysis

Xin Li,Wenhui Zhu,Xuanzhao Dong,Hao Wang,Yujian Xiong,Oana Dumitrascu,Yalin Wang

Main category: cs.CV

TL;DR: 提出nnMobileNet++,一种结合卷积与Transformer的轻量混合网络,用于提升视网膜图像分析性能。

Details Motivation: 纯卷积结构难以捕捉视网膜图像中的长距离依赖和不规则病变及血管模式,限制了现有模型在临床诊断中的表现。 Method: 在nnMobileNet基础上引入动态蛇形卷积、阶段特定的Transformer模块(从第二下采样阶段开始)以及视网膜图像预训练,构建混合架构nnMobileNet++。 Result: 在多个公开视网膜数据集上实现了最先进或极具竞争力的分类精度,同时保持低计算成本。 Conclusion: nnMobileNet++是一种高效且轻量的视网膜图像分析框架,在保留计算效率的同时显著提升了对复杂结构的建模能力。 Abstract: Retinal imaging is a critical, non-invasive modality for the early detection and monitoring of ocular and systemic diseases. Deep learning, particularly convolutional neural networks (CNNs), has significant progress in automated retinal analysis, supporting tasks such as fundus image classification, lesion detection, and vessel segmentation. As a representative lightweight network, nnMobileNet has demonstrated strong performance across multiple retinal benchmarks while remaining computationally efficient. However, purely convolutional architectures inherently struggle to capture long-range dependencies and model the irregular lesions and elongated vascular patterns that characterize on retinal images, despite the critical importance of vascular features for reliable clinical diagnosis. To further advance this line of work and extend the original vision of nnMobileNet, we propose nnMobileNet++, a hybrid architecture that progressively bridges convolutional and transformer representations. The framework integrates three key components: (i) dynamic snake convolution for boundary-aware feature extraction, (ii) stage-specific transformer blocks introduced after the second down-sampling stage for global context modeling, and (iii) retinal image pretraining to improve generalization. Experiments on multiple public retinal datasets for classification, together with ablation studies, demonstrate that nnMobileNet++ achieves state-of-the-art or highly competitive accuracy while maintaining low computational cost, underscoring its potential as a lightweight yet effective framework for retinal image analysis.

[225] Supervised Contrastive Machine Unlearning of Background Bias in Sonar Image Classification with Fine-Grained Explainable AI

Kamal Basha S,Athira Nambiar

Main category: cs.CV

TL;DR: 提出了一种新的声呐图像分析框架,通过目标对比遗忘(TCU)模块和可解释的遗忘框架(UESF),减少模型对海底特征的依赖,提升泛化能力和可解释性。

Details Motivation: 现有AI模型在声呐图像分析中过度依赖海底特征,导致泛化能力差,需提高模型对目标本身的识别能力。 Method: 引入目标对比遗忘(TCU)模块,扩展三元组损失以减少背景偏差;设计UESF框架,结合改进的LIME解释器,可视化模型遗忘内容并生成更准确的局部归因。 Result: 在真实与合成声呐数据集上的实验表明,该方法显著提升了遗忘效果、模型鲁棒性和可解释性。 Conclusion: 所提框架有效缓解了声呐图像分析中的背景依赖问题,增强了模型的泛化性与可信度,适用于复杂水下环境的目标检测与分类。 Abstract: Acoustic sonar image analysis plays a critical role in object detection and classification, with applications in both civilian and defense domains. Despite the availability of real and synthetic datasets, existing AI models that achieve high accuracy often over-rely on seafloor features, leading to poor generalization. To mitigate this issue, we propose a novel framework that integrates two key modules: (i) a Targeted Contrastive Unlearning (TCU) module, which extends the traditional triplet loss to reduce seafloor-induced background bias and improve generalization, and (ii) the Unlearn to Explain Sonar Framework (UESF), which provides visual insights into what the model has deliberately forgotten while adapting the LIME explainer to generate more faithful and localized attributions for unlearning evaluation. Extensive experiments across both real and synthetic sonar datasets validate our approach, demonstrating significant improvements in unlearning effectiveness, model robustness, and interpretability.

[226] Diffusion Model in Latent Space for Medical Image Segmentation Task

Huynh Trinh Ngoc,Toan Nguyen Hai,Ba Luong Son,Long Tran Quoc

Main category: cs.CV

TL;DR: 提出MedSegLatDiff,一种基于扩散模型的高效医学图像分割框架,结合VAE与潜在扩散模型,在低维潜在空间中进行去噪,并改进损失函数以更好保留小结构,实现先进分割性能并生成多样化结果和置信图。

Details Motivation: 传统医学图像分割方法仅生成单一掩码,无法捕捉不确定性;现有生成模型虽能生成多假设但计算开销大,需更高效且能表达不确定性的分割方法。 Method: 提出MedSegLatDiff,结合变分自编码器(VAE)与潜在扩散模型:VAE将输入压缩至低维潜在空间以降噪并加速训练,扩散过程在该紧凑空间中进行;并在VAE的掩码重建路径中使用加权交叉熵损失替代MSE,以更好保留微小结构。 Result: 在ISIC-2018、CVC-Clinic和LIDC-IDRI数据集上验证,取得最先进或极具竞争力的Dice和IoU分数,同时生成多样化的分割假设和置信图。 Conclusion: MedSegLatDiff在保持高分割精度的同时,提升了模型效率与解释性,能够提供不确定性估计,相比确定性方法更具临床部署价值。 Abstract: Medical image segmentation is crucial for clinical diagnosis and treatment planning. Traditional methods typically produce a single segmentation mask, failing to capture inherent uncertainty. Recent generative models enable the creation of multiple plausible masks per image, mimicking the collaborative interpretation of several clinicians. However, these approaches remain computationally heavy. We propose MedSegLatDiff, a diffusion based framework that combines a variational autoencoder (VAE) with a latent diffusion model for efficient medical image segmentation. The VAE compresses the input into a low dimensional latent space, reducing noise and accelerating training, while the diffusion process operates directly in this compact representation. We further replace the conventional MSE loss with weighted cross entropy in the VAE mask reconstruction path to better preserve tiny structures such as small nodules. MedSegLatDiff is evaluated on ISIC-2018 (skin lesions), CVC-Clinic (polyps), and LIDC-IDRI (lung nodules). It achieves state of the art or highly competitive Dice and IoU scores while simultaneously generating diverse segmentation hypotheses and confidence maps. This provides enhanced interpretability and reliability compared to deterministic baselines, making the model particularly suitable for clinical deployment.

[227] EGG-Fusion: Efficient 3D Reconstruction with Geometry-aware Gaussian Surfel on the Fly

Xiaokun Pan,Zhenzhe Li,Zhichao Ye,Hongjia Zhai,Guofeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为EGG-Fusion的实时3D重建系统,结合稀疏到稠密的相机跟踪与几何感知的高斯surfel映射模块,通过信息滤波融合方法有效应对传感器噪声,显著提升了重建精度,在Replica和ScanNet++数据集上比现有最先进方法提高20%以上,同时保持24 FPS的实时性能。

Details Motivation: 现有的基于可微渲染的SLAM系统在实时计算和传感器噪声敏感性方面存在挑战,导致几何重建质量下降,限制了实际应用。 Method: 提出EGG-Fusion系统,包含鲁棒的稀疏到稠密相机跟踪和几何感知的高斯surfel映射模块,采用基于信息滤波的融合方法,显式建模传感器噪声,实现高效且精确的表面重建。 Result: 在Replica和ScanNet++标准数据集上实现了0.6cm的表面重建误差,相比最先进的3DGS方法提升超过20%,并以24 FPS实现实时处理。 Conclusion: EGG-Fusion是目前最精确的基于可微渲染的实时重建系统之一,兼顾高精度几何重建与实时性能,具有较强的实用性。 Abstract: Real-time 3D reconstruction is a fundamental task in computer graphics. Recently, differentiable-rendering-based SLAM system has demonstrated significant potential, enabling photorealistic scene rendering through learnable scene representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Current differentiable rendering methods face dual challenges in real-time computation and sensor noise sensitivity, leading to degraded geometric fidelity in scene reconstruction and limited practicality. To address these challenges, we propose a novel real-time system EGG-Fusion, featuring robust sparse-to-dense camera tracking and a geometry-aware Gaussian surfel mapping module, introducing an information filter-based fusion method that explicitly accounts for sensor noise to achieve high-precision surface reconstruction. The proposed differentiable Gaussian surfel mapping effectively models multi-view consistent surfaces while enabling efficient parameter optimization. Extensive experimental results demonstrate that the proposed system achieves a surface reconstruction error of 0.6\textit{cm} on standardized benchmark datasets including Replica and ScanNet++, representing over 20\% improvement in accuracy compared to state-of-the-art (SOTA) GS-based methods. Notably, the system maintains real-time processing capabilities at 24 FPS, establishing it as one of the most accurate differentiable-rendering-based real-time reconstruction systems. Project Page: https://zju3dv.github.io/eggfusion/

[228] TBT-Former: Learning Temporal Boundary Distributions for Action Localization

Thisara Rathnayaka,Uthayasanker Thayasivam

Main category: cs.CV

TL;DR: 本文提出了Temporal Boundary Transformer (TBT-Former),通过增强Transformer骨干网络、跨尺度特征融合和基于分布学习的边界回归头,解决了时序动作定位中边界模糊和多尺度上下文融合困难的问题,在多个基准数据集上实现了先进性能。

Details Motivation: 现有的单阶段无锚框模型(如ActionFormer)在处理时序动作定位时难以精确识别具有模糊边界的动作实例,并且缺乏有效的多尺度上下文信息融合机制。 Method: 提出TBT-Former,包含三个核心改进:(1) 更强的Transformer主干网络(更多注意力头和更大的MLP维度);(2) 带横向连接的自上而下特征金字塔网络(FPN),实现跨尺度特征融合;(3) 受广义聚焦损失(GFL)启发的边界分布回归头,将边界回归转化为概率分布学习以建模不确定性。 Result: TBT-Former在THUMOS14和EPIC-Kitchens 100数据集上达到新的最先进性能,在ActivityNet-1.3上也具有竞争力。 Conclusion: TBT-Former有效提升了Transformer-based时序动作定位模型的性能,尤其在处理模糊边界和多尺度上下文方面表现出色,为未来研究提供了有效框架。 Abstract: Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or "fuzzy" temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at https://github.com/aaivu/In21-S7-CS4681-AML-Research-Projects/tree/main/projects/210536K-Multi-Modal-Learning_Video-Understanding

[229] DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy

Jaewoo Song,Jooyoung Choi,Kanghyun Baek,Sangyub Lee,Daemin Park,Sungroh Yoon

Main category: cs.CV

TL;DR: DCText是一种无需训练的视觉文本生成方法,采用分而治之策略,通过分解提示词并利用两个注意力掩码(Text-Focus和Context-Expansion)在指定区域内精确渲染文本,同时保持图像整体连贯性。

Details Motivation: 现有文本到图像模型在处理长文本或多文本时因全局注意力稀释而导致文本渲染效果不佳。 Method: 提出DCText方法,将提示词分解为目标文本段并分配至特定区域,引入Text-Focus和Context-Expansion两种注意力掩码,并结合局部噪声初始化技术,在去噪过程中逐步实现精准文本渲染。 Result: 在单句和多句基准上实验表明,DCText在不牺牲图像质量的前提下实现了最高的文本准确率,并具有最低的生成延迟。 Conclusion: DCText有效解决了长文本和多文本渲染中的注意力稀释问题,显著提升了文本生成的准确性和效率。 Abstract: Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.

[230] Gaussian Swaying: Surface-Based Framework for Aerodynamic Simulation with 3D Gaussians

Hongru Yan,Xiang Zhang,Zeyuan Chen,Fangyin Wei,Zhuowen Tu

Main category: cs.CV

TL;DR: 本文提出了Gaussian Swaying,一种基于3D高斯的表面连续建模框架,用于高效且细粒度的空气动力学模拟,统一了仿真与渲染,实现了最先进的性能和效率。

Details Motivation: 为了提升视觉和图形中自然运动的真实感,需要有效的空气动力学模拟方法。现有网格或粒子方法存在成本高或依赖离散数据的问题。 Method: 提出Gaussian Swaying框架,使用3D高斯表示表面(高斯块),实现连续建模,支持力计算和轻量级着色,统一仿真与渲染。 Result: 在多个合成和真实世界数据集上实验表明,该方法在多种指标下达到最先进的性能和效率。 Conclusion: Gaussian Swaying提供了一种可扩展、高效且精细的空气动力学场景模拟方法,适用于逼真的视觉应用。 Abstract: Branches swaying in the breeze, flags rippling in the wind, and boats rocking on the water all show how aerodynamics shape natural motion -- an effect crucial for realism in vision and graphics. In this paper, we present Gaussian Swaying, a surface-based framework for aerodynamic simulation using 3D Gaussians. Unlike mesh-based methods that require costly meshing, or particle-based approaches that rely on discrete positional data, Gaussian Swaying models surfaces continuously with 3D Gaussians, enabling efficient and fine-grained aerodynamic interaction. Our framework unifies simulation and rendering on the same representation: Gaussian patches, which support force computation for dynamics while simultaneously providing normals for lightweight shading. Comprehensive experiments on both synthetic and real-world datasets across multiple metrics demonstrate that Gaussian Swaying achieves state-of-the-art performance and efficiency, offering a scalable approach for realistic aerodynamic scene simulation.

[231] Lost in Distortion: Uncovering the Domain Gap Between Computer Vision and Brain Imaging - A Study on Pretraining for Age Prediction

Yanteng Zhang,Songheng Li,Zeyu Shen,Qizhen Lan,Lipei Zhang,Yang Liu,Vince Calhoun

Main category: cs.CV

TL;DR: 本研究探讨了大规模脑成像数据预训练中数据质量的影响,发现不同质量水平的扫描数据对下游任务(如脑龄预测)性能有显著差异,强调需基于领域知识进行数据 curated 以构建可靠且可泛化的领域基础模型。

Details Motivation: 脑成像数据常存在质量异质性,而自然图像领域的预训练方法未必适用于临床神经影像,因此需要系统评估数据质量对预训练效果的影响。 Method: 在不同质量水平的脑成像数据集上进行预训练,并在外部队列上微调用于脑龄预测的模型,以评估数据质量对下游任务性能的影响。 Result: 不同数据质量水平导致显著的性能差异,低质量或噪声扫描可能损害模型学习,但也揭示了部分潜在利用价值。 Conclusion: 直接套用计算机视觉的数据处理方式不适用于临床神经影像,必须结合领域知识进行数据筛选与管理,以确保构建可信、可泛化的脑科学基础模型。 Abstract: Large-scale brain imaging datasets provide unprecedented opportunities for developing domain foundation models through pretraining. However, unlike natural image datasets in computer vision, these neuroimaging data often exhibit high heterogeneity in quality, ranging from well-structured scans to severely distorted or incomplete brain volumes. This raises a fundamental question: can noise or low-quality scans contribute meaningfully to pretraining, or do they instead hinder model learning? In this study, we systematically explore the role of data quality level in pretraining and its impact on downstream tasks. Specifically, we perform pretraining on datasets with different quality levels and perform fine-tuning for brain age prediction on external cohorts. Our results show significant performance differences across quality levels, revealing both opportunities and limitations. We further discuss the gap between computer vision practices and clinical neuroimaging standards, emphasizing the necessity of domain-aware curation to ensure trusted and generalizable domain-specific foundation models.

[232] IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval

Ning Han,Yawen Zeng,Shaohua Long,Chengqing Li,Sijie Yang,Dun Tan,Jianfeng Dong,Jingjing Chen

Main category: cs.CV

TL;DR: 本文提出了交互式视频语料检索(IVCR)任务及IVCR-200K数据集,结合多模态大语言模型构建可解释的交互式检索框架。

Details Motivation: 现有视频检索系统缺乏与用户的有意义交互,单向检索模式难以满足用户的个性化和动态需求。 Method: 提出IVCR任务和IVCR-200K数据集,并基于多模态大语言模型设计支持多轮对话和多种交互模式的综合框架。 Result: 实验表明所提出的数据集和框架在交互式视频检索任务中具有有效性。 Conclusion: IVCR任务和框架为实现更自然、个性化的视频检索提供了新方向。 Abstract: In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8\% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.

[233] TokenPure: Watermark Removal through Tokenized Appearance and Structural Guidance

Pei Yang,Yepeng Liu,Kelly Peng,Yuan Gao,Yiren Song

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散变换器的新型框架TokenPure,用于高效且一致地去除数字水印,通过基于令牌的条件重建,在去除水印的同时保持内容的一致性和结构完整性。

Details Motivation: 在数字经济时代,数字水印对于大量可复制内容的所有权证明至关重要。然而,设计能够抵抗各种攻击和处理操作的鲁棒性水印变得更加重要。因此,需要一种既能彻底去除水印又能保持内容一致性的方法。 Method: TokenPure将水印去除任务重新定义为条件生成问题,完全绕过了初始携带水印的噪声。它通过将被水印图像分解成两个互补的令牌集——用于纹理的视觉令牌和用于几何结构的结构令牌——来实现这一点。这两个令牌集共同作为扩散过程的条件,使框架能够合成无水印图像,同时保持细粒度的一致性和结构完整性。 Result: 综合实验表明,TokenPure在水印去除和重建保真度方面达到了最先进的水平,无论是在感知质量还是在一致性上都显著优于现有的基线方法。 Conclusion: TokenPure提供了一种有效的方法来解决水印去除与内容一致性之间的权衡问题,为数字内容保护提供了新的视角和技术手段。 Abstract: In the digital economy era, digital watermarking serves as a critical basis for ownership proof of massive replicable content, including AI-generated and other virtual assets. Designing robust watermarks capable of withstanding various attacks and processing operations is even more paramount. We introduce TokenPure, a novel Diffusion Transformer-based framework designed for effective and consistent watermark removal. TokenPure solves the trade-off between thorough watermark destruction and content consistency by leveraging token-based conditional reconstruction. It reframes the task as conditional generation, entirely bypassing the initial watermark-carrying noise. We achieve this by decomposing the watermarked image into two complementary token sets: visual tokens for texture and structural tokens for geometry. These tokens jointly condition the diffusion process, enabling the framework to synthesize watermark-free images with fine-grained consistency and structural integrity. Comprehensive experiments show that TokenPure achieves state-of-the-art watermark removal and reconstruction fidelity, substantially outperforming existing baselines in both perceptual quality and consistency.

[234] FOD-S2R: A FOD Dataset for Sim2Real Transfer Learning based Object Detection

Ashish Vashist,Qiranul Saadiyean,Suresh Sundaram,Chandra Sekhar Seelamantula

Main category: cs.CV

TL;DR: 本文提出了一个名为FOD-S2R的新数据集,包含真实和合成的飞机燃油箱内异物(FOD)图像,旨在提升封闭环境中FOD检测性能,并验证合成数据在缩小仿真到现实(Sim2Real)差距方面的有效性。

Details Motivation: 由于现有数据集缺乏针对飞机燃油箱等封闭复杂环境的FOD数据,且真实数据采集成本高、标注困难,因此需要构建专用数据集并探索合成数据对检测模型性能的提升作用。 Method: 构建了一个包含3,114张真实高清图像和3,137张基于Unreal Engine生成的合成图像的数据集FOD-S2R,涵盖多种视场、距离、光照、颜色和物体尺寸;并在该数据集上 benchmark 多个最先进的目标检测模型,评估引入合成数据对真实场景检测性能的影响。 Result: 实验证明,引入合成数据能够显著提高模型在真实环境中的检测精度和泛化能力,有效缩小Sim2Real差距。 Conclusion: FOD-S2R为航空维护中自动化FOD检测系统的发展提供了重要基础,验证了合成数据在封闭工业环境检测任务中的实用价值。 Abstract: Foreign Object Debris (FOD) within aircraft fuel tanks presents critical safety hazards including fuel contamination, system malfunctions, and increased maintenance costs. Despite the severity of these risks, there is a notable lack of dedicated datasets for the complex, enclosed environments found inside fuel tanks. To bridge this gap, we present a novel dataset, FOD-S2R, composed of real and synthetic images of the FOD within a simulated aircraft fuel tank. Unlike existing datasets that focus on external or open-air environments, our dataset is the first to systematically evaluate the effectiveness of synthetic data in enhancing the real-world FOD detection performance in confined, closed structures. The real-world subset consists of 3,114 high-resolution HD images captured in a controlled fuel tank replica, while the synthetic subset includes 3,137 images generated using Unreal Engine. The dataset is composed of various Field of views (FOV), object distances, lighting conditions, color, and object size. Prior research has demonstrated that synthetic data can reduce reliance on extensive real-world annotations and improve the generalizability of vision models. Thus, we benchmark several state-of-the-art object detection models and demonstrate that introducing synthetic data improves the detection accuracy and generalization to real-world conditions. These experiments demonstrate the effectiveness of synthetic data in enhancing the model performance and narrowing the Sim2Real gap, providing a valuable foundation for developing automated FOD detection systems for aviation maintenance.

[235] Rethinking Intracranial Aneurysm Vessel Segmentation: A Perspective from Computational Fluid Dynamics Applications

Feiyang Xiao,Yichi Zhang,Xigui Li,Yuanye Zhou,Chen Jiang,Xin Guo,Limei Han,Yuxin Li,Fengping Zhu,Yuan Cheng

Main category: cs.CV

TL;DR: 本文提出了一个用于颅内动脉瘤及其载瘤血管分割的新数据集IAVS,包含641个3D MRA图像和587个标注,并首次引入血流动力学分析结果,以提升分割结果在CFD应用中的适用性。

Details Motivation: 现有分割方法多关注图像层面的评价指标,忽视了其在计算流体动力学(CFD)等实际临床应用中的有效性,缺乏具备拓扑完整性和CFD适用性的高质量数据集。 Method: 构建了一个多中心的IAVS数据集,包含图像-掩码对及详细的血流动力学分析结果;设计了两阶段评估基准(动脉瘤全局定位和精细分割),并提出一种简单有效的两阶段框架;建立标准化的CFD适用性评估系统,实现从分割掩码到CFD模型的自动化转换。 Result: IAVS是首个支持CFD适用性评估的颅内动脉瘤血管分割数据集,所提两阶段框架可作为即用方法和强基线,CFD评估系统实现了分割结果的自动化一致性验证。 Conclusion: 该工作填补了医学图像分割与临床血流动力学分析之间的鸿沟,推动了面向实际应用的分割技术发展。 Abstract: The precise segmentation of intracranial aneurysms and their parent vessels (IA-Vessel) is a critical step for hemodynamic analyses, which mainly depends on computational fluid dynamics (CFD). However, current segmentation methods predominantly focus on image-based evaluation metrics, often neglecting their practical effectiveness in subsequent CFD applications. To address this deficiency, we present the Intracranial Aneurysm Vessel Segmentation (IAVS) dataset, the first comprehensive, multi-center collection comprising 641 3D MRA images with 587 annotations of aneurysms and IA-Vessels. In addition to image-mask pairs, IAVS dataset includes detailed hemodynamic analysis outcomes, addressing the limitations of existing datasets that neglect topological integrity and CFD applicability. To facilitate the development and evaluation of clinically relevant techniques, we construct two evaluation benchmarks including global localization of aneurysms (Stage I) and fine-grained segmentation of IA-Vessel (Stage II) and develop a simple and effective two-stage framework, which can be used as a out-of-the-box method and strong baseline. For comprehensive evaluation of applicability of segmentation results, we establish a standardized CFD applicability evaluation system that enables the automated and consistent conversion of segmentation masks into CFD models, offering an applicability-focused assessment of segmentation outcomes. The dataset, code, and model will be public available at https://github.com/AbsoluteResonance/IAVS.

[236] Optimizing Stroke Risk Prediction: A Machine Learning Pipeline Combining ROS-Balanced Ensembles and XAI

A S M Ahsanul Sarkar Akib,Raduana Khawla,Abdul Hasib

Main category: cs.CV

TL;DR: 本研究提出了一种结合集成学习与可解释人工智能(XAI)的机器学习框架,用于中风风险预测,通过特征工程、数据预处理和ROS解决类别不平衡问题,在中风预测数据集上实现了99.09%的准确率,并利用LIME分析识别出年龄、高血压和血糖水平三个关键临床变量,提升了模型的可解释性与临床应用价值。

Details Motivation: 中风是全球主要的健康问题,早期风险评估对及时干预和有效预防至关重要,但现有方法在准确性与模型可解释性方面存在不足。 Method: 采用10种机器学习模型进行5折交叉验证,结合特征工程与数据预处理(使用随机过采样ROS解决类别不平衡),构建并优化由随机森林、ExtraTrees和XGBoost组成的集成模型,并利用LIME进行可解释性分析以识别关键特征。 Result: 所提出的集成模型在中风预测数据集(SPD)上达到99.09%的准确率,LIME分析揭示年龄、高血压和血糖水平为最重要的三个临床预测因子。 Conclusion: 集成学习与可解释AI的结合可显著提升中风风险预测的准确性与透明度,有助于推动数据驱动的个性化临床决策和心血管风险管理。 Abstract: Stroke is a major cause of death and permanent impairment, making it a major worldwide health concern. For prompt intervention and successful preventative tactics, early risk assessment is essential. To address this challenge, we used ensemble modeling and explainable AI (XAI) techniques to create an interpretable machine learning framework for stroke risk prediction. A thorough evaluation of 10 different machine learning models using 5-fold cross-validation across several datasets was part of our all-inclusive strategy, which also included feature engineering and data pretreatment (using Random Over-Sampling (ROS) to solve class imbalance). Our optimized ensemble model (Random Forest + ExtraTrees + XGBoost) performed exceptionally well, obtaining a strong 99.09% accuracy on the Stroke Prediction Dataset (SPD). We improved the model's transparency and clinical applicability by identifying three important clinical variables using LIME-based interpretability analysis: age, hypertension, and glucose levels. Through early prediction, this study highlights how combining ensemble learning with explainable AI (XAI) can deliver highly accurate and interpretable stroke risk assessment. By enabling data-driven prevention and personalized clinical decisions, our framework has the potential to transform stroke prediction and cardiovascular risk management.

[237] AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

Yexin Liu,Wen-Jie Shu,Zile Huang,Haoze Zheng,Yueze Wang,Manyuan Zhang,Ser-Nam Lim,Harry Yang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视频生成框架AlignVid,以提升文本引导图像到视频生成中的语义一致性,特别是在处理图像大幅变换时减少语义忽视问题。

Details Motivation: 现有方法在遵循细粒度文本提示语义方面存在不足,尤其在涉及对象增删改等显著图像变换时表现出语义忽视问题。作者希望通过简单有效的机制提升生成视频对提示的语义忠实度。 Method: 受高斯模糊输入可改善语义对齐现象的启发,作者从注意力机制和能量分布角度分析,提出了AlignVid框架,包含两个部分:(i) 注意力缩放调制(ASM),通过轻量级Q/K缩放重新加权注意力;(ii) 引导调度(GS),在Transformer模块和去噪步骤中选择性应用ASM,以减少视觉质量下降。 Result: AlignVid在不进行训练的前提下显著提升了文本-视频生成的语义保真度。作者还构建了OmitI2V评测集(367个人工标注样本),用于评估语义忽视问题。实验表明该方法有效增强了对添加、删除和修改类提示的遵循能力。 Conclusion: AlignVid通过最小化干预注意力机制,在保持视觉质量的同时有效缓解了TI2V任务中的语义忽视问题,为训练-free的语义对齐提供了新思路。 Abstract: Text-guided image-to-video (TI2V) generation has recently achieved remarkable progress, particularly in maintaining subject consistency and temporal coherence. However, existing methods still struggle to adhere to fine-grained prompt semantics, especially when prompts entail substantial transformations of the input image (e.g., object addition, deletion, or modification), a shortcoming we term semantic negligence. In a pilot study, we find that applying a Gaussian blur to the input image improves semantic adherence. Analyzing attention maps, we observe clearer foreground-background separation. From an energy perspective, this corresponds to a lower-entropy cross-attention distribution. Motivated by this, we introduce AlignVid, a training-free framework with two components: (i) Attention Scaling Modulation (ASM), which directly reweights attention via lightweight Q or K scaling, and (ii) Guidance Scheduling (GS), which applies ASM selectively across transformer blocks and denoising steps to reduce visual quality degradation. This minimal intervention improves prompt adherence while limiting aesthetic degradation. In addition, we introduce OmitI2V to evaluate semantic negligence in TI2V generation, comprising 367 human-annotated samples that span addition, deletion, and modification scenarios. Extensive experiments demonstrate that AlignVid can enhance semantic fidelity.

[238] EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

Yingjie Zhou,Xilei Zhu,Siyu Ren,Ziyi Zhao,Ziwen Wang,Farong Wen,Yu Zhou,Jiezhang Cao,Xiongkuo Min,Fengjiao Chen,Xiaoyu Li,Xuezhi Cao,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出了首个大规模的多说话人生成质量评估数据集THQA-MT,并设计了新的质量评估框架EvalTalker,能够有效评估多说话人生成视频的质量,显著提升与主观评分的相关性。

Details Motivation: 现有的多说话人生成技术在质量上存在明显退化,缺乏有效的质量评估手段,限制了高质多说话人生成模型的发展。 Method: 构建了包含5492个多说话人生成视频的大规模数据集THQA-MT,通过主观实验分析不同方法的感知差异并识别出12种常见失真类型;提出EvalTalker框架,结合全局质量、人物特征、身份一致性和多模态同步性(Qwen-Sync)进行质量评估。 Result: EvalTalker在与主观评分的相关性上表现优异,显著优于现有评估方法,验证了其在多说话人生成质量评估中的有效性。 Conclusion: EvalTalker为高质量多说话人生成模型的研究提供了可靠的评估基础,推动了该领域的进一步发展。 Abstract: Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore, we introduce EvalTalker, a novel TH quality assessment framework. This framework possesses the ability to perceive global quality, human characteristics, and identity consistency, while integrating Qwen-Sync to perceive multimodal synchrony. Experimental results demonstrate that EvalTalker achieves superior correlation with subjective scores, providing a robust foundation for future research on high-quality Multi-Talker generation and evaluation.

[239] InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Chenting Wang,Yuhan Zhu,Yicheng Xu,Jiange Yang,Ziang Yan,Yali Wang,Yi Wang,Limin Wang

Main category: cs.CV

TL;DR: 本文提出InternVideo-Next,一种基于Encoder-Predictor-Decoder(EPD)框架的两阶段掩码视频建模方法,通过解耦传统编解码结构并引入条件扩散解码器与语义先验,解决了像素级重建与语义抽象之间的冲突,实现了在无标签视频上的先进视频表征学习性能。

Details Motivation: 现有大规模视频-文本预训练依赖噪声较大的合成字幕,缺乏对隐式世界知识(如物体运动、3D几何、物理线索)的建模;而掩码视频建模虽直接利用时空结构,却因架构问题(如像素重建与语义冲突、捷径学习)在通用任务上表现落后。 Method: 提出EPD框架,将编码器与预测器分离,并设计两阶段预训练:第一阶段使用条件扩散解码器结合图像级语义先验,实现像素保真与语义一致的潜在空间构建;第二阶段在此空间内预测固定的Stage 1目标,以学习隐式世界知识并缓解捷径学习。 Result: 在多个基准测试上达到最先进的性能,且仅使用公开无标签视频进行训练,验证了方法的有效性与可扩展性。 Conclusion: InternVideo-Next通过架构创新和两阶段预训练,有效弥合了掩码视频建模中低层次像素重建与高层次语义理解之间的鸿沟,为通用视频表征学习提供了可扩展的新路径。 Abstract: Large-scale video-text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks. We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning. To address these, we disentangle the traditional encoder-decoder design into an Encoder-Predictor-Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preserving latent space for this world model. First, conventional linear decoder in pixel MVM enforces the predictor output latent to be linearly projected to, thus separable in pixel space, causing the conflict with semantic abstraction. Our Stage 1 proposes a conditional diffusion decoder and injects reliable image-level semantic priors to enhance semantics and convergence, thus bridging pixel-level fidelity with high-level semantic abstraction. Stage 2 further learns world knowledge by predicting frozen Stage 1 targets within this space, mitigating shortcut learning. Trained on public, unlabeled videos, InternVideo-Next achieves state-of-the-art results across benchmarks and provides a scalable path toward general video representation learning.

[240] Handwritten Text Recognition for Low Resource Languages

Sayantan Dey,Alireza Alaei,Partha Pratim Roy

Main category: cs.CV

TL;DR: 本文提出了BharatOCR,一种无需分段的段落级手写印地语和乌尔都语文本识别模型,结合ViT、Transformer解码器和预训练语言模型(RoBERTa),在多个数据集上实现了领先的字符识别准确率。

Details Motivation: 低资源语言(如印地语、乌尔都语)缺乏足够的语言资源,段落级手写文本识别仍具挑战性,需专门方法提升OCR性能。 Method: 提出ViT-Transformer Decoder-LM架构:ViT提取视觉特征,Transformer解码器生成文本序列,RoBERTa语言模型优化输出;采用DeiT提升图像建模效率,并通过隐式行分割逐行处理段落图像。 Result: 在NUST-UHWR、PUCIT-OUHL和Parimal-Urdu数据集上分别达到96.24%、92.05%和94.80%的字符识别率,在Parimal-Hindi数据集上达到80.64%,优于多种现有先进方法。 Conclusion: 所提模型在低资源手写文本识别任务中表现优越,验证了结合视觉与语言预训练模型的有效性,推动了印度语系手写OCR的发展。 Abstract: Despite considerable progress in handwritten text recognition, paragraph-level handwritten text recognition, especially in low-resource languages, such as Hindi, Urdu and similar scripts, remains a challenging problem. These languages, often lacking comprehensive linguistic resources, require special attention to develop robust systems for accurate optical character recognition (OCR). This paper introduces BharatOCR, a novel segmentation-free paragraph-level handwritten Hindi and Urdu text recognition. We propose a ViT-Transformer Decoder-LM architecture for handwritten text recognition, where a Vision Transformer (ViT) extracts visual features, a Transformer decoder generates text sequences, and a pre-trained language model (LM) refines the output to improve accuracy, fluency, and coherence. Our model utilizes a Data-efficient Image Transformer (DeiT) model proposed for masked image modeling in this research work. In addition, we adopt a RoBERTa architecture optimized for masked language modeling (MLM) to enhance the linguistic comprehension and generative capabilities of the proposed model. The transformer decoder generates text sequences from visual embeddings. This model is designed to iteratively process a paragraph image line by line, called implicit line segmentation. The proposed model was evaluated using our custom dataset ('Parimal Urdu') and ('Parimal Hindi'), introduced in this research work, as well as two public datasets. The proposed model achieved benchmark results in the NUST-UHWR, PUCIT-OUHL, and Parimal-Urdu datasets, achieving character recognition rates of 96.24%, 92.05%, and 94.80%, respectively. The model also provided benchmark results using the Hindi dataset achieving a character recognition rate of 80.64%. The results obtained from our proposed model indicated that it outperformed several state-of-the-art Urdu text recognition methods.

[241] OpenBox: Annotate Any Bounding Boxes in 3D

In-Jae Lee,Mungyeom Kim,Kwonyoung Ryu,Pierre Musacchio,Jaesik Park

Main category: cs.CV

TL;DR: 提出OpenBox,一种无需自训练的两阶段自动标注流水线,利用2D视觉基础模型生成高质量、区分刚性与运动状态的3D边界框,提升无监督开放词汇3D目标检测性能。

Details Motivation: 现有无监督开放词汇3D检测方法普遍采用统一标注边界框,忽略物体物理状态,且依赖多轮自训练导致计算开销大、标注质量低,难以满足自动驾驶对安全性和可扩展性的要求。 Method: 设计两阶段自动标注流程:第一阶段通过跨模态实例对齐,将2D视觉基础模型提取的实例线索关联到3D点云;第二阶段按刚性和运动状态分类实例,并结合类别特定尺寸统计生成自适应边界框。 Result: 在Waymo、Lyft和nuScenes数据集上验证了方法有效性,相比基线实现更高精度与效率,无需自训练即可生成高质量3D标注。 Conclusion: OpenBox通过引入物理状态感知与跨模态对齐,显著提升了无监督开放词汇3D检测的标注质量与效率,为实际应用提供了更实用的解决方案。 Abstract: Unsupervised and open-vocabulary 3D object detection has recently gained attention, particularly in autonomous driving, where reducing annotation costs and recognizing unseen objects are critical for both safety and scalability. However, most existing approaches uniformly annotate 3D bounding boxes, ignore objects' physical states, and require multiple self-training iterations for annotation refinement, resulting in suboptimal quality and substantial computational overhead. To address these challenges, we propose OpenBox, a two-stage automatic annotation pipeline that leverages a 2D vision foundation model. In the first stage, OpenBox associates instance-level cues from 2D images processed by a vision foundation model with the corresponding 3D point clouds via cross-modal instance alignment. In the second stage, it categorizes instances by rigidity and motion state, then generates adaptive bounding boxes with class-specific size statistics. As a result, OpenBox produces high-quality 3D bounding box annotations without requiring self-training. Experiments on the Waymo Open Dataset, the Lyft Level 5 Perception dataset, and the nuScenes dataset demonstrate improved accuracy and efficiency over baselines.

[242] BlinkBud: Detecting Hazards from Behind via Sampled Monocular 3D Detection on a Single Earbud

Yunzhe Li,Jiajun Yan,Yuzhou Wei,Kechen Liu,Yize Zhao,Chong Zhang,Hongzi Zhu,Li Lu,Shan Chang,Minyi Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为BlinkBud的系统,利用单个耳戴设备和配对手机实时检测从后方接近的危险物体,通过少量采样的摄像头图像实现高效3D目标跟踪,结合卡尔曼滤波和基于强化学习的最优图像采样策略,在低功耗下实现了高精度检测,并有效补偿了用户头部运动的影响。

Details Motivation: 行人和骑行者无法察觉后方高速接近的车辆,存在严重道路安全隐患,现有技术在功耗和跟踪精度之间难以平衡,缺乏适用于耳戴设备的实时、低功耗解决方案。 Method: 提出BlinkBud系统,采用耳戴摄像头采集稀疏图像,结合卡尔曼滤波进行轨迹预测,并设计基于强化学习的最优图像采样策略以降低功耗;利用估计的俯仰角和偏航角校正深度估计并对齐坐标系,消除头部运动影响。 Result: 原型系统实验显示耳戴端平均功耗仅29.8 mW,手机端702.6 mW,具备轻量级特性;危险检测的平均误报率(FPR)为4.90%,漏报率(FNR)为1.47%,表现出高检测精度。 Conclusion: BlinkBud实现了低功耗与高精度的后方危险物体检测,能够有效提升行人和骑行者的道路安全,具有实际应用潜力。 Abstract: Failing to be aware of speeding vehicles approaching from behind poses a huge threat to the road safety of pedestrians and cyclists. In this paper, we propose BlinkBud, which utilizes a single earbud and a paired phone to online detect hazardous objects approaching from behind of a user. The core idea is to accurately track visually identified objects utilizing a small number of sampled camera images taken from the earbud. To minimize the power consumption of the earbud and the phone while guaranteeing the best tracking accuracy, a novel 3D object tracking algorithm is devised, integrating both a Kalman filter based trajectory estimation scheme and an optimal image sampling strategy based on reinforcement learning. Moreover, the impact of constant user head movements on the tracking accuracy is significantly eliminated by leveraging the estimated pitch and yaw angles to correct the object depth estimation and align the camera coordinate system to the user's body coordinate system, respectively. We implement a prototype BlinkBud system and conduct extensive real-world experiments. Results show that BlinkBud is lightweight with ultra-low mean power consumptions of 29.8 mW and 702.6 mW on the earbud and smartphone, respectively, and can accurately detect hazards with a low average false positive ratio (FPR) and false negative ratio (FNR) of 4.90% and 1.47%, respectively.

[243] SRAM: Shape-Realism Alignment Metric for No Reference 3D Shape Evaluation

Sheng Liu,Tianyu Luan,Phani Nuney,Xuelu Feng,Junsong Yuan

Main category: cs.CV

TL;DR: 提出一种基于大语言模型的3D形状真实感评估指标,通过将网格编码为语言标记并设计专用解码器,实现与人类感知对齐的真实感评价。

Details Motivation: 传统3D形状真实感评估依赖于地面实况,但在实际应用中,真实感更多取决于人类感知而非与参考模型的相似性,因此需要一种无需地面实况、更符合人类判断的评估方法。 Method: 采用网格编码技术将3D形状转换为语言标记空间,利用大语言模型作为桥梁,并设计专门的真实感解码器来对齐模型输出与人类对真实感的感知;同时构建了无须地面实况的人工标注数据集RealismGrading。 Result: 所提指标在跨对象的k折交叉验证中表现出与人类感知高度相关的结果,优于现有方法,并展现出良好的泛化能力。 Conclusion: 该工作验证了利用大语言模型进行3D形状真实感评估的可行性,提供了一种脱离地面实况、更贴近人类感知的新评估范式。 Abstract: 3D generation and reconstruction techniques have been widely used in computer games, film, and other content creation areas. As the application grows, there is a growing demand for 3D shapes that look truly realistic. Traditional evaluation methods rely on a ground truth to measure mesh fidelity. However, in many practical cases, a shape's realism does not depend on having a ground truth reference. In this work, we propose a Shape-Realism Alignment Metric that leverages a large language model (LLM) as a bridge between mesh shape information and realism evaluation. To achieve this, we adopt a mesh encoding approach that converts 3D shapes into the language token space. A dedicated realism decoder is designed to align the language model's output with human perception of realism. Additionally, we introduce a new dataset, RealismGrading, which provides human-annotated realism scores without the need for ground truth shapes. Our dataset includes shapes generated by 16 different algorithms on over a dozen objects, making it more representative of practical 3D shape distributions. We validate our metric's performance and generalizability through k-fold cross-validation across different objects. Experimental results show that our metric correlates well with human perceptions and outperforms existing methods, and has good generalizability.

[244] Textured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network

Tianyu Luan,Xuelu Feng,Zixin Zhu,Phani Nuney,Sheng Liu,Xuan Gong,David Doermann,Chunming Qiao,Junsong Yuan

Main category: cs.CV

TL;DR: 提出了一种名为Textured Geometry Evaluation (TGE)的新方法,直接基于带纹理的3D网格评估保真度,结合几何和颜色信息,无需渲染,在真实失真数据集上表现优于现有方法。

Details Motivation: 现有的3D形状保真度评估指标(如Chamfer Distance)与人类感知对齐差,基于学习的方法依赖渲染图像和2D指标,受限于视角选择和结构覆盖不全,且多在合成失真上训练,存在域差距。 Method: 提出TGE方法,直接在带纹理的3D网格上操作,联合利用几何和颜色信息计算输入网格相对于参考形状的保真度;构建了一个包含真实世界失真的人工标注数据集用于训练和评估。 Result: 实验表明,TGE在真实世界失真数据集上优于基于渲染和仅几何的方法,具有更好的人类感知对齐性。 Conclusion: TGE是一种更有效、更贴近人类感知的3D带纹理网格保真度评估方法,通过直接处理3D网格并融合几何与纹理信息,克服了现有方法的局限性。 Abstract: Textured high-fidelity 3D models are crucial for games, AR/VR, and film, but human-aligned evaluation methods still fall behind despite recent advances in 3D reconstruction and generation. Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes. Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics. However, these approaches face limitations due to incomplete structural coverage and sensitivity to viewpoint choices. Moreover, most methods are trained on synthetic distortions, which differ significantly from real-world distortions, resulting in a domain gap. To address these challenges, we propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering. Our method, named Textured Geometry Evaluation TGE, jointly uses the geometry and color information to calculate the fidelity of the input textured mesh with comparison to a reference colored shape. To train and evaluate our metric, we design a human-annotated dataset with real-world distortions. Experiments show that TGE outperforms rendering-based and geometry-only methods on real-world distortion dataset.

[245] Reversible Inversion for Training-Free Exemplar-guided Image Editing

Yuke Li,Lianli Gao,Ji Zhang,Pengpeng Zeng,Lichuan Xiang,Hongkai Wen,Heng Tao Shen,Jingkuan Song

Main category: cs.CV

TL;DR: 提出了一种无需训练的可逆反转(ReInversion)方法用于示例引导的图像编辑,通过两阶段去噪和掩码引导的选择性去噪策略,在保持背景结构一致性的同时实现了高效高质量的编辑。

Details Motivation: 现有方法依赖大规模预训练,计算成本高;标准反转技术在图像编辑中效果不佳,质量差且效率低。 Method: 提出ReInversion,采用两阶段去噪过程:第一阶段以源图像为条件,第二阶段以参考图像为条件;引入掩码引导的选择性去噪(MSD)策略,限制编辑区域并保持背景结构。 Result: 在定性和定量比较中,ReInversion在计算开销最低的情况下达到了最先进的EIE性能。 Conclusion: ReInversion是一种高效、无需训练的图像编辑方法,显著优于标准反转和其他现有方法,适用于高质量的示例引导图像编辑任务。 Abstract: Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.

[246] PointNet4D: A Lightweight 4D Point Cloud Video Backbone for Online and Offline Perception in Robotic Applications

Yunze Liu,Zifan Wang,Peiran Wu,Jiayang Ao

Main category: cs.CV

TL;DR: 提出PointNet4D,一种轻量级4D骨干网络,结合Mamba与Transformer优势,用于实时点云视频处理,在9项任务中表现优异,并成功应用于机器人系统。

Details Motivation: 现有4D模型计算开销大,难以满足实时性与资源受限场景需求,尤其在处理动态4D环境时缺乏高效时序建模能力。 Method: 设计Hybrid Mamba-Transformer时序融合模块,结合Mamba的高效状态空间建模与Transformer的双向建模能力;提出4DMAP帧级掩码自回归预训练策略以增强运动感知。 Result: 在7个数据集的9项任务中均取得一致提升,显著优于现有方法;在RoboTwin和HandoverSim基准上构建的4D扩散策略与模仿学习系统表现突出。 Conclusion: PointNet4D是一种高效、灵活的4D骨干网络,适用于在线与离线场景,推动了动态环境建模在机器人等实时系统中的应用。 Abstract: Understanding dynamic 4D environments-3D space evolving over time-is critical for robotic and interactive systems. These applications demand systems that can process streaming point cloud video in real-time, often under resource constraints, while also benefiting from past and present observations when available. However, current 4D backbone networks rely heavily on spatiotemporal convolutions and Transformers, which are often computationally intensive and poorly suited to real-time applications. We propose PointNet4D, a lightweight 4D backbone optimized for both online and offline settings. At its core is a Hybrid Mamba-Transformer temporal fusion block, which integrates the efficient state-space modeling of Mamba and the bidirectional modeling power of Transformers. This enables PointNet4D to handle variable-length online sequences efficiently across different deployment scenarios. To enhance temporal understanding, we introduce 4DMAP, a frame-wise masked auto-regressive pretraining strategy that captures motion cues across frames. Our extensive evaluations across 9 tasks on 7 datasets, demonstrating consistent improvements across diverse domains. We further demonstrate PointNet4D's utility by building two robotic application systems: 4D Diffusion Policy and 4D Imitation Learning, achieving substantial gains on the RoboTwin and HandoverSim benchmarks.

[247] FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

Seungho Choi,Jeahun Sung,Jihyong Oh

Main category: cs.CV

TL;DR: 本文提出FRAMER,一种即插即用的训练框架,通过利用扩散模型的先验知识,在不改变网络结构和推理过程的前提下提升真实图像超分辨率(Real-ISR)中高频细节的重建效果。

Details Motivation: 扩散模型在感知质量上优于GAN,但由于低频偏差和“先低频后高频”的层级结构,难以有效重建高频细节。因此需要一种方法在保持其优势的同时改善高频恢复能力。 Method: 提出FRAMER框架:在每一步去噪过程中,使用最后一层特征图指导所有中间层;通过FFT掩码将特征图分解为低频和高频成分,并分别施加监督;针对低频采用Intra对比损失(IntraCL)稳定全局结构,针对高频采用Inter对比损失(InterCL)增强实例特异性细节;引入两个自适应调制模块FAW和FAM,动态调整各层低频/高频信号权重并控制蒸馏过程。 Result: 在U-Net和DiT等多种骨干网络(如Stable Diffusion 2、3)上验证了FRAMER的有效性,一致提升了PSNR/SSIM及感知指标(LPIPS、NIQE、MANIQA、MUSIQ);消融实验验证了末层教师机制和随机层负样本的有效性。 Conclusion: FRAMER通过频率感知的特征蒸馏与对比学习策略,有效缓解了扩散模型在Real-ISR任务中的低频偏差问题,显著提升了高频细节的重建质量,且具有良好的通用性和即插即用特性。 Abstract: Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.

[248] Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries

Tushar Pranav,Eshan Pandey,Austria Lyka Diane Bala,Aman Chadha,Indriyati Atmosukarto,Donny Soh Cheng Lock

Main category: cs.CV

TL;DR: RICE-VL是一个新的基准,用于评估视觉-语言模型在东南亚11个东盟国家中的文化理解能力,揭示了现有模型在低资源国家和抽象文化领域的性能差距。

Details Motivation: 解决现有视觉-语言模型中存在的西方中心主义偏见问题,提升其在文化多样性区域(如东南亚)的有效性。 Method: 构建包含28,000多个视觉问答样本和1,000个图像-边界框对的RICE-VL基准,并提出扩展的SEA-LAVE评估指标,涵盖文本准确性、文化一致性与国家识别。通过多类型任务(如填空、判断、开放性回答和视觉定位)评估六个主流VLM。 Result: 现有视觉-语言模型在低资源国家和抽象文化概念理解上表现较差,视觉定位任务也显示出模型在复杂场景中定位文化元素的能力有限。 Conclusion: 当前VLM在跨文化理解方面存在显著局限,需推动更具包容性的模型开发以服务全球多样化用户。 Abstract: Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.

[249] MDiff4STR: Mask Diffusion Model for Scene Text Recognition

Yongkun Du,Miaomiao Zhao,Songlin Fan,Zhineng Chen,Caiyan Jia,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文首次将Mask Diffusion Models(MDMs)引入场景文本识别(STR)任务,提出MDiff4STR模型,通过六种噪声策略和令牌替换机制解决训练-推理噪声差异与过度自信预测问题,在保持高效推理的同时超越现有自回归模型的精度。

Details Motivation: 尽管MDMs在效率上优于自回归模型(ARMs),但在STR任务中准确率不足,本文旨在缩小这一差距。 Method: 提出MDiff4STR,设计六种噪声策略以对齐训练与推理过程,并引入令牌替换噪声机制来缓解过度自信的错误预测。 Result: 在多个标准和挑战性STR基准上验证了MDiff4STR的有效性,涵盖多种复杂场景(如不规则、艺术、遮挡、中文文本等),且仅需三步去噪即可超越最先进的ARMs。 Conclusion: MDiff4STR显著提升了MDMs在STR任务中的性能,在准确率和效率之间实现了更优平衡,成为该领域的新SOTA方法。 Abstract: Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.

[250] \textit{ViRectify}: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models

Xusen Hei,Jiali Chen,Jinyu Yang,Mengchen Zhao,Yi Cai

Main category: cs.CV

TL;DR: 提出ViRectify,一个评估多模态大语言模型视频推理错误纠正能力的综合基准,包含30K以上实例,并设计轨迹证据驱动的纠正框架以提升模型对错误传播和关键时间戳的关注。

Details Motivation: 现有基准缺乏对MLLMs在复杂视频推理中错误识别与纠正能力的系统评估,需构建更细粒度的评测体系以揭示模型弱点并推动改进。 Method: 通过AI辅助标注加人工验证构建大规模数据集,设计分步错误识别与基于视觉证据的推理生成任务;提出轨迹证据驱动的纠正框架,结合逐步错误轨迹建模与视觉证据奖励机制。 Result: 在16个先进MLLM上进行广泛评估,GPT-5仅取得31.94%的纠正准确率;该框架使Qwen2.5-VL-7B consistently 超越72B变体,显示其有效性。 Conclusion: ViRectify为评估MLLMs的视频推理纠错能力提供了具有挑战性的新测试平台,揭示了模型间纠错能力的系统性差异,并为反思学习提供了宝贵数据资源。 Abstract: As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs' ability to identify and correct these video reasoning errors. To bridge this gap, we propose \textit{ViRectify}, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30\textit{K} instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In \textit{ViRectify}, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction. It encourages the model to explicitly concentrate on error propagation and key timestamps for correction. Extensive evaluation across 16 advanced MLLMs demonstrates that our \textit{ViRectify} serves as a challenging testbed, where GPT-5 achieves only 31.94\% correction accuracy. Our framework enables a Qwen2.5-VL-7B to consistently outperform the variants of 72B on \textit{ViRectify}, showing the effectiveness of our approach. Further analysis uncovers systematic asymmetries in error correction across models, and our dataset is also a valuable data resource to perform reflection learning. We believe \textit{ViRectify} provides a new direction for comprehensively evaluating the advanced MLLMs in video reasoning.

[251] ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers

Yiyang Ma,Feng Zhou,Xuedan Yin,Pu Cao,Yonghao Dang,Jianqin Yin

Main category: cs.CV

TL;DR: 提出了一种无需训练的高分辨率图像生成方法ResDiT,通过位置编码缩放和局部增强机制解决DiT在高分辨率下布局崩溃和纹理失真问题。

Details Motivation: 现有的基于预训练DiT的高分辨率图像生成方法常出现空间布局崩溃和纹理质量下降,需依赖复杂的多阶段流程,缺乏对生成机制的深入理解。 Method: 分析DiT的生成机制,发现位置嵌入(PE)在高分辨率外推时产生错误定位,导致布局崩溃;提出PE缩放技术修正位置编码,并设计基于基础分辨率局部注意力的局部增强机制,结合块级融合模块与高斯加权拼接策略提升细节保真度。 Result: ResDiT在多种设置下实现了高质量、无伪影的高分辨率图像生成,有效避免布局崩溃和网格效应,支持空间控制生成等下游任务。 Conclusion: ResDiT是一种无需训练、高效可扩展的高分辨率生成方法,通过改进位置编码和局部信息融合,显著提升了DiT在超分辨率生成中的表现。 Abstract: Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.

[252] Language-Guided Open-World Anomaly Segmentation

Klara Reichard,Nikolas Brasch,Nassir Navab,Federico Tombari

Main category: cs.CV

TL;DR: 提出Clipomaly,首个基于CLIP的开放世界与异常分割方法,可在无需异常特定训练数据的情况下实现对未知物体的零样本分割与语义命名,动态扩展推理词汇,达到领先性能。

Details Motivation: 现有开放世界和异常分割方法无法为未知区域分配有意义的语义标签,且难以区分和学习未知类别的表示;而开放词汇分割方法受限于固定词汇表,无法应对无约束的未知类别。 Method: 提出Clipomaly,利用CLIP的图像-文本共享嵌入空间,实现零样本异常分割;通过在推理时动态扩展词汇表,结合上下文提示生成机制为未知对象生成可解释的语义名称。 Result: 在标准异常分割基准上达到最先进的性能,同时支持对非典型类别(如Cityscapes之外)的鲁棒检测与命名。 Conclusion: Clipomaly首次实现了无需训练即可对未知对象进行语义分割与可解释命名的开放世界分割方法,具有良好的实用性、灵活性和部署潜力。 Abstract: Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP's shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.

[253] FastAnimate: Towards Learnable Template Construction and Pose Deformation for Fast 3D Human Avatar Animation

Jian Shu,Nanjie Yao,Gangjian Zhang,Junlong Ren,Yu Feng,Hao Wang

Main category: cs.CV

TL;DR: 提出了一种基于学习的统一框架,用于3D人体 avatar 动画,通过U-Net快速生成模板并用数据驱动方法提升变形质量。

Details Motivation: 现有方法在模板构建和姿态变形阶段存在效率低、伪影多和结构失真等问题,影响动画真实感。 Method: 采用U-Net架构解耦纹理与姿态信息以快速生成初始模板,并引入数据驱动的细化技术增强目标姿态的结构完整性。 Result: 实验表明该方法在多种姿态下均表现出色,在效率与质量之间达到最佳平衡,优于当前最先进方法。 Conclusion: 所提出的统一框架有效解决了模板构建和变形过程中的关键问题,显著提升了3D人体动画的质量与鲁棒性。 Abstract: 3D human avatar animation aims at transforming a human avatar from an arbitrary initial pose to a specified target pose using deformation algorithms. Existing approaches typically divide this task into two stages: canonical template construction and target pose deformation. However, current template construction methods demand extensive skeletal rigging and often produce artifacts for specific poses. Moreover, target pose deformation suffers from structural distortions caused by Linear Blend Skinning (LBS), which significantly undermines animation realism. To address these problems, we propose a unified learning-based framework to address both challenges in two phases. For the former phase, to overcome the inefficiencies and artifacts during template construction, we leverage a U-Net architecture that decouples texture and pose information in a feed-forward process, enabling fast generation of a human template. For the latter phase, we propose a data-driven refinement technique that enhances structural integrity. Extensive experiments show that our model delivers consistent performance across diverse poses with an optimal balance between efficiency and quality,surpassing state-of-the-art (SOTA) methods.

[254] CourtMotion: Learning Event-Driven Motion Representations from Skeletal Data for Basketball

Omer Sela,Michael Chertok,Lior Wolf

Main category: cs.CV

TL;DR: 本文提出了CourtMotion,一种基于骨骼追踪数据的时空建模框架,通过图神经网络和Transformer结构来预测篮球比赛中的事件与战术行为,显著优于仅依赖位置信息的传统方法。

Details Motivation: 传统方法仅使用球员位置数据,难以捕捉身体朝向、防守姿态等关键动作语义,无法准确预测篮球赛事中的复杂事件。因此需要结合运动细节与战术意图的模型。 Method: 采用两阶段方法:首先用图神经网络处理骨骼追踪数据以捕获精细运动模式,再利用带有特殊注意力机制的Transformer建模球员间交互,并引入事件投影头将动作与传球、投篮、抢断等事件显式关联。 Result: 在NBA追踪数据上实验显示,相比基于位置的最先进模型,轨迹预测误差降低35%,并在多个下游任务(如掩护识别、助攻预测、投篮类型识别等)中表现显著提升。 Conclusion: CourtMotion能有效融合物理运动与战术语义,为篮球赛事分析提供强大基础模型,推动体育智能分析的发展。 Abstract: This paper presents CourtMotion, a spatiotemporal modeling framework for analyzing and predicting game events and plays as they develop in professional basketball. Anticipating basketball events requires understanding both physical motion patterns and their semantic significance in the context of the game. Traditional approaches that use only player positions fail to capture crucial indicators such as body orientation, defensive stance, or shooting preparation motions. Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns, then employs a Transformer architecture with specialized attention mechanisms to model player interactions. We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their tactical purposes. Experiments on NBA tracking data demonstrate significant improvements over position-only baselines: 35% reduction in trajectory prediction error compared to state-of-the-art position-based models and consistent performance gains across key basketball analytics tasks. The resulting pretrained model serves as a powerful foundation for multiple downstream tasks, with pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition demonstrating substantial improvements over existing methods.

[255] ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling

Qisen Wang,Yifan Zhao,Peisen Shen,Jialu Li,Jia Li

Main category: cs.CV

TL;DR: 提出ChronosObserver,一种无需训练的方法,用于生成高保真、3D一致且时间同步的多视角视频。

Details Motivation: 现有视频生成模型在直接扩展到4D世界中的多视角视频生成时面临3D一致性与时间同步的挑战。 Method: 引入World State Hyperspace表示4D场景的时空约束,并通过Hyperspace Guided Sampling同步多个视角的扩散采样轨迹。 Result: 实验表明该方法能在无需训练或微调扩散模型的情况下生成高质量、3D一致且时间同步的多视角视频。 Conclusion: ChronosObserver为4D场景建模提供了一种高效、可扩展且无需训练的新范式。 Abstract: Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.

[256] A variational method for curve extraction with curvature-dependent energies

Majid Arthaud,Antonin Chambolle,Vincent Duval

Main category: cs.CV

TL;DR: 提出了一种基于能量离散化和Smirnov分解定理的变分方法,用于从图像中自动提取曲线和一维结构,且方法可扩展至曲率依赖的能量。

Details Motivation: 为了实现图像中曲线和1D结构的自动、低监督提取,需要一种能处理复杂几何形态的数学框架。 Method: 基于能量泛函的离散化和向量场的Smirnov分解定理,结合双层最小化策略,并通过在位置-方向空间中提升曲线以处理曲率相关能量。 Result: 实现了对图像中曲线结构的有效提取,支持多种端点配置并可纳入曲率先验。 Conclusion: 该方法为图像中低维结构的提取提供了一个灵活且理论严谨的变分框架,具有良好的可扩展性和实用性。 Abstract: We introduce a variational approach for extracting curves between a list of possible endpoints, based on the discretization of an energy and Smirnov's decomposition theorem for vector fields. It is used to design a bi-level minimization approach to automatically extract curves and 1D structures from an image, which is mostly unsupervised. We extend then the method to curvature-dependent energies, using a now classical lifting of the curves in the space of positions and orientations equipped with an appropriate sub-Riemanian or Finslerian metric.

[257] ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark

Joanne Lin,Ruirui Lin,Yini Li,David Bull,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 本文提出了ELVIS框架,用于提升视频实例分割模型在低光环境下的性能,通过建模时空退化、无需校准的退化配置网络和增强解码头,显著提高了分割精度。

Details Motivation: 低光视频存在噪声、模糊和低对比度等问题,现有数据集和合成方法不足,且当前VIS方法对低光退化不鲁棒,限制了实际应用。 Method: 提出ELVIS框架,包括无监督的合成低光视频流水线、VDP-Net退化配置网络和解耦退化的增强解码头,实现域自适应。 Result: 在合成低光YouTube-VIS 2019数据集上性能提升达+3.7AP。 Conclusion: ELVIS有效提升了现有VIS模型在低光场景下的表现,为低光视频理解提供了新思路。 Abstract: Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.

[258] Semantic-aware Random Convolution and Source Matching for Domain Generalization in Medical Image Segmentation

Franz Thaler,Martin Urschler,Mateusz Kozinski,Matthias AF Gsell,Gernot Plank,Darko Stern

Main category: cs.CV

TL;DR: 本文提出了一种名为SRCSM的新方法,用于单源域泛化的医学图像分割,通过语义感知的随机卷积增强源域多样性,并在测试时对目标域图像强度进行映射,提升了跨模态和跨中心场景下的分割性能,成为该领域的最先进方法之一。

Details Motivation: 解决医学图像分割中单源域训练模型难以直接应用于不同域(如CT到MR)的问题,无需目标域数据参与训练即可实现良好的泛化能力。 Method: 提出SRCSM方法:训练时采用基于标注标签的语义感知随机卷积来多样化源域;测试时通过映射目标域图像强度使其接近源域分布,从而提升模型泛化性。 Result: 在多种跨模态、跨中心及心脏相位转换场景下进行了广泛实验,多数情况下优于现有域泛化方法,部分设置下性能达到甚至匹配域内基线水平。 Conclusion: SRCSM显著缩小了域间差距,在单源域泛化医学图像分割中表现出色,可视为新的最优方法之一。 Abstract: We tackle the challenging problem of single-source domain generalization (DG) for medical image segmentation. To this end, we aim for training a network on one domain (e.g., CT) and directly apply it to a different domain (e.g., MR) without adapting the model and without requiring images or annotations from the new domain during training. We propose a novel method for promoting DG when training deep segmentation networks, which we call SRCSM. During training, our method diversifies the source domain through semantic-aware random convolution, where different regions of a source image are augmented differently, based on their annotation labels. At test-time, we complement the randomization of the training domain via mapping the intensity of target domain images, making them similar to source domain data. We perform a comprehensive evaluation on a variety of cross-modality and cross-center generalization settings for abdominal, whole-heart and prostate segmentation, where we outperform previous DG techniques in a vast majority of experiments. Additionally, we also investigate our method when training on whole-heart CT or MR data and testing on the diastolic and systolic phase of cine MR data captured with different scanner hardware, where we make a step towards closing the domain gap in this even more challenging setting. Overall, our evaluation shows that SRCSM can be considered a new state-of-the-art in DG for medical image segmentation and, moreover, even achieves a segmentation performance that matches the performance of the in-domain baseline in several settings.

[259] QuantumCanvas: A Multimodal Benchmark for Visual Learning of Atomic Interactions

Can Polat,Erchin Serpedin,Mustafa Kurban,Hasan Kurban

Main category: cs.CV

TL;DR: QuantumCanvas 是一个大规模多模态基准,将二体量子系统作为物质的基本单元,通过轨道密度等图像表示和数值属性,实现对局部量子相互作用的可解释、可迁移学习。

Details Motivation: 现有分子和材料机器学习模型大多缺乏物理可迁移性,无法捕捉原子对之间的量子相互作用;QuantumCanvas 旨在通过以二体相互作用为基础单元来解决这一问题。 Method: 构建包含2850个元素对的数据集,每个对标注18种电子、热力学和几何性质,并生成十通道的物理驱动图像(如轨道密度、电荷投影等);采用GATv2、EGNN、DimeNet及多模态融合模型进行基准测试,并探索在QM9、MD17等数据集上的预训练迁移效果。 Result: GATv2在能隙预测上达到0.201 eV MAE,EGNN在HOMO/LUMO上分别达到0.265和0.274 eV MAE,DimeNet在总能量和排斥能上表现优异,多模态模型在Mermin自由能上达2.15 eV MAE;预训练显著提升其他数据集的收敛性和泛化能力。 Conclusion: QuantumCanvas 通过融合轨道物理与视觉表示学习,为学习可迁移的量子相互作用提供了原则性、可解释的基础。 Abstract: Despite rapid advances in molecular and materials machine learning, most models still lack physical transferability: they fit correlations across whole molecules or crystals rather than learning the quantum interactions between atomic pairs. Yet bonding, charge redistribution, orbital hybridization, and electronic coupling all emerge from these two-body interactions that define local quantum fields in many-body systems. We introduce QuantumCanvas, a large-scale multimodal benchmark that treats two-body quantum systems as foundational units of matter. The dataset spans 2,850 element-element pairs, each annotated with 18 electronic, thermodynamic, and geometric properties and paired with ten-channel image representations derived from l- and m-resolved orbital densities, angular field transforms, co-occupancy maps, and charge-density projections. These physically grounded images encode spatial, angular, and electrostatic symmetries without explicit coordinates, providing an interpretable visual modality for quantum learning. Benchmarking eight architectures across 18 targets, we report mean absolute errors of 0.201 eV on energy gap using GATv2, 0.265 eV on HOMO and 0.274 eV on LUMO using EGNN. For energy-related quantities, DimeNet attains 2.27 eV total-energy MAE and 0.132 eV repulsive-energy MAE, while a multimodal fusion model achieves a 2.15 eV Mermin free-energy MAE. Pretraining on QuantumCanvas further improves convergence stability and generalization when fine-tuned on larger datasets such as QM9, MD17, and CrysMTM. By unifying orbital physics with vision-based representation learning, QuantumCanvas provides a principled and interpretable basis for learning transferable quantum interactions through coupled visual and numerical modalities. Dataset and model implementations are available at https://github.com/KurbanIntelligenceLab/QuantumCanvas.

[260] Diffusion Fuzzy System: Fuzzy Rule Guided Latent Multi-Path Diffusion Modeling

Hailong Yang,Te Zhang,Kup-sze Choi,Zhaohong Deng

Main category: cs.CV

TL;DR: 本文提出了一种基于模糊规则引导的潜在空间多路径扩散模型——Diffusion Fuzzy System (DFS),通过为不同图像特征分配专用扩散路径、采用规则链推理实现路径间高效协调,并引入基于模糊隶属度的潜在空间压缩机制以降低计算成本,在多个公开数据集上实现了更稳定的训练、更快的收敛速度以及更高的图像质量和文本对齐精度。

Details Motivation: 现有的扩散模型在处理具有显著特征差异的图像集合时存在困难,难以有效捕捉复杂特征并容易产生冲突结果;传统多路径方法协调效率低且计算成本高。 Method: 提出Diffusion Fuzzy System (DFS),在潜在空间中构建多路径扩散结构,每条路径专注于学习特定类别的图像特征;引入基于模糊规则链的推理机制动态引导扩散过程,并设计基于模糊隶属度的压缩机制减少计算开销。 Result: 在LSUN Bedroom、LSUN Church和MS COCO三个数据集上实验表明,DFS相比单路径和多路径扩散模型具有更稳定的训练过程和更快的收敛速度,在图像质量、文本-图像对齐性及与目标参考的匹配准确率方面均优于基线模型。 Conclusion: DFS通过模糊规则引导的多路径协同机制和高效的潜在空间压缩策略,有效提升了扩散模型在复杂特征图像生成中的性能与计算效率,为多模态图像生成提供了新思路。 Abstract: Diffusion models have emerged as a leading technique for generating images due to their ability to create high-resolution and realistic images. Despite their strong performance, diffusion models still struggle in managing image collections with significant feature differences. They often fail to capture complex features and produce conflicting results. Research has attempted to address this issue by learning different regions of an image through multiple diffusion paths and then combining them. However, this approach leads to inefficient coordination among multiple paths and high computational costs. To tackle these issues, this paper presents a Diffusion Fuzzy System (DFS), a latent-space multi-path diffusion model guided by fuzzy rules. DFS offers several advantages. First, unlike traditional multi-path diffusion methods, DFS uses multiple diffusion paths, each dedicated to learning a specific class of image features. By assigning each path to a different feature type, DFS overcomes the limitations of multi-path models in capturing heterogeneous image features. Second, DFS employs rule-chain-based reasoning to dynamically steer the diffusion process and enable efficient coordination among multiple paths. Finally, DFS introduces a fuzzy membership-based latent-space compression mechanism to reduce the computational costs of multi-path diffusion effectively. We tested our method on three public datasets: LSUN Bedroom, LSUN Church, and MS COCO. The results show that DFS achieves more stable training and faster convergence than existing single-path and multi-path diffusion models. Additionally, DFS surpasses baseline models in both image quality and alignment between text and images, and also shows improved accuracy when comparing generated images to target references.

[261] Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias Analysis

Alexander Frotscher,Christian F. Baumgartner,Thomas Wolfers

Main category: cs.CV

TL;DR: 本研究提出了一个大规模、多中心的深度无监督脑部MRI异常检测基准,涵盖多种扫描仪、病变类型和人口统计学特征,系统评估了现有算法的性能与鲁棒性,发现当前方法在小病灶、低对比度病变及跨设备泛化方面存在局限,且存在年龄和性别相关的偏倚,表明算法改进比增加训练数据更为迫切。

Details Motivation: 由于碎片化的评估、异构的数据集和不一致的指标,脑影像无监督异常检测的研究进展受限,缺乏可重复性和临床可转化性,因此需要一个标准化的大规模基准来推动领域发展。 Method: 构建了一个包含近6000名健康个体和多个临床队列的大规模多中心数据集,使用T1w和T2w MRI图像训练和测试多种无监督异常检测算法,采用Dice分数评估病灶分割性能,并系统分析扫描仪、病变特征和人口学因素对算法鲁棒性的影响。 Result: 不同算法的Dice分数在0.03到0.65之间,差异显著;基于重建的方法(尤其是受扩散模型启发的方法)表现最佳,而基于特征的方法在分布偏移下更稳健;多数算法受扫描仪影响明显,对小病灶和低对比度病变检测能力弱,且假阳性率与年龄和性别相关;增加健康训练数据带来的性能提升有限。 Conclusion: 当前无监督异常检测方法面临算法层面的根本限制而非数据不足,未来应聚焦于图像原生预训练、合理的偏差度量、公平性建模和强域适应能力,以推动临床转化。 Abstract: Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.

[262] FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

Zipeng Wang,Dan Xu

Main category: cs.CV

TL;DR: 本文提出FlashVGGT,一种高效的3D重建模型,通过基于描述符的注意力机制解决现有方法在长图像序列中计算复杂度高的问题,实现快速推理与良好扩展性。

Details Motivation: 现有的视觉几何接地Transformer(VGGT)因自注意力机制的二次复杂度在处理长图像序列时存在可扩展性差的问题,限制了其在大规模场景中的应用。 Method: 引入一种基于描述符的注意力机制,将每帧的空间信息压缩为一组紧凑的描述符令牌,并通过图像令牌与描述符之间的交叉注意力来替代全局自注意力,从而降低计算开销;同时采用分块递归机制缓存并复用描述符,支持长序列的在线推理。 Result: 实验表明,FlashVGGT在保持与VGGT相当的重建精度的同时,将1000张图像的推理时间减少到VGGT的9.3%,并能高效扩展至超过3000张图像的序列。 Conclusion: FlashVGGT通过描述符注意力和分块递归机制显著提升了多视角3D重建模型的效率与可扩展性,适用于大规模图像序列的实时应用场景。 Abstract: 3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images. Our project page is available at https://wzpscott.github.io/flashvggt_page/.

[263] MasHeNe: A Benchmark for Head and Neck CT Mass Segmentation using Window-Enhanced Mamba with Frequency-Domain Integration

Thao Thi Phuong Dao,Tan-Cong Nguyen,Nguyen Chi Thanh,Truong Hoang Viet,Trong-Le Do,Mai-Khiem Tran,Minh-Khoi Pham,Trung-Nghia Le,Minh-Triet Tran,Thanh Dinh Le

Main category: cs.CV

TL;DR: 本文提出了一个新的头颈部肿块分割数据集MasHeNe,包含3779张带像素级标注的增强CT切片,涵盖肿瘤和囊肿,并提出了一种基于Mamba架构的Windowing-Enhanced Mamba with Frequency integration (WEMF)模型,在该数据集上取得了最优性能。

Details Motivation: 现有公开数据集主要关注恶性病变,忽视了头颈部其他占位性病变(如囊肿),缺乏全面评估肿块分割模型能力的数据支持,因此需要一个更全面、多样化的数据集及适配模型来推动该领域发展。 Method: 构建了一个名为MasHeNe的新数据集,包含3,779张带像素级标注的对比增强CT切片;设计了WEMF模型,采用三窗增强预处理输入外观,并在U型Mamba主干网络中通过多频率注意力融合跳跃连接中的信息。 Result: WEMF在MasHeNe数据集上表现最佳,Dice为70.45%,IoU为66.89%,NSD为72.33%,HD95为5.12 mm,显示出稳定且强大的分割性能,但错误模式分析表明该任务仍具挑战性。 Conclusion: MasHeNe填补了头颈部非仅恶性肿块分割数据的空白,提供了新的基准;WEMF模型有效整合多频信息与窗口增强策略,提升了分割效果,但仍需进一步研究以应对复杂临床场景。 Abstract: Head and neck masses are space-occupying lesions that can compress the airway and esophagus and may affect nerves and blood vessels. Available public datasets primarily focus on malignant lesions and often overlook other space-occupying conditions in this region. To address this gap, we introduce MasHeNe, an initial dataset of 3,779 contrast-enhanced CT slices that includes both tumors and cysts with pixel-level annotations. We also establish a benchmark using standard segmentation baselines and report common metrics to enable fair comparison. In addition, we propose the Windowing-Enhanced Mamba with Frequency integration (WEMF) model. WEMF applies tri-window enhancement to enrich the input appearance before feature extraction. It further uses multi-frequency attention to fuse information across skip connections within a U-shaped Mamba backbone. On MasHeNe, WEMF attains the best performance among evaluated methods, with a Dice of 70.45%, IoU of 66.89%, NSD of 72.33%, and HD95 of 5.12 mm. This model indicates stable and strong results on this challenging task. MasHeNe provides a benchmark for head-and-neck mass segmentation beyond malignancy-only datasets. The observed error patterns also suggest that this task remains challenging and requires further research. Our dataset and code are available at https://github.com/drthaodao3101/MasHeNe.git.

[264] RoleMotion: A Large-Scale Dataset towards Robust Scene-Specific Role-Playing Motion Synthesis with Fine-grained Descriptions

Junran Peng,Yiheng Huang,Silei Shen,Zeji Wei,Jingwei Yang,Baojie Wang,Yonghao He,Chuanchen Luo,Man Zhang,Xucheng Yin,Wei Sui

Main category: cs.CV

TL;DR: 本文介绍了RoleMotion,一个大规模的人类动作数据集,专注于角色扮演和特定场景下的功能性动作数据,包含25个经典场景、110个功能角色、超过500种行为和10296段高质量身体与手部动作序列,并配有27831条细粒度文本标注。

Details Motivation: 现有动作数据集分散、功能不全、质量不一且文本标注粗糙,难以支持复杂社会场景中的角色化动作生成,因此需要构建一个高质量、场景化、角色明确且标注精细的新型数据集。 Method: 设计并收集了聚焦于场景与角色的RoleMotion数据集,涵盖多样的行为和高精度动作捕捉数据,采用细粒度文本描述进行标注,并构建了一个更强的评估器来评测文本到动作生成方法,同时探索了全身(含手部)动作生成的协同机制。 Result: RoleMotion包含10296段高质量动作序列和27831条细粒度文本描述,在文本驱动的全身动作生成任务中表现出优越的质量和功能性;所构建的评估器具有更高可靠性;实验验证了身体与手部动作协同生成的有效性。 Conclusion: RoleMotion是一个高质量、功能性强、适用于多样化场景的角色化动作数据集,显著提升了文本到动作生成的研究基础,尤其推动了全身动作(包括手部)的精细化生成与评估。 Abstract: In this paper, we introduce RoleMotion, a large-scale human motion dataset that encompasses a wealth of role-playing and functional motion data tailored to fit various specific scenes. Existing text datasets are mainly constructed decentrally as amalgamation of assorted subsets that their data are nonfunctional and isolated to work together to cover social activities in various scenes. Also, the quality of motion data is inconsistent, and textual annotation lacks fine-grained details in these datasets. In contrast, RoleMotion is meticulously designed and collected with a particular focus on scenes and roles. The dataset features 25 classic scenes, 110 functional roles, over 500 behaviors, and 10296 high-quality human motion sequences of body and hands, annotated with 27831 fine-grained text descriptions. We build an evaluator stronger than existing counterparts, prove its reliability, and evaluate various text-to-motion methods on our dataset. Finally, we explore the interplay of motion generation of body and hands. Experimental results demonstrate the high-quality and functionality of our dataset on text-driven whole-body generation.

[265] Toward Content-based Indexing and Retrieval of Head and Neck CT with Abscess Segmentation

Thao Thi Phuong Dao,Tan-Cong Nguyen,Trong-Le Do,Truong Hoang Viet,Nguyen Chi Thanh,Huynh Nguyen Thuan,Do Vo Cong Nguyen,Minh-Khoi Pham,Mai-Khiem Tran,Viet-Tham Huynh,Trong-Thuan Nguyen,Trung-Nghia Le,Vo Thanh Toan,Tam V. Nguyen,Minh-Triet Tran,Thanh Dinh Le

Main category: cs.CV

TL;DR: 本文介绍了AbscessHeNe,一个包含4926张增强CT切片的头颈部脓肿标注数据集,用于支持脓肿分割、深部颈间隙侵犯评估及临床决策,并为多媒体索引和病例检索提供基础。

Details Motivation: 头颈部脓肿若未及时诊治可能导致败血症或死亡,准确的影像学检测与病灶勾画对诊断和治疗至关重要,但目前缺乏公开、高质量的数据集支持相关研究。 Method: 构建了一个经过临床确认、像素级标注的CT图像数据集(AbscessHeNe),包含4926张对比增强CT切片,并评估了多种先进的语义分割模型(包括CNN、Transformer和Mamba架构)在该数据集上的性能表现。 Result: 表现最佳的模型取得Dice系数0.39、IoU 0.27和归一化表面距离0.67,表明当前模型在该任务上仍面临挑战;数据集还支持基于内容的医学图像检索和知识驱动的临床流程应用。 Conclusion: AbscessHeNe为头颈部脓肿的自动分割与临床分析提供了重要资源,当前分割性能有限,需进一步研究;该数据集有望推动智能诊疗系统的发展,并将公开共享以促进学术合作。 Abstract: Abscesses in the head and neck represent an acute infectious process that can potentially lead to sepsis or mortality if not diagnosed and managed promptly. Accurate detection and delineation of these lesions on imaging are essential for diagnosis, treatment planning, and surgical intervention. In this study, we introduce AbscessHeNe, a curated and comprehensively annotated dataset comprising 4,926 contrast-enhanced CT slices with clinically confirmed head and neck abscesses. The dataset is designed to facilitate the development of robust semantic segmentation models that can accurately delineate abscess boundaries and evaluate deep neck space involvement, thereby supporting informed clinical decision-making. To establish performance baselines, we evaluate several state-of-the-art segmentation architectures, including CNN, Transformer, and Mamba-based models. The highest-performing model achieved a Dice Similarity Coefficient of 0.39, Intersection-over-Union of 0.27, and Normalized Surface Distance of 0.67, indicating the challenges of this task and the need for further research. Beyond segmentation, AbscessHeNe is structured for future applications in content-based multimedia indexing and case-based retrieval. Each CT scan is linked with pixel-level annotations and clinical metadata, providing a foundation for building intelligent retrieval systems and supporting knowledge-driven clinical workflows. The dataset will be made publicly available at https://github.com/drthaodao3101/AbscessHeNe.git.

[266] Depth Matching Method Based on ShapeDTW for Oil-Based Mud Imager

Fengfeng Li,Zhou Feng,Hongliang Wu,Hao Zhang,Han Tian,Peng Liu,Lixin Yuan

Main category: cs.CV

TL;DR: 提出一种基于ShapeDTW算法的井下图像深度匹配方法,结合HOG1D与原始信号特征,有效解决油基泥浆微电阻率成像中上下 pad 图像间的深度错位问题,具有良好的结构保持性和扩展性。

Details Motivation: 在使用油基泥浆微电阻率成像工具时,尽管已进行速度校正,上下pad集获取的图像仍存在深度错位问题,影响成像质量与地质解释准确性。 Method: 采用Shape Dynamic Time Warping (ShapeDTW)算法,提取局部形状特征并构建形态敏感的距离矩阵;使用HOG1D与原始信号组成的复合特征作为形状描述子,提升序列对齐过程中的结构相似性保留能力。 Result: 现场测试结果表明,该方法能精确对齐存在复杂纹理、深度偏移或局部缩放的图像,并具备良好的鲁棒性与灵活性。 Conclusion: 所提出的基于ShapeDTW的深度匹配方法能有效解决OBM微电阻率成像中的深度错位问题,且框架支持特征扩展,可适应不同地质特征的需求。 Abstract: In well logging operations using the oil-based mud (OBM) microresistivity imager, which employs an interleaved design with upper and lower pad sets, depth misalignment issues persist between the pad images even after velocity correction. This paper presents a depth matching method for borehole images based on the Shape Dynamic Time Warping (ShapeDTW) algorithm. The method extracts local shape features to construct a morphologically sensitive distance matrix, better preserving structural similarity between sequences during alignment. We implement this by employing a combined feature set of the one-dimensional Histogram of Oriented Gradients (HOG1D) and the original signal as the shape descriptor. Field test examples demonstrate that our method achieves precise alignment for images with complex textures, depth shifts, or local scaling. Furthermore, it provides a flexible framework for feature extension, allowing the integration of other descriptors tailored to specific geological features.

[267] SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

Yumeng He,Ying Jiang,Jiayin Lu,Yin Yang,Chenfanfu Jiang

Main category: cs.CV

TL;DR: SPARK是一个从单张RGB图像重建物理一致、运动学层级的3D可动物体的框架,结合VLM与生成模型,生成可用于仿真的URDF资产。

Details Motivation: 创建可用于仿真的3D可动物体资产通常费时且依赖专家知识,缺乏自动化方法。 Method: 利用视觉语言模型(VLM)提取粗略URDF参数并生成部件参考图,结合生成扩散Transformer和部件图像引导生成形状,并通过可微前向运动学与渲染优化关节参数。 Result: 在多种物体类别上生成高质量、仿真就绪的可动物体模型,支持机器人操作等下游任务。 Conclusion: SPARK实现了从单图到仿真级可动物体建模的端到端自动化,提升了数字资产生成效率。 Abstract: Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive experiments show that SPARK produces high-quality, simulation-ready articulated assets across diverse categories, enabling downstream applications such as robotic manipulation and interaction modeling.

[268] Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

Xin Wang,Haipeng Zhang,Mang Li,Zhaohui Xia,Yueguo Chen,Yu Zhang,Chunyu Wei

Main category: cs.CV

TL;DR: 提出Fusion-Diff,一种用于零样本组合图像检索的生成编辑框架,通过多模态融合和轻量级Control-Adapter实现高效跨模态对齐。

Details Motivation: 现有零样本CIR方法在桥接视觉-语言模态鸿沟方面存在困难,且依赖昂贵的三元组标注,需更高效的数据有效方案。 Method: 在联合视觉-语言空间中引入多模态融合特征编辑策略,并采用轻量级Control-Adapter进行微调,仅使用20万合成样本即可实现高性能。 Result: 在CIRR、FashionIQ和CIRCO基准上显著优于先前的零样本方法,并通过可视化增强了模型可解释性。 Conclusion: Fusion-Diff有效缩小了模态差距,在数据效率和性能上均表现出色,推动了零样本组合图像检索的发展。 Abstract: Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.

[269] ViT$^3$: Unlocking Test-Time Training in Vision

Dongchen Han,Yining Li,Tianyu Li,Zixuan Cao,Ziming Wang,Jun Song,Yu Cheng,Bo Zheng,Gao Huang

Main category: cs.CV

TL;DR: 本文提出了一个针对视觉序列建模的系统性实证研究,总结了六个实用的设计原则,并提出了Vision Test-Time Training (ViT$^3$) 模型,该模型具有线性复杂度和可并行计算的特点,在多种视觉任务上表现优异。

Details Motivation: 现有的测试时训练(TTT)方法在视觉应用中的设计缺乏系统理解与实践指导,本文旨在填补这一空白。 Method: 通过一系列实验和分析,系统地研究TTT在视觉序列建模中的设计选择,提炼出六项设计原则,并基于这些原则构建ViT$^3$模型。 Result: ViT$^3$在图像分类、生成、目标检测和语义分割等任务上达到或超过了现有线性复杂度模型(如Mamba和线性注意力变体)的性能,并缩小了与优化良好的视觉Transformer之间的差距。 Conclusion: 本文为视觉TTT模型提供了有效设计原则和强基线模型ViT$^3$,有望推动未来在该方向的研究发展。 Abstract: Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.

[270] DB-KAUNet: An Adaptive Dual Branch Kolmogorov-Arnold UNet for Retinal Vessel Segmentation

Hongyu Xu,Panpan Meng,Meng Wang,Dayu Hu,Liming Liang,Xiaoqi Sheng

Main category: cs.CV

TL;DR: 提出了一种用于视网膜血管分割的自适应双分支Kolmogorov-Arnold UNet(DB-KAUNet),结合CNN与Transformer路径,并引入新型模块提升特征表示和融合效果,在多个数据集上实现了领先的分割性能。

Details Motivation: 传统CNN方法在捕捉长距离依赖和复杂非线性关系方面存在局限,影响视网膜血管精确分割。 Method: 设计异构双分支编码器(HDBE),包含并行的CNN和Transformer路径,引入KANConv和KAT块,并集成交叉通道交互(CCI)和空间特征增强(SFE/SFE-GAF)模块,实现几何自适应融合与特征优化。 Result: 在DRIVE、STARE和CHASE_DB1数据集上实验表明,DB-KAUNet在分割精度、鲁棒性和背景噪声抑制方面优于现有方法,计算开销更低。 Conclusion: DB-KAUNet通过融合CNN与Transformer优势及自适应特征增强机制,显著提升了视网膜血管分割性能,具有良好的临床应用潜力。 Abstract: Accurate segmentation of retinal vessels is crucial for the clinical diagnosis of numerous ophthalmic and systemic diseases. However, traditional Convolutional Neural Network (CNN) methods exhibit inherent limitations, struggling to capture long-range dependencies and complex nonlinear relationships. To address the above limitations, an Adaptive Dual Branch Kolmogorov-Arnold UNet (DB-KAUNet) is proposed for retinal vessel segmentation. In DB-KAUNet, we design a Heterogeneous Dual-Branch Encoder (HDBE) that features parallel CNN and Transformer pathways. The HDBE strategically interleaves standard CNN and Transformer blocks with novel KANConv and KAT blocks, enabling the model to form a comprehensive feature representation. To optimize feature processing, we integrate several critical components into the HDBE. First, a Cross-Branch Channel Interaction (CCI) module is embedded to facilitate efficient interaction of channel features between the parallel pathways. Second, an attention-based Spatial Feature Enhancement (SFE) module is employed to enhance spatial features and fuse the outputs from both branches. Building upon the SFE module, an advanced Spatial Feature Enhancement with Geometrically Adaptive Fusion (SFE-GAF) module is subsequently developed. In the SFE-GAF module, adaptive sampling is utilized to focus on true vessel morphology precisely. The adaptive process strengthens salient vascular features while significantly reducing background noise and computational overhead. Extensive experiments on the DRIVE, STARE, and CHASE_DB1 datasets validate that DB-KAUNet achieves leading segmentation performance and demonstrates exceptional robustness.

[271] Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery

Zhicheng Zhao,Yin Huang,Lingma Sun,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出了ScaleBridge-Det,首个面向遥感图像中微小物体检测的大规模检测框架,通过尺度自适应专家路由和密度引导的查询分配,在多尺度上实现均衡检测性能。

Details Motivation: 遥感图像中存在极端尺度变化和密集分布的微小物体,现有方法难以在不同尺度对象间取得平衡,且大模型在该任务中的应用尚未探索。 Method: 提出ScaleBridge-Det框架,包含两个核心模块:1)路由增强混合注意力(REM)模块,通过自适应路由选择和融合尺度特定专家特征;2)密度引导动态查询(DGQ)模块,根据预测密度调整查询位置与数量,实现资源高效分配。 Result: 在AI-TOD-V2和DTOD数据集上达到SOTA性能,在VisDrone上展现出优异的跨域鲁棒性。 Conclusion: ScaleBridge-Det首次将大规模检测框架应用于遥感微小物体检测,通过REM和DGQ模块有效解决了多尺度和密度差异带来的挑战,实现了对密集微小物体和常规物体的同时优化。 Abstract: Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.

[272] GRASP: Guided Residual Adapters with Sample-wise Partitioning

Felix Nützel,Mischa Dombrowski,Bernhard Kainz

Main category: cs.CV

TL;DR: 提出GRASP方法,通过样本聚类和残差适配器缓解长尾分布下文本到图像扩散模型中的梯度冲突,提升稀有类别的生成质量与多样性。

Details Motivation: 现有文本到图像扩散模型在长尾数据(如罕见病理)上表现不佳,易发生模式崩溃,导致稀有类别生成质量与多样性不足,限制了其在数据增强中的应用。 Method: 引入GRASP方法,利用外部先验对样本进行静态聚类以减少组内梯度冲突,并在Transformer前馈层中注入类别特定的残差适配器进行微调,避免学习门控机制,提升稳定性和效率。 Result: 在MIMIC-CXR-LT和NIH-CXR-LT医学图像数据集上,GRASP在FID和多样性指标上优于基线方法,尤其提升了稀有类别的生成效果;在ImageNet-LT上验证了通用性。 Conclusion: GRASP能有效缓解长尾分布下的梯度冲突,显著提升扩散模型对稀有类别的生成能力,具有轻量、可扩展和易于集成的优点。 Abstract: Recent advances in text-to-image diffusion models enable high-fidelity generation across diverse prompts. However, these models falter in long-tail settings, such as medical imaging, where rare pathologies comprise a small fraction of data. This results in mode collapse: tail-class outputs lack quality and diversity, undermining the goal of synthetic data augmentation for underrepresented conditions. We pinpoint gradient conflicts between frequent head and rare tail classes as the primary culprit, a factor unaddressed by existing sampling or conditioning methods that mainly steer inference without altering the learned distribution. To resolve this, we propose GRASP: Guided Residual Adapters with Sample-wise Partitioning. GRASP uses external priors to statically partition samples into clusters that minimize intra-group gradient clashes. It then fine-tunes pre-trained models by injecting cluster-specific residual adapters into transformer feedforward layers, bypassing learned gating for stability and efficiency. On the long-tail MIMIC-CXR-LT dataset, GRASP yields superior FID and diversity metrics, especially for rare classes, outperforming baselines like vanilla fine-tuning and Mixture of Experts variants. Downstream classification on NIH-CXR-LT improves considerably for tail labels. Generalization to ImageNet-LT confirms broad applicability. Our method is lightweight, scalable, and readily integrates with diffusion pipelines.

[273] Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

Haodong Yan,Hang Yu,Zhide Zhong,Weilin Yuan,Xin Gong,Zehang Luo,Chengxi Heyu,Junfeng Li,Wenxuan Song,Shunbo Zhou,Haoang Li

Main category: cs.CV

TL;DR: 提出一种结构和接触感知的表示方法(SCAR),用于生成更真实的双手-物体交互视频,无需3D标注,提升了物理真实性和时序一致性。

Details Motivation: 现有方法在使用2D或3D表示生成手-物交互视频时,难以同时兼顾可扩展性和交互保真度。 Method: 提出结构与接触感知表示(SCAR),捕捉手与物体间的接触、遮挡及整体结构上下文,并采用联合生成范式与共享-专业化策略进行视频生成。 Result: 在两个真实世界数据集上优于现有最先进方法,生成的视频更具物理真实性和时序连贯性,并在开放世界场景中表现出强泛化能力。 Conclusion: 所提出的SCAR表示和生成框架有效解决了手-物交互视频生成中可扩展性与交互质量之间的权衡问题,具有良好的实际应用潜力。 Abstract: Generating realistic hand-object interactions (HOI) videos is a significant challenge due to the difficulty of modeling physical constraints (e.g., contact and occlusion between hands and manipulated objects). Current methods utilize HOI representation as an auxiliary generative objective to guide video synthesis. However, there is a dilemma between 2D and 3D representations that cannot simultaneously guarantee scalability and interaction fidelity. To address this limitation, we propose a structure and contact-aware representation that captures hand-object contact, hand-object occlusion, and holistic structure context without 3D annotations. This interaction-oriented and scalable supervision signal enables the model to learn fine-grained interaction physics and generalize to open-world scenarios. To fully exploit the proposed representation, we introduce a joint-generation paradigm with a share-and-specialization strategy that generates interaction-oriented representations and videos. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two real-world datasets in generating physics-realistic and temporally coherent HOI videos. Furthermore, our approach exhibits strong generalization to challenging open-world scenarios, highlighting the benefit of our scalable design. Our project page is https://hgzn258.github.io/SCAR/.

[274] Cross-Domain Validation of a Resection-Trained Self-Supervised Model on Multicentre Mesothelioma Biopsies

Farzaneh Seyedshahi,Francesca Damiola,Sylvie Lantuejoul,Ke Yuan,John Le Quesne

Main category: cs.CV

TL;DR: 提出了一种自监督学习方法,利用切除组织训练的编码器应用于活检样本,实现间皮瘤亚型分类和生存预测。

Details Motivation: 在实际临床中,多数计算病理模型依赖大块切除组织图像,而小活检样本应用受限,因此需要一种能适用于小活检的准确分类与预后预测方法。 Method: 采用在切除组织上训练的自监督编码器,将其迁移至活检组织图像,提取有意义的形态学特征,并用于肿瘤亚型分类和患者生存预测。 Result: 该模型能够在活检样本上有效捕捉形态学模式,并准确预测患者生存和肿瘤亚型。 Conclusion: 自监督学习框架可跨样本类型(切除→活检)迁移,具有辅助间皮瘤诊断与治疗规划的潜力。 Abstract: Accurate subtype classification and outcome prediction in mesothelioma are essential for guiding therapy and patient care. Most computational pathology models are trained on large tissue images from resection specimens, limiting their use in real-world settings where small biopsies are common. We show that a self-supervised encoder trained on resection tissue can be applied to biopsy material, capturing meaningful morphological patterns. Using these patterns, the model can predict patient survival and classify tumor subtypes. This approach demonstrates the potential of AI-driven tools to support diagnosis and treatment planning in mesothelioma.

[275] DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

Patrick Kwon,Chen Chen

Main category: cs.CV

TL;DR: DreamingComics是一种布局感知的故事可视化框架,通过改进的视频扩散-变换器模型和区域感知的位置编码(RegionalRoPE)提升角色与风格的一致性,并结合基于大语言模型的布局生成器实现从自然语言脚本到漫画布局的可控生成。

Details Motivation: 现有故事可视化方法主要依赖文本定位主体,在艺术风格和角色一致性方面存在不足,难以保持视觉连贯性和布局控制。 Method: 基于预训练的视频扩散-变换器(DiT)模型,提出RegionalRoPE区域感知位置编码以实现布局控制,并引入掩码条件损失来约束各主体的视觉特征;集成基于大语言模型的布局生成器,从自然语言脚本中推断漫画布局。 Result: 相比先前方法,角色一致性提升了29.2%,风格相似性提高了36.2%,且具有高空间准确性。 Conclusion: DreamingComics有效提升了故事可视化中的身份、风格和布局一致性,实现了更可控、更高质量的漫画生成。 Abstract: Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation of our approach, showing a 29.2% increase in character consistency and a 36.2% increase in style similarity compared to previous methods, while displaying high spatial accuracy. Our project page is available at https://yj7082126.github.io/dreamingcomics/

[276] SSR: Semantic and Spatial Rectification for CLIP-based Weakly Supervised Segmentation

Xiuli Bi,Die Xiao,Junchao Fan,Bin Xiao

Main category: cs.CV

TL;DR: 本文提出了一种用于CLIP-based弱监督语义分割的语义与空间校正方法(SSR),通过跨模态原型对齐(CMPA)和超像素引导校正(SGC)有效缓解前景过激活和背景误激活问题,在PASCAL VOC和MS COCO上取得了优异性能。

Details Motivation: 现有基于CLIP的弱监督语义分割方法存在非目标前景区域和背景区域的过激活问题,限制了分割精度,因此需要一种能够同时优化语义一致性和空间精确性的方法。 Method: 提出语义与空间校正(SSR)方法:在语义层面,采用跨模态原型对齐(CMPA)构建对比学习机制,实现模态间特征空间对齐,增强语义相关性并减少类间混淆;在空间层面,引入超像素引导校正(SGC),利用超像素的空间先验信息过滤非目标区域在亲和传播中的干扰,抑制背景过激活。 Result: 在PASCAL VOC和MS COCO数据集上进行了广泛实验,结果表明该方法优于所有单阶段方法以及更复杂的多阶段方法,分别达到79.5%和50.6%的mIoU分数。 Conclusion: 本文提出的SSR框架通过语义和空间双重校正机制,显著提升了CLIP在弱监督语义分割任务中的表现,有效解决了过激活问题,为后续基于CLIP的分割方法提供了新的思路。 Abstract: In recent years, Contrastive Language-Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross-modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP-based weakly supervised semantic segmentation approaches: over-activation in non-target foreground regions and background areas. Specifically, at the semantic level, the Cross-Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter-class overlap while enhancing semantic correlations, to rectify over-activation in non-target foreground regions effectively; at the spatial level, the Superpixel-Guided Correction (SGC) leverages superpixel-based spatial priors to precisely filter out interference from non-target regions during affinity propagation, significantly rectifying background over-activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single-stage approaches, as well as more complex multi-stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.

[277] FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing

Yucheng Liao,Jiajun Liang,Kaiqian Cui,Baoquan Zhao,Haoran Xie,Wei Liu,Qing Li,Xudong Mao

Main category: cs.CV

TL;DR: 本文提出FreqEdit,一种无需训练的图像编辑框架,通过高频特征注入、自适应注入策略和路径补偿机制,有效解决多轮语言指令编辑中的质量退化问题,实现稳定且精确的连续编辑。

Details Motivation: 现有基于语言指令的图像编辑模型在单次编辑上表现良好,但在多轮连续编辑中会出现严重的质量退化,主要表现为细节丢失,因此需要一种能够保持编辑稳定性与图像质量的方法。 Method: 提出FreqEdit框架,包含三个核心组件:1)从参考速度场注入高频特征以保留细节;2)自适应注入策略,空间调节注入强度以实现局部精准控制;3)路径补偿机制,周期性校正编辑路径防止过度约束。该方法无需训练,适用于现有模型。 Result: 实验表明,FreqEdit在十次以上的连续编辑中仍保持高质量,在身份保持和指令遵循方面优于七个最先进的基线模型。 Conclusion: FreqEdit通过保护高频信息和优化编辑路径,显著提升了多轮语言指令图像编辑的稳定性和保真度,为实际应用提供了可行方案。 Abstract: Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.

[278] HiconAgent: History Context-aware Policy Optimization for GUI Agents

Xurui Zhou,Gongwei Chen,Yuquan Xie,Zaijing Li,Kaiwen Zhou,Shuai Wang,Shuo Yang,Zhuotao Tian,Rui Shao

Main category: cs.CV

TL;DR: 本文提出了一种名为HiconAgent的GUI智能体,通过历史上下文感知策略优化(HCPO)来高效利用历史信息,包含动态上下文采样和基于锚点的历史压缩两个核心组件,在多个基准上取得了优于或媲美更大模型的性能,同时显著提升了计算效率。

Details Motivation: 为了克服在顺序GUI导航任务中因直接使用完整历史而导致的计算开销大和无关信息干扰问题,需要一种能够有效且高效利用历史上下文的方法。 Method: 提出了HiconAgent,采用历史上下文感知策略优化(HCPO),包括动态上下文采样(DCS)以在采样时自适应选择相关历史,以及锚点引导的历史压缩(AHC)在策略更新时通过双分支结构保留历史动作作为信息锚点,并使用对齐损失保持一致性。 Result: 在GUI-Odyssey上,HiconAgent-3B相比GUI-R1-7B提升了8.46%的定位准确率和11.32%的步骤成功率,在AndroidControl和AITW上表现相当,同时实现了最高2.47倍的速度提升和60%的FLOPs减少。 Conclusion: HiconAgent通过HCPO机制实现了对历史信息的高效利用,在更小模型规模下超越了更大模型的性能,并显著降低了计算成本,展示了其在GUI导航任务中的优越性和实用性。 Abstract: Graphical User Interface (GUI) agents require effective use of historical context to perform sequential navigation tasks. While incorporating past actions and observations can improve decision making, naive use of full history leads to excessive computational overhead and distraction from irrelevant information. To address this, we introduce HiconAgent, a GUI agent trained with History Context-aware Policy Optimization (HCPO) for efficient and effective utilization of historical information. HCPO optimizes history usage in both sampling and policy updates through two complementary components: (1) Dynamic Context Sampling (DCS) presents the agent with variable length histories during sampling, enabling adaptive use of the most relevant context; (2) Anchor-guided History Compression (AHC) refines the policy update phase with a dual branch strategy where the compressed branch removes history observations while keeping history actions as information flow anchors. The compressed and uncompressed branches are coupled through a history-enhanced alignment loss to enforce consistent history usage while maintaining efficiency. Experiments on mainstream GUI navigation benchmarks demonstrate strong performance. Despite being smaller, HiconAgent-3B outperforms GUI-R1-7B by +8.46 percent grounding accuracy and +11.32 percent step success rate on GUI-Odyssey, while achieving comparable results on AndroidControl and AITW with up to 2.47x computational speedup and 60 percent FLOPs reduction.

[279] VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis

Hafsa Billah

Main category: cs.CV

TL;DR: 提出了一种通用的视频情境分析(VSA)框架,结合扩展关系模型(R++)和图模型,支持连续查询处理和跨域情境检测,通过参数化模板实现领域无关性,并在多个实际场景中验证了其准确性、效率和鲁棒性。

Details Motivation: 现有视频情境分析方法依赖人工或定制算法,难以泛化且耗时费力,无法满足不同领域对多种情境的自动检测需求。 Method: 利用先进的视频内容提取技术提取视频内容,采用R++关系模型和图模型两种表示方式;R++支持基于连续查询语言的流式处理,图模型则用于检测复杂情境;通过参数化模板抽象跨域基本情境以实现领域独立性。 Result: 在辅助生活、城市监控和普通 surveillance 三个领域的多种情境上进行了广泛实验,结果表明该框架在不同长度视频下均具有高准确率、高效性和强鲁棒性。 Conclusion: 所提出的VSA框架能够有效实现通用、自动化的情境分析,克服了传统方法在可扩展性和领域适应性方面的局限,具备实际应用价值。 Abstract: Automatically understanding video contents is important for several applications in Civic Monitoring (CM), general Surveillance (SL), Assisted Living (AL), etc. Decades of Image and Video Analysis (IVA) research have advanced tasks such as content extraction (e.g., object recognition and tracking). Identifying meaningful activities or situations (e.g., two objects coming closer) remains difficult and cannot be achieved by content extraction alone. Currently, Video Situation Analysis (VSA) is done manually with a human in the loop, which is error-prone and labor-intensive, or through custom algorithms designed for specific video types or situations. These algorithms are not general-purpose and require a new algorithm/software for each new situation or video from a new domain. This report proposes a general-purpose VSA framework that overcomes the above limitations. Video contents are extracted once using state-of-the-art Video Content Extraction technologies. They are represented using two alternative models -- the extended relational model (R++) and graph models. When represented using R++, the extracted contents can be used as data streams, enabling Continuous Query Processing via the proposed Continuous Query Language for Video Analysis. The graph models complement this by enabling the detection of situations that are difficult or impossible to detect using the relational model alone. Existing graph algorithms and newly developed algorithms support a wide variety of situation detection. To support domain independence, primitive situation variants across domains are identified and expressed as parameterized templates. Extensive experiments were conducted across several interesting situations from three domains -- AL, CM, and SL-- to evaluate the accuracy, efficiency, and robustness of the proposed approach using a dataset of videos of varying lengths from these domains.

[280] Robust Rigid and Non-Rigid Medical Image Registration Using Learnable Edge Kernels

Ahsan Raza Siyal,Markus Haltmeier,Ruth Steiger,Malik Galijasevic,Elke Ruth Gizewski,Astrid Ellen Grams

Main category: cs.CV

TL;DR: 提出一种结合可学习边缘核的医学图像配准方法,通过改进边缘特征提取,在多种实验设置下优于现有技术。

Details Motivation: 传统医学图像配准方法在处理对比度差异、空间畸变和模态差异方面存在困难,需更有效的特征对齐策略。 Method: 引入预定义边缘检测核并添加噪声进行可学习优化,结合基于学习的刚性和非刚性配准方法,增强结构特征提取。 Result: 在三个内部设置及两个公开数据集上,所提方法在刚性和非刚性配准任务中均优于当前最先进方法。 Conclusion: 可学习边缘核能有效提升多模态医学图像配准性能,有助于改善解剖结构分析与临床诊断应用。 Abstract: Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.

[281] Evaluating SAM2 for Video Semantic Segmentation

Syed Hesham Syed Ariff,Yun Liu,Guolei Sun,Jing Yang,Henghui Ding,Xue Geng,Xudong Jiang

Main category: cs.CV

TL;DR: 本文探讨了将SAM2模型扩展到密集视频语义分割(VSS)的两种方法,利用其精确的边界预测能力提升VSS性能。

Details Motivation: 将强大的提示驱动分割模型SAM2扩展至需要空间精度、时间一致性和多对象追踪的视频语义分割任务中。 Method: 第一种方法使用SAM2提取对象掩码,并用并行分割网络生成和优化初始预测;第二种方法利用预测掩码提取特征向量,输入简单网络进行分类,再结合分类结果与掩码得到最终分割。 Result: 实验表明,利用SAM2可提升VSS整体性能,主要得益于其对物体边界的精确预测。 Conclusion: SAM2在视频语义分割中具有潜力,但需应对多尺度、复杂边界和时间一致性等挑战。 Abstract: The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.

[282] Learned Image Compression for Earth Observation: Implications for Downstream Segmentation Tasks

Christian Mollière,Iker Cumplido,Marco Zeulner,Lukas Liesenhoff,Matthias Schubert,Julia Gottfriedsen

Main category: cs.CV

TL;DR: 本文评估了在地球观测数据中,任务特定的学习型压缩算法相较于传统方法(如JPEG 2000)的性能,重点比较其在火灾、云层和建筑物检测等分割任务中的表现。

Details Motivation: 卫星遥感数据快速增长,带来传输与存储压力,需在压缩同时保留关键信息以支持下游任务。 Method: 采用离散化混合高斯似然(Discretized Mixed Gaussian Likelihood)作为学习型压缩方法,与JPEG 2000对比,在三种遥感图像分割任务上评估重建质量(PSNR)和分割精度;并探索端到端联合优化压缩与分割模型的效果。 Result: 学习型压缩在大规模多通道光学影像上显著优于JPEG 2000,无论是在PSNR还是分割准确率方面;但在小规模单通道热红外数据上,传统方法因数据有限和架构限制仍具竞争力;端到端联合优化未带来性能提升。 Conclusion: 任务特定的学习型压缩在适合的遥感场景下可有效减小数据量并保留任务关键信息,但其优势依赖于数据模态和规模,当前端到端联合优化尚未体现优势。 Abstract: The rapid growth of data from satellite-based Earth observation (EO) systems poses significant challenges in data transmission and storage. We evaluate the potential of task-specific learned compression algorithms in this context to reduce data volumes while retaining crucial information. In detail, we compare traditional compression (JPEG 2000) versus a learned compression approach (Discretized Mixed Gaussian Likelihood) on three EO segmentation tasks: Fire, cloud, and building detection. Learned compression notably outperforms JPEG 2000 for large-scale, multi-channel optical imagery in both reconstruction quality (PSNR) and segmentation accuracy. However, traditional codecs remain competitive on smaller, single-channel thermal infrared datasets due to limited data and architectural constraints. Additionally, joint end-to-end optimization of compression and segmentation models does not improve performance over standalone optimization.

[283] SAM3-UNet: Simplified Adaptation of Segment Anything Model 3

Xinyu Xiong,Zihuang Wu,Lei Lu,Yufa Xia

Main category: cs.CV

TL;DR: 本文提出了SAM3-UNet,一种简化版的SAM3模型,用于低成本适应下游任务,包含SAM3图像编码器、参数高效微调的简单适配器和轻量级U-Net风格解码器,在多个任务上优于先前方法且训练内存低于6GB。

Details Motivation: 为了以较低成本将SAM3模型适配到下游任务,解决现有方法资源消耗大或性能不足的问题。 Method: 提出SAM3-UNet,结合SAM3图像编码器、参数高效的适配器和轻量级U-Net解码器,实现高效微调与低内存占用。 Result: 在镜像检测和显著目标检测等多个任务上,SAM3-UNet优于SAM2-UNet及其他SOTA方法,训练时批量大小为12的情况下显存占用小于6GB。 Conclusion: SAM3-UNet是一种高效、低资源消耗的SAM3变体,适用于多种下游视觉任务,具备良好的实用性和扩展性。 Abstract: In this paper, we introduce SAM3-UNet, a simplified variant of Segment Anything Model 3 (SAM3), designed to adapt SAM3 for downstream tasks at a low cost. Our SAM3-UNet consists of three components: a SAM3 image encoder, a simple adapter for parameter-efficient fine-tuning, and a lightweight U-Net-style decoder. Preliminary experiments on multiple tasks, such as mirror detection and salient object detection, demonstrate that the proposed SAM3-UNet outperforms the prior SAM2-UNet and other state-of-the-art methods, while requiring less than 6 GB of GPU memory during training with a batch size of 12. The code is publicly available at https://github.com/WZH0120/SAM3-UNet.

[284] Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas,Youngsun Lim,Ananya Srinivasan,Audrey Zheng,Deepti Ghadiyaram

Main category: cs.CV

TL;DR: 本文提出了一种新的视频生成质量评估指标,通过融合骨骼几何与外观特征构建真实人类动作的潜在空间,有效提升了对复杂人体动作时序合理性的评估能力。

Details Motivation: 现有视频评估方法对外观依赖强、缺乏时序理解,难以捕捉生成视频中复杂人体动作的运动动态与解剖不合理性。 Method: 提出一种基于真实人类动作潜在空间的新评估指标,融合外观无关的骨骼几何特征与外观特征,通过计算生成视频与真实动作分布之间的表示距离来量化动作质量。 Result: 在新构建的多维度基准上比现有最先进方法提升超过68%,在外部基准表现优异,且与人类感知相关性更强。 Conclusion: 该方法显著优于现有评估指标,揭示了当前视频生成模型的关键缺陷,为视频生成研究建立了新标准。 Abstract: Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.

[285] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Juanxi Tian,Siyuan Li,Conghui He,Lijun Wu,Cheng Tan

Main category: cs.CV

TL;DR: 提出Envision基准和Envision-Score指标,用于评估多模态模型在因果事件序列生成中的表现,揭示现有模型在时空一致性与动态世界知识建模上的不足。

Details Motivation: 现有多模态模型依赖静态单图生成,导致过度拟合静态模式,难以建模随时间展开的动态过程,缺乏对因果时序和世界知识的真正理解。 Method: 构建Envision——基于世界知识和时空因果结构的千级四阶段文本到多图生成基准,并提出Envision-Score综合评估指标,涵盖多维一致性、物理合理性和美学质量。 Result: 对15个模型的评测显示:专用T2I模型擅长美学渲染但缺乏世界知识;统一多模态模型在因果叙事连贯性上更优,但仍逊于闭源模型,且普遍在时空一致性上存在挑战。 Conclusion: 仅关注孤立单图生成会阻碍多帧推理与生成,促使模型偏向静态匹配而非动态建模,限制了对世界知识的内化能力;需转向因果链式任务以推动真正的时间感知智能发展。 Abstract: Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.

[286] Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Meng Cao,Haokun Lin,Haoyuan Li,Haoran Tang,Rongtao Xu,Dong An,Xue Liu,Ian Reid,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出了MILO,一种隐式的空间世界建模范式,通过视觉生成器和相对位置编码(RePE)提升多模态大语言模型的空间推理能力,并构建了大规模几何感知数据集GeoGen进行训练。

Details Motivation: 现有的多模态大语言模型在空间推理方面依赖文本描述调优,缺乏对视觉形态的连接,导致视觉文盲问题。 Method: 提出MILO范式,结合视觉生成器提供几何感知反馈,并设计RePE相对位置编码来捕捉相机姿态变换,同时构建GeoGen数据集用于训练。 Result: 实验表明,该方法在多个基准和模型上显著提升了空间推理性能。 Conclusion: MILO通过隐式地将符号推理与感知经验结合,实现了更全面的3D空间理解,为多模态模型的空间认知提供了新思路。 Abstract: Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.

[287] CauSight: Learning to Supersense for Visual Causal Discovery

Yize Zhang,Meiqi Chen,Sirui Chen,Bo Peng,Yanxi Zhang,Tianyu Li,Chaochao Lu

Main category: cs.CV

TL;DR: 本文提出了视觉因果发现任务,并构建了大规模数据集VCG-32K,同时开发了CauSight模型,通过因果感知推理在该任务上显著超越GPT-4.1。

Details Motivation: 为了让AI系统具备类似人类的因果理解能力,而不仅仅是感知视觉元素的存在,需要模型能够推断不同场景中视觉实体之间的因果关系。 Method: 构建包含32,000多张图像并标注实体级因果图的数据集VCG-32K;提出CauSight模型,结合Tree-of-Causal-Thought(ToCT)生成推理路径,并采用带因果奖励的强化学习优化推理策略。 Result: CauSight在视觉因果发现任务上性能超过GPT-4.1,绝对提升达21%,性能提升超过三倍。 Conclusion: CauSight通过因果感知的推理机制有效实现了视觉因果发现,验证了结合因果结构建模与强化学习在该任务上的有效性,推动了AI系统向类人因果理解的发展。 Abstract: Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.

[288] OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM-as-Critic

Songyan Zhang,Wenhui Huang,Zhan Chen,Chua Jiahao Collister,Qihang Huang,Chen Lv

Main category: cs.CV

TL;DR: 本文提出了OpenREAD,一种基于开放性推理强化的视觉-语言模型自动驾驶框架,通过端到端的强化微调(RFT)在从高层推理到低层轨迹规划的全过程中提升性能。

Details Motivation: 现有的监督微调(SFT)限制了推理泛化能力,而当前的强化微调(RFT)主要局限于下游任务,难以应用于开放性的场景理解问题,因奖励设计困难。 Method: 构建大规模的思维链(CoT)标注数据,利用强大的Qwen3大语言模型作为RFT中的评判器(critic),对开放性问题的推理质量进行量化,并实现从感知到决策的端到端强化微调。 Result: 实验表明,联合的端到端RFT显著提升了上游和下游任务的表现,OpenREAD在推理与规划基准上达到了最先进的性能。 Conclusion: OpenREAD通过引入基于LLM的奖励建模和端到端RFT,成功实现了开放性推理与规划的统一优化,推动了知识驱动型自动驾驶的发展。 Abstract: Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.

[289] PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Zeqing Wang,Keze Wang,Lei Zhang

Main category: cs.CV

TL;DR: 本文构建了一个名为PID的物理不可信内容检测数据集,并提出了一种轻量级微调方法,使视觉语言模型能够检测并解释文本到视频生成模型中的物理违规现象,进而评估了当前主流T2V模型在物理合理性方面的表现。

Details Motivation: 尽管文本到视频生成模型在质量和长度上取得了进展,但其生成内容是否符合物理规律仍不清楚;现有视觉语言模型难以识别生成视频中的物理错误,因此需要专门的方法来评估和解释这些错误。 Method: 构建了一个包含500个测试样本和2,588个训练样本的PID数据集,其中每个不合理的视频都由真实视频通过修改字幕生成;采用轻量级微调方法训练视觉语言模型,使其能检测物理不合理事件并生成解释,形成PhyDetEx系统。 Result: 实验表明,尽管最新的T2V模型在生成物理合理视频方面有所进步,但在遵循物理规律方面仍存在挑战,尤其是开源模型;微调后的VLM可有效检测并解释物理违规现象。 Conclusion: 理解与遵循物理规律仍是T2V模型的重大挑战;本文提出的PhyDetEx为评估和提升生成视频的物理合理性提供了有效工具和基准。 Abstract: Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

[290] Register Any Point: Scaling 3D Point Cloud Registration by Flow Matching

Yue Pan,Tao Sun,Liyuan Zhu,Lucas Nunes,Iro Armeni,Jens Behley,Cyrill Stachniss

Main category: cs.CV

TL;DR: 本文提出了一种将点云配准视为条件生成任务的新方法,通过学习连续的逐点速度场将未对齐的点云变换到统一坐标系下,直接生成配准后的场景,无需依赖传统对应关系匹配。

Details Motivation: 传统点云配准方法依赖于点对之间的对应关系估计和成对变换优化,难以处理低重叠、噪声以及多传感器模态等挑战,因此需要一种更鲁棒、端到端的配准框架。 Method: 将配准问题建模为条件生成任务,利用学习到的连续逐点速度场将输入点云“流动”至目标场景;结合轻量级局部特征提取器和测试时刚性约束来提升配准精度。 Result: 在成对和多视图配准基准上实现了最先进的性能,尤其在低重叠情况下表现优异,并能跨尺度和传感器模态泛化,同时支持重定位、多机器人SLAM和多会话地图融合等下游任务。 Conclusion: 所提出的方法提供了一种新颖且鲁棒的点云配准范式,摆脱了传统对应匹配的限制,在多种实际应用场景中展现出强大潜力。 Abstract: Point cloud registration aligns multiple unposed point clouds into a common frame, and is a core step for 3D reconstruction and robot localization. In this work, we cast registration as conditional generation: a learned continuous, point-wise velocity field transports noisy points to a registered scene, from which the pose of each view is recovered. Unlike previous methods that conduct correspondence matching to estimate the transformation between a pair of point clouds and then optimize the pairwise transformations to realize multi-view registration, our model directly generates the registered point cloud. With a lightweight local feature extractor and test-time rigidity enforcement, our approach achieves state-of-the-art results on pairwise and multi-view registration benchmarks, particularly with low overlap, and generalizes across scales and sensor modalities. It further supports downstream tasks including relocalization, multi-robot SLAM, and multi-session map merging. Source code available at: https://github.com/PRBonn/RAP.

[291] COACH: Collaborative Agents for Contextual Highlighting - A Multi-Agent Framework for Sports Video Analysis

Tsz-To Wong,Ching-Chun Huang,Hong-Han Shuai

Main category: cs.CV

TL;DR: 提出一种可重构的多智能体系统(MAS)框架,用于解决体育视频理解中多层次时序建模、泛化性差、可解释性低等问题,通过专业化智能体协作实现从微观动作到宏观策略的自适应分析。

Details Motivation: 现有端到端模型在处理体育视频的多层次时序信息时存在泛化能力差、任务定制成本高和可解释性不足的问题,难以统一支持从微小动作识别到全局比赛总结的多样化分析需求。 Method: 设计一个可重构的多智能体系统(MAS),每个智能体作为专用‘认知工具’负责特定分析任务;通过灵活组合和迭代调用智能体,构建适应不同时间尺度和任务需求的动态分析流程,应用于羽毛球中的击球问答和比赛摘要生成。 Result: 在两个羽毛球分析任务上验证了框架的适应性,能够有效连接细粒度事件检测与全局语义组织,实现跨任务的统一建模,并提升系统的可解释性和灵活性。 Conclusion: 该工作提供了一种面向体育视频理解的新型范式,具备良好的可扩展性、可解释性和跨任务泛化能力,为构建通用体育智能系统提供了新思路。 Abstract: Intelligent sports video analysis demands a comprehensive understanding of temporal context, from micro-level actions to macro-level game strategies. Existing end-to-end models often struggle with this temporal hierarchy, offering solutions that lack generalization, incur high development costs for new tasks, and suffer from poor interpretability. To overcome these limitations, we propose a reconfigurable Multi-Agent System (MAS) as a foundational framework for sports video understanding. In our system, each agent functions as a distinct "cognitive tool" specializing in a specific aspect of analysis. The system's architecture is not confined to a single temporal dimension or task. By leveraging iterative invocation and flexible composition of these agents, our framework can construct adaptive pipelines for both short-term analytic reasoning (e.g., Rally QA) and long-term generative summarization (e.g., match summaries). We demonstrate the adaptability of this framework using two representative tasks in badminton analysis, showcasing its ability to bridge fine-grained event detection and global semantic organization. This work presents a paradigm shift towards a flexible, scalable, and interpretable system for robust, cross-task sports video intelligence.The project homepage is available at https://aiden1020.github.io/COACH-project-page

[292] TransientTrack: Advanced Multi-Object Tracking and Classification of Cancer Cells with Transient Fluorescent Signals

Florian Bürger,Martim Dias Gomes,Nica Gutu,Adrián E. Granada,Noémie Moreau,Katarzyna Bozek

Main category: cs.CV

TL;DR: TransientTrack是一个基于深度学习的框架,用于在具有瞬时荧光信号的多通道显微视频中进行细胞追踪,能够识别有丝分裂和凋亡等关键事件,并构建完整的细胞谱系轨迹。

Details Motivation: 现有细胞追踪方法主要针对单一恒定信号的视频,无法检测细胞死亡等重要事件,且难以处理信号随时间波动的数据。 Method: 采用基于Transformer网络的深度学习框架,直接在细胞检测嵌入上进行匹配,结合多阶段匹配和卡尔曼滤波插值缺失轨迹片段,无需提取特定追踪特征。 Result: 在多种条件下均表现出色,能有效追踪细胞并捕捉分裂与死亡事件,成功应用于化疗药物单细胞水平疗效分析。 Conclusion: TransientTrack为癌症细胞动态的定量研究提供了有力工具,有助于深入理解治疗反应和耐药机制。 Abstract: Tracking cells in time-lapse videos is an essential technique for monitoring cell population dynamics at a single-cell level. Current methods for cell tracking are developed on videos with mostly single, constant signals and do not detect pivotal events such as cell death. Here, we present TransientTrack, a deep learning-based framework for cell tracking in multi-channel microscopy video data with transient fluorescent signals that fluctuate over time following processes such as the circadian rhythm of cells. By identifying key cellular events - mitosis (cell division) and apoptosis (cell death) our method allows us to build complete trajectories, including cell lineage information. TransientTrack is lightweight and performs matching on cell detection embeddings directly, without the need for quantification of tracking-specific cell features. Furthermore, our approach integrates Transformer Networks, multi-stage matching using all detection boxes, and the interpolation of missing tracklets with the Kalman Filter. This unified framework achieves strong performance across diverse conditions, effectively tracking cells and capturing cell division and death. We demonstrate the use of TransientTrack in an analysis of the efficacy of a chemotherapeutic drug at a single-cell level. The proposed framework could further advance quantitative studies of cancer cell dynamics, enabling detailed characterization of treatment response and resistance mechanisms. The code is available at https://github.com/bozeklab/TransientTrack.

[293] KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

Zaid Nasser,Mikhail Iumanov,Tianhao Li,Maxim Popov,Jaafar Mahmoud,Malik Mohrat,Ilya Obrubov,Ekaterina Derevyanka,Ivan Sosin,Sergey Kolyubin

Main category: cs.CV

TL;DR: KM-ViPE是一个实时、开放词汇的SLAM框架,适用于未标定的单目相机,在动态环境中通过融合几何与视觉语言特征实现在线定位与语义建图。

Details Motivation: 现有SLAM系统多依赖深度传感器、离线标定或无法有效处理动态场景,限制了在第一视角和互联网规模视频中的应用。 Method: 提出KM-ViPE框架,紧耦合DINO视觉特征与几何约束,采用基于高层特征的自适应鲁棒核,处理动态物体与可移动静态物体,并融合几何与语言对齐的深度视觉特征进行语义建图。 Result: 在多个场景中表现出与最先进方法相当的性能,支持在线运行、无需深度输入或运动估计,并在动态环境中有更强鲁棒性。 Conclusion: KM-ViPE实现了无需标定、支持单目相机、在线运行且鲁棒应对动态场景的SLAM,适用于机器人、AR/VR等具身AI应用,推动了实用化空间智能的发展。 Abstract: We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.

[294] StyleYourSmile: Cross-Domain Face Retargeting Without Paired Multi-Style Data

Avirup Dey,Vinay Namboodiri

Main category: cs.CV

TL;DR: 本文提出了一种名为StyleYourSmile的一次性跨域人脸重定向方法,无需配对的多风格数据,通过高效的数据增强策略和双编码器框架提取领域不变的身份特征并捕捉领域特定的风格变化,并利用扩散模型实现跨域面部表情重定向。

Details Motivation: 现有方法在跨域人脸重定向中难以泛化、依赖测试时优化或需要精细标注的多风格数据集来实现领域不变的身份表示。 Method: 提出一种高效的数据增强策略和双编码器框架,结合扩散模型进行条件生成,以分离控制身份、表情和风格特征。 Result: 实验表明,该方法在多种视觉域下均实现了优越的身份保持性和重定向保真度。 Conclusion: StyleYourSmile无需多风格配对数据即可实现高效的一次式跨域人脸表情重定向,显著提升了泛化能力和实用性。 Abstract: Cross-domain face retargeting requires disentangled control over identity, expressions, and domain-specific stylistic attributes. Existing methods, typically trained on real-world faces, either fail to generalize across domains, need test-time optimizations, or require fine-tuning with carefully curated multi-style datasets to achieve domain-invariant identity representations. In this work, we introduce \textit{StyleYourSmile}, a novel one-shot cross-domain face retargeting method that eliminates the need for curated multi-style paired data. We propose an efficient data augmentation strategy alongside a dual-encoder framework, for extracting domain-invariant identity cues and capturing domain-specific stylistic variations. Leveraging these disentangled control signals, we condition a diffusion model to retarget facial expressions across domains. Extensive experiments demonstrate that \textit{StyleYourSmile} achieves superior identity preservation and retargeting fidelity across a wide range of visual domains.

[295] SARL: Spatially-Aware Self-Supervised Representation Learning for Visuo-Tactile Perception

Gurmeher Khurana,Lan Wei,Dandan Zhang

Main category: cs.CV

TL;DR: 提出了一种空间感知的自监督学习框架SARL,用于融合视觉-触觉数据的机器人操作,通过保留特征图的空间结构,在多个下游任务中显著优于现有方法。

Details Motivation: 现有的自监督学习方法通常将特征图压缩为全局向量,丢失了对接触丰富操作至关重要的空间结构信息。 Method: 在BYOL架构基础上引入三个基于特征图级别的目标:显著性对齐(SAL)、块-原型分布对齐(PPDA)和区域亲和匹配(RAM),以保持跨视角的注意力、部件组成和几何关系一致性。 Result: SARL在六项下游任务中 consistently 超过九个SSL基线;在边缘姿态回归任务中MAE为0.3955,比次优方法提升30%,接近有监督上限。 Conclusion: 对于融合视觉-触觉数据,保持空间等变性的结构化信号最有效,能显著提升机器人感知能力。 Abstract: Contact-rich robotic manipulation requires representations that encode local geometry. Vision provides global context but lacks direct measurements of properties such as texture and hardness, whereas touch supplies these cues. Modern visuo-tactile sensors capture both modalities in a single fused image, yielding intrinsically aligned inputs that are well suited to manipulation tasks requiring visual and tactile information. Most self-supervised learning (SSL) frameworks, however, compress feature maps into a global vector, discarding spatial structure and misaligning with the needs of manipulation. To address this, we propose SARL, a spatially-aware SSL framework that augments the Bootstrap Your Own Latent (BYOL) architecture with three map-level objectives, including Saliency Alignment (SAL), Patch-Prototype Distribution Alignment (PPDA), and Region Affinity Matching (RAM), to keep attentional focus, part composition, and geometric relations consistent across views. These losses act on intermediate feature maps, complementing the global objective. SARL consistently outperforms nine SSL baselines across six downstream tasks with fused visual-tactile data. On the geometry-sensitive edge-pose regression task, SARL achieves a Mean Absolute Error (MAE) of 0.3955, a 30% relative improvement over the next-best SSL method (0.5682 MAE) and approaching the supervised upper bound. These findings indicate that, for fused visual-tactile data, the most effective signal is structured spatial equivariance, in which features vary predictably with object geometry, which enables more capable robotic perception.

[296] Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

Zahra Mahdavi,Zahra Khodakaramimaghsoud,Hooman Khaloo,Sina Bakhshandeh Taleshani,Erfan Hashemi,Javad Mirzapour Kaleybar,Omid Nejati Manzari

Main category: cs.CV

TL;DR: Med-VCD是一种稀疏视觉对比解码方法,用于减少医疗大视觉语言模型中的幻觉问题,提升事实准确性和推理可靠性,且不增加推理时间开销。

Details Motivation: 现有的解码策略在减少自然图像领域中的幻觉方面已有进展,但在医疗领域中存在推理延迟、模态失配等问题,需要一种高效且通用的解决方案来提升医疗LVLM的可靠性。 Method: 提出Med-VCD,采用稀疏视觉对比解码和动态令牌稀疏化策略,选择受视觉信息支持的令牌,实时剔除冗余,增强视觉证据对生成过程的约束,避免二次解码带来的延迟。 Result: 在八个涵盖眼科、放射学和病理学的医疗数据集上评估显示,相比基线模型,Med-VCD平均将事实准确性提高了13%,幻觉准确性提高了6%。 Conclusion: Med-VCD能有效减轻医疗视觉语言模型中的幻觉问题,在保持推理效率的同时显著提升生成内容的事实一致性,具有良好的跨领域应用潜力。 Abstract: Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.

[297] Physical ID-Transfer Attacks against Multi-Object Tracking via Adversarial Trajectory

Chenyi Wang,Yanmao Man,Raymond Muller,Ming Li,Z. Berkay Celik,Ryan Gerdes,Jonathan Petit

Main category: cs.CV

TL;DR: 本文提出了AdvTraj,首个针对检测-跟踪型多目标跟踪(MOT)系统的在线物理ID操纵攻击方法,通过生成对抗性轨迹在不攻击检测模块的情况下误导目标的身份关联,具有高成功率和跨算法可迁移性。

Details Motivation: 现有的MOT攻击主要集中在数字域内对检测模块的干扰或单个目标的劫持,缺乏对实际物理场景中在线攻击的研究,且攻击鲁棒性和通用性不足。本文旨在揭示MOT系统在目标关联阶段的潜在脆弱性。 Method: 提出AdvTraj,利用对抗性轨迹将攻击者ID传递给目标对象,干扰跟踪系统的身份匹配过程;在CARLA仿真环境中实现在线物理攻击,无需修改检测模块;设计两种可由行人或驾驶员执行的通用对抗性操作模式。 Result: 实验表明,在白盒设置下对SORT算法的攻击成功率达100%,并对多种SOTA MOT算法具有高达93%的迁移攻击成功率;生成的对抗轨迹具备可解释的模式,支持实际场景中的物理实现。 Conclusion: AdvTraj揭示了当前MOT系统在对象关联阶段的安全弱点,展示了无需攻击检测模块即可实现高效ID操纵的可能性,为提升MOT系统的鲁棒性提供了新视角。 Abstract: Multi-Object Tracking (MOT) is a critical task in computer vision, with applications ranging from surveillance systems to autonomous driving. However, threats to MOT algorithms have yet been widely studied. In particular, incorrect association between the tracked objects and their assigned IDs can lead to severe consequences, such as wrong trajectory predictions. Previous attacks against MOT either focused on hijacking the trackers of individual objects, or manipulating the tracker IDs in MOT by attacking the integrated object detection (OD) module in the digital domain, which are model-specific, non-robust, and only able to affect specific samples in offline datasets. In this paper, we present AdvTraj, the first online and physical ID-manipulation attack against tracking-by-detection MOT, in which an attacker uses adversarial trajectories to transfer its ID to a targeted object to confuse the tracking system, without attacking OD. Our simulation results in CARLA show that AdvTraj can fool ID assignments with 100% success rate in various scenarios for white-box attacks against SORT, which also have high attack transferability (up to 93% attack success rate) against state-of-the-art (SOTA) MOT algorithms due to their common design principles. We characterize the patterns of trajectories generated by AdvTraj and propose two universal adversarial maneuvers that can be performed by a human walker/driver in daily scenarios. Our work reveals under-explored weaknesses in the object association phase of SOTA MOT systems, and provides insights into enhancing the robustness of such systems.

[298] Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Zhongyu Yang,Dannong Xu,Wei Pang,Yingfang Yuan

Main category: cs.CV

TL;DR: Script是一种无需重新训练的即插即用型视觉token剪枝方法,通过图结构剪枝和查询条件语义剪枝模块,在保持96.88%原始性能的同时,显著提升多模态大模型的推理效率。

Details Motivation: 现有视觉token剪枝方法常忽略与用户查询的相关性或受限于注意力机制,导致适应性和有效性不足。 Method: 提出Script,包含两个模块:图结构剪枝模块去除视觉冗余token,查询条件语义剪枝模块保留与查询相关的语义信息,两者结合提升多模态任务性能。 Result: 在14个图像与视频理解基准上实验表明,Script相比现有方法更高效且准确;在LLaVA-NeXT-7B上实现最高6.8倍prefill速度提升和10倍FLOP减少。 Conclusion: Script是一种通用、高效的视觉token剪枝方法,可广泛适用于多种多模态大语言模型,显著降低计算开销并保持高性能。 Abstract: The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to 6.8x prefill speedup and 10x FLOP reduction, while retaining 96.88% of the original performance.

[299] GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment

Haoyang He,Jay Patrikar,Dong-Ki Kim,Max Smith,Daniel McGann,Ali-akbar Agha-mohammadi,Shayegan Omidshafiei,Sebastian Scherer

Main category: cs.CV

TL;DR: 提出了一种名为RLWG的自监督后训练框架,通过几何和感知奖励将预训练的世界模型与物理可验证结构对齐,提升了生成模型在具身导航中的空间一致性和长程稳定性。

Details Motivation: 现有的视频世界模型虽然视觉逼真,但缺乏几何接地,限制了其在需要空间连贯性的导航任务中的应用。 Method: 提出了RLWG框架,结合姿态循环一致性、深度重投影和时间连贯性等多种可验证奖励,并基于GRPO设计了GrndCtrl方法进行奖励对齐的适配。 Result: GrndCtrl显著提高了世界模型在户外环境中的空间连贯性和导航稳定性,优于监督微调方法。 Conclusion: 通过自监督的可验证奖励对齐,可以有效桥接生成式预训练与具身行为之间的鸿沟,实现更可靠的具身环境模拟。 Abstract: Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability. We introduce Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG can use multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. We instantiate this framework with GrndCtrl, a reward-aligned adaptation method based on Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation. Like post-training alignment in large language models, GrndCtrl leverages verifiable rewards to bridge generative pretraining and grounded behavior, achieving superior spatial coherence and navigation stability over supervised fine-tuning in outdoor environments.

[300] SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation

Zisu Li,Hengye Lyu,Jiaxin Shi,Yufeng Zeng,Mingming Fan,Hanwang Zhang,Chen Liang

Main category: cs.CV

TL;DR: 本文提出SpriteHand,一种自回归视频生成框架,用于实时合成多种物体类型和运动模式下的手-物交互视频,支持非刚性或可变形物体的动态交互建模。

Details Motivation: 现有物理引擎和基于仿真的方法依赖刚性物体模型和预设手势,难以处理与非刚性、可变形或生物类物体的复杂交互,限制了真实感和灵活性。 Method: 提出SpriteHand,采用因果推理架构进行自回归生成,输入为静态物体图像和手部动作视频流,结合混合后训练策略提升视觉真实感和时序一致性。 Result: 该13亿参数模型在单张NVIDIA RTX 5090 GPU上实现约18 FPS、640x368分辨率的实时流式生成,延迟约150毫秒,可持续输出超过一分钟,实验显示其在视觉质量、物理合理性和交互保真度上优于生成式和基于引擎的基线方法。 Conclusion: SpriteHand能够高效、真实地合成多样化手-物交互视频,扩展了对非刚性与复杂物体交互的建模能力,具有在AR/VR、人机交互等场景中的应用潜力。 Abstract: Modeling and synthesizing complex hand-object interactions remains a significant challenge, even for state-of-the-art physics engines. Conventional simulation-based approaches rely on explicitly defined rigid object models and pre-scripted hand gestures, making them inadequate for capturing dynamic interactions with non-rigid or articulated entities such as deformable fabrics, elastic materials, hinge-based structures, furry surfaces, or even living creatures. In this paper, we present SpriteHand, an autoregressive video generation framework for real-time synthesis of versatile hand-object interaction videos across a wide range of object types and motion patterns. SpriteHand takes as input a static object image and a video stream in which the hands are imagined to interact with the virtual object embedded in a real-world scene, and generates corresponding hand-object interaction effects in real time. Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence. Our 1.3B model supports real-time streaming generation at around 18 FPS and 640x368 resolution, with an approximate 150 ms latency on a single NVIDIA RTX 5090 GPU, and more than a minute of continuous output. Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.

[301] SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning

Xu Zhang,Jin Yuan,Hanwang Zhang,Guojin Zhong,Yongsheng Zang,Jiacheng Lin,Zhiyong Li

Main category: cs.CV

TL;DR: 本文提出了一种新的图像语义理解任务——“图像协同分割与描述”(SegCaptioning),通过一个简单的提示(如边界框)生成多样化的(描述,掩码)对,并设计了基于场景图引导的扩散模型SGDiff来实现多模态语义对齐预测。

Details Motivation: 传统可控图像理解任务需要高成本的输入提示且输出单一,难以满足用户灵活需求。本文旨在通过极简提示生成多样化、语义一致的描述与分割结果,提升人机交互效率与信息丰富度。 Method: 提出Scene Graph Guided Diffusion Model(SGDiff),包括:1)Prompt-Centric Scene Graph Adaptor,将用户提示映射为场景图以捕捉意图;2)Scene Graph Guided Bimodal Transformer,在扩散过程中联合生成语义相关的描述与掩码;3)Multi-Entities Contrastive Learning损失,增强图文实体间的跨模态对齐。 Result: 在两个数据集上的实验表明,SGDiff在SegCaptioning任务中表现优异,同时在图像描述和实例分割子任务上也取得良好效果,验证了其用极少提示生成高质量、对齐的多模态输出的能力。 Conclusion: SGDiff通过引入场景图结构与扩散模型的结合,有效实现了从简单提示到多样化语义输出的转换,为未来低交互成本、高灵活性的视觉理解系统提供了新思路。 Abstract: Controllable image semantic understanding tasks, such as captioning or segmentation, necessitate users to input a prompt (e.g., text or bounding boxes) to predict a unique outcome, presenting challenges such as high-cost prompt input or limited information output. This paper introduces a new task ``Image Collaborative Segmentation and Captioning'' (SegCaptioning), which aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs, allowing flexible result selection by users. This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks. Technically, we propose a novel Scene Graph Guided Diffusion Model that leverages structured scene graph features for correlated mask-caption prediction. Initially, we introduce a Prompt-Centric Scene Graph Adaptor to map a user's prompt to a scene graph, effectively capturing his intention. Subsequently, we employ a diffusion process incorporating a Scene Graph Guided Bimodal Transformer to predict correlated caption-mask pairs by uncovering intricate correlations between them. To ensure accurate alignment, we design a Multi-Entities Contrastive Learning loss to explicitly align visual and textual entities by considering inter-modal similarity, resulting in well-aligned caption-mask pairs. Extensive experiments conducted on two datasets demonstrate that SGDiff achieves superior performance in SegCaptioning, yielding promising results for both captioning and segmentation tasks with minimal prompt input.

[302] Artemis: Structured Visual Reasoning for Perception Policy Learning

Wei Tang,Yanpeng Sun,Shan Zhang,Xiaofan Li,Piotr Koniusz,Wei Li,Na Zhao,Zechao Li

Main category: cs.CV

TL;DR: Artemis是一种基于结构化提案的感知策略学习框架,通过(标签,边界框)对表示中间推理步骤,将推理与空间表示对齐,提升了视觉感知任务的性能和泛化能力。

Details Motivation: 现有基于自然语言的中间推理链在视觉感知任务中常导致性能下降,因其在非结构化的语言空间中进行语义推理,而视觉感知需要空间和以对象为中心的推理形式。 Method: 提出Artemis框架,使用(标签,边界框)对作为中间推理步骤的结构化表示,实现可验证的视觉状态跟踪,并在Qwen2.5-VL-3B模型基础上进行感知策略学习。 Result: Artemis在定位、检测任务上表现优异,并在计数和几何感知任务上展现出强泛化能力,同时在通用MLLM基准上达到有竞争力的性能。 Conclusion: 将推理过程与空间表示对齐能够有效提升感知策略学习的效果,结构化的视觉推理为构建可扩展、通用的感知策略提供了可行路径。 Abstract: Recent reinforcement-learning frameworks for visual perception policy have begun to incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reasoning often reduces performance on perception tasks. We argue that the core issue lies not in reasoning per se but in the form of reasoning: while these chains perform semantic reasoning in an unstructured linguistic space, visual perception requires reasoning in a spatial and object-centric space. In response, we introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning, where each intermediate step is represented as a (label, bounding-box) pair capturing a verifiable visual state. This design enables explicit tracking of intermediate states, direct supervision for proposal quality, and avoids ambiguity introduced by language-based reasoning. Artemis is built on Qwen2.5-VL-3B, achieves strong performance on grounding and detection task and exhibits substantial generalization to counting and geometric-perception tasks. The consistent improvements across these diverse settings confirm that aligning reasoning with spatial representations enhances perception-policy learning. Owing to its strengthened visual reasoning, Artemis also achieves competitive performance on general MLLM benchmarks, illustrating that spatially grounded reasoning provides a principled route toward scalable and general perception policies.

[303] PAI-Bench: A Comprehensive Benchmark For Physical AI

Fengzhe Zhou,Jiannan Huang,Jialuo Li,Deva Ramanan,Humphrey Shi

Main category: cs.CV

TL;DR: PAI-Bench是一个评估视频生成、条件视频生成和视频理解中物理感知与预测能力的综合基准,包含2,808个真实案例,揭示当前多模态大模型和视频生成模型在物理一致性动态建模上的不足。

Details Motivation: 当前多模态大语言模型和视频生成模型在物理世界动态感知与预测方面的能力尚不明确,缺乏统一、全面的评估标准。 Method: 提出PAI-Bench,一个包含2,808个真实世界案例的基准,涵盖视频生成、条件生成和理解任务,采用任务对齐的指标评估物理合理性和领域推理能力。 Result: 实验表明,尽管视频生成模型具有高视觉保真度,但在保持物理动态一致性方面表现不佳;多模态大模型在预测和因果理解上能力有限。 Conclusion: 现有系统在应对Physical AI的感知与预测需求方面仍处于早期阶段,PAI-Bench为未来研究提供了现实评估基础和关键改进方向。 Abstract: Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.

[304] Learning Visual Affordance from Audio

Lidong Lu,Guo Chen,Zhu Wei,Yicheng Liu,Tong Lu

Main category: cs.CV

TL;DR: 本文提出了一个新的任务——视听感知定位(AV-AG),通过动作声音分割对象交互区域,并构建了首个包含动作声音、物体图像和像素级标注的数据集,同时提出AVAGFormer模型,在跨模态融合与掩码预测上取得SOTA性能。

Details Motivation: 现有基于文本指令或演示视频的感知定位方法存在歧义或遮挡问题,而声音能提供实时、语义丰富且视觉独立的线索,因此希望利用音频实现更直观的交互区域定位。 Method: 构建了首个AV-AG数据集,包含动作声音、物体图像和像素级感知标注,并设计了AVAGFormer模型,采用语义条件跨模态混合器和双头解码器来有效融合音视频信号进行掩码预测。 Result: AVAGFormer在AV-AG任务上达到SOTA,超越相关任务的基线模型;实验验证了端到端建模的优势及各组件贡献,并揭示了AV-AG与AVS任务的区别。 Conclusion: 音频可作为感知定位的有效输入模态,AV-AG为多模态感知提供了新方向,发布的数据集和代码将推动该领域发展。 Abstract: We introduce Audio-Visual Affordance Grounding (AV-AG), a new task that segments object interaction regions from action sounds. Unlike existing approaches that rely on textual instructions or demonstration videos, which often limited by ambiguity or occlusion, audio provides real-time, semantically rich, and visually independent cues for affordance grounding, enabling more intuitive understanding of interaction regions. To support this task, we construct the first AV-AG dataset, comprising a large collection of action sounds, object images, and pixel-level affordance annotations. The dataset also includes an unseen subset to evaluate zero-shot generalization. Furthermore, we propose AVAGFormer, a model equipped with a semantic-conditioned cross-modal mixer and a dual-head decoder that effectively fuses audio and visual signals for mask prediction. Experiments show that AVAGFormer achieves state-of-the-art performance on AV-AG, surpassing baselines from related tasks. Comprehensive analyses highlight the distinctions between AV-AG and AVS, the benefits of end-to-end modeling, and the contribution of each component. Code and dataset have been released on https://jscslld.github.io/AVAGFormer/.

[305] MV-TAP: Tracking Any Point in Multi-View Videos

Jahyeok Koo,Inès Hyeonsu Kim,Mungyeom Kim,Junghyun Park,Seohyun Park,Jaeyeong Kim,Jung Yi,Seokju Cho,Seungryong Kim

Main category: cs.CV

TL;DR: 提出MV-TAP,一种利用跨视角信息的多视角视频动态点跟踪新方法,在合成数据训练并在真实场景评估中表现优异。

Details Motivation: 多视角相机系统能提供丰富的场景观测,但动态对象在多视角下的理解仍具挑战,需更可靠的轨迹估计方法。 Method: 提出MV-TAP,结合相机几何与跨视角注意力机制,聚合多视角时空信息进行点跟踪,并构建大规模合成训练数据和真实世界测试集。 Result: 在多个具有挑战性的基准上,MV-TAP优于现有点跟踪方法,实现了更完整和可靠的轨迹估计。 Conclusion: MV-TAP有效提升了多视角动态场景中的点跟踪性能,为该领域研究提供了有力基线。 Abstract: Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

[306] AirSim360: A Panoramic Simulation Platform within Drone View

Xian Ge,Yuling Pan,Yuhang Zhang,Xiang Li,Weijun Zhang,Dizhe Zhang,Zhaoliang Wan,Xin Lin,Xiangkai Zhang,Juntao Liang,Jason Li,Wenjie Jiang,Bo Du,Ming-Hsuan Yang,Lu Qi

Main category: cs.CV

TL;DR: 本文提出了AirSim360,一个用于从空中视角生成全向数据的仿真平台,解决了360度全方位理解领域中大规模多样化数据缺乏的问题。

Details Motivation: 现有的全向理解研究受限于缺乏大规模、多样化的数据,且缺少对4D真实世界的系统性建模。 Method: 提出AirSim360平台,包含三个核心部分:面向像素级理解的渲染对齐数据与标注范式、支持行人感知的交互式系统、以及支持导航任务的自动化轨迹生成方法,并通过无人机实现广泛场景采样。 Result: 收集了超过6万张全景样本,在多种任务上进行了实验验证,证明了该仿真平台的有效性。 Conclusion: AirSim360是首个在全向设置下系统建模4D现实世界的工作,将公开提供包括工具包、插件和数据集在内的完整平台。 Abstract: The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and entity-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available at https://insta360-research-team.github.io/AirSim360-website.

[307] Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng,Yiyang Lu,Zongze Wu,Eli Shechtman,J. Zico Kolter,Kaiming He

Main category: cs.CV

TL;DR: 本文提出了改进的MeanFlow(iMF)方法,通过重新设计训练目标和引入灵活的引导机制,在单步生成模型中实现了更高的训练稳定性和性能,ImageNet 256×256上1-NFE达到1.72 FID,显著优于先前方法。

Details Motivation: 原始MeanFlow在训练目标和引导机制上存在缺陷:训练依赖网络自身导致不稳定,且引导尺度固定限制灵活性,因此需要改进以提升性能和适用性。 Method: 将训练目标重构为对瞬时速度v的损失,并用预测平均速度u的网络进行重参数化;将引导视为显式条件变量,采用上下文内条件处理以增强灵活性并减少模型规模。 Result: iMF在ImageNet 256×256上实现1.72 FID(1-NFE),性能超越同类单步方法,接近多步方法,且无需蒸馏。 Conclusion: iMF通过更稳定的训练和灵活的引导机制显著提升了单步生成模型的性能,推动了前向生成建模作为独立范式的进展。 Abstract: MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

[308] TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Zhiheng Liu,Weiming Ren,Haozhe Liu,Zijian Zhou,Shoufa Chen,Haonan Qiu,Xiaoke Huang,Zhaochong An,Fanny Yang,Aditya Patel,Viktar Atliha,Tony Ng,Xiao Han,Chuyan Zhu,Chenyang Zhang,Ding Liu,Juan-Manuel Perez-Rua,Sen He,Jürgen Schmidhuber,Wenhu Chen,Ping Luo,Wei Liu,Tao Xiang,Jonas Schult,Yuren Cong

Main category: cs.CV

TL;DR: TUNA是一种原生的统一多模态模型,通过级联VAE编码器和表示编码器构建统一的连续视觉表示空间,实现图像和视频在理解和生成任务上的端到端处理,在多项任务上达到SOTA。

Details Motivation: 现有统一多模态模型通常采用分离的表示方式,导致格式不匹配问题,限制了理解和生成任务的协同优化,因此需要一种统一的视觉表示方法。 Method: 提出TUNA模型,使用VAE编码器与表示编码器级联,构建统一的连续视觉表示空间,并在该空间中联合训练理解和生成任务。 Result: TUNA在图像/视频理解、生成及编辑任务上均取得SOTA性能;更强的预训练表示编码器带来持续性能提升;联合训练使两类任务相互促进而非干扰。 Conclusion: 统一的视觉表示空间有助于消除模态间表示差异,提升多模态模型的理解与生成能力,验证了统一建模范式的有效性与可扩展性。 Abstract: Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

[309] Generative Video Motion Editing with 3D Point Tracks

Yao-Chih Lee,Zhoutong Zhang,Jiahui Huang,Jui-Hsien Wang,Joon-Young Lee,Jia-Bin Huang,Eli Shechtman,Zhengqi Li

Main category: cs.CV

TL;DR: 提出一种基于3D点轨迹的视频到视频框架,实现对相机和物体运动的联合编辑,通过利用三维轨迹提供的深度信息,提升运动编辑的精确性与时空一致性。

Details Motivation: 现有视频编辑方法在处理复杂物体运动时缺乏对全场景上下文的理解,且难以实现细粒度的运动控制,尤其是相机与非刚性物体运动的联合编辑。 Method: 提出一个基于3D点轨迹条件的视频生成框架,利用源视频及其对应的源与目标3D点轨迹作为条件输入,通过两阶段(合成与真实数据)训练实现运动编辑。 Result: 模型支持多种编辑任务,包括相机与物体的联合操控、运动迁移和非刚性形变,在保持时空连贯性的同时实现了更精确的深度顺序与遮挡处理。 Conclusion: 该方法通过引入3D轨迹条件,显著提升了视频运动编辑的可控性与真实感,为创意视频编辑提供了新可能。 Abstract: Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.

[310] Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don't Know Galileo's Principle...for now

Varun Varma Thozhiyoor,Shivam Tripathi,Venkatesh Babu Radhakrishnan,Anand Bhattad

Main category: cs.CV

TL;DR: 研究视频生成模型对重力物理规律的建模能力,发现其普遍存在重力加速度偏低的问题,并提出一种无量纲双物体测试方法验证对伽利略等效原理的违背;通过少量数据微调可部分修正该问题。

Details Motivation: 评估视频生成模型作为世界模型时对基本物理规律(如重力)的理解能力,探究其生成结果中物理错误是否源于尺度模糊性。 Method: 提出一种不依赖于重力加速度g、焦距和尺度的无量纲双物体测试协议(t_1^2/t_2^2 = h_1/h_2),用于检测生成视频中的物理一致性;并通过低秩适配器在少量单球下落视频上进行微调以修正重力表现。 Result: 发现即使用时间重标也无法消除生成视频中重力加速度的高方差偏差;双物体测试显示模型违背伽利略等效原理;仅用100段单球视频微调即可将有效重力加速度从1.81 m/s²提升至6.43 m/s²,并能零样本泛化到双球下落和斜面场景。 Conclusion: 视频生成模型未能准确编码真实世界的重力规律,但可通过少量物理特定数据进行有效修正,表明物理知识可被针对性增强。 Abstract: Video generators are increasingly evaluated as potential world models, which requires them to encode and understand physical laws. We investigate their representation of a fundamental law: gravity. Out-of-the-box video generators consistently generate objects falling at an effectively slower acceleration. However, these physical tests are often confounded by ambiguous metric scale. We first investigate if observed physical errors are artifacts of these ambiguities (e.g., incorrect frame rate assumptions). We find that even temporal rescaling cannot correct the high-variance gravity artifacts. To rigorously isolate the underlying physical representation from these confounds, we introduce a unit-free, two-object protocol that tests the timing ratio $t_1^2/t_2^2 = h_1/h_2$, a relationship independent of $g$, focal length, and scale. This relative test reveals violations of Galileo's equivalence principle. We then demonstrate that this physical gap can be partially mitigated with targeted specialization. A lightweight low-rank adaptor fine-tuned on only 100 single-ball clips raises $g_{\mathrm{eff}}$ from $1.81\,\mathrm{m/s^2}$ to $6.43\,\mathrm{m/s^2}$ (reaching $65\%$ of terrestrial gravity). This specialist adaptor also generalizes zero-shot to two-ball drops and inclined planes, offering initial evidence that specific physical laws can be corrected with minimal data.

[311] Visual Sync: Multi-Camera Synchronization via Cross-View Object Motion

Shaowei Liu,David Yifan Yao,Saurabh Gupta,Shenlong Wang

Main category: cs.CV

TL;DR: VisualSync 是一种基于多视角动态的优化框架,能够以毫秒级精度对齐未标定、不同步的视频流。

Details Motivation: 跨摄像头视频同步在非受控环境下仍然具有挑战性,现有方法依赖特定条件或昂贵硬件,难以广泛应用。 Method: 利用现成的3D重建、特征匹配和密集跟踪技术提取轨迹、相对位姿和跨视图对应关系,并通过联合最小化极线误差来估计每个相机的时间偏移。 Result: 在四个多样化且具挑战性的数据集上实验表明,VisualSync 超越基线方法,中位同步误差低于50毫秒。 Conclusion: VisualSync 能有效实现消费者级多摄像头视频的高精度时间同步,适用于真实场景中的广泛使用。 Abstract: Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.

[312] Data-Centric Visual Development for Self-Driving Labs

Anbang Liu,Guanzhong Hu,Jiayi Wang,Ping Guo,Han Liu

Main category: cs.CV

TL;DR: 提出了一种融合真实与虚拟数据的混合管道,用于解决自驾驶实验室中因数据稀缺导致的高精度模型训练难题,特别是在移液操作中的气泡检测。

Details Motivation: 自驾驶实验室对模型精度要求高,但标注数据尤其是负样本难以获取,导致模型训练受限。 Method: 构建了结合真实数据(采用人机协同的自动采集与选择性人工验证)和虚拟数据(基于参考条件和提示引导的图像生成,并进行筛选验证)的混合数据生成 pipeline,形成类别平衡的数据集。 Result: 在完全使用自动采集的真实图像训练时,模型在保留的真实测试集上达到99.6%的准确率;结合真实与生成数据训练时准确率保持在99.4%,同时显著降低数据采集与审核负担。 Conclusion: 该方法为自驾驶实验室提供了可扩展且成本效益高的视觉反馈数据供给策略,也为罕见事件检测等视觉任务中的数据稀缺问题提供了实用解决方案。 Abstract: Self-driving laboratories offer a promising path toward reducing the labor-intensive, time-consuming, and often irreproducible workflows in the biological sciences. Yet their stringent precision requirements demand highly robust models whose training relies on large amounts of annotated data. However, this kind of data is difficult to obtain in routine practice, especially negative samples. In this work, we focus on pipetting, the most critical and precision sensitive action in SDLs. To overcome the scarcity of training data, we build a hybrid pipeline that fuses real and virtual data generation. The real track adopts a human-in-the-loop scheme that couples automated acquisition with selective human verification to maximize accuracy with minimal effort. The virtual track augments the real data using reference-conditioned, prompt-guided image generation, which is further screened and validated for reliability. Together, these two tracks yield a class-balanced dataset that enables robust bubble detection training. On a held-out real test set, a model trained entirely on automatically acquired real images reaches 99.6% accuracy, and mixing real and generated data during training sustains 99.4% accuracy while reducing collection and review load. Our approach offers a scalable and cost-effective strategy for supplying visual feedback data to SDL workflows and provides a practical solution to data scarcity in rare event detection and broader vision tasks.