Skip to content

Table of Contents

cs.CL [Back]

[1] Signature vs. Substance: Evaluating the Balance of Adversarial Resistance and Linguistic Quality in Watermarking Large Language Models

William Guo,Adaku Uchendu,Ana Smith

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLM)生成文本的水印技术,指出当前方法在文本质量、写作风格保持和对抗攻击鲁棒性方面存在不足,尤其是回译攻击可轻易去除水印,影响其广泛应用。

Details Motivation: 为了减轻大语言模型生成文本可能带来的危害,研究者提出了水印技术,但现有方法在文本质量和抗攻击能力方面表现不佳,阻碍了其广泛采用,因此需要系统评估现有水印技术的鲁棒性和文本保真度。 Method: 通过比较改写和回译(英→其他语言→英)等对抗攻击下多种水印技术的表现,并利用语言学指标评估其对原始文本语义、质量和写作风格的保持能力。 Result: 实验结果表明,现有水印技术虽能保持语义,但会偏离原始写作风格,且在对抗攻击下(尤其是回译攻击)水印信号易被去除,导致检测失效。 Conclusion: 当前的水印技术在面对强对抗攻击时鲁棒性不足,且影响文本自然性,需进一步改进以促进其在实际中的广泛部署。 Abstract: To mitigate the potential harms of Large Language Models (LLMs)generated text, researchers have proposed watermarking, a process of embedding detectable signals within text. With watermarking, we can always accurately detect LLM-generated texts. However, recent findings suggest that these techniques often negatively affect the quality of the generated texts, and adversarial attacks can strip the watermarking signals, causing the texts to possibly evade detection. These findings have created resistance in the wide adoption of watermarking by LLM creators. Finally, to encourage adoption, we evaluate the robustness of several watermarking techniques to adversarial attacks by comparing paraphrasing and back translation (i.e., English $\to$ another language $\to$ English) attacks; and their ability to preserve quality and writing style of the unwatermarked texts by using linguistic metrics to capture quality and writing style of texts. Our results suggest that these watermarking techniques preserve semantics, deviate from the writing style of the unwatermarked texts, and are susceptible to adversarial attacks, especially for the back translation attack.

[2] Refine Thought: A Test-Time Inference Method for Embedding Model Reasoning

Guangzhi Wang,Kai Li,Yinghao Jiao,Zhi Liu

Main category: cs.CL

TL;DR: 提出RT(Refine Thought)方法,通过多次前向传播增强文本嵌入模型的语义推理能力,在语义推理任务中表现显著提升,同时保持在通用语义理解任务上的稳定性。

Details Motivation: 提升纯解码器文本嵌入模型在语义推理任务中的表现,挖掘预训练中已学习但未充分激活的推理能力。 Method: 通过在测试时多次运行文本嵌入模型的前向传播,逐步 refine 语义表征,获得更深层次的语义理解。 Result: 在BRIGHT和PJBenchmark1等语义推理任务上取得显著性能提升,同时在C-MTEB等通用语义理解任务上保持稳定表现。 Conclusion: RT是一种有效的测试时推理方法,能够进一步激活decoder-only模型在预训练中隐含的语义推理能力。 Abstract: We propose RT (Refine Thought), a method that can enhance the semantic rea-soning ability of text embedding models. The method obtains the final semanticrepresentation by running multiple forward passes of the text embedding model.Experiments show that RT achieves significant improvements on semantic reason-ing tasks in BRIGHT and the person job matching benchmark PJBenchmark1, while maintaining consistent performance on general-purpose semantic under-standing tasks such as C-MTEB. Our results indicate that RT is effective becauseit further activates the semantic reasoning ability learned during pretraining bydecoder-only text embedding models(e.g., Qwen3-Embedding-8B). RT canbe seen as a test-time inference method.

[3] Can QE-informed (Re)Translation lead to Error Correction?

Govardhan Padmanabhan

Main category: cs.CL

TL;DR: 本文提出了两种无需训练的方法用于翻译质量评估指导的片段级错误修正任务,其中基于多LLM候选翻译选择高质量结果的方法表现最佳。

Details Motivation: 现有自动后编辑(APE)系统存在过度修正问题,导致性能下降,因此探索无需训练且能减少不必要修改的新方法。 Method: 第一种方法是QE指导的重翻译,从多个LLM生成的候选翻译中选择质量最高的;第二种方法类似APE,利用LLM根据QE解释替换错误子串,并采用条件启发式策略最小化编辑次数以提高增益-编辑比。 Result: 两种方法的Delta COMET得分分别为0.0201和-0.0108,第一种方法在子任务排行榜上位居第一。 Conclusion: 无需训练的QE指导重翻译方法有效提升了翻译质量并避免了过度修正问题,显著优于基于修正的APE式方法。 Abstract: The paper presents two approaches submitted to the WMT 2025 Automated Translation Quality Evaluation Systems Task 3 - Quality Estimation (QE)-informed Segment-level Error Correction. While jointly training QE systems with Automatic Post-Editing (APE) has shown improved performance for both tasks, APE systems are still known to overcorrect the output of Machine Translation (MT), leading to a degradation in performance. We investigate a simple training-free approach - QE-informed Retranslation, and compare it with another within the same training-free paradigm. Our winning approach selects the highest-quality translation from multiple candidates generated by different LLMs. The second approach, more akin to APE, instructs an LLM to replace error substrings as specified in the provided QE explanation(s). A conditional heuristic was employed to minimise the number of edits, with the aim of maximising the Gain-to-Edit ratio. The two proposed approaches achieved a Delta COMET score of 0.0201 and -0.0108, respectively, leading the first approach to achieve the winning position on the subtask leaderboard.

[4] What Works for 'Lost-in-the-Middle' in LLMs? A Study on GM-Extract and Mitigations

Mihir Gupte,Eshan Dixit,Muhammad Tayyab,Arun Adiththan

Main category: cs.CL

TL;DR: 本文提出了GM-Extract基准数据集,用于评估大语言模型在多文档上下文中检索控制变量的能力,揭示了“中间丢失”现象对实际应用的影响,并通过两种指标分析模型表现,同时系统评估了黑盒与白盒缓解方法的有效性。

Details Motivation: 由于大语言模型在长上下文中的信息利用能力下降(即“lost-in-the-middle”现象),在基于检索的应用中面临挑战,因此需要一个真实场景下的基准来系统评估和诊断该问题。 Method: 构建了GM-Extract基准数据集,设计两个评估指标:文档位置指标(空间检索能力)和变量提取指标(语义检索能力),并在7-8B参数规模的模型上进行键值提取和问答任务的测试;同时对现有缓解方法进行分类并实证评估其效果。 Result: 实验显示,仅改变上下文中的数据表示方式就显著影响检索性能;虽未一致观察到U型曲线,但发现了跨模型的清晰性能模式,并与困惑度得分相关联;部分缓解策略在某些情况下有效,但在其他情况下反而产生负面影响。 Conclusion: “lost-in-the-middle”现象对实际检索任务有显著影响,当前的缓解方法效果复杂且情境依赖性强,需结合具体应用场景谨慎选择策略。 Abstract: The diminishing ability of large language models (LLMs) to effectively utilize long-range context-the "lost-in-the-middle" phenomenon-poses a significant challenge in retrieval-based LLM applications. To study the impact of this phenomenon in a real-world application setting, we introduce GM-Extract, a novel benchmark dataset meticulously designed to evaluate LLM performance on retrieval of control variables. To accurately diagnose failure modes, we propose a simple yet elegant evaluation system using two distinct metrics: one for spatial retrieval capability (Document Metric) and the other for semantic retrieval capability (Variable Extraction Metric). We conduct a systematic evaluation of 7-8B parameter models on two multi-document tasks (key-value extraction and question-answering), demonstrating a significant change in retrieval performance simply by altering how the data is represented in the context window. While a distinct U-shaped curve was not consistently observed, our analysis reveals a clear pattern of performance across models, which we further correlate with perplexity scores. Furthermore, we perform a literature survey of mitigation methods, which we categorize into two distinct approaches: black-box and white-box methods. We then apply these techniques to our benchmark, finding that their efficacy is highly nuanced. Our evaluation highlights scenarios where these strategies successfully improve performance, as well as surprising cases where they lead to a negative impact, providing a comprehensive understanding of their utility in a practical context.

[5] Hint-Augmented Re-ranking: Efficient Product Search using LLM-Based Query Decomposition

Yilun Zhu,Nikhita Vedula,Shervin Malmasi

Main category: cs.CL

TL;DR: 本文提出了一种通过提取结构化解释(hints)来解析电子商务搜索查询中最高级表达的框架,利用大语言模型揭示潜在意图,并将语义理解迁移到轻量级模型中,在提升搜索与排序性能的同时解决了实际部署中的延迟问题。

Details Motivation: 最高级查询(如“最好的”、“最受欢迎的”)需要跨多个维度进行比较,依赖语言理解和领域知识,传统方法难以有效处理此类复杂语义,因此需要一种能准确捕捉潜在意图且适用于实际检索系统的解决方案。 Method: 提出一种将查询分解为属性-值提示(attribute-value hints)的框架,通过大语言模型同步生成结构化解析结果,并将其集成到检索流程中;为解决LLM直接重排序的高延迟问题,设计了将超语义解释迁移到轻量级模型的高效方法。 Result: 该方法在MAP指标上比基线提升10.9点,在MRR上提升5.9点,显著改善搜索和排序性能,同时大幅降低推理延迟,实现了实际部署可行性。 Conclusion: 超级语义可通过结构化提示有效表示,并能在大模型与轻量级模型之间成功迁移,该方法推动了检索系统中的语言理解能力,兼顾了性能与效率。 Abstract: Search queries with superlatives (e.g., best, most popular) require comparing candidates across multiple dimensions, demanding linguistic understanding and domain knowledge. We show that LLMs can uncover latent intent behind these expressions in e-commerce queries through a framework that extracts structured interpretations or hints. Our approach decomposes queries into attribute-value hints generated concurrently with retrieval, enabling efficient integration into the ranking pipeline. Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines. Since direct LLM-based reranking faces prohibitive latency, we develop an efficient approach transferring superlative interpretations to lightweight models. Our findings provide insights into how superlative semantics can be represented and transferred between models, advancing linguistic interpretation in retrieval systems while addressing practical deployment constraints.

[6] Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports

Chenchen Kuai,Zihao Li,Braden Rosen,Stephanie Paan,Navid Jafari,Jean-Louis Briaud,Yunlong Zhang,Youssef M. A. Hashash,Yang Zhou

Main category: cs.CL

TL;DR: 本研究提出了一种名为MoRA-RAG的知识增强型大语言模型框架,用于从灾害侦察报告中提取结构化信息以支持多灾种推理。该框架通过混合检索机制、智能分块和验证循环显著提升了准确性和可靠性。

Details Motivation: 灾后侦察报告包含多灾种相互作用的关键证据,但其非结构化文本难以进行系统性知识传递,且现有大语言模型在缺乏领域知识支撑时易产生幻觉。 Method: 提出MoRA-RAG框架,结合混合检索机制(动态路由查询至特定灾害数据库)、代理式分块(保持上下文连贯)和验证循环(评估证据充分性并发起补充检索),并在基于GEER报告构建的HazardRecQA数据集上进行评估。 Result: MoRA-RAG在HazardRecQA上达到最高94.5%的准确率,比零样本大模型高30%,比现有RAG系统高10%,显著减少幻觉,并使开源大模型性能接近闭源模型。 Conclusion: MoRA-RAG为将灾后文档转化为可信、可操作的防灾减灾情报提供了新范式。 Abstract: Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.

[7] HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection

Junjie Wu,Yumeng Fu,Nan Yu,Guohong Fu

Main category: cs.CL

TL;DR: 提出了一种名为HiEAG的分层证据增强生成框架,通过利用多模态大语言模型的知识,改进外部一致性检查,显著提升了图文对错误信息检测的准确性。

Details Motivation: 现有方法过于关注内部一致性,忽视了图文对与外部证据之间的外部一致性,导致在检测误导性内容时性能受限。 Method: 将外部一致性检查分解为包含检索、重排序和重写在内的综合引擎流程;设计了自动证据选择提示(AESP)用于证据重排序,以及自动证据生成提示(AEGP)用于证据重写,结合指令微调提升模型任务适应性。 Result: 在多个基准数据集上的实验表明,HiEAG在整体准确率上优于之前的最先进方法,并能生成可解释的判断依据。 Conclusion: HiEAG有效增强了多模态错误信息检测中的外部一致性验证能力,通过分层次的证据增强策略实现了性能突破,并为模型决策提供了可解释性支持。 Abstract: Recent advancements in multimodal out-of-context (OOC) misinformation detection have made remarkable progress in checking the consistencies between different modalities for supporting or refuting image-text pairs. However, existing OOC misinformation detection methods tend to emphasize the role of internal consistency, ignoring the significant of external consistency between image-text pairs and external evidence. In this paper, we propose HiEAG, a novel Hierarchical Evidence-Augmented Generation framework to refine external consistency checking through leveraging the extensive knowledge of multimodal large language models (MLLMs). Our approach decomposes external consistency checking into a comprehensive engine pipeline, which integrates reranking and rewriting, apart from retrieval. Evidence reranking module utilizes Automatic Evidence Selection Prompting (AESP) that acquires the relevant evidence item from the products of evidence retrieval. Subsequently, evidence rewriting module leverages Automatic Evidence Generation Prompting (AEGP) to improve task adaptation on MLLM-based OOC misinformation detectors. Furthermore, our approach enables explanation for judgment, and achieves impressive performance with instruction tuning. Experimental results on different benchmark datasets demonstrate that our proposed HiEAG surpasses previous state-of-the-art (SOTA) methods in the accuracy over all samples.

[8] Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

Zijin Su,Huanzhu Lv,Yuren Niu,Yiming Liu

Main category: cs.CL

TL;DR: 提出了一种基于平衡数据集和改进模型的多标签情感分类方法,显著提升了分类性能。

Details Motivation: 现有数据集如GoEmotions存在严重的类别不平衡问题,导致模型对少数类情感的识别效果较差,影响多标签情感分类的整体性能。 Method: 通过整合GoEmotions原始数据、使用RoBERTa-base-GoEmotions模型标注Sentiment140数据以及GPT-4 mini生成并人工标注文本,构建了一个均衡的多标签情感数据集;采用预训练FastText嵌入、卷积层、双向LSTM、注意力机制和Sigmoid输出层构建分类模型,并使用混合精度训练提升效率。 Result: 实验结果表明,相比在不平衡数据上训练的模型,所提方法在准确率、精确率、召回率、F1分数和AUC等指标上均有显著提升。 Conclusion: 平衡数据集结合高效的多标签分类模型能有效提升多标签情感分类的性能,验证了数据均衡化与模型结构优化的重要性。 Abstract: Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. To address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERTa-base-GoEmotions model, and manually annotated texts generated by GPT-4 mini. Our data balancing strategy ensured an even distribution across 28 emotion categories. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastText embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach.

[9] Stealth Fine-Tuning: Efficiently Breaking Alignment in RVLMs Using Self-Generated CoT

Le Yu,Zhengyue Zhao,Yawen Zheng,Yunhao Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为“隐秘微调”(Stealth Fine-Tuning)的新攻击方法,通过段级干扰生成有害推理路径,并利用自生成数据进行轻量级微调,有效绕过RVLMs的安全对齐机制,在低资源下实现高攻击成功率。

Details Motivation: RVLMs虽然依赖安全对齐防止有害行为,但其暴露的思维链(CoT)痕迹带来了新的攻击面,现有防御机制对此类漏洞缺乏充分防护。 Method: 提出‘隐秘微调’方法,通过段级干扰诱导有害推理轨迹,将模型自生成的输出作为监督微调数据,采用轮转加权损失设计,实现分布一致的轻量级微调。 Result: 实验表明,仅用499个样本和不到3小时单A100(QLoRA),该方法在攻击成功率(ASR)上超过IDEATOR 38.52%,同时保持模型原有的推理能力与表示分布。 Conclusion: Stealth Fine-Tuning是一种低成本、高效且隐蔽的攻击方式,揭示了当前RVLMs安全对齐在面对内部推理轨迹操纵时的脆弱性,呼吁更强的防御机制。 Abstract: Reasoning-augmented Vision-Language Models (RVLMs) rely on safety alignment to prevent harmful behavior, yet their exposed chain-of-thought (CoT) traces introduce new attack surfaces. In this work, we find that the safety alignment of RVLMs can be easily break through a novel attack method termed \textbf{Stealth Fine-Tuning}. Our method elicits harmful reasoning traces through \textbf{segment-level interference} and reuses the self-generated outputs as supervised fine-tuning data. Through a \textbf{turn-based weighted} loss design, yielding a lightweight, distribution-consistent finetuning method. In our experiment, with only 499 samples and under 3 hours on a single A100 (QLoRA), Stealth Fine-Tuning outperforms IDEATOR by 38.52\% ASR while preserving general reasoning ability, as the tuned model retains the original representation distribution. Experiments on AdvBench and several general benchmarks demonstrate that Stealth Fine-Tuning is a low-cost and highly effective way to bypass alignment defenses. \textcolor{red}{\textbf{Disclaimer: This paper contains content that may be disturbing or offensive.}}

[10] Synthetic Clinical Notes for Rare ICD Codes: A Data-Centric Framework for Long-Tail Medical Coding

Truong Vo,Weiyi Wu,Kaize Ding

Main category: cs.CL

TL;DR: 提出一种数据中心框架,通过生成高质量的合成出院记录来缓解ICD编码中的长尾分布问题,提升了罕见代码的预测性能。

Details Motivation: 由于诊断代码的极端长尾分布,许多罕见和零样本ICD代码在现有数据集中严重不足,导致模型表现不佳,尤其是宏观F1分数较低。 Method: 构建基于真实共现模式、ICD描述、同义词、分类体系等信息的多标签代码集,并以此生成9万条覆盖7902个ICD代码的合成病历;在此基础上微调PLM-ICD和GKI-ICD等先进模型。 Result: 在保持较高微观F1的同时,宏观F1略有提升,优于之前的最先进方法,验证了合成数据对长尾代码预测公平性的改善作用。 Conclusion: 精心设计的合成数据能够有效缓解ICD编码中的长尾问题,提升模型对罕见代码的预测能力,增强医疗NLP系统的公平性。 Abstract: Automatic ICD coding from clinical text is a critical task in medical NLP but remains hindered by the extreme long-tail distribution of diagnostic codes. Thousands of rare and zero-shot ICD codes are severely underrepresented in datasets like MIMIC-III, leading to low macro-F1 scores. In this work, we propose a data-centric framework that generates high-quality synthetic discharge summaries to mitigate this imbalance. Our method constructs realistic multi-label code sets anchored on rare codes by leveraging real-world co-occurrence patterns, ICD descriptions, synonyms, taxonomy, and similar clinical notes. Using these structured prompts, we generate 90,000 synthetic notes covering 7,902 ICD codes, significantly expanding the training distribution. We fine-tune two state-of-the-art transformer-based models, PLM-ICD and GKI-ICD, on both the original and extended datasets. Experiments show that our approach modestly improves macro-F1 while maintaining strong micro-F1, outperforming prior SOTA. While the gain may seem marginal relative to the computational cost, our results demonstrate that carefully crafted synthetic data can enhance equity in long-tail ICD code prediction.

[11] From Graphs to Hypergraphs: Enhancing Aspect-Based Sentiment Analysis via Multi-Level Relational Modeling

Omkar Mahesh Kashyap,Padegal Amit,Madhav Kashyap,Ashwini M Joshi,Shylaja SS

Main category: cs.CL

TL;DR: 提出HyperABSA,一种基于动态超图的方面情感分析框架,通过样本特定的层次聚类建模方面-意见结构,在多个基准上优于现有图方法。

Details Motivation: 现有图方法仅建模成对依赖关系,需构建多个图以捕捉不同关系视图,导致冗余、参数开销和融合错误传播,尤其在短文本低资源场景下鲁棒性不足。 Method: 提出动态超图框架HyperABSA,利用样本特定的层次聚类构建超边,并引入加速-回退截断机制自适应确定聚类粒度,从而捕捉更复杂的语义结构。 Result: 在Lap14、Rest14和MAMS三个基准上实验表明,HyperABSA持续优于强图基线方法,尤其与RoBERTa结合时性能提升显著。 Conclusion: 动态超图构建是ABSA任务的一种高效且强大的替代方案,具有推广到其他短文本NLP任务的潜力。 Abstract: Aspect-Based Sentiment Analysis (ABSA) predicts sentiment polarity for specific aspect terms, a task made difficult by conflicting sentiments across aspects and the sparse context of short texts. Prior graph-based approaches model only pairwise dependencies, forcing them to construct multiple graphs for different relational views. These introduce redundancy, parameter overhead, and error propagation during fusion, limiting robustness in short-text, low-resource settings. We present HyperABSA, a dynamic hypergraph framework that induces aspect-opinion structures through sample-specific hierarchical clustering. To construct these hyperedges, we introduce a novel acceleration-fallback cutoff for hierarchical clustering, which adaptively determines the level of granularity. Experiments on three benchmarks (Lap14, Rest14, MAMS) show consistent improvements over strong graph baselines, with substantial gains when paired with RoBERTa backbones. These results position dynamic hypergraph construction as an efficient, powerful alternative for ABSA, with potential extensions to other short-text NLP tasks.

[12] Applying Relation Extraction and Graph Matching to Answering Multiple Choice Questions

Naoki Shimoda,Akihiro Yamamoto

Main category: cs.CL

TL;DR: 提出一种结合Transformer关系抽取与知识图谱匹配的方法,用于可追溯的多项选择题回答。

Details Motivation: 传统知识图谱构建成本高且静态,难以应对错误事实输入;希望利用动态生成的知识图谱提升问答系统的可解释性与准确性。 Method: 使用Transformer模型进行关系抽取,将问题句子转化为关系图,并在闭世界假设下与真实知识图谱匹配验证其真实性,从而判断答案。 Result: 该方法在多项选择题上达到约70%的准确率,且具备良好的过程可追溯性;问题类别对准确率有显著影响。 Conclusion: 结合动态关系抽取与知识图谱验证能有效支持可追溯的问答系统,尤其适用于填空式选择题,但性能受问题类型影响较大。 Abstract: In this research, we combine Transformer-based relation extraction with matching of knowledge graphs (KGs) and apply them to answering multiple-choice questions (MCQs) while maintaining the traceability of the output process. KGs are structured representations of factual knowledge consisting of entities and relations. Due to the high construction cost, they had been regarded as static databases with validated links. However, the recent development of Transformer-based relation extraction (RE) methods has enabled us to generate KGs dynamically by giving them natural language texts, and thereby opened the possibility for representing the meaning of the input sentences with the created KGs. Using this effect, we propose a method that answers MCQs in the "fill-in-the-blank" format, taking care of the point that RE methods generate KGs that represent false information if provided with factually incorrect texts. We measure the truthfulness of each question sentence by (i) converting the sentence into a relational graph using an RE method and (ii) verifying it against factually correct KGs under the closed-world assumption. The experimental results demonstrate that our method correctly answers up to around 70% of the questions, while providing traceability of the procedure. We also highlight that the question category has a vast influence on the accuracy.

[13] Selective Weak-to-Strong Generalization

Hao Lang,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: 提出了一种选择性弱到强泛化(W2SG)框架,通过训练分类器P(IK)识别强模型可回答的问题,并结合图平滑方法优化弱标签,提升了模型在缺乏高质量监督数据下的对齐性能。

Details Motivation: 现有弱到强泛化方法因始终依赖弱监督,存在鲁棒性问题,部分弱标签甚至对模型有害,因此需要一种更智能地使用弱监督的机制。 Method: 提出选择性W2SG框架:训练二分类器P(IK)判断强模型能否回答问题,若能则使用其自生成标签进行对齐;否则才使用弱监督,并通过图平滑方法优化弱标签质量。 Result: 在三个基准上的实验表明,该方法持续优于现有基线;分析显示P(IK)具备跨任务和难度的泛化能力。 Conclusion: 选择性使用弱监督可有效提升弱到强泛化的鲁棒性和性能,P(IK)的泛化性表明该方法有助于超对齐(superalignment)研究。 Abstract: Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.

[14] SymLoc: Symbolic Localization of Hallucination across HaluEval and TruthfulQA

Naveen Lamba,Sanju Tiwari,Manas Gaur

Main category: cs.CL

TL;DR: 本文提出了首个基于符号语言学知识的幻觉定位框架,揭示了大语言模型在处理符号触发词(如否定、数量词等)时从早期层就开始出现注意力方差激增和语义处理崩溃,表明幻觉本质上是符号语言处理失败而非普遍生成问题。

Details Motivation: 大语言模型在面对符号触发词时容易产生幻觉,但现有方法未能系统分析这些符号如何在模型各层中引发幻觉,缺乏基于符号语言知识的定位机制。 Method: 提出一种新的符号化定位框架,利用符号语言和语义知识追踪幻觉在模型各层的发展,通过HaluEval和TruthfulQA评估五个模型,并分析注意力方差与符号语义处理的关系。 Result: 发现符号触发词(尤其是否定)在早期层(2-4层)即引起注意力方差爆炸性增长,深层中持续出现显著注意力下降;尽管模型规模增大,幻觉率仍高达78.3%-83.7%。 Conclusion: 幻觉的根本原因是符号语义处理的失效,而非整体生成机制的问题,符号语言知识为理解和定位幻觉提供了关键视角。 Abstract: LLMs still struggle with hallucination, especially when confronted with symbolic triggers like modifiers, negation, numbers, exceptions, and named entities. Yet, we lack a clear understanding of where these symbolic hallucinations originate, making it crucial to systematically handle such triggers and localize the emergence of hallucination inside the model. While prior work explored localization using statistical techniques like LSC and activation variance analysis, these methods treat all tokens equally and overlook the role symbolic linguistic knowledge plays in triggering hallucinations. So far, no approach has investigated how symbolic elements specifically drive hallucination failures across model layers, nor has symbolic linguistic knowledge been used as the foundation for a localization framework. We propose the first symbolic localization framework that leverages symbolic linguistic and semantic knowledge to meaningfully trace the development of hallucinations across all model layers. By focusing on how models process symbolic triggers, we analyze five models using HaluEval and TruthfulQA. Our symbolic knowledge approach reveals that attention variance for these linguistic elements explodes to critical instability in early layers (2-4), with negation triggering catastrophic variance levels, demonstrating that symbolic semantic processing breaks down from the very beginning. Through the lens of symbolic linguistic knowledge, despite larger model sizes, hallucination rates remain consistently high (78.3%-83.7% across Gemma variants), with steep attention drops for symbolic semantic triggers throughout deeper layers. Our findings demonstrate that hallucination is fundamentally a symbolic linguistic processing failure, not a general generation problem, revealing that symbolic semantic knowledge provides the key to understanding and localizing hallucination mechanisms in LLMs.

[15] Harnessing Deep LLM Participation for Robust Entity Linking

Jiajun Hou,Chenyu Zhang,Rui Meng

Main category: cs.CL

TL;DR: 本文提出了DeepEL,一个将大语言模型(LLM)全面整合到实体链接(EL)各个阶段的框架,并引入基于全局上下文的自验证机制,显著提升了EL性能,尤其在跨领域场景下表现突出。

Details Motivation: 现有方法仅将大语言模型应用于实体链接的孤立阶段,未能充分利用其潜力,且缺乏对实体间整体关系的建模,导致性能受限。 Method: 提出DeepEL框架,将大语言模型融入实体链接的全流程,并设计一种新的自验证机制,利用句子中的全局上下文信息让模型自我修正预测结果,提升实体间的协同消歧能力。 Result: 在十个基准数据集上实验表明,DeepEL平均F1分数比现有最先进方法提高2.6%,在跨领域数据集上提升达4%。 Conclusion: 深度整合大语言模型并引入自验证机制能有效提升实体链接的整体性能,特别是在跨领域场景下,推动了该领域的技术进步。 Abstract: Entity Linking (EL), the task of mapping textual entity mentions to their corresponding entries in knowledge bases, constitutes a fundamental component of natural language understanding. Recent advancements in Large Language Models (LLMs) have demonstrated remarkable potential for enhancing EL performance. Prior research has leveraged LLMs to improve entity disambiguation and input representation, yielding significant gains in accuracy and robustness. However, these approaches typically apply LLMs to isolated stages of the EL task, failing to fully integrate their capabilities throughout the entire process. In this work, we introduce DeepEL, a comprehensive framework that incorporates LLMs into every stage of the entity linking task. Furthermore, we identify that disambiguating entities in isolation is insufficient for optimal performance. To address this limitation, we propose a novel self-validation mechanism that utilizes global contextual information, enabling LLMs to rectify their own predictions and better recognize cohesive relationships among entities within the same sentence. Extensive empirical evaluation across ten benchmark datasets demonstrates that DeepEL substantially outperforms existing state-of-the-art methods, achieving an average improvement of 2.6\% in overall F1 score and a remarkable 4% gain on out-of-domain datasets. These results underscore the efficacy of deep LLM integration in advancing the state-of-the-art in entity linking.

[16] ArbESC+: Arabic Enhanced Edit Selection System Combination for Grammatical Error Correction Resolving conflict and improving system combination in Arabic GEC

Ahlam Alrehili,Areej Alhothali

Main category: cs.CL

TL;DR: 本文提出了一种多系统融合方法ArbESC+,用于阿拉伯语语法错误纠正,结合多种模型生成修正建议,并通过分类器选择最优修正,显著提升了纠正性能。

Details Motivation: 阿拉伯语复杂的形态和句法结构使其语法错误纠正更具挑战性,而现有研究多依赖单一模型,未充分利用多系统融合的潜力。 Method: 采用AraT5、ByT5、mT5、AraBART等多个模型生成修正建议,将建议转化为数值特征,使用分类器进行选择,并引入支持技术过滤重叠修正并评估决策可靠性。 Result: 在QALB-14测试集上F0.5达到82.63%,QALB-15 L1上为84.64%,L2上为65.55%,优于单一模型。 Conclusion: ArbESC+是首个集成多系统的阿拉伯语语法纠错框架,验证了多系统融合的有效性,为阿拉伯语文本处理工具的发展提供了实用方向。 Abstract: Grammatical Error Correction (GEC) is an important aspect of natural language processing. Arabic has a complicated morphological and syntactic structure, posing a greater challenge than other languages. Even though modern neural models have improved greatly in recent years, the majority of previous attempts used individual models without taking into account the potential benefits of combining different systems. In this paper, we present one of the first multi-system approaches for correcting grammatical errors in Arabic, the Arab Enhanced Edit Selection System Complication (ArbESC+). Several models are used to collect correction proposals, which are represented as numerical features in the framework. A classifier determines and implements the appropriate corrections based on these features. In order to improve output quality, the framework uses support techniques to filter overlapping corrections and estimate decision reliability. A combination of AraT5, ByT5, mT5, AraBART, AraBART+Morph+GEC, and Text editing systems gave better results than a single model alone, with F0.5 at 82.63% on QALB-14 test data, 84.64% on QALB-15 L1 data, and 65.55% on QALB-15 L2 data. As one of the most significant contributions of this work, it's the first Arab attempt to integrate linguistic error correction. Improving existing models provides a practical step towards developing advanced tools that will benefit users and researchers of Arabic text processing.

Kai Tian,Yirong Mao,Wendong Bi,Hanjie Wang,Que Wenhui

Main category: cs.CL

TL;DR: 本文提出了一种面向音乐领域的大型语言模型构建框架,通过构建400亿token的高质量音乐语料库和领域优先的数据流水线,并引入基于参考模型的软评分机制优化训练目标,提升了模型在音乐娱乐领域的性能与事实准确性。

Details Motivation: 大型语言模型在通用任务上表现良好,但在音乐等专业领域受限,主要问题包括语料规模不足、数据纯度低以及训练目标与领域需求不匹配。因此需要构建专门的音乐领域语料和训练方法。 Method: 1) 构建包含开源与内部数据的40B token音乐相关语料;2) 设计领域优先的数据管道:使用轻量级分类器筛选和加权领域文本,进行多阶段清洗、去重和隐私掩码;3) 融合多源音乐文本与元数据以增强知识结构;4) 在训练中引入基于参考模型的token级软评分机制,采用统一损失比准则进行数据选择和动态降权。 Result: 实现了更有效的音乐领域持续预训练与对齐,减少了噪声梯度,增强了任务一致信号;提出了MusicSimpleQA基准用于评估模型事实性,支持自动化评分;并通过系统性实验验证了不同数据组成对性能的影响。 Conclusion: 该工作提供了可扩展的数据-训练框架和可复用的评估工具,推动了音乐领域大模型的发展,验证了高质量领域语料与适配训练目标的重要性。 Abstract: Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection and for dynamic down-weighting during optimization, reducing noise gradients and amplifying task-aligned signals, thereby enabling more effective music-domain continued pretraining and alignment. To assess factuality, we design the MusicSimpleQA benchmark, which adopts short, single-answer prompts with automated agreement scoring. Beyond the benchmark design, we conduct systematic comparisons along the axes of data composition. Overall, this work advances both the right corpus and the right objective, offering a scalable data-training framework and a reusable evaluation tool for building domain LLMs in the music field.

[18] Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning

Rui Liu,Yuan Zhao,Zhenqi Jia

Main category: cs.CL

TL;DR: 本文提出了一种名为Authentic-Dubber的检索增强型导演-演员交互学习框架,用于实现更真实的电影自动配音。该方法通过构建多模态参考片段库、情感相似性检索增强策略以及渐进式图结构语音生成模型,模拟真实配音中导演与演员的协作流程,显著提升了配音的情感表现力。

Details Motivation: 现有自动电影配音方法忽略了真实配音中导演与演员之间的动态交互过程,尤其是情感引导环节,导致配音缺乏情感表现力。因此,需要一种能够模拟真实配音工作流的方法来提升配音质量。 Method: 1) 构建多模态参考片段库,并利用大语言模型(LLMs)深入理解跨模态信号中的情感表征;2) 提出基于情感相似性的检索增强策略,从库中检索与目标无声视频最相关的信息;3) 设计渐进式图结构语音生成方法,逐步融合检索到的多模态情感知识以生成最终语音。 Result: 在V2C Animation基准数据集上的主客观评估结果表明,Authentic-Dubber在情感表达和整体配音质量方面均优于现有方法,有效模拟了真实配音流程。 Conclusion: Authentic-Dubber通过模拟导演-演员交互的真实工作流,显著提升了自动电影配音的情感表现力和真实性,为未来配音系统提供了新的设计思路。 Abstract: The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.

[19] AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR

Gabrial Zencha Ashungafac,Mardhiyah Sanni,Busayo Awobade,Alex Gichamba,Tobi Olatunji

Main category: cs.CL

TL;DR: AfriSpeech-MultiBench是首个针对非洲100多种英语口音和多个应用领域的语音识别评估套件,揭示了现有模型在口音、噪声和领域特定任务上的表现差异,并强调了幻觉问题,推动更包容的语音技术发展。

Details Motivation: 缺乏针对非洲语言多样性的情境化、应用特定的语音模型评估工具,限制了语音技术在当地的实际应用。 Method: 构建AfriSpeech-MultiBench评估套件,涵盖10+个国家的100+种非洲英语口音和7个应用领域,使用自发与非自发语音数据,对开源、闭源、单模态ASR和多模态LLM系统进行基准测试。 Result: 开源ASR在自发语音中表现好但在嘈杂非母语对话中下降;多模态LLM对口音鲁棒但难以处理领域特定命名实体;专有模型在干净语音中准确率高但跨国家和领域表现不稳定;微调模型精度高且延迟低,但幻觉问题仍普遍存在。 Conclusion: AfriSpeech-MultiBench为非洲场景下的语音技术选型提供了重要基准,促进了面向弱势群体的包容性语音应用发展。 Abstract: Recent advances in speech-enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.

[20] Entropy-Guided Reasoning Compression

Hourun Zhu,Yang Gao,Wenlong Fei,Jiawei Li,Huashan Sun

Main category: cs.CL

TL;DR: 提出一种基于熵引导的训练框架,有效解决大推理模型在压缩过程中存在的熵冲突问题,在显著缩短推理链长度的同时保持甚至提升准确性。

Details Motivation: 现有推理模型压缩方法忽略了训练过程中的熵冲突现象:压缩目标降低熵以缩短推理链,而追求准确性的目标增加熵,导致模型陷入局部困境。这种冲突源于逻辑连接词所承受的相反梯度压力。 Method: 采用熵引导的训练框架,在熵下降时鼓励简洁的推理步骤,在熵上升时增强探索能力,从而在紧凑推理模式下提高鲁棒性,缓解熵冲突。 Result: 在六个数学推理基准上的实验表明,该方法可将推理长度压缩至原始的20%,同时保持甚至超过基线模型的准确性。 Conclusion: 熵引导训练能有效协调压缩与准确性之间的矛盾,为大推理模型的高效部署提供了可行方案。 Abstract: Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process -- the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.

[21] Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space

Ante Wang,Weizhi Ma,Yang Liu

Main category: cs.CL

TL;DR: 本文提出通过预测口头化的概率分布来增强大语言模型在置信度估计中的推理深度,该方法在多种模型和任务中均表现出优势,且推理模式符合人类预期。

Details Motivation: 研究关注如何利用推理策略改进大语言模型对自身回答的置信度估计,尤其是在结合思维链后,但不同策略的影响尚不明确。 Method: 提出让大语言模型预测完整的口头化概率分布,而非仅基于单一猜测给出置信度,从而促使模型对所有候选答案进行深入考量并合理分配置信分数。 Result: 该方法在多个模型和任务上均优于现有方法,即使在答案空间未知或经过强化学习后仍保持优势,且分析显示其推理模式更符合人类直觉。 Conclusion: 预测口头化概率分布是一种有效促进深度推理以提升置信度估计的方法,具有广泛适用性和可解释性。 Abstract: Knowing the reliability of a model's response is essential in application. With the strong generation capabilities of LLMs, research has focused on generating verbalized confidence. This is further enhanced by combining chain-of-thought reasoning, which provides logical and transparent estimation. However, how reasoning strategies affect the estimated confidence is still under-explored. In this work, we demonstrate that predicting a verbalized probability distribution can effectively encourage in-depth reasoning for confidence estimation. Intuitively, it requires an LLM to consider all candidates within the answer space instead of basing on a single guess, and to carefully assign confidence scores to meet the requirements of a distribution. This method shows an advantage across different models and various tasks, regardless of whether the answer space is known. Its advantage is maintained even after reinforcement learning, and further analysis shows its reasoning patterns are aligned with human expectations.

[22] AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbib,Hasan Abed Al Kader Hammoud,Sina Mukalled,Nadine Rizk,Fatima Karnib,Issam Lakkis,Ammar Mohanna,Bernard Ghanem

Main category: cs.CL

TL;DR: AraLingBench是一个全新的人工标注基准,用于评估大语言模型在阿拉伯语语言能力方面的表现,涵盖语法、词法、拼写、阅读理解和句法五个方面。

Details Motivation: 现有大语言模型在知识型基准上表现良好,但在深层语言理解上存在不足,尤其是阿拉伯语的结构语言能力缺乏系统评估。 Method: 构建了一个包含150道专家设计的多项选择题的全人工标注基准AraLingBench,覆盖五个核心语言学类别,并对35个阿拉伯语及双语大模型进行了评估。 Result: 评估结果显示当前模型在表层语言任务上表现较好,但在深层语法和句法推理上仍有明显不足,暴露出记忆和模式识别与真正语言理解之间的差距。 Conclusion: AraLingBench为评估和改进阿拉伯语大模型的语言能力提供了有效的诊断工具,强调了发展真正语言理解能力的重要性。 Abstract: We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

[23] ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

Xingwei He,Qianru Zhang,Pengfei Chen,Guanhua Chen,Linlin Yu,Yuan Yuan,Siu-Ming Yiu

Main category: cs.CL

TL;DR: 本文提出了ConInstruct基准,用于评估大语言模型在用户指令存在冲突约束时的检测与解决能力,发现多数专有模型具备较强的冲突检测能力,但极少主动告知用户或寻求澄清,揭示了当前模型在指令遵循中的关键缺陷。

Details Motivation: 现有研究多关注LLMs遵循指令的能力,但忽视了复杂提示中常见指令冲突的情况,导致对模型在此类场景下的行为理解不足。 Method: 构建名为ConInstruct的基准数据集,系统评估多个LLMs在冲突检测和冲突解决方面的表现,并分析其行为模式。 Result: 实验显示,多数专有LLM具有较强冲突检测能力,其中DeepSeek-R1和Claude-4.5-Sonnet分别以91.5%和87.3%的F1分数位居前二;但在检测到冲突后,模型很少主动通知用户或请求澄清。 Conclusion: 当前LLMs在处理指令冲突方面存在明显短板,尽管部分模型能有效检测冲突,但缺乏透明沟通机制,未来需加强模型在冲突情境下的可解释性与交互能力。 Abstract: Instruction-following is a critical capability of Large Language Models (LLMs). While existing works primarily focus on assessing how well LLMs adhere to user instructions, they often overlook scenarios where instructions contain conflicting constraints-a common occurrence in complex prompts. The behavior of LLMs under such conditions remains under-explored. To bridge this gap, we introduce ConInstruct, a benchmark specifically designed to assess LLMs' ability to detect and resolve conflicts within user instructions. Using this dataset, we evaluate LLMs' conflict detection performance and analyze their conflict resolution behavior. Our experiments reveal two key findings: (1) Most proprietary LLMs exhibit strong conflict detection capabilities, whereas among open-source models, only DeepSeek-R1 demonstrates similarly strong performance. DeepSeek-R1 and Claude-4.5-Sonnet achieve the highest average F1-scores at 91.5% and 87.3%, respectively, ranking first and second overall. (2) Despite their strong conflict detection abilities, LLMs rarely explicitly notify users about the conflicts or request clarification when faced with conflicting constraints. These results underscore a critical shortcoming in current LLMs and highlight an important area for future improvement when designing instruction-following LLMs.

[24] The Tokenization Bottleneck: How Vocabulary Extension Improves Chemistry Representation Learning in Pretrained Language Models

Prathamesh Kalamkar,Ned Letcher,Meissane Chami,Sahger Lad,Shayan Mohanty,Prasanna Pendse

Main category: cs.CL

TL;DR: 提出一种通过扩展预训练大语言模型词汇表并继续在化学领域文本上预训练的方法,以解决化学表示中的分词瓶颈问题。

Details Motivation: 大语言模型在化学领域的应用常因通用分词器将化学结构(如SMILES)切分为无意义的子词而受限。 Method: 通过引入化学相关词汇扩展预训练大语言模型的词汇表,并在化学领域文本上进行持续预训练以融合新知识。 Result: 该方法在多个下游化学任务中表现出更优性能,有效缓解了分词瓶颈。 Conclusion: 统一自然语言与分子结构表示的策略能显著提升大语言模型在化学领域的适用性和性能。 Abstract: The application of large language models (LLMs) to chemistry is frequently hampered by a "tokenization bottleneck", where tokenizers tuned on general-domain text tend to fragment chemical representations such as SMILES into semantically uninformative sub-tokens. This paper introduces a principled methodology to resolve this bottleneck by unifying the representation of natural language and molecular structures within a single model. Our approach involves targeted vocabulary extension-augmenting a pretrained LLM's vocabulary with chemically salient tokens, followed by continued pretraining on chemistry-domain text to integrate this new knowledge. We provide an empirical demonstration of the effectiveness of this strategy, showing that our methodology leads to superior performance on a range of downstream chemical tasks.

[25] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Hongwei Liu,Junnan Liu,Shudong Liu,Haodong Duan,Yuqiang Li,Mao Su,Xiaohong Liu,Guangtao Zhai,Xinyu Fang,Qianhong Ma,Taolin Zhang,Zihan Ma,Yufeng Zhao,Peiheng Zhou,Linchen Xiao,Wenlong Zhang,Shijie Zhou,Xingjian Ma,Siqi Sun,Jiaye Ge,Meng Li,Yuhong Liu,Jianxin Dong,Jiaying Li,Hui Wu,Hanwen Liang,Jintai Lin,Yanting Wang,Jie Dong,Tong Zhu,Tianfan Fu,Conghui He,Qi Zhang,Songyang Zhang,Lei Bai,Kai Chen

Main category: cs.CL

TL;DR: ATLAS是一个面向AGI的科学推理评测基准,包含约800个原创高难度跨学科问题,涵盖七大科学领域,强调抗数据污染、复杂推理和多领域知识融合,通过专家评审和LLM裁判实现可靠评估。

Details Motivation: 现有LLM评测基准在前沿模型区分度、跨学科性和抗数据污染方面存在不足,难以真实反映科学推理能力。 Method: 由博士级领域专家设计约800个原创问题,覆盖七个科学领域,采用多阶段专家评审与对抗测试确保质量,并引入LLM裁判团进行自动化复杂答案评分。 Result: ATLAS能有效区分当前领先LLM的科学推理能力,初步验证其作为高保真、抗污染、跨学科评测平台的有效性。 Conclusion: ATLAS为评估LLM的高级科学推理提供了可靠、开放的长期评测平台,有助于推动AGI发展。 Abstract: The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.

[26] Mitigating Label Length Bias in Large Language Models

Mario Sanz-Guerrero,Katharina von der Wense

Main category: cs.CL

TL;DR: 提出了一种名为归一化上下文校准(NCC)的方法,用于缓解大语言模型在多标签分类任务中的标签长度偏差问题,显著提升了性能和鲁棒性。

Details Motivation: 大语言模型在预测候选选项时存在标签偏差,尤其是多词标签带来的长度偏差,现有校准方法未能有效解决这一问题。 Method: 提出了归一化 contextual 校准(NCC),在完整标签级别进行归一化和校准,以缓解标签长度不一致带来的偏差。 Result: NCC在多个数据集和模型上显著优于现有方法,F1分数最高提升达10%,并扩展到多选问答等更广泛任务;同时减少对少样本示例选择的敏感性,所需示例更少,置信度估计更可靠。 Conclusion: 缓解完整标签级别的偏差对提升大语言模型的性能和鲁棒性至关重要,尤其在现实应用中类别标签常为多词的情况下,NCC提供了一种有效解决方案。 Abstract: Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.

[27] Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Xin Yi,Yue Li,Dongsheng Shi,Linlin Wang,Xiaoling Wang,Liang He

Main category: cs.CL

TL;DR: 本文提出了一个针对教育场景中大语言模型安全问题的基准EduHarm和一种三阶段防御框架TSSF,有效抵御 jailbreak 和微调攻击,同时保持模型实用性。

Details Motivation: 现有研究主要关注通用安全性评估,缺乏对教育场景特殊安全需求的关注,因此需要专门针对教育应用中的LLM进行安全性评估与防护。 Method: 构建了包含五种典型教育场景的安全-不安全指令对的基准EduHarm;提出三阶段防护框架TSSF,包括安全感知注意力重对齐、层间安全判断和防御驱动的双路由机制。 Result: 在八种jailbreak攻击策略下实验表明TSSF能有效增强安全性且避免对良性请求的过度拒绝;在三个微调攻击数据集上也表现出强健的防御能力,同时保留良性微调带来的性能提升。 Conclusion: TSSF能够系统性地提升教育场景下LLM的安全性,在应对jailbreak和微调攻击方面具有鲁棒性,且不影响模型正常功能。 Abstract: Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.

[28] MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Jinru Ding,Lu Lu,Chao Ding,Mouxiao Bian,Jiayuan Chen,Renjie Lu,Wenrao Pang,Xiaoqin Wu,Zhiqiang Liu,Luyi Jiang,Bing Han,Yunqiu Wang,Jie Xu

Main category: cs.CL

TL;DR: MedBench v4是一个覆盖广泛的临床基准测试平台,评估了15个前沿模型在医疗AI中的表现,揭示基础模型在多模态推理和安全性上的不足,而基于代理的架构显著提升了临床准备度。

Details Motivation: 为应对医疗大语言模型、多模态模型和智能体的发展,需要能反映真实临床流程和安全约束的评估框架。 Method: 构建包含70万项任务的全国性云基准平台MedBench v4,涵盖24个主要和91个次要专科,通过多轮临床专家评审,并使用校准过的LLM作为评分器评估开放答案。 Result: 基础大模型平均得分为54.1/100,安全伦理得分低(18.4/100);多模态模型整体表现更差(均分47.5),跨模态推理弱;代理架构显著提升性能至均分79.8,最高达85.3/100,安全任务上可达88.9/100。 Conclusion: MedBench v4揭示了当前医疗AI在安全性和多模态推理方面的差距,表明具备治理意识的代理架构可有效提升临床可用性,同时为医院、开发者和政策制定者提供符合中国临床指南和监管需求的实用评估工具。 Abstract: Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.

[29] Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning

Trishala Jayesh Ahalpara

Main category: cs.CL

TL;DR: 本文提出了一种名为Tell Me的心理健康支持系统,结合大语言模型与检索增强生成、合成对话生成和AI代理协作,提供个性化、情境感知的对话支持与自我关怀计划,旨在降低心理支持门槛并促进NLP与心理健康领域的跨学科合作。

Details Motivation: 为应对专业心理治疗资源不足、数据保密性强导致研究受限,以及现有心理健康工具静态化、缺乏个性化的问题,作者希望利用大语言模型构建一个可访问、可扩展且情境感知的心理健康支持系统。 Method: 系统包含三个核心组件:基于检索增强生成(RAG)的个性化对话助手;基于用户画像的合成客户-治疗师对话生成器;以及由CrewAI驱动的Well-being AI小组,用于生成每周自我关怀计划和引导冥想音频。系统架构结合了知识检索、合成数据生成与多智能体协作流程。 Result: 在精心设计的心理健康场景中,通过对LLM自动评估与人类用户研究,验证了RAG助手的有效性;成功生成了基于用户画像的合成治疗对话,可用于数据增强与研究;AI代理团队实现了动态自适应的个性化自我关怀规划,优于静态工具。 Conclusion: Tell Me系统展示了对话式AI在心理健康支持中的潜力,既能作为情感反思空间辅助个体,又能推动研究发展,强调了NLP与心理健康领域跨学科合作的重要性,并为负责任的AI创新提供了实践范例。 Abstract: We present Tell Me, a mental well-being system that leverages advances in large language models to provide accessible, context-aware support for users and researchers. The system integrates three components: (i) a retrieval-augmented generation (RAG) assistant for personalized, knowledge-grounded dialogue; (ii) a synthetic client-therapist dialogue generator conditioned on client profiles to facilitate research on therapeutic language and data augmentation; and (iii) a Well-being AI crew, implemented with CrewAI, that produces weekly self-care plans and guided meditation audio. The system is designed as a reflective space for emotional processing rather than a substitute for professional therapy. It illustrates how conversational assistants can lower barriers to support, complement existing care, and broaden access to mental health resources. To address the shortage of confidential therapeutic data, we introduce synthetic client-therapist dialogue generation conditioned on client profiles. Finally, the planner demonstrates an innovative agentic workflow for dynamically adaptive, personalized self-care, bridging the limitations of static well-being tools. We describe the architecture, demonstrate its functionalities, and report evaluation of the RAG assistant in curated well-being scenarios using both automatic LLM-based judgments and a human-user study. This work highlights opportunities for interdisciplinary collaboration between NLP researchers and mental health professionals to advance responsible innovation in human-AI interaction for well-being.

[30] Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Mingyue Cheng,Jie Ouyang,Shuo Yu,Ruiran Yan,Yucong Luo,Zirui Liu,Daoyu Wang,Qi Liu,Enhong Chen

Main category: cs.CL

TL;DR: 本文探讨了如何将强化学习(RL)应用于大语言模型(LLM)智能体,提出扩展的马尔可夫决策过程框架,并介绍了一个模块化、灵活的训练框架Agent-R1,实验验证了其在多跳问答任务上的有效性。

Details Motivation: 尽管强化学习在训练LLM智能体方面具有潜力,但目前缺乏针对LLM智能体的RL方法和灵活的训练框架,限制了该领域的发展。 Method: 通过系统扩展马尔可夫决策过程(MDP)框架来定义LLM智能体的关键组件,并开发了一个名为Agent-R1的模块化、可扩展的强化学习训练框架。 Result: 在多跳问答基准任务上的实验表明,所提出的框架和方法具有初步的有效性,能够支持不同任务场景和交互环境下的RL训练。 Conclusion: 本文为LLM智能体的强化学习提供了更清晰的方法论基础,并通过Agent-R1框架推动了该领域向更灵活、易扩展的方向发展。 Abstract: Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning (RL) is considered a key technology with significant potential for training such Agents; however, the effective application of RL to LLM Agents is still in its nascent stages and faces considerable challenges. Currently, this emerging field lacks in-depth exploration into RL approaches specifically tailored for the LLM Agent context, alongside a scarcity of flexible and easily extensible training frameworks designed for this purpose. To help advance this area, this paper first revisits and clarifies Reinforcement Learning methodologies for LLM Agents by systematically extending the Markov Decision Process (MDP) framework to comprehensively define the key components of an LLM Agent. Secondly, we introduce Agent-R1, a modular, flexible, and user-friendly training framework for RL-based LLM Agents, designed for straightforward adaptation across diverse task scenarios and interactive environments. We conducted experiments on Multihop QA benchmark tasks, providing initial validation for the effectiveness of our proposed methods and framework.

[31] LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

David Carmel,Simone Filice,Guy Horowitz,Yoelle Maarek,Alex Shtoff,Oren Somekh,Ran Tavory

Main category: cs.CL

TL;DR: 本文介绍了LiveRAG基准,一个包含895个合成问题和答案的公开数据集,用于系统评估基于检索增强生成(RAG)的问答系统。该基准源自SIGIR'2025 LiveRAG挑战赛,并增加了真实答案、支持性声明以及题目难度和区分度评分,有助于更好评估和区分不同系统的性能。

Details Motivation: 随着检索增强生成(RAG)在生成式AI中的重要性日益增加,亟需一种系统化的方法来评估其有效性。现有的评估方法缺乏标准化和全面性,因此需要一个公开、可复现且具备细粒度分析能力的基准。 Method: 基于SIGIR'2025 LiveRAG挑战赛使用的数据集构建LiveRAG基准,通过合成方式生成问题与答案,并补充挑战赛中未提供的信息,如真实答案及其支持性证据。同时,采用项目反应理论(Item Response Theory)模型对参赛者回答进行建模,估算每个问题的难度和区分度得分。 Result: LiveRAG基准展现出问题类型的多样性、广泛的难度分布以及良好的系统区分能力。分析表明该基准能有效反映不同RAG系统的性能差异,适合用于系统性评估。 Conclusion: LiveRAG基准为RAG-based问答系统的评估提供了一个公开、系统且富有分析深度的工具,有望推动RAG技术的研究与发展,促进更稳健问答系统的构建。 Abstract: With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.

[32] Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak

Lucia Makaiová,Martin Fajčík,Antonín Jarolím

Main category: cs.CL

TL;DR: 本文探讨了文档级声明提取的评估方法,提出通过比对模型提取与人工标注的声明集来衡量提取性能,并在捷克和斯洛伐克新闻评论数据上实验,揭示现有方法在语义相似性及声明特性(如原子性、可验证性、去上下文化)评估上的不足。

Details Motivation: 文档级声明提取是事实核查中的难题,当前对其评估方法关注较少,缺乏可靠框架来衡量提取质量及标注一致性。 Method: 通过将两个关于同一文档的声明集进行对齐,计算其对齐得分作为相似性度量,探索最优对齐策略和评估方法,以比较模型提取结果与人工标注的一致性。 Result: 实验表明现有评估方法在处理非正式语言、本地语境强且语言相近的文本时存在局限,难以准确捕捉语义相似性和关键声明属性。 Conclusion: 需要更先进的评估方法,能够有效识别语义相似性并评估声明的原子性、可验证性和去上下文化等核心特性。 Abstract: Document-level claim extraction remains an open challenge in the field of fact-checking, and subsequently, methods for evaluating extracted claims have received limited attention. In this work, we explore approaches to aligning two sets of claims pertaining to the same source document and computing their similarity through an alignment score. We investigate techniques to identify the best possible alignment and evaluation method between claim sets, with the aim of providing a reliable evaluation framework. Our approach enables comparison between model-extracted and human-annotated claim sets, serving as a metric for assessing the extraction performance of models and also as a possible measure of inter-annotator agreement. We conduct experiments on newly collected dataset-claims extracted from comments under Czech and Slovak news articles-domains that pose additional challenges due to the informal language, strong local context, and subtleties of these closely related languages. The results draw attention to the limitations of current evaluation approaches when applied to document-level claim extraction and highlight the need for more advanced methods-ones able to correctly capture semantic similarity and evaluate essential claim properties such as atomicity, checkworthiness, and decontextualization.

[33] Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Noam Dahan,Omer Kidron,Gabriel Stanovsky

Main category: cs.CL

TL;DR: 提出了一种利用报纸头版提示语自动提取自然摘要的方法,并构建了希伯来语首个多文档摘要数据集HEBTEASESUM。

Details Motivation: 低资源语言的高质量摘要数据稀缺,而历史报纸中存在大量未被利用的自然标注摘要数据。 Method: 通过识别报纸头版的编辑提示语(Front-Page Teasers)作为文章摘要,提出自动化数据收集方法,适用于不同语言资源条件。 Result: 该方法在七种不同语言中验证有效,并成功应用于希伯来语报纸,构建了首个希伯来语多文档摘要数据集HEBTEASESUM。 Conclusion: Front-Page Teasers是跨语言获取自然摘要数据的有效来源,所提方法可扩展且适用于低资源语言。 Abstract: High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

[34] A Method for Characterizing Disease Progression from Acute Kidney Injury to Chronic Kidney Disease

Yilu Fang,Jordan G. Nestor,Casey N. Ta,Jerard Z. Kneifati-Hayek,Chunhua Weng

Main category: cs.CL

TL;DR: 本研究利用电子健康记录数据,通过纵向追踪急性肾损伤(AKI)患者的临床演变,识别出15种不同的临床状态,并发现不同状态下的慢性肾病(CKD)进展风险各异,提出了一种数据驱动的高风险患者识别方法。

Details Motivation: 识别AKI患者中发展为CKD的高风险人群仍具挑战性,需更动态、精准的风险分层方法。 Method: 基于EHR数据中的纵向医学编码和肌酐值构建患者向量,聚类识别临床状态,采用多状态模型估计状态转移概率与CKD进展风险,并通过生存分析识别不同亚群的CKD危险因素。 Result: 在20,699名AKI患者中,17%发展为CKD;识别出15个不同的临床状态,多数患者(75%)保持稳定或仅经历一次状态转移;除已知危险因素外,还发现了新的CKD风险因素,且其影响因临床状态而异。 Conclusion: 该研究提供了一种数据驱动的方法,可用于识别AKI后高风险CKD患者,有助于开发支持早期检测和干预的决策工具。 Abstract: Patients with acute kidney injury (AKI) are at high risk of developing chronic kidney disease (CKD), but identifying those at greatest risk remains challenging. We used electronic health record (EHR) data to dynamically track AKI patients' clinical evolution and characterize AKI-to-CKD progression. Post-AKI clinical states were identified by clustering patient vectors derived from longitudinal medical codes and creatinine measurements. Transition probabilities between states and progression to CKD were estimated using multi-state modeling. After identifying common post-AKI trajectories, CKD risk factors in AKI subpopulations were identified through survival analysis. Of 20,699 patients with AKI at admission, 3,491 (17%) developed CKD. We identified fifteen distinct post-AKI states, each with different probabilities of CKD development. Most patients (75%, n=15,607) remained in a single state or made only one transition during the study period. Both established (e.g., AKI severity, diabetes, hypertension, heart failure, liver disease) and novel CKD risk factors, with their impact varying across these clinical states. This study demonstrates a data-driven approach for identifying high-risk AKI patients, supporting the development of decision-support tools for early CKD detection and intervention.

[35] Bridging Human and Model Perspectives: A Comparative Analysis of Political Bias Detection in News Media Using Large Language Models

Shreya Adrita Banik,Niaz Nafi Rahman,Tahsina Moiukh,Farig Sadeque

Main category: cs.CL

TL;DR: 本研究提出了一种比较框架,用于评估人类标注与多种大语言模型(如GPT、BERT、RoBERTa和FLAN)在新闻政治偏见检测中的表现。通过构建人工标注数据集,发现RoBERTa在传统模型中与人类标签最一致,而GPT在零样本设置下整体一致性最强。微调后的RoBERTa模型表现最佳。研究强调需结合人类可解释性与模型可扩展性的混合评估框架。

Details Motivation: 当前大语言模型在政治偏见检测中与人类判断的对齐程度尚不明确,缺乏系统性比较,亟需评估模型与人类感知之间的一致性与差异。 Method: 构建一个手工标注的新闻文章数据集,评估标注一致性、偏见极性及模型间一致性;比较包括GPT、BERT、RoBERTa和FLAN在内的多种模型在政治偏见分类任务中的表现,特别关注其与人类标注的对齐程度。 Result: 在传统Transformer模型中,RoBERTa与人类标签对齐度最高;生成式模型GPT在零样本设置下表现出最强的整体一致性;经过微调的RoBERTa模型在准确率和与人类标注的一致性方面均优于其他基线模型。 Conclusion: 人类与大语言模型在感知政治倾向上存在系统性差异,单纯依赖模型可能引入偏差。未来应发展结合人类解释力与模型可扩展性的混合评估框架,以提升自动媒体偏见检测的可靠性。 Abstract: Detecting political bias in news media is a complex task that requires interpreting subtle linguistic and contextual cues. Although recent advances in Natural Language Processing (NLP) have enabled automatic bias classification, the extent to which large language models (LLMs) align with human judgment still remains relatively underexplored and not yet well understood. This study aims to present a comparative framework for evaluating the detection of political bias across human annotations and multiple LLMs, including GPT, BERT, RoBERTa, and FLAN. We construct a manually annotated dataset of news articles and assess annotation consistency, bias polarity, and inter-model agreement to quantify divergence between human and model perceptions of bias. Experimental results show that among traditional transformer-based models, RoBERTa achieves the highest alignment with human labels, whereas generative models such as GPT demonstrate the strongest overall agreement with human annotations in a zero-shot setting. Among all transformer-based baselines, our fine-tuned RoBERTa model acquired the highest accuracy and the strongest alignment with human-annotated labels. Our findings highlight systematic differences in how humans and LLMs perceive political slant, underscoring the need for hybrid evaluation frameworks that combine human interpretability with model scalability in automated media bias detection.

[36] Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

Kahaan Gandhi,Boris Bolliet,Inigo Zubeldia

Main category: cs.CL

TL;DR: 提出基于视觉语言模型(VLM)的多智能体系统,通过将图表作为可验证检查点,实现端到端的自主科学发现,显著提升任务成功率与可解释性。

Details Motivation: 传统自动科学发现方法在错误纠正和实时推理调整方面能力有限,缺乏对复杂科学数据的动态适应性和可审计性。 Method: 利用VLM作为判断器,根据动态生成的领域特定评分标准评估图表,并指导多智能体系统进行自我纠错和探索性数据分析。 Result: 在宇宙学和天体化学案例中,系统能从错误推理路径中恢复并适应新数据;在10项任务基准上,VLM增强系统的pass@1得分为0.7-0.8,显著高于代码仅(0.2-0.3)和代码加文本基线(0.4-0.5)。 Conclusion: VLM引导的多智能体系统有效提升了自主科学发现的准确性、鲁棒性和可解释性,具备无需人工干预的持续探索能力。 Abstract: We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent

[37] A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases

Tao Yang,Dandan Huang,Yunting Lin,Pengfei Wu,Zhikun Wu,Gangyuan Ma,Yulan Lu,Xinran Dong,Dingpeng Li,Junshuang Ge,Zhiyan Zhang,Xuanzhao Huang,Wenyan Nong,Yao Zhou,Hui Tang,Hongxi Yang,Shijie Zhang,Juan Li,Xiaojun Cao,Lin Yang,Xia Gao,Kaishou Xu,Xiaoqiong Gu,Wen Zhang,Huimin Xia,Li Liu,Wenhao Zhou,Mulin Jun Li

Main category: cs.CL

TL;DR: RareSeek R1 是一种基于叙事优先、知识整合推理范式的临床诊断模型,通过阶段化指令微调和图增强检索,在罕见病诊断中实现了最先进的准确性与稳定性,性能媲美经验丰富的医生,并提升了临床决策支持的可审计性与可解释性。

Details Motivation: 罕见病诊断通常耗时数年,现有方法在处理噪声数据、知识陈旧和幻觉问题上表现不佳,且通用或医学大模型受限于真实电子健康记录(EHR)数据稀缺,亟需一种更可靠、可解释的诊断支持系统。 Method: 构建了一个大规模领域专用临床语料库和经医生验证的推理数据集,采用分阶段指令微调、思维链学习和基于图的检索增强技术开发 RareSeek R1 模型。 Result: 在多中心 EHR 叙述和公共基准测试中,RareSeek R1 实现了最先进的诊断准确率,具备强泛化能力和噪声鲁棒性;检索增强在结合优先变异信息时效果最显著;人机协作研究表明其性能与资深医师相当,并能有效辅助诊断。 Conclusion: RareSeek R1 推动了以临床叙述为核心、融合外部知识的推理范式,显著缩短诊断过程,提供可审计、可临床转化的决策支持,为罕见病诊断提供了可靠的新工具。 Abstract: Rare diseases affect hundreds of millions worldwide, yet diagnosis often spans years. Convectional pipelines decouple noisy evidence extraction from downstream inferential diagnosis, and general/medical large language models (LLMs) face scarce real world electronic health records (EHRs), stale domain knowledge, and hallucinations. We assemble a large, domain specialized clinical corpus and a clinician validated reasoning set, and develop RareSeek R1 via staged instruction tuning, chain of thought learning, and graph grounded retrieval. Across multicenter EHR narratives and public benchmarks, RareSeek R1 attains state of the art accuracy, robust generalization, and stability under noisy or overlapping phenotypes. Augmented retrieval yields the largest gains when narratives pair with prioritized variants by resolving ambiguity and aligning candidates to mechanisms. Human studies show performance on par with experienced physicians and consistent gains in assistive use. Notably, transparent reasoning highlights decisive non phenotypic evidence (median 23.1%, such as imaging, interventions, functional tests) underpinning many correct diagnoses. This work advances a narrative first, knowledge integrated reasoning paradigm that shortens the diagnostic odyssey and enables auditable, clinically translatable decision support.

[38] Graded strength of comparative illusions is explained by Bayesian inference

Yuhan Zhang,Erxiao Wang,Cory Shain

Main category: cs.CL

TL;DR: 该研究通过结合统计语言模型与人类行为数据,构建了一个定量的后验概率模型来解释比较性错觉(Comparative Illusion, CI)现象,验证了噪声信道理论在句子理解中的适用性,并成功预测了代词与全名词短语作为从句主语时对错觉强度的影响。

Details Motivation: 比较性错觉(如“去俄罗斯的学生比我多”)看似合理实则逻辑荒谬,但人们常误判其可接受性。研究旨在检验噪声信道理论是否能系统解释此类语言错觉,并扩展此前仅限于少数解释的实验结果。 Method: 结合统计语言模型与人类行为数据,量化计算多种可能解释的后验概率,建立预测CI错觉强度的模型,并通过行为实验验证模型对不同句法结构(如代词vs.全名词短语)影响的预测。 Result: 模型成功解释了CI效应的细微差异,并预测且证实了代词作从句主语时错觉更强的现象,为噪声信道理论提供了新的经验证据。 Conclusion: 噪声信道推理不仅是解释语言错觉的有效框架,也为语言理解提供了一种统一的计算层面理论,适用于广泛的语言处理现象。 Abstract: Like visual processing, language processing is susceptible to illusions in which people systematically misperceive stimuli. In one such case--the comparative illusion (CI), e.g., More students have been to Russia than I have--comprehenders tend to judge the sentence as acceptable despite its underlying nonsensical comparison. Prior research has argued that this phenomenon can be explained as Bayesian inference over a noisy channel: the posterior probability of an interpretation of a sentence is proportional to both the prior probability of that interpretation and the likelihood of corruption into the observed (CI) sentence. Initial behavioral work has supported this claim by evaluating a narrow set of alternative interpretations of CI sentences and showing that comprehenders favor interpretations that are more likely to have been corrupted into the illusory sentence. In this study, we replicate and go substantially beyond this earlier work by directly predicting the strength of illusion with a quantitative model of the posterior probability of plausible interpretations, which we derive through a novel synthesis of statistical language models with human behavioral data. Our model explains not only the fine gradations in the strength of CI effects, but also a previously unexplained effect caused by pronominal vs. full noun phrase than-clause subjects. These findings support a noisy-channel theory of sentence comprehension by demonstrating that the theory makes novel predictions about the comparative illusion that bear out empirically. This outcome joins related evidence of noisy channel processing in both illusory and non-illusory contexts to support noisy channel inference as a unified computational-level theory of diverse language processing phenomena.

[39] Bias in, Bias out: Annotation Bias in Multilingual Large Language Models

Xia Cui,Ziyi Huang,Naeemeh Adel

Main category: cs.CL

TL;DR: 本文提出了一种理解多语言大模型中注释偏见的综合框架,区分了指令偏见、标注者偏见以及文化和情境偏见,并总结了检测与缓解这些偏见的方法,提出了适用于多语言环境的集合式偏见缓解策略及其伦理分析。

Details Motivation: 由于任务设定、标注者主观性和文化差异导致的注释偏见严重影响多语言大模型的发展,尤其是在文化多样性背景下,可能导致模型输出失真并加剧社会危害。 Method: 提出了一个注释偏见分类体系,综述了包括标注者间一致性、模型分歧和元数据分析等检测方法,并引入多语言模型差异和文化推断等新兴技术;同时提出主动与被动的缓解策略,如多样化招募标注者、迭代优化标注指南和事后模型调整。 Result: 贡献包括:(1) 注释偏见的类型学;(2) 检测指标的整合;(3) 适用于多语言设置的基于集成的偏见缓解方法;(4) 对标注过程的伦理分析。 Conclusion: 该研究为构建更公平、更具文化敏感性的大语言模型标注流程提供了理论基础与实践指导。 Abstract: Annotation bias in NLP datasets remains a major challenge for developing multilingual Large Language Models (LLMs), particularly in culturally diverse settings. Bias from task framing, annotator subjectivity, and cultural mismatches can distort model outputs and exacerbate social harms. We propose a comprehensive framework for understanding annotation bias, distinguishing among instruction bias, annotator bias, and contextual and cultural bias. We review detection methods (including inter-annotator agreement, model disagreement, and metadata analysis) and highlight emerging techniques such as multilingual model divergence and cultural inference. We further outline proactive and reactive mitigation strategies, including diverse annotator recruitment, iterative guideline refinement, and post-hoc model adjustments. Our contributions include: (1) a typology of annotation bias; (2) a synthesis of detection metrics; (3) an ensemble-based bias mitigation approach adapted for multilingual settings, and (4) an ethical analysis of annotation processes. Together, these insights aim to inform more equitable and culturally grounded annotation pipelines for LLMs.

[40] Streamlining Industrial Contract Management with Retrieval-Augmented LLMs

Kristi Topollai,Tolga Dimlioglu,Anna Choromanska,Simon Odie,Reginald Hui

Main category: cs.CL

TL;DR: 提出一种基于检索增强生成(RAG)的模块化框架,用于自动化合同管理中的条款修订,通过合成数据生成、语义检索、可接受性分类和奖励对齐,在低资源真实场景下实现超过80%的准确率。

Details Motivation: 合同管理中缺乏标注数据且存在大量非结构化历史合同,传统方法难以自动化处理条款修订过程中的问题修订。 Method: 构建一个包含合成数据生成、语义子句检索、可接受性分类和基于奖励对齐的模块化RAG框架,以识别并优化有问题的合同修订。 Result: 在与行业伙伴合作开发和评估中,系统在识别和优化问题修订方面均达到80%以上的准确率,表现出在真实低资源条件下的良好性能。 Conclusion: 该框架能有效加速合同修订流程,为实际合同管理提供了一种可行的自动化解决方案。 Abstract: Contract management involves reviewing and negotiating provisions, individual clauses that define rights, obligations, and terms of agreement. During this process, revisions to provisions are proposed and iteratively refined, some of which may be problematic or unacceptable. Automating this workflow is challenging due to the scarcity of labeled data and the abundance of unstructured legacy contracts. In this paper, we present a modular framework designed to streamline contract management through a retrieval-augmented generation (RAG) pipeline. Our system integrates synthetic data generation, semantic clause retrieval, acceptability classification, and reward-based alignment to flag problematic revisions and generate improved alternatives. Developed and evaluated in collaboration with an industry partner, our system achieves over 80% accuracy in both identifying and optimizing problematic revisions, demonstrating strong performance under real-world, low-resource conditions and offering a practical means of accelerating contract revision workflows.

[41] Quadratic Term Correction on Heaps' Law

Oscar Fontanelli,Wentian Li

Main category: cs.CL

TL;DR: 本文研究了Heaps定律在对数-对数尺度下的非线性特征,发现二次函数比幂律更精确地拟合词符与词型的关系,并通过“从袋中随机抽取带替换彩球”模型解释了曲率的来源。

Details Motivation: 传统Heaps定律假设词型-词符关系为幂律,但在对数-对数尺度下仍存在轻微凹性,表明幂律假设不完全成立,需更高阶近似来准确描述该关系。 Method: 基于二十部英文小说或文本(部分为翻译作品)的数据,在对数-对数尺度下使用包含线性和二次项的回归模型拟合log(词型)-log(词符)关系,并引入‘随机抽彩球’模型解释曲率的成因。 Result: 二次函数在对数-对数尺度下完美拟合类型-词符数据;回归分析显示线性系数略大于1,二次系数约为-0.02;‘伪方差’为负值,解释了曲线的凹性。 Conclusion: 词型-词符关系在对数-对数尺度下具有可量化的曲率,不能简单视为幂律;二次模型提供了更精确的描述,且‘伪方差’框架有助于理解小样本下的曲率行为。 Abstract: Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.

[42] SMRC: Aligning Large Language Models with Student Reasoning for Mathematical Error Correction

Biaojie Zeng,Min Zhang,Juan Zhou,Fengrui Liu,Ruiyang Huang,Xin Lin

Main category: cs.CL

TL;DR: 提出SMRC方法,利用蒙特卡洛树搜索和LLM引导的奖励生成,实现对学生数学推理过程的细粒度纠正,并构建了包含多解错误的高中数学基准MSEB。

Details Motivation: 现有大模型自纠错方法缺乏教育场景中所需的“教师式”系统性指导,难以有效纠正学生解题过程中的推理错误。 Method: 将学生数学推理建模为多步决策问题,引入蒙特卡洛树搜索(MCTS)探索最优纠正路径;通过LLM引导的广度优先搜索与最终答案评估生成过程级奖励,并利用反向传播机制将奖励分配到中间步骤,实现细粒度监督。 Result: 在ProcessBench、MR-GSM8K和自建MSEB数据集上,SMRC显著优于现有方法,在解题准确率和正确步骤保留率方面表现更优。 Conclusion: SMRC实现了对大模型数学推理过程的有效外部纠正,推动了LLM在教育场景中作为‘智能导师’的应用。 Abstract: Large language models (LLMs) often make reasoning errors when solving mathematical problems, and how to automatically detect and correct these errors has become an important research direction. However, existing approaches \textit{mainly focus on self-correction within the model}, which falls short of the ``teacher-style`` correction required in educational settings, \textit{i.e.}, systematically guiding and revising a student's problem-solving process. To address this gap, we propose \texttt{SMRC} (\textit{\underline{S}tudent \underline{M}athematical \underline{R}easoning \underline{C}orrection}), a novel method that aligns LLMs with student reasoning. Specifically, \texttt{SMRC} formulates student reasoning as a multi-step sequential decision problem and introduces Monte Carlo Tree Search (MCTS) to explore optimal correction paths. To reduce the cost of the annotating process-level rewards, we leverage breadth-first search (BFS) guided by LLMs and final-answer evaluation to generate reward signals, which are then distributed across intermediate reasoning steps via a back-propagation mechanism, enabling fine-grained process supervision. Additionally, we construct a benchmark for high school mathematics, MSEB (Multi-Solution Error Benchmark), consisting of 158 instances that include problem statements, student solutions, and correct reasoning steps. We further propose a dual evaluation protocol centered on \textbf{solution accuracy} and \textbf{correct-step retention}, offering a comprehensive measure of educational applicability. Experiments demonstrate that \texttt{SMRC} significantly outperforms existing methods on two public datasets (ProcessBench and MR-GSM8K) and our MSEB in terms of effectiveness and overall performance. The code and data are available at https://github.com/Mind-Lab-ECNU/SMRC.

[43] Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries

Kiera McCormick,Rafael Martínez-Galarza

Main category: cs.CL

TL;DR: 研究大语言模型(LLM)是否能够通过文本描述编码天体物理学中的物理信息,并探讨提示方式和语言特征对物理量编码的影响,使用稀疏自编码器提取可解释特征。

Details Motivation: 探索大语言模型在缺乏直接科学测量的情况下,能否从文本中捕捉和编码物理信息,推动其在科学领域的应用。 Method: 以天体物理学为实验平台,利用稀疏自编码器分析LLM嵌入,研究提示方式和语言特征对物理量编码的影响。 Result: 发现提示方式显著影响LLM对物理量的编码,且特定语言特征在编码物理信息中起关键作用。 Conclusion: LLM有能力从文本中编码科学相关的物理信息,提示工程和语言结构对其性能有重要影响。 Abstract: Large Language Models have demonstrated the ability to generalize well at many levels across domains, modalities, and even shown in-context learning capabilities. This enables research questions regarding how they can be used to encode physical information that is usually only available from scientific measurements, and loosely encoded in textual descriptions. Using astrophysics as a test bed, we investigate if LLM embeddings can codify physical summary statistics that are obtained from scientific measurements through two main questions: 1) Does prompting play a role on how those quantities are codified by the LLM? and 2) What aspects of language are most important in encoding the physics represented by the measurement? We investigate this using sparse autoencoders that extract interpretable features from the text.

[44] Ground Truth Generation for Multilingual Historical NLP using LLMs

Clovis Gladstone,Zhao Fang,Spencer Dean Stewart

Main category: cs.CL

TL;DR: 该论文探讨了利用大语言模型(LLM)为历史法语和中文文本生成真实标注,以解决低资源NLP任务中的数据稀缺问题。通过在语料子集上使用LLM生成的标注微调spaCy模型,在词性标注、词形还原和命名实体识别方面取得了显著提升。

Details Motivation: 历史和低资源语言处理因标注数据稀缺及与现代语料的领域差异而面临挑战,亟需有效方法提升NLP工具在计算人文学科中的适用性。 Method: 利用大语言模型为历史法语(16-20世纪)和中文(1900-1950年)文本生成高质量标注数据,并使用这些合成数据对spaCy模型进行微调,以适应特定时期的语言特征。 Result: 在特定时期的测试中,微调后的spaCy模型在词性标注、词形还原和命名实体识别任务上表现显著提升,表明少量合成数据即可有效改善低资源场景下的NLP性能。 Conclusion: 领域特定模型结合LLM生成的合成标注数据,是提升历史和低资源语言处理效果的有效途径,对计算人文学科具有重要应用价值。 Abstract: Historical and low-resource NLP remains challenging due to limited annotated data and domain mismatches with modern, web-sourced corpora. This paper outlines our work in using large language models (LLMs) to create ground-truth annotations for historical French (16th-20th centuries) and Chinese (1900-1950) texts. By leveraging LLM-generated ground truth on a subset of our corpus, we were able to fine-tune spaCy to achieve significant gains on period-specific tests for part-of-speech (POS) annotations, lemmatization, and named entity recognition (NER). Our results underscore the importance of domain-specific models and demonstrate that even relatively limited amounts of synthetic data can improve NLP tools for under-resourced corpora in computational humanities research.

[45] Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances

Rishu Kumar Singh,Navneet Shreya,Sarmistha Das,Apoorva Singh,Sriparna Saha

Main category: cs.CL

TL;DR: 本文提出了VALOR,一种用于多模态、多轮客户投诉对话分析的验证感知学习框架,结合文本与视觉证据实现细粒度投诉方面与严重性分类,并通过语义对齐与元融合策略提升性能,支持联合国可持续发展目标SDG 9与SDG 12。

Details Motivation: 现有投诉分析方法主要依赖单模态短文本(如推文或评论),难以捕捉复杂投诉中跨模态的信息;本文旨在利用包含文本和视觉证据的多轮客户对话,实现更精细、准确的投诉理解。 Method: 提出VALOR模型,采用多专家推理机制结合大模型的思维链(CoT)提示进行决策;通过计算文本与图像间的语义对齐得分,并设计元融合策略整合多模态信息以优化分类结果。 Result: 在标注了细粒度方面和严重性标签的多模态投诉数据集上,VALOR显著优于基线模型,尤其在文本与图像信息分布复杂的场景下表现更优。 Conclusion: 多模态交互与专家验证机制能有效提升实际投诉理解系统的性能,该研究为构建智能、可扩展的服务基础设施及促进负责任消费提供了可行路径。 Abstract: Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR

[46] Subword Tokenization Strategies for Kurdish Word Embeddings

Ali Salehi,Cassandra L. Jacobs

Main category: cs.CL

TL;DR: 本文研究了库尔德语词嵌入的分词策略,比较了基于词、词素和BPE的方法,发现传统评估方式存在覆盖偏差,采用覆盖感知评估后,词素级分词在语义组织和形态复杂性覆盖上表现更优。

Details Motivation: 在低资源语言库尔德语中,如何有效进行分词以保留形态相似性并构建高质量词嵌入仍是一个挑战,现有评估方法可能因覆盖不均而产生偏差。 Method: 开发了一个基于BiLSTM-CRF的形态分割器,使用少量人工标注进行自举训练,生成词素级分词,并对比Word2Vec在不同分词策略下的词向量表现,评估指标包括相似性保持、聚类质量和语义组织,同时引入覆盖率分析以揭示评估偏差。 Result: BPE在初始评估中表现较好,但仅覆盖28.6%的测试样本,而词素模型覆盖68.7%,在全面评估下,词素级分词在嵌入空间组织、语义邻域结构和不同形态复杂度的均衡覆盖上均优于BPE和词级分词。 Conclusion: 在低资源语言处理中,分词策略的评估必须考虑覆盖率,忽略这一点会导致错误结论;词素级分词结合覆盖感知评估能更真实反映性能,对低资源语言的NLP任务具有重要意义。 Abstract: We investigate tokenization strategies for Kurdish word embeddings by comparing word-level, morpheme-based, and BPE approaches on morphological similarity preservation tasks. We develop a BiLSTM-CRF morphological segmenter using bootstrapped training from minimal manual annotation and evaluate Word2Vec embeddings across comprehensive metrics including similarity preservation, clustering quality, and semantic organization. Our analysis reveals critical evaluation biases in tokenization comparison. While BPE initially appears superior in morphological similarity, it evaluates only 28.6\% of test cases compared to 68.7\% for morpheme model, creating artificial performance inflation. When assessed comprehensively, morpheme-based tokenization demonstrates superior embedding space organization, better semantic neighborhood structure, and more balanced coverage across morphological complexity levels. These findings highlight the importance of coverage-aware evaluation in low-resource language processing and offers different tokenization methods for low-resourced language processing.

[47] Strategic Innovation Management in the Age of Large Language Models Market Intelligence, Adaptive R&D, and Ethical Governance

Raha Aghaei,Ali A. Kiaei,Mahnaz Boush,Mahan Rofoosheh,Mohammad Zavvar

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)通过自动化知识发现、增强假设生成、整合跨学科见解和促进创新生态系统内的协作,显著提升了研发流程的效率与效果。

Details Motivation: 探索LLMs在研发过程中多方面功能,以提升创新效率并缩短突破性想法的上市时间。 Method: 通过对科学文献、专利数据库和实验数据进行广泛分析,评估LLMs在研发自动化和优化中的作用。 Result: LLMs显著提高了研发流程的灵活性和决策质量,加快了创新周期,并降低了时间成本。 Conclusion: LLMs已成为推动现代研发变革的核心工具,具有广泛应用于未来创新生态系统的潜力。 Abstract: This study analyzes the multiple functions of Large Language Models (LLMs) in transforming research and development (R&D) processes. By automating knowledge discovery, boosting hypothesis creation, integrating transdisciplinary insights, and enabling cooperation within innovation ecosystems, LLMs dramatically improve the efficiency and effectiveness of research processes. Through extensive analysis of scientific literature, patent databases, and experimental data, these models enable more flexible and informed R&D workflows, ultimately accelerating innovation cycles and lowering time-to-market for breakthrough ideas.

cs.CV [Back]

[48] nuCarla: A nuScenes-Style Bird's-Eye View Perception Dataset for CARLA Simulation

Zhijie Qiao,Zhong Cao,Henry X. Liu

Main category: cs.CV

TL;DR: nuCarla是一个大规模、与nuScenes格式兼容的鸟瞰图感知数据集,基于CARLA仿真器构建,支持端到端自动驾驶的闭环训练与评估,推动可靠且安全感知的研究发展。

Details Motivation: 现有真实世界采集的数据集多为非交互式,主要用于开环学习,难以支持端到端自动驾驶所需的闭环测试,且缺乏标准化的大规模数据来学习有意义的中间表示(如BEV特征)。 Method: 在CARLA仿真器中构建一个名为nuCarla的大规模鸟瞰图感知数据集,完全兼容nuScenes格式,具备均衡的类别分布,并可直接用于闭环仿真;同时提供高性能的BEV主干网络作为基准模型。 Result: nuCarla实现了最先进的检测性能,支持感知模型从真实世界到仿真的无缝迁移,并可直接用于闭环端到端模型的训练与部署。 Conclusion: nuCarla通过提供开放的数据和模型基准,显著加速了闭环端到端自动驾驶系统的发展,为安全可靠的自动驾驶研究提供了重要支撑。 Abstract: End-to-end (E2E) autonomous driving heavily relies on closed-loop simulation, where perception, planning, and control are jointly trained and evaluated in interactive environments. Yet, most existing datasets are collected from the real world under non-interactive conditions, primarily supporting open-loop learning while offering limited value for closed-loop testing. Due to the lack of standardized, large-scale, and thoroughly verified datasets to facilitate learning of meaningful intermediate representations, such as bird's-eye-view (BEV) features, closed-loop E2E models remain far behind even simple rule-based baselines. To address this challenge, we introduce nuCarla, a large-scale, nuScenes-style BEV perception dataset built within the CARLA simulator. nuCarla features (1) full compatibility with the nuScenes format, enabling seamless transfer of real-world perception models; (2) a dataset scale comparable to nuScenes, but with more balanced class distributions; (3) direct usability for closed-loop simulation deployment; and (4) high-performance BEV backbones that achieve state-of-the-art detection results. By providing both data and models as open benchmarks, nuCarla substantially accelerates closed-loop E2E development, paving the way toward reliable and safety-aware research in autonomous driving.

[49] Known Meets Unknown: Mitigating Overconfidence in Open Set Recognition

Dongdong Zhao,Ranxin Fang,Changtian Song,Zhihui Liu,Jianwen Xiang

Main category: cs.CV

TL;DR: 提出了一种新的开放集识别框架,通过扰动不确定性估计和基于学习的未知类检测来缓解类别间重叠导致的过置信问题。

Details Motivation: 在开放集识别中,未知样本与已知类别语义相似时会导致特征空间中的类别重叠,引发模型对未知样本产生过高的置信度,从而错误分类,影响识别性能。 Method: 框架包含两个部分:基于扰动的不确定性估计模块,通过可控参数扰动生成多样化预测以量化预测不确定性;两阶段的基于学习的未知类检测模块,利用估计的不确定性增强已知与未知类之间的区分能力。 Result: 在三个公开数据集上的实验表明,该方法在开放集识别任务上优于现有的OSR方法。 Conclusion: 所提出的框架有效缓解了由类别间重叠引起的过置信问题,显著提升了开放集识别的性能。 Abstract: Open Set Recognition (OSR) requires models not only to accurately classify known classes but also to effectively reject unknown samples. However, when unknown samples are semantically similar to known classes, inter-class overlap in the feature space often causes models to assign unjustifiably high confidence to them, leading to misclassification as known classes -- a phenomenon known as overconfidence. This overconfidence undermines OSR by blurring the decision boundary between known and unknown classes. To address this issue, we propose a framework that explicitly mitigates overconfidence caused by inter-class overlap. The framework consists of two components: a perturbation-based uncertainty estimation module, which applies controllable parameter perturbations to generate diverse predictions and quantify predictive uncertainty, and an unknown detection module with distinct learning-based classifiers, implemented as a two-stage procedure, which leverages the estimated uncertainty to improve discrimination between known and unknown classes, thereby enhancing OSR performance. Experimental results on three public datasets show that the proposed framework achieves superior performance over existing OSR methods.

[50] Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection

Yogesh Kumar,Anand Mishra

Main category: cs.CV

TL;DR: 本文提出了一种新的少样本视频目标检测方法,通过选择性传播高置信度特征来增强时序一致性和检测精度,无需依赖复杂的物体轨迹提议,在多个数据集上取得了显著的性能提升。

Details Motivation: 解决传统检测方法在少样本情况下难以泛化以及视频中因遮挡和外观变化导致的时序不一致问题。 Method: 引入一种新颖的对象感知时序建模方法,结合过滤机制,在帧间选择性地传播高置信度对象特征,并使用少样本训练的检测与分类头实现高效、鲁棒的特征传递。 Result: 在5-shot设置下,AP指标在FSVOD-500、FSYTV-40、VidOR和VidVRD数据集上分别提升了3.7%、5.3%、4.3%和4.5%,并在1-shot、3-shot和10-shot设置中均表现出改进。 Conclusion: 所提出的方法在不依赖显式物体轨迹生成的情况下实现了更强的时序一致性与少样本泛化能力,有效提升了视频少样本目标检测的性能。 Abstract: Few-shot Video Object Detection (FSVOD) addresses the challenge of detecting novel objects in videos with limited labeled examples, overcoming the constraints of traditional detection methods that require extensive training data. This task presents key challenges, including maintaining temporal consistency across frames affected by occlusion and appearance variations, and achieving novel object generalization without relying on complex region proposals, which are often computationally expensive and require task-specific training. Our novel object-aware temporal modeling approach addresses these challenges by incorporating a filtering mechanism that selectively propagates high-confidence object features across frames. This enables efficient feature progression, reduces noise accumulation, and enhances detection accuracy in a few-shot setting. By utilizing few-shot trained detection and classification heads with focused feature propagation, we achieve robust temporal consistency without depending on explicit object tube proposals. Our approach achieves performance gains, with AP improvements of 3.7% (FSVOD-500), 5.3% (FSYTV-40), 4.3% (VidOR), and 4.5 (VidVRD) in the 5-shot setting. Further results demonstrate improvements in 1-shot, 3-shot, and 10-shot configurations. We make the code public at: https://github.com/yogesh-iitj/fs-video-vit

[51] FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching

Huayi Zhu,Xiu Shu,Youqiang Xiong,Qiao Liu,Rui Chen,Di Yuan,Xiaojun Chang,Zhenyu He

Main category: cs.CV

TL;DR: 提出一种基于流匹配的多模态图像融合方法,通过概率传输和伪标签选择提升采样效率与融合质量,并引入融合精炼模块和持续学习机制,在多任务场景下实现高效、轻量且鲁棒的图像融合。

Details Motivation: 现有方法依赖任务特定模型导致成本高、扩展性差,生成式方法虽统一建模但推理慢,且缺乏高质量真值监督。 Method: 将图像融合建模为从源模态到融合图像分布的概率传输,采用流匹配提升采样效率;利用多个SOTA模型结果作为先验,通过任务感知选择函数选取可靠伪标签;设计Fusion Refiner模块分解并增强伪标签中的退化成分;结合弹性权重固化和经验回放实现多任务持续学习。 Result: 在多种融合任务上达到具有竞争力的性能,显著提升采样效率,模型轻量,支持持续学习。 Conclusion: 所提方法在保证高质量融合的同时大幅提升效率,具备良好的跨任务泛化能力与应用潜力。 Abstract: Current multi-modal image fusion methods typically rely on task-specific models, leading to high training costs and limited scalability. While generative methods provide a unified modeling perspective, they often suffer from slow inference due to the complex sampling trajectories from noise to image. To address this, we formulate image fusion as a direct probabilistic transport from source modalities to the fused image distribution, leveraging the flow matching paradigm to improve sampling efficiency and structural consistency. To mitigate the lack of high-quality fused images for supervision, we collect fusion results from multiple state-of-the-art models as priors, and employ a task-aware selection function to select the most reliable pseudo-labels for each task. We further introduce a Fusion Refiner module that employs a divide-and-conquer strategy to systematically identify, decompose, and enhance degraded components in selected pseudo-labels. For multi-task scenarios, we integrate elastic weight consolidation and experience replay mechanisms to preserve cross-task performance and enhance continual learning ability from both parameter stability and memory retention perspectives. Our approach achieves competitive performance across diverse fusion tasks, while significantly improving sampling efficiency and maintaining a lightweight model design. The code will be available at: https://github.com/Ist-Zhy/FusionFM.

[52] A Trajectory-free Crash Detection Framework with Generative Approach and Segment Map Diffusion

Weiying Shen,Hao Yu,Yu Dong,Pan Liu,Yu Han,Xin Wen

Main category: cs.CV

TL;DR: 提出了一种无需轨迹的两阶段实时碰撞检测框架,利用扩散模型生成道路段地图并检测异常。

Details Motivation: 为克服传统方法在轨迹获取和车辆跟踪上的局限性,需要一种更高效、鲁棒的实时碰撞检测方法。 Method: 第一阶段使用基于扩散的Mapfusion模型,通过加噪-去噪过程结合时间动态嵌入和ControlNet控制生成合理的未来道路段地图;第二阶段通过比较实际监测图与生成图进行碰撞检测。 Result: 模型在非碰撞数据上训练,能生成逼真的道路段演化图,实验表明该方法在真实世界碰撞数据上具有良好的检测精度和鲁棒性。 Conclusion: 所提出的两阶段框架无需依赖精确轨迹,有效实现了实时碰撞检测,适用于不同采样间隔,具有实际应用潜力。 Abstract: Real-time crash detection is essential for developing proactive safety management strategy and enhancing overall traffic efficiency. To address the limitations associated with trajectory acquisition and vehicle tracking, road segment maps recording the individual-level traffic dynamic data were directly served in crash detection. A novel two-stage trajectory-free crash detection framework, was present to generate the rational future road segment map and identify crashes. The first-stage diffusion-based segment map generation model, Mapfusion, conducts a noisy-to-normal process that progressively adds noise to the road segment map until the map is corrupted to pure Gaussian noise. The denoising process is guided by sequential embedding components capturing the temporal dynamics of segment map sequences. Furthermore, the generation model is designed to incorporate background context through ControlNet to enhance generation control. Crash detection is achieved by comparing the monitored segment map with the generations from diffusion model in second stage. Trained on non-crash vehicle motion data, Mapfusion successfully generates realistic road segment evolution maps based on learned motion patterns and remains robust across different sampling intervals. Experiments on real-world crashes indicate the effectiveness of the proposed two-stage method in accurately detecting crashes.

[53] Synergizing Multigrid Algorithms with Vision Transformer: A Novel Approach to Enhance the Seismic Foundation Model

Huiwen Wu,Shuo Zhang,Yi Liu,Hongbin Ye

Main category: cs.CV

TL;DR: 提出一种针对地震数据的自适应双网格基础模型训练策略(ADATG),结合希尔伯特编码和频谱分解,有效捕捉高低频特征,提升地震视觉基础模型的预训练效果。

Details Motivation: 现有视觉Transformer在处理地震数据时未能有效捕捉高低频特征,且忽略数据内在结构,导致预训练效率与性能受限。 Method: 采用频谱分解分离高低频成分,引入分层希尔伯特编码表示地震数据,并设计自适应训练策略,先关注粗粒度信息再逐步细化到精细特征。 Result: 实验表明所提ADATG方法在地震数据上显著优于现有方法,有效提升了模型对高频和低频信息的捕获能力及训练效率。 Conclusion: 数据编码方式与考虑地震数据特性的自适应训练策略对视觉地震基础模型的预训练至关重要,ADATG为该类模型的设计提供了新思路。 Abstract: Due to the emergency and homogenization of Artificial Intelligence (AI) technology development, transformer-based foundation models have revolutionized scientific applications, such as drug discovery, materials research, and astronomy. However, seismic data presents unique characteristics that require specialized processing techniques for pretraining foundation models in seismic contexts with high- and low-frequency features playing crucial roles. Existing vision transformers (ViTs) with sequential tokenization ignore the intrinsic pattern and fail to grasp both the high- and low-frequency seismic information efficiently and effectively. This work introduces a novel adaptive two-grid foundation model training strategy (ADATG) with Hilbert encoding specifically tailored for seismogram data, leveraging the hierarchical structures inherent in seismic data. Specifically, our approach employs spectrum decomposition to separate high- and low-frequency components and utilizes hierarchical Hilbert encoding to represent the data effectively. Moreover, observing the frequency principle observed in ViTs, we propose an adaptive training strategy that initially emphasizes coarse-level information and then progressively refines the model's focus on fine-level features. Our extensive experiments demonstrate the effectiveness and efficiency of our training methods. This research highlights the importance of data encoding and training strategies informed by the distinct characteristics of high- and low-frequency features in seismic images, ultimately contributing to the enhancement of visual seismic foundation models pretraining.

[54] Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video

Filippo Cenacchi. Longbing Cao,Mitchell McEwan,Deborah Richards

Main category: cs.CV

TL;DR: 本文提出一种基于面部微动态分析的被动式痴呆筛查方法,利用短时间的自拍视频捕捉自然面部行为,无需语言或脚本参与,适用于大规模、跨设备和文化的早期神经认知变化检测。

Details Motivation: 现有痴呆筛查多依赖语音或结构化访谈,限制了在临床外的应用,并与语言能力绑定。本文旨在探索不依赖语言的、基于面部时序动力学特征(如眨眼、口部运动、目光变化等)进行早期痴呆检测的可能性。 Method: 通过稳定面部信号,将微小面部动作转化为可解释的时间序列,平滑后提取短时间窗口内的统计特征;使用各运动流的活动混合比例(activity mix)作为编码方式,使模型关注运动分布而非幅度,提升可解释性;并在新构建的数据集YT DemTalk上验证轻量级浅层分类器的性能。 Result: 在YT DemTalk数据集(300个视频片段,150例痴呆自报者,150例对照)上,消融实验表明目光不稳定性和口/下颌动态是最具信息量的线索;轻量级浅层分类器达到AUROC 0.953、平均精度AP 0.961、F1分数0.851和准确率0.857。 Conclusion: 仅凭非语言的面部微动态特征即可实现高精度的痴呆预测,所提方法支持无干预、大规模、真实场景下的被动筛查,具有良好的跨设备与文化适用潜力。 Abstract: We target passive dementia screening from short camera-facing talking head video, developing a facial temporal micro dynamics analysis for language free detection of early neuro cognitive change. This enables unscripted, in the wild video analysis at scale to capture natural facial behaviors, transferrable across devices, topics, and cultures without active intervention by clinicians or researchers during recording. Most existing resources prioritize speech or scripted interviews, limiting use outside clinics and coupling predictions to language and transcription. In contrast, we identify and analyze whether temporal facial kinematics, including blink dynamics, small mouth jaw motions, gaze variability, and subtle head adjustments, are sufficient for dementia screening without speech or text. By stabilizing facial signals, we convert these micro movements into interpretable facial microdynamic time series, smooth them, and summarize short windows into compact clip level statistics for screening. Each window is encoded by its activity mix (the relative share of motion across streams), thus the predictor analyzes the distribution of motion across streams rather than its magnitude, making per channel effects transparent. We also introduce YT DemTalk, a new dataset curated from publicly available, in the wild camera facing videos. It contains 300 clips (150 with self reported dementia, 150 controls) to test our model and offer a first benchmarking of the corpus. On YT DemTalk, ablations identify gaze lability and mouth/jaw dynamics as the most informative cues, and light weighted shallow classifiers could attain a dementia prediction performance of (AUROC) 0.953, 0.961 Average Precision (AP), 0.851 F1-score, and 0.857 accuracy.

[55] Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Xinxin Liu,Zhaopan Xu,Kai Wang,Yong Jae Lee,Yuzhang Shang

Main category: cs.CV

TL;DR: 本文提出了Gen-ViRe,首个面向视频生成模型的视觉推理基准,旨在评估模型在链式帧(CoF)推理中的认知能力,揭示当前模型在视觉质量与实际推理深度之间的差距。

Details Motivation: 现有视频生成模型虽能模拟物理世界动态,但缺乏对链式帧推理能力的系统评估,尤其在多步规划、算法逻辑和抽象模式推断方面。需要一个能衡量真实认知能力的基准。 Method: 基于认知科学设计Gen-ViRe基准,分解CoF推理为六个认知维度和24个子任务,采用多源数据构建、最小提示协议及混合视觉语言模型辅助评估方法进行量化评测。 Result: 实验显示当前最先进的视频模型在视觉质量上表现优异,但在深层推理任务上存在明显不足,暴露出推理能力的局限性。 Conclusion: Gen-ViRe为评估视频模型的推理能力提供了首个系统化框架,揭示了模型在认知层面的短板,为开发真正具备世界模拟和推理能力的AI系统提供了基准和改进方向。 Abstract: While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.

[56] RSPose: Ranking Based Losses for Human Pose Estimation

Muhammed Can Keles,Bedrettin Cetinkaya,Sinan Kalkan,Emre Akbas

Main category: cs.CV

TL;DR: 本文提出了基于排序的损失函数(RSPose),以解决热图在人体姿态估计中的三个问题:MSE损失不聚焦关键点定位、热图存在空间与类别不平衡、损失函数与评估指标mAP不一致。所提方法提升了置信度与定位质量的相关性,在多个数据集和模型上显著提升了mAP性能。

Details Motivation: 现有热图损失函数(如MSE)未能有效提升关键点定位精度,且与最终评估指标mAP不一致,导致训练与评测脱节。此外,热图存在严重的空间和类别不平衡问题,影响模型性能。 Method: 提出一种新的排序损失函数(ranking-based losses),通过增强预测热图峰值的置信度与其实际定位质量之间的相关性,使损失函数更贴近mAP评估目标。该方法适用于一维和二维热图,并可应用于不同架构(如ViTPose-H和SimCC)。 Result: RSPose在COCO-val集上达到79.9 mAP(ViTPose-H),优于此前最优方法;同时提升SimCC ResNet-50 1.5 AP至73.6 AP。在COCO、CrowdPose和MPII三个数据集上均验证了有效性。 Conclusion: 所提出的排序损失函数更契合mAP评估指标,解决了传统热图损失的多个缺陷,显著提升人体姿态估计性能,是首个明确对齐mAP目标的损失设计。 Abstract: While heatmap-based human pose estimation methods have shown strong performance, they suffer from three main problems: (P1) "Commonly used Mean Squared Error (MSE)" Loss may not always improve joint localization because it penalizes all pixel deviations equally, without focusing explicitly on sharpening and correctly localizing the peak corresponding to the joint; (P2) heatmaps are spatially and class-wise imbalanced; and, (P3) there is a discrepancy between the evaluation metric (i.e., mAP) and the loss functions. We propose ranking-based losses to address these issues. Both theoretically and empirically, we show that our proposed losses are superior to commonly used heatmap losses (MSE, KL-Divergence). Our losses considerably increase the correlation between confidence scores and localization qualities, which is desirable because higher correlation leads to more accurate instance selection during Non-Maximum Suppression (NMS) and better Average Precision (mAP) performance. We refer to the models trained with our losses as RSPose. We show the effectiveness of RSPose across two different modes: one-dimensional and two-dimensional heatmaps, on three different datasets (COCO, CrowdPose, MPII). To the best of our knowledge, we are the first to propose losses that align with the evaluation metric (mAP) for human pose estimation. RSPose outperforms the previous state of the art on the COCO-val set and achieves an mAP score of 79.9 with ViTPose-H, a vision transformer model for human pose estimation. We also improve SimCC Resnet-50, a coordinate classification-based pose estimation method, by 1.5 AP on the COCO-val set, achieving 73.6 AP.

[57] Segmenting Collision Sound Sources in Egocentric Videos

Kranti Kumar Parida,Omar Emara,Hazel Doughty,Dima Damen

Main category: cs.CV

TL;DR: 本文提出了碰撞声音源分割(CS3)的新任务,旨在基于音频对视频中产生碰撞声的物体进行分割,并提出了一种利用基础模型(如CLIP和SAM2)的弱监督方法,结合以自我为中心的线索,在两个新基准上显著优于基线方法。

Details Motivation: 受人类多感官感知能力启发,希望实现从碰撞声音中识别出对应视觉对象的任务,特别是在以自我为中心的视频中准确分离产生声音的物体。 Method: 提出一种弱监督的音频条件分割方法,利用CLIP和SAM2等基础模型,并结合以自我为中心的线索(如手中物体)来定位可能的碰撞声源。 Result: 在新构建的EPIC-CS3和Ego4D-CS3两个基准上,该方法在mIoU指标上分别超越竞争基线3倍和4.7倍。 Conclusion: 所提出的CS3任务和方法有效结合了音频与视觉信息,利用基础模型和上下文线索,在以自我为中心的视频中实现了更准确的声音源对象分割。 Abstract: Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3\times$ and $4.7\times$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.

[58] GRLoc: Geometric Representation Regression for Visual Localization

Changyang Li,Xuejian Ma,Lixiang Liu,Zhan Li,Qingan Yan,Yi Xu

Main category: cs.CV

TL;DR: 提出了一种基于几何表示回归(GRR)的新范式,通过将图像直接回归为3D几何表示来改进绝对位姿估计,实现了在7-Scenes和Cambridge Landmarks数据集上的最先进性能。

Details Motivation: 现有绝对位姿回归(APR)模型通常作为黑箱操作,难以理解3D场景几何结构,容易记忆训练视图,导致泛化能力差。 Method: 受新视角合成的启发,将APR重新定义为其逆过程——从图像回归出底层的3D几何表示。模型在世界坐标系中显式预测两种解耦的几何表示:(1) 光线束方向用于估计相机旋转;(2) 对应的点图用于估计相机平移,并通过可微的确定性求解器恢复6-DoF相机位姿。 Result: 该方法在7-Scenes和Cambridge Landmarks数据集上达到了最先进的性能,验证了所提方法在可推广的绝对位姿估计中的有效性。 Conclusion: 通过显式解耦旋转和平移预测并将几何先验引入网络,GRR为绝对位姿估计提供了更鲁棒且可推广的路径。 Abstract: Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a ray bundle's directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final 6-DoF camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.

[59] H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction

Xueyang Li,Zongren Wang,Yuliang Zhang,Zixuan Pan,Yu-Jen Chen,Nishchal Sapkota,Gelei Xu,Danny Z. Chen,Yiyu Shi

Main category: cs.CV

TL;DR: 本文提出了一种用于膀胱癌复发预测的多序列MRI数据集和一种新的分层门控注意力多分支模型H-CNN-ViT,该模型结合CNN和ViT的优势,在特征融合上实现选择性加权,显著提升了复发检测性能。

Details Motivation: 由于术后组织改变(如瘢痕、肿胀)使得MRI解读困难,且缺乏专用的多序列MRI数据集,限制了AI在膀胱癌复发预测中的发展。因此需要构建专用数据集并设计更有效的模型。 Method: 提出H-CNN-ViT模型,采用多分支结构独立处理不同MRI模态,结合CNN提取局部特征与ViT捕捉全局上下文,并通过门控注意力机制实现上下文自适应的特征融合。同时构建了一个多序列、多模态MRI数据集用于训练与评估。 Result: 在所构建的数据集上,H-CNN-ViT模型AUC达到78.6%,优于现有最先进的模型。 Conclusion: H-CNN-ViT能有效融合多序列MRI信息,提升膀胱癌复发预测准确性,为术后监测提供了有前景的AI辅助工具,且模型已公开以促进后续研究。 Abstract: Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT}.

[60] QwenCLIP: Boosting Medical Vision-Language Pretraining via LLM Embeddings and Prompt tuning

Xiaoyang Wei,Camille Kurtz,Florence Cloppet

Main category: cs.CV

TL;DR: 提出QwenCLIP,用大语言模型替代CLIP文本编码器以更好处理长篇医学报告,提升医学图文对齐与下游任务性能。

Details Motivation: CLIP文本编码器仅支持77个token,难以处理信息丰富的长篇放射学报告;现有医学编码器(如PubMedBERT)受限于输入长度和语义理解深度。 Method: 用基于大语言模型的嵌入模块(如Qwen3-Embedding)替换CLIP文本编码器,并引入可学习提示来增强跨模态对齐。 Result: 在放射学基准上显著提升医学图像-文本对齐和下游任务性能,得益于LLM更长上下文窗口和更丰富语义表示。 Conclusion: QwenCLIP有效克服了传统模型在处理长文本和深层语义上的局限,为医学视觉-语言任务提供了更强大的框架。 Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated strong generalization for vision-language tasks in computer vision and medical domains, yet its text encoder accepts only up to 77 tokens, which limits its ability to represent long and information-rich radiology reports. Recent adaptations using domain-specific encoders, such as PubMedBERT or ClinicalBERT, mitigate this issue by leveraging medical corpora, but remain constrained by their limited input length (typically 512 tokens) and relatively shallow semantic understanding. To address these limitations, we propose QwenCLIP, a vision-language framework that replaces CLIP's text encoder with a large language model (LLM)-based embedding module (e.g., Qwen3-Embedding) and introduces learnable prompts to enhance cross-modal alignment. By leveraging the extended context window and richer representations of LLMs, QwenCLIP captures comprehensive medical semantics from long-form clinical text, substantially improving medical image-text alignment and downstream performance on radiology benchmarks. Our code is publicly available at https://github.com/Wxy-24/QwenCLIP.

[61] Hybrid Convolution Neural Network Integrated with Pseudo-Newton Boosting for Lumbar Spine Degeneration Detection

Pandiyaraju V,Abishek Karthik,Jaspin K,Kannan A,Jaime Lloret

Main category: cs.CV

TL;DR: 提出了一种结合EfficientNet和VGG19的增强模型架构,引入伪牛顿提升层和稀疏性诱导特征降维层,用于腰椎退变的DICOM图像分类,显著提升了性能。

Details Motivation: 克服传统迁移学习在高维医学图像中特征表示不足和冗余的问题,提升腰椎退变自动分类的准确性和鲁棒性。 Method: 采用EfficientNet与VGG19融合的混合架构,加入伪牛顿提升层优化特征权重,并设计稀疏性诱导特征降维层减少冗余特征。 Result: 模型达到精度0.9、召回率0.861、F1分数0.88、损失0.18、准确率88.1%,优于EfficientNet基线模型。 Conclusion: 该架构有效提升了医学图像分类性能,为腰椎退变的自动化诊断提供了更优的解决方案。 Abstract: This paper proposes a new enhanced model architecture to perform classification of lumbar spine degeneration with DICOM images while using a hybrid approach, integrating EfficientNet and VGG19 together with custom-designed components. The proposed model is differentiated from traditional transfer learning methods as it incorporates a Pseudo-Newton Boosting layer along with a Sparsity-Induced Feature Reduction Layer that forms a multi-tiered framework, further improving feature selection and representation. The Pseudo-Newton Boosting layer makes smart variations of feature weights, with more detailed anatomical features, which are mostly left out in a transfer learning setup. In addition, the Sparsity-Induced Layer removes redundancy for learned features, producing lean yet robust representations for pathology in the lumbar spine. This architecture is novel as it overcomes the constraints in the traditional transfer learning approach, especially in the high-dimensional context of medical images, and achieves a significant performance boost, reaching a precision of 0.9, recall of 0.861, F1 score of 0.88, loss of 0.18, and an accuracy of 88.1%, compared to the baseline model, EfficientNet. This work will present the architectures, preprocessing pipeline, and experimental results. The results contribute to the development of automated diagnostic tools for medical images.

[62] VLMs Guided Interpretable Decision Making for Autonomous Driving

Xin Hu,Taotao Jing,Renran Tian,Zhengming Ding

Main category: cs.CV

TL;DR: 提出一种将视觉语言模型(VLMs)从直接决策生成器转变为语义增强器的新方法,通过多模态交互架构和事后优化模块,在自动驾驶决策中实现更可靠、可解释的性能。

Details Motivation: 现有基于VLM的自动驾驶决策方法依赖手工提示且性能不稳定,缺乏在真实场景中的鲁棒性和泛化能力。 Method: 利用VLM进行场景语义理解,生成结构化的自然语言描述以增强视觉输入;设计多模态交互架构融合视觉与语言特征,并引入基于VLM的事后预测优化模块。 Result: 在两个自动驾驶基准上实现了最先进的性能,显著提升了决策的准确性和可解释性。 Conclusion: 将VLM作为语义增强工具而非直接决策器,能更有效、可靠地融入自动驾驶系统,为可解释AI驱动的AD提供了新方向。 Abstract: Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions. Building on this enriched representation, we introduce a multi-modal interactive architecture that fuses visual and linguistic features for more accurate decision-making and interpretable textual explanations. Furthermore, we design a post-hoc refinement module that utilizes VLMs to enhance prediction reliability. Extensive experiments on two autonomous driving benchmarks demonstrate that our approach achieves state-of-the-art performance, offering a promising direction for integrating VLMs into reliable and interpretable AD systems.

[63] Revisiting Data Scaling Law for Medical Segmentation

Yuetan Chu,Zhongyi Han,Gongning Luo,Xin Gao

Main category: cs.CV

TL;DR: 本研究探讨了医学解剖分割中的数据规模缩放规律,在15个语义任务和4种成像模态上验证了大规模数据显著提升性能,并提出一种基于测地线子空间生成微分同胚映射的可扩展图像增强方法,显著提高数据利用效率,加速模型收敛,超越标准幂律缩放趋势。

Details Motivation: 医学解剖分割中数据规模与模型性能的关系尚不明确,且缺乏对拓扑结构相似性在数据增强中作用的研究。 Method: 分析15个任务和4种模态下的缩放规律;评估弹性形变与配准引导形变对缩放律的影响;提出基于图像配准的测地线子空间生成微分同胚映射的新型增强方法。 Result: 形变引导增强显著提升数据利用效率;所提生成式形变方法在性能和收敛速度上均优于传统方法,突破标准幂律缩放限制。 Conclusion: 考虑拓扑结构一致性的形变增强可有效提升医学图像分割的缩放性能,减少对标注数据和计算资源的依赖,推动高效模型开发。 Abstract: The population loss of trained deep neural networks often exhibits power law scaling with the size of the training dataset, guiding significant performance advancements in deep learning applications. In this study, we focus on the scaling relationship with data size in the context of medical anatomical segmentation, a domain that remains underexplored. We analyze scaling laws for anatomical segmentation across 15 semantic tasks and 4 imaging modalities, demonstrating that larger datasets significantly improve segmentation performance, following similar scaling trends. Motivated by the topological isomorphism in images sharing anatomical structures, we evaluate the impact of deformation-guided augmentation strategies on data scaling laws, specifically random elastic deformation and registration-guided deformation. We also propose a novel, scalable image augmentation approach that generates diffeomorphic mappings from geodesic subspace based on image registration to introduce realistic deformation. Our experimental results demonstrate that both registered and generated deformation-based augmentation considerably enhance data utilization efficiency. The proposed generated deformation method notably achieves superior performance and accelerated convergence, surpassing standard power law scaling trends without requiring additional data. Overall, this work provides insights into the understanding of segmentation scalability and topological variation impact in medical imaging, thereby leading to more efficient model development with reduced annotation and computational costs.

[64] Uni-Hema: Unified Model for Digital Hematopathology

Abdul Rehman,Iqra Rasool,Ayesha Imran,Mohsen Ali,Waqas Sultani

Main category: cs.CV

TL;DR: 本文提出了Uni-Hema,一个用于数字血液病理学的多任务统一模型,能够整合检测、分类、分割、形态预测和跨疾病推理,基于46个公开数据集,在多种血液学任务上表现出色并提供可解释的单细胞级形态学洞察。

Details Motivation: 现有方法无法在数字血液病理学中实现统一的多任务、多模态推理,限制了对多种疾病(如白血病、疟疾、镰状细胞病)的综合分析。 Method: 提出Uni-Hema模型,基于Hema-Former多模态模块,融合视觉与文本表征,支持检测、分类、分割、形态预测和视觉问答等多任务学习,利用46个公开数据集(超过70万张图像和2.1万对问答对)进行训练与验证。 Result: 实验表明,Uni-Hema在各项血液学任务上的性能优于或相当于单任务、单数据集模型,同时能提供细粒度的、形态相关的可解释性结果。 Conclusion: Uni-Hema建立了数字血液病理学中多任务与多模态分析的新标准,具有广泛的临床应用潜力。 Abstract: Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation, they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and textual representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.

[65] Weakly Supervised Ephemeral Gully Detection In Remote Sensing Images Using Vision Language Models

Seyed Mohamad Ali Tousi,John A. Lory,G. N. DeSouza

Main category: cs.CV

TL;DR: 提出首个弱监督管道用于农田中短暂性沟壑的检测,利用视觉语言模型(VLMs)和师生模型结构,结合噪声感知损失函数,显著减少人工标注负担,并发布首个遥感图像半监督检测数据集。

Details Motivation: 由于短暂性沟壑生命周期短、标注数据稀缺且难以获取,传统计算机视觉和机器学习方法难以有效检测,现有零样本方法实现困难,因此需要一种减少人工标注依赖的自动检测方法。 Method: 提出一种基于遥感和视觉语言模型(VLMs)的弱监督检测管道:首先利用VLM生成弱标签作为教师模型的输入,教师模型学习含噪标签,再通过噪声感知损失函数指导学生模型进行弱监督训练。 Result: 实验结果表明,该方法在使用弱监督训练学生模型时,性能优于直接使用VLM和仅用少量标签的基线模型;构建了包含18,000多张高分辨率遥感图像的数据集,覆盖13年多时段多个地点,含专家标注与大量未标注数据。 Conclusion: 所提出的弱监督框架能有效利用VLM先验知识和未标注数据,显著提升短暂性沟壑检测性能,为农业土壤侵蚀监测提供了可扩展且实用的解决方案,代码与数据集已公开。 Abstract: Among soil erosion problems, Ephemeral Gullies are one of the most concerning phenomena occurring in agricultural fields. Their short temporal cycles increase the difficulty in automatically detecting them using classical computer vision approaches and remote sensing. Also, due to scarcity of and the difficulty in producing accurate labeled data, automatic detection of ephemeral gullies using Machine Learning is limited to zero-shot approaches which are hard to implement. To overcome these challenges, we present the first weakly supervised pipeline for detection of ephemeral gullies. Our method relies on remote sensing and uses Vision Language Models (VLMs) to drastically reduce the labor-intensive task of manual labeling. In order to achieve that, the method exploits: 1) the knowledge embedded in the VLM's pretraining; 2) a teacher-student model where the teacher learns from noisy labels coming from the VLMs, and the student learns by weak supervision using teacher-generate labels and a noise-aware loss function. We also make available the first-of-its-kind dataset for semi-supervised detection of ephemeral gully from remote-sensed images. The dataset consists of a number of locations labeled by a group of soil and plant scientists, as well as a large number of unlabeled locations. The dataset represent more than 18,000 high-resolution remote-sensing images obtained over the course of 13 years. Our experimental results demonstrate the validity of our approach by showing superior performances compared to VLMs and the label model itself when using weak supervision to train an student model. The code and dataset for this work are made publicly available.

[66] Temporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors

Mert Onur Cakiroglu,Idil Bilge Altun,Zhihe Lu,Mehmet Dalkilic,Hasan Kurban

Main category: cs.CV

TL;DR: 提出一种基于压缩视频流中运动矢量(MVs)的可扩展、模型无关的框架,用于评估生成视频的时间真实性,并通过多种统计方法和可视化揭示现有生成模型在运动上的系统性缺陷。

Details Motivation: 当前生成视频模型的时间真实性评估受限于传统指标对运动敏感性不足,缺乏有效捕捉时间动态的手段。 Method: 利用H.264和HEVC等编码标准提取运动矢量(MVs),通过KL、JS和Wasserstein散度比较真实与生成视频的MV统计差异,并结合MV与RGB进行多模态融合分析。 Result: 在GenVidBench数据集上发现Pika和SVD最接近真实运动,VC2和Text2Video-Zero在MV-sum上表现较好,CogVideo偏差最大;可视化揭示了中心偏置、稀疏流动和网格伪影;引入MV可使分类准确率最高达99.0%。 Conclusion: 压缩域中的运动矢量是诊断生成视频运动缺陷和增强判别模型时间推理能力的有效时间信号。 Abstract: Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: https://github.com/KurbanIntelligenceLab/Motion-Vector-Learning

[67] SAE-MCVT: A Real-Time and Scalable Multi-Camera Vehicle Tracking Framework Powered by Edge Computing

Yuqiang Lin,Sam Lockyer,Florian Stanek,Markus Zarbock,Adrian Evans,Wenbin Li,Nic Zhang

Main category: cs.CV

TL;DR: 提出首个可扩展的实时多摄像头车辆跟踪框架SAE-MCVT,结合边缘计算与中心化关联,实现实时性与高精度。

Details Motivation: 现有MCVT方法注重准确性但忽视实时性和可扩展性,难以应用于城市规模的实际场景。 Method: 采用边缘设备处理视频流并提取轻量级元数据,通过中心工作站基于时空约束和自监督相机链接模型进行跨摄像头关联。 Result: 在RoundaboutHD数据集上实现了对2K 15FPS视频流的实时处理,IDF1得分为61.2。 Conclusion: SAE-MCVT是首个适用于城市规模部署的可扩展实时MCVT框架,兼顾效率与性能。 Abstract: In modern Intelligent Transportation Systems (ITS), cameras are a key component due to their ability to provide valuable information for multiple stakeholders. A central task is Multi-Camera Vehicle Tracking (MCVT), which generates vehicle trajectories and enables applications such as anomaly detection, traffic density estimation, and suspect vehicle tracking. However, most existing studies on MCVT emphasize accuracy while overlooking real-time performance and scalability. These two aspects are essential for real-world deployment and become increasingly challenging in city-scale applications as the number of cameras grows. To address this issue, we propose SAE-MCVT, the first scalable real-time MCVT framework. The system includes several edge devices that interact with one central workstation separately. On the edge side, live RTSP video streams are serialized and processed through modules including object detection, object tracking, geo-mapping, and feature extraction. Only lightweight metadata -- vehicle locations and deep appearance features -- are transmitted to the central workstation. On the central side, cross-camera association is calculated under the constraint of spatial-temporal relations between adjacent cameras, which are learned through a self-supervised camera link model. Experiments on the RoundaboutHD dataset show that SAE-MCVT maintains real-time operation on 2K 15 FPS video streams and achieves an IDF1 score of 61.2. To the best of our knowledge, this is the first scalable real-time MCVT framework suitable for city-scale deployment.

[68] Mind the Gap: Evaluating LLM Understanding of Human-Taught Road Safety Principles

Chalamalasetti Kranti

Main category: cs.CV

TL;DR: 本文评估了多模态大语言模型在理解交通标志和道路安全规范方面的能力,发现其在零样本设置下表现不佳,揭示了模型与人类在安全推理上的差距。

Details Motivation: 确保AI系统(如自动驾驶车辆)能够正确理解和遵守道路安全规范,是保障交通安全的关键。现有研究缺乏对多模态大语言模型在该方面能力的系统评估。 Method: 构建了一个来自学校教材的交通标志与道路安全示意图的小型数据集,并在零样本设定下评估多模态大语言模型的理解能力。 Result: 实验结果显示,当前多模态大语言模型在道路安全推理任务上表现较差,难以准确理解交通标志及其背后的安全逻辑。 Conclusion: 多模态大语言模型在道路安全理解方面存在显著缺陷,需进一步研究以缩小模型解释与人类认知之间的差距。 Abstract: Following road safety norms is non-negotiable not only for humans but also for the AI systems that govern autonomous vehicles. In this work, we evaluate how well multi-modal large language models (LLMs) understand road safety concepts, specifically through schematic and illustrative representations. We curate a pilot dataset of images depicting traffic signs and road-safety norms sourced from school text books and use it to evaluate models capabilities in a zero-shot setting. Our preliminary results show that these models struggle with safety reasoning and reveal gaps between human learning and model interpretation. We further provide an analysis of these performance gaps for future research.

[69] Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding

Qingyang Yan,Guangyao Chen,Yixiong Zou

Main category: cs.CV

TL;DR: 提出了一种基于课程学习的相对策略优化方法(CuRPO),通过利用推理链长度和gIoU奖励作为复杂度指标,逐步训练视觉定位模型,在多个数据集上显著优于现有方法,尤其在少样本和复杂语言描述场景下表现突出。

Details Motivation: 发现强化学习微调的推理链(CoT)在视觉定位任务中可能因推理过程过长或复杂而导致性能下降,且数据量增加不一定提升性能,因此需要一种能根据数据复杂度动态调整训练顺序的方法。 Method: 提出CuRPO,将CoT生成的推理链长度和gIoU作为复杂度衡量指标,设计课程学习策略,按从易到难的顺序组织训练数据,并结合相对策略优化进行强化学习训练。 Result: 在RefCOCO、RefCOCO+、RefCOCOg和LISA数据集上实验表明,CuRPO显著优于现有方法,最高提升达+12.52 mAP,且在少样本设置下仍保持优异性能,尤其适用于描述模糊和复杂的视觉定位任务。 Conclusion: CuRPO通过引入基于复杂度的课程学习机制,有效缓解了长推理链带来的性能退化问题,提升了视觉定位模型的鲁棒性和效率,为CoT与强化学习的结合提供了新的优化路径。 Abstract: Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks by explicitly generating intermediate reasoning steps. However, we find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks, particularly as CoT outputs become lengthy or complex. Additionally, our analysis reveals that increased dataset size does not always enhance performance due to varying data complexities. Motivated by these findings, we propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union (gIoU) rewards as complexity indicators to progressively structure training data from simpler to more challenging examples. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and LISA datasets demonstrate the effectiveness of our approach. CuRPO consistently outperforms existing methods, including Visual-RFT, with notable improvements of up to +12.52 mAP on RefCOCO. Moreover, CuRPO exhibits exceptional efficiency and robustness, delivering strong localization performance even in few-shot learning scenarios, particularly benefiting tasks characterized by ambiguous and intricate textual descriptions.The code is released on https://github.com/qyoung-yan/CuRPO.

[70] Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Noam Glazner,Noam Tsfaty,Sharon Shalev,Avishai Weizman

Main category: cs.CV

TL;DR: 提出了一种基于聚类的帧选择策略,以减少视频派生帧数据集中的信息泄露。

Details Motivation: 为了减轻视频帧数据集中因时间相关性导致的信息泄露问题,提高数据划分的可靠性。 Method: 通过将视觉上相似的帧进行聚类,然后在聚类结果基础上划分训练、验证和测试集。 Result: 该方法生成了更具代表性、更平衡和更可靠的数据集划分。 Conclusion: 基于聚类的帧选择策略能有效缓解信息泄露,提升视频帧数据集划分质量。 Abstract: We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

[71] Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Zachary Shinnick,Liangze Jiang,Hemanth Saratchandran,Damien Teney,Anton van den Hengel

Main category: cs.CV

TL;DR: 提出一种基于程序生成的非视觉数据对视觉Transformer进行预训练的方法,提升模型的数据效率和下游性能。

Details Motivation: 探索如何在视觉Transformer中引入跨模态通用的归纳偏置,以提升其在标准训练前的内在抽象计算能力。 Method: 使用形式语法等简单算法生成无视觉或语义内容的程序化数据,在跳过视觉块嵌入机制的情况下对ViT进行预训练暖启动,随后再进行标准图像训练。 Result: 在ImageNet-1k上,仅用1%训练预算用于程序化数据即可提升超过1.7%的准确率,且该方法显著提高数据效率和收敛速度。 Conclusion: 程序化数据预训练能有效注入抽象计算先验,为数据高效、领域无关的预训练提供了新路径。 Abstract: Transformers show remarkable versatility across domains, suggesting the existence of inductive biases beneficial across modalities. In this work, we explore a new way to instil such generic biases in vision transformers (ViTs) by pretraining on procedurally-generated data devoid of visual or semantic content. We generate this data with simple algorithms such as formal grammars, so the results bear no relationship to either natural or synthetic images. We use this procedurally-generated data to pretrain ViTs in a warm-up phase that bypasses their visual patch embedding mechanisms, thus encouraging the models to internalise abstract computational priors. When followed by standard image-based training, this warm-up significantly improves data efficiency, convergence speed, and downstream performance. On ImageNet-1k for example, allocating just 1% of the training budget to procedural data improves final accuracy by over 1.7%. In terms of its effect on performance, 1% procedurally generated data is thus equivalent to 28% of the ImageNet-1k data. These findings suggest a promising path toward new data-efficient and domain-agnostic pretraining strategies.

[72] Single Tensor Cell Segmentation using Scalar Field Representations

Kevin I. Ruiz Vargas,Gabriel G. Galdino,Tsang Ing Ren,Alexandre L. Cunha

Main category: cs.CV

TL;DR: 提出一种基于连续标量场的细胞图像分割方法,通过求解偏微分方程(如泊松方程和热方程稳态解)构建标量场,并利用watershed方法进行分割,无需正则化,鲁棒性强且边界清晰。

Details Motivation: 传统细胞图像分割方法易受训练数据中异常值影响,且常需复杂正则化或后处理。本文旨在通过物理启发的标量场建模,实现更鲁棒、高效的细胞实例分割,尤其适用于边缘计算场景。 Method: 设计一个神经网络学习图像域上的连续标量场,该标量场为泊松方程或类扩散方程的解;仅通过最小化场残差来训练网络,不使用正则化;利用watershed算法对标量场进行分割以获得细胞实例。 Result: 在公开数据集上取得具有竞争力的结果,能够生成清晰的细胞边界,对异常值鲁棒;仅需单个张量训练U-Net,简化实现,降低训练与推理时间及内存占用。 Conclusion: 所提出的基于物理驱动标量场的细胞分割方法简单、几何直观且高效,在保持高精度的同时显著降低计算资源需求,适合边缘计算应用。 Abstract: We investigate image segmentation of cells under the lens of scalar fields. Our goal is to learn a continuous scalar field on image domains such that its segmentation produces robust instances for cells present in images. This field is a function parameterized by the trained network, and its segmentation is realized by the watershed method. The fields we experiment with are solutions to the Poisson partial differential equation and a diffusion mimicking the steady-state solution of the heat equation. These solutions are obtained by minimizing just the field residuals, no regularization is needed, providing a robust regression capable of diminishing the adverse impacts of outliers in the training data and allowing for sharp cell boundaries. A single tensor is all that is needed to train a \unet\ thus simplifying implementation, lowering training and inference times, hence reducing energy consumption, and requiring a small memory footprint, all attractive features in edge computing. We present competitive results on public datasets from the literature and show that our novel, simple yet geometrically insightful approach can achieve excellent cell segmentation results.

[73] EchoAgent: Guideline-Centric Reasoning Agent for Echocardiography Measurement and Interpretation

Matin Daghyani,Lyuyang Wang,Nima Hashemi,Bassant Medhat,Baraa Abdelsamad,Eros Rojas Velez,XiaoXiao Li,Michael Y. C. Tsang,Christina Luong,Teresa S. M. Tsang,Purang Abolmaesumi

Main category: cs.CV

TL;DR: EchoAgent是一个基于大语言模型控制的框架,通过协调专用视觉工具实现超声心动图视频的时空定位、测量和临床解读,具备可解释性和指南依从性。

Details Motivation: 现有的深度学习模型无法支持超声心动图所需的视频级推理和基于指南的测量分析,缺乏可解释性和结构化输出。 Method: 提出EchoAgent框架,利用大语言模型调度专门的视觉工具进行时间定位、空间测量和临床解读,并引入测量可行性预测模型以自主选择可可靠测量的帧。构建了包含多样化、临床验证的视频-查询对的基准数据集用于评估。 Result: EchoAgent在复杂的时空视频分析任务中实现了准确且可解释的结果,输出基于视觉证据和临床指南,具有良好的透明度和可追溯性。 Conclusion: 该研究证明了基于任务特定工具和全视频级自动化的智能体框架在超声心动图分析中的可行性,为可信AI在心脏超声中的应用提供了新方向。 Abstract: Purpose: Echocardiographic interpretation requires video-level reasoning and guideline-based measurement analysis, which current deep learning models for cardiac ultrasound do not support. We present EchoAgent, a framework that enables structured, interpretable automation for this domain. Methods: EchoAgent orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation. A key contribution is a measurement-feasibility prediction model that determines whether anatomical structures are reliably measurable in each frame, enabling autonomous tool selection. We curated a benchmark of diverse, clinically validated video-query pairs for evaluation. Results: EchoAgent achieves accurate, interpretable results despite added complexity of spatiotemporal video analysis. Outputs are grounded in visual evidence and clinical guidelines, supporting transparency and traceability. Conclusion: This work demonstrates the feasibility of agentic, guideline-aligned reasoning for echocardiographic video analysis, enabled by task-specific tools and full video-level automation. EchoAgent sets a new direction for trustworthy AI in cardiac ultrasound.

[74] Learning Skill-Attributes for Transferable Assessment in Video

Kumar Ashutosh,Kristen Grauman

Main category: cs.CV

TL;DR: 提出CrossTrainer方法,通过跨运动技能评估的可迁移视频表示,利用多模态语言模型生成动作反馈和熟练度评级,显著优于现有技术。

Details Motivation: 现有模型局限于单项运动且依赖昂贵的专业标注,难以扩展到长尾运动项目,需开发可迁移的技能评估表示方法。 Method: CrossTrainer发现跨运动通用的技能属性(如平衡、控制、手部位置),并训练多模态语言模型对新视频生成具体反馈和 proficiency 等级。 Result: 在多个数据集的跨运动和单运动设置中,性能较现有技术提升高达60%。 Conclusion: 通过抽象人类技能的共性行为,该视频表示方法具有更强泛化能力,提升了多模态大模型在技能评估中的应用。 Abstract: Skill assessment from video entails rating the quality of a person's physical performance and explaining what could be done better. Today's models specialize for an individual sport, and suffer from the high cost and scarcity of expert-level supervision across the long tail of sports. Towards closing that gap, we explore transferable video representations for skill assessment. Our CrossTrainer approach discovers skill-attributes, such as balance, control, and hand positioning -- whose meaning transcends the boundaries of any given sport, then trains a multimodal language model to generate actionable feedback for a novel video, e.g., "lift hands more to generate more power" as well as its proficiency level, e.g., early expert. We validate the new model on multiple datasets for both cross-sport (transfer) and intra-sport (in-domain) settings, where it achieves gains up to 60% relative to the state of the art. By abstracting out the shared behaviors indicative of human skill, the proposed video representation generalizes substantially better than an array of existing techniques, enriching today's multimodal large language models.

[75] CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution

Xianming Gu,Lihui Wang,Ying Cao,Zeyu Deng,Yingfeng Ou,Guodong Hu,Yi Chen

Main category: cs.CV

TL;DR: 提出了一种基于卷积字典特征解耦的双提示专家网络(CD-DPE),用于多对比度MRI超分辨率,有效分离和融合跨对比度与同对比度特征,显著提升图像重建质量与泛化能力。

Details Motivation: 多对比度MRI超分辨率中,不同模态间的固有对比度差异导致参考图像纹理信息难以有效利用,影响目标图像重建的特征融合效果。 Method: 引入迭代卷积字典特征解耦模块(CD-FDM)分离跨对比度和同对比度特征,并设计双提示特征融合专家模块(DP-FFEM),通过频率提示选择相关特征,自适应路由提示优化融合方式。 Result: 在公开多对比度MRI数据集上实验表明,CD-DPE优于现有最先进方法,能更好重建细节;在未见数据集上也表现出强泛化能力。 Conclusion: CD-DPE通过特征解耦与双提示融合机制,有效解决了多对比度MRI超分辨率中的特征干扰问题,提升了重建精度和模型泛化性。 Abstract: Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.

[76] RISE: Single Static Radar-based Indoor Scene Understanding

Kaichen Zhou,Laura Dodds,Sayed Saad Afzal,Fadel Adib

Main category: cs.CV

TL;DR: RISE是首个基于单静态雷达的室内场景理解系统和基准,利用多路径反射(通常被视为噪声)提供丰富的几何线索,实现布局重建和物体检测,兼具隐私保护和高几何精度。

Details Motivation: 现有光学传感器(如RGB和LiDAR)在室内存在遮挡和隐私问题,而毫米波雷达虽具隐私性和穿障能力,但空间分辨率低,难以进行可靠几何推理。因此需要一种兼顾隐私、鲁棒性和几何理解能力的解决方案。 Method: 提出Bi-Angular Multipath Enhancement方法,显式建模到达角和出发角以恢复次级(幽灵)反射;并采用仿真到现实的分层扩散框架,将碎片化的雷达响应转化为完整的布局重建与物体检测。 Result: RISE基准包含100条真实室内轨迹的50,000帧数据;实验显示布局重建的Chamfer距离比现有最好方法降低60%(降至16厘米),并首次实现毫米波雷达下的物体检测,达到58% IoU。 Conclusion: RISE为单静态雷达的几何感知与隐私保护室内场景理解建立了新基础,验证了多路径反射在室内感知中的潜力。 Abstract: Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections, traditionally treated as noise, encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar.

[77] MRI Plane Orientation Detection using a Context-Aware 2.5D Model

SangHyuk Kim,Daniel Haehn,Sumientra Rampersad

Main category: cs.CV

TL;DR: 本研究提出了一种基于2.5D上下文感知模型的MRI切片解剖平面分类方法,准确率达到99.49%,显著优于2D模型,并通过生成的方向元数据提升了脑肿瘤检测的诊断准确性。

Details Motivation: 自动系统在识别MRI切片的解剖平面时存在困难,缺失的方向元数据会影响数据分析、加剧数据集间的域偏移,并降低诊断分类器的准确性。 Method: 采用2.5D上下文感知模型,利用多切片信息进行训练,避免单一切片的歧义性;模型在3D切片序列和静态2D图像上均进行训练,并与2D基准模型对比。 Result: 2.5D模型准确率达99.49%,较2D模型的98.74%显著提升,错误率降低60%;在脑肿瘤检测任务中,结合不确定性评分的门控策略使诊断准确率从97.0%提升至98.0%,误诊率下降33.3%。 Conclusion: 2.5D上下文信息对解剖平面识别至关重要,生成的元数据可有效提升下游诊断任务性能,该模型已集成至开源交互式Web应用中。 Abstract: Humans can easily identify anatomical planes (axial, coronal, and sagittal) on a 2D MRI slice, but automated systems struggle with this task. Missing plane orientation metadata can complicate analysis, increase domain shift when merging heterogeneous datasets, and reduce accuracy of diagnostic classifiers. This study develops a classifier that accurately generates plane orientation metadata. We adopt a 2.5D context-aware model that leverages multi-slice information to avoid ambiguity from isolated slices and enable robust feature learning. We train the 2.5D model on both 3D slice sequences and static 2D images. While our 2D reference model achieves 98.74% accuracy, our 2.5D method raises this to 99.49%, reducing errors by 60%, highlighting the importance of 2.5D context. We validate the utility of our generated metadata in a brain tumor detection task. A gated strategy selectively uses metadata-enhanced predictions based on uncertainty scores, boosting accuracy from 97.0% with an image-only model to 98.0%, reducing misdiagnoses by 33.3%. We integrate our plane orientation model into an interactive web application and provide it open-source.

[78] LINGUAL: Language-INtegrated GUidance in Active Learning for Medical Image Segmentation

Md Shazid Islam,Shreyangshu Bera,Sudipta Paul,Amit K. Roy-Chowdhury

Main category: cs.CV

TL;DR: 提出LINGUAL框架,利用自然语言指令实现医学图像分割中的自动任务执行,在主动域适应中性能优于或相当于传统主动学习方法,同时减少约80%的标注时间。

Details Motivation: 医学图像中边界模糊且标注耗时费力,传统主动学习在标注区域大小与认知负担之间存在权衡,亟需降低专家标注负担的方法。 Method: 提出LINGUAL框架,通过上下文学习将专家的自然语言指令转化为可执行程序,并自动完成相应子任务,实现无需人工干预的分割过程。 Result: 在主动域适应任务中,LINGUAL达到与主动学习基线相当或更优的性能,同时估计标注时间减少了约80%。 Conclusion: LINGUAL通过语言引导显著降低了医学图像分割中的标注成本和认知负荷,为主动学习提供了一种高效、低负担的替代方案。 Abstract: Although active learning (AL) in segmentation tasks enables experts to annotate selected regions of interest (ROIs) instead of entire images, it remains highly challenging, labor-intensive, and cognitively demanding due to the blurry and ambiguous boundaries commonly observed in medical images. Also, in conventional AL, annotation effort is a function of the ROI- larger regions make the task cognitively easier but incur higher annotation costs, whereas smaller regions demand finer precision and more attention from the expert. In this context, language guidance provides an effective alternative, requiring minimal expert effort while bypassing the cognitively demanding task of precise boundary delineation in segmentation. Towards this goal, we introduce LINGUAL: a framework that receives natural language instructions from an expert, translates them into executable programs through in-context learning, and automatically performs the corresponding sequence of sub-tasks without any human intervention. We demonstrate the effectiveness of LINGUAL in active domain adaptation (ADA) achieving comparable or superior performance to AL baselines while reducing estimated annotation time by approximately 80%.

[79] Training-free Detection of AI-generated images via Cropping Robustness

Sungik Choi,Hankook Lee,Moontae Lee

Main category: cs.CV

TL;DR: 提出了一种无需训练的AI生成图像检测方法WaRPAD,基于自监督模型对高频频域扰动和裁剪缩放增强的鲁棒性进行检测。

Details Motivation: 随着视觉生成模型的发展,需要一种不依赖特定数据集训练、适用于多种生成模型且鲁棒性强的AI生成图像检测方法。 Method: 利用自监督模型在RandomResizedCrop等增强下学习到的跨分辨率一致性,通过Haar小波分解提取高频方向扰动,定义基础得分函数,并将图像分块计算各块敏感度得分后平均作为最终检测得分。 Result: 在多种分辨率和领域的真实数据集及23种生成模型产生的图像上验证,WaRPAD表现出竞争性的性能和对测试时干扰的强鲁棒性。 Conclusion: WaRPAD是一种通用、无需训练的AI生成图像检测方法,可广泛应用于具有RandomResizedCrop不变性的自监督模型。 Abstract: AI-generated image detection has become crucial with the rapid advancement of vision-generative models. Instead of training detectors tailored to specific datasets, we study a training-free approach leveraging self-supervised models without requiring prior data knowledge. These models, pre-trained with augmentations like RandomResizedCrop, learn to produce consistent representations across varying resolutions. Motivated by this, we propose WaRPAD, a training-free AI-generated image detection algorithm based on self-supervised models. Since neighborhood pixel differences in images are highly sensitive to resizing operations, WaRPAD first defines a base score function that quantifies the sensitivity of image embeddings to perturbations along high-frequency directions extracted via Haar wavelet decomposition. To simulate robustness against cropping augmentation, we rescale each image to a multiple of the models input size, divide it into smaller patches, and compute the base score for each patch. The final detection score is then obtained by averaging the scores across all patches. We validate WaRPAD on real datasets of diverse resolutions and domains, and images generated by 23 different generative models. Our method consistently achieves competitive performance and demonstrates strong robustness to test-time corruptions. Furthermore, as invariance to RandomResizedCrop is a common training scheme across self-supervised models, we show that WaRPAD is applicable across self-supervised models.

[80] FashionMAC: Deformation-Free Fashion Image Generation with Fine-Grained Model Appearance Customization

Rong Zhang,Jinxiao Li,Jingnan Wang,Zhiwen Zuo,Jianfeng Dong,Wei Li,Chi Wang,Weiwei Xu,Xun Wang

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的无变形框架FashionMAC,用于高质量且可控的服装展示图像生成,通过直接外绘分割的服装区域来保持细节,并引入区域自适应解耦注意力(RADA)和链式掩码注入策略实现细粒度外观控制。

Details Motivation: 现有方法在生成过程中需进行服装变形,易导致纹理失真,且缺乏对人物外观细粒度属性的精确控制,因此需要一种能同时保持服装细节并实现精细控制的新方法。 Method: 提出FashionMAC,采用扩散模型框架,避免 garment deformation,直接对外绘服装区域进行生成;设计RADA机制与链式掩码注入策略,使文本属性与生成区域自适应匹配,增强控制精度。 Result: 实验表明,FashionMAC在保持服装细节和生成图像质量方面优于现有最先进方法,显著提升了文本驱动的细粒度外观可控性。 Conclusion: FashionMAC通过去除非必要变形和引入区域感知注意力机制,有效解决了服装图像生成中细节保留与可控性之间的权衡问题,为电商等应用场景提供了更优解决方案。 Abstract: Garment-centric fashion image generation aims to synthesize realistic and controllable human models dressing a given garment, which has attracted growing interest due to its practical applications in e-commerce. The key challenges of the task lie in two aspects: (1) faithfully preserving the garment details, and (2) gaining fine-grained controllability over the model's appearance. Existing methods typically require performing garment deformation in the generation process, which often leads to garment texture distortions. Also, they fail to control the fine-grained attributes of the generated models, due to the lack of specifically designed mechanisms. To address these issues, we propose FashionMAC, a novel diffusion-based deformation-free framework that achieves high-quality and controllable fashion showcase image generation. The core idea of our framework is to eliminate the need for performing garment deformation and directly outpaint the garment segmented from a dressed person, which enables faithful preservation of the intricate garment details. Moreover, we propose a novel region-adaptive decoupled attention (RADA) mechanism along with a chained mask injection strategy to achieve fine-grained appearance controllability over the synthesized human models. Specifically, RADA adaptively predicts the generated regions for each fine-grained text attribute and enforces the text attribute to focus on the predicted regions by a chained mask injection strategy, significantly enhancing the visual fidelity and the controllability. Extensive experiments validate the superior performance of our framework compared to existing state-of-the-art methods.

[81] Flood-LDM: Generalizable Latent Diffusion Models for rapid and accurate zero-shot High-Resolution Flood Mapping

Sun Han Neo,Sachith Seneviratne,Herath Mudiyanselage Viraj Vidura Herath,Abhishek Saha,Sanka Rasnayaka,Lucy Amanda Marshall

Main category: cs.CV

TL;DR: 提出一种基于潜在扩散模型的洪水地图超分辨率方法,能够在保持高精度的同时显著减少计算时间,并具有良好的跨区域泛化能力。

Details Motivation: 传统水动力模型计算成本高,难以实时应用;现有深度学习方法在未见区域泛化能力有限。 Method: 利用潜在扩散模型对粗网格洪水图进行超分辨率重建,结合物理信息输入并采用迁移学习提升模型可解释性与适应性。 Result: 实验表明该方法大幅降低生成高保真洪水图的计算时间,且在不同地理区域表现出优越的泛化性能。 Conclusion: 所提方法在保证准确性的同时提升推理速度与可解释性,适用于实时洪水风险管理和跨区域应用。 Abstract: Flood prediction is critical for emergency planning and response to mitigate human and economic losses. Traditional physics-based hydrodynamic models generate high-resolution flood maps using numerical methods requiring fine-grid discretization; which are computationally intensive and impractical for real-time large-scale applications. While recent studies have applied convolutional neural networks for flood map super-resolution with good accuracy and speed, they suffer from limited generalizability to unseen areas. In this paper, we propose a novel approach that leverages latent diffusion models to perform super-resolution on coarse-grid flood maps, with the objective of achieving the accuracy of fine-grid flood maps while significantly reducing inference time. Experimental results demonstrate that latent diffusion models substantially decrease the computational time required to produce high-fidelity flood maps without compromising on accuracy, enabling their use in real-time flood risk management. Moreover, diffusion models exhibit superior generalizability across different physical locations, with transfer learning further accelerating adaptation to new geographic regions. Our approach also incorporates physics-informed inputs, addressing the common limitation of black-box behavior in machine learning, thereby enhancing interpretability. Code is available at https://github.com/neosunhan/flood-diff.

[82] Saliency-Guided Deep Learning for Bridge Defect Detection in Drone Imagery

Loucif Hebbache,Dariush Amirkhani,Mohand Saïd Allili,Jean-François Lapointe

Main category: cs.CV

TL;DR: 提出一种基于无人机图像的混凝土桥梁缺陷自动检测、定位和分类的新方法,结合显著性区域提议和YOLOX检测器,在准确性和计算效率方面表现优异。

Details Motivation: 桥梁结构中的异常检测在计算机视觉中具有挑战性,传统方法难以兼顾精度与效率,需要一种自动化且高效的方法来实现实际巡检应用。 Method: 方法分为两个阶段:第一阶段利用显著性检测生成缺陷候选区域,第二阶段在经过边界框级别亮度增强的显著性增强图像上使用YOLOX深度学习检测器进行缺陷检测与分类。 Result: 在标准数据集上的实验结果表明,该框架在检测精度和计算效率方面均表现出色,适合部署于自供电巡检系统中。 Conclusion: 所提方法有效提升了混凝土桥梁缺陷检测的自动化水平和实用性,具有在真实场景中广泛应用的潜力。 Abstract: Anomaly object detection and classification are one of the main challenging tasks in computer vision and pattern recognition. In this paper, we propose a new method to automatically detect, localize and classify defects in concrete bridge structures using drone imagery. This framework is constituted of two main stages. The first stage uses saliency for defect region proposals where defects often exhibit local discontinuities in the normal surface patterns with regard to their surrounding. The second stage employs a YOLOX-based deep learning detector that operates on saliency-enhanced images obtained by applying bounding-box level brightness augmentation to salient defect regions. Experimental results on standard datasets confirm the performance of our framework and its suitability in terms of accuracy and computational efficiency, which give a huge potential to be implemented in a self-powered inspection system.

[83] Semantic Context Matters: Improving Conditioning for Autoregressive Models

Dongyang Jin,Ryan Xu,Jianhao Zeng,Rui Lan,Yancheng Bai,Lei Sun,Xiangxiang Chu

Main category: cs.CV

TL;DR: 提出SCAR方法,通过压缩语义预填充和语义对齐引导,提升自回归模型在图像编辑中的语义控制能力和生成质量。

Details Motivation: 现有的自回归模型在图像编辑中存在条件控制弱、效率低、指令遵循差和视觉伪影等问题,难以有效实现语义精确编辑。 Method: 提出SCAR,包含两个核心组件:压缩语义预填充(将高层语义编码为紧凑高效的前缀)和语义对齐引导(在自回归解码时对齐视觉隐状态与目标语义)。该方法基于向量量化预填充的灵活性,克服其语义局限和高成本问题,并兼容next-token和next-set两种AR范式。 Result: SCAR在指令编辑和可控生成任务上实现了更高的视觉保真度和语义对齐效果,优于先前的自回归方法,同时保持良好的可控性。 Conclusion: SCAR有效提升了自回归模型在图像编辑中的语义控制能力与生成效率,具有良好的通用性和应用前景。 Abstract: Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based prefilling while overcoming its semantic limitations and high cost. It generalizes across both next-token and next-set AR paradigms with minimal architectural changes. SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks, outperforming prior AR-based methods while maintaining controllability. All code will be released.

[84] CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs

Jingyu Lei,Gaoang Wang,Der-Horng Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为CORE的新型视觉令牌压缩范式,通过基于对象中心的表示和质心引导排序机制,在显著降低计算成本的同时保持高性能。

Details Motivation: 现有视觉令牌压缩方法缺乏高层语义理解,导致合并效果不佳、信息冗余或上下文丢失。 Method: 利用高效的分割解码器生成对象掩码作为语义先验,指导视觉令牌的合并,并采用质心引导的排序机制恢复空间顺序。 Result: 在六个权威基准上实现了最先进的固定比率压缩性能,在自适应压缩场景下显著提升了效率;仅保留2.2%视觉令牌时仍维持97.4%的基线性能。 Conclusion: 面向对象的表示在提升大视觉语言模型的处理效率与效果方面具有显著优势。 Abstract: Large Vision-Language Models (LVLMs) usually suffer from prohibitive computational and memory costs due to the quadratic growth of visual tokens with image resolution. Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens into a compact set of object-centric representations. Furthermore, a novel centroid-guided sorting mechanism restores a coherent spatial order to the merged tokens, preserving vital positional information. Extensive experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings. Even under extreme compression, after aggressively retaining with only 2.2% of all visual tokens, CORE still maintains 97.4% of baseline performance. Our work demonstrates the superiority of object-centric representations for efficient and effective LVLM processing.

[85] Zero-Training Task-Specific Model Synthesis for Few-Shot Medical Image Classification

Yao Qin,Yangyang Yan,YuanChao Yang,Jinhua Pang,Huanyong Bi,Yuan Liu,HaiHua Wang

Main category: cs.CV

TL;DR: 提出一种零训练任务特定模型合成(ZS-TMS)新范式,利用生成引擎直接合成分类器参数,仅需少量多模态信息(如单一样本图像和临床文本)即可实现无需训练的医学图像分类,显著优于现有少样本和零样本方法。

Details Motivation: 深度学习在医学图像分析中受限于对大规模标注数据的依赖,而医学数据尤其是罕见病数据获取困难且标注成本高,亟需一种不依赖大量标注数据的新方法。 Method: 提出语义引导参数合成器(SGPS),利用大规模预训练生成模型,根据极少的任务相关信息(如1-shot图像和临床文本描述)直接生成轻量级分类器(如EfficientNet-V2)的全部参数,无需任何任务特定训练或微调。 Result: 在ISIC 2018皮肤病变数据集和自建罕见病数据集上的少样本分类任务中,SGPS在1-shot和5-shot等极低数据条件下显著优于先进的少样本和零样本学习方法,达到新的最先进水平。 Conclusion: SGPS实现了无需训练的任务特定模型生成,为数据极度稀缺的罕见病等医学领域快速部署AI诊断工具提供了可行的新路径。 Abstract: Deep learning models have achieved remarkable success in medical image analysis but are fundamentally constrained by the requirement for large-scale, meticulously annotated datasets. This dependency on "big data" is a critical bottleneck in the medical domain, where patient data is inherently difficult to acquire and expert annotation is expensive, particularly for rare diseases where samples are scarce by definition. To overcome this fundamental challenge, we propose a novel paradigm: Zero-Training Task-Specific Model Synthesis (ZS-TMS). Instead of adapting a pre-existing model or training a new one, our approach leverages a large-scale, pre-trained generative engine to directly synthesize the entire set of parameters for a task-specific classifier. Our framework, the Semantic-Guided Parameter Synthesizer (SGPS), takes as input minimal, multi-modal task information as little as a single example image (1-shot) and a corresponding clinical text description to directly synthesize the entire set of parameters for a task-specific classifier. The generative engine interprets these inputs to generate the weights for a lightweight, efficient classifier (e.g., an EfficientNet-V2), which can be deployed for inference immediately without any task-specific training or fine-tuning. We conduct extensive evaluations on challenging few-shot classification benchmarks derived from the ISIC 2018 skin lesion dataset and a custom rare disease dataset. Our results demonstrate that SGPS establishes a new state-of-the-art, significantly outperforming advanced few-shot and zero-shot learning methods, especially in the ultra-low data regimes of 1-shot and 5-shot classification. This work paves the way for the rapid development and deployment of AI-powered diagnostic tools, particularly for the long tail of rare diseases where data is critically limited.

[86] Automated glenoid bone loss measurement and segmentation in CT scans for pre-operative planning in shoulder instability

Zhonghao Liu,Hanxue Gu,Qihang Li,Michael Fox,Jay M. Levin,Maciej A. Mazurowski,Brian C. Lau

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的全自动管道,用于在三维CT扫描中测量肩关节盂骨缺损,具有高一致性与临床可靠性。

Details Motivation: 现有的人工和半自动方法耗时且存在阅片者间变异性,需要一种可靠、高效的自动化方法来评估肩关节不稳中的盂骨缺损。 Method: 开发了一个多阶段深度学习管道,包括U-Net进行关节盂和肱骨头分割、神经网络预测解剖标志点,结合主成分分析(PCA)、投影和圆拟合计算骨缺损百分比。 Result: 该方法在91例患者中表现出与专家共识高度一致(ICC 0.84 vs 0.78),在低和高骨缺损亚组中均优于医生间一致性,并实现了较高的分类召回率,无严重误分类。 Conclusion: 该全自动深度学习方法高效、可靠,适用于肩关节不稳术前规划及显著骨缺损患者的筛查,具备临床应用潜力。 Abstract: Reliable measurement of glenoid bone loss is essential for operative planning in shoulder instability, but current manual and semi-automated methods are time-consuming and often subject to interreader variability. We developed and validated a fully automated deep learning pipeline for measuring glenoid bone loss on three-dimensional computed tomography (CT) scans using a linear-based, en-face view, best-circle method. Shoulder CT images of 91 patients (average age, 40 years; range, 14-89 years; 65 men) were retrospectively collected along with manual labels including glenoid segmentation, landmarks, and bone loss measurements. The multi-stage algorithm has three main stages: (1) segmentation, where we developed a U-Net to automatically segment the glenoid and humerus; (2) anatomical landmark detection, where a second network predicts glenoid rim points; and (3) geometric fitting, where we applied principal component analysis (PCA), projection, and circle fitting to compute the percentage of bone loss. The automated measurements showed strong agreement with consensus readings and exceeded surgeon-to-surgeon consistency (intraclass correlation coefficient (ICC) 0.84 vs 0.78), including in low- and high-bone-loss subgroups (ICC 0.71 vs 0.63 and 0.83 vs 0.21, respectively; P < 0.001). For classifying patients into low, medium, and high bone-loss categories, the pipeline achieved a recall of 0.714 for low and 0.857 for high severity, with no low cases misclassified as high or vice versa. These results suggest that our method is a time-efficient and clinically reliable tool for preoperative planning in shoulder instability and for screening patients with substantial glenoid bone loss. Code and dataset are available at https://github.com/Edenliu1/Auto-Glenoid-Measurement-DL-Pipeline.

[87] Error-Driven Scene Editing for 3D Grounding in Large Language Models

Yue Zhang,Zun Wang,Han Lin,Jialu Li,Jianing Yang,Yonatan Bitton,Idan Szpektor,Mohit Bansal

Main category: cs.CV

TL;DR: 本文提出了一种基于3D场景编辑的错误驱动框架DEER-3D,通过细粒度的空间操作生成精确的视觉反事实样本,以缓解3D-LLM在语言与空间元素对齐上的固有偏差。

Details Motivation: 现有的3D-LLM在语言与3D视觉和空间元素的对齐上存在局限性,主要由于训练数据侧重语言推理而非空间理解,且3D资源稀缺,导致接地偏差难以解决。 Method: 提出DEER-3D框架,采用“分解、诊断评估、编辑、再训练”的流程:首先识别3D-LLM的接地失败,然后在谓词层面诊断具体错误(如属性或空间关系),并执行最小化、与谓词对齐的3D场景编辑(如重着色或重新定位),生成针对性的反事实监督信号用于迭代微调。 Result: 在多个3D接地和场景理解基准上验证了该方法的有效性,结果显示通过迭代优化,在所有评估数据集上均有一致性能提升。 Conclusion: DEER-3D证明了基于错误驱动的针对性场景编辑能有效弥合3D-LLM中语言推理与空间接地之间的差距,无需大规模3D数据收集或复杂场景重建。 Abstract: Despite recent progress in 3D-LLMs, they remain limited in accurately grounding language to visual and spatial elements in 3D environments. This limitation stems in part from training data that focuses on language reasoning rather than spatial understanding due to scarce 3D resources, leaving inherent grounding biases unresolved. To address this, we propose 3D scene editing as a key mechanism to generate precise visual counterfactuals that mitigate these biases through fine-grained spatial manipulation, without requiring costly scene reconstruction or large-scale 3D data collection. Furthermore, to make these edits targeted and directly address the specific weaknesses of the model, we introduce DEER-3D, an error-driven framework following a structured "Decompose, Diagnostic Evaluation, Edit, and Re-train" workflow, rather than broadly or randomly augmenting data as in conventional approaches. Specifically, upon identifying a grounding failure of the 3D-LLM, our framework first diagnoses the exact predicate-level error (e.g., attribute or spatial relation). It then executes minimal, predicate-aligned 3D scene edits, such as recoloring or repositioning, to produce targeted counterfactual supervision for iterative model fine-tuning, significantly enhancing grounding accuracy. We evaluate our editing pipeline across multiple benchmarks for 3D grounding and scene understanding tasks, consistently demonstrating improvements across all evaluated datasets through iterative refinement. DEER-3D underscores the effectiveness of targeted, error-driven scene editing in bridging linguistic reasoning capabilities with spatial grounding in 3D LLMs.

[88] GCA-ResUNet:Image segmentation in medical images using grouped coordinate attention

Jun Ding,Shang Gao

Main category: cs.CV

TL;DR: 本文提出了一种高效的医学图像分割网络GCA-ResUNet,通过在ResNet-50残差块中引入分组坐标注意力(GCA),有效增强了全局依赖建模能力,在保持计算效率的同时提升了分割精度。

Details Motivation: 传统的U-Net类模型难以捕捉长距离依赖,而基于Transformer的方法虽能建模全局上下文但计算开销大;因此需要一种兼具高效性与强特征表达能力的分割网络。 Method: 提出GCA-ResUNet,将分组坐标注意力机制嵌入ResNet-50残差块,利用分组坐标建模联合编码通道和空间位置上的全局依赖关系,增强特征表示和边界划分能力。 Result: 在Synapse数据集上Dice分数达到86.11%,ACDC数据集上达到92.64%,优于多个SOTA基线方法,且具有快速推理速度和良好的计算效率。 Conclusion: GCA为卷积神经网络提供了一种实用的全局建模方式,能够在资源受限条件下实现高精度医学图像分割。 Abstract: Medical image segmentation underpins computer-aided diagnosis and therapy by supporting clinical diagnosis, preoperative planning, and disease monitoring. While U-Net style convolutional neural networks perform well due to their encoder-decoder structures with skip connections, they struggle to capture long-range dependencies. Transformer-based variants address global context but often require heavy computation and large training datasets. This paper proposes GCA-ResUNet, an efficient segmentation network that integrates Grouped Coordinate Attention (GCA) into ResNet-50 residual blocks. GCA uses grouped coordinate modeling to jointly encode global dependencies across channels and spatial locations, strengthening feature representation and boundary delineation while adding minimal parameter and FLOP overhead compared with self-attention. On the Synapse dataset, GCA-ResUNet achieves a Dice score of 86.11%, and on the ACDC dataset, it reaches 92.64%, surpassing several state-of-the-art baselines while maintaining fast inference and favorable computational efficiency. These results indicate that GCA offers a practical way to enhance convolutional architectures with global modeling capability, enabling high-accuracy and resource-efficient medical image segmentation.

[89] SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts

Fan Zhang,Haoyuan Ren,Fei Ma,Qiang Yin,Yongsheng Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer的端到端模型SMGeo,用于跨视角物体地理定位,支持点击提示和实时交互式定位,在无人机到卫星图像任务中显著优于现有方法。

Details Motivation: 传统多阶段“检索-匹配”流程在处理跨视角图像时存在视角和尺度差异大、背景干扰复杂等问题,易产生累积误差,因此需要更精确、端到端的定位方法。 Method: 提出SMGeo,采用Swin-Transformer进行双模态特征联合编码,引入网格级稀疏Mixture-of-Experts(GMoE)增强跨模态和同视图依赖建模,并使用无锚框检测头通过热图监督直接回归坐标,支持点击提示实现交互式定位。 Result: 在无人机到卫星任务中,SMGeo在IoU=0.25和mIoU指标上分别达到87.51%、62.50%和61.45%,显著优于DetGeo等方法;消融实验验证了共享编码、查询引导融合和GMoE模块的有效性。 Conclusion: SMGeo通过端到端的Transformer架构和GMoE机制,实现了高精度、可交互的跨视角物体地理定位,有效克服了视角差异与尺度变化带来的挑战。 Abstract: Cross-view object Geo-localization aims to precisely pinpoint the same object across large-scale satellite imagery based on drone images. Due to significant differences in viewpoint and scale, coupled with complex background interference, traditional multi-stage "retrieval-matching" pipelines are prone to cumulative errors. To address this, we present SMGeo, a promptable end-to-end transformer-based model for object Geo-localization. This model supports click prompting and can output object Geo-localization in real time when prompted to allow for interactive use. The model employs a fully transformer-based architecture, utilizing a Swin-Transformer for joint feature encoding of both drone and satellite imagery and an anchor-free transformer detection head for coordinate regression. In order to better capture both inter-modal and intra-view dependencies, we introduce a grid-level sparse Mixture-of-Experts (GMoE) into the cross-view encoder, allowing it to adaptively activate specialized experts according to the content, scale and source of each grid. We also employ an anchor-free detection head for coordinate regression, directly predicting object locations via heat-map supervision in the reference images. This approach avoids scale bias and matching complexity introduced by predefined anchor boxes. On the drone-to-satellite task, SMGeo achieves leading performance in accuracy at IoU=0.25 and mIoU metrics (e.g., 87.51%, 62.50%, and 61.45% in the test set, respectively), significantly outperforming representative methods such as DetGeo (61.97%, 57.66%, and 54.05%, respectively). Ablation studies demonstrate complementary gains from shared encoding, query-guided fusion, and grid-level sparse mixture-of-experts.

[90] BCE3S: Binary Cross-Entropy Based Tripartite Synergistic Learning for Long-tailed Recognition

Weijia Fan,Qiufu Li,Jiajun Wen,Xiaoyang Peng

Main category: cs.CV

TL;DR: 本文提出了一种基于二元交叉熵(BCE)的三重协同学习方法BCE3S,用于长尾识别任务,通过解耦特征与分类器向量间的度量关系,提升特征的紧凑性与可分性,并平衡分类器向量间的可分性,在多个长尾数据集上实现了最先进的性能。

Details Motivation: 现有的基于交叉熵(CE)损失的长尾识别方法难以学习到理想的特征性质,且Softmax中分类器向量的不平衡性被放大,影响性能。 Method: 提出BCE-based tripartite synergistic learning(BCE3S),包括三部分:1)基于BCE的联合学习,通过多个Sigmoid解耦特征与分类器向量的度量;2)基于BCE的对比学习,增强类内紧凑性;3)基于BCE的均匀学习,平衡分类器向量间的可分性,并与联合学习交互优化特征。 Result: 实验表明,BCE3S在CIFAR10-LT、CIFAR100-LT、ImageNet-LT和iNaturalist2018等多个长尾数据集上均取得SOTA性能,显著提升了特征的紧凑性、可分性及分类器的平衡性。 Conclusion: BCE3S通过解耦优化和三重学习机制,有效解决了长尾识别中特征学习与分类器不平衡的问题,为LTR提供了一种更优的基于BCE的学习框架。 Abstract: For long-tailed recognition (LTR) tasks, high intra-class compactness and inter-class separability in both head and tail classes, as well as balanced separability among all the classifier vectors, are preferred. The existing LTR methods based on cross-entropy (CE) loss not only struggle to learn features with desirable properties but also couple imbalanced classifier vectors in the denominator of its Softmax, amplifying the imbalance effects in LTR. In this paper, for the LTR, we propose a binary cross-entropy (BCE)-based tripartite synergistic learning, termed BCE3S, which consists of three components: (1) BCE-based joint learning optimizes both the classifier and sample features, which achieves better compactness and separability among features than the CE-based joint learning, by decoupling the metrics between feature and the imbalanced classifier vectors in multiple Sigmoid; (2) BCE-based contrastive learning further improves the intra-class compactness of features; (3) BCE-based uniform learning balances the separability among classifier vectors and interactively enhances the feature properties by combining with the joint learning. The extensive experiments show that the LTR model trained by BCE3S not only achieves higher compactness and separability among sample features, but also balances the classifier's separability, achieving SOTA performance on various long-tailed datasets such as CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and iNaturalist2018.

[91] FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

Jingren Liu,Shuning Xu,Qirui Yang,Yun Wang,Xiangyu Chen,Zhong Ji

Main category: cs.CV

TL;DR: 提出FAPE-IR,一种结合语义规划与频域恢复的全合一图像恢复框架,通过冻结多模态大语言模型生成频域感知恢复计划,指导基于LoRA-MoE的扩散执行器,实现SOTA性能和强零样本泛化能力。

Details Motivation: 现有全合一图像恢复方法依赖任务特定设计或隐式路由策略,难以适应复杂真实场景中的多种退化类型,缺乏可解释性和统一性。 Method: 采用冻结的多模态大语言模型作为规划器,分析退化图像并生成频域感知恢复计划;执行器基于扩散模型,引入LoRA-MoE模块动态选择高频或低频专家,并结合输入图像的频域特征;通过对抗训练和频域正则化损失提升恢复质量、减少伪影。 Result: 在七个图像恢复任务上达到最先进性能,且在混合退化条件下表现出强零样本泛化能力。 Conclusion: FAPE-IR通过语义规划与频域恢复的协同机制,提供了一种统一、可解释的全合一图像恢复解决方案,显著提升了多退化场景下的恢复效果与鲁棒性。 Abstract: All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.

[92] Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations

Yiqing Shen,Chenjia Li,Mathias Unberath

Main category: cs.CV

TL;DR: 本文提出了推理视频编辑(reasoning video editing)任务,旨在通过多跳推理理解隐式文本指令来实现视频编辑,并提出了RIVER模型,该模型结合数字孪生表示与大语言模型进行推理,指导扩散模型完成像素级修改,在新提出的RVEBenchmark及现有基准上均取得最佳性能。

Details Motivation: 现有文本驱动视频编辑方法需要用户明确指定编辑目标的空间位置和时间范围,难以处理基于语义属性或对象关系的隐式查询,因此需要一种能够理解并推理隐式指令的视频编辑方法。 Method: 提出RIVER模型,通过数字孪生表示保留视频中的空间、时间和语义信息;利用大语言模型对隐式查询进行多跳推理,生成结构化编辑指令,指导扩散模型进行像素级编辑;采用强化学习训练,结合推理准确性和生成质量的奖励机制。 Result: 在新构建的RVEBenchmark(包含100个视频和519个隐式查询)上,RIVER表现最优;同时在VegGIE和FiVE两个现有视频编辑基准上优于六个基线方法,达到最先进水平。 Conclusion: RIVER首次实现了对隐式文本指令的多跳推理式视频编辑,通过解耦推理与生成过程,有效提升了复杂语义编辑的能力,为未来智能视频编辑系统提供了新范式。 Abstract: Text-driven video editing enables users to modify video content only using text queries. While existing methods can modify video content if explicit descriptions of editing targets with precise spatial locations and temporal boundaries are provided, these requirements become impractical when users attempt to conceptualize edits through implicit queries referencing semantic properties or object relationships. We introduce reasoning video editing, a task where video editing models must interpret implicit queries through multi-hop reasoning to infer editing targets before executing modifications, and a first model attempting to solve this complex task, RIVER (Reasoning-based Implicit Video Editor). RIVER decouples reasoning from generation through digital twin representations of video content that preserve spatial relationships, temporal trajectories, and semantic attributes. A large language model then processes this representation jointly with the implicit query, performing multi-hop reasoning to determine modifications, then outputs structured instructions that guide a diffusion-based editor to execute pixel-level changes. RIVER training uses reinforcement learning with rewards that evaluate reasoning accuracy and generation quality. Finally, we introduce RVEBenchmark, a benchmark of 100 videos with 519 implicit queries spanning three levels and categories of reasoning complexity specifically for reasoning video editing. RIVER demonstrates best performance on the proposed RVEBenchmark and also achieves state-of-the-art performance on two additional video editing benchmarks (VegGIE and FiVE), where it surpasses six baseline methods.

[93] RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment

Zeyu Cheng,Tongfei Liu,Tao Lei,Xiang Hua,Yi Zhang,Chengkai Tang

Main category: cs.CV

TL;DR: 提出了一种轻量高效的实时自监督单目深度估计方法RTS-Mono,基于Lite-Encoder和多尺度稀疏融合解码器,在KITTI数据集上实现了参数量仅3M的SOTA性能,并在Nvidia Jetson Orin上达到49 FPS的实时推理速度。

Details Motivation: 现有自监督单目深度估计模型计算资源消耗大,轻量化方法常导致性能下降,限制了其在自动驾驶和机器人导航等实际场景中的部署。 Method: 提出RTS-Mono,采用Lite-Encoder作为编码器,设计多尺度稀疏融合框架作为解码器,以减少冗余、保持性能并提升推理速度。 Result: 在KITTI数据集上,低分辨率下Abs Rel和Sq Rel分别提升5.6%和9.8%,高分辨率下Sq Rel和RMSE分别提升6.1%和1.9%;模型参数仅3M,实现在Nvidia Jetson Orin上49 FPS的实时推理。 Conclusion: RTS-Mono在显著降低模型复杂度的同时保持甚至提升了估计精度,有效平衡了效率与性能,适合实际应用场景的部署。 Abstract: Depth information is crucial for autonomous driving and intelligent robot navigation. The simplicity and flexibility of self-supervised monocular depth estimation are conducive to its role in these fields. However, most existing monocular depth estimation models consume many computing resources. Although some methods have reduced the model's size and improved computing efficiency, the performance deteriorates, seriously hindering the real-world deployment of self-supervised monocular depth estimation models in the real world. To address this problem, we proposed a real-time self-supervised monocular depth estimation method and implemented it in the real world. It is called RTS-Mono, which is a lightweight and efficient encoder-decoder architecture. The encoder is based on Lite-Encoder, and the decoder is designed with a multi-scale sparse fusion framework to minimize redundancy, ensure performance, and improve inference speed. RTS-Mono achieved state-of-the-art (SoTA) performance in high and low resolutions with extremely low parameter counts (3 M) in experiments based on the KITTI dataset. Compared with lightweight methods, RTS-Mono improved Abs Rel and Sq Rel by 5.6% and 9.8% at low resolution and improved Sq Rel and RMSE by 6.1% and 1.9% at high resolution. In real-world deployment experiments, RTS-Mono has extremely high accuracy and can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS. Source code is available at https://github.com/ZYCheng777/RTS-Mono.

[94] O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Rishi Gupta,Mukilan Karuppasamy,Shyam Marjit,Aditay Tripathi,Anirban Chakraborty

Main category: cs.CV

TL;DR: 本文提出了一种新的大规模图像-草图-指令三元组数据集,并基于该数据集训练了大型视觉语言模型O3SLM,显著提升了模型对抽象手绘草图的理解与推理能力,在多个草图相关任务上达到最先进水平。

Details Motivation: 现有的大型视觉语言模型(LVLMs)在理解手绘草图等抽象视觉输入方面表现有限,主要瓶颈在于缺乏同时包含草图、真实图像和自然语言指令的大规模联合建模数据集。 Method: 构建了一个大规模的图像-草图-指令三元组数据集,用于预训练和指令微调,并在此基础上训练了新的LVLM模型O3SLM。 Result: 在多个草图任务(如目标定位、计数、图像检索和视觉问答)上的实验表明,O3SLM在QuickDraw!、Sketchy、Tu Berlin和新构建的SketchVCL数据集上均取得了最先进的性能,显著优于现有LVLM。 Conclusion: 通过引入大规模多模态数据集和针对性训练,O3SLM有效提升了LVLM对草图的理解与推理能力,为未来支持抽象视觉输入的交互式应用提供了可行方案。 Abstract: While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

[95] $A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors

Zhenyu Li,Tianyi Shang

Main category: cs.CV

TL;DR: 提出了一种用于视觉地点识别的非对称聚合方法 $A^2$GC-VPR,结合几何约束提升匹配精度和鲁棒性。

Details Motivation: 现有基于最优传输的方法使用Sinkhorn算法对称处理源和目标边缘分布,难以应对图像特征与聚类中心分布差异大的情况,限制了性能。 Method: 提出非对称聚合方法,采用行列归一化与独立边缘校准,实现适应分布差异的非对称匹配;引入可学习坐标嵌入构建几何约束,融合空间邻近性与特征相似性。 Result: 在MSLS、NordLand和Pittsburgh数据集上实验表明,该方法在匹配精度和鲁棒性方面优于现有方法。 Conclusion: $A^2$GC-VPR通过非对称聚合和几何约束有效提升了视觉地点识别的性能,尤其在特征分布不一致和复杂环境下的表现更优。 Abstract: Visual Place Recognition (VPR) aims to match query images against a database using visual cues. State-of-the-art methods aggregate features from deep backbones to form global descriptors. Optimal transport-based aggregation methods reformulate feature-to-cluster assignment as a transport problem, but the standard Sinkhorn algorithm symmetrically treats source and target marginals, limiting effectiveness when image features and cluster centers exhibit substantially different distributions. We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. Our method employs row-column normalization averaging with separate marginal calibration, enabling asymmetric matching that adapts to distributional discrepancies in visual place recognition. Geometric constraints are incorporated through learnable coordinate embeddings, computing compatibility scores fused with feature similarities, thereby promoting spatially proximal features to the same cluster and enhancing spatial awareness. Experimental results on MSLS, NordLand, and Pittsburgh datasets demonstrate superior performance, validating the effectiveness of our approach in improving matching accuracy and robustness.

[96] CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Srivathsan Sivakumar,Faisal Z. Qureshi

Main category: cs.CV

TL;DR: 提出了一种轻量级视觉Transformer架构Cascaded-ViT(CViT),通过新型的CCFFN模块提升计算和能效,在保持准确率的同时显著降低FLOPs和能耗,适用于资源受限设备。

Details Motivation: Vision Transformers虽然性能优越,但计算、内存和能耗高,难以部署在资源受限设备上,因此需要更高效的模型设计。 Method: 提出Cascaded-ViT(CViT),采用新型的Cascaded-Chunk Feed Forward Network(CCFFN),将输入特征分块处理,提升参数和FLOP效率。 Result: 在ImageNet-1K上,CViT-XL在减少15% FLOPs和3.3%能耗的情况下达到75.5% Top-1准确率;CViT系列模型能效最低,且在Accuracy-Per-FLOP(APF)指标上表现最优,CViT-L比EfficientViT-M2准确率高2.2%且APF相当。 Conclusion: CViT在精度、计算效率和能耗之间实现了更好平衡,特别适合移动设备和无人机等低功耗场景的部署。 Abstract: Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.

[97] Coffee: Controllable Diffusion Fine-tuning

Ziyao Zeng,Jingcheng Ni,Ruyi Liu,Alex Wong

Main category: cs.CV

TL;DR: 提出Coffee方法,通过语言指定不期望的概念来正则化文本到图像扩散模型的微调过程,防止模型学习不期望的概念。

Details Motivation: 在微调过程中控制模型不学习数据中存在的不期望概念,避免这些概念与用户提示纠缠,以支持偏差缓解、防止恶意适应等下游任务。 Method: 通过保持用户提示的嵌入不与不期望的概念对齐来实现正则化,使用语言描述灵活指定不期望的概念,无需额外训练。 Result: 实验表明,Coffee能有效防止文本到图像模型在微调过程中学习指定的不期望概念,并优于现有方法。 Conclusion: Coffee提供了一种灵活、无需额外训练的方法,用于可控的扩散模型微调,有效避免不期望概念的学习和纠缠。 Abstract: Text-to-image diffusion models can generate diverse content with flexible prompts, which makes them well-suited for customization through fine-tuning with a small amount of user-provided data. However, controllable fine-tuning that prevents models from learning undesired concepts present in the fine-tuning data, and from entangling those concepts with user prompts, remains an open challenge. It is crucial for downstream tasks like bias mitigation, preventing malicious adaptation, attribute disentanglement, and generalizable fine-tuning of diffusion policy. We propose Coffee that allows using language to specify undesired concepts to regularize the adaptation process. The crux of our method lies in keeping the embeddings of the user prompt from aligning with undesired concepts. Crucially, Coffee requires no additional training and enables flexible modification of undesired concepts by modifying textual descriptions. We evaluate Coffee by fine-tuning on images associated with user prompts paired with undesired concepts. Experimental results demonstrate that Coffee can prevent text-to-image models from learning specified undesired concepts during fine-tuning and outperforms existing methods. Code will be released upon acceptance.

[98] Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models

Hao Zhen,Yunxiang Yang,Jidong J. Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为MP-PVIR的多视角相位感知行人-车辆事故推理框架,通过四个阶段将多视角视频流转化为结构化诊断报告,结合视觉-语言模型与大语言模型,实现对行人行为认知阶段的分割、多视角分析与因果推理,生成包含场景理解、行为解释和预防建议的综合报告。

Details Motivation: 现有基于视频的系统虽能检测行人-车辆事故,但缺乏对事故中行人行为认知阶段演变过程的理解;同时,当前视觉-语言模型多孤立处理视频,缺少显式的时间结构和多视角融合能力。 Method: 提出MP-PVIR框架,包含事件触发的多视角视频采集、行人行为相位分割、相位特定的多视角推理及分层合成与诊断推理四个阶段;采用TG-VLM进行行为相位分割,PhaVR-VLM进行相位感知多视角分析,并结合大语言模型生成结构化诊断报告。 Result: 在Woven交通安全部件数据集上验证,TG-VLM达到mIoU 0.4881,PhaVR-VLM在图像描述任务中得分为33.063,问答准确率最高达64.70%,系统可有效生成包含因果链和预防策略的可操作洞察。 Conclusion: MP-PVIR通过整合多视角视频与认知行为阶段建模,提升了事故理解的细粒度与可解释性,推动了面向车路协同系统的AI驱动交通安全分析发展。 Abstract: Pedestrian-vehicle incidents remain a critical urban safety challenge, with pedestrians accounting for over 20% of global traffic fatalities. Although existing video-based systems can detect when incidents occur, they provide little insight into how these events unfold across the distinct cognitive phases of pedestrian behavior. Recent vision-language models (VLMs) have shown strong potential for video understanding, but they remain limited in that they typically process videos in isolation, without explicit temporal structuring or multi-view integration. This paper introduces Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning (MP-PVIR), a unified framework that systematically processes multi-view video streams into structured diagnostic reports through four stages: (1) event-triggered multi-view video acquisition, (2) pedestrian behavior phase segmentation, (3) phase-specific multi-view reasoning, and (4) hierarchical synthesis and diagnostic reasoning. The framework operationalizes behavioral theory by automatically segmenting incidents into cognitive phases, performing synchronized multi-view analysis within each phase, and synthesizing results into causal chains with targeted prevention strategies. Particularly, two specialized VLMs underpin the MP-PVIR pipeline: TG-VLM for behavioral phase segmentation (mIoU = 0.4881) and PhaVR-VLM for phase-aware multi-view analysis, achieving a captioning score of 33.063 and up to 64.70% accuracy on question answering. Finally, a designated large language model is used to generate comprehensive reports detailing scene understanding, behavior interpretation, causal reasoning, and prevention recommendations. Evaluation on the Woven Traffic Safety dataset shows that MP-PVIR effectively translates multi-view video data into actionable insights, advancing AI-driven traffic safety analytics for vehicle-infrastructure cooperative systems.

[99] Attention Via Convolutional Nearest Neighbors

Mingi Kang,Jeová Farias Sales Rocha Neto

Main category: cs.CV

TL;DR: 本文提出了一个统一卷积和自注意力机制的框架ConvNN,通过k近邻聚合揭示二者在邻域选择与聚合上的共性,并验证其在图像分类任务中的有效性。

Details Motivation: 卷积神经网络与Transformer被视为截然不同的架构,但作者认为卷积与自注意力本质上存在共通之处,可统一建模。 Method: 提出Convolutional Nearest Neighbors (ConvNN) 框架,将卷积(基于空间邻近)和注意力(基于特征相似性)统一为k近邻聚合的不同特例,并作为可插拔模块用于现有模型。 Result: 在CIFAR-10/100上,ConvNN在VGG混合分支和ViT中均优于标准卷积和注意力及其他变体,且通过调节k值可平衡局部与全局感受野,带来正则化效果。 Conclusion: 卷积与注意力存在于同一连续谱系,ConvNN提供了统一视角,有助于设计更合理、可解释的视觉模型。 Abstract: The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and aggregation; convolution selects neighbors by spatial proximity, while attention selects by feature similarity, revealing they exist on a continuous spectrum. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. Crucially, ConvNN serves as a drop-in replacement for convolutional and attention layers, enabling systematic exploration of the intermediate spectrum between these two extremes. We validate the framework's coherence on CIFAR-10 and CIFAR-100 classification tasks across two complementary architectures: (1) Hybrid branching in VGG improves accuracy on both CIFAR datasets by combining spatial-proximity and feature-similarity selection; and (2) ConvNN in ViT outperforms standard attention and other attention variants on both datasets. Extensive ablations on $k$ values and architectural variants reveal that interpolating along this spectrum provides regularization benefits by balancing local and global receptive fields. Our work provides a unifying framework that dissolves the apparent distinction between convolution and attention, with implications for designing more principled and interpretable vision architectures.

[100] SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

An Yu,Weiheng Lu,Jian Li,Zhenfei Zhang,Yunhang Shen,Felix X. -F. Ye,Ming-Ching Chang

Main category: cs.CV

TL;DR: 提出了一种基于多模态大语言模型的视频时刻检索框架SMART,结合音频线索和镜头级时间结构,通过镜头感知的令牌压缩保留细粒度时序信息,在多个基准上显著优于现有方法。

Details Motivation: 现有视频时刻检索方法多依赖粗粒度时序理解和单一视觉模态,难以应对复杂视频中的细粒度定位需求,限制了性能提升。 Method: 提出SMART框架,融合音频与视觉特征,引入镜头感知的令牌压缩机制,在每个镜头内选择性保留高信息量令牌以减少冗余,并优化提示设计以更好利用音视频线索。 Result: 在Charades-STA和QVHighlights数据集上取得显著性能提升,其中Charades-STA上的R1@0.5提升1.61%,R1@0.7提升2.59%。 Conclusion: SMART通过整合音频模态和镜头级时间结构,增强了多模态表征能力,有效提升了视频时刻检索的精度,尤其在细粒度时序定位方面表现突出。 Abstract: Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.

[101] iGaussian: Real-Time Camera Pose Estimation via Feed-Forward 3D Gaussian Splatting Inversion

Hao Wang,Linqing Zhao,Xiuwei Xu,Jiwen Lu,Haibin Yan

Main category: cs.CV

TL;DR: 本文提出iGaussian,一种基于3D高斯表示的两阶段前馈框架,通过直接逆向推理实现单图像实时相机位姿估计,避免了传统渲染-比较-优化循环,显著提升速度与精度。

Details Motivation: 现有SLAM和视觉导航方法依赖迭代式的渲染-比较-优化流程,计算开销大,难以满足机器人实时性需求。 Method: 提出iGaussian,包含两个阶段:第一阶段使用基于高斯场景先验的位姿回归网络(结合空间均匀采样与引导注意力机制)回归粗略6DoF位姿;第二阶段通过特征匹配与多模型融合进行精修。核心创新包括无需可微渲染的跨相关模块,以及融合多个策略采样视点的加权多视图预测器。 Result: 在NeRF Synthetic、Mip-NeRF 360和T&T+DB数据集上实验表明,中值旋转误差降至0.2°,移动机器人上实现2.87 FPS的跟踪速度,相比优化方法提速10倍。 Conclusion: iGaussian通过直接3D高斯逆向推理实现了高效准确的实时位姿估计,为基于高斯表示的视觉导航提供了新思路。 Abstract: Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative \textit{render-compare-refine} loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T\&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10 times speedup compared to optimization-based approaches. Code: https://github.com/pythongod-exe/iGaussian

[102] Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion

Laura Dodds,Maisy Lam,Waleed Akbar,Yibo Cheng,Fadel Adib

Main category: cs.CV

TL;DR: Wave-Former是一种利用毫米波无线信号进行高精度3D形状重建的新方法,能够在完全遮挡情况下重建多样化日常物体的完整几何结构,相较于现有方法显著提升召回率。

Details Motivation: 现有毫米波3D重建方法受限于覆盖范围小和噪声高的问题,难以准确恢复被遮挡物体的完整形状,因此需要一种能克服这些限制的新方法。 Method: 提出了一种三阶段的物理感知形状补全 pipeline:首先生成候选几何表面,然后使用专为毫米波信号设计的基于Transformer的形状补全模型,最后通过熵引导的表面选择机制优化结果;整个模型使用合成点云数据训练,实现了从无线信号到视觉级形状补全的映射。 Result: 在与最先进基线方法的对比中,Wave-Former将召回率从54%提升至72%,同时保持85%的高精度,并展现出对真实世界数据的良好泛化能力。 Conclusion: Wave-Former通过结合毫米波物理特性与先进的形状补全技术,实现了在复杂遮挡环境下高质量的3D形状重建,具有在机器人、增强现实和物流等领域的广泛应用前景。 Abstract: We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former's design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model designed specifically for mmWave signals, and finally performs entropy-guided surface selection. This enables Wave-Former to be trained using entirely synthetic point-clouds, while demonstrating impressive generalization to real-world data.In head-to-head comparisons with state-of-the-art baselines, Wave-Former raises recall from 54% to 72% while maintaining a high precision of 85%.

[103] Learning Representation and Synergy Invariances: A Povable Framework for Generalized Multimodal Face Anti-Spoofing

Xun Lin,Shuai Wang,Yi Yu,Zitong Yu,Jiale Zhou,Yizhong Liu,Xiaochun Cao,Alex Kot,Yefeng Zheng

Main category: cs.CV

TL;DR: 提出了一种新的多模态面部反欺骗框架RiSe,通过解决表征不变性和模态协同不变性风险来提升跨域泛化性能。

Details Motivation: 现有跨域多模态面部反欺骗方法因模态表征和协同效应的域偏移导致性能下降,主要源于类别不对称性和虚假模态相关性。 Method: 提出RiSe框架,包含AsyIRM(在径向空间学习不变球形决策边界)和MMSD(通过跨样本混合与解耦增强模态协同解耦)。 Result: 理论分析和实验表明,RiSe在多个跨域场景中实现了最先进的性能,有效缓解了多模态FAS的泛化问题。 Conclusion: RiSe通过建模表征与协同不变性,显著提升了多模态面部反欺骗在未见域中的鲁棒性和泛化能力。 Abstract: Multimodal Face Anti-Spoofing (FAS) methods, which integrate multiple visual modalities, often suffer even more severe performance degradation than unimodal FAS when deployed in unseen domains. This is mainly due to two overlooked risks that affect cross-domain multimodal generalization. The first is the modal representation invariant risk, i.e., whether representations remain generalizable under domain shift. We theoretically show that the inherent class asymmetry in FAS (diverse spoofs vs. compact reals) enlarges the upper bound of generalization error, and this effect is further amplified in multimodal settings. The second is the modal synergy invariant risk, where models overfit to domain-specific inter-modal correlations. Such spurious synergy cannot generalize to unseen attacks in target domains, leading to performance drops. To solve these issues, we propose a provable framework, namely Multimodal Representation and Synergy Invariance Learning (RiSe). For representation risk, RiSe introduces Asymmetric Invariant Risk Minimization (AsyIRM), which learns an invariant spherical decision boundary in radial space to fit asymmetric distributions, while preserving domain cues in angular space. For synergy risk, RiSe employs Multimodal Synergy Disentanglement (MMSD), a self-supervised task enhancing intrinsic, generalizable modal features via cross-sample mixing and disentanglement. Theoretical analysis and experiments verify RiSe, which achieves state-of-the-art cross-domain performance.

[104] MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Huiyi Chen,Jiawei Peng,Dehai Min,Changchang Sun,Kaijie Chen,Yan Yan,Xu Yang,Lu Cheng

Main category: cs.CV

TL;DR: 本文提出了MVI-Bench,首个专门用于评估误导性视觉输入对大型视觉语言模型(LVLMs)鲁棒性影响的综合基准。该基准基于视觉基本要素,涵盖三个层次的误导性视觉输入:视觉概念、视觉属性和视觉关系,并构建了六个代表性类别和1,248个标注的VQA实例。作者还提出了一种新的细粒度评估指标MVI-Sensitivity。在18个最先进的LVLM上的实验揭示了这些模型在面对误导性视觉输入时的显著脆弱性,分析结果为提升LVLM的鲁棒性和可靠性提供了实用指导。

Details Motivation: 现有LVLM鲁棒性评测基准多关注文本输入的误导或幻觉问题,忽视了误导性视觉输入对视觉理解能力评估的重要挑战。为填补这一空白,需构建专门针对视觉层面误导的系统性评测基准。 Method: 提出MVI-Bench,基于视觉基本要素设计三层级分类体系(视觉概念、属性、关系),构建包含六类共1,248个专家标注VQA样本的数据集;同时提出MVI-Sensitivity指标,用于细粒度量化LVLM对误导性视觉输入的敏感程度。 Result: 在18个先进LVLM上的实验表明,当前模型普遍对误导性视觉输入高度敏感,暴露出严重鲁棒性缺陷;通过MVI-Sensitivity实现模型间细粒度比较,并识别出不同模型和架构的脆弱模式。 Conclusion: MVI-Bench有效揭示了LVLM在处理误导性视觉输入方面的不足,为未来提升模型视觉理解鲁棒性提供了重要评测工具和改进方向。 Abstract: Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

[105] AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs

Xinliang Zhang,Lei Zhu,Hangzhou He,Shuang Zeng,Ourui Fu,Jiakui Hu,Zhengjian Yao,Yanye Lu

Main category: cs.CV

TL;DR: 提出一种基于对象级token合并的自适应压缩方法,显著减少多模态大模型中的图像token数量,同时保持高性能。

Details Motivation: 传统基于patch的token化导致计算和内存开销大,且与人类视觉认知系统不一致,易引起幻觉和冗余。 Method: 提出对象级token合并策略,实现自适应token压缩,使其更符合人类视觉认知机制。 Result: 在多个基准上实验表明,仅使用10%的token即可达到原始模型约96%的性能,优于现有方法。 Conclusion: 该方法有效平衡了压缩比与性能,提升了MLLM的效率与视觉认知一致性。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural paradigm. However, patch-level tokenization leads to a quadratic growth in image tokens, burdening MLLMs' understanding and reasoning with enormous computation and memory. Additionally, the traditional patch-wise scanning tokenization workflow misaligns with the human vision cognition system, further leading to hallucination and computational redundancy. To address this issue, we propose an object-level token merging strategy for Adaptive Token compression, revealing the consistency with human vision system. The experiments are conducted on multiple comprehensive benchmarks, which show that our approach averagely, utilizes only 10% tokens while achieving almost 96% of the vanilla model's performance. More extensive experimental results in comparison with relevant works demonstrate the superiority of our method in balancing compression ratio and performance. Our code will be available.

[106] DoGCLR: Dominance-Game Contrastive Learning Network for Skeleton-Based Action Recognition

Yanshan Li,Ke Ma,Miaomiao Wei,Linhui Dai

Main category: cs.CV

TL;DR: 本文提出了一种基于博弈论的自监督骨架动作识别框架DoGCLR,通过动态优势博弈构建正负样本,结合时空双权重定位和熵驱动的优势策略,在多个数据集上超越了现有方法。

Details Motivation: 现有自监督对比学习方法对骨架区域处理过于均匀,且使用FIFO队列存储负样本,导致运动信息丢失和负样本选择不优。 Method: 提出DoGCLR框架,将正负样本构建建模为动态优势博弈;设计时空双权重定位机制以增强运动多样性并保持语义;采用熵驱动的优势策略管理记忆库,保留高熵难负样本。 Result: 在NTU RGB+D和PKU-MMD数据集上实验表明,DoGCLR在多个设置下超越现有方法,例如在NTU RGB+D 60 X-Sub/X-View上达到81.1%/89.4%准确率,并在PKU-MMD Part II上提升1.9%。 Conclusion: DoGCLR有效提升了骨架动作识别的表征能力,在复杂场景中表现出更强鲁棒性,验证了博弈论与自监督学习结合的潜力。 Abstract: Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.

[107] UniSER: A Foundation Model for Unified Soft Effects Removal

Jingdong Zhang,Lingzhi Zhang,Qing Liu,Mang Tik Chiu,Connelly Barnes,Yizhou Wang,Haoran You,Xiaoyang Liu,Yuqian Zhou,Zhe Lin,Eli Shechtman,Sohrab Amirghodsi,Xin Li,Wenping Wang,Xiaohang Zhan

Main category: cs.CV

TL;DR: 本文提出了一种通用的图像恢复模型UniSER,用于处理由镜头光晕、雾霾、阴影和反射等软效应引起的多种图像退化问题。通过构建包含380万对数据的大规模数据集,并采用基于扩散变换器的训练框架,UniSER实现了对半透明遮挡类退化的高效统一修复,在真实场景中表现出优于专用和通用模型的效果。

Details Motivation: 现有的图像恢复方法多针对特定退化问题设计专用模型,缺乏泛化能力;而通用模型虽具备广泛编辑能力,但在细粒度恢复任务上表现不佳,难以稳健去除软效应并保持场景身份。因此需要一个能统一处理多种软效应退化的高鲁棒性模型。 Method: 基于软效应的共性——半透明遮挡,构建了一个包含3.8M图像对的大规模数据集,涵盖物理合理的合成退化以填补公开基准的空白;在此基础上,微调一个扩散Transformer模型,引入细粒度的掩码和强度控制机制,形成一套定制化训练流程,学习通用的恢复先验。 Result: UniSER在多个软效应恢复任务上显著优于现有专用模型和通用图像编辑模型(如GPT-4o、Flux Kontext、Nano Banana),在真实复杂场景中实现了更鲁棒、高保真的图像恢复效果,验证了其泛化能力和实用性。 Conclusion: UniSER作为一个基础性的多功能模型,成功地将多种软效应退化问题统一到一个框架下解决,展示了通过大规模多样化数据和针对性训练策略实现通用图像恢复的可行性,为未来图像增强系统提供了新方向。 Abstract: Digital images are often degraded by soft effects such as lens flare, haze, shadows, and reflections, which reduce aesthetics even though the underlying pixels remain partially visible. The prevailing works address these degradations in isolation, developing highly specialized, specialist models that lack scalability and fail to exploit the shared underlying essences of these restoration problems. While specialist models are limited, recent large-scale pretrained generalist models offer powerful, text-driven image editing capabilities. while recent general-purpose systems (e.g., GPT-4o, Flux Kontext, Nano Banana) require detailed prompts and often fail to achieve robust removal on these fine-grained tasks or preserve identity of the scene. Leveraging the common essence of soft effects, i.e., semi-transparent occlusions, we introduce a foundational versatile model UniSER, capable of addressing diverse degradations caused by soft effects within a single framework. Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization, which includes novel, physically-plausible data to fill critical gaps in public benchmarks, and a tailored training pipeline that fine-tunes a Diffusion Transformer to learn robust restoration priors from this diverse data, integrating fine-grained mask and strength controls. This synergistic approach allows UniSER to significantly outperform both specialist and generalist models, achieving robust, high-fidelity restoration in the wild.

[108] GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

Xuan Zhao,Zhongyu Zhang,Yuge Huang,Yuxi Mi,Guodong Mu,Shouhong Ding,Jun Wang,Rizen Guo,Shuigeng Zhou

Main category: cs.CV

TL;DR: 提出了一种名为GloTok的全局视角分词器,利用全局关系信息实现更均匀的语义分布,提升图像重建与生成质量。

Details Motivation: 现有图像分词方法因局部监督导致语义分布不均,限制生成性能;而更均匀的特征分布(如VA-VAE所示)有助于提升生成效果,因此需要引入全局视角来优化语义分布。 Method: 提出GloTok,采用码本级直方图关系学习,将预训练模型在全数据集上建模的语义信息迁移至语义码本,并设计残差学习模块以恢复量化带来的细节损失。 Result: 在ImageNet-1k上实现了最先进的图像重建性能和生成质量,且无需在训练过程中直接访问预训练模型。 Conclusion: GloTok通过引入全局关系信息和残差细节恢复,实现了更均匀的语义潜在表示,有效提升了自回归图像生成模型的性能。 Abstract: Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.

[109] PAVE: An End-to-End Dataset for Production Autonomous Vehicle Evaluation

Xiangyu Li,Chen Wang,Yumao Liu,Dengbo He,Jiahao Zhang,Ke Ma

Main category: cs.CV

TL;DR: 本文提出了首个完全由真实世界自动驾驶模式收集的端到端基准数据集,包含超过100小时的自然驾驶数据,提供高精度传感器数据和丰富的场景标注,用于评估自动驾驶车辆的行为安全。

Details Motivation: 现有自动驾驶数据集多基于人工驾驶模式采集,难以真实反映自动驾驶系统的行为安全性,因此需要一个全自动驾驶模式下采集的数据集来更准确地评估AV的实际表现。 Method: 采集了来自多个量产自动驾驶车型的100多小时自然驾驶数据,分割为32,727个关键帧,每个帧包含四路同步图像、高精度定位信息及20Hz的车辆轨迹,并进行详细的2D标注和场景属性分类。 Result: 提供了包含丰富语义和动态信息的关键帧数据集,支持未来5秒的轨迹预测,端到端规划模型在该数据集上实现1.4米的平均位移误差(ADE),且数据每周新增超10小时。 Conclusion: 该数据集为自动驾驶行为分析和安全性评估提供了可持续、真实且全面的基础,推动自动驾驶技术向更安全的方向发展。 Abstract: Most existing autonomous-driving datasets (e.g., KITTI, nuScenes, and the Waymo Perception Dataset), collected by human-driving mode or unidentified driving mode, can only serve as early training for the perception and prediction of autonomous vehicles (AVs). To evaluate the real behavioral safety of AVs controlled in the black box, we present the first end-to-end benchmark dataset collected entirely by autonomous-driving mode in the real world. This dataset contains over 100 hours of naturalistic data from multiple production autonomous-driving vehicle models in the market. We segment the original data into 32,727 key frames, each consisting of four synchronized camera images and high-precision GNSS/IMU data (0.8 cm localization accuracy). For each key frame, 20 Hz vehicle trajectories spanning the past 6 s and future 5 s are provided, along with detailed 2D annotations of surrounding vehicles, pedestrians, traffic lights, and traffic signs. These key frames have rich scenario-level attributes, including driver intent, area type (covering highways, urban roads, and residential areas), lighting (day, night, or dusk), weather (clear or rain), road surface (paved or unpaved), traffic and vulnerable road users (VRU) density, traffic lights, and traffic signs (warning, prohibition, and indication). To evaluate the safety of AVs, we employ an end-to-end motion planning model that predicts vehicle trajectories with an Average Displacement Error (ADE) of 1.4 m on autonomous-driving frames. The dataset continues to expand by over 10 hours of new data weekly, thereby providing a sustainable foundation for research on AV driving behavior analysis and safety evaluation.

[110] Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation

Zhaoyu Liu,Kan Jiang,Murong Ma,Zhe Hou,Yun Lin,Jin Song Dong

Main category: cs.CV

TL;DR: 提出了一种用于少样本精确事件定位的统一多实体图网络(UMEG-Net),通过融合人体骨骼和特定运动物体关键点,结合时空图卷积与多模态知识蒸馏,在数据有限的情况下显著提升了性能。

Details Motivation: 现有方法依赖大量标注数据和像素或姿态输入,在少样本条件下表现不佳,且获取大规模标注数据困难,因此需要一种更高效、可扩展的少样本事件定位方法。 Method: 提出UMEG-Net,将人体骨骼和运动相关物体关键点构建为统一图结构,设计基于先进图卷积和多尺度时移的时空提取模块,并采用多模态蒸馏将关键点图的知识迁移到视觉表征中。 Result: 在少样本设置下显著优于基线模型,即使标注数据有限仍保持鲁棒性能,验证了方法的有效性和可扩展性。 Conclusion: UMEG-Net通过融合多实体信息和多模态知识蒸馏,为少样本精确事件定位提供了一个高效且可扩展的解决方案。 Abstract: Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.

[111] Hierarchical Semantic Learning for Multi-Class Aorta Segmentation

Pengcheng Shi

Main category: cs.CV

TL;DR: 提出一种基于课程学习和分形softmax的层次化语义学习方法,用于提升主动脉分支血管的3D分割精度与效率,显著改善Dice分数并加速模型收敛,适用于临床实时应用。

Details Motivation: 现有方法忽视血管结构的层次解剖关系,且难以应对严重类别不平衡问题,影响主动脉病变的精准分割。 Method: 引入课程学习策略,结合新提出的分形softmax,从简单到复杂逐步学习解剖结构;采用两阶段推理策略提升计算效率,并设计层次化语义损失函数以增强特征表示。 Result: 在验证集上比nnU-Net ResEnc M的Dice分数提高11.65%,测试集上比基线模型高5.6%,模型推理速度提升最高达五倍。 Conclusion: 该框架有效提升了主动脉及其分支的分割准确性与效率,具备良好的临床实用性,适合实时医学图像分析应用。 Abstract: The aorta, the body's largest artery, is prone to pathologies such as dissection, aneurysm, and atherosclerosis, which often require timely intervention. Minimally invasive repairs involving branch vessels necessitate detailed 3D anatomical analysis. Existing methods often overlook hierarchical anatomical relationships while struggling with severe class imbalance inherent in vascular structures. We address these challenges with a curriculum learning strategy that leverages a novel fractal softmax for hierarchical semantic learning. Inspired by human cognition, our approach progressively learns anatomical constraints by decomposing complex structures from simple to complex components. The curriculum learning framework naturally addresses class imbalance by first establishing robust feature representations for dominant classes before tackling rare but anatomically critical structures, significantly accelerating model convergence in multi-class scenarios. Our two-stage inference strategy achieves up to fivefold acceleration, enhancing clinical practicality. On the validation set at epoch 50, our hierarchical semantic loss improves the Dice score of nnU-Net ResEnc M by 11.65%. The proposed model demonstrates a 5.6% higher Dice score than baselines on the test set. Experimental results show significant improvements in segmentation accuracy and efficiency, making the framework suitable for real-time clinical applications. The implementation code for this challenge entry is publicly available at: https://github.com/PengchengShi1220/AortaSeg24. The code for fractal softmax will be available at https://github.com/PengchengShi1220/fractal-softmax.

[112] Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

Zitang Sun,Masakazu Yoshimura,Junji Otsuka,Atsushi Irie,Takeshi Ohashi

Main category: cs.CV

TL;DR: DetGain是一种针对目标检测的在线数据筛选方法,通过估计每张图像对平均精度(AP)的边际扰动来选择信息量大的样本,具有架构无关性和高鲁棒性,能有效提升检测性能。

Details Motivation: 高质量数据驱动模型进步,但现有在线采样策略因目标检测的结构复杂性和领域差异难以应用,需专门设计适用于该任务的数据筛选方法。 Method: 提出DetGain,基于预测质量估计图像对全局AP的边际影响,建模全局分数分布并计算师生贡献差距,以在每次迭代中选择最具信息量的样本,且不依赖特定检测架构。 Result: 在COCO数据集上多个主流检测器中均取得一致的精度提升,对低质量数据具有强鲁棒性,并可与知识蒸馏技术结合进一步提升性能。 Conclusion: DetGain为数据高效的目标检测提供了一种通用、正交且易于集成的在线数据筛选策略,具有广泛应用潜力。 Abstract: High-quality data has become a primary driver of progress under scale laws, with curated datasets often outperforming much larger unfiltered ones at lower cost. Online data curation extends this idea by dynamically selecting training samples based on the model's evolving state. While effective in classification and multimodal learning, existing online sampling strategies rarely extend to object detection because of its structural complexity and domain gaps. We introduce DetGain, an online data curation method specifically for object detection that estimates the marginal perturbation of each image to dataset-level Average Precision (AP) based on its prediction quality. By modeling global score distributions, DetGain efficiently estimates the global AP change and computes teacher-student contribution gaps to select informative samples at each iteration. The method is architecture-agnostic and minimally intrusive, enabling straightforward integration into diverse object detection architectures. Experiments on the COCO dataset with multiple representative detectors show consistent improvements in accuracy. DetGain also demonstrates strong robustness under low-quality data and can be effectively combined with knowledge distillation techniques to further enhance performance, highlighting its potential as a general and complementary strategy for data-efficient object detection.

[113] Multi-Scale Correlation-Aware Transformer for Maritime Vessel Re-Identification

Yunhe Liu

Main category: cs.CV

TL;DR: 提出了一种用于海上船舶重识别的多尺度相关感知Transformer网络(MCFormer),通过建模全局和局部特征的相关性来抑制异常样本的影响,显著提升了性能。

Details Motivation: 现有船舶重识别方法多直接借鉴行人重识别算法,难以应对船舶图像中较大的类内差异和严重的局部缺失问题,导致同一身份出现异常样本。 Method: 提出MCFormer网络,包含全局相关模块(GCM)和局部相关模块(LCM)。GCM通过构建跨所有输入图像的全局相似性关联矩阵,基于图像间一致性进行特征聚合;LCM利用动态记忆库挖掘并对其具有上下文相似性的局部特征,补偿局部缺失或遮挡。同时融合多尺度的全局与局部相关特征以增强鲁棒性。 Result: 在三个基准数据集上的实验表明,MCFormer实现了最先进的性能。 Conclusion: MCFormer通过显式建模多尺度图像间相关性,有效缓解了船舶重识别中的类内差异和局部缺失问题,显著提升了识别准确率。 Abstract: Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.

[114] InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

Weimin Bai,Suzhe Xu,Yiwei Ren,Jinhua Hao,Ming Sun,Wenzheng Chen,He Sun

Main category: cs.CV

TL;DR: 本文提出了InstantViR,一种基于预训练视频扩散先验的快速视频重建框架,通过将双向视频扩散模型蒸馏到因果自回归模型中,实现在单次前向传播中完成高质量、低延迟的视频恢复。

Details Motivation: 现有的基于扩散模型的视频重建方法存在时间伪影或推理速度过慢的问题,难以满足实时应用需求。因此需要一种既能保持高质量重建又能满足低延迟要求的方法。 Method: 提出InstantViR,采用蒸馏策略将强大的双向视频扩散模型(教师)知识迁移到一个因果自回归学生模型中,并设计了一种先验驱动的无数据蒸馏方式;同时引入LeanVAE替代原VAE backbone以提升效率。 Result: 在随机修复、去模糊和超分辨率等任务上,InstantViR达到或超过了现有扩散模型的质量水平,且在NVIDIA A100上运行速度超过35 FPS,相比迭代式视频扩散方法提速高达100倍。 Conclusion: InstantViR实现了高质量与超快推理的结合,证明了基于扩散的视频重建可以适用于实时交互式流媒体场景,推动其成为现代视觉系统的实用组件。 Abstract: Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.

[115] Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

N Dinesh Reddy,Sudeep Pillai

Main category: cs.CV

TL;DR: Orion是一个能够接收和生成任何模态的视觉智能体框架,通过调用多种计算机视觉工具执行复杂的多步视觉任务,实现了从被动理解到主动视觉智能的转变。

Details Motivation: 传统视觉语言模型仅能生成描述性输出,难以满足复杂视觉任务的需求,Orion旨在通过工具调用实现更强大的视觉推理与实际应用能力。 Method: 采用基于智能体的框架,集成目标检测、关键点定位、全景分割、OCR和几何分析等多种专用视觉工具,支持多模态输入输出,并通过工具调用机制进行自主视觉推理。 Result: 在MMMU、MMBench、DocVQA和MMLongBench等多个基准上取得具有竞争力的表现,展示了其在复杂视觉任务中的有效性。 Conclusion: Orion将神经感知与符号执行相结合,推动了视觉智能从描述生成向工具驱动的主动推理演进,具备生产级应用潜力。 Abstract: We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.

[116] Measurement-Constrained Sampling for Text-Prompted Blind Face Restoration

Wenjie Li,Yulun Zhang,Guangwei Gao,Heng Guo,Zhanyu Ma

Main category: cs.CV

TL;DR: 提出了一种基于测量约束采样的盲脸修复方法,通过文本提示生成多样化的高质量人脸恢复结果。

Details Motivation: 现有的盲脸修复方法通常产生确定性结果,难以捕捉低质量输入对应多个合理高分辨率输出的一对多关系。 Method: 将盲脸修复建模为测量约束生成任务,利用前向和反向测量约束,在文本到图像扩散模型中实现后验引导采样。 Result: 实验表明,该方法能生成与文本提示对齐的多样化修复结果,并优于现有盲脸修复方法。 Conclusion: MCS方法有效解决了盲脸修复中的一对多问题,实现了可控且多样化的高质量人脸重建。 Abstract: Blind face restoration (BFR) may correspond to multiple plausible high-quality (HQ) reconstructions under extremely low-quality (LQ) inputs. However, existing methods typically produce deterministic results, struggling to capture this one-to-many nature. In this paper, we propose a Measurement-Constrained Sampling (MCS) approach that enables diverse LQ face reconstructions conditioned on different textual prompts. Specifically, we formulate BFR as a measurement-constrained generative task by constructing an inverse problem through controlled degradations of coarse restorations, which allows posterior-guided sampling within text-to-image diffusion. Measurement constraints include both Forward Measurement, which ensures results align with input structures, and Reverse Measurement, which produces projection spaces, ensuring that the solution can align with various prompts. Experiments show that our MCS can generate prompt-aligned results and outperforms existing BFR methods. Codes will be released after acceptance.

[117] StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Yifan Yang,Zhi Cen,Sida Peng,Xiangwei Chen,Yifu Deng,Xinyu Zhu,Fan Jia,Xiaowei Zhou,Hujun Bao

Main category: cs.CV

TL;DR: 本文提出了一种新的自回归扩散模型StreamingTalker,用于实现低延迟、高质量的语音驱动3D面部动画生成,支持任意长度音频输入和实时流式处理。

Details Motivation: 现有语音驱动3D面部动画方法在处理超长音频时性能下降且存在高延迟问题,难以满足实时应用需求。 Method: 提出一种自回归扩散模型,以流式方式处理音频,利用有限的历史帧作为动态条件,结合当前音频输入指导扩散过程逐步生成面部动作。 Result: 该方法在保持高质量动画生成的同时,实现了与音频长度无关的低延迟,支持任意长度输入,并成功应用于实时交互演示系统。 Conclusion: 所提出的StreamingTalker模型有效解决了传统扩散模型在长序列处理和实时性方面的局限,为语音驱动面部动画提供了高效、实用的解决方案。 Abstract: This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

[118] Breaking the Passive Learning Trap: An Active Perception Strategy for Human Motion Prediction

Juncheng Hu,Zijian Zhang,Zeyu Wang,Guoyu Wang,Yingji Li,Kedi Lyu

Main category: cs.CV

TL;DR: 提出一种主动感知策略(APS)用于3D人体运动预测,通过商空间表示和辅助学习目标显著提升时空建模能力,在多个数据集上达到SOTA性能。

Details Motivation: 现有方法过度依赖隐式网络建模,导致学习过程陷入被动,获取的3D坐标信息冗余且单调,缺乏主动引导的显式学习机制。 Method: 设计数据感知模块,将姿态投影到商空间以解耦运动几何与坐标冗余;联合编码切向量和Grassmann投影实现几何降维、语义解耦和动态约束;引入网络感知模块,通过掩码或加噪构造辅助监督信号,并设计专用辅助学习网络进行恢复性学习。 Result: 在H3.6M、CMU Mocap和3DPW三个主流数据集上分别取得16.3%、13.9%和10.1%的显著性能提升,达到新的SOTA水平。 Conclusion: APS提供了一种模型无关的主动感知框架,有效增强了对人体运动的显式建模与主动学习能力,显著提升了3D运动预测精度。 Abstract: Forecasting 3D human motion is an important embodiment of fine-grained understanding and cognition of human behavior by artificial agents. Current approaches excessively rely on implicit network modeling of spatiotemporal relationships and motion characteristics, falling into the passive learning trap that results in redundant and monotonous 3D coordinate information acquisition while lacking actively guided explicit learning mechanisms. To overcome these issues, we propose an Active Perceptual Strategy (APS) for human motion prediction, leveraging quotient space representations to explicitly encode motion properties while introducing auxiliary learning objectives to strengthen spatio-temporal modeling. Specifically, we first design a data perception module that projects poses into the quotient space, decoupling motion geometry from coordinate redundancy. By jointly encoding tangent vectors and Grassmann projections, this module simultaneously achieves geometric dimension reduction, semantic decoupling, and dynamic constraint enforcement for effective motion pose characterization. Furthermore, we introduce a network perception module that actively learns spatio-temporal dependencies through restorative learning. This module deliberately masks specific joints or injects noise to construct auxiliary supervision signals. A dedicated auxiliary learning network is designed to actively adapt and learn from perturbed information. Notably, APS is model agnostic and can be integrated with different prediction models to enhance active perceptual. The experimental results demonstrate that our method achieves the new state-of-the-art, outperforming existing methods by large margins: 16.3% on H3.6M, 13.9% on CMU Mocap, and 10.1% on 3DPW.

[119] Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization

Yan Huang,Yongyi Su,Xin Lin,Le Zhang,Xun Xu

Main category: cs.CV

TL;DR: WeSTAR是一种参数高效的弱监督自训练框架,用于提升单目深度估计基础模型在未见和多样化领域中的鲁棒性和泛化性能。

Details Motivation: 尽管基础模型在零样本单目深度估计中取得了进展,但在获得下游任务部分数据时,如何进一步提升其性能仍是一个开放问题。 Method: 提出WeSTAR框架,结合密集自训练目标、语义感知的分层归一化、成对序数深度弱监督以及权重正则化损失,实现稳定且鲁棒的适应。 Result: 在多种真实和受干扰的分布外数据集上,WeSTAR consistently 提升了模型性能,实现了最先进的泛化效果。 Conclusion: WeSTAR有效增强了单目深度估计基础模型的适应能力,在多样场景下表现出优越的鲁棒性和泛化性。 Abstract: The emergence of foundation models has substantially advanced zero-shot generalization in monocular depth estimation (MDE), as exemplified by the Depth Anything series. However, given access to some data from downstream tasks, a natural question arises: can the performance of these models be further improved? To this end, we propose WeSTAR, a parameter-efficient framework that performs Weakly supervised Self-Training Adaptation with Regularization, designed to enhance the robustness of MDE foundation models in unseen and diverse domains. We first adopt a dense self-training objective as the primary source of structural self-supervision. To further improve robustness, we introduce semantically-aware hierarchical normalization, which exploits instance-level segmentation maps to perform more stable and multi-scale structural normalization. Beyond dense supervision, we introduce a cost-efficient weak supervision in the form of pairwise ordinal depth annotations to further guide the adaptation process, which enforces informative ordinal constraints to mitigate local topological errors. Finally, a weight regularization loss is employed to anchor the LoRA updates, ensuring training stability and preserving the model's generalizable knowledge. Extensive experiments on both realistic and corrupted out-of-distribution datasets under diverse and challenging scenarios demonstrate that WeSTAR consistently improves generalization and achieves state-of-the-art performance across a wide range of benchmarks.

[120] V2VLoc: Robust GNSS-Free Collaborative Perception via LiDAR Localization

Wenkai Lin,Qiming Xia,Wen Li,Xun Huang,Chenglu Wen

Main category: cs.CV

TL;DR: 提出了一种基于LiDAR的无GNSS协同感知框架,通过轻量级位姿生成器(PGC)和位姿感知时空对齐Transformer(PASTAT),在GNSS拒止环境下实现鲁棒的多智能体协同感知。

Details Motivation: 在GNSS信号缺失的环境中,传统定位方法失效,导致多智能体间特征对齐困难,影响协同感知性能。 Method: 设计了轻量级的Pose Generator with Confidence(PGC)用于估计紧凑的位姿与置信度表示;提出Pose-Aware Spatio-Temporal Alignment Transformer(PASTAT),结合置信度进行空间对齐并建模时间上下文。构建了新的仿真数据集V2VLoc,支持LiDAR定位与协同检测任务。 Result: 在V2VLoc数据集上实验表明所提方法在GNSS拒止条件下达到SOTA性能,并在真实世界数据集V2V4Real上验证了PASTAT的有效性与泛化能力。 Conclusion: 该框架有效解决了GNSS拒止环境下的多智能体协同感知难题,具备高精度、强鲁棒性和实际应用潜力。 Abstract: Multi-agents rely on accurate poses to share and align observations, enabling a collaborative perception of the environment. However, traditional GNSS-based localization often fails in GNSS-denied environments, making consistent feature alignment difficult in collaboration. To tackle this challenge, we propose a robust GNSS-free collaborative perception framework based on LiDAR localization. Specifically, we propose a lightweight Pose Generator with Confidence (PGC) to estimate compact pose and confidence representations. To alleviate the effects of localization errors, we further develop the Pose-Aware Spatio-Temporal Alignment Transformer (PASTAT), which performs confidence-aware spatial alignment while capturing essential temporal context. Additionally, we present a new simulation dataset, V2VLoc, which can be adapted for both LiDAR localization and collaborative detection tasks. V2VLoc comprises three subsets: Town1Loc, Town4Loc, and V2VDet. Town1Loc and Town4Loc offer multi-traversal sequences for training in localization tasks, whereas V2VDet is specifically intended for the collaborative detection task. Extensive experiments conducted on the V2VLoc dataset demonstrate that our approach achieves state-of-the-art performance under GNSS-denied conditions. We further conduct extended experiments on the real-world V2V4Real dataset to validate the effectiveness and generalizability of PASTAT.

[121] ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation

Zitong Xu,Huiyu Duan,Xiaoyu Wang,Zhaolin Cai,Kaiwei Zhang,Qiang Hu,Jing Liu,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了ManipBench,一个大规模的AI生成图像篡改检测与定位基准数据集,包含45万张以上由25种最先进编辑模型生成的篡改图像,并引入了ManipShield模型,基于多模态大语言模型实现统一的篡改检测、定位与解释,表现出卓越的泛化性和性能。

Details Motivation: 现有图像篡改检测基准在内容多样性、生成模型覆盖范围和可解释性方面存在不足,限制了检测方法的泛化与解释能力。 Method: 构建了ManipBench数据集,涵盖25种先进编辑模型和12类篡改操作;提出ManipShield,基于多模态大语言模型,采用对比LoRA微调和任务特定解码器实现检测、定位与解释一体化。 Result: 在ManipBench和多个公开数据集上的实验表明,ManipShield在检测与定位性能上达到SOTA,并对未见过的篡改模型展现出强泛化能力。 Conclusion: ManipBench为AI生成图像篡改检测提供了更全面、更具挑战性的评估平台,而ManipShield展示了多模态大模型在统一篡改分析任务中的巨大潜力。 Abstract: With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.

[122] Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery

Yiming Zeng,Xi-Le Zhao,Wei-Hao Wu,Teng-Yu Ji,Chao Wang

Main category: cs.CV

TL;DR: 提出了一种基于高斯点阵的低秩张量表示框架(GSLR),用于更精确地表示多维图像,尤其在捕捉局部高频信息方面优于现有方法。

Details Motivation: 现有t-SVD方法在近似潜在张量和变换矩阵时过于粗糙且使用固定基原子,难以准确捕捉模式3纤维上的局部高频信息。 Method: 引入2D和1D高斯点阵分别生成潜在张量和变换矩阵,在连续紧凑的空间中实现多维图像的表示。 Result: 实验表明,GSLR在多维图像恢复任务中显著优于现有最先进方法,尤其在恢复局部高频细节方面表现突出。 Conclusion: GSLR框架通过可学习的高斯基函数提升了张量表示能力,为多维图像处理提供了新的高效解决方案。 Abstract: Tensor singular value decomposition (t-SVD) is a promising tool for multi-dimensional image representation, which decomposes a multi-dimensional image into a latent tensor and an accompanying transform matrix. However, two critical limitations of t-SVD methods persist: (1) the approximation of the latent tensor (e.g., tensor factorizations) is coarse and fails to accurately capture spatial local high-frequency information; (2) The transform matrix is composed of fixed basis atoms (e.g., complex exponential atoms in DFT and cosine atoms in DCT) and cannot precisely capture local high-frequency information along the mode-3 fibers. To address these two limitations, we propose a Gaussian Splatting-based Low-rank tensor Representation (GSLR) framework, which compactly and continuously represents multi-dimensional images. Specifically, we leverage tailored 2D Gaussian splatting and 1D Gaussian splatting to generate the latent tensor and transform matrix, respectively. The 2D and 1D Gaussian splatting are indispensable and complementary under this representation framework, which enjoys a powerful representation capability, especially for local high-frequency information. To evaluate the representation ability of the proposed GSLR, we develop an unsupervised GSLR-based multi-dimensional image recovery model. Extensive experiments on multi-dimensional image recovery demonstrate that GSLR consistently outperforms state-of-the-art methods, particularly in capturing local high-frequency information.

[123] Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

Weimin Bai,Yubo Li,Weijian Luo,Zeqiang Lai,Yequan Wang,Wenzheng Chen,He Sun

Main category: cs.CV

TL;DR: 本文提出了VLM3D,一个利用大型视觉语言模型(VLM)作为语义和空间评判器的通用框架,以解决文本到3D生成中的语义对齐和空间理解问题。

Details Motivation: 现有文本到3D生成模型在细粒度语义对齐和3D空间理解方面存在不足,导致几何不一致和部件组装错误。 Method: 提出一种双查询批评信号,基于VLM的是/否对数几率,评估语义保真度和几何连贯性,并将其应用于优化型和前馈型两种不同范式中。 Result: 在标准基准上显著优于现有方法,并能有效纠正SOTA 3D模型在采样过程中的严重空间错误。 Conclusion: VLM3D为将VLM的语言基础语义与空间理解能力注入多样化的3D生成流程提供了原理性和可推广的路径。 Abstract: Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.

[124] Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning

Tong Zhang,Yifan Zhao,Liangyu Wang,Jia Li

Main category: cs.CV

TL;DR: 提出中间域代理(IDP)方法,通过源域特征构建码本以重构目标域特征,并实现跨域少样本学习的高效域对齐。

Details Motivation: 解决跨域少样本学习中的语义不一致、域间差异大和数据稀缺三重挑战。 Method: 利用源域特征嵌入构建中间域代理作为码本,重构目标域特征,并基于该代理进行快速域对齐和特征变换。 Result: 在8个跨域少样本学习基准上超越现有最先进模型。 Conclusion: 所提IDP方法通过中间域代理有效促进知识迁移,显著提升跨域少样本学习性能。 Abstract: Cross-Domain Few-Shot Learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, i.e., semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct Intermediate Domain Proxies (IDP) with source feature embeddings as the codebook and reconstruct the target domain feature with this learned codebook. We then conduct an empirical study to explore the intrinsic attributes from perspectives of visual styles and semantic contents in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks.

[125] NeuralSSD: A Neural Solver for Signed Distance Surface Reconstruction

Zi-Chen Xi,Jiahui Huang,Hao-Xiang Chen,Francis Williams,Qun-Ce Xu,Tai-Jiang Mu,Shi-Min Hu

Main category: cs.CV

TL;DR: NeuralSSD 是一种基于神经Galerkin方法的新型3D隐式表面重建方法,通过引入新的能量方程和卷积网络,实现了从点云数据中高精度、稳定地重建表面,并在多个数据集上达到最先进的性能。

Details Motivation: 现有隐式场参数化方法缺乏确保表面与输入点云紧密贴合的显式机制,导致重建精度不足。 Method: 提出一种基于神经Galerkin方法的求解器NeuralSSD,设计新的能量方程以平衡点云信息的可靠性,并采用新型三维卷积网络学习点云特征以优化隐式表面重建。 Result: 在ShapeNet和Matterport等多个具有挑战性的数据集上,NeuralSSD在表面重建精度和泛化能力方面均达到当前最优水平。 Conclusion: NeuralSSD能有效提升从点云数据中重建隐式表面的质量和稳定性,具备良好的实际应用潜力。 Abstract: We proposed a generalized method, NeuralSSD, for reconstructing a 3D implicit surface from the widely-available point cloud data. NeuralSSD is a solver-based on the neural Galerkin method, aimed at reconstructing higher-quality and accurate surfaces from input point clouds. Implicit method is preferred due to its ability to accurately represent shapes and its robustness in handling topological changes. However, existing parameterizations of implicit fields lack explicit mechanisms to ensure a tight fit between the surface and input data. To address this, we propose a novel energy equation that balances the reliability of point cloud information. Additionally, we introduce a new convolutional network that learns three-dimensional information to achieve superior optimization results. This approach ensures that the reconstructed surface closely adheres to the raw input points and infers valuable inductive biases from point clouds, resulting in a highly accurate and stable surface reconstruction. NeuralSSD is evaluated on a variety of challenging datasets, including the ShapeNet and Matterport datasets, and achieves state-of-the-art results in terms of both surface reconstruction accuracy and generalizability.

[126] NeuralBoneReg: A Novel Self-Supervised Method for Robust and Accurate Multi-Modal Bone Surface Registration

Luohong Wu,Matthias Seibold,Nicola A. Cavalcanti,Yunke Ao,Roman Flepp,Aidana Massalimova,Lilian Calvet,Philipp Fürnstahl

Main category: cs.CV

TL;DR: 提出了一种自监督、基于表面的框架NeuralBoneReg,用于模态无关的骨表面配准,在多种数据集上表现出色。

Details Motivation: 在计算机辅助骨科手术中,不同成像模态之间的异质性使得术前和术中数据的精确配准具有挑战性且易出错。因此,需要一种鲁棒、自动且模态无关的骨表面配准方法。 Method: NeuralBoneReg采用3D点云作为模态无关表示,包含两个模块:隐式神经无符号距离场(UDF)学习术前骨模型,以及基于MLP的配准模块进行全局初始化和局部优化,通过生成变换假设将术中点云与神经UDF对齐。该方法为自监督,无需跨被试训练数据。 Result: 在多个公开多模态数据集(包括新引入的UltraBones-Hip)上,NeuralBoneReg达到或超过了现有方法的表现,平均RRE/RTE分别为1.68°/1.86 mm(UltraBones100k)、1.88°/1.89 mm(UltraBones-Hip)和3.79°/2.45 mm(SpineDepth)。 Conclusion: NeuralBoneReg展现出跨解剖结构和成像模态的良好泛化能力,为CAOS提供了鲁棒且准确的跨模态对齐方案。 Abstract: In computer- and robot-assisted orthopedic surgery (CAOS), patient-specific surgical plans derived from preoperative imaging define target locations and implant trajectories. During surgery, these plans must be accurately transferred, relying on precise cross-registration between preoperative and intraoperative data. However, substantial modality heterogeneity across imaging modalities makes this registration challenging and error-prone. Robust, automatic, and modality-agnostic bone surface registration is therefore clinically important. We propose NeuralBoneReg, a self-supervised, surface-based framework that registers bone surfaces using 3D point clouds as a modality-agnostic representation. NeuralBoneReg includes two modules: an implicit neural unsigned distance field (UDF) that learns the preoperative bone model, and an MLP-based registration module that performs global initialization and local refinement by generating transformation hypotheses to align the intraoperative point cloud with the neural UDF. Unlike SOTA supervised methods, NeuralBoneReg operates in a self-supervised manner, without requiring inter-subject training data. We evaluated NeuralBoneReg against baseline methods on two publicly available multi-modal datasets: a CT-ultrasound dataset of the fibula and tibia (UltraBones100k) and a CT-RGB-D dataset of spinal vertebrae (SpineDepth). The evaluation also includes a newly introduced CT--ultrasound dataset of cadaveric subjects containing femur and pelvis (UltraBones-Hip), which will be made publicly available. NeuralBoneReg matches or surpasses existing methods across all datasets, achieving mean RRE/RTE of 1.68°/1.86 mm on UltraBones100k, 1.88°/1.89 mm on UltraBones-Hip, and 3.79°/2.45 mm on SpineDepth. These results demonstrate strong generalizability across anatomies and modalities, providing robust and accurate cross-modal alignment for CAOS.

[127] GEN3D: Generating Domain-Free 3D Scenes from a Single Image

Yuxin Zhang,Ziyu Lu,Hongbo Duan,Keyu Fan,Pengting Luo,Peiyu Zhuang,Mengyu Yang,Houde Liu

Main category: cs.CV

TL;DR: 提出Gen3d方法,从单张图像生成高质量、大范围的通用3D场景,通过RGBD图像提升生成点云,并优化高斯溅射表示,实验证明其在生成世界模型和新视角合成方面具有优异性能。

Details Motivation: 现有神经3D重建依赖密集多视角图像,限制了应用;同时3D场景生成对具身AI和世界模型发展至关重要,需多样化高质量场景。 Method: 首先通过RGBD图像生成初始点云,随后扩展并维护世界模型,最终通过优化高斯溅射表示完成3D场景生成。 Result: 在多个数据集上实验表明,该方法具有强泛化能力和优越性能,能生成高保真且一致的新视角。 Conclusion: Gen3d能够从单图生成高质量3D场景,在世界模型构建和新视角合成方面优于现有方法,具备广泛适用性。 Abstract: Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. Additionally, 3D scene generation is vital for advancing embodied AI and world models, which depend on diverse, high-quality scenes for learning and evaluation. In this work, we propose Gen3d, a novel method for generation of high-quality, wide-scope, and generic 3D scenes from a single image. After the initial point cloud is created by lifting the RGBD image, Gen3d maintains and expands its world model. The 3D scene is finalized through optimizing a Gaussian splatting representation. Extensive experiments on diverse datasets demonstrate the strong generalization capability and superior performance of our method in generating a world model and Synthesizing high-fidelity and consistent novel views.

[128] SAM-Fed: SAM-Guided Federated Semi-Supervised Learning for Medical Image Segmentation

Sahar Nasirihaghighi,Negin Ghamsarian,Yiping Li,Marcel Breeuwer,Raphael Sznitman,Klaus Schoeffmann

Main category: cs.CV

TL;DR: 提出SAM-Fed,一种基于分割基础模型的联邦半监督学习框架,通过双知识蒸馏和自适应一致性机制提升轻量客户端的伪标签质量与模型性能。

Details Motivation: 医学图像分割中数据隐私和专家标注成本限制了标注数据的获取,现有联邦半监督方法因客户端模型容量受限导致伪标签不可靠,影响性能。 Method: 利用高容量分割基础模型(如SAM)在训练过程中指导轻量级客户端,采用双知识蒸馏和自适应像素级一致性机制优化伪标签生成与模型学习。 Result: 在皮肤病变和息肉分割任务中,SAM-Fed在同构和异构客户端设置下均优于当前最先进的联邦半监督方法。 Conclusion: SAM-Fed有效解决了联邦半监督学习中客户端模型轻量化与伪标签质量之间的矛盾,提升了分割性能与稳定性。 Abstract: Medical image segmentation is clinically important, yet data privacy and the cost of expert annotation limit the availability of labeled data. Federated semi-supervised learning (FSSL) offers a solution but faces two challenges: pseudo-label reliability depends on the strength of local models, and client devices often require compact or heterogeneous architectures due to limited computational resources. These constraints reduce the quality and stability of pseudo-labels, while large models, though more accurate, cannot be trained or used for routine inference on client devices. We propose SAM-Fed, a federated semi-supervised framework that leverages a high-capacity segmentation foundation model to guide lightweight clients during training. SAM-Fed combines dual knowledge distillation with an adaptive agreement mechanism to refine pixel-level supervision. Experiments on skin lesion and polyp segmentation across homogeneous and heterogeneous settings show that SAM-Fed consistently outperforms state-of-the-art FSSL methods.

[129] Iterative Diffusion-Refined Neural Attenuation Fields for Multi-Source Stationary CT Reconstruction: NAF Meets Diffusion Model

Jiancheng Fang,Shaoyu Wang,Junlin Wang,Weiwen Wu,Yikun Zhang,Qiegen Liu

Main category: cs.CV

TL;DR: 提出Diff-NAF,一种用于超稀疏视角多源静态CT的迭代重建框架,结合神经衰减场与双分支条件扩散模型,通过投影合成与扩散驱动 refinement 显著提升重建质量。

Details Motivation: 传统方法在超稀疏视角下重建质量差,插值不准确,难以满足实际应用需求。 Method: 提出Diff-NAF框架:首先训练初始神经衰减场(NAF),然后利用角度先验引导的投影合成策略生成新投影,并通过扩散驱动的投影 refine 模块优化投影,将优化后的投影作为伪标签用于下一轮迭代训练。 Result: 在多个模拟3D CT数据和真实投影数据上实验表明,Diff-NAF在超稀疏视角条件下优于现有方法,显著提升重建完整性与保真度。 Conclusion: Diff-NAF通过迭代式投影补全与 refine,有效解决了超稀疏视角下多源静态CT重建难题,实现了高质量图像重建。 Abstract: Multi-source stationary computed tomography (CT) has recently attracted attention for its ability to achieve rapid image reconstruction, making it suitable for time-sensitive clinical and industrial applications. However, practical systems are often constrained by ultra-sparse-view sampling, which significantly degrades reconstruction quality. Traditional methods struggle under ultra-sparse-view settings, where interpolation becomes inaccurate and the resulting reconstructions are unsatisfactory. To address this challenge, this study proposes Diffusion-Refined Neural Attenuation Fields (Diff-NAF), an iterative framework tailored for multi-source stationary CT under ultra-sparse-view conditions. Diff-NAF combines a Neural Attenuation Field representation with a dual-branch conditional diffusion model. The process begins by training an initial NAF using ultra-sparse-view projections. New projections are then generated through an Angle-Prior Guided Projection Synthesis strategy that exploits inter view priors, and are subsequently refined by a Diffusion-driven Reuse Projection Refinement Module. The refined projections are incorporated as pseudo-labels into the training set for the next iteration. Through iterative refinement, Diff-NAF progressively enhances projection completeness and reconstruction fidelity under ultra-sparse-view conditions, ultimately yielding high-quality CT reconstructions. Experimental results on multiple simulated 3D CT volumes and real projection data demonstrate that Diff-NAF achieves the best performance under ultra-sparse-view conditions.

[130] Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs

Yiyi Miao,Taoyu Wu,Tong Chen,Ji Jiang,Zhe Tang,Zhengyong Jiang,Angelos Stefanidis,Limin Yu,Jionglong Su

Main category: cs.CV

TL;DR: 提出Dental3R,一种无需姿态估计、基于图引导的稀疏口腔图像高保真3D重建方法,结合几何感知配对策略与小波正则化优化3D高斯溅射模型,显著提升远程正畸中的重建质量。

Details Motivation: 传统口内扫描无法用于依赖稀疏智能手机图像的远程正畸;现有3D重建方法在大视角基线、光照不一致和镜面反射下难以稳定估计位姿与几何结构,且易因稀疏视图导致细节丢失。 Method: 提出Dental3R:首先设计几何感知配对策略(GAPS),筛选高价值图像对以稳定位姿与点云初始化;基于恢复的位姿和点云,采用离散小波变换正则化的3D高斯溅射(3DGS)模型,通过频带限制保留精细边缘并抑制伪影。 Result: 在950例临床数据和195例视频测试集上验证,Dental3R在稀疏无位姿输入下实现了优于当前方法的新型视图合成质量,尤其在咬合可视化中表现出更优的细节保持能力。 Conclusion: Dental3R有效解决了稀疏、无位姿口腔照片重建中的稳定性与细节丢失问题,为远程数字正畸提供了高保真实用的3D重建方案。 Abstract: Intraoral 3D reconstruction is fundamental to digital orthodontics, yet conventional methods like intraoral scanning are inaccessible for remote tele-orthodontics, which typically relies on sparse smartphone imagery. While 3D Gaussian Splatting (3DGS) shows promise for novel view synthesis, its application to the standard clinical triad of unposed anterior and bilateral buccal photographs is challenging. The large view baselines, inconsistent illumination, and specular surfaces common in intraoral settings can destabilize simultaneous pose and geometry estimation. Furthermore, sparse-view photometric supervision often induces a frequency bias, leading to over-smoothed reconstructions that lose critical diagnostic details. To address these limitations, we propose \textbf{Dental3R}, a pose-free, graph-guided pipeline for robust, high-fidelity reconstruction from sparse intraoral photographs. Our method first constructs a Geometry-Aware Pairing Strategy (GAPS) to intelligently select a compact subgraph of high-value image pairs. The GAPS focuses on correspondence matching, thereby improving the stability of the geometry initialization and reducing memory usage. Building on the recovered poses and point cloud, we train the 3DGS model with a wavelet-regularized objective. By enforcing band-limited fidelity using a discrete wavelet transform, our approach preserves fine enamel boundaries and interproximal edges while suppressing high-frequency artifacts. We validate our approach on a large-scale dataset of 950 clinical cases and an additional video-based test set of 195 cases. Experimental results demonstrate that Dental3R effectively handles sparse, unposed inputs and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art methods.

[131] LSP-YOLO: A Lightweight Single-Stage Network for Sitting Posture Recognition on Embedded Devices

Nanjun Li,Ziyue Hao,Quanqiang Wang,Xuanyin Wang

Main category: cs.CV

TL;DR: 提出了一种轻量级单阶段网络LSP-YOLO,用于嵌入式边缘设备上的坐姿识别,具有高精度、低计算成本和实时性。

Details Motivation: 现有坐姿识别方法多采用两阶段流程,存在侵入性强、计算量大、实时性差的问题,难以在边缘设备上高效部署。 Method: 受YOLOv11-Pose启发,设计了集成部分卷积(PConv)和相似性感知激活模块(SimAM)的轻量模块Light-C3k2,并通过点卷积直接将关键点映射到姿态类别,引入中间监督机制实现姿态估计与分类的高效融合。 Result: 在自建包含5000张图像、六类坐姿的数据集上,最小模型LSP-YOLO-n达到94.2%准确率,PC上运行速度达251 FPS,模型大小仅1.9MB,并在SV830C+GC030A平台上验证了实时性和部署能力。 Conclusion: LSP-YOLO具有高效、轻量、易部署的优点,适用于智慧教室、康复训练和人机交互等场景。 Abstract: With the rise in sedentary behavior, health problems caused by poor sitting posture have drawn increasing attention. Most existing methods, whether using invasive sensors or computer vision, rely on two-stage pipelines, which result in high intrusiveness, intensive computation, and poor real-time performance on embedded edge devices. Inspired by YOLOv11-Pose, a lightweight single-stage network for sitting posture recognition on embedded edge devices termed LSP-YOLO was proposed. By integrating partial convolution(PConv) and Similarity-Aware Activation Module(SimAM), a lightweight module, Light-C3k2, was designed to reduce computational cost while maintaining feature extraction capability. In the recognition head, keypoints were directly mapped to posture classes through pointwise convolution, and intermediate supervision was employed to enable efficient fusion of pose estimation and classification. Furthermore, a dataset containing 5,000 images across six posture categories was constructed for model training and testing. The smallest trained model, LSP-YOLO-n, achieved 94.2% accuracy and 251 Fps on personal computer(PC) with a model size of only 1.9 MB. Meanwhile, real-time and high-accuracy inference under constrained computational resources was demonstrated on the SV830C + GC030A platform. The proposed approach is characterized by high efficiency, lightweight design and deployability, making it suitable for smart classrooms, rehabilitation, and human-computer interaction applications.

[132] Step by Step Network

Dongchen Han,Tianzhu Ye,Zhuofan Xia,Kaiyi Chen,Yulin Wang,Hanting Chen,Gao Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为StepsNet的广义残差网络架构,通过逐步增加宽度的模块设计,解决了深层网络中的捷径退化和宽度受限问题,显著提升了深度模型在多种任务上的性能。

Details Motivation: 随着网络不断加深,现有架构难以充分发挥其理论上的能力提升,需要更先进的设计来释放深层网络的潜力。 Method: 提出Step by Step Network (StepsNet),沿通道维度分离特征,并通过堆叠宽度逐渐增加的模块实现渐进式学习,以缓解捷径退化和深度-宽度权衡问题。 Result: 在图像分类、目标检测、语义分割和语言建模等多种任务上,StepsNet consistently优于传统残差模型。 Conclusion: StepsNet是一种有效的广义残差架构,能够更好地发挥深层网络的理论潜力,是残差网络的优越推广形式。 Abstract: Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep-layer learning, while the inherent depth-width trade-off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro design applicable to various models. Extensive experiments show that our method consistently outperforms residual models across diverse tasks, including image classification, object detection, semantic segmentation, and language modeling. These results position StepsNet as a superior generalization of the widely adopted residual architecture.

[133] ArchMap: Arch-Flattening and Knowledge-Guided Vision Language Model for Tooth Counting and Structured Dental Understanding

Bohan Zhang,Yiyi Miao,Taoyu Wu,Tong Chen,Ji Jiang,Zhuoxiao Li,Zhe Tang,Limin Yu,Jionglong Su

Main category: cs.CV

TL;DR: 提出了一种无需训练、基于知识引导的框架ArchMap,用于实现对3D口内扫描数据的鲁棒性结构化牙科理解,通过几何归一化和牙科知识库实现跨设备泛化与多任务临床分析。

Details Motivation: 现有深度学习方法依赖特定模态、大量标注数据和受控扫描条件,难以泛化到不同设备且在真实临床流程中部署受限;同时原始口扫网格存在姿态变化大、几何不完整、缺乏纹理等问题,导致语义理解困难。 Method: 提出ArchMap框架:首先设计几何感知的牙弓展平模块,将原始3D网格标准化为多视角投影;然后构建牙科知识库(DKB),包含牙齿层级本体、牙列阶段规则和临床语义,以约束符号推理空间,实现无需训练的语义解析。 Result: 在1060例正畸病例上验证,ArchMap在牙齿计数、解剖分割、牙列阶段分类及拥挤、缺牙、修复体、龋齿等临床问题识别中表现优异,相比监督模型和提示式视觉语言模型基线,准确率更高、语义漂移更少、在稀疏或含伪影数据下稳定性更强。 Conclusion: ArchMap通过结合几何归一化与本体引导的多模态推理,提供了一种无需训练、可扩展且适用于真实临床场景的3D口扫数据分析方案,推动了数字正畸中的结构化理解发展。 Abstract: A structured understanding of intraoral 3D scans is essential for digital orthodontics. However, existing deep-learning approaches rely heavily on modality-specific training, large annotated datasets, and controlled scanning conditions, which limit generalization across devices and hinder deployment in real clinical workflows. Moreover, raw intraoral meshes exhibit substantial variation in arch pose, incomplete geometry caused by occlusion or tooth contact, and a lack of texture cues, making unified semantic interpretation highly challenging. To address these limitations, we propose ArchMap, a training-free and knowledge-guided framework for robust structured dental understanding. ArchMap first introduces a geometry-aware arch-flattening module that standardizes raw 3D meshes into spatially aligned, continuity-preserving multi-view projections. We then construct a Dental Knowledge Base (DKB) encoding hierarchical tooth ontology, dentition-stage policies, and clinical semantics to constrain the symbolic reasoning space. We validate ArchMap on 1060 pre-/post-orthodontic cases, demonstrating robust performance in tooth counting, anatomical partitioning, dentition-stage classification, and the identification of clinical conditions such as crowding, missing teeth, prosthetics, and caries. Compared with supervised pipelines and prompted VLM baselines, ArchMap achieves higher accuracy, reduced semantic drift, and superior stability under sparse or artifact-prone conditions. As a fully training-free system, ArchMap demonstrates that combining geometric normalization with ontology-guided multimodal reasoning offers a practical and scalable solution for the structured analysis of 3D intraoral scans in modern digital orthodontics.

[134] Silhouette-to-Contour Registration: Aligning Intraoral Scan Models with Cephalometric Radiographs

Yiyi Miao,Taoyu Wu,Ji Jiang,Tong Chen,Zhe Tang,Zhengyong Jiang,Angelos Stefanidis,Limin Yu,Jionglong Su

Main category: cs.CV

TL;DR: 提出DentalSCR框架,用于解决口腔扫描模型与侧颅X光片之间3D-2D配准在临床实际中的挑战,通过构建统一解剖坐标系和轮廓引导的配准方法,实现高精度、可解释且稳定的对齐。

Details Motivation: 传统基于强度的配准方法在面对X光片的投影放大、几何失真、低对比度牙齿冠部等问题时稳定性差,易导致配准失败或不合理的解剖对齐。 Method: 提出DentalSCR:1)构建U-Midline Dental Axis (UMDA) 以标准化解剖坐标系;2)采用基于表面的DRR生成具有临床真实感的投影图像;3)使用双向Chamfer距离在分层粗到精策略下优化2D相似变换。 Result: 在34个临床病例上验证,显著降低关键点误差(尤其是后牙区域),下颌误差分布更集中,曲线级指标(Chamfer和Hausdorff距离)表现优异。 Conclusion: DentalSCR能稳健处理真实临床数据,提供高保真、可检查的3D-2D配准结果,优于传统方法。 Abstract: Reliable 3D-2D alignment between intraoral scan (IOS) models and lateral cephalometric radiographs is critical for orthodontic diagnosis, yet conventional intensity-driven registration methods struggle under real clinical conditions, where cephalograms exhibit projective magnification, geometric distortion, low-contrast dental crowns, and acquisition-dependent variation. These factors hinder the stability of appearance-based similarity metrics and often lead to convergence failures or anatomically implausible alignments. To address these limitations, we propose DentalSCR, a pose-stable, contour-guided framework for accurate and interpretable silhouette-to-contour registration. Our method first constructs a U-Midline Dental Axis (UMDA) to establish a unified cross-arch anatomical coordinate system, thereby stabilizing initialization and standardizing projection geometry across cases. Using this reference frame, we generate radiograph-like projections via a surface-based DRR formulation with coronal-axis perspective and Gaussian splatting, which preserves clinical source-object-detector magnification and emphasizes external silhouettes. Registration is then formulated as a 2D similarity transform optimized with a symmetric bidirectional Chamfer distance under a hierarchical coarse-to-fine schedule, enabling both large capture range and subpixel-level contour agreement. We evaluate DentalSCR on 34 expert-annotated clinical cases. Experimental results demonstrate substantial reductions in landmark error-particularly at posterior teeth-tighter dispersion on the lower jaw, and low Chamfer and controlled Hausdorff distances at the curve level. These findings indicate that DentalSCR robustly handles real-world cephalograms and delivers high-fidelity, clinically inspectable 3D--2D alignment, outperforming conventional baselines.

[135] ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Junfu Pu,Teng Wang,Yixiao Ge,Yuying Ge,Chen Li,Ying Shan

Main category: cs.CV

TL;DR: ARC-Chapter是首个大规模视频分章模型,基于百万级长视频章节数据训练,支持双语、时序定位和多层次标注,显著提升了长视频内容结构化的性能。

Details Motivation: 现有视频分章方法受限于小规模、粗粒度的标注数据,难以泛化到长视频中复杂的过渡场景,因此需要一个更大规模、更精细标注的数据集和模型来提升性能。 Method: 提出ARC-Chapter模型,构建了一个融合ASR文本、场景文字和视觉描述的结构化流程,生成英汉双语、多层级的章节标注数据,并设计了新的评估指标GRACE,结合多对一片段重叠和语义相似性进行评价。 Result: 实验表明,ARC-Chapter在F1分数上比之前最优方法提升14.0%,SODA分数提升11.3%,并在YouCook2等下游任务上展现出优秀的迁移能力。 Conclusion: ARC-Chapter通过大规模数据和精细化标注,在长视频分章任务上实现了显著突破,同时提出的GRACE指标更贴合实际应用需求,推动了视频内容结构化的发展。 Abstract: The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.

[136] IBGS: Image-Based Gaussian Splatting

Hoang Chuong Nguyen,Wei Mao,Jose M. Alvarez,Miaomiao Liu

Main category: cs.CV

TL;DR: 提出了一种基于图像的高斯点阵化方法(Image-Based Gaussian Splatting),利用高分辨率源图像建模视图细节和视角相关颜色,通过结合基础渲染颜色与从邻近训练图像学习的残差,提升了3D高斯点阵在新视角合成中的表现,在不增加存储开销的情况下显著提高了渲染质量。

Details Motivation: 3D高斯点阵化(3DGS)使用低阶球谐函数表示颜色,限制了其对空间变化颜色和视角相关效果(如镜面高光)的表达能力;现有纹理增强方法存在对复杂场景建模困难或存储开销高的问题。 Method: 将每个像素颜色建模为基础3DGS渲染颜色与从邻近训练图像中学习到的残差之和,利用高分辨率源图像提供高频细节和视角相关信息,实现更精确的颜色预测和表面对齐。 Result: 在标准新视角合成基准上显著优于先前的高斯点阵化方法,尤其在高频细节和视角相关效果方面表现突出,且未增加模型存储占用。 Conclusion: 所提方法在保持3DGS高效性和低存储成本的同时,有效解决了其在颜色表达和视角依赖效果建模上的局限,为高质量新视角合成提供了高效解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a fast, high-quality method for novel view synthesis (NVS). However, its use of low-degree spherical harmonics limits its ability to capture spatially varying color and view-dependent effects such as specular highlights. Existing works augment Gaussians with either a global texture map, which struggles with complex scenes, or per-Gaussian texture maps, which introduces high storage overhead. We propose Image-Based Gaussian Splatting, an efficient alternative that leverages high-resolution source images for fine details and view-specific color modeling. Specifically, we model each pixel color as a combination of a base color from standard 3DGS rendering and a learned residual inferred from neighboring training images. This promotes accurate surface alignment and enables rendering images of high-frequency details and accurate view-dependent effects. Experiments on standard NVS benchmarks show that our method significantly outperforms prior Gaussian Splatting approaches in rendering quality, without increasing the storage footprint.

[137] Clinically-Validated Innovative Mobile Application for Assessing Blinking and Eyelid Movements

Gustavo Adolpho Bonesso,Carlos Marcelo Gurjão de Godoy,Tammy Hentona Osaki,Midori Hentona Osaki,Bárbara Moreira Ribeiro Trindade dos Santos,Regina Célia Coelho

Main category: cs.CV

TL;DR: 本研究开发并临床验证了一款基于Flutter框架和Google ML Kit的移动应用Bapp,用于实时、客观地分析眼睑运动。

Details Motivation: 现有评估眨眼动作的方法复杂、成本高且临床适用性有限,缺乏便捷、客观的工具。 Method: 使用Flutter框架开发Bapp应用,集成Google ML Kit实现设备端实时眼睑运动分析,并利用45段真实患者视频进行验证,由眼科专家手动标注作为金标准。 Result: Bapp在精确率、召回率和F1分数上分别达到98.4%、96.9%和98.3%,表现出高准确性和可靠性。 Conclusion: Bapp是一款便携、可及且可靠的工具,可用于监测正常与异常眨眼,有望替代传统人工计数方法,支持临床持续眼部健康监测和术后评估。 Abstract: Blinking is a vital physiological process that protects and maintains the health of the ocular surface. Objective assessment of eyelid movements remains challenging due to the complexity, cost, and limited clinical applicability of existing tools. This study presents the clinical validation of Bapp (Blink Application), a mobile application developed using the Flutter framework and integrated with Google ML Kit for on-device, real-time analysis of eyelid movements. The validation occurred using 45 videos from real patients, whose blinks were manually annotated by ophthalmology specialists from the Paulista School of Medicine of the Federal University of Sao Paulo (EPM-UNIFESP) to serve as the ground truth. Bapp's performance was evaluated using standard metrics, including Precision, Recall, and F1-Score, with results demonstrating 98.4% precision, 96.9% recall, and an overall accuracy of 98.3%. These outcomes confirm the reliability of Bapp as a portable, accessible, and objective tool for monitoring both normal and abnormal eyelid movements. The application offers a promising alternative to traditional manual blink counting, supporting continuous ocular health monitoring and postoperative evaluation in clinical environments.

[138] Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection

Xiaolin Wang,Houzhang Fang,Qingshan Li,Lu Wang,Yi Chang,Luxin Yan

Main category: cs.CV

TL;DR: 本文提出了一种联合特征域去模糊与检测的端到端框架JFD3,通过双分支共享权重结构,利用清晰图像分支指导模糊分支增强判别性特征表示,并设计了特征恢复网络、频率结构引导模块和特征一致性自监督损失,在新构建的IRBlurUAV数据集上实现了优越的检测性能和实时性。

Details Motivation: 现有去模糊方法多关注视觉质量提升,忽视了对检测任务关键的判别性特征增强,导致红外无人机目标在运动模糊下检测性能下降。 Method: 提出JFD3框架:1)设计轻量级特征恢复网络,以清晰分支的特征作为监督指导模糊分支;2)引入频率结构引导模块,将恢复网络的结构先验融入浅层检测网络;3)在双分支主干间施加特征一致性自监督损失,使模糊分支逼近清晰分支的特征表示。 Result: 在自建的IRBlurUAV数据集(包含30,000张模拟和4,118张真实红外无人机图像)上实验表明,JFD3显著提升了模糊条件下的检测性能,同时保持实时处理能力。 Conclusion: JFD3通过联合优化去模糊与检测,有效增强了模糊图像中的任务相关特征表示,为红外无人机目标在运动模糊场景下的检测提供了高效解决方案。 Abstract: Infrared unmanned aerial vehicle (UAV) target images often suffer from motion blur degradation caused by rapid sensor movement, significantly reducing contrast between target and background. Generally, detection performance heavily depends on the discriminative feature representation between target and background. Existing methods typically treat deblurring as a preprocessing step focused on visual quality, while neglecting the enhancement of task-relevant features crucial for detection. Improving feature representation for detection under blur conditions remains challenging. In this paper, we propose a novel Joint Feature-Domain Deblurring and Detection end-to-end framework, dubbed JFD3. We design a dual-branch architecture with shared weights, where the clear branch guides the blurred branch to enhance discriminative feature representation. Specifically, we first introduce a lightweight feature restoration network, where features from the clear branch serve as feature-level supervision to guide the blurred branch, thereby enhancing its distinctive capability for detection. We then propose a frequency structure guidance module that refines the structure prior from the restoration network and integrates it into shallow detection layers to enrich target structural information. Finally, a feature consistency self-supervised loss is imposed between the dual-branch detection backbones, driving the blurred branch to approximate the feature representations of the clear one. Wealso construct a benchmark, named IRBlurUAV, containing 30,000 simulated and 4,118 real infrared UAV target images with diverse motion blur. Extensive experiments on IRBlurUAV demonstrate that JFD3 achieves superior detection performance while maintaining real-time efficiency.

[139] A Quantitative Method for Shoulder Presentation Evaluation in Biometric Identity Documents

Alfonso Pedro Ridao

Main category: cs.CV

TL;DR: 提出一种肩部呈现评估(SPE)算法,用于量化生物特征证件图像中肩部的偏航和翻滚角度,仅使用常见姿态估计框架提供的两个肩部关键点的3D坐标。

Details Motivation: 国际生物特征身份文件标准对肩部姿态有严格要求,但现有自动化质量评估方法缺乏对此类属性的定量评价手段。 Method: 利用两个肩部关键点的3D坐标,通过计算肩部的偏航角和翻滚角来量化肩部姿态,生成SPE评分。 Result: 在121张人像图像数据集上验证,SPE评分与人工标注具有强皮尔逊相关性(r≈0.80),且通过改进的错误-丢弃方法验证了其有效过滤非合规样本的能力。 Conclusion: 该算法是一种轻量级、可行的工具,可用于注册系统中的自动化合规性检查。 Abstract: International standards for biometric identity documents mandate strict compliance with pose requirements, including the square presentation of a subject's shoulders. However, the literature on automated quality assessment offers few quantitative methods for evaluating this specific attribute. This paper proposes a Shoulder Presentation Evaluation (SPE) algorithm to address this gap. The method quantifies shoulder yaw and roll using only the 3D coordinates of two shoulder landmarks provided by common pose estimation frameworks. The algorithm was evaluated on a dataset of 121 portrait images. The resulting SPE scores demonstrated a strong Pearson correlation (r approx. 0.80) with human-assigned labels. An analysis of the metric's filtering performance, using an adapted Error-versus-Discard methodology, confirmed its utility in identifying non-compliant samples. The proposed algorithm is a viable lightweight tool for automated compliance checking in enrolment systems.

[140] Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

Kangqiao Zhao,Shuo Huai,Xurui Song,Jun Luo

Main category: cs.CV

TL;DR: 提出了一种针对自动驾驶中立体匹配模型的新型纹理增强物理对抗攻击方法,通过3D对抗样本和全局伪装纹理实现跨视角的有效攻击,并引入新的渲染模块和融合攻击策略,显著提升了攻击的隐蔽性和有效性。

Details Motivation: 现有的对抗攻击多基于2D补丁且主要针对单目感知,立体视觉下的深度估计模型对物理对抗样本的脆弱性尚不明确,因此需要探索更真实、有效的物理攻击方式。 Method: 提出一种带有全局伪装纹理的3D物理对抗样本,设计新的3D立体匹配渲染模块以应对双目相机视差问题,并通过细粒度优化实现将目标无缝融入环境的融合攻击。 Result: 实验表明所提PAE能有效误导立体匹配模型,生成错误的深度信息,且相比现有隐藏攻击更具隐蔽性和攻击性。 Conclusion: 该工作验证了立体深度估计模型在面对纹理增强型3D物理对抗攻击时的脆弱性,为自动驾驶系统的安全性评估提供了新视角和方法。 Abstract: Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.

[141] Enhancing LLM-based Autonomous Driving with Modular Traffic Light and Sign Recognition

Fabian Schmidt,Noushiq Mohammed Kayilan Abdul Nazar,Markus Enzweiler,Abhinav Valada

Main category: cs.CV

TL;DR: 提出TLS-Assist,一个模块化的冗余层,通过显式交通灯和标志识别增强基于大语言模型的自动驾驶代理,提升驾驶性能并减少交通违规。

Details Motivation: 当前基于大语言模型的自动驾驶代理缺乏强制执行交通规则的机制,难以可靠检测小型但关键的安全物体(如交通灯和标志)。 Method: 设计TLS-Assist框架,将交通灯和标志的检测结果转化为结构化自然语言消息,并注入LLM输入中,以显式引导模型关注安全关键信息;支持单视图和多视图相机设置,具有即插即用和模型无关特性。 Result: 在CARLA的LangAuto基准上闭环测试显示,相比LMDrive和BEVDriver分别最高提升14%和7%的驾驶性能,显著减少交通灯和标志违规。 Conclusion: TLS-Assist有效弥补了LLM在安全关键感知方面的不足,提升了自动驾驶系统的安全性与可靠性。 Abstract: Large Language Models (LLMs) are increasingly used for decision-making and planning in autonomous driving, showing promising reasoning capabilities and potential to generalize across diverse traffic situations. However, current LLM-based driving agents lack explicit mechanisms to enforce traffic rules and often struggle to reliably detect small, safety-critical objects such as traffic lights and signs. To address this limitation, we introduce TLS-Assist, a modular redundancy layer that augments LLM-based autonomous driving agents with explicit traffic light and sign recognition. TLS-Assist converts detections into structured natural language messages that are injected into the LLM input, enforcing explicit attention to safety-critical cues. The framework is plug-and-play, model-agnostic, and supports both single-view and multi-view camera setups. We evaluate TLS-Assist in a closed-loop setup on the LangAuto benchmark in CARLA. The results demonstrate relative driving performance improvements of up to 14% over LMDrive and 7% over BEVDriver, while consistently reducing traffic light and sign infractions. We publicly release the code and models on https://github.com/iis-esslingen/TLS-Assist.

[142] BEDLAM2.0: Synthetic Humans and Cameras in Motion

Joachim Tesch,Giorgio Becherini,Prerana Achar,Anastasios Yiannakidis,Muhammed Kocabas,Priyanka Patel,Michael J. Black

Main category: cs.CV

TL;DR: BEDLAM2.0 是一个改进的、更真实的3D人类运动数据集,相较于原 BEDLAM 数据集在人体形态、动作、服装、环境和相机运动等方面更具多样性,并新增了鞋子建模,显著提升了世界坐标下3D人体运动估计的训练效果。

Details Motivation: 现有方法在估计视频中3D人类运动时受限于缺乏包含真实人体与相机运动的高质量标注数据,尤其是在需要世界坐标输出的应用中表现不佳。 Method: 构建并发布 BEDLAM2.0 数据集,扩展了人体形状、动作、服饰、发型、鞋子、3D场景以及相机设置的多样性与真实性,并提供渲染视频、真实人体参数和相机运动等标注信息。 Result: 实验表明,在 BEDLAM2.0 上训练的最先进方法相比在原 BEDLAM 上训练的模型显著提升了估计精度,尤其在世界坐标下的3D人体运动估计任务中表现更优。 Conclusion: BEDLAM2.0 是一个更强大、更现实的数据集,为3D人类姿态与运动估计提供了重要资源,推动了从视频中推断世界坐标下人体运动的研究进展。 Abstract: Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.

[143] Stage Aware Diagnosis of Diabetic Retinopathy via Ordinal Regression

Saksham Kumar,D Sridhar Aditya,T Likhil Kumar,Thulasi Bikku,Srinivasarao Thota,Chandan Kumar

Main category: cs.CV

TL;DR: 提出了一种基于序数回归的糖尿病视网膜病变检测框架,在APTOS-2019数据集上达到0.8992的二次加权Kappa分数,表现优异。

Details Motivation: 糖尿病视网膜病变是导致可预防性失明的主要原因,早期筛查可防止不可逆损伤,因此需要高效准确的自动检测方法。 Method: 采用绿色通道提取、噪声掩码和CLAHE等预处理方法,并基于序数回归构建分类模型,使用APTOS-2019眼底图像数据集进行训练与评估。 Result: 模型在APTOS-2019数据集上的二次加权Kappa得分为0.8992,显著优于现有方法,表现出与临床分级高度一致的结果。 Conclusion: 所提出的序数回归框架在DR分级任务中表现优越,为自动化糖尿病视网膜病变筛查提供了新的基准。 Abstract: Diabetic Retinopathy (DR) has emerged as a major cause of preventable blindness in recent times. With timely screening and intervention, the condition can be prevented from causing irreversible damage. The work introduces a state-of-the-art Ordinal Regression-based DR Detection framework that uses the APTOS-2019 fundus image dataset. A widely accepted combination of preprocessing methods: Green Channel (GC) Extraction, Noise Masking, and CLAHE, was used to isolate the most relevant features for DR classification. Model performance was evaluated using the Quadratic Weighted Kappa, with a focus on agreement between results and clinical grading. Our Ordinal Regression approach attained a QWK score of 0.8992, setting a new benchmark on the APTOS dataset.

[144] Language as an Anchor: Preserving Relative Visual Geometry for Domain Incremental Learning

Shuyi Geng,Tao Zhou,Yi Zhou

Main category: cs.CV

TL;DR: 提出了一种名为LAVA的领域增量学习新框架,通过基于文本锚点的相对对齐来保持跨域视觉表征的一致几何结构,有效缓解了知识遗忘与语义失真问题,在标准基准上显著优于现有方法。

Details Motivation: 现有领域增量学习方法在统一视觉空间对齐时易导致域间干扰和语义失真,而隔离域参数则造成知识碎片化,难以实现知识复用与持续学习。 Method: 提出LAVA框架,利用文本类名之间的语义相似性构建参考锚点,通过语言引导的相对对齐方式,使各域视觉表征保持一致的相对几何结构,从而实现跨域知识桥接与特征聚合。 Result: 在多个标准DIL基准上实验表明,LAVA显著优于现有最先进方法,实现了更好的性能与知识保持能力。 Conclusion: LAVA通过语言锚定的视觉对齐机制,有效解决了领域增量学习中的知识遗忘与语义失真问题,为跨域持续学习提供了新的思路。 Abstract: A key challenge in Domain Incremental Learning (DIL) is to continually learn under shifting distributions while preserving knowledge from previous domains. Existing methods face a fundamental dilemma. On one hand, projecting all domains into a single unified visual space leads to inter-domain interference and semantic distortion, as large shifts may vary with not only visual appearance but also underlying semantics. On the other hand, isolating domain-specific parameters causes knowledge fragmentation, creating "knowledge islands" that hamper knowledge reuse and exacerbate forgetting. To address this issue, we propose LAVA (Language-Anchored Visual Alignment), a novel DIL framework that replaces direct feature alignment with relative alignment driven by a text-based reference anchor. LAVA guides the visual representations of each incoming domain to preserve a consistent relative geometry, which is defined by mirroring the pairwise semantic similarities between the class names. This anchored geometric structure acts as a bridge across domains, enabling the retrieval of class-aware prior knowledge and facilitating robust feature aggregation. Extensive experiments on standard DIL benchmarks demonstrate that LAVA achieves significant performance improvements over state-of-the-arts. Code is available at https://github.com/ShuyiGeng/LAVA.

[145] Cranio-ID: Graph-Based Craniofacial Identification via Automatic Landmark Annotation in 2D Multi-View X-rays

Ravi Shankar Prasad,Nandani Sharma,Dinesh Singh

Main category: cs.CV

TL;DR: 本文提出了一种名为Cranio-ID的新框架,用于颅面识别中的自动地标标注和跨模态匹配,通过YOLO-pose模型和图表示结合交叉注意力与最优传输实现高精度的颅骨到面部匹配。

Details Motivation: 传统颅面地标定位方法耗时且依赖专业知识,现有自动标注方法缺乏大规模验证,可靠性不足。 Method: 首先使用训练好的YOLO-pose模型对2D颅骨X光图像及其对应光学图像进行自动地标标注;然后将地标转化为图表示,利用交叉注意力机制和最优传输框架实现跨模态语义匹配。 Result: 在S2F和CUHK数据集上验证了该框架的有效性,实验表明其在可靠性、准确性以及跨域颅骨-面部、素描-面部匹配方面均有显著提升。 Conclusion: Cranio-ID框架在法医颅面识别中表现出更高的准确性和可靠性,具有广泛应用于生物医学和法医学的潜力。 Abstract: In forensic craniofacial identification and in many biomedical applications, craniometric landmarks are important. Traditional methods for locating landmarks are time-consuming and require specialized knowledge and expertise. Current methods utilize superimposition and deep learning-based methods that employ automatic annotation of landmarks. However, these methods are not reliable due to insufficient large-scale validation studies. In this paper, we proposed a novel framework Cranio-ID: First, an automatic annotation of landmarks on 2D skulls (which are X-ray scans of faces) with their respective optical images using our trained YOLO-pose models. Second, cross-modal matching by formulating these landmarks into graph representations and then finding semantic correspondence between graphs of these two modalities using cross-attention and optimal transport framework. Our proposed framework is validated on the S2F and CUHK datasets (CUHK dataset resembles with S2F dataset). Extensive experiments have been conducted to evaluate the performance of our proposed framework, which demonstrates significant improvements in both reliability and accuracy, as well as its effectiveness in cross-domain skull-to-face and sketch-to-face matching in forensic science.

[146] Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

Yusen Cai,Bhargava Satya Nunna,Qing Lin,Mengmi Zhang

Main category: cs.CV

TL;DR: 该研究通过模拟婴儿视觉发育过程中的灰度到彩色、模糊到清晰以及时间连续性的特点(CATDiet),训练自监督学习模型,并提出CombDiet方法,在多种视觉任务中展现出更强的鲁棒性和生物一致性,表明婴儿视觉发展阶段对构建机器视觉智能具有重要启发意义。

Details Motivation: 探索婴儿视觉发育过程中阶段性‘视觉饮食’(如低清晰度、去色、时间连续性)在生态上的优势,为提升机器视觉系统的鲁棒性和发展性提供生物学启发。 Method: 设计CATDiet框架,模拟婴儿视觉的三个关键特性:灰度到彩色(C)、模糊到清晰(A)和时间连续性(T);在物体中心视频上训练自监督学习模型,并构建涵盖十项数据集的综合评测基准;进一步提出CombDiet,先用CATDiet初始化再进行标准训练。 Result: 所有CATDiet变体在对象识别中表现出更强的鲁棒性,且展现出类似生物发育的模式,如猕猴V1区突触密度变化和婴儿视觉悬崖反应;CombDiet在域内和域外的对象识别与深度感知任务上均优于标准自监督学习方法。 Conclusion: 婴儿视觉发育的阶段性特征可作为反向工程框架,有效指导机器视觉系统中鲁棒视觉智能的发展,验证了 developmental vision 在人工智能中的应用潜力。 Abstract: Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.

[147] Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

Hong Gao,Yiming Bao,Xuezhen Tu,Yutong Xu,Yue Jin,Yiyang Mu,Bin Zhong,Linan Yue,Min-Ling Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练、灵活的视频理解框架AVI,通过模拟人类的三阶段推理过程和结构化知识库,在多个基准上实现了具有竞争力的性能。

Details Motivation: 现有的视觉语言模型在视频理解中缺乏对证据的反复访问和迭代优化,而基于代理的方法通常依赖昂贵的专有模型或强化学习训练。 Method: 设计了受人类启发的三阶段推理流程(检索-感知-回顾),构建基于实体图的结构化视频知识库,并结合开源模型集成(推理大语言模型+轻量级CV模型+VLM)。 Result: 在LVBench、VideoMME-Long、LongVideoBench和Charades-STA等多个长视频理解基准上取得了具有竞争力的结果,同时具备良好的可解释性。 Conclusion: AVI提供了一种高效、透明且无需训练的视频理解范式,通过系统级设计模拟人类认知过程,克服了对专有模型和强化学习训练的依赖。 Abstract: Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.

[148] DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval

Zongwei Zhen,Biqing Zeng

Main category: cs.CV

TL;DR: 本文提出了DIR-TIR框架,通过对话式交互逐步优化文本到图像的检索过程,结合对话精炼模块和图像精炼模块,显著提升了检索精度和用户体验。

Details Motivation: 传统单次查询的文本到图像检索方法缺乏交互性和容错能力,难以准确捕捉用户意图,因此需要一种能够通过多轮对话动态优化检索过程的方法。 Method: 提出DIR-TIR框架,包含对话精炼模块(Dialog Refiner Module)和图像精炼模块(Image Refiner Module),前者通过主动提问获取用户反馈并生成更精确的文本描述,后者识别生成图像与用户意图之间的感知差异,缩小视觉语义差距。 Result: 在多个图像数据集上的实验表明,该方法显著优于仅依赖初始描述的基线方法,在检索精度和交互体验方面均有提升。 Conclusion: DIR-TIR通过双模块协同机制,利用多轮对话实现更可控、更具容错性的文本到图像检索,有效提高了目标图像的命中率。 Abstract: This paper addresses the task of interactive, conversational text-to-image retrieval. Our DIR-TIR framework progressively refines the target image search through two specialized modules: the Dialog Refiner Module and the Image Refiner Module. The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions of the target image. Complementarily, the Image Refiner identifies perceptual gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy. By leveraging multi-turn dialogues, DIR-TIR provides superior controllability and fault tolerance compared to conventional single-query methods, significantly improving target image hit accuracy. Comprehensive experiments across diverse image datasets demonstrate our dialogue-based approach substantially outperforms initial-description-only baselines, while the synergistic module integration achieves both higher retrieval precision and enhanced interactive experience.

[149] CompEvent: Complex-valued Event-RGB Fusion for Low-light Video Enhancement and Deblurring

Mingchen Zhong,Xin Lu,Dong Li,Senyan Xu,Ruixuan Jiang,Xueyang Fu,Baocai Yin

Main category: cs.CV

TL;DR: 提出CompEvent,一种基于复数神经网络的端到端融合框架,用于低光视频去模糊,结合事件相机数据与RGB帧实现时空全阶段融合,在性能上超越现有方法。

Details Motivation: 现有融合方法多采用分阶段策略,难以有效应对低光照与运动模糊并存的退化问题,限制了低光视频去模糊的效果。 Method: 提出CompEvent框架,包含复数时序对齐GRU和复数空频学习模块,利用复数卷积在时空和频域统一处理事件流与RGB帧,实现全流程深度融合。 Result: 实验表明CompEvent在低光视频去模糊任务中优于现有最先进方法,显著提升恢复质量。 Conclusion: CompEvent通过复数神经网络实现了事件数据与RGB帧的高效全阶段融合,有效解决了低光与运动模糊联合退化问题,推动了低光视频恢复的发展。 Abstract: Low-light video deblurring poses significant challenges in applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. While event cameras offer potential solutions with superior low-light sensitivity and high temporal resolution, existing fusion methods typically employ staged strategies, limiting their effectiveness against combined low-light and motion blur degradations. To overcome this, we propose CompEvent, a complex neural network framework enabling holistic full-process fusion of event data and RGB frames for enhanced joint restoration. CompEvent features two core components: 1) Complex Temporal Alignment GRU, which utilizes complex-valued convolutions and processes video and event streams iteratively via GRU to achieve temporal alignment and continuous fusion; and 2) Complex Space-Frequency Learning module, which performs unified complex-valued signal processing in both spatial and frequency domains, facilitating deep fusion through spatial structures and system-level characteristics. By leveraging the holistic representation capability of complex-valued neural networks, CompEvent achieves full-process spatiotemporal fusion, maximizes complementary learning between modalities, and significantly strengthens low-light video deblurring capability. Extensive experiments demonstrate that CompEvent outperforms SOTA methods in addressing this challenging task. The code is available at https://github.com/YuXie1/CompEvent.

[150] Learning Subglacial Bed Topography from Sparse Radar with Physics-Guided Residuals

Bayu Adhi Tama,Jianwu Wang,Vandana Janeja,Mostafa Cham

Main category: cs.CV

TL;DR: 提出一种物理引导的残差学习框架,用于在稀疏且不均匀的雷达观测条件下准确预测冰盖下地形。

Details Motivation: 冰盖建模需要精确的冰下地形,但现有雷达观测数据稀疏且分布不均,限制了建模精度。 Method: 采用基于ResNet-50等编码器的DeepLabV3+解码器,结合多尺度质量守恒、流对齐总变差、拉普拉斯阻尼、厚度非负性、渐进式先验一致性及置信图调制的掩码Huber损失,训练预测相对于BedMachine先验的冰厚残差。 Result: 在格陵兰两个子区域的测试中,该方法在保留核心区域上的表现优于U-Net、Attention U-Net、FPN和普通CNN,具有更高的测试精度和结构保真度。 Conclusion: 物理引导的残差学习结合先验模型的设计,能生成空间连贯且物理合理的冰下地形,适用于存在域偏移的实际制图任务。 Abstract: Accurate subglacial bed topography is essential for ice sheet modeling, yet radar observations are sparse and uneven. We propose a physics-guided residual learning framework that predicts bed thickness residuals over a BedMachine prior and reconstructs bed from the observed surface. A DeepLabV3+ decoder over a standard encoder (e.g.,ResNet-50) is trained with lightweight physics and data terms: multi-scale mass conservation, flow-aligned total variation, Laplacian damping, non-negativity of thickness, a ramped prior-consistency term, and a masked Huber fit to radar picks modulated by a confidence map. To measure real-world generalization, we adopt leakage-safe blockwise hold-outs (vertical/horizontal) with safety buffers and report metrics only on held-out cores. Across two Greenland sub-regions, our approach achieves strong test-core accuracy and high structural fidelity, outperforming U-Net, Attention U-Net, FPN, and a plain CNN. The residual-over-prior design, combined with physics, yields spatially coherent, physically plausible beds suitable for operational mapping under domain shift.

[151] 2D Gaussians Spatial Transport for Point-supervised Density Regression

Miao Shang,Xiaopeng Hong

Main category: cs.CV

TL;DR: 本文提出了高斯空间传输(GST)框架,利用高斯点阵化技术将图像坐标空间中的概率测度传输到标注图,通过贝叶斯概率计算传输计划,并设计了相应的损失函数用于标准网络优化。

Details Motivation: 传统最优传输方法在训练过程中需要迭代计算传输计划,效率较低,且难以有效建模像素与标注之间的对应关系。 Method: 提出基于高斯点阵化的像素-标注对应估计方法,结合贝叶斯概率推导传输计划,并构建可嵌入深度学习框架的传输后差异损失函数。 Result: 在人群计数和关键点检测等任务上验证了方法的有效性,相比传统最优传输方案,显著提升了训练效率,且无需迭代计算传输计划。 Conclusion: GST提供了一种高效、可微的传输机制,能够有效整合到主流视觉任务中,在保持性能的同时大幅提升计算效率。 Abstract: This paper introduces Gaussian Spatial Transport (GST), a novel framework that leverages Gaussian splatting to facilitate transport from the probability measure in the image coordinate space to the annotation map. We propose a Gaussian splatting-based method to estimate pixel-annotation correspondence, which is then used to compute a transport plan derived from Bayesian probability. To integrate the resulting transport plan into standard network optimization in typical computer vision tasks, we derive a loss function that measures discrepancy after transport. Extensive experiments on representative computer vision tasks, including crowd counting and landmark detection, validate the effectiveness of our approach. Compared to conventional optimal transport schemes, GST eliminates iterative transport plan computation during training, significantly improving efficiency. Code is available at https://github.com/infinite0522/GST.

[152] Segmentation-Aware Latent Diffusion for Satellite Image Super-Resolution: Enabling Smallholder Farm Boundary Delineation

Aditi Agarwal,Anjali Jain,Nikita Saxena,Ishan Deshpande,Michal Kazmierski,Abigail Annkah,Nadav Sherman,Karthikeyan Shanmugam,Alok Talekar,Vaibhav Rajan

Main category: cs.CV

TL;DR: 提出SEED-SR方法,通过在分割感知的潜在空间中进行超分辨率,实现20倍尺度下的农田边界分割,显著优于现有方法。

Details Motivation: 现有参考型超分辨率方法在提升图像感知质量时忽略下游任务所需的关键特征,且难以满足大尺度因子需求;两步法(先超分后分割)未能有效融合多源卫星数据。 Method: 结合条件潜在扩散模型与大规模多光谱、多源地理空间基础模型,在分割感知的潜在空间中进行超分辨率,绕过像素空间的显式超分过程。 Result: 在两个真实大数据集上实验表明,相比当前最优Ref-SR方法,实例分割和语义分割指标分别相对提升25.5%和12.9%。 Conclusion: SEED-SR能有效利用高低分辨率卫星图像,支持小农户农场的高频季节性监测,在大尺度因子下实现高精度分割。 Abstract: Delineating farm boundaries through segmentation of satellite images is a fundamental step in many agricultural applications. The task is particularly challenging for smallholder farms, where accurate delineation requires the use of high resolution (HR) imagery which are available only at low revisit frequencies (e.g., annually). To support more frequent (sub-) seasonal monitoring, HR images could be combined as references (ref) with low resolution (LR) images -- having higher revisit frequency (e.g., weekly) -- using reference-based super-resolution (Ref-SR) methods. However, current Ref-SR methods optimize perceptual quality and smooth over crucial features needed for downstream tasks, and are unable to meet the large scale-factor requirements for this task. Further, previous two-step approaches of SR followed by segmentation do not effectively utilize diverse satellite sources as inputs. We address these problems through a new approach, $\textbf{SEED-SR}$, which uses a combination of conditional latent diffusion models and large-scale multi-spectral, multi-source geo-spatial foundation models. Our key innovation is to bypass the explicit SR task in the pixel space and instead perform SR in a segmentation-aware latent space. This unique approach enables us to generate segmentation maps at an unprecedented 20$\times$ scale factor, and rigorous experiments on two large, real datasets demonstrate up to $\textbf{25.5}$ and $\textbf{12.9}$ relative improvement in instance and semantic segmentation metrics respectively over approaches based on state-of-the-art Ref-SR methods.

[153] Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM

Jack Qin,Zhitao Wang,Yinan Zheng,Keyu Chen,Yang Zhou,Yuanxin Zhong,Siyuan Cheng

Main category: cs.CV

TL;DR: 提出了一种名为Risk Semantic Distillation (RSD)的新框架,利用视觉-语言模型(VLM)增强端到端自动驾驶系统的训练,通过引入RiskHead模块将因果风险估计蒸馏到鸟瞰图(BEV)特征中,生成可解释的风险注意力图,从而提升模型在复杂动态环境中的泛化能力、感知与规划性能。

Details Motivation: 当前自动驾驶系统在面对未见过的场景或不熟悉的传感器配置时泛化能力有限,现有基于VLM的方法虽有潜力但导致系统割裂或计算开销过大。 Method: 提出RSD框架和RiskHead模块,通过从VLM中提取风险语义并将其蒸馏到BEV特征中,增强端到端自动驾驶骨干网络的风险注意力表示能力。 Result: 在Bench2Drive基准上验证了RSD的有效性,显著提升了感知与规划能力,增强了对空间边界和高风险物体的处理效果。 Conclusion: RSD通过风险注意力蒸馏有效提升了自动驾驶系统的泛化能力和安全性,且具备可解释性,更贴近人类驾驶行为。 Abstract: The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.

[154] Parameter Aware Mamba Model for Multi-task Dense Prediction

Xinzhuo Yu,Yunzhi Zhuge,Sitong Gong,Lu Zhang,Pingping Zhang,Huchuan Lu

Main category: cs.CV

TL;DR: 提出了一种基于状态空间模型的解码器框架PAMM,用于多任务密集预测,通过任务特定参数先验和多方向Hilbert扫描提升任务间交互与2D感知能力。

Details Motivation: 现有方法主要依赖卷积和注意力机制建模任务关系,难以充分捕捉多任务间的复杂依赖,需更高效、可扩展的建模方式。 Method: 设计PAMM框架,利用状态空间模型(S4)的可扩展参数建模任务关联;引入双状态空间参数专家设置任务特定先验,并采用多方向Hilbert扫描构建多角度特征序列以增强对2D数据的感知。 Result: 在NYUD-v2和PASCAL-Context数据集上实验表明PAMM优于现有方法,显著提升多任务密集预测性能。 Conclusion: PAMM通过参数化状态空间模型有效建模多任务关系,结合Hilbert扫描增强了序列模型对图像结构的理解,为多任务学习提供了新思路。 Abstract: Understanding the inter-relations and interactions between tasks is crucial for multi-task dense prediction. Existing methods predominantly utilize convolutional layers and attention mechanisms to explore task-level interactions. In this work, we introduce a novel decoder-based framework, Parameter Aware Mamba Model (PAMM), specifically designed for dense prediction in multi-task learning setting. Distinct from approaches that employ Transformers to model holistic task relationships, PAMM leverages the rich, scalable parameters of state space models to enhance task interconnectivity. It features dual state space parameter experts that integrate and set task-specific parameter priors, capturing the intrinsic properties of each task. This approach not only facilitates precise multi-task interactions but also allows for the global integration of task priors through the structured state space sequence model (S4). Furthermore, we employ the Multi-Directional Hilbert Scanning method to construct multi-angle feature sequences, thereby enhancing the sequence model's perceptual capabilities for 2D data. Extensive experiments on the NYUD-v2 and PASCAL-Context benchmarks demonstrate the effectiveness of our proposed method. Our code is available at https://github.com/CQC-gogopro/PAMM.

[155] D-PerceptCT: Deep Perceptual Enhancement for Low-Dose CT Images

Taifour Yousra Nabila,Azeddine Beghdadi,Marie Luong,Zuheng Ming,Habib Zaidi,Faouzi Alaya Cheikh

Main category: cs.CV

TL;DR: 本文提出了一种基于人类视觉系统原理的新型网络D-PerceptCT,用于提升低剂量CT图像质量,结合语义感知与多尺度特征,并设计了新的深度感知相关性损失函数,在保留关键诊断细节方面优于现有方法。

Details Motivation: 低剂量CT图像因辐射剂量降低导致图像质量下降,现有方法常过度平滑噪声,丢失重要解剖和病理细节,因此需要一种能保留感知关键特征的增强方法。 Method: 提出D-PerceptCT模型,包含视觉双路径提取器(ViDex)融合DINOv2语义先验与局部空间特征,以及全局-局部状态空间模块捕捉长距离与多尺度信息;并设计基于人眼对比敏感性的深度感知相关性损失函数(DPRLF)。 Result: 在Mayo2016数据集上实验表明,D-PerceptCT在结构和纹理信息的保留上优于当前最先进方法,显著提升低剂量CT图像的视觉感知质量。 Conclusion: D-PerceptCT通过引入人类视觉系统机制和语义感知结构,有效提升了低剂量CT图像的增强效果,有助于临床诊断中关键细节的可视化。 Abstract: Low Dose Computed Tomography (LDCT) is widely used as an imaging solution to aid diagnosis and other clinical tasks. However, this comes at the price of a deterioration in image quality due to the low dose of radiation used to reduce the risk of secondary cancer development. While some efficient methods have been proposed to enhance LDCT quality, many overestimate noise and perform excessive smoothing, leading to a loss of critical details. In this paper, we introduce D-PerceptCT, a novel architecture inspired by key principles of the Human Visual System (HVS) to enhance LDCT images. The objective is to guide the model to enhance or preserve perceptually relevant features, thereby providing radiologists with CT images where critical anatomical structures and fine pathological details are perceptu- ally visible. D-PerceptCT consists of two main blocks: 1) a Visual Dual-path Extractor (ViDex), which integrates semantic priors from a pretrained DINOv2 model with local spatial features, allowing the network to incorporate semantic-awareness during enhancement; (2) a Global-Local State-Space block that captures long-range information and multiscale features to preserve the important structures and fine details for diagnosis. In addition, we propose a novel deep perceptual loss, designated as the Deep Perceptual Relevancy Loss Function (DPRLF), which is inspired by human contrast sensitivity, to further emphasize perceptually important features. Extensive experiments on the Mayo2016 dataset demonstrate the effectiveness of D-PerceptCT method for LDCT enhancement, showing better preservation of structural and textural information within LDCT images compared to SOTA methods.

[156] A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and Enhancement

Yufeng Tian,Yifan Chen,Zhe Sun,Libang Chen,Mingyu Dou,Jijun Lu,Ye Zheng,Xuelong Li

Main category: cs.CV

TL;DR: 提出一种基于非配对图像到图像转换的生成数据框架,利用空中自然图像作为清晰参考,合成包含6种典型水下退化类型的高质量配对数据集,用于提升水下图像恢复与增强模型的色彩还原和泛化能力。

Details Motivation: 现有水下图像恢复方法受限于高质量配对数据集的缺乏,且人工选择的参考图像缺乏真实性和全局一致性,导致模型训练监督信号不准确,影响恢复效果和泛化能力。 Method: 构建基于非配对图像到图像翻译的生成数据框架,将空中自然图像转化为具有六种典型水下退化的图像,生成带精确真值标签的大规模合成数据集,用于监督模型学习从退化图像到清晰场景的映射。 Result: 在6种网络架构和3个独立测试集上的实验表明,使用合成数据训练的模型在色彩恢复和泛化性能上达到或优于使用现有真实数据集训练的模型。 Conclusion: 该研究提供了一种可靠且可扩展的数据驱动解决方案,有效缓解了水下图像恢复中真实配对数据稀缺的问题,推动了相关模型的性能提升。 Abstract: Underwater image restoration and enhancement are crucial for correcting color distortion and restoring image details, thereby establishing a fundamental basis for subsequent underwater visual tasks. However, current deep learning methodologies in this area are frequently constrained by the scarcity of high-quality paired datasets. Since it is difficult to obtain pristine reference labels in underwater scenes, existing benchmarks often rely on manually selected results from enhancement algorithms, providing debatable reference images that lack globally consistent color and authentic supervision. This limits the model's capabilities in color restoration, image enhancement, and generalization. To overcome this limitation, we propose using in-air natural images as unambiguous reference targets and translating them into underwater-degraded versions, thereby constructing synthetic datasets that provide authentic supervision signals for model learning. Specifically, we establish a generative data framework based on unpaired image-to-image translation, producing a large-scale dataset that covers 6 representative underwater degradation types. The framework constructs synthetic datasets with precise ground-truth labels, which facilitate the learning of an accurate mapping from degraded underwater images to their pristine scene appearances. Extensive quantitative and qualitative experiments across 6 representative network architectures and 3 independent test sets show that models trained on our synthetic data achieve comparable or superior color restoration and generalization performance to those trained on existing benchmarks. This research provides a reliable and scalable data-driven solution for underwater image restoration and enhancement. The generated dataset is publicly available at: https://github.com/yftian2025/SynUIEDatasets.git.

[157] DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

Xiangchen Yin,Jiahui Yuan,Zhangchi Hu,Wenzhang Sun,Jie Chen,Xiaozhen Qiao,Hao Li,Xiaoyan Sun

Main category: cs.CV

TL;DR: 提出了一种名为DeCo-VAE的视频变分自编码器,通过显式解耦关键帧、运动和残差成分来实现紧凑的潜在表示,提升了视频重建性能。

Details Motivation: 现有视频VAE忽略了帧内容之间的相似性,导致潜在建模冗余。 Method: 将视频内容分解为关键帧、运动和残差三个部分,使用专用编码器分别学习各部分的潜在表示,并采用共享3D解码器进行重构,同时设计了解耦适应策略以稳定训练过程。 Result: 在多个定量和定性实验中,DeCo-VAE在视频重建性能上优于现有方法。 Conclusion: DeCo-VAE通过显式解耦和专用编码器有效减少了冗余建模,实现了更紧凑的潜在表示和更优的重建效果。 Abstract: Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.

[158] Learning Compact Latent Space for Representing Neural Signed Distance Functions with High-fidelity Geometry Details

Qiang Bai,Bojian Wu,Xi Yang,Zhizhong Han

Main category: cs.CV

TL;DR: 提出一种在共同空间中表示多个神经符号距离函数(SDF)的方法,通过结合泛化和过拟合学习策略,实现高保真几何细节的恢复与紧凑的潜在表示。

Details Motivation: 现有神经SDF在处理多形状时受限于潜在空间信息不足和几何细节丢失,难以兼顾高保真重建与紧凑表示。 Method: 结合泛化与过拟合学习策略,在共享潜在空间中表示多个SDF,并提出新的训练查询采样策略以提升效率并减少跨SDF干扰导致的伪影。 Result: 在多个基准上验证了方法的有效性,相比最新方法在表示能力和紧凑性方面表现更优,兼具高保真几何恢复与高效训练。 Conclusion: 所提方法能有效平衡多SDF表示中的细节保持与潜在空间紧凑性,为3D形状集合的隐式建模提供了更优解决方案。 Abstract: Neural signed distance functions (SDFs) have been a vital representation to represent 3D shapes or scenes with neural networks. An SDF is an implicit function that can query signed distances at specific coordinates for recovering a 3D surface. Although implicit functions work well on a single shape or scene, they pose obstacles when analyzing multiple SDFs with high-fidelity geometry details, due to the limited information encoded in the latent space for SDFs and the loss of geometry details. To overcome these obstacles, we introduce a method to represent multiple SDFs in a common space, aiming to recover more high-fidelity geometry details with more compact latent representations. Our key idea is to take full advantage of the benefits of generalization-based and overfitting-based learning strategies, which manage to preserve high-fidelity geometry details with compact latent codes. Based on this framework, we also introduce a novel sampling strategy to sample training queries. The sampling can improve the training efficiency and eliminate artifacts caused by the influence of other SDFs. We report numerical and visual evaluations on widely used benchmarks to validate our designs and show advantages over the latest methods in terms of the representative ability and compactness.

[159] Interaction-Aware 4D Gaussian Splatting for Dynamic Hand-Object Interaction Reconstruction

Hao Tian,Chenyangguang Zhang,Rui Liu,Wen Shen,Xiaolin Qin

Main category: cs.CV

TL;DR: 本文提出了一种基于动态3D高斯点阵的交互感知方法,用于无物体先验条件下手-物交互场景的几何与外观联合建模,实现了最先进的重建性能。

Details Motivation: 在没有物体先验的情况下,同时建模手-物交互场景的几何与外观具有挑战性,尤其是存在相互遮挡、边缘模糊和复杂形变等问题。 Method: 引入交互感知的手-物高斯表示,采用分段线性假设优化参数;将手部信息融入物体变形场,构建交互感知的动态场;并设计渐进式优化策略与显式正则化以稳定重建过程。 Result: 实验表明,该方法在动态手-物交互重建中优于现有的基于3D-GS的方法,实现了更清晰的结构表达、更真实的物理交互和更连贯的光照效果。 Conclusion: 所提出的方法有效解决了无物体先验下手-物交互建模中的遮挡、形变和优化难题,显著提升了动态重建的质量与稳定性。 Abstract: This paper focuses on a challenging setting of simultaneously modeling geometry and appearance of hand-object interaction scenes without any object priors. We follow the trend of dynamic 3D Gaussian Splatting based methods, and address several significant challenges. To model complex hand-object interaction with mutual occlusion and edge blur, we present interaction-aware hand-object Gaussians with newly introduced optimizable parameters aiming to adopt piecewise linear hypothesis for clearer structural representation. Moreover, considering the complementarity and tightness of hand shape and object shape during interaction dynamics, we incorporate hand information into object deformation field, constructing interaction-aware dynamic fields to model flexible motions. To further address difficulties in the optimization process, we propose a progressive strategy that handles dynamic regions and static background step by step. Correspondingly, explicit regularizations are designed to stabilize the hand-object representations for smooth motion transition, physical interaction reality, and coherent lighting. Experiments show that our approach surpasses existing dynamic 3D-GS-based methods and achieves state-of-the-art performance in reconstructing dynamic hand-object interaction.

[160] ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection

Mohammad Romani

Main category: cs.CV

TL;DR: 本文提出了一种名为ForensicFlow的三模态深度伪造视频检测框架,通过融合RGB、纹理和频率三种模态信息,有效提升了检测性能。

Details Motivation: 现有的单流CNN方法难以捕捉跨空间、纹理和频率域的多尺度伪造痕迹,导致检测鲁棒性和泛化能力不足。 Method: 设计了一个三分支结构:RGB分支(ConvNeXt-tiny)提取全局视觉不一致性;纹理分支(Swin Transformer-tiny)检测细粒度融合伪影;频率分支(CNN+SE)识别周期性频谱噪声,并通过注意力机制进行时序池化和自适应融合。 Result: 在Celeb-DF (v2)数据集上使用Focal Loss训练,AUC达到0.9752,F1分数为0.9408,准确率为0.9208,优于单流基线模型。消融实验验证了各分支的协同作用,Grad-CAM可视化表明模型关注于关键取证区域。 Conclusion: 该多模态特征融合框架能有效应对细微伪造,显著提升深度伪造视频的检测能力与鲁棒性。 Abstract: Deepfakes generated by advanced GANs and autoencoders severely threaten information integrity and societal stability. Single-stream CNNs fail to capture multi-scale forgery artifacts across spatial, texture, and frequency domains, limiting robustness and generalization. We introduce the ForensicFlow, a tri-modal forensic framework that synergistically fuses RGB, texture, and frequency evidence for video Deepfake detection. The RGB branch (ConvNeXt-tiny) extracts global visual inconsistencies; the texture branch (Swin Transformer-tiny) detects fine-grained blending artifacts; the frequency branch (CNN + SE) identifies periodic spectral noise. Attention-based temporal pooling dynamically prioritizes high-evidence frames, while adaptive attention fusion balances branch contributions.Trained on Celeb-DF (v2) with Focal Loss, ForensicFlow achieves AUC 0.9752, F1-Score 0.9408, and accuracy 0.9208, outperforming single-stream baselines. Ablation validates branch synergy; Grad-CAM confirms forensic focus. This comprehensive feature fusion provides superior resilience against subtle forgeries.

[161] Explaining Digital Pathology Models via Clustering Activations

Adam Bajger,Jan Obdržálek,Vojtěch Kůr,Rudolf Nenutil,Petr Holub,Vít Musil,Tomáš Brázdil

Main category: cs.CV

TL;DR: 提出一种基于聚类的可解释性技术,用于分析数字病理学中的卷积神经网络模型,相较于传统的显著性图方法,能够揭示模型的全局行为并提供更细粒度的信息,有助于提升临床应用中的信任与采纳。

Details Motivation: 现有基于显著性图的方法(如occlusion、GradCAM)只能解释单个样本的预测结果,缺乏对模型整体行为的理解,限制了其在临床实践中的可信度和应用。 Method: 采用聚类方法对模型特征进行分组,分析模型在大量样本上的全局行为,并通过可视化展示聚类结果,从而提供更全面和细粒度的可解释性。 Result: 在前列腺癌检测模型上验证了该方法的有效性,能够清晰展示模型的决策模式,增强对其行为的理解和信任。 Conclusion: 该聚类-based可解释技术不仅能帮助理解模型的全局行为,还能提升临床医生对AI模型的信任,推动其在医学图像分析中的实际应用。 Abstract: We present a clustering-based explainability technique for digital pathology models based on convolutional neural networks. Unlike commonly used methods based on saliency maps, such as occlusion, GradCAM, or relevance propagation, which highlight regions that contribute the most to the prediction for a single slide, our method shows the global behaviour of the model under consideration, while also providing more fine-grained information. The result clusters can be visualised not only to understand the model, but also to increase confidence in its operation, leading to faster adoption in clinical practice. We also evaluate the performance of our technique on an existing model for detecting prostate cancer, demonstrating its usefulness.

[162] OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Keda Tao,Kele Shao,Bohan Yu,Weiqiang Wang,Jian liu,Huan Wang

Main category: cs.CV

TL;DR: OmniZip是一种无需训练的音频引导音视频联合压缩框架,通过动态剪枝和跨模态相似性保留关键信息,在实现3.42倍推理加速和1.4倍内存减少的同时保持性能。

Details Motivation: 现有token压缩方法无法有效处理音视频联合压缩需求,且音视频大模型推理存在计算瓶颈。 Method: 利用音频token的重要性计算音频保留分数,指导视频token的动态剪枝,并采用交错时空压缩策略对视频token进行压缩。 Result: 实现了3.42倍的推理速度提升和1.4倍的内存占用降低,同时在多个任务上保持了模型性能。 Conclusion: OmniZip为音视频大模型提供了一种高效、无需训练的联合token压缩方案,显著提升了推理效率。 Abstract: Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.

[163] Deep Learning-Based Regional White Matter Hyperintensity Mapping as a Robust Biomarker for Alzheimer's Disease

Julia Machnio,Mads Nielsen,Mostafa Mehdipour Ghazi

Main category: cs.CV

TL;DR: 提出一种深度学习框架,用于白质高信号(WMH)的分割与定位,并证明区域WMH负荷结合脑萎缩指标可显著提升阿尔茨海默病的诊断性能。

Details Motivation: 现有WMH自动分割方法多关注总体负荷,忽略其在不同白质区域的空间分布,限制了其在认知老化和阿尔茨海默病中的诊断价值。 Method: 开发基于深度学习的WMH分割与定位框架,在多个公开数据集及ADNI队列中进行评估,量化各解剖区域的WMH负荷,并结合脑结构体积进行诊断分类分析。 Result: 预测的WMH负荷与参考值一致,具有良好的鲁棒性;区域WMH负荷优于全局负荷用于疾病分类,结合脑萎缩指标后AUC可达0.97,前部白质通路区域与诊断状态显著相关。 Conclusion: 区域WMH定量具有重要临床价值,结合局部病变与萎缩标记物可提高神经退行性疾病的早期诊断与分层能力。 Abstract: White matter hyperintensities (WMH) are key imaging markers in cognitive aging, Alzheimer's disease (AD), and related dementias. Although automated methods for WMH segmentation have advanced, most provide only global lesion load and overlook their spatial distribution across distinct white matter regions. We propose a deep learning framework for robust WMH segmentation and localization, evaluated across public datasets and an independent Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Our results show that the predicted lesion loads are in line with the reference WMH estimates, confirming the robustness to variations in lesion load, acquisition, and demographics. Beyond accurate segmentation, we quantify WMH load within anatomically defined regions and combine these measures with brain structure volumes to assess diagnostic value. Regional WMH volumes consistently outperform global lesion burden for disease classification, and integration with brain atrophy metrics further improves performance, reaching area under the curve (AUC) values up to 0.97. Several spatially distinct regions, particularly within anterior white matter tracts, are reproducibly associated with diagnostic status, indicating localized vulnerability in AD. These results highlight the added value of regional WMH quantification. Incorporating localized lesion metrics alongside atrophy markers may enhance early diagnosis and stratification in neurodegenerative disorders.

[164] CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities

Dongqing Xie,Yonghuang Wu,Zisheng Ai,Jun Min,Zhencun Jiang,Shaojin Geng,Lei Wang

Main category: cs.CV

TL;DR: 提出了一种跨模态组合自蒸馏框架(CCSD),用于在多模态MRI缺失情况下实现鲁棒的脑肿瘤分割。

Details Motivation: 现有深度学习模型在面对临床中常见的模态缺失问题时性能下降严重,缺乏泛化能力。 Method: 采用共享-特定编码器-解码器结构,结合分层模态自蒸馏和渐进模态组合蒸馏策略,在训练中模拟模态丢失以提升鲁棒性。 Result: 在公开数据集上实验表明,CCSD在多种模态缺失场景下均达到最先进水平,具有良好的泛化性和稳定性。 Conclusion: CCSD能有效应对多模态医学图像中模态缺失的挑战,显著提升了分割模型的实用性和鲁棒性。 Abstract: The accurate segmentation of brain tumors from multi-modal MRI is critical for clinical diagnosis and treatment planning. While integrating complementary information from various MRI sequences is a common practice, the frequent absence of one or more modalities in real-world clinical settings poses a significant challenge, severely compromising the performance and generalizability of deep learning-based segmentation models. To address this challenge, we propose a novel Cross-Modal Compositional Self-Distillation (CCSD) framework that can flexibly handle arbitrary combinations of input modalities. CCSD adopts a shared-specific encoder-decoder architecture and incorporates two self-distillation strategies: (i) a hierarchical modality self-distillation mechanism that transfers knowledge across modality hierarchies to reduce semantic discrepancies, and (ii) a progressive modality combination distillation approach that enhances robustness to missing modalities by simulating gradual modality dropout during training. Extensive experiments on public brain tumor segmentation benchmarks demonstrate that CCSD achieves state-of-the-art performance across various missing-modality scenarios, with strong generalization and stability.

[165] MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimer's Disease Cohorts

Nathaniel Putera,Daniel Vilet Rodríguez,Noah Videcrantz,Julia Machnio,Mostafa Mehdipour Ghazi

Main category: cs.CV

TL;DR: 该研究比较了表格数据和基于MRI的Transformer嵌入在预测阿尔茨海默病认知衰退中的作用,发现临床特征更适用于识别高风险个体,而ViT提取的MRI嵌入对识别认知稳定个体更敏感,提示多模态融合有助于提升疾病进展建模效果。

Details Motivation: 准确建模阿尔茨海默病的认知衰退对于早期分层和个性化管理至关重要,现有表格预测因子难以捕捉细微脑部变化,需探索更有效的成像表征方法。 Method: 提出基于动态时间规整聚类的轨迹感知标注策略,采用无监督重建方式在标准化增强的MRI数据上预训练3D Vision Transformer以获得解剖结构保持的嵌入,并与表格特征及卷积网络基线进行对比评估。 Result: 临床和体积特征在预测轻度和重度进展时AUC达0.70,表现最佳;ViT的MRI嵌入在区分认知稳定个体时AUC为0.71,表现最优;但所有方法在中等程度进展组均表现不佳。 Conclusion: 临床特征擅长识别高风险极端人群,而基于Transformer的MRI嵌入对稳定性相关细微变化更敏感,建议未来采用多模态融合策略以改进阿尔茨海默病进展预测模型。 Abstract: Accurate modeling of cognitive decline in Alzheimer's disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.

[166] XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation

Yilin Zhang,Leo D. Westbury,Elaine M. Dennison,Nicholas C. Harvey,Nicholas R. Fuggle,Rahman Attar

Main category: cs.CV

TL;DR: 提出了一种名为XAttn-BMD的多模态深度学习框架,通过交叉注意力机制融合髋部X光图像和临床元数据,准确预测股骨颈骨密度(BMD),在回归性能和临床筛查潜力方面均优于基线模型。

Details Motivation: 骨密度降低导致骨折风险增加,传统方法难以精准预测BMD,因此需要一种能有效融合影像与临床数据的自动化、高精度预测模型。 Method: 提出XAttn-BMD框架,采用双向交叉注意力机制动态融合X光图像与结构化临床元数据,并设计加权平滑L1损失函数以提升对临床关键病例的预测性能。 Result: 在Hertfordshire队列研究数据上验证,相比基线模型,MSE降低16.7%,MAE降低6.03%,R2提升16.4%;消融实验表明交叉注意力和定制损失函数均有效;分类实验显示其在临床阈值下的筛查潜力。 Conclusion: XAttn-BMD通过有效的多模态融合策略显著提升了股骨颈BMD的预测精度,具有应用于骨质疏松早期筛查的临床前景。 Abstract: Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model's potential in real-world scenarios.

[167] 3D-Guided Scalable Flow Matching for Generating Volumetric Tissue Spatial Transcriptomics from Serial Histology

Mohammad Vali Sanian,Arshia Hemmat,Amirhossein Vahidi,Jonas Maaskola,Jimmy Tsz Hang Lee,Stanislaw Makarchuk,Yeliz Demirci,Nana-Jane Chipampe,Omer Bayraktar,Lassi Paavolainen,Mohammad Lotfollahi

Main category: cs.CV

TL;DR: HoloTea是一种3D感知的流动匹配框架,能够从H&E染色图像中推断出全组织的基因表达,利用相邻切片的形态学信息提升空间转录组数据的准确性和泛化能力。

Details Motivation: 现有方法大多独立处理组织切片,忽略三维结构信息,且缺乏可扩展性和生成能力,限制了对组织整体结构和疾病机制的理解。 Method: 提出HoloTea框架,通过在共享特征空间中检索相邻切片上形态相似的区域,并将跨切片上下文融合到轻量级ControlNet中;引入结合零膨胀负二项分布和空间经验先验的3D一致性先验,用于流动匹配,并采用全局注意力机制实现3D扩展。 Result: 在三个不同组织类型和分辨率的空间转录组数据集上,HoloTea在3D表达预测准确性与泛化性能上均优于2D和3D基线方法。 Conclusion: HoloTea能够高效生成高精度的3D虚拟组织,有望推动生物标志物发现并深化对疾病的系统性理解。 Abstract: A scalable and robust 3D tissue transcriptomics profile can enable a holistic understanding of tissue organization and provide deeper insights into human biology and disease. Most predictive algorithms that infer ST directly from histology treat each section independently and ignore 3D structure, while existing 3D-aware approaches are not generative and do not scale well. We present Holographic Tissue Expression Inpainting and Analysis (HoloTea), a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E while explicitly using information from adjacent sections. Our key idea is to retrieve morphologically corresponding spots on neighboring slides in a shared feature space and fuse this cross section context into a lightweight ControlNet, allowing conditioning to follow anatomical continuity. To better capture the count nature of the data, we introduce a 3D-consistent prior for flow matching that combines a learned zero-inflated negative binomial (ZINB) prior with a spatial-empirical prior constructed from neighboring sections. A global attention block introduces 3D H&E scaling linearly with the number of spots in the slide, enabling training and inference on large 3D ST datasets. Across three spatial transcriptomics datasets spanning different tissue types and resolutions, HoloTea consistently improves 3D expression accuracy and generalization compared to 2D and 3D baselines. We envision HoloTea advancing the creation of accurate 3D virtual tissues, ultimately accelerating biomarker discovery and deepening our understanding of disease.

[168] Fusing Biomechanical and Spatio-Temporal Features for Fall Prediction: Characterizing and Mitigating the Simulation-to-Reality Gap

Md Fokhrul Islam,Sajeda Al-Hammouri,Christopher J. Arellano,Kavan Hazeli,Heman Shakeri

Main category: cs.CV

TL;DR: 提出了一种结合姿态和生物力学信息的双流图卷积网络BioST-GCN,用于视觉跌倒预测,在模拟数据上表现良好,但零样本泛化能力差,揭示了仿真与现实之间的显著差距,需通过个性化策略和真实数据来改进。

Details Motivation: 由于老年人跌倒数据稀缺,现有视觉跌倒预测系统难以发展,且仿真数据存在偏差,导致模型在真实场景中表现不佳,亟需提升模型在真实环境中的泛化能力。 Method: 提出BioST-GCN双流模型,结合姿态与生物力学信息,采用跨注意力融合机制;利用时空注意力机制增强可解释性,并在MCF-UA和MUVIM模拟数据集上进行训练与评估。 Result: 在MCF-UA和MUVIM数据集上分别比ST-GCN基线提升5.32%和2.91%的F1分数;全监督下模拟数据F1达89.0%,但对未见个体的零样本泛化性能下降至35.9%。 Conclusion: 当前仿真数据存在‘主动跌倒’等偏差,导致模型难以泛化到真实老年群体,尤其是糖尿病或虚弱者;必须通过个性化策略和隐私保护的数据管道推动真实世界验证,缩小仿真与现实的差距。 Abstract: Falls are a leading cause of injury and loss of independence among older adults. Vision-based fall prediction systems offer a non-invasive solution to anticipate falls seconds before impact, but their development is hindered by the scarcity of available fall data. Contributing to these efforts, this study proposes the Biomechanical Spatio-Temporal Graph Convolutional Network (BioST-GCN), a dual-stream model that combines both pose and biomechanical information using a cross-attention fusion mechanism. Our model outperforms the vanilla ST-GCN baseline by 5.32% and 2.91% F1-score on the simulated MCF-UA stunt-actor and MUVIM datasets, respectively. The spatio-temporal attention mechanisms in the ST-GCN stream also provide interpretability by identifying critical joints and temporal phases. However, a critical simulation-reality gap persists. While our model achieves an 89.0% F1-score with full supervision on simulated data, zero-shot generalization to unseen subjects drops to 35.9%. This performance decline is likely due to biases in simulated data, such as `intent-to-fall' cues. For older adults, particularly those with diabetes or frailty, this gap is exacerbated by their unique kinematic profiles. To address this, we propose personalization strategies and advocate for privacy-preserving data pipelines to enable real-world validation. Our findings underscore the urgent need to bridge the gap between simulated and real-world data to develop effective fall prediction systems for vulnerable elderly populations.

[169] SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction

Meiying Gu,Jiawei Zhang,Jiahe Li,Xiaohan Yu,Haonan Luo,Jin Zheng,Xiao Bai

Main category: cs.CV

TL;DR: 本文提出了一种新的高斯点阵优化方法\net{},通过引入立体几何-纹理对齐和伪特征增强的几何一致性,有效解决了稀疏视角下的过拟合问题,在多个数据集上实现了最先进的表面重建和新视角合成性能。

Details Motivation: 在稀疏输入视角下,现有高斯点阵优化方法容易过拟合,且使用扁平化高斯基元会加剧这一问题,导致重建质量下降。因此需要一种能同时提升几何重建与渲染质量的方法。 Method: 提出Stereo Geometry-Texture Alignment以联合优化渲染质量和几何估计,并设计Pseudo-Feature Enhanced Geometry Consistency,利用训练视图和未见视图来增强多视角几何一致性,缓解稀疏监督下的过拟合。 Result: 在DTU、BlendedMVS和Mip-NeRF360数据集上进行了大量实验,结果表明该方法在表面重建精度和新视角合成质量方面均达到最先进水平。 Conclusion: 所提出的方法\net{}有效平衡了几何重建与渲染质量,在稀疏视角条件下显著优于现有方法,推动了高斯点阵在三维重建中的应用。 Abstract: Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose \net{}, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.

[170] SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology

Marco Acerbis,Swarnadip Chatterjee,Christophe Avenel,Joakim Lindblad

Main category: cs.CV

TL;DR: 提出了一种名为SLAM-AGS的多任务预训练框架,用于计算细胞学分析,通过联合优化弱监督相似性目标和自监督对比目标,并采用自适应梯度手术稳定学习,在低见证率下显著提升下游任务性能。

Details Motivation: 计算细胞学中实例级标签不可靠且获取成本高,同时见证率极低,传统方法难以有效训练模型。 Method: 提出SLAM-AGS框架,结合滑动负样本上的弱监督相似性目标和正样本上的自监督对比目标进行多任务预训练,使用自适应梯度手术(Adaptive Gradient Surgery)解决梯度冲突,并将预训练编码器集成到基于注意力的多实例学习聚合器中,实现包级别预测与异常实例检索。 Result: 在公开骨髓细胞学数据集上验证,模拟见证率从10%降至0.5%,SLAM-AGS在包级别F1分数和前400个阳性细胞检索方面优于其他预训练方法,尤其在低见证率下提升显著。 Conclusion: 通过解决多任务学习中的梯度干扰问题,SLAM-AGS实现了稳定的预训练,显著提升了低见证率下的下游任务性能,具备良好的可复现性,代码已开源。 Abstract: Computational cytology faces two major challenges: i) instance-level labels are unreliable and prohibitively costly to obtain, ii) witness rates are extremely low. We propose SLAM-AGS, a Slide-Label-Aware Multitask pretraining framework that jointly optimizes (i) a weakly supervised similarity objective on slide-negative patches and (ii) a self-supervised contrastive objective on slide-positive patches, yielding stronger performance on downstream tasks. To stabilize learning, we apply Adaptive Gradient Surgery to tackle conflicting task gradients and prevent model collapse. We integrate the pretrained encoder into an attention-based Multiple Instance Learning aggregator for bag-level prediction and attention-guided retrieval of the most abnormal instances in a bag. On a publicly available bone-marrow cytology dataset, with simulated witness rates from 10% down to 0.5%, SLAM-AGS improves bag-level F1-Score and Top 400 positive cell retrieval over other pretraining methods, with the largest gains at low witness rates, showing that resolving gradient interference enables stable pretraining and better performance on downstream tasks. To facilitate reproducibility, we share our complete implementation and evaluation framework as open source: https://github.com/Ace95/SLAM-AGS.

[171] RepAir: A Framework for Airway Segmentation and Discontinuity Correction in CT

John M. Oyer,Ali Namvar,Benjamin A. Hoff,Wassim W. Labaki,Ella A. Kazerooni,Charles R. Hatt,Fernando J. Martinez,MeiLan K. Han,Craig J. Galbán,Sundaresh Ram

Main category: cs.CV

TL;DR: 提出RepAir,一种结合nnU-Net与解剖学拓扑校正的三阶段框架,用于从胸部CT扫描中实现鲁棒的3D气道分割,显著提升分割完整性和拓扑准确性。

Details Motivation: 准确的气道分割对肺部定量分析至关重要,但手动标注不现实,现有自动化U-Net方法常产生不连通结构,影响生物标志物提取。 Method: RepAir包含三个阶段:首先使用nnU-Net生成初始气道掩码,然后通过基于骨架的算法检测并提议修复断裂,最后用1D卷积分类器判断候选连接是否为真实解剖分支。 Result: 在ATM'22和AeroPath两个数据集上,RepAir在体素级和拓扑指标上均优于Bronchinet和NaviAirway等现有方法,生成更完整且解剖一致的气道树,同时保持高分割精度。 Conclusion: RepAir通过融合深度学习与解剖学感知的拓扑校正,显著提升了复杂CT图像中气道分割的鲁棒性和临床可用性。 Abstract: Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM'22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.

[172] Improving segmentation of retinal arteries and veins using cardiac signal in doppler holograms

Marius Dubosc,Yann Fischer,Zacharie Auray,Nicolas Boutry,Edwin Carlinet,Michael Atlan,Thierry Geraud

Main category: cs.CV

TL;DR: 提出一种基于时间分辨预处理的简单有效方法,利用标准分割网络实现多普勒全息中动脉-静脉分割,性能媲美复杂模型。

Details Motivation: 传统分割方法仅利用空间信息,忽略了多普勒全息数据中的丰富时间动态信息,难以准确区分视网膜动静脉。 Method: 通过构建专用脉冲分析流程提取时间特征,并将其作为标准U-Net分割网络的输入,以利用动态全息数据中的时序信息。 Result: 所提方法在动静脉分割任务上达到与更复杂的注意力或迭代模型相当的性能,验证了时序预处理对提升深度学习效果的有效性。 Conclusion: 时间分辨预处理能充分释放深度学习在多普勒全息中的潜力,为视网膜血流动力学的定量研究提供了新途径。 Abstract: Doppler holography is an emerging retinal imaging technique that captures the dynamic behavior of blood flow with high temporal resolution, enabling quantitative assessment of retinal hemodynamics. This requires accurate segmentation of retinal arteries and veins, but traditional segmentation methods focus solely on spatial information and overlook the temporal richness of holographic data. In this work, we propose a simple yet effective approach for artery-vein segmentation in temporal Doppler holograms using standard segmentation architectures. By incorporating features derived from a dedicated pulse analysis pipeline, our method allows conventional U-Nets to exploit temporal dynamics and achieve performance comparable to more complex attention- or iteration-based models. These findings demonstrate that time-resolved preprocessing can unlock the full potential of deep learning for Doppler holography, opening new perspectives for quantitative exploration of retinal hemodynamics. The dataset is publicly available at https://huggingface.co/datasets/DigitalHolography/

[173] Impact of Image Resolution on Age Estimation with DeepFace and InsightFace

Shiyar Jamo

Main category: cs.CV

TL;DR: 本研究探讨了图像分辨率对DeepFace和InsightFace在年龄估计任务中准确性的影响,发现224x224像素时性能最佳,且低或过高分辨率均会降低准确率,同时InsightFace速度更快。

Details Motivation: 输入图像分辨率变化大,影响年龄估计准确性,需评估不同分辨率对主流模型的影响。 Method: 使用IMDB-Clean数据集的1000张图像,在七种分辨率下生成7000个样本,采用MAE、SD和MedAE指标评估DeepFace和InsightFace的性能。 Result: 两种模型在224x224分辨率下表现最优(DeepFace MAE=10.83年,InsightFace MAE=7.46年),低分辨率导致误差显著上升,过高分辨率也降低精度,且InsightFace运行速度始终快于DeepFace。 Conclusion: 输入图像分辨率显著影响年龄估计精度,存在最优分辨率,过高或过低均不利,建议在实际应用中控制输入分辨率以提升性能。 Abstract: Automatic age estimation is widely used for age verification, where input images often vary considerably in resolution. This study evaluates the effect of image resolution on age estimation accuracy using DeepFace and InsightFace. A total of 1000 images from the IMDB-Clean dataset were processed in seven resolutions, resulting in 7000 test samples. Performance was evaluated using Mean Absolute Error (MAE), Standard Deviation (SD), and Median Absolute Error (MedAE). Based on this study, we conclude that input image resolution has a clear and consistent impact on the accuracy of age estimation in both DeepFace and InsightFace. Both frameworks achieve optimal performance at 224x224 pixels, with an MAE of 10.83 years (DeepFace) and 7.46 years (InsightFace). At low resolutions, MAE increases substantially, while very high resolutions also degrade accuracy. InsightFace is consistently faster than DeepFace across all resolutions.

[174] HyMAD: A Hybrid Multi-Activity Detection Approach for Border Surveillance and Monitoring

Sriram Srinivasan,Srinivasan Aruchamy,Siva Ram Krisha Vadali

Main category: cs.CV

TL;DR: 提出了一种基于时空特征融合的深度神经网络架构HyMAD,用于地震传感中的多活动检测,能够在复杂噪声环境下准确识别同时发生的人员、动物和车辆活动。

Details Motivation: 准确检测和区分同时发生的重叠活动(如人类入侵、动物移动和车辆震动)是地震传感在边境监控中面临的主要挑战,传统方法易受噪声干扰且难以分离多源信号。 Method: 提出HyMAD框架,结合SincNet提取频谱特征,RNN建模时间依赖性,引入自注意力机制增强模态内表示,并通过跨模态融合模块实现鲁棒的多标签分类。 Result: 在真实野外地震数据集上验证了方法的有效性,能够很好地泛化到包含人类、动物和车辆的复杂并发活动场景,表现出具有竞争力的性能。 Conclusion: HyMAD为地震信号中的多活动检测提供了一个有效的解决方案,具有良好的模块化设计,可扩展应用于实际安全监控系统。 Abstract: Seismic sensing has emerged as a promising solution for border surveillance and monitoring; the seismic sensors that are often buried underground are small and cannot be noticed easily, making them difficult for intruders to detect, avoid, or vandalize. This significantly enhances their effectiveness compared to highly visible cameras or fences. However, accurately detecting and distinguishing between overlapping activities that are happening simultaneously, such as human intrusions, animal movements, and vehicle rumbling, remains a major challenge due to the complex and noisy nature of seismic signals. Correctly identifying simultaneous activities is critical because failing to separate them can lead to misclassification, missed detections, and an incomplete understanding of the situation, thereby reducing the reliability of surveillance systems. To tackle this problem, we propose HyMAD (Hybrid Multi-Activity Detection), a deep neural architecture based on spatio-temporal feature fusion. The framework integrates spectral features extracted with SincNet and temporal dependencies modeled by a recurrent neural network (RNN). In addition, HyMAD employs self-attention layers to strengthen intra-modal representations and a cross-modal fusion module to achieve robust multi-label classification of seismic events. e evaluate our approach on a dataset constructed from real-world field recordings collected in the context of border surveillance and monitoring, demonstrating its ability to generalize to complex, simultaneous activity scenarios involving humans, animals, and vehicles. Our method achieves competitive performance and offers a modular framework for extending seismic-based activity recognition in real-world security applications.

[175] Seeing Beyond the Image: ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images

Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Meryem Jabrane,Vicente Grau,Shahnaz Jamil-Copley,Richard H. Clayton,Chen,Chen

Main category: cs.CV

TL;DR: 提出了一种结合心电图(ECG)和LGE-MRI的多模态框架,用于心肌瘢痕分割,引入时间感知特征融合机制(TAFF),显著提升了分割性能。

Details Motivation: LGE-MRI在心肌瘢痕分割中受限于对比度变化和成像伪影,而ECG提供可辅助定位瘢痕的生理信息,但二者非同步采集,需有效融合。 Method: 提出一种融合ECG电生理信息与AHA-17解剖图谱先验知识的多模态框架,设计时间感知特征融合(TAFF)机制,根据影像与ECG采集时间差动态加权融合特征。 Result: 在临床数据上验证,相比仅用图像的nnU-Net,Dice分数从0.6149提升至0.8463,精度达0.9115,敏感性达0.9043。 Conclusion: 融合生理与解剖先验信息可实现更鲁棒、符合生理规律的心肌瘢痕分割,为该领域提供了新方向。 Abstract: Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to "see beyond the image", setting a new direction for robust and physiologically grounded cardiac scar segmentation.

[176] FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation

Yunfeng Wu,Jiayi Song,Zhenxiong Tan,Zihao He,Songhua Liu

Main category: cs.CV

TL;DR: 提出了一种无需训练的方法FreeSwim,利用预训练的视频Diffusion Transformers生成超高分辨率视频,通过内滑窗口注意力机制和双路径结构保持视觉保真度与全局一致性。

Details Motivation: Transformer中注意力机制的二次复杂度使得端到端训练超高清视频成本极高,因此需要一种无需额外训练即可扩展视频分辨率的方法。 Method: 引入 inward 滑动窗口注意力机制,保持查询token在训练尺度下的感受野;采用双路径结构,其中一条路径使用全感受野的跨注意力覆盖策略来指导局部注意力生成的内容,并结合跨注意力缓存提升效率。 Result: 在无需训练的前提下,成功生成具有精细细节的超高清视频,在VBench上表现优于部分需训练方法,且效率相当或更优。 Conclusion: FreeSwim提供了一种高效、训练免费的高分辨率视频生成方案,有效平衡了局部细节与全局一致性,推动了预训练模型的后训练扩展能力。 Abstract: The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim

[177] Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model

Xiyuan Wang,Muhan Zhang

Main category: cs.CV

TL;DR: 提出Diffusion as Self-Distillation (DSD)框架,首次实现编码器、解码器与扩散网络的统一,端到端训练单个网络完成图像生成,避免潜变量坍塌,在ImageNet上取得优异FID成绩。

Details Motivation: 标准潜扩散模型采用多阶段训练的三模块架构,效率低且阻碍与视觉基础模型统一,需将编码、解码与扩散网络融合为单一可端到端训练的网络。 Method: 通过类比扩散与自蒸馏无监督学习方法,分析潜变量坍塌原因,并提出DSD框架,改进训练目标以稳定潜空间,实现单网络联合训练。 Result: DSD在ImageNet 256×256条件生成任务中达到FID=13.44/6.38/4.25(参数量42M/118M/205M),仅用50个epoch且无需分类器自由引导。 Conclusion: DSD成功实现了潜扩散模型的单网络端到端训练,解决了潜变量坍塌问题,性能优越,推动扩散模型向更高效、统一的架构发展。 Abstract: Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.

[178] Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

Yifan Wang,Liya Ji,Zhanghan Ke,Harry Yang,Ser-Nam Lim,Qifeng Chen

Main category: cs.CV

TL;DR: 提出一种零样本框架,通过基于扩散视频基础模型的方法,结合估计的结构感知信息(如深度图、语义图和边缘图),在无需微调的情况下将模拟器生成的合成视频重渲染为高度逼真的视频。

Details Motivation: 提升合成视频的真实性,同时保持其在空间和时间域上的多层次结构一致性,以克服现有方法在结构保持和真实感之间的权衡问题。 Method: 利用预训练的扩散视频基础模型,通过辅助模型估计合成视频的结构信息(如深度、语义和边缘图),并在生成/去噪过程中将其作为条件输入,实现无需微调的零样本视频重渲染。 Result: 实验表明,该方法在保持原始视频结构一致性方面优于现有基线方法,同时实现了最先进的视觉真实感质量。 Conclusion: 所提出的方法是一种简单、通用且强大的合成视频真实性增强方案,能够在不进行微调的情况下有效保留时空结构并生成逼真的视频。 Abstract: We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.

[179] A Neural Field-Based Approach for View Computation & Data Exploration in 3D Urban Environments

Stefan Cobeli,Kazi Shahrukh Omar,Rodrigo Valença,Nivan Ferreira,Fabio Miranda

Main category: cs.CV

TL;DR: 提出一种基于视图的3D城市数据探索方法,利用神经场构建高效的隐式环境表示,支持快速正向与逆向查询,有效解决遮挡问题并提升大规模城市分析效率。

Details Motivation: 由于3D城市环境中复杂的几何结构导致严重遮挡和手动调整视角效率低下,现有方法在大规模数据探索上存在计算瓶颈和交互复杂性问题。 Method: 提出一种基于向量场编码视角的视图驱动方法,并引入基于神经场的隐式表示模型,实现对3D环境的高效建模,支持直接查询(如视野评估指标)和逆向查询(如寻找满足特定模式的视角)。 Result: 实验表明该方法在可见性分析、建筑立面可视性评估、户外空间视野评价等任务中表现优异,能够快速找到理想视角,并得到领域专家认可。 Conclusion: 所提方法显著提升了3D城市数据的探索效率与可用性,为城市规划中的可见性、日照及视觉影响分析提供了有力工具。 Abstract: Despite the growing availability of 3D urban datasets, extracting insights remains challenging due to computational bottlenecks and the complexity of interacting with data. In fact, the intricate geometry of 3D urban environments results in high degrees of occlusion and requires extensive manual viewpoint adjustments that make large-scale exploration inefficient. To address this, we propose a view-based approach for 3D data exploration, where a vector field encodes views from the environment. To support this approach, we introduce a neural field-based method that constructs an efficient implicit representation of 3D environments. This representation enables both faster direct queries, which consist of the computation of view assessment indices, and inverse queries, which help avoid occlusion and facilitate the search for views that match desired data patterns. Our approach supports key urban analysis tasks such as visibility assessments, solar exposure evaluation, and assessing the visual impact of new developments. We validate our method through quantitative experiments, case studies informed by real-world urban challenges, and feedback from domain experts. Results show its effectiveness in finding desirable viewpoints, analyzing building facade visibility, and evaluating views from outdoor spaces. Code and data are publicly available at https://urbantk.org/neural-3d.

[180] Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

Alexander Vedernikov,Puneet Kumar,Haoyu Chen,Tapio Seppänen,Xiaobai Li

Main category: cs.CV

TL;DR: 提出一种利用视觉大语言模型(VLMs)优化视频参与度识别中主观和噪声标签的框架,结合问卷行为线索与课程学习策略,提升模型性能。

Details Motivation: 视频数据集中的参与度识别面临标签主观性强和噪声多的问题,限制了模型性能,需改进标签质量和训练策略。 Method: 利用VLMs通过问卷提取行为线索,将数据划分为高/低可靠性子集;采用课程学习与软标签优化相结合的训练策略,逐步引入模糊样本并调整监督信号以反映不确定性。 Result: 在EngageNet、DREAMS和PAFE等基准上优于现有方法,F1分数分别提升+1.21%、+0.22和+0.06。 Conclusion: 使用VLMs精炼标签并结合课程学习策略可有效应对参与度识别中的标签主观性与噪声问题,显著提升传统模型性能。 Abstract: Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

[181] Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen,Yuheng Qiu,Ruogu Li,Ali Agha,Shayegan Omidshafiei,Jay Patrikar,Sebastian Scherer

Main category: cs.CV

TL;DR: 提出了一种无需重新训练或微调的视觉几何Transformer加速方法Co-Me,通过置信度引导的令牌合并,在保持性能的同时显著提升速度。

Details Motivation: 为了在不牺牲性能的前提下加速视觉几何Transformer的推理过程,尤其是在长序列输入下的效率问题。 Method: 设计了一个轻量级的置信度预测器,根据令牌的不确定性进行排序,并选择性地合并低置信度令牌,从而减少计算量并保持空间覆盖。 Result: 在VGGT和MapAnything上分别实现了最高11.3倍和7.2倍的加速,且性能未下降,适用于多视角和流式视觉几何Transformer。 Conclusion: Co-Me是一种通用、高效的Token合并策略,能够显著加速视觉几何Transformer,推动其在实时3D感知与重建中的应用。 Abstract: We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

[182] UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

Rui Tian,Mingfei Gao,Haiming Gang,Jiasen Lu,Zhe Gan,Yinfei Yang,Zuxuan Wu,Afshin Dehghan

Main category: cs.CV

TL;DR: UniGen-1.5 是一个增强的统一多模态大语言模型,通过改进架构和训练流程,提升了图像理解、生成与编辑能力,引入了统一的强化学习策略和轻量级编辑指令对齐阶段,在多项评测中表现达到或超越现有最先进模型。

Details Motivation: 为了提升多模态大模型在图像理解、生成和编辑任务上的综合性能,尤其是实现强大的图像编辑能力,同时保持高效的训练与推理。 Method: 基于UniGen进行架构和训练流程的全面优化,提出统一的强化学习策略,共享奖励模型以同时优化图像生成和编辑;并引入轻量级的编辑指令对齐阶段,提升模型对编辑指令的理解能力。 Result: UniGen-1.5 在 GenEval 和 ImgEdit 上分别取得了 0.89 和 4.31 的总体得分,性能优于 BAGEL 等开源模型,并媲美 GPT-Image-1 等闭源模型,在图像理解和生成方面具有竞争力。 Conclusion: UniGen-1.5 通过架构改进和创新的训练策略,实现了图像理解、生成与编辑能力的全面提升,展现了统一多模态模型在复杂视觉任务中的强大潜力。 Abstract: We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.

[183] ARC Is a Vision Problem!

Keya Hu,Ali Cy,Linlu Qiu,Xiaoman Delores Ding,Runqian Wang,Yeyin Eva Zhu,Jacob Andreas,Kaiming He

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉范式的新方法Vision ARC (VARC),将Abstraction and Reasoning Corpus (ARC) 视为图像到图像的转换问题,使用Vision Transformer直接处理可视化输入,在ARC-1基准上达到60.4%的准确率,显著优于从零开始训练的现有方法,并接近人类平均水平。

Details Motivation: ARC任务本质上是视觉性的,但现有研究多将其作为语言问题处理,缺乏对视觉先验的有效利用,因此需要一种以视觉为中心的方法来更好地建模抽象推理。 Method: 将ARC任务建模为图像到图像翻译问题,设计‘canvas’表示法以使输入可被标准视觉模型(如ViT)处理,模型完全在ARC数据上从零开始训练,并通过测试时训练实现对未见任务的泛化。 Result: VARC在ARC-1基准上取得60.4%的准确率,显著优于其他从零训练的方法,性能与领先的大型语言模型相当,接近人类平均水平。 Conclusion: 通过将ARC置于视觉框架下,仅用视觉模型即可高效完成抽象推理任务,验证了视觉中心方法在该任务上的有效性,为未来研究提供了新方向。 Abstract: The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.