Skip to content

Table of Contents

cs.CL [Back]

[1] Fine-Tuning BERT for Domain-Specific Question Answering: Toward Educational NLP Resources at University Scale

Aurélie Montfrond

Main category: cs.CL

TL;DR: 本研究通过微调BERT模型,构建了一个面向大学课程信息的问答系统,填补了缺乏针对高校课程材料的领域特定基础模型的空白。

Details Motivation: 现有的科学问答研究多集中于通用聊天机器人,缺乏针对特定领域(如大学课程)的细粒度推理模型;同时,尽管存在BioBERT和SciBERT等领域模型,但尚无专门针对大学课程材料的基础模型。 Method: 基于University of Limerick的课程模块手册,构建了包含1,203个问答对的SQuAD格式数据集,并结合人工与合成数据进行增强;使用PyTorch对BERT模型进行微调,并采用精确匹配(Exact Match)和F1分数评估性能。 Result: 实验结果表明,即使进行适度微调,也能显著提升假设构建与知识提取能力,在教育领域内展现出良好的适应性与可行性。 Conclusion: 微调BERT模型可用于高效构建面向大学课程的领域特定问答系统,具备扩展为首个通用高校领域问答模型的潜力,推动自主化教育知识系统的发展。 Abstract: Prior work on scientific question answering has largely emphasized chatbot-style systems, with limited exploration of fine-tuning foundation models for domain-specific reasoning. In this study, we developed a chatbot for the University of Limerick's Department of Electronic and Computer Engineering to provide course information to students. A custom dataset of 1,203 question-answer pairs in SQuAD format was constructed using the university book of modules, supplemented with manually and synthetically generated entries. We fine-tuned BERT (Devlin et al., 2019) using PyTorch and evaluated performance with Exact Match and F1 scores. Results show that even modest fine-tuning improves hypothesis framing and knowledge extraction, demonstrating the feasibility of adapting foundation models to educational domains. While domain-specific BERT variants such as BioBERT and SciBERT exist for biomedical and scientific literature, no foundation model has yet been tailored to university course materials. Our work addresses this gap by showing that fine-tuning BERT with academic QA pairs yields effective results, highlighting the potential to scale towards the first domain-specific QA model for universities and enabling autonomous educational knowledge systems.

Gili Goldin,Ella Rabinovich,Shuly Wintner

Main category: cs.CL

TL;DR: 提出一种基于情感风格而非意识形态差异的极化量化新方法,通过效价、唤醒度和支配度衡量情感话语,分析以色列议会记录发现政府成员与反对派成员的情感风格存在差异,且情感极化程度随时间显著上升。

Details Motivation: 近年来全球范围内各种平台上的极化言论增多,需要新的方法来准确量化情感极化现象。 Method: 利用效价(Valence)、唤醒度(Arousal)和支配度(Dominance)的情感测量指标,检测情感话语信号,并据此操作化情感极化的概念。 Result: 分析以色列议会(Knesset)的会议语料发现,政府成员与反对派成员在情感风格上存在显著差异,且情感极化水平随时间显著上升。 Conclusion: 基于情感风格的极化量化方法能有效捕捉政治话语中的情感极化趋势,为理解政治极化提供了新视角。 Abstract: Recent years have seen an increase in polarized discourse worldwide, on various platforms. We propose a novel method for quantifying polarization, based on the emotional style of the discourse rather than on differences in ideological stands. Using measures of Valence, Arousal and Dominance, we detect signals of emotional discourse and use them to operationalize the concept of affective polarization. Applying this method to a recently released corpus of proceedings of the Knesset, the Israeli parliament (in Hebrew), we find that the emotional style of members of government differs from that of opposition members; and that the level of affective polarization, as reflected by this style, is significantly increasing with time.

[3] Decoding the Black Box: Discerning AI Rhetorics About and Through Poetic Prompting

P. D. Edgar,Alia Hall

Main category: cs.CL

TL;DR: 本文探讨了诗歌提示模式在提示工程中的应用,提出创意文本提示可作为研究大语言模型算法倾向和偏见的新工具,并通过诗意提示评估了三种模型对著名诗人的描述与评价能力,以及其为假定受众改编或重写原创作品的意愿。

Details Motivation: 探索创意提示(特别是诗歌提示)在理解和引导大语言模型行为方面的潜力,拓展提示工程的方法论。 Method: 提出“诗歌提示模式”这一创意提示方法,并将其应用于三个大语言模型,通过生成对著名诗人的描述和评价,以及改写原创诗歌的任务来测试模型的适应性和创造性。 Result: 发现诗歌提示能够有效激发模型的创造性输出,并揭示模型在面对艺术性任务时的评估偏差和改写倾向,显示出模型在迎合预设受众时可能牺牲原作意图。 Conclusion: 创意提示,尤其是诗歌提示模式,是提示工程中有价值的补充,有助于深入理解大语言模型的创作能力和潜在偏见。 Abstract: Prompt engineering has emerged as a useful way studying the algorithmic tendencies and biases of large language models. Meanwhile creatives and academics have leveraged LLMs to develop creative works and explore the boundaries of their writing capabilities through text generation and code. This study suggests that creative text prompting, specifically Poetry Prompt Patterns, may be a useful addition to the toolbox of the prompt engineer, and outlines the process by which this approach may be taken. Then, the paper uses poetic prompts to assess descriptions and evaluations of three models of a renowned poet and test the consequences of the willingness of models to adapt or rewrite original creative works for presumed audiences.

[4] Enhancing Clinical Note Generation with ICD-10, Clinical Ontology Knowledge Graphs, and Chain-of-Thought Prompting Using GPT-4

Ivan Makohon,Mohamad Najafi,Jian Wu,Mathias Brochhausen,Yaohang Li

Main category: cs.CL

TL;DR: 本文研究了利用链式思维(CoT)提示工程结合语义搜索和知识图谱来提升大语言模型在临床记录生成中的表现,实验结果表明该方法优于标准单样本提示生成的临床记录。

Details Motivation: 医生手动撰写临床记录耗时较长,影响诊疗效率,因此需要一种自动化方法来辅助生成高质量的临床记录,以节省时间并提高医疗服务质量。 Method: 采用链式思维(CoT)提示工程,并结合基于ICD编码和患者基本信息的输入,引入语义搜索结果和从临床本体构建的知识图谱,以增强生成内容的准确性和专业性。使用GPT-4模型在CodiEsp数据集的六个临床案例上进行测试。 Result: 所提出的方法在生成临床记录的质量上优于标准单样本提示方法,能够更准确、更完整地生成符合临床需求的文本。 Conclusion: 结合CoT提示、语义搜索与知识图谱的方法能有效提升大语言模型在临床记录生成任务中的性能,具有实际应用潜力。 Abstract: In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients' assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering and editing them. Manually writing clinical notes takes a considerable amount of a doctor's valuable time, increasing the patient's waiting time and possibly delaying diagnoses. Large language models (LLMs) possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM's response in clinical note generation. In our prompts, we use as input International Classification of Diseases (ICD) codes and basic patient information. We investigate a strategy that combines the traditional CoT with semantic search results to improve the quality of generated clinical notes. Additionally, we infuse a knowledge graph (KG) built from clinical ontology to further enrich the domain-specific knowledge of generated clinical notes. We test our prompting technique on six clinical cases from the CodiEsp test dataset using GPT-4 and our results show that it outperformed the clinical notes generated by standard one-shot prompts.

[5] To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples

Vignesh Kothapalli,Ata Fatahibaarzi,Hamed Firooz,Maziar Sanjabi

Main category: cs.CL

TL;DR: 提出CoT-Recipe方法,通过调节元训练中思维链(CoT)与非CoT样本的比例,显著提升大模型在新推理任务上的少样本性能。

Details Motivation: 现有基于思维链的上下文学习在预训练知识不足的新任务上效果有限,且CoT样本过多会损害元训练效果。 Method: 提出CoT-Recipe,形式化调控元训练序列中CoT与非CoT样本的混合比例,并在CoT-ICL Lab框架下进行研究。 Result: 在无上下文CoT示例时准确率提升高达300%;应用于Qwen2.5系列模型,在符号推理任务上准确率提升达130%。 Conclusion: 合理调控CoT与非CoT样本的混合对元训练至关重要,CoT-Recipe能有效增强模型对新抽象推理任务的泛化能力。 Abstract: Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.

[6] LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

Ömer Faruk Akgül,Yusuf Hakan Kalaycı,Rajgopal Kannan,Willie Neiswanger,Viktor Prasanna

Main category: cs.CL

TL;DR: LYNX是一种在线早期退出机制,利用模型自身隐藏状态的感知能力,在推理过程中实现基于置信度控制的提前停止,有效减少“过度思考”,在多个任务和模型上显著降低计算开销并保持或提升准确性。

Details Motivation: 大型推理模型常因“过度思考”浪费计算资源并可能降低准确率,现有早期停止方法依赖额外采样、启发式规则或辅助验证模型,缺乏通用性和理论保证。 Method: LYNX利用推理过程中的自然语言提示词(如“hmm”、“wait”)作为退出决策点,训练轻量级探针识别这些时刻的隐藏状态,并结合分裂保形预测提供对提前退出的分布无关控制;探针在数学语料上训练后可跨任务、温度甚至非数学任务复用。 Result: 在GSM8K上减少40-65% token的同时保持或提高准确率;MATH-500上准确率提升达12点且token减少35-60%;AIME 2024上以超过50% token节省恢复基线性能;CommonsenseQA上零样本迁移实现最多70% token减少并略有增益。 Conclusion: LYNX实现了高效、通用且具备置信度可控性的在线早期退出,优于现有方法,在多任务和多模型下展现出卓越的准确性-效率权衡。 Abstract: Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.

[7] Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection Against LLM-Generated Threats

Sadat Shahriar,Navid Ayoobi,Arjun Mukherjee,Mostafa Musharrat,Sai Vishnu Vamsi

Main category: cs.CL

TL;DR: 本研究探讨了“粉红 slime 新闻”(Pink Slime Journalism)的语言特征,并揭示了大型语言模型(LLMs)如何被用于对抗性修改以逃避检测,导致现有系统性能下降达40%。为此,提出了一种鲁棒的学习框架,可提升检测能力达27%。

Details Motivation: 应对粉红 slime 新闻对本地新闻生态的威胁,尤其是其通过LLMs进行对抗性改写从而规避检测的新风险。 Method: 通过细粒度分析粉红 slime 文章的语言、风格和词汇特征,研究LLM修改后的对抗样本对现有检测系统的影响,并设计一种具有抗干扰能力的鲁棒学习框架。 Result: 发现消费者级LLMs可使现有检测系统的F1分数下降高达40%;所提出的框架在应对此类攻击时性能提升达27%。 Conclusion: 必须考虑LLM驱动的对抗性改写对虚假新闻检测的威胁,所提出的鲁棒框架能有效适应这一新兴挑战。 Abstract: The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.

[8] Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

Ananth Hariharan,David Mortensen

Main category: cs.CL

TL;DR: 本研究提出一种混合神经-符号方法,通过弱监督和多语言BERT微调分析梵语两千多年的形态演变,揭示其形态复杂性并未简化而是动态重新分配。

Details Motivation: 挑战语言变化即简化的传统假设,探索低资源、形态丰富的语言在长期演变中的复杂性变化,并解决数据稀缺问题。 Method: 使用100多个高精度正则表达式生成伪标签,以弱监督方式微调多语言BERT,并通过新颖的置信度加权集成融合符号与神经输出。 Result: 在147万词的历时语料库上,系统整体特征检测率达到52.4%,发现梵语的形态复杂性未下降,而是在不同领域间重新分布,如复合词显著增加并出现新的哲学术语;系统具有良好的校准不确定性估计(r=0.92, ECE=0.043)。 Conclusion: 混合神经-符号方法能有效揭示低资源语言的复杂演化模式,提供可解释且可靠的分析工具,推动计算语言学和历史语言学的研究。 Abstract: This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit's overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.

[9] Mitigating Self-Preference by Authorship Obfuscation

Taslim Mahbub,Shi Feng

Main category: cs.CL

TL;DR: 本文研究了语言模型(LM)作为评判者时存在的自我偏好偏差问题,并探索通过黑箱扰动(如同义词替换)来减少其对自身输出的识别能力,从而降低自我偏好。尽管简单扰动有效,但在进一步中和风格差异时,自我偏好会恢复,表明该偏差存在于多个语义层面,完全消除仍具挑战。

Details Motivation: LM评判者在评估生成结果时表现出自我偏好偏差,可能影响评估公正性,需探究其成因并寻找缓解策略。 Method: 通过对评估候选文本施加黑箱扰动(如同义词替换)以模糊作者身份,削弱LM对其自身输出的识别,进而在成对比较中测量自我偏好的变化。 Result: 简单的扰动可有效降低自我偏好,但当扰动扩展至更全面的风格中和时,自我偏好反而恢复,表明自我识别发生在多个语义层次。 Conclusion: 虽然轻微扰动能缓解LM的自我偏好,但由于该偏差根植于多层语义特征,彻底消除仍面临根本性挑战。 Abstract: Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and complete mitigation remains challenging despite promising initial results.

[10] Learning from Self Critique and Refinement for Faithful LLM Summarization

Ting-Yao Hu,Hema Swetha Koppula,Hadi Pouransari,Cem Koc,Oncel Tuzel,Raviteja Vemulapalli

Main category: cs.CL

TL;DR: 提出了一种名为SCRPO的自监督训练框架,利用大语言模型自身的批评与改进能力构建偏好数据集,并通过偏好学习提升模型在忠实摘要任务上的表现,相比现有方法更高效且效果更好。

Details Motivation: 大语言模型在长文本生成任务中容易产生脱离输入上下文的幻觉问题,现有缓解方法依赖额外计算或更强教师模型,成本高且实用性低。 Method: 提出Self Critique and Refinement-based Preference Optimization (SCRPO),利用LLM自身对生成结果进行自我批评与迭代优化,构建偏好数据集,并在此基础上进行偏好学习以提升模型的忠实性。 Result: 在XSUM、CNNDM和SAMSum三个摘要基准上,SCRPO在忠实性指标上优于现有的自监督学习方法,同时保持或提升了摘要的整体质量指标;相比测试时优化方法,效率更高且生成更忠实的摘要。 Conclusion: SCRPO是一种高效实用的自监督训练方法,能够有效减少大语言模型在摘要任务中的幻觉问题,无需依赖外部模型或增加推理开销。 Abstract: Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can be reduced by iteratively critiquing and refining previously generated outputs using either the same model or a more powerful teacher model as the critique. However, these approaches either require additional test-time compute or assume access to more powerful teacher models, making them costly and less practical. In this work, we propose Self Critique and Refinement-based Preference Optimization (SCRPO), which is a self-supervised training framework that first constructs a preference dataset by leveraging the LLM's own critique and refinement capabilities, and then applies preference learning to improve the same LLM for faithful summarization. Experiments on three summarization benchmarks (XSUM CNNDM and SAMSum), demonstrate that our approach outperforms state-of-the-art self-supervised learning methods in terms of faithfulness metrics while either maintaining or improving other metrics that measure the overall quality of the summary. Moreover, compared to test-time refinement, our approach not only improves efficiency but also results in more faithful summaries.

[11] SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs

Ruixuan Huang,Hao Zeng,Hantao Huang,Jinyuan Shi,Minghui Yu,Ian En-Hsu Yen,Shuai Wang

Main category: cs.CL

TL;DR: 提出了一种统一的稀疏量化格式(SQ-format),在保证精度的同时实现性能和吞吐量的帕累托改进,适用于具有离群值的激活并支持硬件加速。

Details Motivation: 现有低比特量化和稀疏化技术难以在精度和效率之间取得平衡,且硬件支持有限,特别是2:4半结构稀疏格式因精度损失而应用受限。 Method: 提出SQ-format,结合高精度稀疏矩阵加速和低精度矩阵乘法加速的优势,统一量化与稀疏化格式,并设计支持该格式的硬件方案。 Result: 实现了当前最先进的PTQ性能,展示了SQ-format在性能与吞吐量上的帕累托提升,并提供了下一代AI加速器的设计启示。 Conclusion: SQ-format是一种硬件友好的统一数据格式,能有效结合量化与稀疏化优势,推动大模型在边缘和通用硬件上的高效部署。 Abstract: Post-training quantization (PTQ) plays a crucial role in the democratization of large language models (LLMs). However, existing low-bit quantization and sparsification techniques are difficult to balance accuracy and efficiency due to the limited hardware support. For example, W4A8 can only achieve the same peak TOPS as W8A8 whereas the GPU-supported sparse data format (2:4 semi-structure sparse) is seldomly adopted due to the loss of accuracy. To bridge this gap, in this paper, we propose the Sparse-Quantized Format (SQ-format), which is a unified data format for quantization and sparsification potentially easily supported by new hardware and existing GPUs. SQ-format makes use of the fact that sparse matrix can be accelerated in high-precision, and low-precision matrix multiplication can also be accelerated accordingly. As such, SQ-format is proposed to achieve Pareto improvement between performance and throughput. This format is particularly suitable for activations with outlier inequality status and makes their static compression possible. We show the state-of-the-art PTQ performance with SQ-format, propose the hardware required to support it, and further offer the design exploration and insights for the next-generation AI accelerators.

[12] LMSpell: Neural Spell Checking for Low-Resource Languages

Akesh Gunathilakea,Nadil Karunarathnea,Tharusha Bandaranayakea,Nisansa de Silvaa,Surangika Ranathunga

Main category: cs.CL

TL;DR: 本研究首次对预训练语言模型(PLMs)在低资源语言拼写纠错中的有效性进行了实证研究,发现大语言模型(LLMs)在大规模微调数据下表现更优,即使在未预训练的语言中也成立。作者发布了LMSpell工具包,并提出针对LLM幻觉的评估方法,还以僧伽罗语为例进行了案例研究。

Details Motivation: 低资源语言的拼写纠错仍具挑战性,现有研究对不同预训练语言模型的比较不足,且缺乏覆盖低资源语言的系统性评估。 Method: 通过实证研究比较多种预训练语言模型(包括LLMs、编码器式和编码器-解码器式)在不同语言尤其是低资源语言上的拼写纠错性能,开发LMSpell工具包,并设计补偿LLM幻觉的评估函数。 Result: 大语言模型(LLMs)在大规模微调数据集上优于其他模型结构,该优势甚至延伸到模型未预训练过的语言;发布了LMSpell拼写纠错工具包并完成了僧伽罗语的案例研究。 Conclusion: LLMs在拼写纠错任务中具有显著优势,尤其适用于低资源语言场景,结合专用评估方法可有效缓解其幻觉问题,推动了多语言拼写纠错的发展。 Abstract: Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.

[13] ArtistMus: A Globally Diverse, Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering

Daeyong Kwon,SeungHeon Doh,Juhan Nam

Main category: cs.CL

TL;DR: 本文提出了MusWikiDB和ArtistMus两个资源,用于提升音乐领域问答系统的性能。通过检索增强生成(RAG)方法显著提高了事实准确性,尤其在开源模型上表现突出,并推动了音乐信息检索与领域特定问答的研究。

Details Motivation: 由于预训练数据中音乐知识稀疏,现有大语言模型在音乐相关推理任务上表现有限,缺乏基于艺术家元数据或历史背景的事实性和上下文音乐问答资源。 Method: 构建了一个包含320万段落的音乐维基数据库MusWikiDB,以及一个涵盖500位多样艺术家的1000个问题的基准集ArtistMus;采用检索增强生成(RAG)方法进行实验,并对模型进行RAG风格微调以提升性能。 Result: RAG显著提升了事实准确率,开源模型最高提升达+56.8个百分点;MusWikiDB相比通用维基语料库准确率提高约6个百分点且检索速度快40%;RAG微调进一步增强了事实回忆和上下文推理能力。 Conclusion: MusWikiDB和ArtistMus为音乐领域的检索增强问答提供了有效基础,验证了RAG在音乐问答中的有效性,促进了文化密集型领域的问题回答研究。 Abstract: Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.

[14] Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment

Panatchakorn Anantaprayoon,Nataliia Babina,Jad Tarifi,Nima Asgharbeygi

Main category: cs.CL

TL;DR: 本文提出了一种超越传统对齐范式的动态对齐框架,引入“集体能动性”(Collective Agency, CA)作为统一且开放的对齐目标,并通过LLM自动生成数据和自我奖励机制实现可扩展的自我改进对齐。

Details Motivation: 传统的基于人类偏好或固定原则(如帮助性、诚实性)的价值对齐方式在迈向通用人工智能时可能不足,且依赖人类反馈的方法成本高、难以扩展。需要更全面、可持续演进的对齐目标与方法。 Method: 提出动态对齐框架,包含两个核心组件:(1) 使用LLM自动构建训练数据集;(2) 采用自我奖励机制,由策略模型评估输出并赋予奖励,用于GRPO优化学习。以‘集体能动性’为对齐目标,促进模型综合能动能力的发展。 Result: 实验结果表明,该方法能有效使模型对齐到集体能动性目标,同时保持其通用自然语言处理能力。 Conclusion: 动态对齐结合集体能动性提供了一个可扩展、开放演化的AI价值对齐路径,有望支持未来更高级智能系统的持续自我对齐。 Abstract: Large Language Models (LLMs) are typically aligned with human values using preference data or predefined principles such as helpfulness, honesty, and harmlessness. However, as AI systems progress toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), such value systems may become insufficient. In addition, human feedback-based alignment remains resource-intensive and difficult to scale. While AI-feedback-based self-improving alignment methods have been explored as a scalable alternative, they have largely remained constrained to conventional alignment values. In this work, we explore both a more holistic alignment objective and a scalable, self-improving alignment approach. Aiming to transcend conventional alignment norms, we introduce Collective Agency (CA)-a unified and open-ended alignment value that encourages integrated agentic capabilities. We also propose Dynamic Alignment-an alignment framework that enables an LLM to iteratively align itself. Dynamic Alignment comprises two key components: (1) automated training dataset generation with LLMs, and (2) a self-rewarding mechanism, where the policy model evaluates its own output candidates and assigns rewards for GRPO-based learning. Experimental results demonstrate that our approach successfully aligns the model to CA while preserving general NLP capabilities.

[15] SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures

Panuthep Tasawong,Jian Gang Ngui,Alham Fikri Aji,Trevor Cohn,Peerat Limkonchotiwat

Main category: cs.CL

TL;DR: 本文介绍了SEA-SafeguardBench,首个针对东南亚语言的人工验证安全基准,旨在解决现有英语中心化评估在文化和语言多样性上的不足。

Details Motivation: 现有的多语言安全基准大多依赖机器翻译的英文数据,无法捕捉低资源语言中的细微差别,且东南亚语言因文化和语言多样性而面临独特安全挑战,却严重缺乏代表性。 Method: 构建了一个覆盖八种东南亚语言、21,640个样本的本地人工撰写和验证的安全基准,包含通用、真实场景和内容生成三个子集。 Result: 实验结果表明,即使是最先进的大语言模型和防护机制,在处理东南亚文化相关危害场景时表现不佳,显著低于其在英语文本上的表现。 Conclusion: 必须开发本地化、文化适配的安全基准来有效评估大语言模型在多元语言环境中的安全性,SEA-SafeguardBench为实现这一目标提供了重要基础。 Abstract: Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region's linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.

[16] Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches

Namu Park,Farzad Ahmed,Zhaoyi Sun,Kevin Lybarger,Ethan Breinhorst,Julie Hu,Ozlem Uzuner,Martin Gunn,Meliha Yetisgen

Main category: cs.CL

TL;DR: 该研究评估了大语言模型(LLM)在细粒度、病灶级别检测需随访的偶然发现(incidentalomas)中的表现,提出结合病灶标记和解剖学感知提示的新型推理策略,结果显示生成式LLM显著优于传统监督模型,并接近人类专家水平。

Details Motivation: 现有基于文档级别的分类系统在识别需随访的偶然发现时存在局限性,无法精确定位到具体病灶,因此需要更精细、可解释的自动化方法来提升放射科工作流中的检测准确性和可靠性。 Method: 使用包含1,623个标注病灶的400份放射学报告数据集,比较三种监督式Transformer编码器与四种生成式大语言模型(如Llama 3.1-8B、GPT-4o、GPT-OSS-20b),并引入病灶标记输入与解剖学感知提示策略以增强模型推理能力,采用类别特异的F1分数进行评估。 Result: 解剖学感知的GPT-OSS-20b模型表现最佳,偶然发现阳性的宏F1得分为0.79,超过所有监督模型(最高0.70),接近人工标注者间一致性(0.76);集成最优模型后F1提升至0.90;错误分析显示LLM在区分可操作发现与良性病变方面具备更强的上下文推理能力。 Conclusion: 当结合结构化病灶标记和解剖上下文时,生成式大语言模型显著优于传统监督模型,性能接近人类专家,为放射科中自动监测偶然发现提供了可靠且可解释的新路径。 Abstract: Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p < 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows.

[17] Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems

Aurprita Mahmood,Sabrin alam,Neloy kumer Sagor,Md. Abdul Hadi,Md. Sehab Al Islam,Minhajul Islam

Main category: cs.CL

TL;DR: 本文研究了基于SOMADHAN数据集的孟加拉语数学应用题求解,比较了标准提示、思维链(CoT)和思维树(ToT)方法在不同大模型上的表现,发现ToT能将准确率提升至88%,尤其适用于中大规模模型,为低资源语言的数学推理提供了更优的结构化解决方案。

Details Motivation: 由于数学应用题需要语言理解和多步数值推理,而传统的思维链(CoT)方法存在错误传播问题,因此需要探索更有效的推理方法,尤其是在低资源语言如孟加拉语中的应用。 Method: 采用思维树(ToT)推理方法,在SOMADHAN数据集上对100个代表性问题进行实验,使用GPT-OSS和LLaMA等大模型,对比标准提示、CoT和ToT三种策略的效果。 Result: CoT将准确率从78%提升到83%,ToT进一步提升至最高88%(GPT-OSS-120B),表明ToT在中大规模模型中效果显著,但在小模型中优势有限。 Conclusion: ToT是一种解决低资源语言数学应用题的鲁棒框架,相比CoT能提供更可靠和全局一致的推理结果,推动多语言NLP中的推理方法发展。 Abstract: Mathematical Word Problems (MWPs) are among the most challenging tasks in natural language processing because they require both linguistic understanding and multi-step numerical reasoning. While Chain-of-Thought (CoT) prompting has shown promise, its linear structure often propagates errors, limiting overall effectiveness. To address this limitation, we present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Owing to computational and token-cost constraints, we evaluate a curated set of 100 representative problems across multiple large language models (LLMs), including GPT-OSS and LLaMA variants, under standard prompting, CoT, and ToT strategies. Our results show that CoT improves baseline accuracy from 78% (standard prompting) to 83% on average, while ToT further increases performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. These improvements highlight that ToT is particularly effective in medium-to-large-scale models but may offer less advantage for smaller ones. Overall, our findings establish ToT as a robust framework for solving mathematical problems in low-resource languages such as Bengali. More broadly, this study shows that structured reasoning methods like ToT can provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.

[18] A Greek Government Decisions Dataset for Public-Sector Analysis and Insight

Giorgos Antoniou,Giorgos Filandrianos,Aggelos Vlachos,Giorgos Stamou,Lampros Kollimenos,Konstantinos Skianis,Michalis Vazirgiannis

Main category: cs.CL

TL;DR: 本文介绍了一个包含100万份希腊政府决策的开放、机器可读语料库,数据来自国家透明度平台Diavgeia,提供高质量PDF文本提取结果和可复现的提取流程,并探索了其在检索增强生成(RAG)任务和大模型训练中的潜力。

Details Motivation: 为了提升公共部门信息的可访问性和透明度,推动政府文档的结构化检索与推理能力,构建一个大规模、高质量的希腊政府决策语料库。 Method: 从Diavgeia平台收集100万份政府决策PDF文件,使用自动化管道提取高质量原始文本为Markdown格式,发布数据与代码;设计定性分析以识别模板模式,并构建RAG任务:提出代表性问题、生成高质量答案并评估基线系统性能。 Result: 成功构建并发布了大规模高质量语料库及可复现处理流程;RAG基线系统展示了在检索与回答公共决策问题上的可行性;语料库被证明适用于LM/LLM预训练、领域适应、知识增强生成和可解释AI研究。 Conclusion: 该语料库具有规模大、质量高、覆盖广的优势,可在政府透明度、信息检索、问答系统及专用语言模型开发中发挥重要作用,未来将改进提取精度并扩展应用场景。 Abstract: We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.

[19] Grounded Multilingual Medical Reasoning for Question Answering with Large Language Models

Pietro Ferrazzi,Aitor Soroa,Rodrigo Agerri

Main category: cs.CL

TL;DR: 提出了一种基于检索增强生成的多语言推理轨迹生成方法,利用维基百科医学知识生成了50万条英、意、西语推理轨迹,并在MedQA和MedMCQA上实现了8B参数模型的SOTA结果。

Details Motivation: 现有医学问答中的大模型方法主要集中在英语且依赖通用大模型的知识蒸馏,导致医学知识可靠性不足,缺乏多语言支持和可解释性。 Method: 采用检索增强生成(RAG)方法,基于维基百科的医学信息生成多语言(英、意、西)推理轨迹;扩展MedQA和MedMCQA至多语言版本,并用于训练与评估。 Result: 在领域内和领域外的医学问答基准上均取得提升,无论是在上下文学习还是微调设置下,均实现8B参数模型中的最先进性能。 Conclusion: 所提出的推理轨迹生成方法能有效提升多语言医学问答的性能与可靠性,发布的资源(包括推理轨迹、翻译数据集、医学维基数据和微调模型)有助于推动更安全、透明的多语言临床决策支持工具发展。 Abstract: Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces grounded in factual medical knowledge. We produce 500k traces in English, Italian, and Spanish, using a retrievalaugmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and outof-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of safer, more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.

[20] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong,Siyuan Wang,Xingyu Liu,Zhongyu Wei

Main category: cs.CL

TL;DR: 本文提出了一种新的多模态推理框架ILVR,通过交错文本生成与潜在视觉表示,解决了现有方法在感知精度和动态建模之间的权衡问题。

Details Motivation: 现有的交错推理范式因重复编码高密度像素图像而计算成本高昂,而潜在视觉推理方法则面临感知建模不精确或缺乏动态结构的问题。 Method: 提出Interleaved Latent Visual Reasoning (ILVR),将文本生成与作为特定、动态线索的潜在视觉表示交错结合,并采用动量教师模型通过自监督策略选择性地蒸馏辅助图像特征作为稀疏监督目标。 Result: 在多个多模态推理基准上的实验表明,ILVR显著优于现有方法,在细粒度感知与序列多模态推理之间取得了更好平衡。 Conclusion: ILVR有效统一了动态状态演化与精确感知建模,为高效多模态推理提供了新思路。 Abstract: Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

[21] MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation

Zhitao He,Haolin Yang,Zeyu Qin,Yi R Fung

Main category: cs.CL

TL;DR: 本文提出了一种基于多智能体教学模拟器ClinEdu和Socratic教学对话数据集ClinTeach的医学教育新方法,训练出首个面向临床医学一对多教学的多模态导师模型MedTutor-R1,显著提升了教学效果与适应性。

Details Motivation: 临床培训需求上升与专家教师稀缺之间的矛盾日益突出,现有大语言模型研究多集中于一对一知识传授,忽视了团队协作推理这一关键学习方式。 Method: 开发了包含人格化病人和多样化学生群体的多智能体教学模拟器ClinEdu,构建了大规模Socratic教学对话数据集ClinTeach,并基于该数据集通过指令微调和强化学习优化三轴评估标准下的自适应Socratic策略,训练出MedTutor-R1模型。 Result: MedTutor-R1在平均教学评分上比基线模型高出20%以上,性能媲美o3模型,并展现出对不同学生数量的良好适应能力。 Conclusion: ClinEdu模拟器与ClinTeach数据集有效支持了复杂教学过程的可控测试与可扩展数据生成,MedTutor-R1验证了多模态Socratic tutor在临床医学一对多教学中的可行性与优越性。 Abstract: The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.

[22] Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods

Tereza Novotna,Jakub Harasta

Main category: cs.CL

TL;DR: 本文比较了两种模型在捷克宪法法院判决案例检索中的表现,发现通用的OpenAI嵌入模型显著优于领域内预训练的BERT模型,尽管标签噪声导致绝对nDCG较低,但结果具有统计显著性,并提出了一种适用于噪声标注数据的鲁棒评估框架。

Details Motivation: 案例检索耗时且依赖数据库查询,现有方法在处理带有噪声标签的司法数据时面临挑战,因此需要更有效的检索与评估方法。 Method: 比较了通用OpenAI嵌入模型与在约3万份判决上从零训练的领域特定BERT模型,在三种设置下使用IDF加权关键词重叠作为分级相关性指标,采用双阈值二值化、配对自助法检验显著性,并结合nDCG诊断与定性分析。 Result: 通用OpenAI嵌入模型在@10/@20/@100指标上均显著优于领域BERT模型,差异具有统计显著性;低nDCG绝对值归因于标签漂移和理想标准过高,而非模型无效。 Conclusion: 即使在标签噪声较大的情况下,通用嵌入模型仍可在法律案例检索中表现更优,所提出的评估框架适用于处理来自传统司法数据库的异构标签数据,具备实际应用价值。 Abstract: Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.

[23] Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains

Ben Malin,Tatiana Kalganova,Nikolaos Boulgouris

Main category: cs.CL

TL;DR: 提出了一种通过融合基本忠实度指标来提高大语言模型(LLM)输出忠实性评估准确性的方法,使用基于树的模型结合人类判断来确定各指标的重要性,融合后的指标与人类判断的相关性更强。

Details Motivation: 提升对大语言模型输出忠实性的评估准确性,增强模型在多样化应用场景中的可信度。 Method: 将多个基本忠实度指标通过树型模型进行融合,利用人类对LLM响应忠实性的判断作为训练信号,学习各指标的权重。 Result: 融合后的指标在所有测试领域中均表现出与人类判断更高的相关性,并发布了跨问答和对话领域的标准化数据集以支持可复现研究。 Conclusion: 该方法有效提升了忠实性评估的准确性,为LLM的可靠应用提供了更强的评估基础。 Abstract: We present a methodology for improving the accuracy of faithfulness evaluation in Large Language Models (LLMs). The proposed methodology is based on the combination of elementary faithfulness metrics into a combined (fused) metric, for the purpose of improving the faithfulness of LLM outputs. The proposed strategy for metric fusion deploys a tree-based model to identify the importance of each metric, which is driven by the integration of human judgements evaluating the faithfulness of LLM responses. This fused metric is demonstrated to correlate more strongly with human judgements across all tested domains for faithfulness. Improving the ability to evaluate the faithfulness of LLMs, allows for greater confidence to be placed within models, allowing for their implementation in a greater diversity of scenarios. Additionally, we homogenise a collection of datasets across question answering and dialogue-based domains and implement human judgements and LLM responses within this dataset, allowing for the reproduction and trialling of faithfulness evaluation across domains.

[24] Efficient Text Classification with Conformal In-Context Learning

Ippokratis Pantelidis,Korbinian Randl,Aron Henriksson

Main category: cs.CL

TL;DR: 本文系统评估了CICLe在多种NLP分类基准上的表现,表明其在提升分类效果的同时显著减少提示长度和计算开销,尤其适用于类别不平衡的文本分类任务。

Details Motivation: 尽管大语言模型(LLMs)具有强大的上下文学习能力,但其文本分类性能高度依赖提示设计且计算成本高;CICLe虽被提出以提高效率,但其跨领域的适用性和效率优势尚缺乏系统研究。 Method: 采用CICLe框架,结合轻量级基础分类器与符合性预测来指导LLM提示,并在多个NLP分类基准上进行综合评估。 Result: CICLe在足够样本下优于基础分类器和少样本提示基线,在低数据场景下表现相当;提示长度和样本数最多减少25.16%和34.45%,并支持使用更小模型实现竞争性性能,对类别不平衡任务尤为有效。 Conclusion: CICLe是一种实用且可扩展的高效文本分类方法,融合传统分类器的鲁棒性与LLM的适应性,显著提升数据与计算效率。 Abstract: Large Language Models (LLMs) demonstrate strong in-context learning abilities, yet their effectiveness in text classification depends heavily on prompt design and incurs substantial computational cost. Conformal In-Context Learning (CICLe) has been proposed as a resource-efficient framework that integrates a lightweight base classifier with Conformal Prediction to guide LLM prompting by adaptively reducing the set of candidate classes. However, its broader applicability and efficiency benefits beyond a single domain have not yet been systematically explored. In this paper, we present a comprehensive evaluation of CICLe across diverse NLP classification benchmarks. The results show that CICLe consistently improves over its base classifier and outperforms few-shot prompting baselines when the sample size is sufficient for training the base classifier, and performs comparably in low-data regimes. In terms of efficiency, CICLe reduces the number of shots and prompt length by up to 34.45% and 25.16%, respectively, and enables the use of smaller models with competitive performance. CICLe is furthermore particularly advantageous for text classification tasks with high class imbalance. These findings highlight CICLe as a practical and scalable approach for efficient text classification, combining the robustness of traditional classifiers with the adaptability of LLMs, and achieving substantial gains in data and computational efficiency.

[25] Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

Jinlong Liu,Mohammed Bahja,Venelin Kovatchev,Mark Lee

Main category: cs.CL

TL;DR: 本文提出了一种基于Group Relative Policy Optimization(GRPO)和多奖励机制的风格化故事生成训练框架,利用作者身份验证信号指导句子嵌入模型提供风格奖励,并在马克·吐温风格的小说生成任务中实现了优于GPT-4o和Claude Sonnet 4等大模型的风格一致性表现。

Details Motivation: 现有方法在控制文本生成的细粒度写作风格方面能力有限,通常依赖浅层线索模拟风格,缺乏有效评估。本文旨在实现更精准、可评估的作者风格建模。 Method: 采用GRPO强化学习框架,结合定制的多奖励系统:使用基于作者身份验证(AV)微调的句子变换器生成风格奖励,并融合内容与完整性评分以稳定长篇叙事生成。以《哈克贝利·费恩历险记》为风格范本训练8B规模模型。 Result: 所提8B模型在AV风格评分上超过更大的基线模型(如GPT-4o和Claude Sonnet 4),达到0.628的风格得分,同时保持有竞争力的内容质量,但叙事完整性仍存挑战。 Conclusion: 研究表明,通过任务特定训练和适中规模模型即可实现具有代理性的风格化生成,验证了基于AV信号进行风格控制的可行性,未来需改进全局连贯性和情节收束的建模。 Abstract: Recent advances in large language models (LLMs) show impressive performance in open-ended story generation, but fine-grained stylistic control remains limited. Existing methods often rely on shallow cues (e.g., names or topics) to simulate authorial style, without robust evaluation. In this work, we present a training framework for style-conditioned story generation using Group Relative Policy Optimization (GRPO) and a custom multi-reward setup. The style reward is derived from a fine-tuned sentence transformer using authorship verification (AV) signals, combined with content and completeness scores to stabilize long-form narrative generation. We conduct experiments using fiction by Mark Twain, a prominent 19th-century American author, with The Adventures of Huckleberry Finn serving as the reference style exemplar. Our 8B model outperforms larger baselines such as GPT-4o and Claude Sonnet 4 in AV-style metrics, achieving a style score of 0.628 and competitive content quality. Results demonstrate the feasibility of agentic stylistic generation with moderate model size and task-specific training. While the output is clearly style-aligned, narrative completeness remains a challenge, indicating future work is needed to better model global coherence and story resolution.

[26] Heard or Halted? Gender, Interruptions, and Emotional Tone in U.S. Supreme Court Oral Arguments

Yifei Tong

Main category: cs.CL

TL;DR: 该研究利用2010-2019年美国最高法院口头辩论语料库,分析12,663个发言片段,探讨司法质询中的打断如何影响律师陈述的语义内容和情感基调,尤其关注性别差异。研究发现,尽管打断未显著改变论点内容,但针对女性律师的打断包含更多负面情绪。

Details Motivation: 探究司法辩论中性别化的交流模式及其对法律倡导者的影响,特别是理解打断行为是否以及如何根据律师性别不同而表现出差异。 Method: 使用ConvoKit最高法院语料库(2010-2019),分析12,663个律师与法官互动的发言片段;通过GloVe句向量衡量语义相似性,结合基于词典的情感分析评估情感倾向。 Result: 打断前后律师发言的语义相似度始终保持较高水平,表明打断并未显著改变论点内容;然而,针对女性律师的打断显著含有更高的负面情感强度。 Conclusion: 虽然打断行为未实质性扭曲律师的论证内容,但其情感表达存在性别差异,反映出精英制度环境中存在的性别化话语模式,凸显了计算语言学方法在研究司法程序中权力与公平问题的价值。 Abstract: This study examines how interruptions during U.S. Supreme Court oral arguments shape both the semantic content and emotional tone of advocates' speech, with a focus on gendered dynamics in judicial discourse. Using the ConvoKit Supreme Court Corpus (2010-2019), we analyze 12,663 speech chunks from advocate-justice interactions to assess whether interruptions alter the meaning of an advocate's argument and whether interruptions toward female advocates exhibit more negative emotional valence. Semantic shifts are quantified using GloVe-based sentence embeddings, while sentiment is measured through lexicon-based analysis. We find that semantic similarity between pre- and post-interruption speech remains consistently high, suggesting that interruptions do not substantially alter argumentative content. However, interruptions directed at female advocates contain significantly higher levels of negative sentiment. These results deepen empirical understanding of gendered communication in elite institutional settings and demonstrate the value of computational linguistic methods for studying power, discourse, and equity in judicial proceedings.

[27] Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy

Savir Basil,Ina Shapiro,Dan Shapiro,Ethan Mollick,Lilach Mollick,Lennart Meincke

Main category: cs.CL

TL;DR: 该研究探讨了在AI模型中分配角色(如专家或低知识角色)是否能提升其在高难度多项选择题上的表现,结果表明角色提示通常对准确率没有显著影响,甚至可能降低性能。

Details Motivation: 了解角色提示是否能够通过模拟专家或特定身份来增强AI模型在复杂问题上的推理能力,尤其是在科学、工程和法律等领域的研究生级别问题上。 Method: 在GPQA Diamond和MMLU-Pro两个基准上评估六种模型,测试三种方式:匹配领域的专家角色、不匹配领域的专家角色以及低知识角色(如儿童),并与无角色提示的基线进行比较。 Result: 领域匹配的专家角色对大多数模型性能无显著提升(仅Gemini 2.0 Flash例外);领域不匹配的角色略有负面影响;低知识角色普遍降低准确率。 Conclusion: 角色提示一般不能提高AI模型在高难度客观问题上的答案准确性,尽管它们可能在调整输出语气等方面有其他用途。 Abstract: This is the fourth in a series of short reports that help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. Here, we ask whether assigning personas to models improves performance on difficult objective multiple-choice questions. We study both domain-specific expert personas and low-knowledge personas, evaluating six models on GPQA Diamond (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024), graduate-level questions spanning science, engineering, and law. We tested three approaches: -In-Domain Experts: Assigning the model an expert persona ("you are a physics expert") matched to the problem type (physics problems) had no significant impact on performance (with the exception of the Gemini 2.0 Flash model). -Off-Domain Experts (Domain-Mismatched): Assigning the model an expert persona ("you are a physics expert") not matched to the problem type (law problems) resulted in marginal differences. -Low-Knowledge Personas: We assigned the model negative capability personas (layperson, young child, toddler), which were generally harmful to benchmark accuracy. Across both benchmarks, persona prompts generally did not improve accuracy relative to a no-persona baseline. Expert personas showed no consistent benefit across models, with few exceptions. Domain-mismatched expert personas sometimes degraded performance. Low-knowledge personas often reduced accuracy. These results are about the accuracy of answers only; personas may serve other purposes (such as altering the tone of outputs), beyond improving factual performance.

[28] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Tasnimul Hassan,Md Faisal Karim,Haziq Jeelani,Elham Behnam,Robert Green,Fayeq Jeelani Syed

Main category: cs.CL

TL;DR: 本文提出了一种基于检索增强生成(RAG)的医学问答系统,结合领域知识检索与开源大语言模型(LLaMA-2和Falcon),通过LoRA微调提升在PubMedQA和MedMCQA数据集上的准确性和事实一致性,显著减少幻觉问题。

Details Motivation: 直接将大语言模型应用于临床领域存在事实准确性不足和产生幻觉的问题,因此需要一种能够结合专业医学知识以提高回答可靠性的方法。 Method: 采用检索增强生成(RAG)框架,先从医学文献中检索相关信息,再输入经LoRA微调的开源大语言模型(LLaMA-2和Falcon)生成答案,并评估其在标准数据集上的表现。 Result: 在PubMedQA上,微调后的LLaMA-2模型准确率达到71.8%,显著高于55.4%的零样本基线;系统减少了约60%的无依据内容,并提供引用来源以增强透明度。 Conclusion: 基于RAG的开源大语言模型在生物医学问答中具有潜力,可提升答案的准确性和可信度,适用于实际的临床信息学应用。 Abstract: Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA~2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM's answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our fine-tuned LLaMA~2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.

[29] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

David Anugraha,Patrick Amadeus Irawan,Anshul Singh,En-Shiun Annie Lee,Genta Indra Winata

Main category: cs.CL

TL;DR: 本文提出了M4-RAG,一个大规模多语言、多模态的检索增强型视觉问答基准,涵盖42种语言和56种方言/语域,包含超过8万个多文化图像-问题对,并构建了可控的检索环境以评估跨语言和跨模态的检索增强VQA性能。实验发现检索增强虽有利于小型视觉语言模型,但在大模型上效果不佳甚至下降,揭示了当前检索方法与大模型之间的关键不匹配问题。

Details Motivation: 尽管视觉语言模型(VLMs)在视觉问答(VQA)中表现优异,但受限于静态训练数据;而现有检索增强生成(RAG)在多语言多模态场景下研究不足,因此需要一个全面的基准来推动该领域发展。 Method: 提出M4-RAG基准,覆盖42种语言和56种方言/语域,包含8万余个文化多样性的图像-问题对;构建包含数百万条精心筛选的多语言文档的可控检索环境,以平衡真实性和可复现性;系统评估不同规模VLM在RAG下的表现。 Result: 实验表明RAG能持续提升小型VLM的性能,但对大型VLM无效甚至导致性能下降,暴露出当前检索有效性与大模型之间的严重不匹配问题。 Conclusion: M4-RAG为开发能够跨语言、跨模态和跨文化情境进行推理的新一代RAG系统提供了重要基础,同时揭示了需重新设计检索机制以适配大型VLM的迫切需求。 Abstract: Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.

cs.CV [Back]

[30] AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance

Tianling Xu,Shengzhe Gan,Leslie Gu,Yuelei Li,Fangneng Zhan,Hanspeter Pfister

Main category: cs.CV

TL;DR: 提出了一种名为AREA3D的主动3D重建代理,利用前馈3D重建模型和视觉-语言引导,在稀疏视角下实现了最先进的重建精度。

Details Motivation: 现有主动重建方法依赖手工设计的几何启发式策略,容易导致冗余观测且提升有限。 Method: 解耦视图不确定性建模与前馈重建器,并引入视觉-语言模型提供高层语义指导,以选择信息丰富且多样化的视角。 Result: 在场景级和物体级基准上进行了大量实验,验证了AREA3D在稀疏视角下的优越性能。 Conclusion: AREA3D通过结合精确的不确定性估计和语义引导,显著提升了主动3D重建的效率和质量。 Abstract: Active 3D reconstruction enables an agent to autonomously select viewpoints to efficiently obtain accurate and complete scene geometry, rather than passively reconstructing scenes from pre-collected images. However, existing active reconstruction methods often rely on hand-crafted geometric heuristics, which can lead to redundant observations without substantially improving reconstruction quality. To address this limitation, we propose AREA3D, an active reconstruction agent that leverages feed-forward 3D reconstruction models and vision-language guidance. Our framework decouples view-uncertainty modeling from the underlying feed-forward reconstructor, enabling precise uncertainty estimation without expensive online optimization. In addition, an integrated vision-language model provides high-level semantic guidance, encouraging informative and diverse viewpoints beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime. Code will be made available at: https://github.com/TianlingXu/AREA3D .

[31] Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training

Wenshuo Wang,Fan Zhang

Main category: cs.CV

TL;DR: 本文提出了一个新问题“尺度锚定”(Scale Anchoring),指出当前零样本超分辨率时空预测中模型在不同分辨率下误差不变的现象并非良好泛化,而是受限于低分辨率数据的Nyquist频率限制。为解决该问题,作者提出了一种与架构无关的频域表示学习方法(FRL),通过分辨率对齐的频域表示和谱一致性训练,使模型在高分辨率下能更稳定地响应高频信号,从而实现误差随分辨率提升而下降。

Details Motivation: 现有零样本超分辨率时空预测方法误将跨分辨率误差稳定视为良好泛化,但实际上由于低分辨率数据无法捕捉高频物理规律(受Nyquist频率限制),导致模型在高分辨率推理时无法处理未见频段信号,产生系统性误差。这种现象被定义为“尺度锚定”,亟需新的建模思路来打破这一限制。 Method: 提出频率表示学习(Frequency Representation Learning, FRL),包含两个核心:1)分辨率对齐的频域表示,使不同分辨率下的频率成分可比;2)谱一致性训练,增强模型在高分辨率网格上对高频信号的稳定性。该方法可嵌入现有架构,不依赖特定网络结构。 Result: FRL显著降低了高分辨率推理时的预测误差,且误差随分辨率提高而持续下降,打破了传统方法的误差锚定现象。在多个任务和分辨率范围内均大幅优于基线模型,同时仅引入轻微计算开销。 Conclusion: 尺度锚定是制约零样本超分辨率时空预测性能的关键瓶颈,频域表示学习(FRL)通过显式建模频率响应特性,有效缓解了该问题,推动了深度学习模型作为数值求解器在多分辨率场景下的可信部署。 Abstract: Zero-Shot Super-Resolution Spatiotemporal Forecasting requires a deep learning model to be trained on low-resolution data and deployed for inference on high-resolution. Existing studies consider maintaining similar error across different resolutions as indicative of successful multi-resolution generalization. However, deep learning models serving as alternatives to numerical solvers should reduce error as resolution increases. The fundamental limitation is, the upper bound of physical law frequencies that low-resolution data can represent is constrained by its Nyquist frequency, making it difficult for models to process signals containing unseen frequency components during high-resolution inference. This results in errors being anchored at low resolution, incorrectly interpreted as successful generalization. We define this fundamental phenomenon as a new problem distinct from existing issues: Scale Anchoring. Therefore, we propose architecture-agnostic Frequency Representation Learning. It alleviates Scale Anchoring through resolution-aligned frequency representations and spectral consistency training: on grids with higher Nyquist frequencies, the frequency response in high-frequency bands of FRL-enhanced variants is more stable. This allows errors to decrease with resolution and significantly outperform baselines within our task and resolution range, while incurring only modest computational overhead.

[32] InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models

Zihao Wu

Main category: cs.CV

TL;DR: 提出了一种无需训练的扩散模型加速方法InvarDiff,利用时间步和网络层尺度上的特征不变性进行缓存重用,显著减少计算量并实现2-3倍的速度提升。

Details Motivation: 扩散模型虽然生成质量高,但因依赖迭代采样而速度较慢,需要一种不牺牲保真度的高效加速方法。 Method: 通过分析确定性采样中的特征不变性,构建每时间步、每层、每模块的二值缓存矩阵,并采用量化阈值判断变化程度以决定是否重用;同时引入重采样校正防止误差累积,在推理时按步骤优先、逐层缓存的方式执行。 Result: 在DiT和FLUX模型上验证有效,实现了2-3倍的端到端加速,对图像质量和标准指标影响极小。 Conclusion: InvarDiff通过跨时间步和层尺度的缓存重用机制,在无需重新训练的情况下显著提升了扩散模型的推理效率,同时保持了生成质量。 Abstract: Diffusion models deliver high-fidelity synthesis but remain slow due to iterative sampling. We empirically observe there exists feature invariance in deterministic sampling, and present InvarDiff, a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale. From a few deterministic runs, we compute a per-timestep, per-layer, per-module binary cache plan matrix and use a re-sampling correction to avoid drift when consecutive caches occur. Using quantile-based change metrics, this matrix specifies which module at which step is reused rather than recomputed. The same invariance criterion is applied at the step scale to enable cross-timestep caching, deciding whether an entire step can reuse cached results. During inference, InvarDiff performs step-first and layer-wise caching guided by this matrix. When applied to DiT and FLUX, our approach reduces redundant compute while preserving fidelity. Experiments show that InvarDiff achieves $2$-$3\times$ end-to-end speed-ups with minimal impact on standard quality metrics. Qualitatively, we observe almost no degradation in visual quality compared with full computations.

[33] Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

Yujie Xiao,Gongzhen Tang,Deyun Zhang,Jun Li,Guangkun Nie,Haoyu Wang,Shun Huang,Tong Liu,Qinghao Zhao,Kangyin Chen,Shenda Hong

Main category: cs.CV

TL;DR: 本研究开发了一种可解释的人工智能心电图(AI-ECG)模型,用于无创预测冠状动脉CT血管造影(CCTA)中四大主要冠状动脉的严重或完全狭窄,具有良好的内部和外部验证性能,并揭示了与冠状动脉狭窄相关的心电图特征。

Details Motivation: 冠心病(CAD)是全球主要健康负担,准确识别罪犯血管和狭窄程度对个体化治疗至关重要。尽管冠状动脉CT血管造影(CCTA)是首选的非侵入性诊断方法,但其依赖高端设备、辐射暴露及对患者配合度要求高,限制了大规模应用。因此,亟需一种更便捷、低成本的筛查工具。利用广泛可用的心电图(ECG)结合人工智能(AI)技术,为CAD筛查提供了新的可能。 Method: 研究开发了一种可解释的AI-ECG模型,以预测CCTA中右冠状动脉(RCA)、左主干(LM)、左前降支(LAD)和左回旋支(LCX)的严重或完全狭窄。模型在内部和外部验证集上评估性能,并在临床正常心电图亚组及不同人群亚组中进行稳定性分析。通过基于血管特异性发病率阈值的风险分层评估其预测能力,并利用可解释性分析识别对模型决策有贡献的关键心电图波形区域。 Result: 在内部验证集中,模型对RCA、LM、LAD和LCX的AUC分别为0.794、0.818、0.744和0.755;在外部验证集中,AUC分别为0.749、0.971、0.667和0.727。模型在临床正常心电图亚组中表现稳定,表明其性能不依赖于明显的心电图异常。亚组分析显示模型在不同人口学特征和采集时间中均具稳定性。风险分层能有效区分校准曲线和累积事件曲线。可解释性分析揭示了高风险与低风险组之间的显著心电图波形差异,突出了与冠状动脉狭窄相关的关键电生理区域。 Conclusion: 该可解释的AI-ECG模型能够有效预测四大主要冠状动脉的严重狭窄,具备良好的泛化能力和鲁棒性,尤其在正常心电图人群中仍保持性能,显示出作为大规模CAD筛查工具的潜力。同时,模型的可解释性提供了新的心电图与冠状动脉病变关联的生理洞察,有助于推动AI辅助诊断的临床应用。 Abstract: Coronary artery disease (CAD) remains a major global health burden. Accurate identification of the culprit vessel and assessment of stenosis severity are essential for guiding individualized therapy. Although coronary CT angiography (CCTA) is the first-line non-invasive modality for CAD diagnosis, its dependence on high-end equipment, radiation exposure, and strict patient cooperation limits large-scale use. With advances in artificial intelligence (AI) and the widespread availability of electrocardiography (ECG), AI-ECG offers a promising alternative for CAD screening. In this study, we developed an interpretable AI-ECG model to predict severe or complete stenosis of the four major coronary arteries on CCTA. On the internal validation set, the model's AUCs for the right coronary artery (RCA), left main coronary artery (LM), left anterior descending artery (LAD), and left circumflex artery (LCX) were 0.794, 0.818, 0.744, and 0.755, respectively; on the external validation set, the AUCs reached 0.749, 0.971, 0.667, and 0.727, respectively. Performance remained stable in a clinically normal-ECG subset, indicating robustness beyond overt ECG abnormalities. Subgroup analyses across demographic and acquisition-time strata further confirmed model stability. Risk stratification based on vessel-specific incidence thresholds showed consistent separation on calibration and cumulative event curves. Interpretability analyses revealed distinct waveform differences between high- and low-risk groups, highlighting key electrophysiological regions contributing to model decisions and offering new insights into the ECG correlates of coronary stenosis.

[34] ChromouVQA: Benchmarking Vision-Language Models under Chromatic Camouflaged Images

Yunfei Zhang,Yizhuo He,Yuanxun Shao,Zhengtao Yao,Haoyan Xu,Junhao Dong,Zhen Yao,Zhikang Dong

Main category: cs.CV

TL;DR: 提出ChromouVQA,一个基于Ishihara式色觉伪装图像的大规模多任务基准,用于评估视觉-语言模型在复杂背景下的目标识别能力。

Details Motivation: 现有视觉-语言模型在杂乱背景中难以进行图形-背景分离,需更严格的测试基准。 Method: 构建Ishihara风格的色觉伪装图像数据集,引入多种几何填充和变化参数,并设计九种视觉问答任务;提出一种模型无关的对比学习策略,对齐轮廓与其伪装渲染以恢复全局形状。 Result: 人类与模型在细微色差或干扰性几何填充下存在显著性能差距,所提对比方法提升了形状恢复效果。 Conclusion: ChromouVQA为多模态模型提供了紧凑、可控且可复现的评估平台,突出了当前VLM在图形-背景分离上的不足。 Abstract: Vision-Language Models (VLMs) have advanced multimodal understanding, yet still struggle when targets are embedded in cluttered backgrounds requiring figure-ground segregation. To address this, we introduce ChromouVQA, a large-scale, multi-task benchmark based on Ishihara-style chromatic camouflaged images. We extend classic dot plates with multiple fill geometries and vary chromatic separation, density, size, occlusion, and rotation, recording full metadata for reproducibility. The benchmark covers nine vision-question-answering tasks, including recognition, counting, comparison, and spatial reasoning. Evaluations of humans and VLMs reveal large gaps, especially under subtle chromatic contrast or disruptive geometric fills. We also propose a model-agnostic contrastive recipe aligning silhouettes with their camouflaged renderings, improving recovery of global shapes. ChromouVQA provides a compact, controlled benchmark for reproducible evaluation and extension. Code and dataset are available at https://github.com/Chromou-VQA-Benchmark/Chromou-VQA.

[35] Spatiotemporal Satellite Image Downscaling with Transfer Encoders and Autoregressive Generative Models

Yang Xiang,Jingwen Zhong,Yige Yan,Petros Koutrakis,Eric Garshick,Meredith Franklin

Main category: cs.CV

TL;DR: 提出一种基于迁移学习的生成式降尺度框架,结合轻量级U-Net编码器与扩散模型,用于从粗分辨率卫星数据重建细分辨率图像,在亚洲区域实现了高性能且物理一致的降尺度结果。

Details Motivation: 为解决长时间序列粗分辨率遥感图像降尺度中训练数据有限和物理一致性不足的问题,需开发一种高效且能保持时空特征的生成模型。 Method: 采用轻量级U-Net作为迁移编码器,先在MERRA-2粗分辨率数据上预训练以学习时空表征,随后冻结其编码器并迁移到基于扩散的生成模型中作为物理有意义的潜在特征,实现从50 km到7 km的图像重建。 Result: 在不同季节和子区域上R2达到0.65至0.94,优于确定性U-Net、变分自编码器和现有迁移学习基线;通过半变异函数、ACF/PACF和滞后RMSE/R2验证了生成图像具有物理一致的空间变异性与时间自相关性。 Conclusion: 迁移增强的扩散模型可有效提升粗分辨率图像的降尺度质量,具备良好的泛化能力与物理一致性,适用于长期环境监测与暴露评估。 Abstract: We present a transfer-learning generative downscaling framework to reconstruct fine resolution satellite images from coarse scale inputs. Our approach combines a lightweight U-Net transfer encoder with a diffusion-based generative model. The simpler U-Net is first pretrained on a long time series of coarse resolution data to learn spatiotemporal representations; its encoder is then frozen and transferred to a larger downscaling model as physically meaningful latent features. Our application uses NASA's MERRA-2 reanalysis as the low resolution source domain (50 km) and the GEOS-5 Nature Run (G5NR) as the high resolution target (7 km). Our study area included a large area in Asia, which was made computationally tractable by splitting into two subregions and four seasons. We conducted domain similarity analysis using Wasserstein distances confirmed minimal distributional shift between MERRA-2 and G5NR, validating the safety of parameter frozen transfer. Across seasonal regional splits, our model achieved excellent performance (R2 = 0.65 to 0.94), outperforming comparison models including deterministic U-Nets, variational autoencoders, and prior transfer learning baselines. Out of data evaluations using semivariograms, ACF/PACF, and lag-based RMSE/R2 demonstrated that the predicted downscaled images preserved physically consistent spatial variability and temporal autocorrelation, enabling stable autoregressive reconstruction beyond the G5NR record. These results show that transfer enhanced diffusion models provide a robust and physically coherent solution for downscaling a long time series of coarse resolution images with limited training periods. This advancement has significant implications for improving environmental exposure assessment and long term environmental monitoring.

[36] FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation

Georges Le Bellier,Nicolas Audebert

Main category: cs.CV

TL;DR: FlowEO是一种基于流匹配的生成模型框架,用于地球观测图像的无监督域适应(UDA),能有效处理多源遥感数据间的分布偏移,在分类与语义分割任务中表现优异。

Details Motivation: 遥感数据来源多样,存在传感器、时空条件和环境变化导致的分布偏移,限制了预训练模型的泛化能力,亟需有效的无监督域适应方法。 Method: 提出FlowEO框架,利用流匹配(flow matching)学习源域到目标域图像空间的语义保持映射,实现高质量的图像翻译与域适应。 Result: 在四个数据集上验证了FlowEO的有效性,涵盖SAR到光学图像转换及自然灾害引起的时序与语义变化场景,性能优于现有图像翻译方法,且图像感知质量更优。 Conclusion: FlowEO展示了基于流匹配的UDA在遥感领域的潜力,为大规模地球观测数据的跨域应用提供了新思路。 Abstract: The increasing availability of Earth observation data offers unprecedented opportunities for large-scale environmental monitoring and analysis. However, these datasets are inherently heterogeneous, stemming from diverse sensors, geographical regions, acquisition times, and atmospheric conditions. Distribution shifts between training and deployment domains severely limit the generalization of pretrained remote sensing models, making unsupervised domain adaptation (UDA) crucial for real-world applications. We introduce FlowEO, a novel framework that leverages generative models for image-space UDA in Earth observation. We leverage flow matching to learn a semantically preserving mapping that transports from the source to the target image distribution. This allows us to tackle challenging domain adaptation configurations for classification and semantic segmentation of Earth observation images. We conduct extensive experiments across four datasets covering adaptation scenarios such as SAR to optical translation and temporal and semantic shifts caused by natural disasters. Experimental results demonstrate that FlowEO outperforms existing image translation approaches for domain adaptation while achieving on-par or better perceptual image quality, highlighting the potential of flow-matching-based UDA for remote sensing.

[37] Self-Improving VLM Judges Without Human Annotations

Inna Wanyin Lin,Yushi Hu,Shuyue Stella Li,Scott Geng,Pang Wei Koh,Luke Zettlemoyer,Tim Althoff,Marjan Ghazvininejad

Main category: cs.CV

TL;DR: 本文提出了一种无需人类偏好标注的自训练框架,用于训练视觉-语言模型(VLM)评判模型,通过自我生成的数据迭代提升评判性能,在多个评测基准上表现优于包括GPT-4o和Claude 3.5 Sonnet在内的大型模型。

Details Motivation: 现有的VLM评判模型依赖大规模人类偏好标注,成本高且标注易过时,难以跟上模型快速发展的步伐。 Method: 提出一个三阶段迭代自训练框架:(1) 生成不同质量水平的多模态指令-响应对;(2) 为每对生成推理轨迹和判断,并过滤不符合预期质量的结果;(3) 使用正确的判断及其推理轨迹进行训练。 Result: 在VL-RewardBench上,Llama-3.2-11B多模态评判模型的整体准确率从0.38提升至0.51,优于更大模型如Llama-3.2-90B、GPT-4o和Claude 3.5 Sonnet,尤其在通用性、幻觉和推理方面提升显著。 Conclusion: 完全无需人类标注的自训练评判模型具有巨大潜力,可随着VLM能力的快速进步而持续演进。 Abstract: Effective judges of Vision-Language Models (VLMs) are crucial for model development. Current methods for training VLM judges mainly rely on large-scale human preference annotations. However, such an approach is costly, and the annotations easily become obsolete as models rapidly improve. In this work, we present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data. Our method is iterative and has three stages: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments for each pair, removing the ones that do not match our expected quality levels, and (3) training on correct judge answers and their reasoning traces. We evaluate the resulting judge on Multimodal RewardBench and VL-RewardBench across domains: correctness, preference, reasoning, safety, and visual question-answering. Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on VL-RewardBench, often outperforming much larger models including Llama-3.2-90B, GPT-4o, and Claude 3.5 Sonnet, with particularly strong gains in general, hallucination, and reasoning dimensions. The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.

[38] TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Zhenglin Cheng,Peng Sun,Jianguo Li,Tao Lin

Main category: cs.CV

TL;DR: 本文提出了一种名为TwinFlow的高效单步生成框架,用于多模态生成任务,无需预训练教师模型或对抗训练,显著提升推理效率。

Details Motivation: 现有的多步生成模型(如扩散模型)推理效率低,而现有加速方法存在迭代蒸馏、性能下降或训练不稳定等问题。 Method: 提出TwinFlow框架,通过避免使用固定的预训练教师模型和标准对抗网络,实现端到端的单步生成模型训练。 Result: 在文本到图像任务中,1-NFE下GenEval得分为0.83,优于SANA-Sprint和RCGM等基线方法;仅用1-NFE即可匹配原100-NFE模型在GenEval和DPG-Bench上的表现,计算成本降低100倍。 Conclusion: TwinFlow是一种简单且高效的单步生成框架,具备良好的可扩展性,适用于大规模高效生成模型的构建。 Abstract: Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.

[39] EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models

Kun Wang,Donglin Di,Tonghua Su,Lei Fan

Main category: cs.CV

TL;DR: 本文提出了一种基于分层嵌入器和超分辨率机制的细粒度图像生成方法,结合超类与子类语义信息以缓解语义纠缠,并通过增强细节生成质量,在公开基准上优于现有微调方法。

Details Motivation: 现有的扩散模型在细粒度图像生成中倾向于关注常见类别,且存在语义信息纠缠和生成图像细节不足的问题。 Method: 引入分层嵌入器整合超类和子类的语义信息,结合超分辨率机制在感知生成阶段增强细节,并提出高效的ProAttention机制以提升扩散模型性能。 Result: 在多个公开基准上的实验表明,该方法在生成质量和细节表现上优于现有的先进微调方法。 Conclusion: 所提出的方法有效缓解了细粒度图像生成中的语义纠缠问题并提升了细节清晰度,为扩散模型在精细分类任务中的应用提供了新思路。 Abstract: Diffusion models are highly regarded for their controllability and the diversity of images they generate. However, class-conditional generation methods based on diffusion models often focus on more common categories. In large-scale fine-grained image generation, issues of semantic information entanglement and insufficient detail in the generated images still persist. This paper attempts to introduce a concept of a tiered embedder in fine-grained image generation, which integrates semantic information from both super and child classes, allowing the diffusion model to better incorporate semantic information and address the issue of semantic entanglement. To address the issue of insufficient detail in fine-grained images, we introduce the concept of super-resolution during the perceptual information generation stage, enhancing the detailed features of fine-grained images through enhancement and degradation models. Furthermore, we propose an efficient ProAttention mechanism that can be effectively implemented in the diffusion model. We evaluate our method through extensive experiments on public benchmarks, demonstrating that our approach outperforms other state-of-the-art fine-tuning methods in terms of performance.

[40] Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

Wentao Wang,Chunyang Liu,Kehua Sheng,Bo Zhang,Yan Wang

Main category: cs.CV

TL;DR: 提出了一种基于视觉-语言模型(VLM)的视觉强化学习框架Semore,通过双路径骨干网络同时提取语义和运动表征,并利用VLM的常识知识与CLIP实现图文对齐,提升表示能力。

Details Motivation: 现有基于LLM的强化学习方法主要关注策略引导,且受限于骨干网络的表示能力,难以有效融合语义与运动信息。 Method: 设计双路径骨干网络从RGB光流中提取语义和运动表征;利用VLM提取观测中的关键语义信息,结合预训练CLIP实现文本-图像对齐;采用分别监督的方式联合训练语义与运动分支,允许二者自发交互。 Result: 实验表明,Semore在特征层面借助VLM指导,相比当前先进方法展现出更强的适应性和效率。 Conclusion: Semore通过引入VLM和双路径结构,在视觉强化学习中实现了更优的语义-运动融合,提升了智能体的决策能力。 Abstract: The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to simultaneously guide the extraction of semantics and motion, while allowing them to interact spontaneously. Extensive experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art methods. All codes are released.

[41] Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models

Rowan Bradbury,Dazhi Zhong

Main category: cs.CV

TL;DR: 本文提出了一种新的图像潜在空间修复方法——像素等效潜在合成(PELC),通过引入DecFormer模型实现全分辨率掩码控制和真正的软边缘alpha合成,显著减少了标准掩码插值在边缘处的误差,并且与现有的扩散模型管道兼容,无需主干网络微调,仅增加少量参数和计算开销。

Details Motivation: 现有的基于VAE潜在空间线性插值的方法在进行图像修复时会产生明显的接缝伪影、全局退化和色彩偏移,无法实现与像素空间合成等效的效果,因此需要一种更精确的潜在空间融合机制。 Method: 提出PELC原则,要求潜在空间合成应与像素空间合成效果一致;设计DecFormer——一个7.7M参数的Transformer模型,预测每通道混合权重并进行流形外残差校正;训练DecFormer使得融合后的解码结果匹配像素空间的alpha合成结果。 Result: 在FLUX.1系列模型上,DecFormer将边缘区域的误差指标最多降低了53%,恢复了全局色彩一致性、软掩码支持、清晰边界和高保真掩码效果;作为修复先验,结合轻量级LoRA即可达到与完全微调的FLUX.1-Fill模型相当的保真度;并在复杂色彩校正任务中验证了PELC的通用性。 Conclusion: DecFormer实现了符合PELC原则的高质量潜在空间融合,解决了传统线性插值导致的 artifacts 问题,为扩散模型中的潜在空间编辑提供了一个高效、通用且即插即用的新方案。 Abstract: Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixel-equivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev's parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.

[42] DEAR: Dataset for Evaluating the Aesthetics of RenderingDEAR: Dataset for Evaluating the Aesthetics of Rendering

Vsevolod Plohotnuk,Artyom Panshin,Nikola Banić,Simone Bianco,Michael Freeman,Egor Ershov

Main category: cs.CV

TL;DR: 本文提出了一个名为DEAR的新基准数据集,用于建模人类对图像渲染风格的美学判断,填补了传统图像质量评估在美学评价方面的空白。

Details Motivation: 由于缺乏反映主观风格偏好的数据集,图像渲染美学评估长期未被充分探索。为了推动这一领域的发展,需要一个基于人类主观偏好的系统性数据集。 Method: 基于MIT-Adobe FiveK数据集构建DEAR,通过大规模众包收集成对的人类偏好评分,每对图像由25名不同用户评分,共13,648人参与。同时分析投票模式并提出多个应用场景。 Result: DEAR成为首个基于主观人类偏好系统性评估图像渲染美学的数据集,支持风格偏好预测、美学基准测试和个性化美学建模等任务。部分标注数据(100张图像)已发布在HuggingFace上。 Conclusion: DEAR为图像渲染美学评估提供了可靠的数据基础,推动了从传统失真评估向主观美学评估的转变。 Abstract: Traditional Image Quality Assessment~(IQA) focuses on quantifying technical degradations such as noise, blur, or compression artifacts, using both full-reference and no-reference objective metrics. However, evaluation of rendering aesthetics, a growing domain relevant to photographic editing, content creation, and AI-generated imagery, remains underexplored due to the lack of datasets that reflect the inherently subjective nature of style preference. In this work, a novel benchmark dataset designed to model human aesthetic judgments of image rendering styles is introduced: the Dataset for Evaluating the Aesthetics of Rendering (DEAR). Built upon the MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing, with each image pair evaluated by 25 distinct human evaluators with a total of 13,648 of them participating overall. These annotations capture nuanced, context-sensitive aesthetic preferences, enabling the development and evaluation of models that go beyond traditional distortion-based IQA, focusing on a new task: Evaluation of Aesthetics of Rendering (EAR). The data collection pipeline is described, human voting patterns are analyzed, and multiple use cases are outlined, including style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling. To the best of the authors' knowledge, DEAR is the first dataset to systematically address image aesthetics of rendering assessment grounded in subjective human preferences. A subset of 100 images with markup for them is published on HuggingFace (huggingface.co/datasets/vsevolodpl/DEAR).

[43] IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction

Dmitrii Torbunov,Onur Okuducu,Yi Huang,Odera Dim,Rebecca Coles,Yonggang Cui,Yihui Ren

Main category: cs.CV

TL;DR: 本文提出了一种混合视频捕捉范式,结合稀疏RGB关键帧与事件相机的连续事件流,在离线状态下重建完整RGB视频,以降低功耗并保持标准视频输出。作者定义了IE2Video任务,并比较了自回归模型与基于扩散模型的两种架构,实验表明扩散模型在感知质量上显著优于基线。

Details Motivation: 传统RGB相机因固定帧率采集导致高功耗,限制了持续视频监控的应用;而事件相机虽功耗低但输出为非标准的异步事件流。因此需要一种既能节能又能输出标准RGB视频的新方法。 Method: 提出混合采集范式:记录初始RGB帧和后续事件流,离线重建视频。定义IE2Video任务,探索两种方法:一是改进自回归模型HyperE2VID用于RGB生成;二是将事件表示通过学习编码器和LoRA注入预训练文本到视频扩散模型(LTX)中。 Result: 扩散模型方法在感知质量上比自回归基线提升33%(LPIPS 0.283 vs 0.422),并在多个事件相机数据集(BS-ERGB, HS-ERGB far/close)和不同序列长度(32-128帧)下表现出强泛化能力,验证了跨数据集和未见配置的有效性。 Conclusion: 结合事件相机与稀疏关键帧的混合范式可有效降低采集功耗,扩散模型在IE2Video任务中显著优于自回归方法,具备良好的实际应用潜力和泛化性能。 Abstract: Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline -- reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33\% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.

[44] Age-Inclusive 3D Human Mesh Recovery for Action-Preserving Data Anonymization

Georgios Chatzichristodoulou,Niki Efthymiou,Panagiotis Filntisis,Georgios Pavlakos,Petros Maragos

Main category: cs.CV

TL;DR: 本文提出AionHMR框架,通过SMPL-A模型实现成人、儿童和婴儿的统一3D姿态与形状估计,生成隐私保护的3D重建数据,显著提升对儿童群体的泛化能力。

Details Motivation: 现有3D人体重建方法在儿童和婴儿上的泛化性能差,且涉及隐私问题,缺乏包容性。 Method: 基于优化方法扩展高性能模型,引入SMPL-A体模型,并生成儿童/婴儿图像数据库的伪真值标注;训练基于Transformer的深度学习模型实现实时全年龄段3D重建。 Result: 在儿童和婴儿上显著提升3D姿态与形状估计精度,同时保持对成人的准确性;发布隐私保护的3D-BabyRobot数据集。 Conclusion: 该工作填补了全年龄段3D人体建模的领域空白,为包容性、隐私保护和年龄多样性的人体重建奠定了基础。 Abstract: While three-dimensional (3D) shape and pose estimation is a highly researched area that has yielded significant advances, the resulting methods, despite performing well for the adult population, generally fail to generalize effectively to children and infants. This paper addresses this challenge by introducing AionHMR, a comprehensive framework designed to bridge this domain gap. We propose an optimization-based method that extends a top-performing model by incorporating the SMPL-A body model, enabling the concurrent and accurate modeling of adults, children, and infants. Leveraging this approach, we generated pseudo-ground-truth annotations for publicly available child and infant image databases. Using these new training data, we then developed and trained a specialized transformer-based deep learning model capable of real-time 3D age-inclusive human reconstruction. Extensive experiments demonstrate that our methods significantly improve shape and pose estimation for children and infants without compromising accuracy on adults. Importantly, our reconstructed meshes serve as privacy-preserving substitutes for raw images, retaining essential action, pose, and geometry information while enabling anonymized datasets release. As a demonstration, we introduce the 3D-BabyRobot dataset, a collection of action-preserving 3D reconstructions of children interacting with robots. This work bridges a crucial domain gap and establishes a foundation for inclusive, privacy-aware, and age-diverse 3D human modeling.

[45] CARD: Correlation Aware Restoration with Diffusion

Niki Nezakati,Arnab Ghosh,Amit Roy-Chowdhury,Vishwanath Saragadam

Main category: cs.CV

TL;DR: 提出了一种名为CARD的新方法,用于在去噪扩散模型中处理实际传感器中的空间相关噪声,通过白化噪声并改进恢复步骤,在多种图像复原任务中表现优于现有方法。

Details Motivation: 现有的去噪扩散模型大多假设噪声为独立同分布的高斯噪声,而实际传感器噪声具有空间相关性,限制了模型的实际效果。 Method: 提出CARD方法,首先对含噪观测进行白化处理,使噪声变为独立同分布形式,并在扩散恢复过程中采用噪声白化更新,继承DDRM的高效采样特性的同时处理相关噪声。 Result: 在合成相关噪声的标准数据集和真实相关噪声数据集CIN-D上的实验表明,CARD在去噪、去模糊和超分辨率任务中 consistently 优于现有方法。 Conclusion: CARD是一种无需训练的DDRM扩展方法,能有效处理相关噪声,提升了扩散模型在真实场景下的图像恢复性能。 Abstract: Denoising diffusion models have achieved state-of-the-art performance in image restoration by modeling the process as sequential denoising steps. However, most approaches assume independent and identically distributed (i.i.d.) Gaussian noise, while real-world sensors often exhibit spatially correlated noise due to readout mechanisms, limiting their practical effectiveness. We introduce Correlation Aware Restoration with Diffusion (CARD), a training-free extension of DDRM that explicitly handles correlated Gaussian noise. CARD first whitens the noisy observation, which converts the noise into an i.i.d. form. Then, the diffusion restoration steps are replaced with noise-whitened updates, which inherits DDRM's closed-form sampling efficiency while now being able to handle correlated noise. To emphasize the importance of addressing correlated noise, we contribute CIN-D, a novel correlated noise dataset captured across diverse illumination conditions to evaluate restoration methods on real rolling-shutter sensor noise. This dataset fills a critical gap in the literature for experimental evaluation with real-world correlated noise. Experiments on standard benchmarks with synthetic correlated noise and on CIN-D demonstrate that CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks.

[46] Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen,Ajad Chhatkuli,Luc Van Gool,Danda Pani Paudel

Main category: cs.CV

TL;DR: 提出COM4D方法,通过解耦学习空间与时间注意力,从单目视频中联合重建多物体4D场景,无需4D组合训练数据。

Details Motivation: 现有方法通常依赖类别特定的参数化模型且仅处理单个物体,导致场景配置不一致且受限于特定物体类别,难以真实反映复杂动态场景。 Method: 设计了一种解耦训练策略,分别学习多物体组成的空间注意力和单物体动态的时间注意力,并在推理时通过注意力混合机制融合两者,交替进行空间与时间推理,实现对4D场景的联合重建。 Result: 在无4D组合监督的情况下,实现了对多个交互物体的完整、持久的4D场景重建,在4D物体和组合3D重建任务上达到SOTA性能。 Conclusion: COM4D是一种无需4D组合标注、纯数据驱动的方法,能够有效重建复杂真实场景中的多物体4D结构与时空配置。 Abstract: Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

[47] From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Kevin Cannons,Saeed Ranjbar Alvar,Mohammad Asiful Hossain,Ahmad Rezaei,Mohsen Gholami,Alireza Heidarikhazaei,Zhou Weimin,Yong Zhang,Mohammad Akbari

Main category: cs.CV

TL;DR: 本文提出了一个专注于自动驾驶(AD)中时间理解的基准TAD,包含近6000个问答对和7个人工设计的任务,用于评估视觉-语言模型(VLMs)在捕捉AD场景中动作动态关系方面的能力。实验表明现有最先进模型表现不佳,主要由于细粒度运动理解不足。为此,文章提出两种无需训练的解决方案:Scene-CoT和TCogMap,可将TAD上的平均准确率提升达17.72%。

Details Motivation: 现有的视频时间推理数据集多关注体育、烹饪和电影等内容,缺乏针对自动驾驶中以自我为中心的视频所面临的独特时间理解挑战的专门基准。因此需要构建一个专门针对AD场景的时间理解评测基准。 Method: 构建了TAD基准,包含约6000个QA样本和7项任务;评估了9个通用及AD专用的SOTA模型;提出两种无需训练的方法:Scene-CoT(利用思维链)和TCogMap(引入自我中心时间认知图),以增强VLM的时间理解能力。 Result: 当前SOTA模型在TAD上表现较差,显示其在细粒度运动理解上的缺陷;所提方法Scene-CoT和TCogMap能有效提升VLM在TAD上的平均准确率,最多达17.72%。 Conclusion: TAD填补了自动驾驶领域时间理解评测的空白,揭示了现有模型的局限性,并通过提出的无需训练的增强方法推动了该方向的发展,有望促进未来在AD时间推理方面的研究。 Abstract: Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.

[48] SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Elisabetta Fedele,Francis Engelmann,Ian Huang,Or Litany,Marc Pollefeys,Leonidas Guibas

Main category: cs.CV

TL;DR: 本文提出了SpaceControl,一种无需训练的测试时方法,用于对3D生成进行显式空间控制,支持从粗略基元到详细网格的多种几何输入,并在保持视觉质量的同时实现高几何保真度。

Details Motivation: 现有基于文本或图像提示的3D生成方法在几何精确性和用户控制直观性方面存在不足,缺乏对物体几何形状的精细控制手段。 Method: 提出SpaceControl,一种训练免费的测试时方法,接受多种几何输入(如超二次曲面、网格),并通过可调节参数融合到预训练生成模型中,实现对3D生成的显式空间控制。 Result: 实验表明,SpaceControl在几何保真度上优于基于训练和优化的方法,同时保持高质量视觉输出;用户研究和定量评估验证了其有效性。 Conclusion: SpaceControl为3D资产生成提供了高效、直观且精确的几何控制方式,支持交互式编辑,可集成于创意工作流中,具有实际应用价值。 Abstract: Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are cumbersome to edit. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern pre-trained generative models without requiring any additional training. A controllable parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets, facilitating practical deployment in creative workflows. Find our project page at https://spacecontrol3d.github.io/

[49] SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training

Yang Zheng,Hao Tan,Kai Zhang,Peng Wang,Leonidas Guibas,Gordon Wetzstein,Wang Yifan

Main category: cs.CV

TL;DR: 提出一种基于状态感知前馈模型的3D高斯点阵连续编辑方法,支持从用户提供的2D视图进行高效、精确的局部与全局编辑,实现交互式3D内容创作。

Details Motivation: 现有基于扩散或优化的方法在速度、对原始资产的破坏性及精细控制方面存在不足,缺乏适用于3D高斯点阵交互式编辑的有效手段。 Method: 提出一种状态感知的前馈模型,直接预测紧凑且特征丰富的高斯表示属性的更新,并结合测试时训练(Test-Time Training)实现迭代式编辑流程。 Result: 该方法支持高保真局部细节 refinement、局部涂绘和一致性全局重新着色等多种任务,在交互速度下完成,且保持原始资产身份。 Conclusion: 该方法通过统一架构实现了对3D高斯点阵资产的快速、非破坏性和精细化编辑,推动了直观流畅的3D内容创作发展。 Abstract: The rise of 3D Gaussian Splatting has revolutionized photorealistic 3D asset creation, yet a critical gap remains for their interactive refinement and editing. Existing approaches based on diffusion or optimization are ill-suited for this task, as they are often prohibitively slow, destructive to the original asset's identity, or lack the precision for fine-grained control. To address this, we introduce \ourmethod, a state-aware feedforward model that enables continuous editing of 3D Gaussian assets from user-provided 2D view(s). Our method directly predicts updates to the attributes of a compact, feature-rich Gaussian representation and leverages Test-Time Training to create a state-aware, iterative workflow. The versatility of our approach allows a single architecture to perform diverse tasks, including high-fidelity local detail refinement, local paint-over, and consistent global recoloring, all at interactive speeds, paving the way for fluid and intuitive 3D content authoring.

[50] Group Orthogonal Low-Rank Adaptation for RGB-T Tracking

Zekai Shao,Yufan Hu,Jingyuan Liu,Bin Fan,Hongmin Liu

Main category: cs.CV

TL;DR: 提出了一种用于RGB-T跟踪的组正交低秩自适应(GOLA)框架,通过结构化参数学习有效利用秩空间,减少参数冗余并增强特征表示能力。

Details Motivation: 低秩自适应在RGB-T跟踪中存在秩空间冗余问题,许多秩几乎不提供有效信息,限制了模型学习多样化知识的能力。 Method: 采用基于奇异值分解的秩分解划分策略量化秩的重要性,冻结关键秩以保留预训练先验,并将冗余秩分组施加组间正交约束,促使各组学习互补特征。 Result: 实验结果表明,GOLA在四个基准数据集上显著优于现有最先进方法,有效减少了参数冗余并提升了特征表达能力。 Conclusion: GOLA通过结构化的正交约束优化了低秩自适应的参数利用效率,在保持少量可调参数的同时增强了模型对多模态跟踪挑战的适应性。 Abstract: Parameter-efficient fine-tuning has emerged as a promising paradigm in RGB-T tracking, enabling downstream task adaptation by freezing pretrained parameters and fine-tuning only a small set of parameters. This set forms a rank space made up of multiple individual ranks, whose expressiveness directly shapes the model's adaptability. However, quantitative analysis reveals low-rank adaptation exhibits significant redundancy in the rank space, with many ranks contributing almost no practical information. This hinders the model's ability to learn more diverse knowledge to address the various challenges in RGB-T tracking. To address this issue, we propose the Group Orthogonal Low-Rank Adaptation (GOLA) framework for RGB-T tracking, which effectively leverages the rank space through structured parameter learning. Specifically, we adopt a rank decomposition partitioning strategy utilizing singular value decomposition to quantify rank importance, freeze crucial ranks to preserve the pretrained priors, and cluster the redundant ranks into groups to prepare for subsequent orthogonal constraints. We further design an inter-group orthogonal constraint strategy. This constraint enforces orthogonality between rank groups, compelling them to learn complementary features that target diverse challenges, thereby alleviating information redundancy. Experimental results demonstrate that GOLA effectively reduces parameter redundancy and enhances feature representation capabilities, significantly outperforming state-of-the-art methods across four benchmark datasets and validating its effectiveness in RGB-T tracking tasks.

[51] PoolNet: Deep Learning for 2D to 3D Video Process Validation

Sanchit Kaul,Joseph Luna,Shray Arora

Main category: cs.CV

TL;DR: 提出PoolNet,一种用于验证野外数据帧级和场景级适用性的深度学习框架,能高效区分适合与不适合进行SfM处理的场景,显著减少处理时间。

Details Motivation: 由于相机姿态变化不足、遮挡物和噪声等问题,大量公开图像数据无法有效用于Structure-from-Motion(SfM)处理,且现有方法计算成本高、耗时长。 Method: 提出PoolNet,一个基于深度学习的框架,可在帧级别和场景级别对非受控环境下的图像数据进行有效性评估,判断其是否适合SfM处理。 Result: 模型能够成功区分适合和不适合SfM处理的场景,并显著缩短了获取SfM数据所需的时间,优于现有最先进算法。 Conclusion: PoolNet为大规模图像数据的SfM预处理提供了一种高效、自动化的筛选方案,提升了SfM流程的整体效率和可行性。 Abstract: Lifting Structure-from-Motion (SfM) information from sequential and non-sequential image data is a time-consuming and computationally expensive task. In addition to this, the majority of publicly available data is unfit for processing due to inadequate camera pose variation, obscuring scene elements, and noisy data. To solve this problem, we introduce PoolNet, a versatile deep learning framework for frame-level and scene-level validation of in-the-wild data. We demonstrate that our model successfully differentiates SfM ready scenes from those unfit for processing while significantly undercutting the amount of time state of the art algorithms take to obtain structure-from-motion data.

[52] ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

Yingjie Xia,Tao Liu,Jinglei Shi,Qingsong Xie,Heng Guo,Jian Yang,Xi Wang

Main category: cs.CV

TL;DR: 提出了一种名为ShaRP的改进注意力剪枝框架,通过引入分段感知因果掩码、位置去偏和令牌去重,在浅层解码器中实现高效视觉令牌剪枝,显著降低视频大模型推理计算负载,且在高压缩率下保持稳定性能。

Details Motivation: 视频大语言模型在预填充阶段处理大量视觉令牌导致计算负担高,现有注意力剪枝方法在浅层解码器中易造成性能显著下降,尤其在高压缩率下表现不佳。 Method: 提出ShaRP框架,结合分段感知因果掩码、位置去偏机制和令牌去重策略,提升注意力机制在浅层中的令牌选择能力,实现在不重新训练的情况下进行高效剪枝。 Result: 在多个视频理解基准上验证了ShaRP的有效性,能够在高压缩率下保持竞争力的性能,显著加速VLLM推理。 Conclusion: ShaRP为视频大语言模型提供了一种可在浅层安全应用的注意力剪枝方案,建立了高效推理的新范式。 Abstract: Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.

[53] LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

Qingqiao Hu,Weimin Lyu,Meilong Xu,Kehan Qi,Xiaoling Hu,Saumya Gupta,Jiawei Zhou,Chao Chen

Main category: cs.CV

TL;DR: 本文提出了一种高效的病理全切片图像(WSI)多模态语言模型框架LoC-Path,通过减少特征冗余和压缩关键信息,在显著降低计算与内存开销的同时,实现了与现有最先进模型相当的性能。

Details Motivation: 现有的WSI多模态大模型依赖于处理大量图像块的重型编码器,计算成本高昂,且未充分利用诊断相关区域稀疏、特征冗余的特性。 Method: 设计了稀疏Token合并模块(STM)和MAE预训练重采样器以压缩局部和全局冗余的图像块Token;引入跨注意力路由适配器(CARA)和Token重要性评分器(TIS),高效融合压缩后的视觉表示与语言模型。 Result: 实验表明,LoC-Path在多项任务上达到与当前最先进模型相当的性能,同时显著降低了计算量和内存占用。 Conclusion: LoC-Path通过去除冗余和聚焦关键区域,提供了一种高效、可扩展的WSI-语言建模范式,为大规模病理图像分析中的计算效率问题提供了有效解决方案。 Abstract: Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.

[54] Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Shizhan Liu,Xinran Deng,Zhuoyi Yang,Jiayan Teng,Xiaotao Gu,Jie Tang

Main category: cs.CV

TL;DR: 本文提出了一种具有频谱结构的视频VAE(SSVAE),通过引入两种轻量级正则化方法,改善了潜在空间结构,显著提升了文本到视频生成中扩散模型的训练效率和生成质量。

Details Motivation: 现有的视频VAE主要关注重建保真度,忽略了潜在空间的结构对扩散训练难度的重要影响,导致训练效率低下。 Method: 提出两种骨干网络无关的轻量级正则化方法:局部相关性正则化和潜在掩码重建,以诱导潜在空间中有利于扩散训练的两种频谱特性——低频偏置的时空频率谱和通道特征谱中少数主导模式。 Result: 实验表明,SSVAE在文本到视频生成任务中实现了3倍的收敛速度提升,并在视频奖励指标上获得10%的增益,优于强开源VAE方法。 Conclusion: 通过结构化设计视频VAE的潜在空间频谱特性,可显著提升扩散模型的训练效率与生成性能,为高效视频生成提供了有效方案。 Abstract: Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

[55] The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

Zhuoyuan Wu,Xurui Yang,Jiahui Huang,Yue Wang,Jun Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为Dynamic Prior的新方法,利用视觉-语言模型(VLMs)和SAM2的细粒度分割能力,在无需任务特定训练的情况下有效识别动态物体,显著提升了相机位姿优化、深度重建和4D轨迹估计等任务中对三维结构的理解精度与鲁棒性。

Details Motivation: 传统SfM方法在处理包含动态物体的真实视频时难以准确恢复相机位姿和3D场景结构,而现有基于学习的方法受限于大规模运动分割数据集的缺乏,导致分割不准确,影响3D重建质量。 Method: 提出Dynamic Prior方法,结合视觉-语言模型的推理能力和SAM2的精细分割能力,无需专门训练即可识别动态物体,并将其集成到最新的SfM流程中,用于优化相机位姿、深度重建和4D轨迹估计。 Result: 在合成和真实视频数据集上进行了广泛实验,结果显示该方法在运动分割任务上达到最先进水平,同时显著提升了3D结构理解的准确性和鲁棒性。 Conclusion: Dynamic Prior通过引入无需训练的动态物体先验,有效解决了动态场景下3D重建的难题,为结构恢复提供了更强的鲁棒性和精度,具有良好的通用性和应用潜力。 Abstract: Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.

[56] Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang,Honglu Zhou,Shijie Wang,Junnan Li,Caiming Xiong,Silvio Savarese,Mohit Bansal,Michael S. Ryoo,Juan Carlos Niebles

Main category: cs.CV

TL;DR: 本文提出了主动视频感知(AVP)框架,通过迭代的计划-观察-反思过程,使长视频理解代理能主动选择性地从像素中提取与查询相关的紧凑证据,显著提升性能并降低计算开销。

Details Motivation: 现有长视频理解方法依赖于与查询无关的描述器,导致计算资源浪费并丢失细粒度时空信息;受主动感知理论启发,作者主张代理应主动决定观察的内容、时机和位置,并持续评估观测信息是否充分。 Method: 提出AVP框架,采用多轮次的计划-观察-反思流程:规划者生成目标性视频交互指令,观察者执行以提取带时间戳的证据,反思者评估证据充分性,决定输出答案或继续观察。 Result: 在五个长视频理解基准上,AVP取得最佳性能,平均准确率比现有最优代理方法提高5.7%,且仅需18.4%的推理时间和12.4%的输入token。 Conclusion: AVP通过主动感知机制有效提升了长视频理解的效率与准确性,验证了将视频视为交互环境进行证据搜集的可行性与优越性。 Abstract: Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.

[57] Genetic Algorithms For Parameter Optimization for Disparity Map Generation of Radiata Pine Branch Images

Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

Main category: cs.CV

TL;DR: 提出了一种基于遗传算法(GA)的SGBM和WLS参数优化框架,用于提升无人机在林业场景中对树杈距离测量的精度,同时保持高效处理能力。

Details Motivation: 传统立体匹配算法如SGBM+WLS虽速度快,但依赖手动调参,难以适应复杂多变的森林环境,需一种自动化、高效的参数优化方法。 Method: 设计并实现了一个基于遗传算法的参数优化框架,自动搜索SGBM与WLS的最优参数组合,并通过多种图像质量指标(MSE、PSNR、SSIM)评估性能。 Result: 相比基线配置,所提方法将均方误差降低42.86%,峰值信噪比和结构相似性分别提高8.47%和28.52%,且在不同成像条件下表现出更强的泛化能力。 Conclusion: 该GA优化框架有效解决了传统立体匹配算法的手动调参难题,在保持实时性的同时显著提升精度,适用于资源受限的无人机林业应用。 Abstract: Traditional stereo matching algorithms like Semi-Global Block Matching (SGBM) with Weighted Least Squares (WLS) filtering offer speed advantages over neural networks for UAV applications, generating disparity maps in approximately 0.5 seconds per frame. However, these algorithms require meticulous parameter tuning. We propose a Genetic Algorithm (GA) based parameter optimization framework that systematically searches for optimal parameter configurations for SGBM and WLS, enabling UAVs to measure distances to tree branches with enhanced precision while maintaining processing efficiency. Our contributions include: (1) a novel GA-based parameter optimization framework that eliminates manual tuning; (2) a comprehensive evaluation methodology using multiple image quality metrics; and (3) a practical solution for resource-constrained UAV systems. Experimental results demonstrate that our GA-optimized approach reduces Mean Squared Error by 42.86% while increasing Peak Signal-to-Noise Ratio and Structural Similarity by 8.47% and 28.52%, respectively, compared with baseline configurations. Furthermore, our approach demonstrates superior generalization performance across varied imaging conditions, which is critcal for real-world forestry applications.

[58] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

Zhiyuan Jiang,Shenghao Xie,Wenyi Li,Wenqiang Zu,Peihang Li,Jiahao Qiu,Siqi Pei,Lei Ma,Tiejun Huang,Mengdi Wang,Shilong Liu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的GUI定位方法ZoomClick,利用缩放(zoom)作为先验,通过四个关键属性实现动态空间聚焦和自适应上下文切换,显著提升了视觉-语言模型和专用GUI定位模型的性能,并发布了用于评估模型对缩放适应能力的基准GUIZoom-Bench。

Details Motivation: 现有GUI代理在跨平台泛化、复杂布局分析和细粒度元素定位方面仍面临挑战,且依赖大规模边界框监督,因此需要一种更高效、无需训练的解决方案。 Method: 提出ZoomClick方法,利用缩放的四个关键属性(预缩放、深度、缩小尺寸、最小裁剪尺寸)进行动态空间聚焦和自适应上下文切换,无需额外训练即可提升模型定位能力。 Result: 在多个主流基准上达到最先进性能,例如UI-Venus-72B在ScreenSpot-Pro上成功率达到73.1%;同时推出了GUIZoom-Bench用于评估模型对缩放的适应性。 Conclusion: 缩放是一种强大但尚未被充分探索的先验,ZoomClick为GUI定位任务提供了一种高效、无需训练的新范式,未来可推动基于缩放的训练和测试时扩展研究。 Abstract: Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.

[59] YOLO and SGBM Integration for Autonomous Tree Branch Detection and Depth Estimation in Radiata Pine Pruning Applications

Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

Main category: cs.CV

TL;DR: 提出了一种基于YOLO和SGBM立体视觉的无人机自主修剪系统,仅使用双目相机实现精确的枝条检测与深度估计,无需昂贵的LiDAR,处理速度快、安全性高。

Details Motivation: 人工修剪辐射松存在高空作业和复杂地形带来的安全风险,亟需一种安全高效的自动化解决方案。 Method: 结合YOLO目标检测与半全局块匹配(SGBM)立体视觉技术,利用双目相机输入进行枝条检测与深度估计,并在无人机平台上实现自主修剪功能。 Result: YOLO在枝条分割上优于Mask R-CNN,达到82.0% mAPmask50-95;系统在2米操作范围内精确定位枝条,单帧处理时间低于1秒。 Conclusion: 该框架验证了低成本、高效率的自主修剪系统在商业林业中的可行性,有助于提升作业安全性和自动化水平。 Abstract: Manual pruning of radiata pine trees poses significant safety risks due to extreme working heights and challenging terrain. This paper presents a computer vision framework that integrates YOLO object detection with Semi-Global Block Matching (SGBM) stereo vision for autonomous drone-based pruning operations. Our system achieves precise branch detection and depth estimation using only stereo camera input, eliminating the need for expensive LiDAR sensors. Experimental evaluation demonstrates YOLO's superior performance over Mask R-CNN, achieving 82.0% mAPmask50-95 for branch segmentation. The integrated system accurately localizes branches within a 2 m operational range, with processing times under one second per frame. These results establish the feasibility of cost-effective autonomous pruning systems that enhance worker safety and operational efficiency in commercial forestry.

[60] Moving object detection from multi-depth images with an attention-enhanced CNN

Masato Shibukawa,Fumi Yoshida,Toshifumi Yanagisawa,Takashi Ito,Hirohisa Kurosaki,Makoto Yoshikawa,Kohki Kamiya,Ji-an Jiang,Wesley Fraser,JJ Kavelaars,Susan Benecchi,Anne Verbiscer,Akira Hatakeyama,Hosei O,Naoya Ozaki

Main category: cs.CV

TL;DR: 提出一种结合卷积块注意力模块的多输入卷积神经网络,用于提升天文观测中移动天体的检测性能,显著减少人工验证工作量。

Details Motivation: 传统移动天体检测依赖人工验证,成本高且效率低,亟需自动化方法以提高准确性并降低人力负担。 Method: 采用多输入卷积神经网络架构,同时处理多幅堆叠图像,并引入卷积块注意力模块,增强模型在空间和通道维度上对关键特征的关注能力。 Result: 在约2000张观测图像的数据集上测试,模型准确率达到近99%,AUC超过0.99,性能优异;通过调整检测阈值,相比人工验证可减少超过99%的工作量。 Conclusion: 所提方法显著提升了移动天体检测的自动化水平和分类精度,有效降低了对人工干预的依赖,适用于大规模巡天数据处理。 Abstract: One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.

[61] Performance Evaluation of Deep Learning for Tree Branch Segmentation in Autonomous Forestry Systems

Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green

Main category: cs.CV

TL;DR: 本文评估了不同深度学习方法在多分辨率下进行无人机森林作业中树木分支分割的性能,建立了精度与效率的权衡基准。

Details Motivation: 为了实现无人机在复杂森林环境中安全导航和自动修剪,需要快速且精确的树干和树枝分割技术,尤其是在多变的分辨率和操作条件下。 Method: 使用Urban Street Tree Dataset,在256x256、512x512和1024x1024三种分辨率下评估22种深度学习模型配置,采用IoU、Dice、TS-IoU和CPR等指标进行分析。 Result: U-Net+MiT-B4在256x256表现优异;512x512下MiT-B4在多项指标领先;1024x1024时U-Net+MiT-B3验证性能最佳,U-Net++边界质量最优;PSPNet最高效但精度较低。 Conclusion: 研究为嵌入式林业系统提供了多分辨率下的分支分割基准,揭示了不同模型在精度与计算效率间的权衡。 Abstract: UAV-based autonomous forestry operations require rapid and precise tree branch segmentation for safe navigation and automated pruning across varying pixel resolutions and operational conditions. We evaluate different deep learning methods at three resolutions (256x256, 512x512, 1024x1024) using the Urban Street Tree Dataset, employing standard metrics (IoU, Dice) and specialized measures including Thin Structure IoU (TS-IoU) and Connectivity Preservation Rate (CPR). Among 22 configurations tested, U-Net with MiT-B4 backbone achieves strong performance at 256x256. At 512x512, MiT-B4 leads in IoU, Dice, TS-IoU, and Boundary-F1. At 1024x1024, U-Net+MiT-B3 shows the best validation performance for IoU/Dice and precision, while U-Net++ excels in boundary quality. PSPNet provides the most efficient option (2.36/9.43/37.74 GFLOPs) with 25.7/19.6/11.8 percentage point IoU reductions compared to top performers at respective resolutions. These results establish multi-resolution benchmarks for accuracy-efficiency trade-offs in embedded forestry systems. Implementation is available at https://github.com/BennyLinntu/PerformanceTreeBranchSegmentation.

[62] ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction

Jiangtong Tan,Lin Liu,Jie Huanng,Xiaopeng Zhang,Qi Tian,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为ParaUni的统一多模态模型,通过并行提取视觉语言模型(VLM)各层特征,并结合层集成模块(LIM)和分层动态调整机制(LDAM),有效融合多层次信息以提升生成质量,并在强化学习中实现多奖励优化。

Details Motivation: 现有方法因表示差异大而难以充分平衡交互性与灵活性,且未充分利用VLM中层次化的丰富信息。 Method: 提出ParaUni,从VLM的不同层并行提取特征,通过LIM模块融合细粒度细节与语义抽象,并引入LDAM机制,在强化学习中根据各层对不同奖励的响应进行动态调整,以对齐层次特性。 Result: 实验表明,ParaUni能有效利用多层互补特征,显著提升生成质量,并在强化学习阶段展现出强大的多奖励优化潜力。 Conclusion: ParaUni通过并行多层特征融合与动态层调整机制,增强了统一多模态模型的生成能力,为多奖励学习提供了新思路。 Abstract: Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM's layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.

[63] TED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression

Cheng-Yuan Ho,He-Bi Yang,Jui-Chiu Chiang,Yu-Lun Liu,Wen-Hsiao Peng

Main category: cs.CV

TL;DR: 本文提出TED-4DGS,一种用于动态3D高斯点阵(4DGS)的率失真优化压缩框架,结合稀疏锚点表示、可学习时序激活参数和轻量级时间嵌入,实现高效且紧凑的动态场景建模,在多个真实数据集上达到最优率失真性能。

Details Motivation: 现有的动态3DGS方法在变形机制和压缩效率方面存在不足,缺乏对时间控制的显式建模和率失真优化的压缩策略,限制了其在实际应用中的部署。 Method: 提出TED-4DGS,采用基于稀疏锚点的3DGS表示,每个锚点配备可学习的时序激活参数以控制其出现与消失,并通过轻量级时间嵌入查询共享变形库生成特定变形;在压缩方面,引入基于隐式神经表示(INR)的超先验建模锚点属性分布,并使用通道自回归模型捕捉锚点内相关性。 Result: 在多个真实世界数据集上实现了最先进的率失真性能,显著优于先前方法,验证了所提方案在压缩效率和重建质量上的优势。 Conclusion: TED-4DGS统一了时空4DGS与经典变形方法的优势,首次实现了针对动态3DGS的率失真优化压缩框架,为高效动态场景表示提供了新方向。 Abstract: Building on the success of 3D Gaussian Splatting (3DGS) in static 3D scene representation, its extension to dynamic scenes, commonly referred to as 4DGS or dynamic 3DGS, has attracted increasing attention. However, designing more compact and efficient deformation schemes together with rate-distortion-optimized compression strategies for dynamic 3DGS representations remains an underexplored area. Prior methods either rely on space-time 4DGS with overspecified, short-lived Gaussian primitives or on canonical 3DGS with deformation that lacks explicit temporal control. To address this, we present TED-4DGS, a temporally activated and embedding-based deformation scheme for rate-distortion-optimized 4DGS compression that unifies the strengths of both families. TED-4DGS is built on a sparse anchor-based 3DGS representation. Each canonical anchor is assigned learnable temporal-activation parameters to specify its appearance and disappearance transitions over time, while a lightweight per-anchor temporal embedding queries a shared deformation bank to produce anchor-specific deformation. For rate-distortion compression, we incorporate an implicit neural representation (INR)-based hyperprior to model anchor attribute distributions, along with a channel-wise autoregressive model to capture intra-anchor correlations. With these novel elements, our scheme achieves state-of-the-art rate-distortion performance on several real-world datasets. To the best of our knowledge, this work represents one of the first attempts to pursue a rate-distortion-optimized compression framework for dynamic 3DGS representations.

[64] University Building Recognition Dataset in Thailand for the mission-oriented IoT sensor system

Takara Taniguchi,Yudai Ueda,Atsuya Muramatsu,Kohki Hashimoto,Ryo Yagi,Hideya Ochiai,Chaodit Aswakul

Main category: cs.CV

TL;DR: 本文提出了一个针对泰国朱拉隆功大学的建筑物识别数据集CUBR,并在无线自组织联邦学习(WAFL)框架下结合Vision Transformer进行图像识别任务,实验结果表明WAFL训练方式优于独立训练。

Details Motivation: 由于不同应用场景需要特定的数据集,且未来边缘设备上的训练具有潜力,因此需要构建面向特定任务的数据库并探索联邦学习在边缘设备上的应用效果。 Method: 开发了专用于朱拉隆功大学的建筑物识别数据集CUBR,并在WAFL-ViT框架下进行实验,采用设备间直接通信的联邦学习模式进行模型训练。 Result: 在WAFL场景下的训练比独立训练获得更高的准确率,验证了WAFL在实际应用中的有效性。 Conclusion: 构建特定任务的数据集对边缘智能系统至关重要,WAFL在实际部署中展现出优于本地训练的性能,具备推广价值。 Abstract: Many industrial sectors have been using of machine learning at inference mode on edge devices. Future directions show that training on edge devices is promising due to improvements in semiconductor performance. Wireless Ad Hoc Federated Learning (WAFL) has been proposed as a promising approach for collaborative learning with device-to-device communication among edges. In particular, WAFL with Vision Transformer (WAFL-ViT) has been tested on image recognition tasks with the UTokyo Building Recognition Dataset (UTBR). Since WAFL-ViT is a mission-oriented sensor system, it is essential to construct specific datasets by each mission. In our work, we have developed the Chulalongkorn University Building Recognition Dataset (CUBR), which is specialized for Chulalongkorn University as a case study in Thailand. Additionally, our results also demonstrate that training on WAFL scenarios achieves better accuracy than self-training scenarios. Dataset is available in https://github.com/jo2lxq/wafl/.

[65] EmoStyle: Emotion-Driven Image Stylization

Jingyuan Yang,Zihuan Bai,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了情感图像风格化(AIS)任务,并设计了EmoStyle框架,通过构建新数据集和情绪-内容推理机制,实现既能传达特定情感又保持内容一致的图像风格迁移。

Details Motivation: 现有图像风格化方法多关注视觉外观的转换,忽视了艺术风格所承载的情感影响。本文旨在填补这一空白,探索如何通过风格传递特定情绪。 Method: 提出EmoStyle框架:1)构建EmoStyleSet三元组数据集(内容-情感-风格化图像);2)设计情绪-内容推理器,自适应融合情绪线索与内容以生成风格查询;3)开发风格量化器,将连续风格特征映射到情感相关码本。 Result: 实验表明EmoStyle在保持内容一致性的同时显著提升了图像的情感表现力;所学习的情绪感知风格字典可迁移到其他生成任务,具备广泛应用潜力。 Conclusion: 本研究建立了情感驱动图像风格化的新方向,为AI艺术创作提供了更具情感表达能力的技术基础。 Abstract: Art has long been a profound medium for expressing emotions. While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles. To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content. We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion-style mapping. First, we construct EmoStyleSet, a content-emotion-stylized image triplet dataset derived from ArtEmis to support AIS. We then propose an Emotion-Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries. Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries. Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency. Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications. Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.

[66] UniFS: Unified Multi-Contrast MRI Reconstruction via Frequency-Spatial Fusion

Jialin Li,Yiwei Ren,Kai Pan,Dong Wei,Pujin Cheng,Xian Wu,Xiaoying Tang

Main category: cs.CV

TL;DR: 本文提出了一种名为UniFS的统一频域-空间融合模型,用于多对比度磁共振重建(MCMR),能够无需重新训练即可适应多种k空间欠采样模式,显著提升了模型在未见采样模式下的泛化能力。

Details Motivation: 现有MCMR方法通常难以泛化到不同的k空间欠采样模式,且需为每种模式单独训练模型,限制了实际应用;同时,多数方法忽视频域特征或仅提取浅层频域信息,未能充分利用跨模态频域互补性。 Method: UniFS包含三个关键模块:跨模态频域融合模块、基于自适应掩码的提示学习模块和双分支互补优化模块,通过融合频域与空间信息,并引入自适应提示机制动态适应不同欠采样模式,实现对多种采样模式的统一建模。 Result: 在BraTS和HCP数据集上验证了UniFS的有效性,涵盖多种k空间欠采样模式和加速因子(包括未见模式),实验结果表明其在多个场景下均达到最先进性能。 Conclusion: UniFS实现了无需重训练即可应对多种k空间欠采样模式的MCMR重建,有效结合了频域与空间信息,显著提升了模型的泛化能力和重建质量,具有良好的临床应用前景。 Abstract: Recently, Multi-Contrast MR Reconstruction (MCMR) has emerged as a hot research topic that leverages high-quality auxiliary modalities to reconstruct undersampled target modalities of interest. However, existing methods often struggle to generalize across different k-space undersampling patterns, requiring the training of a separate model for each specific pattern, which limits their practical applicability. To address this challenge, we propose UniFS, a Unified Frequency-Spatial Fusion model designed to handle multiple k-space undersampling patterns for MCMR tasks without any need for retraining. UniFS integrates three key modules: a Cross-Modal Frequency Fusion module, an Adaptive Mask-Based Prompt Learning module, and a Dual-Branch Complementary Refinement module. These modules work together to extract domain-invariant features from diverse k-space undersampling patterns while dynamically adapt to their own variations. Another limitation of existing MCMR methods is their tendency to focus solely on spatial information while neglect frequency characteristics, or extract only shallow frequency features, thus failing to fully leverage complementary cross-modal frequency information. To relieve this issue, UniFS introduces an adaptive prompt-guided frequency fusion module for k-space learning, significantly enhancing the model's generalization performance. We evaluate our model on the BraTS and HCP datasets with various k-space undersampling patterns and acceleration factors, including previously unseen patterns, to comprehensively assess UniFS's generalizability. Experimental results across multiple scenarios demonstrate that UniFS achieves state-of-the-art performance. Our code is available at https://github.com/LIKP0/UniFS.

[67] Concept-based Explainable Data Mining with VLM for 3D Detection

Mai Tsujimoto

Main category: cs.CV

TL;DR: 本文提出了一种利用2D视觉语言模型(VLM)挖掘自动驾驶场景中稀有物体的新跨模态框架,以提升基于点云数据的3D目标检测性能。该方法结合异常检测与语义过滤,有效识别如施工车辆、摩托车等关键稀有物体,显著降低标注成本,并在nuScenes数据集上展现出优于随机采样的性能提升。

Details Motivation: 稀有物体检测在纯点云驱动的自动驾驶系统中具有挑战性,且传统方法标注成本高、效率低,难以聚焦关键样本。 Method: 提出一个融合2D VLM与点云数据的跨模态框架,通过目标检测、语义特征提取、降维及多层面异常检测(如Isolation Forest和t-SNE)结合概念过滤,系统化挖掘稀有但重要的物体。 Result: 在nuScenes数据集上验证了该方法的有效性,仅使用少量训练数据即提升了3D检测性能,尤其对拖车、自行车等难检类别效果显著,优于同等规模的随机数据采样。 Conclusion: 该框架能高效挖掘语义上有意义的稀有物体,大幅减少标注负担,为安全关键型自动驾驶系统的数据集构建提供了高效、可解释的解决方案。 Abstract: Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.

[68] WaterWave: Bridging Underwater Image Enhancement into Video Streams via Wavelet-based Temporal Consistency Field

Qi Zhu,Jingyi Zhang,Naishan Zheng,Wei Yu,Jinghao Zhang,Deyi Ji,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为WaterWave的无配对数据条件下基于小波时序一致性场的隐式表示方法,用于解决水下视频增强中的时序不一致问题。

Details Motivation: 由于复杂的水下成像条件,获取成对的水下视频数据困难,现有方法多逐帧应用单图增强模型,导致时序不一致。 Method: 从局部时频角度观察动态场景中的时序一致性先验,构建基于小波的时序一致性场(WaterWave),并通过渐进滤波抑制不一致成分;设计水下光流校正模块以更准确地表示时频带。 Result: 实验表明,WaterWave显著提升了单图像增强方法生成视频的质量,并在UOSTrack和MAT等下游水下跟踪任务中分别提升19.7%和9.7%的精度。 Conclusion: 该方法在无配对训练数据的情况下有效实现了水下视频的时序一致增强,具有良好的视觉效果和实用潜力。 Abstract: Underwater video pairs are fairly difficult to obtain due to the complex underwater imaging. In this case, most existing video underwater enhancement methods are performed by directly applying the single-image enhancement model frame by frame, but a natural issue is lacking temporal consistency. To relieve the problem, we rethink the temporal manifold inherent in natural videos and observe a temporal consistency prior in dynamic scenes from the local temporal frequency perspective. Building upon the specific prior and no paired-data condition, we propose an implicit representation manner for enhanced video signals, which is conducted in the wavelet-based temporal consistency field, WaterWave. Specifically, under the constraints of the prior, we progressively filter and attenuate the inconsistent components while preserving motion details and scenes, achieving a natural-flowing video. Furthermore, to represent temporal frequency bands more accurately, an underwater flow correction module is designed to rectify estimated flows considering the transmission in underwater scenes. Extensive experiments demonstrate that WaterWave significantly enhances the quality of videos generated using single-image underwater enhancements. Additionally, our method demonstrates high potential in downstream underwater tracking tasks, such as UOSTrack and MAT, outperforming the original video by a large margin, i.e., 19.7% and 9.7% on precise respectively.

[69] Decoding with Structured Awareness: Integrating Directional, Frequency-Spatial, and Structural Attention for Medical Image Segmentation

Fan Zhang,Zhiwei Gu,Hua Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于医学图像分割的新型Transformer解码器框架,包含三个核心模块:自适应交叉融合注意力(ACFA)、三域特征融合注意力(TFFA)和结构感知多尺度掩码模块(SMMM),有效提升了边缘细节、局部纹理和空间连续性的建模能力。

Details Motivation: 传统Transformer解码器在医学图像分割中难以充分捕捉边缘细节、局部纹理并建模空间连续性,限制了高精度任务的性能,本文旨在解决这些局限。 Method: 提出由ACFA、TFFA和SMMM组成的新型解码器框架:ACFA引入三维可学习引导的通道与空间注意力机制;TFFA融合空间、傅里叶和小波域特征以实现频域-空间联合表示;SMMM通过多尺度上下文与结构显著性过滤优化编码器-解码器间的跳跃连接。 Result: 实验结果表明,该框架在肿瘤分割和器官边界提取等任务中显著提升分割精度与模型泛化能力,尤其在复杂和模糊边界场景下表现优异。 Conclusion: 所提出的三模块协同框架有效克服了传统解码器的缺陷,为医学图像分割提供了高效且实用的解决方案。 Abstract: To address the limitations of Transformer decoders in capturing edge details, recognizing local textures and modeling spatial continuity, this paper proposes a novel decoder framework specifically designed for medical image segmentation, comprising three core modules. First, the Adaptive Cross-Fusion Attention (ACFA) module integrates channel feature enhancement with spatial attention mechanisms and introduces learnable guidance in three directions (planar, horizontal, and vertical) to enhance responsiveness to key regions and structural orientations. Second, the Triple Feature Fusion Attention (TFFA) module fuses features from Spatial, Fourier and Wavelet domains, achieving joint frequency-spatial representation that strengthens global dependency and structural modeling while preserving local information such as edges and textures, making it particularly effective in complex and blurred boundary scenarios. Finally, the Structural-aware Multi-scale Masking Module (SMMM) optimizes the skip connections between encoder and decoder by leveraging multi-scale context and structural saliency filtering, effectively reducing feature redundancy and improving semantic interaction quality. Working synergistically, these modules not only address the shortcomings of traditional decoders but also significantly enhance performance in high-precision tasks such as tumor segmentation and organ boundary extraction, improving both segmentation accuracy and model generalization. Experimental results demonstrate that this framework provides an efficient and practical solution for medical image segmentation.

[70] Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm

Chuang Yu,Jinmiao Zhao,Yunpeng Liu,Yaokun Li,Xiujun Shu,Yuanhao Feng,Bo Wang,Yimian Dai,Xiangyu Yue

Main category: cs.CV

TL;DR: 本文首次将视觉基础模型(VFM)引入红外小目标检测(SIRST)任务,提出一种高效的基础模型驱动范式FDEP,通过语义对齐调制融合模块和协同优化的隐式自蒸馏策略,在不增加推理开销的情况下显著提升性能,并构建了统一的HSE评估指标。

Details Motivation: 尽管大规模视觉基础模型在多种视觉任务中表现出强泛化能力,但其在单帧红外小目标检测中的潜力尚未被充分探索,且现有评估体系碎片化,缺乏高效、公平的比较基准。 Method: 提出FDEP框架,包含两个核心组件:1)语义对齐调制融合(SAMF)模块,实现VFM全局语义先验与任务特征的动态对齐与深度融合;2)基于协同优化的隐式自蒸馏(CO-ISD)策略,通过参数共享和同步反向传播实现轻量分支中的隐式语义迁移,避免推理负担。同时构建多阈值积分的HSE评估指标,统一像素级置信度与目标级鲁棒性评价。 Result: 实验表明,集成FDEP的SIRST检测网络在多个公开数据集上达到SOTA性能,显著优于现有方法,且无额外推理开销。HSE指标也验证了其评估稳定性与全面性。 Conclusion: FDEP为将视觉基础模型高效适配至红外小目标检测提供了有效解决方案,兼顾性能提升与推理效率,并推动了该领域标准化评估体系的发展。 Abstract: While large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, their potential for single-frame infrared small target (SIRST) detection remains largely unexplored. To fill this gap, we systematically introduce the frozen representations from VFMs into the SIRST task for the first time and propose a Foundation-Driven Efficient Paradigm (FDEP), which can seamlessly adapt to existing encoder-decoder-based methods and significantly improve accuracy without additional inference overhead. Specifically, a Semantic Alignment Modulation Fusion (SAMF) module is designed to achieve dynamic alignment and deep fusion of the global semantic priors from VFMs with task-specific features. Meanwhile, to avoid the inference time burden introduced by VFMs, we propose a Collaborative Optimization-based Implicit Self-Distillation (CO-ISD) strategy, which enables implicit semantic transfer between the main and lightweight branches through parameter sharing and synchronized backpropagation. In addition, to unify the fragmented evaluation system, we construct a Holistic SIRST Evaluation (HSE) metric that performs multi-threshold integral evaluation at both pixel-level confidence and target-level robustness, providing a stable and comprehensive basis for fair model comparison. Extensive experiments demonstrate that the SIRST detection networks equipped with our FDEP framework achieve state-of-the-art (SOTA) performance on multiple public datasets. Our code is available at https://github.com/YuChuang1205/FDEP-Framework

[71] Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

Chinthani Sugandhika,Chen Li,Deepu Rajan,Basura Fernando

Main category: cs.CV

TL;DR: Know-Show是一个新的基准,用于评估视频语言模型在时空中的 grounded reasoning 能力,揭示现有模型在细粒度推理上的不足,并提出无需训练的插件GRAM来增强模型的细粒度定位能力。

Details Motivation: 当前的视频语言模型在多模态理解中缺乏对时空信息的有效 grounding,导致推理不准确或不可靠,需要一个统一的评估框架来衡量和提升其 grounded reasoning 能力。 Method: 构建了包含五个场景的Know-Show基准,基于Charades、Action Genome和Ego4D数据集,设计2.5K人工编写的问题;提出GRAM方法,通过注意力机制选择视频token并显式编码时间戳,以增强模型的细粒度 grounding 能力。 Result: 实验表明现有视频语言模型在‘知道’与‘展示’之间存在显著脱节,尤其在手-物交互等细粒度任务上表现较差;GRAM在多个主流模型上实现了性能提升且无需训练。 Conclusion: Know-Show为视频语言理解中的 grounded reasoning 提供了统一的评估标准,GRAM展示了通过注意力机制改进模型可解释性和可靠性的潜力,推动未来更可靠的多模态推理系统发展。 Abstract: Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at https://github.com/LUNAProject22/Know-Show.

[72] DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis

Yuhua Wen,Qifei Li,Yingying Zhou,Yingming Gao,Zhengqi Wen,Jianhua Tao,Ya Li

Main category: cs.CV

TL;DR: 提出了一种新的多模态情感分析框架DashFusion,通过双流对齐和分层瓶颈融合解决跨模态的时间与语义对齐及特征融合问题,在多个数据集上实现了最先进的性能。

Details Motivation: 现有方法通常孤立地处理对齐或融合问题,导致多模态情感分析的性能和效率受限。 Method: 提出Dual-stream Alignment with Hierarchical Bottleneck Fusion(DashFusion):1)双流对齐模块通过跨模态注意力实现时间对齐,通过对比学习实现语义对齐;2)采用有监督对比学习优化模态特征;3)通过压缩的瓶颈token进行分层融合,提升效率与性能。 Result: 在CMU-MOSI、CMU-MOSEI和CH-SIMS三个数据集上实验表明,DashFusion在多个指标上达到最先进水平,消融实验证明了对齐与融合策略的有效性。 Conclusion: DashFusion有效解决了多模态情感分析中的对齐与融合挑战,兼顾性能与计算效率,为多模态融合提供了新思路。 Abstract: Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.

[73] VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation

Chinthani Sugandhika,Chen Li,Deepu Rajan,Basura Fernando

Main category: cs.CV

TL;DR: 本文提出了一种新的单阶段时空场景图生成框架VOST-SGG,通过引入视觉语言模型(VLM)的常识推理能力,解决了现有方法中查询语义信息不足和仅依赖单一视觉模态的问题。

Details Motivation: 现有的DETR风格单阶段ST-SGG模型存在两个主要问题:一是可学习查询缺乏语义信息且实例无关地初始化;二是关系分类仅依赖于单模态视觉特征。因此需要引入更丰富的语义先验和多模态信息来提升性能。 Method: 提出了VOST-SGG框架,包含双源查询初始化策略以解耦‘关注什么’与‘关注哪里’,并构建多模态特征库融合来自VLM的视觉、文本和空间线索用于谓词分类。 Result: 在Action Genome数据集上进行了大量实验,结果表明所提方法在ST-SGG任务上达到了最先进的性能。 Conclusion: 通过整合VLM提供的语义先验和多模态特征,能够有效提升时空场景图生成的质量,验证了语义引导和多模态融合的重要性。 Abstract: Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at https://github.com/LUNAProject22/VOST.

[74] See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors

Kunyi Yang,Qingyu Wang,Cheng Yuan,Yutong Ban

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的深度引导手术场景分割框架DepSeg,利用单目深度信息和预训练视觉基础模型实现像素级分割,在减少标注依赖的同时显著提升了性能。

Details Motivation: 由于密集标注成本高昂,腹腔镜场景的像素级分割难以扩展,因此需要一种更高效的分割方法来降低对大量标注数据的依赖。 Method: DepSeg利用预训练的单目深度估计网络生成相对深度图,并据此生成深度引导的点提示;SAM2将这些提示转化为类别无关的掩码,再通过预训练视觉特征对每个掩码进行池化表示,并与基于标注帧构建的模板库进行模板匹配分类。 Result: 在CholecSeg8k数据集上,DepSeg相比直接使用SAM2自动分割的基线方法mIoU从14.7%提升至35.9%,且仅使用10%-20%对象模板时仍保持竞争力。 Conclusion: 深度引导提示与基于模板的分类相结合,提供了一种标注高效、无需训练的手术场景分割新路径。 Abstract: Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10--20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.

[75] Ideal Observer for Segmentation of Dead Leaves Images

Swantje Mahncke,Malte Ott

Main category: cs.CV

TL;DR: 本文提出了一种基于“死叶”模型的贝叶斯理想观测器理论方法,用于图像像素分割,提供了视觉分割任务中的性能上限。

Details Motivation: 为了研究场景中由于物体重叠遮挡导致的可见性问题,并为图像分割提供一个理论上的最优基准。 Method: 基于死叶生成模型,通过分层叠加物体模拟图像形成过程,推导贝叶斯理想观测器以计算像素分割的后验概率,并分析其实际可行性。 Result: 给出了计算后验概率的逐步解释,明确了影响该计算可行性的关键因素,可用于评估人类和算法在有限像素分割任务中的表现。 Conclusion: 该模型为研究视觉分割决策提供了原则性的性能上限,可作为比较人类与视觉算法的基准。 Abstract: The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider "dead leaves" models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects ("leaves") from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.

[76] Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

Weijue Bu,Guan Yuan,Guixian Zhang

Main category: cs.CV

TL;DR: 本文提出了Conscious Gaze(CG-VLM),一种无需训练、在推理时通过博弈论可解释性实现视觉-语言模型解码控制的框架,有效缓解文本惯性导致的对象幻觉问题。

Details Motivation: 大型视觉-语言模型(VLMs)常因注意力偏离视觉证据而出现对象幻觉,现有解码策略无法纠正内部推理漂移,且当前内部控制方法缺乏理论基础。 Method: 提出CG-VLM框架:基于Harsanyi交互的认知需求传感器检测视觉-文本协同状态,判断何时需加强视觉 grounding;随后,聚焦共识诱导模块据此信号选择性地重定向中层注意力至视觉token。 Result: 在POPE和CHAIR基准上,CG-VLM在InstructBLIP、LLaVA、Qwen-VL和mPLUG等多个模型上实现了最优性能,同时保持了模型的通用能力。 Conclusion: 通过token级感知实现精确、上下文感知的干预是可行且有效的,CG-VLM为解决VLM中的文本惯性问题提供了有理论依据的新范式。 Abstract: Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.

[77] 2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency

Xingxi Yin,Yicheng Li,Gong Yan,Chenglin Li,Jian Zhao,Cong Huang,Yue Deng,Yin Zhang

Main category: cs.CV

TL;DR: 本文提出了一个名为2K-Characters-10K-Stories的大规模多模态叙事数据集,首次实现了在视觉叙事中对角色身份与动态属性(如姿态、表情)的解耦控制,并通过人机协同流程保证高质量生成。

Details Motivation: 现有数据集难以分离角色的稳定身份与瞬变属性,导致在可控视觉叙事中无法实现可靠的序列一致性生成,限制了对姿态、表情和场景的结构化控制。 Method: 提出了一种“人机协同”(HiL)流程,结合专家验证的角色模板与大语言模型引导的叙事规划,生成高度对齐的结构化数据;采用解耦控制策略,将持久身份与瞬时属性分离,并通过融合MMLM评估、自动提示调优和局部图像编辑的质量门控循环确保像素级一致性。 Result: 实验表明,在该数据集上微调的模型在视觉叙事生成方面性能媲美闭源模型,显著提升了序列身份一致性与可控性。 Conclusion: 2K-Characters-10K-Stories为可控视觉叙事提供了新基准,所提方法有效解决了身份与属性纠缠问题,推动了高保真、结构化可控内容生成的发展。 Abstract: Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.

[78] ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Zijun Wang,Panwen Hu,Jing Wang,Terry Jingchen Zhang,Yuhao Cheng,Long Chen,Yiqiang Yan,Zutao Jiang,Hanhui Li,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出了ProPhy,一种渐进式物理对齐框架,通过两阶段的物理专家混合机制实现显式的物理感知条件生成和各向异性视频生成,提升了视频生成中的物理一致性。

Details Motivation: 现有视频生成模型在处理大规模或复杂动力学时难以保持物理一致性,且缺乏对局部物理线索的细粒度对齐。 Method: 提出ProPhy框架,采用两阶段Mixture-of-Physics-Experts(MoPE)机制:语义专家从文本描述中推断语义级物理原理,细化专家捕捉令牌级物理动态;并引入物理对齐策略,将视觉语言模型的物理推理能力迁移至细化专家。 Result: 在多个物理感知视频生成基准上的实验表明,ProPhy相比现有最先进方法能生成更真实、动态且物理上更一致的结果。 Conclusion: ProPhy通过细粒度的物理先验提取与各向异性生成,显著提升了生成视频的物理合理性,为构建具备物理意识的世界模拟器提供了新思路。 Abstract: Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

[79] MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging

Xingyu Zhang,Anna Reithmeir,Fryderyk Kögl,Rickmer Braren,Julia A. Schnabel,Daniel M. Lang

Main category: cs.CV

TL;DR: MedDIFT是一个无需训练的3D医学图像配准框架,利用预训练的潜在医学扩散模型的多尺度特征作为体素描述符,通过融合扩散激活特征并使用余弦相似度匹配,在无需任务特定训练的情况下实现了与现有最先进方法相当甚至更优的配准精度。

Details Motivation: 传统的医学图像配准方法依赖局部强度相似性度量,难以捕捉全局语义结构,在低对比度或解剖结构多变区域易产生错配。而扩散模型的中间表示被发现包含丰富的几何和语义信息,因此作者希望利用这一特性提升配准准确性。 Method: 提出MedDIFT框架,从预训练的潜在医学扩散模型中提取多尺度特征作为体素描述符,融合不同层次的扩散激活,并通过余弦相似度进行体素匹配,可选地加入局部搜索先验以优化对应关系。整个过程无需针对配准任务进行额外训练。 Result: 在公开的肺部CT数据集上,MedDIFT的配准精度与最先进的基于学习的UniGradICON模型相当,且优于传统的B样条配准方法;消融实验表明多层级特征融合和适度的扩散噪声有助于提升性能。 Conclusion: MedDIFT成功利用了预训练扩散模型中的多尺度语义与几何信息,实现了无需微调的高精度3D医学图像配准,为训练-free的医学图像分析提供了新思路。 Abstract: Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.

[80] Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer

Rong Wang,Wei Mao,Changsheng Lu,Hongdong Li

Main category: cs.CV

TL;DR: 提出了一种无需蒙皮的3D服装形变生成方法,通过独立估计顶点位置和法线来解耦低频形状与高频褶皱,并利用图像迁移和预训练图像模型提升细节质量。

Details Motivation: 现有基于线性蒙皮的方法因缺乏显式监督导致服装姿态变化时产生错位,影响褶皱细节恢复。 Method: 独立预测顶点位置(低频形变)和顶点法线(高频细节),将二者编码为纹理图像进行2D图像迁移,利用预训练图像模型增强细节,并通过多模态融合恢复3D服装形变。 Result: 在多种服装类型上显著提升了动画质量,恢复的褶皱细节优于当前最先进的方法。 Conclusion: 该方法有效解耦了服装形变的不同频率成分,无需手动UV分割即可处理不同拓扑结构的服装,具有良好的泛化性和视觉表现力。 Abstract: We present a novel method for generating 3D garment deformations from given body poses, which is key to a wide range of applications, including virtual try-on and extended reality. To simplify the cloth dynamics, existing methods mostly rely on linear blend skinning to obtain low-frequency posed garment shape and only regress high-frequency wrinkles. However, due to the lack of explicit skinning supervision, such skinning-based approach often produces misaligned shapes when posing the garment, consequently corrupts the high-frequency signals and fails to recover high-fidelity wrinkles. To tackle this issue, we propose a skinning-free approach by independently estimating posed (i) vertex position for low-frequency posed garment shape, and (ii) vertex normal for high-frequency local wrinkle details. In this way, each frequency modality can be effectively decoupled and directly supervised by the geometry of the deformed garment. To further improve the visual quality of animation, we propose to encode both vertex attributes as rendered texture images, so that 3D garment deformation can be equivalently achieved via 2D image transfer. This enables us to leverage powerful pretrained image models to recover fine-grained visual details in wrinkles, while maintaining superior scalability for garments of diverse topologies without relying on manual UV partition. Finally, we propose a multimodal fusion to incorporate constraints from both frequency modalities and robustly recover deformed 3D garments from transferred images. Extensive experiments show that our method significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods.

[81] Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction

Ruihong Yin,Xuepeng Shi,Oleksandr Bailo,Marco Manfredi,Theo Gevers

Main category: cs.CV

TL;DR: Fast SceneScript是一种用于高效且准确3D场景布局估计的新型结构化语言模型,通过多令牌预测和置信度引导解码显著加速推理,同时保持高精度。

Details Motivation: 现有的基于语言模型的感知通用方法依赖自回归的下一个令牌预测,推理速度慢,限制了实际应用。 Method: 提出Fast SceneScript模型,采用多令牌预测(MTP)减少自回归步数,结合自适应的自推测解码(SSD)和新设计的置信度引导解码(CGD)来提升令牌可靠性,并设计参数高效的机制以降低MTP带来的参数开销。 Result: 在ASE和Structured3D基准上实验表明,每步解码可生成多达9个令牌而不损失精度,仅增加约7.5%的额外参数。 Conclusion: Fast SceneScript在保持高精度的同时显著提升了3D场景布局估计的推理效率,为结构化语言模型的实际应用提供了有效解决方案。 Abstract: Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on the ASE and Structured3D benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only $\sim7.5\%$ additional parameters.

[82] NormalView: sensor-agnostic tree species classification from backpack and aerial lidar data using geometric projections

Juho Korkeala,Jesse Muhojoki,Josef Taher,Klaara Salolahti,Matti Hyyppä,Antero Kukko,Juha Hyyppä

Main category: cs.CV

TL;DR: 提出了一种名为NormalView的传感器无关的投影深度学习方法,利用法向量估计和YOLOv11网络对点云数据进行树种分类,并验证了多光谱强度信息对分类性能的提升作用,在MLS和ALS数据上均取得优异效果。

Details Motivation: 为了实现高精度、跨传感器类型的树种分类,克服传统方法对特定传感器依赖的问题,充分利用点云数据中的几何与光谱信息。 Method: NormalView将点云局部几何信息(法向量)投影为二维图像,作为YOLOv11图像分类网络的输入;同时研究了多光谱辐射强度信息对分类性能的影响,在高密度MLS和ALS数据上进行了训练与测试。 Result: 在MLS数据上达到95.5%的整体准确率(宏平均94.8%),在ALS数据上达到91.8%(宏平均79.1%);使用多光谱ALS三个通道强度信息的模型表现最佳。 Conclusion: 投影法结合几何信息与先进图像分类网络可实现卓越的树种分类性能,且具备传感器无关性,具有广泛适用潜力;同时公开了用于本研究的MLS数据集。 Abstract: Laser scanning has proven to be an invaluable tool in assessing the decomposition of forest environments. Mobile laser scanning (MLS) has shown to be highly promising for extremely accurate, tree level inventory. In this study, we present NormalView, a sensor-agnostic projection-based deep learning method for classifying tree species from point cloud data. NormalView embeds local geometric information into two-dimensional projections, in the form of normal vector estimates, and uses the projections as inputs to an image classification network, YOLOv11. In addition, we inspected the effect of multispectral radiometric intensity information on classification performance. We trained and tested our model on high-density MLS data (7 species, ~5000 pts/m^2), as well as high-density airborne laser scanning (ALS) data (9 species, >1000 pts/m^2). On the MLS data, NormalView achieves an overall accuracy (macro-average accuracy) of 95.5 % (94.8 %), and 91.8 % (79.1 %) on the ALS data. We found that having intensity information from multiple scanners provides benefits in tree species classification, and the best model on the multispectral ALS dataset was a model using intensity information from all three channels of the multispectral ALS. This study demonstrates that projection-based methods, when enhanced with geometric information and coupled with state-of-the-art image classification backbones, can achieve exceptional results. Crucially, these methods are sensor-agnostic, relying only on geometric information. Additionally, we publically release the MLS dataset used in the study.

[83] DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model

Pasquale De Marinis,Pieter M. Blok,Uzay Kaymak,Rogier Brussee,Gennaro Vessio,Giovanna Castellano

Main category: cs.CV

TL;DR: 本文提出了DistillFSS,一种用于跨域少样本语义分割的框架,通过教师-学生蒸馏将支持集知识嵌入模型参数中,实现无需测试时支持图像的高效推理,并在新提出的具有挑战性的CD-FSS基准上表现出色。

Details Motivation: 由于源域和目标域之间存在显著分布差异、标签空间不相交以及支持图像稀缺,现有的少样本分割方法在跨域场景下不可靠且计算成本高,因此需要一种更高效、可扩展的解决方案。 Method: 提出DistillFSS框架,采用教师-学生蒸馏机制,将支持集知识编码进学生网络的专用层中,使模型在测试时无需访问支持图像;结合微调策略,支持大支持集扩展并降低计算开销。 Result: 在涵盖医学影像、工业检测和遥感的新CD-FSS基准上实验表明,DistillFSS在多类和多样本场景下性能达到或优于现有最先进方法,同时显著提升推理效率。 Conclusion: DistillFSS通过知识蒸馏实现了高效、轻量化的跨域少样本语义分割,消除了对测试时支持图像的依赖,具备良好的可扩展性和实际应用潜力。 Abstract: Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce--making standard episodic methods unreliable and computationally demanding at test time. To address these constraints, we propose DistillFSS, a framework that embeds support-set knowledge directly into a model's parameters through a teacher--student distillation process. By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher-driven specialization. Combined with fine-tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD-FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains. The code is available at https://github.com/pasqualedem/DistillFSS.

[84] Experts-Guided Unbalanced Optimal Transport for ISP Learning from Unpaired and/or Paired Data

Georgy Perevozchikov,Nancy Mehta,Egor Ershov,Radu Timofte

Main category: cs.CV

TL;DR: 提出了一种基于非平衡最优传输(UOT)的无监督训练框架,可有效训练任意图像信号处理(ISP)架构,无需依赖成对的原始- sRGB数据,性能媲美甚至超越有监督方法。

Details Motivation: 现有的学习型ISP严重依赖大量成对的raw-to-sRGB数据,获取成本高,限制了其应用。因此需要一种不依赖成对数据的训练框架以降低数据需求瓶颈。 Method: 提出一种基于非平衡最优传输(UOT)的无监督训练框架,并引入“专家判别器委员会”作为混合对抗正则化器,针对颜色保真度、结构伪影和频域真实性等问题提供定向梯度,引导最优传输映射。该框架支持配对与非配对模式训练任意ISP架构。 Result: 在配对模式下,本框架超越原有SOTA方法的各项指标;在非配对模式下,性能仍可媲美甚至部分超越原有的配对训练方法,验证了其有效性与鲁棒性。 Conclusion: 该UOT-based框架成功实现了对ISP模型的有效无监督训练,显著降低了对成对数据的依赖,同时保持甚至提升了图像质量,在定量和定性结果上均表现出色。 Abstract: Learned Image Signal Processing (ISP) pipelines offer powerful end-to-end performance but are critically dependent on large-scale paired raw-to-sRGB datasets. This reliance on costly-to-acquire paired data remains a significant bottleneck. To address this challenge, we introduce a novel, unsupervised training framework based on Optimal Transport capable of training arbitrary ISP architectures in both unpaired and paired modes. We are the first to successfully apply Unbalanced Optimal Transport (UOT) for this complex, cross-domain translation task. Our UOT-based framework provides robustness to outliers in the target sRGB data, allowing it to discount atypical samples that would be prohibitively costly to map. A key component of our framework is a novel ``committee of expert discriminators,'' a hybrid adversarial regularizer. This committee guides the optimal transport mapping by providing specialized, targeted gradients to correct specific ISP failure modes, including color fidelity, structural artifacts, and frequency-domain realism. To demonstrate the superiority of our approach, we retrained existing state-of-the-art ISP architectures using our paired and unpaired setups. Our experiments show that while our framework, when trained in paired mode, exceeds the performance of the original paired methods across all metrics, our unpaired mode concurrently achieves quantitative and qualitative performance that rivals, and in some cases surpasses, the original paired-trained counterparts. The code and pre-trained models are available at: https://github.com/gosha20777/EGUOT-ISP.git.

[85] Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective

Nan Zhong,Mian Zou,Yiran Xu,Zhenxing Qian,Xinpeng Zhang,Baoyuan Wu,Kede Ma

Main category: cs.CV

TL;DR: 提出一种基于EXIF元数据的自监督方法,用于检测AI生成图像,在跨模型和真实场景中表现出优异的泛化性和鲁棒性。

Details Motivation: 现有AI图像检测器依赖特定生成模型的内部假设,跨模型泛化能力差,需开发不依赖模型先验的通用检测方法。 Method: 利用相机照片的EXIF标签设计自监督预训练任务:分类类别型标签(如相机型号),排序数值型标签(如焦距);用学习到的特征构建单类高斯混合模型进行异常检测,并进一步扩展为基于高频残差和图像块打乱的二分类检测器。 Result: 在多种生成模型和真实数据上显著优于现有方法,具备强泛化能力和对常见图像扰动的鲁棒性。 Conclusion: 利用EXIF元数据进行自监督学习可有效捕捉真实摄影图像的本质特征,为AI生成图像检测提供了新且可靠的思路。 Abstract: The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata -- specifically exchangeable image file format (EXIF) tags -- to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (\eg, camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (\eg, focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.

[86] LeAD-M3D: Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection

Johannes Meier,Jonathan Michel,Oussema Dhaouadi,Yung-Hsu Yang,Christoph Reich,Zuria Bauer,Stefan Roth,Marc Pollefeys,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: 提出LeAD-M3D,一种无需LiDAR、立体或几何先验的单目3D检测器,通过A2D2、CM3D和CGI3D实现高精度与实时推理。

Details Motivation: 解决单目3D检测中深度模糊、视角变化和计算成本高的问题,避免依赖LiDAR或牺牲效率。 Method: 引入A2D2进行无监督深度知识迁移,CM3D结合3D MGIoU优化匹配,CGI3D通过置信度门控加速3D回归。 Result: 在KITTI、Waymo和Rope3D上达到SOTA精度,推理速度比以往高精度方法快3.6倍。 Conclusion: 无需额外模态或几何假设,仍可同时实现高保真与实时单目3D检测。 Abstract: Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.

[87] Deep Learning-Based Real-Time Sequential Facial Expression Analysis Using Geometric Features

Talha Enes Koksal,Abdurrahman Gumus

Main category: cs.CV

TL;DR: 提出了一种基于MediaPipe FaceMesh和ConvLSTM1D的实时序列面部表情识别方法,结合几何特征与时间动态,在多个数据集上实现了高效准确的性能,并开源了代码。

Details Motivation: 为了提升人机交互和情感感知系统的实时性与准确性,需要一种能够快速检测并解析面部表情变化的方法。 Method: 使用MediaPipe FaceMesh进行面部关键点检测,提取欧氏距离和角度等几何特征,并通过连续帧间的特征差异建模表情的时间动态(如起始、峰值和结束阶段),采用ConvLSTM1D网络结合多层感知机进行分类。 Result: 在CK+、Oulu-CASIA(VIS和NIR)和MMI数据集上分别达到了93%、79%、77%和68%的准确率,处理速度约为每秒165帧,具备良好的实时性和泛化能力。 Conclusion: 该方法在消费者级硬件上实现了快速、准确且可推广的面部表情识别,推动了情感智能技术的发展,为更复杂的人机交互系统提供了技术支持。 Abstract: Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method's performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model's generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: https://github.com/miralab-ai/facial-expression-analysis.

[88] InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem

Yeobin Hong,Suhyeon Lee,Hyungjin Chung,Jong Chul Ye

Main category: cs.CV

TL;DR: 提出InverseCrafter,一种高效的基于扩散模型的4D视频生成方法,通过将生成任务转化为潜在空间中的修复问题,避免了昂贵的微调过程。

Details Motivation: 现有可控4D视频生成方法依赖于对预训练视频扩散模型的微调,计算成本高且易遗忘原始生成先验。 Method: 提出一种称为InverseCrafter的反向求解器,将4D生成任务建模为潜在空间中的图像修复问题,设计了一种机制将像素空间退化算子编码为连续多通道潜在掩码。 Result: 在新视角生成和相机控制任务中达到可比或更优的测量一致性,计算开销极低,并在通用视频修复与编辑任务中表现优异。 Conclusion: InverseCrafter提供了一种高效、无需微调的4D视频生成框架,有效保留了原始模型的生成能力,同时支持多种编辑应用。 Abstract: Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model's original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at https://github.com/yeobinhong/InverseCrafter.

[89] Hyperspectral Unmixing with 3D Convolutional Sparse Coding and Projected Simplex Volume Maximization

Gargi Panda,Soumitra Kundu,Saumik Bhattacharya,Aurobinda Routray

Main category: cs.CV

TL;DR: 提出了一种基于算法展开的3D卷积稀疏编码网络(3D-CSCNet),结合自编码器框架和投影单纯形体积最大化(PSVM)方法,用于高光谱解混,显著优于现有方法。

Details Motivation: 现有基于展开的网络在高光谱解混中未能充分联合建模光谱与空间信息,且缺乏有效的端元初始化策略。 Method: 构建基于3D卷积稀疏编码的自编码器网络(3D-CSCNet),设计3D-CSCB模块通过深度算法展开求解,并采用PSVM算法初始化解码器权重作为端元估计。 Result: 在三个真实数据集和一个模拟数据集上验证了方法的有效性,在不同信噪比下均优于当前最先进的方法。 Conclusion: 3D-CSCNet能有效联合学习高光谱数据的空间-光谱特征,结合PSVM初始化策略提升了端元和丰度估计精度。 Abstract: Hyperspectral unmixing (HSU) aims to separate each pixel into its constituent endmembers and estimate their corresponding abundance fractions. This work presents an algorithm-unrolling-based network for the HSU task, named the 3D Convolutional Sparse Coding Network (3D-CSCNet), built upon a 3D CSC model. Unlike existing unrolling-based networks, our 3D-CSCNet is designed within the powerful autoencoder (AE) framework. Specifically, to solve the 3D CSC problem, we propose a 3D CSC block (3D-CSCB) derived through deep algorithm unrolling. Given a hyperspectral image (HSI), 3D-CSCNet employs the 3D-CSCB to estimate the abundance matrix. The use of 3D CSC enables joint learning of spectral and spatial relationships in the 3D HSI data cube. The estimated abundance matrix is then passed to the AE decoder to reconstruct the HSI, and the decoder weights are extracted as the endmember matrix. Additionally, we propose a projected simplex volume maximization (PSVM) algorithm for endmember estimation, and the resulting endmembers are used to initialize the decoder weights of 3D-CSCNet. Extensive experiments on three real datasets and one simulated dataset with three different signal-to-noise ratio (SNR) levels demonstrate that our 3D-CSCNet outperforms state-of-the-art methods.

[90] Physics-Informed Graph Neural Network with Frequency-Aware Learning for Optical Aberration Correction

Yong En Kok,Bowen Deng,Alexander Bentley,Andrew J. Parkes,Michael G. Somekh,Amanda J. Wright,Michael P. Pound

Main category: cs.CV

TL;DR: 提出ZRNet,一种物理信息框架,用于联合进行泽尼克系数预测和光学图像恢复,通过显式建模泽尼克多项式之间的物理关系及引入频域对齐损失,在多种显微成像模式和生物样本上实现了先进性能。

Details Motivation: 现有方法通常仅处理有限样本类型和模态下的轻微像差,且多将问题视为黑箱映射,未利用波前畸变的光学物理原理。因此需要一种结合物理先验的方法来更好应对复杂、大振幅像差。 Method: 提出ZRNet框架,包含泽尼克图模块以根据方位度显式建模泽尼克多项式间的物理关系,并设计频率感知对齐(FAA)损失,在傅里叶域中对齐泽尼克系数预测与图像特征,实现物理一致的联合优化。 Result: 在CytoImageNet上的实验表明,ZRNet在图像恢复和泽尼克系数预测方面均优于现有方法,适用于多种显微模态和具有复杂像差的生物样本。 Conclusion: ZRNet通过融合光学物理先验与深度学习,有效提升了存在大振幅像差情况下的显微图像质量,具有广泛适用性和物理一致性。 Abstract: Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. Code is available at https://github.com/janetkok/ZRNet.

[91] OWL: Unsupervised 3D Object Detection by Occupancy Guided Warm-up and Large Model Priors Reasoning

Xusheng Guo,Wanfa Zhang,Shijia Zhao,Qiming Xia,Xiaolong Xie,Mingming Wang,Hai Wu,Chenglu Wen

Main category: cs.CV

TL;DR: 本文提出了一种名为OWL的无监督3D物体检测方法,通过占用引导预热和大模型先验推理来提升性能。

Details Motivation: 现有无监督3D物体检测方法依赖于自训练迭代生成伪标签,但初始伪标签常不准确,影响网络收敛,且难以有效过滤和优化这些标签。 Method: OWL采用占用引导预热(OGW)策略初始化骨干网络权重,增强空间感知能力;引入实例提示推理(ICR)模块利用大模型先验知识评估伪标签质量;并设计加权自适应自训练(WAS)策略动态重加权伪标签。 Result: 在Waymo Open Dataset和KITTI上的实验表明,OWL比现有最先进方法mAP提升超过15.0%。 Conclusion: OWL通过预热策略、大模型先验推理和动态重加权机制,显著提升了无监督3D物体检测的性能,有效缓解了错误伪标签带来的优化干扰问题。 Abstract: Unsupervised 3D object detection leverages heuristic algorithms to discover potential objects, offering a promising route to reduce annotation costs in autonomous driving. Existing approaches mainly generate pseudo labels and refine them through self-training iterations. However, these pseudo-labels are often incorrect at the beginning of training, resulting in misleading the optimization process. Moreover, effectively filtering and refining them remains a critical challenge. In this paper, we propose OWL for unsupervised 3D object detection by occupancy guided warm-up and large-model priors reasoning. OWL first employs an Occupancy Guided Warm-up (OGW) strategy to initialize the backbone weight with spatial perception capabilities, mitigating the interference of incorrect pseudo-labels on network convergence. Furthermore, OWL introduces an Instance-Cued Reasoning (ICR) module that leverages the prior knowledge of large models to assess pseudo-label quality, enabling precise filtering and refinement. Finally, we design a Weight-adapted Self-training (WAS) strategy to dynamically re-weight pseudo-labels, improving the performance through self-training. Extensive experiments on Waymo Open Dataset (WOD) and KITTI demonstrate that OWL outperforms state-of-the-art unsupervised methods by over 15.0% mAP, revealing the effectiveness of our method.

[92] Manifold-Aware Point Cloud Completion via Geodesic-Attentive Hierarchical Feature Learning

Jianan Sun,Dongzhihan Wang,Mingyu Fan

Main category: cs.CV

TL;DR: 本文提出了一种流形感知的点云补全框架,通过引入测地距离近似器(GDA)和流形感知特征提取器(MAFE),利用测地距离和基于测地关系的注意力机制来提升点云重建的几何一致性和语义连贯性。

Details Motivation: 现有方法多依赖欧氏距离,忽略了点云固有的非线性几何结构,导致重建结果几何一致性差、语义模糊。 Method: 提出GDA模块估计点间测地距离以捕捉潜在流形拓扑,并设计MAFE模块采用测地k近邻分组和测地关系注意力机制进行分层特征提取。 Result: 在多个基准数据集上的实验表明,该方法在重建质量上优于当前最先进的方法。 Conclusion: 通过显式建模点云的非线性几何结构,所提方法有效提升了点云补全的几何一致性和语义清晰度。 Abstract: Point cloud completion seeks to recover geometrically consistent shapes from partial or sparse 3D observations. Although recent methods have achieved reasonable global shape reconstruction, they often rely on Euclidean proximity and overlook the intrinsic nonlinear geometric structure of point clouds, resulting in suboptimal geometric consistency and semantic ambiguity. In this paper, we present a manifold-aware point cloud completion framework that explicitly incorporates nonlinear geometry information throughout the feature learning pipeline. Our approach introduces two key modules: a Geodesic Distance Approximator (GDA), which estimates geodesic distances between points to capture the latent manifold topology, and a Manifold-Aware Feature Extractor (MAFE), which utilizes geodesic-based $k$-NN groupings and a geodesic-relational attention mechanism to guide the hierarchical feature extraction process. By integrating geodesic-aware relational attention, our method promotes semantic coherence and structural fidelity in the reconstructed point clouds. Extensive experiments on benchmark datasets demonstrate that our approach consistently outperforms state-of-the-art methods in reconstruction quality.

[93] Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision

Lennart Maack,Julia-Kristin Graß,Lisa-Marie Toscha,Nathaniel Melling,Alexander Schlaefer

Main category: cs.CV

TL;DR: 提出一种隐私保护的框架,通过从大型通用LLM中蒸馏知识,训练一个高效、可本地部署的视觉大语言模型(VLM),用于提升外科手术场景理解能力,特别是在结肠切除术中的解剖结构识别与解释。

Details Motivation: 现有VLM在外科手术场景理解等特定领域表现不足,且依赖云端大模型存在患者数据泄露风险,需发展可本地部署、保护隐私的专用模型。 Method: 通过仅使用文本上下文和二值分割掩码(无敏感图像)向教师LLM提问,生成专家监督数据集;利用该数据集对本地VLM进行监督微调(SFT)和直接偏好优化(DPO)。 Result: 实验表明,使用所生成数据集微调后的本地VLM在外科领域知识的理解能力上显著优于其基础模型。 Conclusion: 该方法验证了一种数据高效且符合隐私要求的训练方式,可用于构建专为外科场景优化的可本地部署VLM。 Abstract: Recently, Vision Large Language Models (VLMs) have demonstrated high potential in computer-aided diagnosis and decision-support. However, current VLMs show deficits in domain specific surgical scene understanding, such as identifying and explaining anatomical landmarks during Complete Mesocolic Excision. Additionally, there is a need for locally deployable models to avoid patient data leakage to large VLMs, hosted outside the clinic. We propose a privacy-preserving framework to distill knowledge from large, general-purpose LLMs into an efficient, local VLM. We generate an expert-supervised dataset by prompting a teacher LLM without sensitive images, using only textual context and binary segmentation masks for spatial information. This dataset is used for Supervised Fine-Tuning (SFT) and subsequent Direct Preference Optimization (DPO) of the locally deployable VLM. Our evaluation confirms that finetuning VLMs with our generated datasets increases surgical domain knowledge compared to its base VLM by a large margin. Overall, this work validates a data-efficient and privacy-conforming way to train a surgical domain optimized, locally deployable VLM for surgical scene understanding.

[94] HQ-DM: Single Hadamard Transformation-Based Quantization-Aware Training for Low-Bit Diffusion Models

Shizhuo Mao,Hongtao Zou,Qihu Xie,Song Chen,Yi Kang

Main category: cs.CV

TL;DR: 本文提出了一种名为HQ-DM的量化感知训练框架,通过单次Hadamard变换减少扩散模型激活矩阵中的异常值,有效提升了低比特量化下的生成性能。

Details Motivation: 现有的扩散模型量化方法在低比特量化下难以缓解推理过程中激活矩阵的异常值问题,导致性能显著下降。 Method: 提出HQ-DM框架,采用单次Hadamard变换处理激活矩阵,支持INT卷积操作,并避免权重异常值的放大。 Result: 在ImageNet 256x256数据集上使用LDM-4模型,W4A4和W4A3量化方案相比现有最先进方法分别将Inception Score提升12.8%和467.73%。 Conclusion: HQ-DM能有效降低激活异常值,保持模型性能,显著优于传统双Hadamard变换方法,适用于高效部署扩散模型。 Abstract: Diffusion models have demonstrated significant applications in the field of image generation. However, their high computational and memory costs pose challenges for deployment. Model quantization has emerged as a promising solution to reduce storage overhead and accelerate inference. Nevertheless, existing quantization methods for diffusion models struggle to mitigate outliers in activation matrices during inference, leading to substantial performance degradation under low-bit quantization scenarios. To address this, we propose HQ-DM, a novel Quantization-Aware Training framework that applies Single Hadamard Transformation to activation matrices. This approach effectively reduces activation outliers while preserving model performance under quantization. Compared to traditional Double Hadamard Transformation, our proposed scheme offers distinct advantages by seamlessly supporting INT convolution operations while preventing the amplification of weight outliers. For conditional generation on the ImageNet 256x256 dataset using the LDM-4 model, our W4A4 and W4A3 quantization schemes improve the Inception Score by 12.8% and 467.73%, respectively, over the existing state-of-the-art method.

[95] USV: Unified Sparsification for Accelerating Video Diffusion Models

Xinjian Wu,Hongmei Wang,Yuan Zhou,Qinglin Lu

Main category: cs.CV

TL;DR: 本文提出了USV,一种统一的视频扩散模型稀疏化框架,通过联合优化模型计算和采样过程,在保持视觉质量的同时显著提升生成效率。

Details Motivation: 现有加速方法仅单独优化注意力机制或去噪步数,存在瓶颈且收益递减,缺乏跨维度的协同优化。 Method: 提出USV框架,学习动态、数据与时间步依赖的稀疏策略,联合剪枝冗余注意力连接、合并相似token并减少去噪步数,实现端到端可训练的多维协同稀疏化。 Result: 在大规模视频生成任务上,USV实现了最多83.3%的去噪加速和22.7%的端到端加速,同时保持高视觉保真度。 Conclusion: 统一动态稀疏化是实现高效高质量视频生成的有效路径。 Abstract: The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.

[96] Label-Efficient Point Cloud Segmentation with Active Learning

Johannes Meyer,Jasper Hoffmann,Felix Schulz,Dominik Merkle,Daniel Buescher,Alexander Reiterer,Joschka Boedecker,Wolfram Burgard

Main category: cs.CV

TL;DR: 提出一种基于2D网格划分和网络集成不确定性的新型主动学习方法,用于3D点云语义分割,方法简单有效,在多个数据集上性能优于或媲美现有复杂方法。

Details Motivation: 3D点云语义分割的标注成本高,现有主动学习方法依赖复杂的启发式策略来划分和选择标注区域,缺乏简洁高效的方法。 Method: 使用2D网格将点云划分为垂直柱状区域,并采用网络集成估计预测不确定性,以选择最具信息量的数据进行标注。 Result: 在S3DIS、Toronto-3D和弗赖堡城市点云数据集上实验表明,该方法性能与先进方法相当甚至更优,且发现标注面积比标注点数更适合作为评估指标。 Conclusion: 所提方法简单易实现,在减少标注成本方面效果显著,验证了标注区域面积作为主动学习评估标准的合理性。 Abstract: Semantic segmentation of 3D point cloud data often comes with high annotation costs. Active learning automates the process of selecting which data to annotate, reducing the total amount of annotation needed to achieve satisfactory performance. Recent approaches to active learning for 3D point clouds are often based on sophisticated heuristics for both, splitting point clouds into annotatable regions and selecting the most beneficial for further neural network training. In this work, we propose a novel and easy-to-implement strategy to separate the point cloud into annotatable regions. In our approach, we utilize a 2D grid to subdivide the point cloud into columns. To identify the next data to be annotated, we employ a network ensemble to estimate the uncertainty in the network output. We evaluate our method on the S3DIS dataset, the Toronto-3D dataset, and a large-scale urban 3D point cloud of the city of Freiburg, which we labeled in parts manually. The extensive evaluation shows that our method yields performance on par with, or even better than, complex state-of-the-art methods on all datasets. Furthermore, we provide results suggesting that in the context of point clouds the annotated area can be a more meaningful measure for active learning algorithms than the number of annotated points.

[97] FNOPT: Resolution-Agnostic, Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators

Ruochen Chen,Thuy Tran,Shaifali Parashar

Main category: cs.CV

TL;DR: FNOpt是一种自监督的布料模拟框架,将时间积分表述为优化问题,并使用傅里叶神经算子(FNO)参数化的神经优化器进行训练,无需大量真值数据即可实现跨分辨率和运动模式的稳定、准确模拟。

Details Motivation: 现有神经模拟器通常依赖大量真实数据或牺牲细节,且在不同分辨率和运动模式间泛化能力差,FNOpt旨在解决这些问题。 Method: 将时间积分建模为优化问题,采用基于FNO的神经优化器,在粗网格上通过物理损失进行自监督训练,实现分辨率无关的布料模拟。 Result: 在基准数据集上,FNOpt在分布外场景中优于先前方法,能捕捉细尺度皱纹并保持滚动稳定性,且可推广到更精细分辨率。 Conclusion: FNOpt减少了对标注数据的依赖,提升了跨分辨率泛化能力,为布料模拟提供了一种更具鲁棒性和实用性的神经模拟新范式。 Abstract: We present FNOpt, a self-supervised cloth simulation framework that formulates time integration as an optimization problem and trains a resolution-agnostic neural optimizer parameterized by a Fourier neural operator (FNO). Prior neural simulators often rely on extensive ground truth data or sacrifice fine-scale detail, and generalize poorly across resolutions and motion patterns. In contrast, FNOpt learns to simulate physically plausible cloth dynamics and achieves stable and accurate rollouts across diverse mesh resolutions and motion patterns without retraining. Trained only on a coarse grid with physics-based losses, FNOpt generalizes to finer resolutions, capturing fine-scale wrinkles and preserving rollout stability. Extensive evaluations on a benchmark cloth simulation dataset demonstrate that FNOpt outperforms prior learning-based approaches in out-of-distribution settings in both accuracy and robustness. These results position FNO-based meta-optimization as a compelling alternative to previous neural simulators for cloth, thus reducing the need for curated data and improving cross-resolution reliability.

[98] Curvature-Regularized Variational Autoencoder for 3D Scene Reconstruction from Sparse Depth

Maryam Yousefi,Soodeh Bakhshandeh

Main category: cs.CV

TL;DR: 提出基于离散拉普拉斯算子的曲率正则化方法,显著提升稀疏深度数据下的3D场景重建精度,仅用单一正则项即可超越多约束复杂模型,训练开销低且推理无额外成本。

Details Motivation: 在深度传感器仅提供5%测量值的情况下,传统方法难以实现高精度3D场景重建,而自动驾驶和机器人应用无法容忍由此产生的几何误差。 Method: 引入基于离散拉普拉斯算子的曲率正则化项到变分自编码器中,利用其稳定梯度和去噪能力来增强稀疏输入下的几何恢复能力。 Result: 相比标准变分自编码器,重建精度提升了18.1%,且仅需15%的训练开销,推理阶段无任何额外成本,优于多几何约束组合的方法。 Conclusion: 单一设计良好的正则化项(如离散拉普拉斯)可超越复杂的多约束方法,挑战了几何深度学习中‘多约束必优’的隐含假设。 Abstract: When depth sensors provide only 5% of needed measurements, reconstructing complete 3D scenes becomes difficult. Autonomous vehicles and robots cannot tolerate the geometric errors that sparse reconstruction introduces. We propose curvature regularization through a discrete Laplacian operator, achieving 18.1% better reconstruction accuracy than standard variational autoencoders. Our contribution challenges an implicit assumption in geometric deep learning: that combining multiple geometric constraints improves performance. A single well-designed regularization term not only matches but exceeds the effectiveness of complex multi-term formulations. The discrete Laplacian offers stable gradients and noise suppression with just 15% training overhead and zero inference cost. Code and models are available at https://github.com/Maryousefi/GeoVAE-3D.

[99] Bring Your Dreams to Life: Continual Text-to-Video Customization

Jiahua Dong,Xudong Wang,Wenqi Liang,Zongyan Han,Meng Cao,Duzhen Zhang,Hanbin Zhao,Zhi Han,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: 提出了一种持续定制化视频扩散模型(CCVD),可有效解决文本到视频生成中持续学习新概念时的灾难性遗忘和概念忽略问题。

Details Motivation: 现有定制化文本到视频生成方法假设个性化概念是静态的,难以在增量学习新概念时避免遗忘和忽略旧概念。 Method: 提出了概念特异性属性保持模块和任务感知概念聚合策略以缓解遗忘;通过层特定区域注意力引导的噪声估计实现可控条件合成,以解决概念忽略。 Result: 在多个文本到视频生成任务上验证了方法的有效性,实验表明CCVD优于现有的CTVG模型。 Conclusion: CCVD能够持续学习新主题和动作概念,同时保留旧知识,显著提升了定制化视频生成的连续学习能力。 Abstract: Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG models. The code is available at https://github.com/JiahuaDong/CCVD.

[100] Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

Saurav Jha,M. Jehanzeb Mirza,Wei Lin,Shiqi Yang,Sarath Chandar

Main category: cs.CV

TL;DR: 本文研究了基于世界模型的视觉语言模型在空间推理任务中的测试时验证机制,提出了ViSA框架以通过可验证的微断言来改进奖励机制,尽管在SAT-Real上表现良好,但在MMSI-Bench上仍受限于想象视图的信息瓶颈。

Details Motivation: 现有VLM在需要多视角理解和具身视角转换的空间推理任务中表现有限,MindJourney等方法虽引入测试时验证机制,但其验证器存在校准差、奖励信号不可靠等问题,需系统性分析并改进。 Method: 提出ViSA(Verification through Spatial Assertions)框架,将测试时奖励建立在可验证、帧锚定的微断言基础上,并通过不确定性分析评估不同验证器在多个基准上的行为。 Result: ViSA在SAT-Real基准上显著提升空间推理能力,并纠正轨迹选择偏差;但在MMSI-Bench上所有验证器均未能实现一致的性能扩展,表明当前世界模型存在信息瓶颈。 Conclusion: 测试时验证机制在改善空间推理方面具有潜力,但其效果受限于世界模型生成视图的质量,未来需突破想象过程中细粒度推理的信息瓶颈。 Abstract: Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.

[101] UG-FedDA: Uncertainty-Guided Federated Domain Adaptation for Multi-Center Alzheimer's Disease Detection

Fubao Zhu,Zhanyuan Jia,Zhiguo Wang,Huan Huang,Danyang Sun,Chuang Han,Yanting Li,Jiaofen Nan,Chen Zhao,Weihua Zhou

Main category: cs.CV

TL;DR: 提出了一种名为UG-FedDA的新框架,结合不确定性量化与联邦域适应,用于多中心阿尔茨海默病分类,有效应对跨站点异质性和隐私保护问题。

Details Motivation: 现有AD分类方法在多中心研究中常忽视站点间异质性,且缺乏不确定性量化机制,限制了其鲁棒性与临床应用。 Method: 采用自注意力Transformer提取多模板感兴趣区域特征,并结合不确定性量化指导特征对齐,通过联邦域适应缓解源域与目标域分布差异。 Result: 在ADNI、AIBL和OASIS三个公开数据集上验证,UG-FedDA在多种分类任务中均表现出色,如NC vs. AD准确率分别达到90.54%、89.04%和77.78%。 Conclusion: UG-FedDA能有效处理多中心MRI数据的异质性,提升跨站点分类性能,同时保障数据隐私,具有良好的临床应用潜力。 Abstract: Alzheimer's disease (AD) is an irreversible neurodegenerative disorder, and early diagnosis is critical for timely intervention. However, most existing classification frameworks face challenges in multicenter studies, as they often neglect inter-site heterogeneity and lack mechanisms to quantify uncertainty, which limits their robustness and clinical applicability. To address these issues, we proposed Uncertainty-Guided Federated Domain Adaptation (UG-FedDA), a novel multicenter AD classification framework that integrates uncertainty quantification (UQ) with federated domain adaptation to handle cross-site structure magnetic resonance imaging (MRI) heterogeneity under privacy constraints. Our approach extracts multi-template region-of-interest (RoI) features using a self-attention transformer, capturing both regional representations and their interactions. UQ is integrated to guide feature alignment, mitigating source-target distribution shifts by down-weighting uncertain samples. Experiments are conducted on three public datasets: the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarkers and Lifestyle study (AIBL), and the Open Access Series of Imaging Studies (OASIS). UG-FedDA achieved consistent cross-domain improvements in accuracy, sensitivity, and area under the ROC curve across three classification tasks: AD vs. normal controls (NC), mild cognitive impairment (MCI) vs. AD, and NC vs. MCI. For NC vs. AD, UG-FedDA achieves accuracies of 90.54%, 89.04%, and 77.78% on ADNI, AIBL and OASIS datasets, respectively. For MCI vs. AD, accuracies are 80.20% (ADNI), 71.91% (AIBL), and 79.73% (OASIS). For NC vs. MCI, results are 76.87% (ADNI), 73.91% (AIBL), and 83.73% (OASIS). These results demonstrate that the proposed framework not only adapts efficiently across multiple sites but also preserves strict privacy.

[102] Phase-OTDR Event Detection Using Image-Based Data Transformation and Deep Learning

Muhammet Cagri Yeke,Samil Sirin,Kivilcim Yuksel,Abdurrahman Gumus

Main category: cs.CV

TL;DR: 提出一种将1D Phase-OTDR数据转换为图像的新型方法,用于光纤事件检测,结合迁移学习模型实现高精度分类。

Details Motivation: 提升Phase-OTDR系统中光纤事件检测的准确性和分析效率,解决传统1D数据分析局限性。 Method: 使用Gramian Angular Difference Field、Gramian Angular Summation Field和Recurrence Plot将1D信号转为灰度图,并组合成多通道RGB图像,利用EfficientNetB0和DenseNet121进行迁移学习分类。 Result: 在公开数据集上达到98.84%(EfficientNetB0)和98.24%(DenseNet121)的分类准确率,5折交叉验证分别达99.07%和98.68%。 Conclusion: 图像化处理结合迁移学习显著提升了光纤事件检测的性能,推动了光纤传感数据分析的发展,代码与数据集已开源。 Abstract: This study focuses on event detection in optical fibers, specifically classifying six events using the Phase-OTDR system. A novel approach is introduced to enhance Phase-OTDR data analysis by transforming 1D data into grayscale images through techniques such as Gramian Angular Difference Field, Gramian Angular Summation Field, and Recurrence Plot. These grayscale images are combined into a multi-channel RGB representation, enabling more robust and adaptable analysis using transfer learning models. The proposed methodology achieves high classification accuracies of 98.84% and 98.24% with the EfficientNetB0 and DenseNet121 models, respectively. A 5-fold cross-validation process confirms the reliability of these models, with test accuracy rates of 99.07% and 98.68%. Using a publicly available Phase-OTDR dataset, the study demonstrates an efficient approach to understanding optical fiber events while reducing dataset size and improving analysis efficiency. The results highlight the transformative potential of image-based analysis in interpreting complex fiber optic sensing data, offering significant advancements in the accuracy and reliability of fiber optic monitoring systems. The codes and the corresponding image-based dataset are made publicly available on GitHub to support further research: https://github.com/miralab-ai/Phase-OTDR-event-detection.

[103] VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Shiji Zhao,Shukun Xiong,Yao Huang,Yan Jin,Zhenyu Wu,Jiyang Guan,Ranjie Duan,Jialing Tao,Hui Xue,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了一种针对多模态大语言模型(MLLMs)的新型视觉推理序列攻击方法(VRSA),通过将有害文本分解为一系列相关的子图像,逐步诱导模型暴露并聚合有害意图,并结合场景优化、语义连贯性和图文一致性策略,显著提升了越狱攻击的成功率。

Details Motivation: 现有的MLLM越狱攻击主要集中在文本模态的推理安全风险,而忽视了视觉模态中的潜在威胁。本文旨在全面评估视觉推理任务中的安全性问题,揭示MLLM在多模态环境下的新攻击面。 Method: 提出Visual Reasoning Sequential Attack (VRSA):1)自适应场景优化(Adaptive Scene Refinement)以增强图像序列的合理性;2)语义连贯补全(Semantic Coherent Completion)确保子文本间的上下文连续性;3)图文一致性对齐(Text-Image Consistency Alignment)维持图文语义一致。 Result: 实验表明,VRSA在多个开源和闭源MLLM(如GPT-4o和Claude-3.5-Sonnet)上均实现了高于现有最先进越狱方法的攻击成功率。 Conclusion: VRSA揭示了MLLM在处理多模态输入时存在的严重安全隐患,尤其是在视觉推理路径中可能被逐步诱导生成有害内容,强调了加强多模态安全防御机制的必要性。 Abstract: Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.

[104] Edit-aware RAW Reconstruction

Abhijith Punnappurath,Luxi Zhao,Ke Zhao,Hue Nguyen,Radek Grzeszczuk,Michael S. Brown

Main category: cs.CV

TL;DR: 提出一种即插即用、编辑感知的损失函数,提升从sRGB图像重建RAW数据时在不同渲染风格和编辑操作下的鲁棒性。

Details Motivation: 现有RAW重建方法侧重像素级保真度,但在多样化渲染风格和编辑下表现不佳;而用户通常在拍摄后对图像进行编辑,因此需要更适应实际编辑需求的RAW重建方法。 Method: 设计了一种可集成到任意RAW重建框架中的编辑感知损失函数,结合模块化、可微分的ISP(图像信号处理器),在训练中随机采样ISP参数以模拟真实相机处理的多样性,并在sRGB空间计算损失。 Result: 使用该损失函数使sRGB重建质量在多种编辑条件下提升1.5-2 dB PSNR,并可在元数据辅助方法中进一步优化目标编辑效果。 Conclusion: 所提出的损失函数简单有效,能显著增强现有RAW重建方法在实际摄影编辑中的保真度与渲染灵活性。 Abstract: Users frequently edit camera images post-capture to achieve their preferred photofinishing style. While editing in the RAW domain provides greater accuracy and flexibility, most edits are performed on the camera's display-referred output (e.g., 8-bit sRGB JPEG) since RAW images are rarely stored. Existing RAW reconstruction methods can recover RAW data from sRGB images, but these approaches are typically optimized for pixel-wise RAW reconstruction fidelity and tend to degrade under diverse rendering styles and editing operations. We introduce a plug-and-play, edit-aware loss function that can be integrated into any existing RAW reconstruction framework to make the recovered RAWs more robust to different rendering styles and edits. Our loss formulation incorporates a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, parameters for each ISP module are randomly sampled from carefully designed distributions that model practical variations in real camera processing. The loss is then computed in sRGB space between ground-truth and reconstructed RAWs rendered through this differentiable ISP. Incorporating our loss improves sRGB reconstruction quality by up to 1.5-2 dB PSNR across various editing conditions. Moreover, when applied to metadata-assisted RAW reconstruction methods, our approach enables fine-tuning for target edits, yielding further gains. Since photographic editing is the primary motivation for RAW reconstruction in consumer imaging, our simple yet effective loss function provides a general mechanism for enhancing edit fidelity and rendering flexibility across existing methods.

[105] Underwater Image Reconstruction Using a Swin Transformer-Based Generator and PatchGAN Discriminator

Md. Mahbub Hasan Akash,Aria Tasnim Mridula,Sheekar Banerjee,Ishtiak Al Mamoon

Main category: cs.CV

TL;DR: 提出了一种基于Swin Transformer和GAN的新型深度学习框架,用于水下图像重建,在EUVP数据集上实现了最先进的性能。

Details Motivation: 传统方法和基于卷积神经网络的方法由于感受野有限且无法建模全局依赖关系,难以有效解决水下图像因水体吸收和散射导致的颜色失真、低对比度和雾化问题。 Method: 将Swin Transformer结构集成到生成对抗网络(GAN)中,生成器采用带有Swin Transformer块的U-Net结构以捕获局部特征和长距离依赖,判别器采用PatchGAN以保留高频细节。 Result: 在EUVP数据集上测试,PSNR达到24.76 dB,SSIM为0.89,优于现有方法;视觉效果显示颜色平衡、对比度和去雾效果显著改善。消融实验验证了Swin Transformer设计的优越性。 Conclusion: 所提方法能够实现鲁棒的水下图像重建,适用于海洋探索、环境监测和基础设施检查等多种应用场景。 Abstract: Underwater imaging is essential for marine exploration, environmental monitoring, and infrastructure inspection. However, water causes severe image degradation through wavelength-dependent absorption and scattering, resulting in color distortion, low contrast, and haze effects. Traditional reconstruction methods and convolutional neural network-based approaches often fail to adequately address these challenges due to limited receptive fields and inability to model global dependencies. This paper presented a novel deep learning framework that integrated a Swin Transformer architecture within a generative adversarial network (GAN) for underwater image reconstruction. Our generator employed a U-Net structure with Swin Transformer blocks to capture both local features and long-range dependencies crucial for color correction across entire images. A PatchGAN discriminator provided adversarial training to ensure high-frequency detail preservation. We trained and evaluated our model on the EUVP dataset, which contains paired underwater images of varying quality. Quantitative results demonstrate stateof-the-art performance with PSNR of 24.76 dB and SSIM of 0.89, representing significant improvements over existing methods. Visual results showed effective color balance restoration, contrast improvement, and haze reduction. An ablation study confirms the superiority of our Swin Transformer designed over convolutional alternatives. The proposed method offers robust underwater image reconstruction suitable for various marine applications.

[106] SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Wenhao Yan,Sheng Ye,Zhuoyi Yang,Jiayan Teng,ZhenHui Dong,Kairui Wen,Xiaotao Gu,Yong-Jin Liu,Jie Tang

Main category: cs.CV

TL;DR: 本文提出了一种名为SCAIL的框架,用于实现高质量的角色动画生成,通过3D姿态表示和全上下文姿态注入机制,在复杂动作和跨身份场景中实现了卓越的结构保真度和时序一致性。

Details Motivation: 现有角色动画方法在复杂运动和跨身份动画中难以保持结构保真度和时间一致性,难以满足影视级制作标准。 Method: 提出一种新的3D姿态表示方法,并在扩散-Transformer架构中引入全上下文姿态注入机制,结合高质量数据管道和综合基准进行训练与评估。 Result: 实验表明SCAIL在多个指标上达到最先进水平,显著提升了角色动画的可靠性和真实感。 Conclusion: SCAIL通过两项关键技术改进,有效推动了角色动画向影视级质量的发展,具备较强的实用性和泛化能力。 Abstract: Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (\textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

[107] NICE: Neural Implicit Craniofacial Model for Orthognathic Surgery Prediction

Jiawen Yang,Yihui Cao,Xuanyu Tian,Yuyao Zhang,Hongjiang Wei

Main category: cs.CV

TL;DR: 提出了一种基于隐式神经表示的颅面建模方法NICE,用于精确预测正颌手术后的面部变化,显著提升了关键区域的预测精度和解剖结构的完整性。

Details Motivation: 现有生物力学模型、参数化模型和深度学习方法在计算效率或捕捉骨骼运动与软组织复杂非线性关系方面存在不足,难以准确预测正颌术后面部外观。 Method: 提出Neural Implicit Craniofacial Model (NICE),包含形状模块(采用区域特定的隐式符号距离函数解码器)和手术模块(通过共享手术潜在编码驱动区域特定形变解码器),输出逐点位移场以精确建模手术结果。 Result: 实验表明NICE在嘴唇和下巴等关键面部区域显著优于现有最先进方法,提高了预测准确性,并能稳健保持解剖完整性。 Conclusion: NICE为正颌手术提供了临床可用的工具,有助于提升手术规划和医患沟通效果。 Abstract: Orthognathic surgery is a crucial intervention for correcting dentofacial skeletal deformities to enhance occlusal functionality and facial aesthetics. Accurate postoperative facial appearance prediction remains challenging due to the complex nonlinear interactions between skeletal movements and facial soft tissue. Existing biomechanical, parametric models and deep-learning approaches either lack computational efficiency or fail to fully capture these intricate interactions. To address these limitations, we propose Neural Implicit Craniofacial Model (NICE) which employs implicit neural representations for accurate anatomical reconstruction and surgical outcome prediction. NICE comprises a shape module, which employs region-specific implicit Signed Distance Function (SDF) decoders to reconstruct the facial surface, maxilla, and mandible, and a surgery module, which employs region-specific deformation decoders. These deformation decoders are driven by a shared surgical latent code to effectively model the complex, nonlinear biomechanical response of the facial surface to skeletal movements, incorporating anatomical prior knowledge. The deformation decoders output point-wise displacement fields, enabling precise modeling of surgical outcomes. Extensive experiments demonstrate that NICE outperforms current state-of-the-art methods, notably improving prediction accuracy in critical facial regions such as lips and chin, while robustly preserving anatomical integrity. This work provides a clinically viable tool for enhanced surgical planning and patient consultation in orthognathic procedures.

[108] LPD: Learnable Prototypes with Diversity Regularization for Weakly Supervised Histopathology Segmentation

Khang Le,Anh Mai Vu,Thi Kim Trang Vo,Ha Thach,Ngoc Bui Lam Quang,Thanh-Huy Nguyen,Minh H. N. Le,Zhu Han,Chandra Mohan,Hien Van Nguyen

Main category: cs.CV

TL;DR: 提出一种无聚类、单阶段可学习原型框架,通过多样性正则化提升形态学类内异质性覆盖,在弱监督组织病理语义分割中实现SOTA性能。

Details Motivation: 现有方法因两阶段流程、超参数敏感及原型发现与分割学习解耦,受限于类间同质性、类内异质性和CAM区域收缩问题。 Method: 设计端到端可学习原型框架,引入多样性正则化机制,避免显式聚类,在单一阶段联合优化原型学习与分割。 Result: 在BCSS-WSSS数据集上达到SOTA,mIoU和mDice优于先前方法;可视化显示边界更清晰、误标更少,激活图表明原型覆盖更全面且互补。 Conclusion: 所提方法有效解决WSSS中的关键挑战,提升分割性能与效率,为弱监督病理图像分析提供新思路。 Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology reduces pixel-level labeling by learning from image-level labels, but it is hindered by inter-class homogeneity, intra-class heterogeneity, and CAM-induced region shrinkage (global pooling-based class activation maps whose activations highlight only the most distinctive areas and miss nearby class regions). Recent works address these challenges by constructing a clustering prototype bank and then refining masks in a separate stage; however, such two-stage pipelines are costly, sensitive to hyperparameters, and decouple prototype discovery from segmentation learning, limiting their effectiveness and efficiency. We propose a cluster-free, one-stage learnable-prototype framework with diversity regularization to enhance morphological intra-class heterogeneity coverage. Our approach achieves state-of-the-art (SOTA) performance on BCSS-WSSS, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels, and activation heatmaps further reveal that, compared with clustering-based prototypes, our learnable prototypes cover more diverse and complementary regions within each class, providing consistent qualitative evidence for their effectiveness.

[109] World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty

Zhiting Mei,Tenny Yin,Micah Baker,Ola Shorinwa,Anirudha Majumdar

Main category: cs.CV

TL;DR: 本文提出了一种名为C3的不确定性量化方法,用于可控视频生成模型,实现亚块级的密集置信度估计,有效检测和可视化生成视频中的幻觉区域。

Details Motivation: 现有可控视频生成模型常产生与物理现实不符的幻觉帧,且缺乏评估自身置信度的能力,限制了其在机器人规划等关键任务中的可靠性。 Method: 提出C3方法:1)基于严格适当的评分规则训练模型以提升正确性和校准性;2)在潜在空间中估计不确定性以避免像素空间带来的训练不稳定和高成本;3)将潜在空间的不确定性映射回RGB像素空间生成高分辨率不确定性热图。 Result: 在大规模机器人学习数据集(Bridge和DROID)和真实场景中验证表明,C3不仅能提供训练分布内的校准不确定性估计,还能有效检测分布外样本。 Conclusion: C3为可控视频模型提供了可靠、可解释的逐帧不确定性估计,增强了模型在关键应用中的可信度和实用性。 Abstract: Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.

[110] A Comparative Study on Synthetic Facial Data Generation Techniques for Face Recognition

Pedro Vidal,Bernardo Biesseck,Luiz E. L. Coelho,Roger Granada,David Menotti

Main category: cs.CV

TL;DR: 该研究比较了不同生成技术创建的合成人脸数据集在人脸识别任务中的有效性,评估了准确性、Rank-1、Rank-5和极低误报率下的真实阳性率,并指出尽管扩散模型、GAN和3D模型取得显著进展,但合成数据与真实数据之间仍存在性能差距。

Details Motivation: 解决人脸识别中的可解释性、人口统计偏差、隐私问题以及对老化、姿态、光照、遮挡和表情变化的鲁棒性挑战,同时应对因隐私法规导致的真实数据集退化问题。 Method: 采用多种合成人脸生成技术(如扩散模型、GANs、3D模型)生成数据,并在八个主流数据集上进行评测,比较其在人脸识别任务中的准确率、Rank-1、Rank-5和TPR@FPR=0.01%等指标。 Result: 合成数据能够有效捕捉现实中的面部变化,部分技术表现接近真实数据,但在整体识别性能上仍落后于真实数据,尤其在极端条件下面临挑战。 Conclusion: 合成人脸数据是缓解隐私和偏差问题的有前景方案,现有生成技术已取得进展,但仍需进一步研究以缩小与真实数据的性能差距。 Abstract: Facial recognition has become a widely used method for authentication and identification, with applications for secure access and locating missing persons. Its success is largely attributed to deep learning, which leverages large datasets and effective loss functions to learn discriminative features. Despite these advances, facial recognition still faces challenges in explainability, demographic bias, privacy, and robustness to aging, pose variations, lighting changes, occlusions, and facial expressions. Privacy regulations have also led to the degradation of several datasets, raising legal, ethical, and privacy concerns. Synthetic facial data generation has been proposed as a promising solution. It mitigates privacy issues, enables experimentation with controlled facial attributes, alleviates demographic bias, and provides supplementary data to improve models trained on real data. This study compares the effectiveness of synthetic facial datasets generated using different techniques in facial recognition tasks. We evaluate accuracy, rank-1, rank-5, and the true positive rate at a false positive rate of 0.01% on eight leading datasets, offering a comparative analysis not extensively explored in the literature. Results demonstrate the ability of synthetic data to capture realistic variations while emphasizing the need for further research to close the performance gap with real data. Techniques such as diffusion models, GANs, and 3D models show substantial progress; however, challenges remain.

[111] Synset Signset Germany: a Synthetic Dataset for German Traffic Sign Recognition

Anne Sielemann,Lena Loercher,Max-Lion Schumacher,Stefan Wolf,Masoud Roschani,Jens Ziehn

Main category: cs.CV

TL;DR: 提出了一种结合数据驱动与解析建模的交通标志识别合成数据集Synset Signset Germany,包含105500张图像,支持可解释AI与鲁棒性测试。

Details Motivation: 现有合成数据集在真实感和参数可控性方面不足,难以满足XAI和鲁棒性评估需求。 Method: 采用GAN生成纹理以模拟磨损和污渍,结合解析渲染实现物理准确的光照与精细参数控制。 Result: 构建了含211类德国交通标志、共105500张图像的数据集,提供掩码、分割图及详细元数据;实验验证其在GTSRB上的真实性和优于CATERED的表现。 Conclusion: 该方法能生成高度逼真且可调控的交通标志图像,适用于模型训练、XAI分析与鲁棒性测试。 Abstract: In this paper, we present a synthesis pipeline and dataset for training / testing data in the task of traffic sign recognition that combines the advantages of data-driven and analytical modeling: GAN-based texture generation enables data-driven dirt and wear artifacts, rendering unique and realistic traffic sign surfaces, while the analytical scene modulation achieves physically correct lighting and allows detailed parameterization. In particular, the latter opens up applications in the context of explainable AI (XAI) and robustness tests due to the possibility of evaluating the sensitivity to parameter changes, which we demonstrate with experiments. Our resulting synthetic traffic sign recognition dataset Synset Signset Germany contains a total of 105500 images of 211 different German traffic sign classes, including newly published (2020) and thus comparatively rare traffic signs. In addition to a mask and a segmentation image, we also provide extensive metadata including the stochastically selected environment and imaging effect parameters for each image. We evaluate the degree of realism of Synset Signset Germany on the real-world German Traffic Sign Recognition Benchmark (GTSRB) and in comparison to CATERED, a state-of-the-art synthetic traffic sign recognition dataset.

[112] Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception

Anne Sielemann,Valentin Barner,Stefan Wolf,Masoud Roschani,Jens Ziehn,Juergen Beyerer

Main category: cs.CV

TL;DR: 本文通过系统生成六个用于交通标志识别的合成数据集,研究背景相关性、相机变化程度和交通标志形状对分类性能及背景特征重要性的影响。

Details Motivation: 现有的可解释AI方法难以定量验证模型是否依赖于输入中的伪相关性(如背景),而真实数据中相关性难以控制,合成数据又缺乏对现实性和随机性的充分量化。因此,需要一种能够主动控制相关性的合成数据方法来评估模型的决策依据。 Method: 设计并生成了六个仅在相机变异程度和背景相关性方面不同的合成数据集,结合已知的对象位置信息(如二值掩码),分析不同条件下模型对背景特征的依赖程度,并使用SHAP、GradCAM等显著性方法评估输入区域对分类结果的影响。 Result: 量化了训练域变化如何影响背景特征在分类中的重要性,揭示了在何种条件下背景特征会获得更高权重,表明模型确实可能利用背景相关性进行分类,尤其是在相机变异较小或背景与类别强相关时。 Conclusion: 通过可控的合成数据可以有效评估深度学习模型是否依赖于背景等伪特征,为可解释AI提供了可量化的验证手段,有助于识别和缓解模型对非鲁棒特征的过拟合。 Abstract: Common approaches to explainable AI (XAI) for deep learning focus on analyzing the importance of input features on the classification task in a given model: saliency methods like SHAP and GradCAM are used to measure the impact of spatial regions of the input image on the classification result. Combined with ground truth information about the location of the object in the input image (e.g., a binary mask), it is determined whether object pixels had a high impact on the classification result, or whether the classification focused on background pixels. The former is considered to be a sign of a healthy classifier, whereas the latter is assumed to suggest overfitting on spurious correlations. A major challenge, however, is that these intuitive interpretations are difficult to test quantitatively, and hence the output of such explanations lacks an explanation itself. One particular reason is that correlations in real-world data are difficult to avoid, and whether they are spurious or legitimate is debatable. Synthetic data in turn can facilitate to actively enable or disable correlations where desired but often lack a sufficient quantification of realism and stochastic properties. [...] Therefore, we systematically generate six synthetic datasets for the task of traffic sign recognition, which differ only in their degree of camera variation and background correlation [...] to quantify the isolated influence of background correlation, different levels of camera variation, and considered traffic sign shapes on the classification performance, as well as background feature importance. [...] Results include a quantification of when and how much background features gain importance to support the classification task based on changes in the training domain [...]. Download: synset.de/datasets/synset-signset-ger/background-effect

[113] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

Munsif Ali,Najmul Hassan,Lucia Ventura,Davide Di Bari,Simonepietro Canese

Main category: cs.CV

TL;DR: 本文提出了一种名为AQUA-Net的新型水下图像增强模型,结合频域和光照感知分支,在保持低计算复杂度的同时有效恢复颜色平衡、对比度和细节,适用于实时应用。

Details Motivation: 水下图像常因光吸收和散射导致颜色失真、对比度低;现有深度学习模型计算复杂度高,难以实现实时部署。 Method: 提出AQUA-Net,采用残差编码器-解码器结构,集成频域融合编码器和基于Retinex的光照感知解码器,分别处理频率信息与光照分量,联合优化空间、频域和光照特征。 Result: 在多个基准数据集上表现媲美SOTA方法,参数更少;消融实验验证了双分支的互补作用;提出了一个来自地中海的真实高分辨率水下视频数据集。 Conclusion: AQUA-Net具有强泛化能力和鲁棒性,为实际水下成像应用提供了高效、轻量的解决方案。 Abstract: Underwater images often suffer from severe color distortion, low contrast, and a hazy appearance due to wavelength-dependent light absorption and scattering. Simultaneously, existing deep learning models exhibit high computational complexity, which limits their practical deployment for real-time underwater applications. To address these challenges, this paper presents a novel underwater image enhancement model, called Adaptive Frequency Fusion and Illumination Aware Network (AQUA-Net). It integrates a residual encoder decoder with dual auxiliary branches, which operate in the frequency and illumination domains. The frequency fusion encoder enriches spatial representations with frequency cues from the Fourier domain and preserves fine textures and structural details. Inspired by Retinex, the illumination-aware decoder performs adaptive exposure correction through a learned illumination map that separates reflectance from lighting effects. This joint spatial, frequency, and illumination design enables the model to restore color balance, visual contrast, and perceptual realism under diverse underwater conditions. Additionally, we present a high-resolution, real-world underwater video-derived dataset from the Mediterranean Sea, which captures challenging deep-sea conditions with realistic visual degradations to enable robust evaluation and development of deep learning models. Extensive experiments on multiple benchmark datasets show that AQUA-Net performs on par with SOTA in both qualitative and quantitative evaluations while using less number of parameters. Ablation studies further confirm that the frequency and illumination branches provide complementary contributions that improve visibility and color representation. Overall, the proposed model shows strong generalization capability and robustness, and it provides an effective solution for real-world underwater imaging applications.

[114] EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Hongyu Li,Manyuan Zhang,Dian Zheng,Ziyu Guo,Yimeng Jia,Kaituo Feng,Hao Yu,Yexin Liu,Yan Feng,Peng Pei,Xunliang Cai,Linjiang Huang,Hongsheng Li,Si Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为EditThinker的深思熟虑式图像编辑框架,通过模仿人类认知循环,利用“边思考边编辑”的迭代过程(批判结果并优化指令)来提升图像编辑模型对指令的遵循能力。该方法采用强化学习训练一个多模态大语言模型,使其生成更具针对性的指令改进,并在多个基准上显著提升了现有模型的表现。

Details Motivation: 现有的基于指令的图像编辑方法受限于单次生成的随机性和缺乏反思机制,导致指令遵循成功率不高。因此需要一种能够持续优化和调整指令的框架以提高编辑准确性。 Method: 提出一个包含‘思考-编辑-批判-优化’循环的框架,训练一个统一的多模态大语言模型EditThinker,通过强化学习联合学习批判评分、推理过程和指令优化,实现与编辑行为一致的思维对齐。 Result: 在四个基准测试上进行了广泛实验,结果表明该方法显著提升了多种图像编辑模型的指令遵循能力,大幅超越现有方法。同时发布了数据构建框架、数据集和模型以促进社区发展。 Conclusion: 本研究验证了引入认知式反思机制在图像编辑中的有效性,EditThinker通过模拟人类思维循环显著增强了指令跟随性能,为未来智能图像编辑系统提供了新范式。 Abstract: Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.