Skip to content

Table of Contents

cs.CL [Back]

[1] Empathy by Design: Aligning Large Language Models for Healthcare Dialogue

Emre Umucu,Guillermina Solis,Leon Garza,Emilia Rivas,Beatrice Lee,Anantaa Kotal,Aritran Piplai

Main category: cs.CL

TL;DR: 本文提出了一种基于直接偏好优化(DPO)的对齐框架,旨在提升大语言模型在护理对话中的事实准确性、语义连贯性以及同理心、礼貌性和简洁性等人本特性。

Details Motivation: 通用大语言模型在医疗照护应用中存在事实不可靠和缺乏同理心的问题,限制了其在敏感场景中的使用。 Method: 采用基于DPO的对齐方法,利用成对偏好数据微调领域适配的LLM,偏好响应体现支持性和易懂的沟通风格,拒绝响应为指令性或过于技术性的表达。 Result: 在多个开源和专有LLM上的实验表明,该方法在语义对齐、事实准确性和人本评估得分上均优于基线和商业系统(如Google医疗对话系统)。 Conclusion: 基于偏好的对齐为开发可信、富有同理心且具备临床知识的AI助手提供了一条可扩展且透明的路径。 Abstract: General-purpose large language models (LLMs) have demonstrated remarkable generative and reasoning capabilities but remain limited in healthcare and caregiving applications due to two key deficiencies: factual unreliability and a lack of empathetic communication. These shortcomings pose significant risks in sensitive contexts where users, particularly non-professionals and caregivers, seek medically relevant guidance or emotional reassurance. To address these challenges, we introduce a Direct Preference Optimization (DPO)-based alignment framework designed to improve factual correctness, semantic coherence, and human-centric qualities such as empathy, politeness, and simplicity in caregiver-patient dialogues. Our approach fine-tunes domain-adapted LLMs using pairwise preference data, where preferred responses reflect supportive and accessible communication styles while rejected ones represent prescriptive or overly technical tones. This direct optimization method aligns model outputs with human preferences more efficiently than traditional reinforcement-learning-based alignment. Empirical evaluations across multiple open and proprietary LLMs show that our DPO-tuned models achieve higher semantic alignment, improved factual accuracy, and stronger human-centric evaluation scores compared to baseline and commercial alternatives such as Google medical dialogue systems. These improvements demonstrate that preference-based alignment offers a scalable and transparent pathway toward developing trustworthy, empathetic, and clinically informed AI assistants for caregiver and healthcare communication. Our open-source code is available at: https://github.com/LeonG19/Empathy-by-Design

[2] Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR

Chris Crawford

Main category: cs.CL

TL;DR: 本文研究了使用形态学感知的分词器来辅助和简化Yoloxóchitl Mixtec语言音频语料库的对行注释,结合ASR和基于文本的序列到序列工具,旨在提高效率并减少人工标注的工作量。

Details Motivation: 为了应对Yoloxóchitl Mixtec这种具有复杂非连续形态的语言在自动语音识别(ASR)和注释中的挑战,设计更适合其语言结构的分词方法以提升处理效率。 Method: 提出了两种新的非线性分词方案:Segment and Melody tokenizer(提取声调但不分割)和Sequence of Processes tokenizer(预测词的分割),并与BPE和Unigram等传统模型进行比较,评估其在ASR任务中的表现。 Result: 新型分词器在词错误率上具有竞争力,其中Segment-and-Melody模型优于传统分词器,但在字符错误率上仍有差距;同时通过形态学和信息论指标分析发现了与下游性能相关的预测因素。 Conclusion: 针对特定语言非连续形态设计的非线性分词器在ASR任务中可与传统的BPE和Unigram模型相媲美,未来需进一步研究其在下游任务中的适用性。 Abstract: This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.

[3] Do You Feel Comfortable? Detecting Hidden Conversational Escalation in AI Chatbots

Jihyung Park,Saleh Afroogh,Junfeng Jiao

Main category: cs.CL

TL;DR: 本文提出了一种名为GAUGE的轻量级、基于logit的框架,用于实时检测大型语言模型对话中的隐性情感升级,以弥补传统毒性过滤器和外部分类器在捕捉对话情感动态变化上的不足。

Details Motivation: 现有的安全机制难以捕捉对话中隐性的、渐进式的情感伤害,尤其是在没有明显毒性内容的情况下,情感强化或情感漂移可能导致用户情绪恶化。 Method: GAUGE通过分析LLM输出的概率分布如何改变对话的情感状态,利用logit层面的信息实时检测情感升级趋势。 Result: 该框架能够在不依赖外部分类器或临床标准的情况下,有效识别对话中潜在的情感风险,具有轻量和实时的优势。 Conclusion: GAUGE为保护用户在与LLM互动中的心理健康提供了一种新的技术路径,尤其适用于需要长期情感交互的应用场景。 Abstract: Large Language Models (LLM) are increasingly integrated into everyday interactions, serving not only as information assistants but also as emotional companions. Even in the absence of explicit toxicity, repeated emotional reinforcement or affective drift can gradually escalate distress in a form of \textit{implicit harm} that traditional toxicity filters fail to detect. Existing guardrail mechanisms often rely on external classifiers or clinical rubrics that may lag behind the nuanced, real-time dynamics of a developing conversation. To address this gap, we propose GAUGE (Guarding Affective Utterance Generation Escalation), a lightweight, logit-based framework for the real-time detection of hidden conversational escalation. GAUGE measures how an LLM's output probabilistically shifts the affective state of a dialogue.

[4] Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety

Junyu Mao,Anthony Hills,Talia Tseriotou,Maria Liakata,Aya Shamir,Dan Sayda,Dana Atzil-Slonim,Natalie Djohari,Arpan Mandal,Silke Roth,Pamela Ugwudike,Mahesan Niranjan,Stuart E. Middleton

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的置信度感知细粒度辩论(CFD)框架,用于通过多智能体模拟人工标注者来增强自然语言处理中的真实世界指标数据,显著提升了心理健康和在线安全等下游任务的表现。

Details Motivation: 由于现实世界指标(如心理健康事件和网络风险行为)动态性强,人工标注成本高且困难,亟需高效、准确的数据增强方法以提升NLP任务性能。 Method: 提出一种 Confidence-Aware Fine-Grained Debate (CFD) 框架,利用多个LLM代理模拟人类标注者,通过交换细粒度证据并结合置信度评估达成共识,实现数据增强;并在两个新构建的专家标注数据集上进行验证。 Result: CFD框架在多种基线方法中表现出最稳健的数据增强效果,尤其通过辩论文本引入的增强特征,在在线安全任务中比未增强基线提升10.1%。 Conclusion: CFD框架能有效提升动态敏感领域的数据标注质量与下游任务性能,为低成本、高质量的NLP数据增强提供了可行方案。 Abstract: Real-world indicators are important for improving natural language processing (NLP) tasks such as life events for mental health analysis and risky behaviour for online safety, yet labelling such information in NLP training datasets is often costly and/or difficult given the dynamic nature of such events. This paper compares several LLM-based data enrichment methods and introduces a novel Confidence-Aware Fine-Grained Debate (CFD) framework in which multiple LLM agents simulate human annotators and exchange fine-grained evidence to reach consensus. We describe two new expert-annotated datasets, a mental health Reddit wellbeing dataset and an online safety Facebook sharenting risk dataset. Our CFD framework achieves the most robust data enrichment performance compared to a range of baselines and we show that this type of data enrichment consistently improves downstream tasks. Enriched features incorporated via debate transcripts yield the largest gains, outperforming the non-enriched baseline by 10.1% for the online safety task.

[5] Policy-based Sentence Simplification: Replacing Parallel Corpora with LLM-as-a-Judge

Xuanxin Wu,Yuki Arase,Masaaki Nagata

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型作为评判器(LLM-as-a-Judge)自动生成符合特定简化策略的训练数据的方法,无需人工标注或平行语料库,实现了句子简化的策略驱动控制。

Details Motivation: 现有的句子简化方法难以实现对不同简化策略(如仅替换复杂词汇或整体重写)的有效控制,且依赖昂贵的人工标注或平行语料。 Method: 利用大语言模型作为评判器来自动生成与目标简化策略对齐的训练数据,并用于训练适应多种简化策略的简化系统。 Result: 实验表明,即使是小型开源LLM(如Phi-3-mini-3.8B)在词汇级简化上也超越了GPT-4o,在整体重写任务上表现相当,且在自动指标和人工评估中均有一致提升。 Conclusion: 该方法有效解决了策略驱动的句子简化难题,具有良好的鲁棒性和泛化能力,适用于不同规模和类型的模型。 Abstract: Sentence simplification aims to modify a sentence to make it easier to read and understand while preserving the meaning. Different applications require distinct simplification policies, such as replacing only complex words at the lexical level or rewriting the entire sentence while trading off details for simplicity. However, achieving such policy-driven control remains an open challenge. In this work, we introduce a simple yet powerful approach that leverages Large Language Model-as-a-Judge (LLM-as-a-Judge) to automatically construct policy-aligned training data, completely removing the need for costly human annotation or parallel corpora. Our method enables building simplification systems that adapt to diverse simplification policies. Remarkably, even small-scale open-source LLMs such as Phi-3-mini-3.8B surpass GPT-4o on lexical-oriented simplification, while achieving comparable performance on overall rewriting, as verified by both automatic metrics and human evaluations. The consistent improvements across model families and sizes demonstrate the robustness of our approach.

[6] LOCUS: A System and Method for Low-Cost Customization for Universal Specialization

Dhanasekar Sundararaman,Keying Li,Wayne Xiong,Aashna Garg

Main category: cs.CL

TL;DR: LOCUS是一种低成本定制化通用专业化管道,利用少量标注数据通过目标检索、合成数据生成和参数高效微调来构建和训练NLP模型,在NER和TC任务上优于强基线(包括GPT-4o),同时显著降低成本和模型大小。

Details Motivation: 在仅有少量标注样本的情况下,如何高效地构建高性能且低资源消耗的NLP模型成为一个关键问题。LOCUS旨在通过结合检索、数据增强与参数高效微调,实现低成本、高精度的模型定制化。 Method: LOCUS采用三阶段流程:首先进行相关数据的定向检索,然后通过上下文内数据生成合成额外训练样本,最后使用全参数微调或低秩适应(LoRA)进行模型优化。 Result: 在命名实体识别和文本分类任务上,LOCUS显著优于强基线模型(如GPT-4o),其内存优化模型保持了全微调模型99%的准确率,仅使用5%的内存占用,并以不到1%的参数量在多个基准上超越GPT-4o。 Conclusion: LOCUS提供了一种高效、低成本的模型定制方法,能够在极小资源消耗下实现与大规模模型相当甚至更优的性能,推动了轻量化NLP模型的发展。 Abstract: We present LOCUS (LOw-cost Customization for Universal Specialization), a pipeline that consumes few-shot data to streamline the construction and training of NLP models through targeted retrieval, synthetic data generation, and parameter-efficient tuning. With only a small number of labeled examples, LOCUS discovers pertinent data in a broad repository, synthesizes additional training samples via in-context data generation, and fine-tunes models using either full or low-rank (LoRA) parameter adaptation. Our approach targets named entity recognition (NER) and text classification (TC) benchmarks, consistently outperforming strong baselines (including GPT-4o) while substantially lowering costs and model sizes. Our resultant memory-optimized models retain 99% of fully fine-tuned accuracy while using barely 5% of the memory footprint, also beating GPT-4o on several benchmarks with less than 1% of its parameters.

[7] Convergence of Outputs When Two Large Language Models Interact in a Multi-Agentic Setup

Aniruddha Maiti,Satya Nimmagadda,Kartha Veerya Jammuladinne,Niladri Sengupta,Ananya Jana

Main category: cs.CL

TL;DR: 研究了两个大语言模型在无外部输入的多智能体设置中相互响应时的行为,发现对话虽初始连贯但最终趋于重复,表现出一种收敛现象。

Details Motivation: 探索大语言模型在封闭多轮交互中的动态行为,尤其是缺乏外部干预时是否会出现模式化或重复性输出。 Method: 使用Mistral Nemo Base 2407和Llama 2 13B hf模型,以一个简短种子句开始,让两个模型交替生成回应,并持续固定步数;采用词汇和嵌入相似度指标分析对话演化过程。 Result: 大多数对话初期连贯,但随后陷入重复,常出现短语循环;一旦开始重复,两模型输出趋于相似,形成自我强化的循环模式。 Conclusion: 即使模型规模大且独立训练,在无外部输入的多轮交互中仍会自发收敛到重复性输出,显示出系统内在的稳定性或局限性。 Abstract: In this work, we report what happens when two large language models respond to each other for many turns without any outside input in a multi-agent setup. The setup begins with a short seed sentence. After that, each model reads the other's output and generates a response. This continues for a fixed number of steps. We used Mistral Nemo Base 2407 and Llama 2 13B hf. We observed that most conversations start coherently but later fall into repetition. In many runs, a short phrase appears and repeats across turns. Once repetition begins, both models tend to produce similar output rather than introducing a new direction in the conversation. This leads to a loop where the same or similar text is produced repeatedly. We describe this behavior as a form of convergence. It occurs even though the models are large, trained separately, and not given any prompt instructions. To study this behavior, we apply lexical and embedding-based metrics to measure how far the conversation drifts from the initial seed and how similar the outputs of the two models becomes as the conversation progresses.

[8] Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

Chen Yang,Guangyue Peng,Jiaying Zhu,Ran Le,Ruixiang Feng,Tao Zhang,Wei Ruan,Xiaoqi Liu,Xiaoxue Cheng,Xiyun Xu,Yang Song,Yanzipeng Gao,Yiming Jia,Yun Xing,Yuntao Wen,Zekai Wang,Zhenwei An,Zhicong Sun,Zongchao Chen

Main category: cs.CL

TL;DR: Nanbeige4-3B 是一种小型但高性能的语言模型,通过创新的训练策略在小规模模型上突破了扩展定律的边界。

Details Motivation: 提升小型语言模型的性能,使其能够匹敌更大规模的模型。 Method: 采用细粒度预热-稳定-衰减(FG-WSD)调度器进行预训练;结合深思熟虑生成优化与思维链重构提升SFT数据质量;使用双偏好蒸馏(DPD)方法进行知识蒸馏;并应用多阶段强化学习结合可验证奖励和偏好建模。 Result: Nanbeige4-3B 在多个基准测试中显著优于同规模模型,并与更大模型表现相当。 Conclusion: 通过系统性的训练优化,小型语言模型也能实现卓越性能,具备高效部署潜力。 Abstract: We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.

[9] Modeling Contextual Passage Utility for Multihop Question Answering

Akriti Jain,Aparna Garimella

Main category: cs.CL

TL;DR: 本文提出了一种轻量级的上下文感知段落效用建模方法,用于多跳问答中的段落重排序,通过考虑段落间的依赖关系提升问答性能。

Details Motivation: 现有段落效用预测方法独立建模段落效用,忽略了多跳推理中段落效用的上下文依赖性,即一个段落的效用受其与其他段落关系的影响。 Method: 利用先进推理模型的推理轨迹构建有序段落使用序列,并基于此生成合成训练数据;微调一个小的基于Transformer的模型来预测考虑了段落间依赖关系的上下文效用得分。 Result: 实验表明,与基于相关性的重排序方法相比,所提出的基于效用的段落评分方法在段落重排序和下游问答任务上均取得更好的性能。 Conclusion: 建模上下文相关的段落效用有助于减少冗余信息带来的噪声,提升多跳问答系统的准确性和鲁棒性。 Abstract: Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages. While most prior retrieval methods assist in identifying relevant passages for QA, further assessing the utility of the passages can help in removing redundant ones, which may otherwise add to noise and inaccuracies in the generated answers. Existing utility prediction approaches model passage utility independently, overlooking a critical aspect of multihop reasoning: the utility of a passage can be context-dependent, influenced by its relation to other passages - whether it provides complementary information or forms a crucial link in conjunction with others. In this paper, we propose a lightweight approach to model contextual passage utility, accounting for inter-passage dependencies. We fine-tune a small transformer-based model to predict passage utility scores for multihop QA. We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question and obtain synthetic training data. Through comprehensive experiments, we demonstrate that our utility-based scoring of retrieved passages leads to improved reranking and downstream QA performance compared to relevance-based reranking methods.

[10] Knowing What's Missing: Assessing Information Sufficiency in Question Answering

Akriti Jain,Aparna Garimella

Main category: cs.CL

TL;DR: 提出了一种“识别-验证”框架,通过生成缺失信息假设并进行语义共识与验证,提升问答系统中上下文充分性判断的准确性。

Details Motivation: 现有方法在推理类问题上表现不佳,难以准确判断上下文是否包含回答问题所需的信息。 Method: 采用两阶段的Identify-then-Verify框架:首先生成多个关于缺失信息的假设并达成语义共识,然后重新检查原文以验证这些信息是否确实缺失。 Result: 在多跳和事实型QA数据集上优于现有基线方法,能更准确地判断信息充分性,并清晰指出信息缺口。 Conclusion: 通过引导模型对缺失信息进行推理与验证,可显著提升上下文充分性判断的鲁棒性和可解释性。 Abstract: Determining whether a provided context contains sufficient information to answer a question is a critical challenge for building reliable question-answering systems. While simple prompting strategies have shown success on factual questions, they frequently fail on inferential ones that require reasoning beyond direct text extraction. We hypothesize that asking a model to first reason about what specific information is missing provides a more reliable, implicit signal for assessing overall sufficiency. To this end, we propose a structured Identify-then-Verify framework for robust sufficiency modeling. Our method first generates multiple hypotheses about missing information and establishes a semantic consensus. It then performs a critical verification step, forcing the model to re-examine the source text to confirm whether this information is truly absent. We evaluate our method against established baselines across diverse multi-hop and factual QA datasets. The results demonstrate that by guiding the model to justify its claims about missing information, our framework produces more accurate sufficiency judgments while clearly articulating any information gaps.

[11] Classifying German Language Proficiency Levels Using Large Language Models

Elias-Leander Ahlers,Witold Brunsmann,Malte Schilling

Main category: cs.CL

TL;DR: 本研究探讨了利用大语言模型(LLM)自动将德语文本按欧洲共同语言参考框架(CEFR)分级的方法,通过结合真实与合成数据构建多样化数据集,并评估提示工程、模型微调及基于探针的方法,结果表明性能优于先前方法。

Details Motivation: 为了实现针对学习者需求的个性化教学,准确评估语言水平至关重要,但传统人工评分成本高且难以扩展,因此需要可靠且可扩展的自动化CEFR分类方法。 Method: 结合多个现有的CEFR标注语料库与合成数据构建多样化数据集;采用提示工程、对LLaMA-3-8B-Instruct模型进行微调,以及利用LLM内部神经状态的探针方法进行分类。 Result: 所提出的方法在CEFR级别分类任务上 consistently 优于先前方法,特别是在结合微调与探针策略时表现出最佳性能。 Conclusion: 大语言模型在自动CEFR分类中具有巨大潜力,结合真实与合成数据及多种建模策略可提升分类可靠性与可扩展性。 Abstract: Assessing language proficiency is essential for education, as it enables instruction tailored to learners needs. This paper investigates the use of Large Language Models (LLMs) for automatically classifying German texts according to the Common European Framework of Reference for Languages (CEFR) into different proficiency levels. To support robust training and evaluation, we construct a diverse dataset by combining multiple existing CEFR-annotated corpora with synthetic data. We then evaluate prompt-engineering strategies, fine-tuning of a LLaMA-3-8B-Instruct model and a probing-based approach that utilizes the internal neural state of the LLM for classification. Our results show a consistent performance improvement over prior methods, highlighting the potential of LLMs for reliable and scalable CEFR classification.

[12] ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

Somnath Banerjee,Sayan Layek,Sayantan Adak,Mykola Pechenizkiy,Animesh Mukherjee,Rima Hazra

Main category: cs.CL

TL;DR: ProSocialAlign 是一种在推理时无需重新训练基础模型的参数高效框架,旨在生成安全、共情且符合人类价值观的响应。

Details Motivation: 现有语言模型的安全范式在情绪化或高风险场景中表现不足,仅拒绝回应可能疏远用户,而盲目顺从则会增加风险。 Method: 提出 ProSocialAlign 框架,形式化五个以人为本的目标,将安全性建模为字典序约束生成:首先用硬约束消除有害延续,然后在安全集合内优化亲社会质量;结合方向性调节(减去参数空间中的‘伤害向量’)和联合训练的偏好感知自回归奖励模型,实现细粒度、可控制的解码。 Result: 在五个安全基准上的实验表明,该方法在减少不安全内容泄露和提升人类价值观对齐方面达到先进水平,在多个指标上均有显著提升。 Conclusion: ProSocialAlign 提供了一种鲁棒且模块化的方案,可在推理时生成情境敏感、安全且符合人类价值的响应。 Abstract: Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.

[13] Adapting AlignScore Mertic for Factual Consistency Evaluation of Text in Russian: A Student Abstract

Mikhail Zimin,Milyausha Shamsutdinova,Georgii Andriushchenko

Main category: cs.CL

TL;DR: 本文提出了AlignRuScore,一种用于评估俄语文本事实一致性的综合指标,通过微调基于RuBERT的对齐模型,并在俄语和翻译的英语数据集上进行训练,成功将统一的对齐指标迁移到俄语,为多语言事实一致性评估奠定了基础。

Details Motivation: 现有的事实一致性评估工具主要针对英语语料库,缺乏对俄语文本的评估工具,因此需要开发适用于俄语的评估方法。 Method: 通过在俄语和翻译的英语数据集上微调基于RuBERT的对齐模型,并添加任务特定的分类和回归头,以适应AlignScore指标到俄语。 Result: 实验结果表明,统一的对齐指标可以成功迁移到俄语,且作者公开了翻译的数据集、模型检查点和代码以支持进一步研究。 Conclusion: AlignRuScore的成功开发为俄语文本的事实一致性评估提供了有效工具,并推动了多语言环境下事实一致性评估的发展。 Abstract: Ensuring factual consistency in generated text is crucial for reliable natural language processing applications. However, there is a lack of evaluation tools for factual consistency in Russian texts, as existing tools primarily focus on English corpora. To bridge this gap, we introduce AlignRuScore, a comprehensive adaptation of the AlignScore metric for Russian. To adapt the metric, we fine-tuned a RuBERT-based alignment model with task-specific classification and regression heads on Russian and translated English datasets. Our results demonstrate that a unified alignment metric can be successfully ported to Russian, laying the groundwork for robust multilingual factual consistency evaluation. We release the translated corpora, model checkpoints, and code to support further research.

[14] The Online Discourse of Virtual Reality and Anxiety

Kwabena Yamoah,Cass Dykeman

Main category: cs.CL

TL;DR: 本研究使用语料库语言学方法分析了网络上关于虚拟现实(VR)与焦虑症讨论的词汇和词语网络,揭示了VR技术在心理健康治疗中的讨论重点和发展方向。

Details Motivation: 了解用户对VR治疗焦虑症的看法,以提升该技术的有效性和应用性。 Method: 采用语料库语言学方法,利用Sketch Engine软件分析英语趋势语料库中关于VR与焦虑的子语料库,识别高频词及其搭配。 Result: 发现VR、Oculus和headset是最常被讨论的词汇;介词短语如of virtual reality、in virtual reality和for virtual reality分别与设计、体验和发展相关。 Conclusion: 研究结果提供了公众如何讨论VR与焦虑的新视角,并为未来通过技术发展和可及性支持心理咨询需求指明了方向。 Abstract: VR in the treatment of clinical concerns such as generalized anxiety disorder or social anxiety. VR has created additional pathways to support patient well-being and care. Understanding online discussion of what users think about this technology may further support its efficacy. The purpose of this study was to employ a corpus linguistic methodology to identify the words and word networks that shed light on the online discussion of virtual reality and anxiety. Using corpus linguistics, frequently used words in discussion along with collocation were identified by utilizing Sketch Engine software. The results of the study, based upon the English Trends corpus, identified VR, Oculus, and headset as the most frequently discussed within the VR and anxiety subcorpus. These results point to the development of the virtual system, along with the physical apparatus that makes viewing and engaging with the virtual environment possible. Additional results point to collocation of prepositional phrases such as of virtual reality, in virtual reality, and for virtual reality relating to the design, experience, and development, respectively. These findings offer new perspective on how VR and anxiety together are discussed in general discourse and offer pathways for future opportunities to support counseling needs through development and accessibility. Keywords: anxiety disorders, corpus linguistics, Sketch Engine, and virtual reality VR

[15] CMV-Fuse: Cross Modal-View Fusion of AMR, Syntax, and Knowledge Representations for Aspect Based Sentiment Analysis

Smitha Muthya Sudheendra,Mani Deep Cherukuri,Jaideep Srivastava

Main category: cs.CL

TL;DR: 本文提出了CMV-Fuse,一种跨模态视图融合框架,通过整合四种语言学视角(抽象语义表示、成分句法、依存句法和语义注意力)并结合外部知识,提升基于方面的情感分析性能。

Details Motivation: 现有的基于方面的情感分析(ABSA)系统通常仅利用孤立的语言学视角,忽略了不同结构表示之间的复杂交互,而人类语言理解则自然地融合了多种语言学信息。因此,需要一种能够模拟人类语言处理过程的多视角融合方法。 Method: 提出CMV-Fuse框架,系统性地融合四种语言学视角:抽象语义表示(AMR)、成分句法、依存句法和语义注意力,并引入外部知识。通过局部句法、中间语义到全局知识层级的分层门控注意力融合机制进行信息整合,并设计了一种结构感知的多视角对比学习机制以保持表示一致性且保证计算效率。 Result: 在标准基准上的大量实验表明,该方法显著优于强基线模型,消融分析验证了各语言学视角对情感分析鲁棒性的贡献。 Conclusion: CMV-Fuse通过有效融合多语言学视角和外部知识,提升了ABSA系统的性能,证明了模拟人类多层次语言理解方式在情感分析中的有效性。 Abstract: Natural language understanding inherently depends on integrating multiple complementary perspectives spanning from surface syntax to deep semantics and world knowledge. However, current Aspect-Based Sentiment Analysis (ABSA) systems typically exploit isolated linguistic views, thereby overlooking the intricate interplay between structural representations that humans naturally leverage. We propose CMV-Fuse, a Cross-Modal View fusion framework that emulates human language processing by systematically combining multiple linguistic perspectives. Our approach systematically orchestrates four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention, enhanced with external knowledge integration. Through hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels, CMV-Fuse captures both fine-grained structural patterns and broad contextual understanding. A novel structure aware multi-view contrastive learning mechanism ensures consistency across complementary representations while maintaining computational efficiency. Extensive experiments demonstrate substantial improvements over strong baselines on standard benchmarks, with analysis revealing how each linguistic view contributes to more robust sentiment analysis.

[16] Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis

Amartya Hatua

Main category: cs.CL

TL;DR: 本研究通过激活修补技术对GPT-2进行机械可解释性分析,揭示其情感信息处理机制:早期层负责词汇级情感检测,而上下文整合(如否定、讽刺等)主要在后期层统一完成,而非预期的中期层分工。

Details Motivation: 探究GPT-2是否遵循“早期词汇检测—中期上下文整合”的两阶段情感处理架构,检验现有假设的正确性。 Method: 在GPT-2所有12个Transformer层上系统地使用激活修补(activation patching)方法,因果性地分析各层对情感分类任务的贡献。 Result: 早期层(0–3)确实是稳定的、位置相关的词汇情感检测器;但关于中期层专门化处理上下文现象(如否定、讽刺)的三个假设均被证伪,发现这些复杂语义主要在后期层(8–11)通过统一且非模块化的方式整合。 Conclusion: GPT-2的情感计算机制并非层级化分工,而是晚期集中整合,表明需进一步实证研究大模型中的上下文整合机制。 Abstract: We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings provide causal evidence that GPT-2's sentiment computation differs from the predicted hierarchical pattern, highlighting the need for further empirical characterization of contextual integration in large language models.

[17] PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

Bowen Jiang,Yuan Yuan,Maohao Shen,Zhuoqun Hao,Zhangchen Xu,Zichen Chen,Ziyi Liu,Anvesh Rao Vijjini,Jiashu He,Hanchao Yu,Radha Poovendran,Gregory Wornell,Lyle Ungar,Dan Roth,Sihao Chen,Camillo Jose Taylor

Main category: cs.CL

TL;DR: 本文介绍了PersonaMem-v2,一个用于大语言模型个性化训练的先进数据集,并提出了一种基于强化微调和代理记忆系统的框架,在隐式个性化任务中显著提升了模型性能,实现了更高准确率与更高效的记忆利用。

Details Motivation: 个性化是提升AI能力和对齐性的关键下一步。现有模型在处理隐式用户偏好和长上下文推理方面表现不佳,难以满足真实场景需求。 Method: 构建了包含1000个用户-聊天机器人交互、覆盖300多个场景和2万多个用户偏好的PersonaMem-v2数据集,采用强化微调方法提升模型的长上下文推理能力,并设计了一个可随时间增长的、人类可读的代理记忆系统以实现高效个性化。 Result: 前沿大模型在隐式个性化任务上仅达到37-48%的准确率;通过强化微调,Qwen3-4B达到53%的准确率,超越GPT-5;提出的代理记忆框架进一步将准确率提升至55%,同时将输入token减少16倍(使用2k-token记忆而非32k历史)。 Conclusion: PersonaMem-v2数据集和代理记忆框架有效推动了个性化智能的发展,证明了强化微调与高效记忆机制结合是实现现实世界个性化AI的可扩展路径。 Abstract: Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time. In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.

[18] Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation

Chengbing Wang,Yang Zhang,Wenjie Wang,Xiaoyan Zhao,Fuli Feng,Xiangnan He,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文提出FlyThinker,一种高效的“生成中思考”框架,用于个性化长文本生成,通过并行的推理模型动态指导生成过程,兼顾训练与推理效率。

Details Motivation: 现有偏好对齐方法主要优化群体层面的偏好,忽视个体用户需求;早期个性化方法难以捕捉隐式偏好,而现有的“先思考后生成”方法在长文本生成中因静态推理限制了适应性与学习效果。 Method: 提出FlyThinker框架,采用独立的推理模型并行生成潜在的词元级推理信号,并将其融合到生成模型中以动态引导输出;推理模型仅依赖先前的生成结果,不依赖自身历史输出,从而保持训练时的并行性。 Result: 在真实世界基准上的实验表明,FlyThinker在实现更优个性化生成效果的同时,保持了高效的训练和推理性能。 Conclusion: FlyThinker通过“生成中思考”的设计,在个性化长文本生成任务中实现了推理动态性、训练高效性与生成质量的平衡,优于传统“先思考后生成”方法。 Abstract: Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches-such as prompt customization or fine-tuning-struggle to reason over implicit preferences, limiting real-world effectiveness. Recent "think-then-generate" methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose FlyThinker, an efficient "think-while-generating" framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions-allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.

[19] TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction

Aoi Fujita,Taichi Yamamoto,Yuri Nakayama,Ryota Kobayashi

Main category: cs.CL

TL;DR: 提出了一种名为TopiCLEAR的新方法,通过基于句子嵌入和自适应降维的聚类来提取短文本主题,在社交媒体和新闻数据上优于现有方法。

Details Motivation: 传统主题模型在处理短文本(如社交媒体帖子)时因词汇共现稀疏、语义碎片化等问题表现不佳,需要更适应非正式、简短文本的主题提取方法。 Method: 使用Sentence-BERT对文本进行嵌入,结合高斯混合模型(GMM)初步聚类,并通过线性判别分析(LDA)监督投影与GMM迭代优化聚类结果,实现自适应降维和主题提取。 Result: 在四个数据集(20News、AgNewsTitle、Reddit、TweetTopic)上,TopiCLEAR在与人工标注主题的相似性方面优于七种基线方法,尤其在社交媒体文本上表现突出,且生成的主题更具可解释性。 Conclusion: TopiCLEAR能有效应对短文本主题建模的挑战,无需文本预处理,适用于社交媒体和网络内容分析,具有良好的实际应用潜力。 Abstract: Rapid expansion of social media platforms such as X (formerly Twitter), Facebook, and Reddit has enabled large-scale analysis of public perceptions on diverse topics, including social issues, politics, natural disasters, and consumer sentiment. Topic modeling is a widely used approach for uncovering latent themes in text data, typically framed as an unsupervised classification task. However, traditional models, originally designed for longer and more formal documents, struggle with short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. To address these challenges, we propose a new method, TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Specifically, each text is embedded using Sentence-BERT (SBERT) and provisionally clustered using Gaussian Mixture Models (GMM). The clusters are then refined iteratively using a supervised projection based on linear discriminant analysis, followed by GMM-based clustering until convergence. Notably, our method operates directly on raw text, eliminating the need for preprocessing steps such as stop word removal. We evaluate our approach on four diverse datasets, 20News, AgNewsTitle, Reddit, and TweetTopic, each containing human-labeled topic information. Compared with seven baseline methods, including a recent SBERT-based method and a zero-shot generative AI method, our approach achieves the highest similarity to human-annotated topics, with significant improvements for both social media posts and online news articles. Additionally, qualitative analysis shows that our method produces more interpretable topics, highlighting its potential for applications in social media data and web content analytics.

[20] Parameter-Efficient Fine-Tuning with Differential Privacy for Robust Instruction Adaptation in Large Language Models

Yulin Huang,Yaxuan Luan,Jinxu Guo,Xiangchen Song,Yuchen Liu

Main category: cs.CL

TL;DR: 提出一种结合差分隐私与参数高效微调的协同优化框架,用于大语言模型指令微调中的隐私保护与效率提升。

Details Motivation: 解决大规模语言模型在指令微调过程中面临的隐私泄露风险和训练效率低下的问题。 Method: 保持主干模型冻结,通过低维投影子空间更新参数,并在梯度计算中引入梯度裁剪与自适应噪声分配,构建统一的协同优化框架。 Result: 该方法在准确性、隐私预算消耗和参数效率方面优于基线模型,且在不同数据条件下表现稳定。 Conclusion: 所提方法有效平衡了隐私保护、模型性能与训练效率,丰富了差分隐私与参数高效微调的理论融合,具备在复杂指令环境中安全训练的实用潜力。 Abstract: This study addresses the issues of privacy protection and efficiency in instruction fine-tuning of large-scale language models by proposing a parameter-efficient method that integrates differential privacy noise allocation with gradient clipping in a collaborative optimization framework. The method keeps the backbone model frozen and updates parameters through a low-dimensional projection subspace, while introducing clipping and adaptive noise allocation during gradient computation. This design reduces privacy budget consumption and ensures training stability and robustness. The unified framework combines gradient constraints, noise allocation, and parameter projection, effectively mitigating performance fluctuations and privacy risks in multi-task instruction scenarios. Experiments are conducted across hyperparameter, environment, and data sensitivity dimensions. Results show that the method outperforms baseline models in accuracy, privacy budget, and parameter efficiency, and maintains stable performance under diverse and uncertain data conditions. The findings enrich the theoretical integration of differential privacy and parameter-efficient fine-tuning and demonstrate its practical adaptability in instruction tasks, providing a feasible solution for secure training in complex instruction environments.

[21] "The Dentist is an involved parent, the bartender is not": Revealing Implicit Biases in QA with Implicit BBQ

Aarushi Wagh,Saniya Srivastava

Main category: cs.CL

TL;DR: 本文提出了ImplicitBBQ,一个用于检测大语言模型中隐性偏见的新基准,扩展自BBQ,通过隐式线索评估6个类别中的偏见,发现GPT-4o在隐性提示下表现显著下降,揭示现有模型存在显性基准无法捕捉的隐性偏见。

Details Motivation: 现有偏见评估基准主要依赖显性属性描述,难以捕捉现实交互中通过名字、文化暗示等隐含表达的偏见,导致公平性评估存在盲区。 Method: 构建ImplicitBBQ基准,将原有BBQ数据集扩展为包含6个类别的隐性线索版本,对GPT-4o进行测试并比较其在显性和隐性提示下的表现差异。 Result: GPT-4o在隐性提示下准确率下降最多达7%(如性取向子类),多数类别均出现一致性能下降,表明模型存在未被显性基准检测出的隐性偏见。 Conclusion: ImplicitBBQ能更细致地揭示大语言模型中的隐性社会偏见,为NLP领域的公平性评估提供了重要工具。 Abstract: Existing benchmarks evaluating biases in large language models (LLMs) primarily rely on explicit cues, declaring protected attributes like religion, race, gender by name. However, real-world interactions often contain implicit biases, inferred subtly through names, cultural cues, or traits. This critical oversight creates a significant blind spot in fairness evaluation. We introduce ImplicitBBQ, a benchmark extending the Bias Benchmark for QA (BBQ) with implicitly cued protected attributes across 6 categories. Our evaluation of GPT-4o on ImplicitBBQ illustrates troubling performance disparity from explicit BBQ prompts, with accuracy declining up to 7% in the "sexual orientation" subcategory and consistent decline located across most other categories. This indicates that current LLMs contain implicit biases undetected by explicit benchmarks. ImplicitBBQ offers a crucial tool for nuanced fairness evaluation in NLP.

[22] A Patient-Doctor-NLP-System to contest inequality for less privileged

Subrit Dikshit,Ritu Tiwari,Priyank Jain

Main category: cs.CL

TL;DR: 提出了一种名为PDFTEMRA的紧凑型Transformer架构,结合模型蒸馏、频域调制、集成学习和随机激活模式,用于低资源语言(如印地语)和视障用户在医疗NLP场景中的高效应用。

Details Motivation: 在资源受限的现实医疗环境中,大型语言模型难以部署;同时,针对低资源语言和视障用户的医疗NLP支持有限,亟需高效且包容的解决方案。 Method: 提出PDFTEMRA模型,融合模型蒸馏、频域调制、集成学习与随机激活机制,并在面向印地语和无障碍场景的医学问答与咨询数据集上进行训练与评估。 Result: PDFTEMRA在显著降低计算开销的同时,实现了与主流NLP模型相当的性能表现。 Conclusion: PDFTEMRA适合应用于资源受限、强调可及性与包容性的医疗NLP场景,尤其对低资源语言和视障用户具有重要意义。 Abstract: Transfer Learning (TL) has accelerated the rapid development and availability of large language models (LLMs) for mainstream natural language processing (NLP) use cases. However, training and deploying such gigantic LLMs in resource-constrained, real-world healthcare situations remains challenging. This study addresses the limited support available to visually impaired users and speakers of low-resource languages such as Hindi who require medical assistance in rural environments. We propose PDFTEMRA (Performant Distilled Frequency Transformer Ensemble Model with Random Activations), a compact transformer-based architecture that integrates model distillation, frequency-domain modulation, ensemble learning, and randomized activation patterns to reduce computational cost while preserving language understanding performance. The model is trained and evaluated on medical question-answering and consultation datasets tailored to Hindi and accessibility scenarios, and its performance is compared against standard NLP state-of-the-art model baselines. Results demonstrate that PDFTEMRA achieves comparable performance with substantially lower computational requirements, indicating its suitability for accessible, inclusive, low-resource medical NLP applications.

[23] One Word Is Not Enough: Simple Prompts Improve Word Embeddings

Rajeev Ranjan

Main category: cs.CL

TL;DR: 通过在单词前添加语义提示(如“meaning: {word}”),显著提升了文本嵌入模型在词级相似性任务上的表现,无需训练即可在标准基准上超越传统静态词向量,达到新水平。

Details Motivation: 现有文本嵌入模型主要针对句子级别任务设计,在词级别上的表现不佳且缺乏理解,因此需要探索提升其在孤立单词相似性评估中性能的简单有效方法。 Method: 在多个主流文本嵌入模型上测试不同语义提示模板(如“meaning: {word}”)对词相似性任务的影响,采用SimLex-999、WordSim-353和MEN-3000三个标准基准进行评估,使用Spearman相关系数衡量性能提升。 Result: 提示显著提升模型在词相似性任务上的表现,最大提升达+0.29(SimLex-999),部分模型从接近0相关性恢复至+0.73提升;最佳结果在SimLex-999上达0.692(Cohere),WordSim-353达0.811,MEN-3000达0.855(OpenAI),均超越Word2Vec和LexVec等经典静态嵌入方法。 Conclusion: 添加语义提示是一种简单有效的零样本技术,能显著改善现代文本嵌入模型在词级语义任务中的表现,为纯嵌入方法设立了新的标杆。 Abstract: Text embedding models are designed for sentence-level applications like retrieval and semantic similarity, and are primarily evaluated on sentence-level benchmarks. Their behavior on isolated words is less understood. We show that simply prepending semantic prompts to words before embedding substantially improves word similarity correlations. Testing 7 text embedding models, including text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), voyage-3(Voyage AI), all-mpnet-base-v2, and Qwen3-Embedding-8B, on 3 standard benchmarks (SimLex-999, WordSim-353, MEN-3000), we find that prompts like "meaning: {word}" or "Represent the semantic concept: {word}" improve Spearman correlations by up to +0.29 on SimLex-999. Some models fail completely on bare words (correlation = 0) but recover with prompts (+0.73 improvement). Our best results achieve correlation = 0.692 on SimLex-999 with embed-english-v3.0 (Cohere), correlation = 0.811 on WordSim-353, and correlation = 0.855 on MEN-3000 with text-embedding-3-large (OpenAI). These results outperform classic static embeddings like Word2Vec (correlation = 0.40) and even the best static method LexVec (correlation = 0.48) on SimLex-999, establishing a new state-of-the-art for pure embedding methods. This zero-shot technique requires no training and works with any text embedding model.

[24] Becoming Experienced Judges: Selective Test-Time Learning for Evaluators

Seungyeon Jwa,Daechul Ahn,Reokyoung Kim,Dongyeop Kang,Jonghyun Choi

Main category: cs.CL

TL;DR: 提出了一种名为Learning While Evaluating (LWE)的框架,使评估器在推理过程中无需训练即可持续改进,并通过自反馈机制优化元提示。

Details Motivation: 现有LLM评估方法独立处理每个样本,缺乏经验积累,且使用固定提示,忽视样本特定标准的需求。 Method: 设计LWE框架,维护一个可进化的元提示,生成针对样本的评估指令,并通过自我生成的反馈进行自我优化;进一步提出选择性更新策略(Selective LWE),仅在自不一致情况下更新元提示。 Result: 在两个成对比较基准上,Selective LWE优于强基线方法,验证了评估器可在推理时通过选择性更新持续提升性能。 Conclusion: 评估器可在无训练条件下通过选择性自我更新,在推理过程中不断学习并提升效果,尤其从困难样本中学习最多。 Abstract: Automatic evaluation with large language models, commonly known as LLM-as-a-judge, is now standard across reasoning and alignment tasks. Despite evaluating many samples in deployment, these evaluators typically (i) treat each case independently, missing the opportunity to accumulate experience, and (ii) rely on a single fixed prompt for all cases, neglecting the need for sample-specific evaluation criteria. We introduce Learning While Evaluating (LWE), a framework that allows evaluators to improve sequentially at inference time without requiring training or validation sets. LWE maintains an evolving meta-prompt that (i) produces sample-specific evaluation instructions and (ii) refines itself through self-generated feedback. Furthermore, we propose Selective LWE, which updates the meta-prompt only on self-inconsistent cases, focusing computation where it matters most. This selective approach retains the benefits of sequential learning while being far more cost-effective. Across two pairwise comparison benchmarks, Selective LWE outperforms strong baselines, empirically demonstrating that evaluators can improve during sequential testing with a simple selective update, learning most from the cases they struggle with.

[25] From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

Yuchuan Tian,Yuchen Liang,Jiacheng Sun,Shuo Zhang,Guangwen Yang,Yingte Shu,Sibo Fang,Tianyu Guo,Kai Han,Chao Xu,Hanting Chen,Xinghao Chen,Yunhe Wang

Main category: cs.CL

TL;DR: 本文提出了一种从自回归模型(AR)到块扩散模型(Block-Diffusion)的系统性适应路径,通过上下文因果注意力掩码、并行适应策略、辅助AR损失和逐步增大片大小等设计,实现了高效的知识迁移,在7B规模下取得了当前最优的扩散语言模型性能。

Details Motivation: 现有的扩散语言模型训练成本高,且未能有效利用成熟的自回归模型知识;而直接将AR模型迁移到块扩散框架中存在因果性与双向性的根本冲突,缺乏系统性的适应方法。 Method: 将AR视为块大小为1的块扩散特例,设计了一条渐进式适应路径:采用上下文因果注意力掩码(在上下文中保持因果,在块内双向)、高效的并行适应流程、引入辅助AR损失以保留预训练知识,并逐步增加生成块大小,确保训练与推理一致性。 Result: 基于该方法构建的NBDiff-7B(Base和Instruct版本)在通用知识、数学和代码等多个基准上显著优于强基线模型,成为7B级别中性能领先的扩散语言模型。 Conclusion: 这种从AR到块扩散的结构化适应路径是一种高效、低计算成本的替代方案,避免了从零训练DLM,同时成功迁移了AR模型的长上下文建模与推理能力。 Abstract: Large language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)--especially block-wise variants--enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior "adaptation" attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff.

[26] LLM4SFC: Sequential Function Chart Generation via Large Language Models

Ofek Glick,Vladimir Tchuiev,Marah Ghoummaid,Michal Moshkovitz,Dotan Di-Castro

Main category: cs.CL

TL;DR: 本文提出了LLM4SFC,首个能将自然语言工业流程描述转换为可执行Sequential Function Chart(SFC)程序的框架,结合简化表示、微调与检索增强生成及结构化生成策略,有效解决了SFC图形化与嵌入代码带来的生成难题。

Details Motivation: 现有大语言模型多用于生成文本型PLC代码(如Structured Text),但对IEC 61131-3标准中的图形化语言(如SFC)支持不足,且生成结果常因语法错误或格式不符而不可执行。 Method: 提出LLM4SFC框架:(i) 使用简化结构化表示捕捉SFC关键拓扑与内嵌ST代码;(ii) 结合微调与基于检索的少样本增强生成以对齐SFC编程规范;(iii) 采用结构化生成方法实时剪枝非法token,确保输出符合SFC文本格式。 Result: 在真实制造项目SFC数据集上评估,结合开源与专有LLM,生成成功率达75%–94%,显著提升SFC生成的正确性与可用性。 Conclusion: LLM4SFC首次实现了从自然语言到可执行SFC程序的可靠转换,弥合了图形化与文本型PLC语言间的鸿沟,推动工业自动化编程的智能化发展。 Abstract: While Large Language Models (LLMs) are increasingly used for synthesizing textual PLC programming languages like Structured Text (ST) code, other IEC 61131-3 standard graphical languages like Sequential Function Charts (SFCs) remain underexplored. Generating SFCs is challenging due to graphical nature and ST actions embedded within, which are not directly compatible with standard generation techniques, often leading to non-executable code that is incompatible with industrial tool-chains In this work, we introduce LLM4SFC, the first framework to receive natural-language descriptions of industrial workflows and provide executable SFCs. LLM4SFC is based on three components: (i) A reduced structured representation that captures essential topology and in-line ST and reduced textual verbosity; (ii) Fine-tuning and few-shot retrieval-augmented generation (RAG) for alignment with SFC programming conventions; and (iii) A structured generation approach that prunes illegal tokens in real-time to ensure compliance with the textual format of SFCs. We evaluate LLM4SFC on a dataset of real-world SFCs from automated manufacturing projects, using both open-source and proprietary LLMs. The results show that LLM4SFC reliably generates syntactically valid SFC programs effectively bridging graphical and textual PLC languages, achieving a generation generation success of 75% - 94%, paving the way for automated industrial programming.

[27] Large Language Model-Based Generation of Discharge Summaries

Tiago Rodrigues,Carla Teixeira Lopes

Main category: cs.CL

TL;DR: 本研究评估了五种大语言模型(包括开源和专有模型)在自动生成出院总结中的表现,结果显示专有模型(尤其是Gemini)在相似度和临床实用性方面优于开源模型。

Details Motivation: 自动化生成出院总结可减轻医疗专业人员负担、减少错误并提高患者信息的可访问性,因此探索大语言模型在此任务上的表现具有重要意义。 Method: 采用MIMIC-III数据集中的病历记录和标准出院总结,对Mistral、Llama 2、GPT-3、GPT-4和Gemini 1.5 Pro五个大语言模型进行测试,使用精确匹配、软重叠和无参考指标进行自动评估,并由临床专家进行人工评估。 Result: 专有模型(特别是使用一次提示的Gemini)生成的总结与标准摘要最相似;开源模型(如Mistral在微调后)虽有潜力,但存在幻觉和重复问题;临床专家确认专有模型生成的摘要具有实际应用价值。 Conclusion: 尽管存在幻觉和信息遗漏等挑战,大语言模型(尤其是专有模型)在自动出院总结生成方面具有潜力,前提是确保数据隐私安全。 Abstract: Discharge Summaries are documents written by medical professionals that detail a patient's visit to a care facility. They contain a wealth of information crucial for patient care, and automating their generation could significantly reduce the effort required from healthcare professionals, minimize errors, and ensure that critical patient information is easily accessible and actionable. In this work, we explore the use of five Large Language Models on this task, from open-source models (Mistral, Llama 2) to proprietary systems (GPT-3, GPT-4, Gemini 1.5 Pro), leveraging MIMIC-III summaries and notes. We evaluate them using exact-match, soft-overlap, and reference-free metrics. Our results show that proprietary models, particularly Gemini with one-shot prompting, outperformed others, producing summaries with the highest similarity to the gold-standard ones. Open-source models, while promising, especially Mistral after fine-tuning, lagged in performance, often struggling with hallucinations and repeated information. Human evaluation by a clinical expert confirmed the practical utility of the summaries generated by proprietary models. Despite the challenges, such as hallucinations and missing information, the findings suggest that LLMs, especially proprietary models, are promising candidates for automatic discharge summary generation as long as data privacy is ensured.

[28] CAuSE: Decoding Multimodal Classifiers using Faithful Natural Language Explanation

Dibyanayan Bandyopadhyay,Soham Bhattacharjee,Mohammed Hasanuzzaman,Asif Ekbal

Main category: cs.CL

TL;DR: 本文提出了CAuSE框架,通过模拟解释下的因果抽象生成多模态分类器的自然语言解释(NLEs),并验证其在跨数据集和模型中的忠实性与泛化能力。

Details Motivation: 多模态分类器通常被视为黑箱模型,现有的解释方法不够直观或缺乏对模型决策过程的忠实反映,因此需要一种更可信、可解释的方法来增强用户信任。 Method: 提出CAuSE框架,利用交错干预训练方式构建因果抽象,并设计新的度量标准评估多模态场景下的因果忠实性。 Result: 实验表明CAuSE在多个数据集和模型上具有良好的泛化能力,在因果忠实性度量上优于其他方法,定性分析进一步支持其优势,但也发现了特定失败情况。 Conclusion: CAuSE是一种能够生成忠实自然语言解释的有效框架,理论和实证均证明其作为因果抽象的合理性,有助于提升多模态模型的可解释性和可信度。 Abstract: Multimodal classifiers function as opaque black box models. While several techniques exist to interpret their predictions, very few of them are as intuitive and accessible as natural language explanations (NLEs). To build trust, such explanations must faithfully capture the classifier's internal decision making behavior, a property known as faithfulness. In this paper, we propose CAuSE (Causal Abstraction under Simulated Explanations), a novel framework to generate faithful NLEs for any pretrained multimodal classifier. We demonstrate that CAuSE generalizes across datasets and models through extensive empirical evaluations. Theoretically, we show that CAuSE, trained via interchange intervention, forms a causal abstraction of the underlying classifier. We further validate this through a redesigned metric for measuring causal faithfulness in multimodal settings. CAuSE surpasses other methods on this metric, with qualitative analysis reinforcing its advantages. We perform detailed error analysis to pinpoint the failure cases of CAuSE. For replicability, we make the codes available at https://github.com/newcodevelop/CAuSE

[29] AquaFusionNet: Lightweight VisionSensor Fusion Framework for Real-Time Pathogen Detection and Water Quality Anomaly Prediction on Edge Devices

Sepyan Purnama Kristanto,Lutfi Hakim,Hermansyah

Main category: cs.CL

TL;DR: 本研究提出AquaFusionNet,一种轻量级跨模态框架,通过门控交叉注意力机制融合显微成像与理化传感数据,实现对小型饮用水系统中微生物污染的实时、低功耗检测。

Details Motivation: 现有监测工具难以全面捕捉低收入和中等收入地区小型饮用水系统中快速变化的微生物污染情况,且操作人员需分别解读显微图像与传感器数据,导致决策不可靠。 Method: 提出AquaFusionNet,采用门控交叉注意力机制统一显微图像与理化传感器数据;在专为饮用水场景构建的新数据集AquaMicro12K上训练,并部署于边缘设备(Jetson Nano)进行实地测试。 Result: 在印度尼西亚东爪哇七个设施为期六个月的部署中,系统处理了184万帧数据,以4.8W功耗实现了94.8% mAP@0.5和96.3%异常预测准确率,优于代表性轻量级检测器,且在污染、浊度突变和光照不均情况下表现更稳健。 Conclusion: AquaFusionNet通过跨模态融合提升了污染检测的准确性与鲁棒性,适用于资源受限环境下的去中心化水质安全系统,所有模型、数据与硬件设计均已开源。 Abstract: Evidence from many low and middle income regions shows that microbial contamination in small scale drinking water systems often fluctuates rapidly, yet existing monitoring tools capture only fragments of this behaviour. Microscopic imaging provides organism level visibility, whereas physicochemical sensors reveal shortterm changes in water chemistry; in practice, operators must interpret these streams separately, making realtime decision-making unreliable. This study introduces AquaFusionNet, a lightweight cross-modal framework that unifies both information sources inside a single edge deployable model. Unlike prior work that treats microscopic detection and water quality prediction as independent tasks, AquaFusionNet learns the statistical dependencies between microbial appearance and concurrent sensor dynamics through a gated crossattention mechanism designed specifically for lowpower hardware. The framework is trained on AquaMicro12K, a new dataset comprising 12,846 annotated 1000 micrographs curated for drinking water contexts, an area where publicly accessible microscopic datasets are scarce. Deployed for six months across seven facilities in East Java, Indonesia, the system processed 1.84 million frames and consistently detected contamination events with 94.8% mAP@0.5 and 96.3% anomaly prediction accuracy, while operating at 4.8 W on a Jetson Nano. Comparative experiments against representative lightweight detectors show that AquaFusionNet provides higher accuracy at comparable or lower power, and field results indicate that cross-modal coupling reduces common failure modes of unimodal detectors, particularly under fouling, turbidity spikes, and inconsistent illumination. All models, data, and hardware designs are released openly to facilitate replication and adaptation in decentralized water safety infrastructures.

[30] Rhea: Role-aware Heuristic Episodic Attention for Conversational LLMs

Wanyang Hong,Zhaoning Zhang,Yi Chen,Libo Zhang,Baihui Liu,Linbo Qiao,Zhiliang Tian,Dongsheng Li

Main category: cs.CL

TL;DR: Rhea提出了一种角色感知的启发式 episodic 注意力框架,通过分离指令记忆和情景记忆来缓解多轮对话中的上下文衰减问题,显著提升了大模型在长程交互中的准确性和指令一致性。

Details Motivation: 大语言模型在单轮任务中表现优异,但在多轮对话中因注意力污染、稀释和漂移导致性能下降,即累积性上下文衰减问题亟需解决。 Method: 提出Rhea框架,将对话历史解耦为两个独立的记忆模块:指令记忆(IM)通过结构化优先级机制持久保存全局约束;情景记忆(EM)通过非对称噪声控制和启发式上下文检索动态管理用户-模型交互;推理时采用优先注意力机制选择性整合相关信息并始终优先保留全局指令。 Result: 在MT-Eval和Long-MT-Bench+等多个多轮对话基准上,Rhea相比强基线平均得分提升1.04分(10分制),相对增益达16%,且在长周期交互中保持接近完美的指令保真度(IAR > 8.1)。 Conclusion: Rhea为构建更精确、指令一致的对话型大语言模型提供了一个原则性强且有效的解决方案,能有效缓解多轮对话中的上下文衰减问题。 Abstract: Large Language Models (LLMs) have achieved remarkable performance on single-turn tasks, yet their effectiveness deteriorates in multi-turn conversations. We define this phenomenon as cumulative contextual decay - a progressive degradation of contextual integrity caused by attention pollution, dilution, and drift. To address this challenge, we propose Rhea (Role-aware Heuristic Episodic Attention), a novel framework that decouples conversation history into two functionally independent memory modules: (1) an Instructional Memory (IM) that persistently stores high-fidelity global constraints via a structural priority mechanism, and (2) an Episodic Memory (EM) that dynamically manages user-model interactions via asymmetric noise control and heuristic context retrieval. During inference, Rhea constructs a high signal-to-noise context by applying its priority attention: selectively integrating relevant episodic information while always prioritizing global instructions. To validate this approach, experiments on multiple multi-turn conversation benchmarks - including MT-Eval and Long-MT-Bench+ - show that Rhea mitigates performance decay and improves overall accuracy by 1.04 points on a 10-point scale (a 16% relative gain over strong baselines). Moreover, Rhea maintains near-perfect instruction fidelity (IAR > 8.1) across long-horizon interactions. These results demonstrate that Rhea provides a principled and effective framework for building more precise, instruction-consistent conversational LLMs.

[31] An Analysis of Large Language Models for Simulating User Responses in Surveys

Ziyun Yu,Yiru Zhou,Chen Zhao,Hongyi Wen

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在模拟不同人口和文化背景用户观点时的局限性,提出了一种名为CLAIMSIM的主张多样化方法以提升回应多样性,但实验表明现有方法在准确模拟用户回应方面仍存在困难。

Details Motivation: 由于LLM(尤其是经过人类反馈强化学习训练的模型)倾向于主流观点,可能无法充分代表多样化背景的用户意见,因此需要探究其在跨领域调查问答中模拟人类回应的能力,并改进其多样性与准确性。 Method: 通过直接提示和思维链提示两种方式测试LLM模拟人类对跨领域调查问题的回答,并提出CLAIMSIM方法,利用LLM参数化知识生成多样化的主张作为上下文输入,以增强回应的多样性。 Result: CLAIMSIM能生成更多样化的回应,但两种方法在准确模拟用户回应方面均表现不佳;分析发现LLM倾向于对不同人口特征保持固定观点,生成单一视角主张,且难以在冲突主张间进行基于人口特征差异的推理。 Conclusion: 当前LLM在模拟多样化用户观点方面存在显著局限,主要表现为观点固化和缺乏对人口特征的敏感推理能力,未来需进一步改进模型以实现更精准的用户行为模拟。 Abstract: Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.

[32] Automated PRO-CTCAE Symptom Selection based on Prior Adverse Event Profiles

Francois Vandenhende,Anna Georgiou,Michalis Georgiou,Theodoros Psaras,Ellie Karekla

Main category: cs.CL

TL;DR: 提出一种基于历史安全数据和MedDRA语义的自动化方法,用于选择最小且全面的PRO-CTCAE症状子集,以平衡信号覆盖与患者负担。

Details Motivation: 传统上根据经验选择PRO-CTCAE项目易导致过度或不足,增加患者负担或遗漏安全信号,缺乏客观、可重复的优化方法。 Method: 将PRO-CTCAE症状映射到MedDRA首选术语(PTs),编码至高维语义空间Safeterm;结合相关性与发生率构建效用函数,并通过谱分析在多样性和相关性之间取得平衡,最终排序并确定最优子集。 Result: 该方法在模拟和肿瘤学案例研究中表现良好,能有效识别关键症状,减少冗余,提升患者依从性,同时保持安全性监测的完整性。 Conclusion: 该自动化工具可优化PRO-CTCAE的设计,利用语义信息和历史数据实现更科学、高效的不良事件监测。 Abstract: The PRO-CTCAE is an NCI-developed patient-reported outcome system for capturing symptomatic adverse events in oncology trials. It comprises a large library drawn from the CTCAE vocabulary, and item selection for a given trial is typically guided by expected toxicity profiles from prior data. Selecting too many PRO-CTCAE items can burden patients and reduce compliance, while too few may miss important safety signals. We present an automated method to select a minimal yet comprehensive PRO-CTCAE subset based on historical safety data. Each candidate PRO-CTCAE symptom term is first mapped to its corresponding MedDRA Preferred Terms (PTs), which are then encoded into Safeterm, a high-dimensional semantic space capturing clinical and contextual diversity in MedDRA terminology. We score each candidate PRO item for relevance to the historical list of adverse event PTs and combine relevance and incidence into a utility function. Spectral analysis is then applied to the combined utility and diversity matrix to identify an orthogonal set of medical concepts that balances relevance and diversity. Symptoms are rank-ordered by importance, and a cut-off is suggested based on the explained information. The tool is implemented as part of the Safeterm trial-safety app. We evaluate its performance using simulations and oncology case studies in which PRO-CTCAE was employed. This automated approach can streamline PRO-CTCAE design by leveraging MedDRA semantics and historical data, providing an objective and reproducible method to balance signal coverage against patient burden.

[33] Large Language Models and Forensic Linguistics: Navigating Opportunities and Threats in the Age of Generative AI

George Mikros

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)在法医语言学中具有双重作用:既是强大的分析工具,又因风格模仿和生成文本对传统作者识别假设构成挑战;当前AI文本检测技术存在误报率高和易受对抗攻击等问题,需通过人机协作、可解释检测方法及跨群体验证机制实现方法论重构。

Details Motivation: 随着LLMs生成文本能力的提升,其在模仿写作风格和制造虚假内容方面对法医语言学的核心假设构成威胁,同时现有检测技术在准确性和公平性方面存在缺陷,亟需重新评估其科学有效性与法律可采性。 Method: 综述近期文体计量研究与AI文本检测技术(包括分类器、文体分析和水印法),分析其在非母语写作者中的误报问题及对抗策略(如同形字符替换)下的脆弱性,并依据Daubert和Kumho Tire等法律标准评估其可采性。 Result: 发现LLMs虽能模仿表层风格特征,但仍可与人类写作区分;然而当前检测方法存在显著局限,尤其在偏见和鲁棒性方面,难以满足法律证据标准。 Conclusion: 法医语言学必须进行方法论革新以维持科学可信度与法律可采性,建议采用人机协同工作流、超越二元分类的可解释检测模型,并建立涵盖多样人群的误差与偏见验证机制,以应对人机混合创作背景下的作者归属挑战。 Abstract: Large language models (LLMs) present a dual challenge for forensic linguistics. They serve as powerful analytical tools enabling scalable corpus analysis and embedding-based authorship attribution, while simultaneously destabilising foundational assumptions about idiolect through style mimicry, authorship obfuscation, and the proliferation of synthetic texts. Recent stylometric research indicates that LLMs can approximate surface stylistic features yet exhibit detectable differences from human writers, a tension with significant forensic implications. However, current AI-text detection techniques, whether classifier-based, stylometric, or watermarking approaches, face substantial limitations: high false positive rates for non-native English writers and vulnerability to adversarial strategies such as homoglyph substitution. These uncertainties raise concerns under legal admissibility standards, particularly the Daubert and Kumho Tire frameworks. The article concludes that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible. Proposed adaptations include hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline's core insight, i.e., that language reveals information about its producer, remains valid but must accommodate increasingly complex chains of human and machine authorship.

[34] XAM: Interactive Explainability for Authorship Attribution Models

Milad Alshomary,Anisha Bhatnagar,Peter Zeng,Smaranda Muresan,Owen Rambow,Kathleen McKeown

Main category: cs.CL

TL;DR: 提出IXAM,一个用于作者归属模型的交互式可解释性框架,允许用户探索模型的嵌入空间并以多粒度风格特征解释预测。

Details Motivation: 现有作者归属模型缺乏可解释性,用户难以理解模型预测依据的写作风格特征。 Method: 设计并实现IXAM框架,支持用户交互式探索基于嵌入的作者归属模型的嵌入空间,并构建多层次的写作风格特征解释。 Result: 用户评估表明,相比预定义的风格解释,IXAM框架在解释性和用户满意度方面更具优势。 Conclusion: IXAM有效提升了作者归属模型的可解释性,增强了用户对模型决策的理解和信任。 Abstract: We present IXAM, an Interactive eXplainability framework for Authorship Attribution Models. Given an authorship attribution (AA) task and an embedding-based AA model, our tool enables users to interactively explore the model's embedding space and construct an explanation of the model's prediction as a set of writing style features at different levels of granularity. Through a user evaluation, we demonstrate the value of our framework compared to predefined stylistic explanations.

[35] Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation

Ivanhoé Botcazou,Tassadit Amghar,Sylvain Lamprier,Frédéric Saubion

Main category: cs.CL

TL;DR: 本文提出了一种新的长度控制方法Progress Ratio Embeddings (PRE),通过连续的三角函数“不耐烦”信号实现对神经语言模型生成长度的稳定控制,克服了现有基于离散倒计时信号的Reverse Positional Embeddings(RPE)在训练分布外不稳定的问题。

Details Motivation: 现有的文本生成长度控制方法(如RPE)在目标长度超出训练分布时表现不稳定,难以精确控制生成长度,限制了实际应用。 Method: 提出Progress Ratio Embeddings (PRE),使用连续的、基于进度比例的三角函数信号代替离散的剩余token计数,嵌入到标准Transformer架构中,实现对生成长度的平滑调控。 Result: PRE在两个主流新闻摘要数据集上实现了更高的长度保真度,且在未见过的目标长度上表现出良好的泛化能力,同时不损害文本质量。 Conclusion: PRE提供了一种稳定、通用且易于集成的长度控制机制,显著优于基于离散信号的RPE方法,适用于需要精确控制生成长度的应用场景。 Abstract: Modern neural language models achieve high accuracy in text generation, yet precise control over generation length remains underdeveloped. In this paper, we first investigate a recent length control method based on Reverse Positional Embeddings (RPE) and show its limits when control is requested beyond the training distribution. In particular, using a discrete countdown signal tied to the absolute remaining token count leads to instability. To provide robust length control, we introduce Progress Ratio Embeddings (PRE), as continuous embeddings tied to a trigonometric impatience signal. PRE integrates seamlessly into standard Transformer architectures, providing stable length fidelity without degrading text accuracy under standard evaluation metrics. We further show that PRE generalizes well to unseen target lengths. Experiments on two widely used news-summarization benchmarks validate these findings.

[36] Prompting-in-a-Series: Psychology-Informed Contents and Embeddings for Personality Recognition With Decoder-Only Models

Jing Jie Tan,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum,Anissa Mokraoui,Shih-Yu Lo

Main category: cs.CL

TL;DR: 提出了一种名为PICEPR的“系列提示”算法,通过模块化解码器-only大语言模型提升人格识别性能,结合内容生成与嵌入,显著优于现有方法。

Details Motivation: 探索如何利用大语言模型的生成能力增强人格识别,解决传统方法在特征提取和个性化内容生成上的局限性。 Method: 设计双管道PICEPR算法:(a) 内容管道用于生成人格相关文本;(b) 嵌入管道用于提取人格特征;并在开源与闭源LLM上进行实验验证。 Result: PICEPR在人格识别任务中比现有方法提升5-15%,验证了其在多种模型上的有效性。 Conclusion: PICEPR通过心理学启发的提示机制,成功将大语言模型的生成能力转化为人格识别的优势,为个性化解析提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language processing tasks. This research introduces a novel "Prompting-in-a-Series" algorithm, termed PICEPR (Psychology-Informed Contents Embeddings for Personality Recognition), featuring two pipelines: (a) Contents and (b) Embeddings. The approach demonstrates how a modularised decoder-only LLM can summarize or generate content, which can aid in classifying or enhancing personality recognition functions as a personality feature extractor and a generator for personality-rich content. We conducted various experiments to provide evidence to justify the rationale behind the PICEPR algorithm. Meanwhile, we also explored closed-source models such as \textit{gpt4o} from OpenAI and \textit{gemini} from Google, along with open-source models like \textit{mistral} from Mistral AI, to compare the quality of the generated content. The PICEPR algorithm has achieved a new state-of-the-art performance for personality recognition by 5-15\% improvement. The work repository and models' weight can be found at https://research.jingjietan.com/?q=PICEPR.

[37] FVA-RAG: Falsification-Verification Alignment for Mitigating Sycophantic Hallucinations

Mayank Ravishankara

Main category: cs.CL

TL;DR: 本文提出了FVA-RAG框架,通过引入“证伪-验证”机制来对抗RAG系统中的检索谄媚问题,提升大模型在面对错误前提时生成事实性回答的鲁棒性。

Details Motivation: 标准RAG系统在用户提问包含错误前提时容易检索到支持该错误的信息,导致模型‘引证式幻觉’;为解决这一因确认偏误引发的检索谄媚问题,需构建更可靠的验证机制。 Method: 提出FVA-RAG框架,采用演绎式证伪而非归纳式验证,设计对抗性检索策略生成‘击杀查询’以寻找反证据,并引入双验证机制权衡初稿答案与‘反上下文’之间的冲突。 Result: 在常见误解数据集上的初步实验表明,FVA-RAG相比标准RAG能显著降低谄媚式幻觉,提升事实准确性。 Conclusion: FVA-RAG通过在推理时主动寻找反例,有效增强了RAG系统的健壮性和真实性,相当于为生成过程引入了一个实时的‘红队’审查机制。 Abstract: Retrieval-Augmented Generation (RAG) systems have significantly reduced hallucinations in Large Language Models (LLMs) by grounding responses in external context. However, standard RAG architectures suffer from a critical vulnerability: Retrieval Sycophancy. When presented with a query based on a false premise or a common misconception, vector-based retrievers tend to fetch documents that align with the user's bias rather than objective truth, leading the model to "hallucinate with citations." In this work, we introduce Falsification-Verification Alignment RAG (FVA-RAG), a framework that shifts the retrieval paradigm from Inductive Verification (seeking support) to Deductive Falsification (seeking disproof). Unlike existing "Self-Correction" methods that rely on internal consistency, FVA-RAG deploys a distinct Adversarial Retrieval Policy that actively generates "Kill Queries"-targeted search terms designed to surface contradictory evidence. We introduce a dual-verification mechanism that explicitly weighs the draft answer against this "Anti-Context." Preliminary experiments on a dataset of common misconceptions demonstrate that FVA-RAG significantly improves robustness against sycophantic hallucinations compared to standard RAG baselines, effectively acting as an inference-time "Red Team" for factual generation.

[38] Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

Richard Young

Main category: cs.CL

TL;DR: 该研究使用TEMPEST多轮攻击框架评估了10个前沿大语言模型在97,000多次API查询中的安全性,发现多数模型极易受到多轮对抗攻击,攻击成功率(ASR)高达96%-100%,而部分模型通过启用扩展推理可将ASR从97%降至42%,表明推理模式是提升安全性的有效途径。

Details Motivation: 当前对大规模语言模型在复杂多轮对抗攻击下的安全性缺乏系统评估,且模型规模和推理模式对其鲁棒性的影响尚不明确。 Method: 采用TEMPEST多轮攻击框架,在来自8家厂商的10个前沿模型上测试其对1,000种有害行为的脆弱性,生成超过97,000个API查询,并利用独立安全分类器进行自动化评估。 Result: 六款模型攻击成功率达96%-100%,四款表现出一定抗性(ASR 42%-78%);相同架构下启用扩展推理使ASR从97%降至42%;模型规模与鲁棒性无显著关联。 Conclusion: 现有对齐技术在面对自适应多轮攻击时普遍存在脆弱性,模型规模不能保证安全性,而采用深思熟虑的推理模式可作为有效的部署级安全增强手段。 Abstract: Despite substantial investment in safety alignment, the vulnerability of large language models to sophisticated multi-turn adversarial attacks remains poorly characterized, and whether model scale or inference mode affects robustness is unknown. This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors, generating over 97,000 API queries across adversarial conversations with automated evaluation by independent safety classifiers. Results demonstrated a spectrum of vulnerability: six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%; enabling extended reasoning on identical architecture reduced ASR from 97% to 42%. These findings indicate that safety alignment quality varies substantially across vendors, that model scale does not predict adversarial robustness, and that thinking mode provides a deployable safety enhancement. Collectively, this work establishes that current alignment techniques remain fundamentally vulnerable to adaptive multi-turn attacks regardless of model scale, while identifying deliberative inference as a promising defense direction.

[39] SETUP: Sentence-level English-To-Uniform Meaning Representation Parser

Emma Markle,Javier Gutierrez Bach,Shira Wein

Main category: cs.CL

TL;DR: 本文提出了两种用于英文文本到UMR解析的方法,其中表现最好的模型SETUP在AnCast和SMATCH++评分上取得了显著提升,推动了自动UMR解析的发展。

Details Motivation: 为了充分发挥UMR在语言记录、低资源语言技术改进和可解释性方面的潜力,需要能够大规模自动生成准确UMR图的文本到UMR解析器。 Method: 提出两种英文文本到UMR的解析方法:一种是微调现有的抽象意义表示解析器,另一种是利用从通用依存关系转换的方法,并以前人工作为基线。 Result: 最佳模型SETUP达到了84的AnCast分数和91的SMATCH++分数,显示出在自动UMR解析上的显著进展。 Conclusion: 所提出的SETUP模型在自动UMR解析任务上表现优异,为未来在多语言和低资源场景下的应用奠定了基础。 Abstract: Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world's languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.

[40] Do Large Language Models Truly Understand Cross-cultural Differences?

Shiwei Guo,Sihang Jiang,Qianxi He,Yanghua Xiao,Jiaqing Liang,Bi Yude,Minggui He,Shimin Tao,Li Zhang

Main category: cs.CL

TL;DR: 本文提出了SAGE,一个基于情境的跨文化理解评估基准,通过跨文化核心概念对齐和生成性任务设计,系统评估大语言模型在九个维度上的跨文化理解和推理能力。

Details Motivation: 现有评估大语言模型跨文化理解能力的基准存在缺乏情境场景、跨文化概念映射不足以及深层文化推理能力测试有限三大问题。 Method: 基于文化理论构建包含九个维度的框架,整理210个核心概念,并在15种现实情境中设计4530个测试项目,分为四大类跨文化情境,采用生成性任务设计和交叉验证方法。 Result: SAGE数据集展现出良好的可扩展性和多语言迁移能力,实验揭示了当前大语言模型在多个维度和情境下均存在系统性的跨文化推理缺陷。 Conclusion: 尽管大语言模型在多语言任务上表现良好,但在实现真正细致的跨文化理解方面仍有较大差距,需进一步改进其文化敏感性和深层推理能力。 Abstract: In recent years, large language models (LLMs) have demonstrated strong performance on multilingual tasks. Given its wide range of applications, cross-cultural understanding capability is a crucial competency. However, existing benchmarks for evaluating whether LLMs genuinely possess this capability suffer from three key limitations: a lack of contextual scenarios, insufficient cross-cultural concept mapping, and limited deep cultural reasoning capabilities. To address these gaps, we propose SAGE, a scenario-based benchmark built via cross-cultural core concept alignment and generative task design, to evaluate LLMs' cross-cultural understanding and reasoning. Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions. Using this framework, we curated 210 core concepts and constructed 4530 test items across 15 specific real-world scenarios, organized under four broader categories of cross-cultural situations, following established item design principles. The SAGE dataset supports continuous expansion, and experiments confirm its transferability to other languages. It reveals model weaknesses across both dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. While progress has been made, LLMs are still some distance away from reaching a truly nuanced cross-cultural understanding. In compliance with the anonymity policy, we include data and code in the supplement materials. In future versions, we will make them publicly available online.

[41] Leveraging KV Similarity for Online Structured Pruning in LLMs

Jungmin Lee,Gwangeun Byeon,Yulhwa Kim,Seokin Hong

Main category: cs.CL

TL;DR: 本文提出了一种名为Token Filtering的轻量级在线结构化剪枝技术,用于加速大语言模型推理。该方法通过联合键值相似性度量令牌冗余性,在推理过程中直接跳过冗余注意力计算,无需离线校准数据,并设计了方差感知融合策略以提高稳定性。

Details Motivation: 现有剪枝方法依赖离线校准数据,泛化能力差,导致在不同输入下表现不稳定,因此需要一种更稳定、无需校准数据的在线剪枝方法。 Method: 提出Token Filtering,利用键和值的联合相似性判断令牌冗余性,在推理时动态跳过冗余注意力计算;引入方差感知的融合策略,自适应地加权不同注意力头中的相似性得分,提升稳定性且不增加内存开销。 Result: 在LLaMA-2(7B/13B)、LLaMA-3(8B)和Mistral(7B)上实验表明,Token Filtering在50%剪枝率下仍保持优异性能,在MMLU等复杂任务上优于以往结构化剪枝方法。 Conclusion: Token Filtering是一种高效稳定的在线结构化剪枝方法,能够在显著降低推理成本的同时保持模型性能,适用于大规模语言模型的高效推理。 Abstract: Pruning has emerged as a promising direction for accelerating large language model (LLM) inference, yet existing approaches often suffer from instability because they rely on offline calibration data that may not generalize across inputs. In this work, we introduce Token Filtering, a lightweight online structured pruning technique that makes pruning decisions directly during inference without any calibration data. The key idea is to measure token redundancy via joint key-value similarity and skip redundant attention computations, thereby reducing inference cost while preserving critical information. To further enhance stability, we design a variance-aware fusion strategy that adaptively weights key and value similarity across heads, ensuring that informative tokens are retained even under high pruning ratios. This design introduces no additional memory overhead and provides a more reliable criterion for token importance. Extensive experiments on LLaMA-2 (7B/13B), LLaMA-3 (8B), and Mistral (7B) demonstrate that Token Filtering consistently outperforms prior structured pruning methods, preserving accuracy on commonsense reasoning benchmarks and maintaining strong performance on challenging tasks such as MMLU, even with 50% pruning.

[42] DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

Nithin Sivakumaran,Justin Chih-Yao Chen,David Wan,Yue Zhang,Jaehong Yoon,Elias Stengel-Eskin,Mohit Bansal

Main category: cs.CL

TL;DR: 本文提出了DART,一种基于多智能体辩论框架的视觉工具选择方法,通过智能体间的分歧自动识别并调用合适的专家视觉工具(如目标检测、OCR等)以提升视觉问答性能。

Details Motivation: 在复杂视觉任务中,如何有效选择和调用合适的视觉工具是一个挑战。现有方法难以动态判断何时调用何种工具,限制了多模态大模型的性能。 Method: 提出DART框架:多个视觉智能体进行辩论,利用它们之间的分歧触发对特定专家工具的调用;工具输出作为新信息引入,并提供与工具一致性的评分以促进讨论;由聚合智能体综合各智能体输出与工具信息做出最终决策。 Result: 在四个基准测试中验证了DART的有效性,在A-OKVQA和MMMU上分别比最强基线(带裁判模型的多智能体辩论)提升3.4%和2.4%;在M3D医疗数据集上超过其他强基线1.3%;文本重叠度分析显示DART产生更丰富的讨论;工具调用分布研究表明DART能稳定使用多样化工具解决分歧。 Conclusion: DART通过利用智能体间分歧驱动工具调用,实现了更有效的多智能体协作,显著提升了视觉语言任务的表现,并展现出良好的工具适应性和通用性。 Abstract: Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.

[43] GUMBridge: a Corpus for Varieties of Bridging Anaphora

Lauren Levine,Amir Zeldes

Main category: cs.CL

TL;DR: 本文介绍了GUMBridge,一个用于英语中搭桥式回指现象的新资源,涵盖16种不同体裁,并提供细粒度的子类型标注。研究还评估了标注质量,并基于当前开源与闭源大语言模型在三项相关任务上进行了基线实验,结果表明在大模型时代,搭桥消解和子类型分类仍是具有挑战性的自然语言处理任务。

Details Motivation: 现有的英语搭桥式回指资源规模小、覆盖范围有限,且体裁多样性不足,难以支持对这一复杂现象的深入研究。因此,需要一个更全面、标注更细致的资源来推动该领域的发展。 Method: 构建了一个名为GUMBridge的新语料库,包含16种不同的英语文本体裁,提供详细的搭桥现象子类型标注;同时评估了人工标注的一致性,并使用当前主流的大语言模型在搭桥消解、先行词识别和子类型分类三个任务上进行基线实验。 Result: GUMBridge显著提升了搭桥现象的覆盖广度和标注精细度;标注质量评估显示较高一致性;实验结果表明现有大语言模型在这三项任务上表现仍不理想,说明这些任务仍具挑战性。 Conclusion: 尽管大语言模型在自然语言处理中取得进展,但搭桥式回指及其子类型分类依然是困难的任务;GUMBridge为未来相关研究提供了高质量、多样化的数据支持。 Abstract: Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in "There is 'a house'. 'The door' is red," where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.

[44] MASim: Multilingual Agent-Based Simulation for Social Science

Xuan Zhang,Wenxuan Zhang,Anxu Wang,See-Kiong Ng,Yang Deng

Main category: cs.CL

TL;DR: MASim是一个多语言基于代理的模拟框架,支持具有多样化社会语言特征的生成代理之间的多轮交互,用于研究跨语言互动中的社会行为。

Details Motivation: 现有的多智能体角色扮演模拟大多局限于单一语言,无法建模真实社会中重要的跨语言互动现象,限制了其在计算社会科学中的应用广度和真实性。 Method: 提出MASim框架,结合生成式代理与多语言交互机制,并构建MAPS基准,包含来自全球人口分布的调查问题和人口统计角色,以支持多轮、多语言的社会行为模拟。 Result: 实验在校准、敏感性、一致性和文化案例研究方面表明,MASim能够复现社会文化现象,并揭示多语言模拟对舆论演化、媒体影响和信息扩散分析的重要性。 Conclusion: MASim是首个支持跨语言社会行为研究的多语言代理模拟框架,为可扩展、受控的计算社会科学提供了新工具,强调了多语言建模在社会模拟中的关键作用。 Abstract: Multi-agent role-playing has recently shown promise for studying social behavior with language agents, but existing simulations are mostly monolingual and fail to model cross-lingual interaction, an essential property of real societies. We introduce MASim, the first multilingual agent-based simulation framework that supports multi-turn interaction among generative agents with diverse sociolinguistic profiles. MASim offers two key analyses: (i) global public opinion modeling, by simulating how attitudes toward open-domain hypotheses evolve across languages and cultures, and (ii) media influence and information diffusion, via autonomous news agents that dynamically generate content and shape user behavior. To instantiate simulations, we construct the MAPS benchmark, which combines survey questions and demographic personas drawn from global population distributions. Experiments on calibration, sensitivity, consistency, and cultural case studies show that MASim reproduces sociocultural phenomena and highlights the importance of multilingual simulation for scalable, controlled computational social science.

[45] NeSTR: A Neuro-Symbolic Abductive Framework for Temporal Reasoning in Large Language Models

Feng Liang,Weixin Zeng,Runhao Zhao,Xiang Zhao

Main category: cs.CL

TL;DR: 本文提出了Neuro-Symbolic Temporal Reasoning (NeSTR) 框架,结合符号表示与混合反思推理,提升大语言模型在复杂时间约束下的时间推理能力。

Details Motivation: 大语言模型在自然语言处理任务中表现优异,但在复杂时间约束下的时间推理仍存在挑战。现有符号方法未能充分利用LLM的推理能力,而反思机制缺乏结构化时间表示,导致推理不一致或幻觉问题。 Method: 提出NeSTR框架,通过符号编码保留显式时间关系,利用验证机制确保逻辑一致性,并采用溯因反思纠正错误推理,实现神经与符号方法的融合。 Result: 在多个时间问答基准上的实验表明,NeSTR在零样本设置下显著优于现有方法,且无需微调即可持续提升时间推理性能。 Conclusion: NeSTR通过整合符号结构与反射推理,有效增强了大语言模型的时间敏感性,展示了神经-符号方法在时间理解中的优势。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, temporal reasoning, particularly under complex temporal constraints, remains a major challenge. To this end, existing approaches have explored symbolic methods, which encode temporal structure explicitly, and reflective mechanisms, which revise reasoning errors through multi-step inference. Nonetheless, symbolic approaches often underutilize the reasoning capabilities of LLMs, while reflective methods typically lack structured temporal representations, which can result in inconsistent or hallucinated reasoning. As a result, even when the correct temporal context is available, LLMs may still misinterpret or misapply time-related information, leading to incomplete or inaccurate answers. To address these limitations, in this work, we propose Neuro-Symbolic Temporal Reasoning (NeSTR), a novel framework that integrates structured symbolic representations with hybrid reflective reasoning to enhance the temporal sensitivity of LLM inference. NeSTR preserves explicit temporal relations through symbolic encoding, enforces logical consistency via verification, and corrects flawed inferences using abductive reflection. Extensive experiments on diverse temporal question answering benchmarks demonstrate that NeSTR achieves superior zero-shot performance and consistently improves temporal reasoning without any fine-tuning, showcasing the advantage of neuro-symbolic integration in enhancing temporal understanding in large language models.

[46] Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection

Mengqi Wang,Jianwei Wang,Qing Liu,Xiwei Xu,Zhenchang Xing,Liming Zhu,Wenjie Zhang

Main category: cs.CL

TL;DR: 提出TreeED和ForestED框架,利用大语言模型诱导生成决策树进行错误检测,提升可解释性和鲁棒性。

Details Motivation: 现有基于大语言模型的错误检测方法缺乏可解释性和鲁棒性,因依赖黑箱式判断且对提示敏感。 Method: 提出LLM-as-an-inducer框架:TreeED通过LLM根据数据上下文生成包含规则节点、GNN节点和叶节点的决策树;ForestED通过不确定性采样构建多个子集并生成多棵决策树,使用EM算法估计树的可靠性并优化共识预测。 Result: 实验表明该方法在准确性、可解释性和稳健性方面表现优异,平均F1分数比最佳基线提高16.1%。 Conclusion: TreeED和ForestED有效提升了表格数据错误检测的可解释性和鲁棒性,同时显著提高了检测性能。 Abstract: Error detection (ED), which aims to identify incorrect or inconsistent cell values in tabular data, is important for ensuring data quality. Recent state-of-the-art ED methods leverage the pre-trained knowledge and semantic capability embedded in large language models (LLMs) to directly label whether a cell is erroneous. However, this LLM-as-a-labeler pipeline (1) relies on the black box, implicit decision process, thus failing to provide explainability for the detection results, and (2) is highly sensitive to prompts, yielding inconsistent outputs due to inherent model stochasticity, therefore lacking robustness. To address these limitations, we propose an LLM-as-an-inducer framework that adopts LLM to induce the decision tree for ED (termed TreeED) and further ensembles multiple such trees for consensus detection (termed ForestED), thereby improving explainability and robustness. Specifically, based on prompts derived from data context, decision tree specifications and output requirements, TreeED queries the LLM to induce the decision tree skeleton, whose root-to-leaf decision paths specify the stepwise procedure for evaluating a given sample. Each tree contains three types of nodes: (1) rule nodes that perform simple validation checks (e.g., format or range), (2) Graph Neural Network (GNN) nodes that capture complex patterns (e.g., functional dependencies), and (3) leaf nodes that output the final decision types (error or clean). Furthermore, ForestED employs uncertainty-based sampling to obtain multiple row subsets, constructing a decision tree for each subset using TreeED. It then leverages an Expectation-Maximization-based algorithm that jointly estimates tree reliability and optimizes the consensus ED prediction. Extensive xperiments demonstrate that our methods are accurate, explainable and robust, achieving an average F1-score improvement of 16.1% over the best baseline.

[47] TeluguST-46: A Benchmark Corpus and Comprehensive Evaluation for Telugu-English Speech Translation

Bhavana Akkiraju,Srihari Bandarupalli,Swathi Sambangi,Vasavi Ravuri,R Vijaya Saraswathi,Anil Kumar Vuppala

Main category: cs.CL

TL;DR: 本文构建了高质量的泰卢固语-英语语音翻译基准,并比较了级联与端到端系统的性能,发现经过微调的端到端模型在低资源条件下可与级联方法相媲美,同时评估了多种自动评价指标在泰卢固语-英语翻译中的可靠性。

Details Motivation: 泰卢固语作为拥有超过8000万使用者的语言,在语音翻译研究中仍严重缺乏关注,尤其是在其形态复杂性的背景下。作者旨在填补这一空白,推动低资源语言的语音翻译发展。 Method: 基于46小时人工验证的CSTD语料库数据(30h/8h/8h训练/开发/测试划分),构建泰卢固语-英语语音翻译基准;系统比较级联(IndicWhisper + IndicMT)与端到端(SeamlessM4T微调)架构的性能;并通过人类评分评估多种自动评价指标(BLEU、METEOR、ChrF++等)的可靠性。 Result: IndicWhisper + IndicMT因使用大量泰卢固语特定数据表现最佳,但微调后的SeamlessM4T模型尽管使用更少的特定数据也展现出竞争力;研究表明,在适当调参和足够平行数据(可能少于100小时)下,端到端系统可在低资源场景中达到与级联方法相当的性能;传统自动评价指标比BERTScore在该语言对上具有更好的质量区分能力。 Conclusion: 本研究提供了可复现的泰卢固语-英语语音翻译基准,实证表明端到端模型在低资源设置下具有与级联方法竞争的潜力,并为形态复杂的语言对的自动评估提供了实践指导。 Abstract: Despite Telugu being spoken by over 80 million people, speech translation research for this morphologically rich language remains severely underexplored. We address this gap by developing a high-quality Telugu--English speech translation benchmark from 46 hours of manually verified CSTD corpus data (30h/8h/8h train/dev/test split). Our systematic comparison of cascaded versus end-to-end architectures shows that while IndicWhisper + IndicMT achieves the highest performance due to extensive Telugu-specific training data, finetuned SeamlessM4T models demonstrate remarkable competitiveness despite using significantly less Telugu-specific training data. This finding suggests that with careful hyperparameter tuning and sufficient parallel data (potentially less than 100 hours), end-to-end systems can achieve performance comparable to cascaded approaches in low-resource settings. Our metric reliability study evaluating BLEU, METEOR, ChrF++, ROUGE-L, TER, and BERTScore against human judgments reveals that traditional metrics provide better quality discrimination than BERTScore for Telugu--English translation. The work delivers three key contributions: a reproducible Telugu--English benchmark, empirical evidence of competitive end-to-end performance potential in low-resource scenarios, and practical guidance for automatic evaluation in morphologically complex language pairs.

[48] Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

Srihari Bandarupalli,Bhavana Akkiraju,Charan Devarakonda,Vamsiraghusimha Narsinga,Anil Kumar Vuppala

Main category: cs.CL

TL;DR: 本文研究了低资源语言的跨语言持续预训练方法,利用未标注语音数据和形态感知分词,在参数更少、标注数据更少的情况下实现了与大模型相当甚至更好的语音识别性能。

Details Motivation: 低资源语言的语音识别受限于标注数据和计算资源的缺乏,现有大模型难以适用。 Method: 构建了一个3000小时的多语言语料库,采用针对性的持续预训练策略和形态感知的分词方法,对3亿参数模型进行训练。 Result: 该模型在波斯语上超越Whisper Large v3,在阿拉伯语和乌尔都语上表现具有竞争力,且仅使用更少参数和标注数据。 Conclusion: 研究表明,对于低资源场景,数据相关性和预训练策略比模型规模更重要,为无需大规模计算资源的包容性语音技术提供了可行路径。 Abstract: Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.

[49] Investigating Training and Generalization in Faithful Self-Explanations of Large Language Models

Tomoki Doi,Masaru Isonuma,Hitomi Yanaka

Main category: cs.CL

TL;DR: 该研究探讨了如何通过训练提升大语言模型自解释的可信度,并验证了这种改进在不同分类任务和解释风格间的泛化能力。

Details Motivation: 现有研究表明大语言模型生成的自我解释常缺乏对其实际行为的忠实反映,但如何提高其忠实性仍研究不足,且不同解释风格间的改进是否具有普适性尚不清楚。 Method: 采用特征归因方法构建可能忠实的单字约束解释,并以此作为伪忠实解释对指令微调模型进行持续学习,实验涵盖三个分类任务和三种解释风格。 Result: 实验证明,训练能显著提升所有任务和风格下的自解释忠实性,且改进效果可泛化至多字设置和未见任务,并在不同解释风格间表现出一致的跨风格泛化能力。 Conclusion: 通过特定训练可有效提升大语言模型自解释的忠实性,且该能力具有跨任务、跨风格的泛化潜力,表明训练有助于发展更通用的忠实解释能力。 Abstract: Large language models have the potential to generate explanations for their own predictions in a variety of styles based on user instructions. Recent research has examined whether these self-explanations faithfully reflect the models' actual behavior and has found that they often lack faithfulness. However, the question of how to improve faithfulness remains underexplored. Moreover, because different explanation styles have superficially distinct characteristics, it is unclear whether improvements observed in one style also arise when using other styles. This study analyzes the effects of training for faithful self-explanations and the extent to which these effects generalize, using three classification tasks and three explanation styles. We construct one-word constrained explanations that are likely to be faithful using a feature attribution method, and use these pseudo-faithful self-explanations for continual learning on instruction-tuned models. Our experiments demonstrate that training can improve self-explanation faithfulness across all classification tasks and explanation styles, and that these improvements also show signs of generalization to the multi-word settings and to unseen tasks. Furthermore, we find consistent cross-style generalization among three styles, suggesting that training may contribute to a broader improvement in faithful self-explanation ability.

[50] Multilingual corpora for the study of new concepts in the social sciences and humanities:

Revekka Kyriakoglou,Anna Pappa

Main category: cs.CL

TL;DR: 提出了一种混合方法构建多语言语料库,用于研究人文与社会科学中的新兴概念(以“非技术性创新”为例),结合公司网站文本和年报,经过自动化处理流程生成可用于机器学习的标注数据集。

Details Motivation: 为了支持人文与社会科学中新兴概念的研究,需要构建高质量、可扩展的多语言语料库,以应对术语演变和跨语言表达的复杂性。 Method: 结合两种互补数据源:从公司网站自动提取并清洗英法文文本,以及收集并按标准过滤的年度报告;通过语言检测、内容过滤、片段提取和元数据标注构建初始语料库,并从中抽取英文上下文块进行主题类别标注,形成适用于监督分类的机器学习数据集。 Result: 成功构建了一个可复用、可扩展的多语言语料库,并生成了一个结构化的英文衍生数据集,每个目标术语的上下文五句话被提取并标注主题类别,适用于自然语言处理任务。 Conclusion: 该方法为研究新兴概念提供了可靠且可扩展的资源,既能分析词汇变异,又能支持监督学习等NLP应用,具有良好的推广潜力。 Abstract: This article presents a hybrid methodology for building a multilingual corpus designed to support the study of emerging concepts in the humanities and social sciences (HSS), illustrated here through the case of ``non-technological innovation''. The corpus relies on two complementary sources: (1) textual content automatically extracted from company websites, cleaned for French and English, and (2) annual reports collected and automatically filtered according to documentary criteria (year, format, duplication). The processing pipeline includes automatic language detection, filtering of non-relevant content, extraction of relevant segments, and enrichment with structural metadata. From this initial corpus, a derived dataset in English is created for machine learning purposes. For each occurrence of a term from the expert lexicon, a contextual block of five sentences is extracted (two preceding and two following the sentence containing the term). Each occurrence is annotated with the thematic category associated with the term, enabling the construction of data suitable for supervised classification tasks. This approach results in a reproducible and extensible resource, suitable both for analyzing lexical variability around emerging concepts and for generating datasets dedicated to natural language processing applications.

[51] Training Language Models to Use Prolog as a Tool

Niklas Mellgren,Peter Schneider-Kamp,Lukas Galke Poech

Main category: cs.CL

TL;DR: 本文研究通过使用Prolog作为外部工具,结合强化学习方法(GRPO)对语言模型进行微调,以提升AI系统在工具使用中的可靠性和可验证性。实验结果表明,该方法在多个基准上显著优于监督微调,并展现出良好的零样本泛化能力。

Details Motivation: 语言模型常生成看似合理但错误的推理,难以验证,影响了AI代理系统的安全性与可靠性。为此,需要一种能够实现可验证计算的方法来增强模型推理的正确性。 Method: 采用Group Relative Policy Optimization (GRPO) 方法,在清理后的GSM8K-Prolog-Prover数据集上对Qwen2.5-3B-Instruct模型进行微调,并系统地评估提示结构、奖励组成(执行、语法、语义、结构)和推理协议(单次、最佳-of-N、两种基于代理的模式)的影响。 Result: 该方法在GSM8K上通过best-of-N结合外部Prolog验证实现了最高准确率;在MMLU-Stem和MMLU-Pro上,采用内部修复的代理推理模式表现出更优的零样本迁移性能;3B模型的零样本MMLU表现媲美7B模型的少样本结果。 Conclusion: 将模型推理锚定于形式化验证系统(如Prolog)可显著提高AI系统的可靠性与可审计性,对安全关键应用具有重要意义。 Abstract: Ensuring reliable tool use is critical for safe agentic AI systems. Language models frequently produce unreliable reasoning with plausible but incorrect solutions that are difficult to verify. To address this, we investigate fine-tuning models to use Prolog as an external tool for verifiable computation. Using Group Relative Policy Optimization (GRPO), we fine-tune Qwen2.5-3B-Instruct on a cleaned GSM8K-Prolog-Prover dataset while varying (i) prompt structure, (ii) reward composition (execution, syntax, semantics, structure), and (iii) inference protocol: single-shot, best-of-N, and two agentic modes where Prolog is invoked internally or independently. Our reinforcement learning approach outperforms supervised fine-tuning, with our 3B model achieving zero-shot MMLU performance comparable to 7B few-shot results. Our findings reveal that: 1) joint tuning of prompt, reward, and inference shapes program syntax and logic; 2) best-of-N with external Prolog verification maximizes accuracy on GSM8K; 3) agentic inference with internal repair yields superior zero-shot generalization on MMLU-Stem and MMLU-Pro. These results demonstrate that grounding model reasoning in formal verification systems substantially improves reliability and auditability for safety-critical applications. The source code for reproducing our experiments is available under https://github.com/niklasmellgren/grpo-prolog-inference

[52] Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning

Amir Mohammad Akhlaghi,Amirhossein Shabani,Mostafa Abdolmaleki,Saeed Reza Kheradpisheh

Main category: cs.CL

TL;DR: 本文提出了一个3.8B参数的波斯语大模型Persian-Phi,通过一种资源高效的课程学习方法,将原本单语的Phi-3 Mini模型成功适配至低资源语言(波斯语),在计算成本极低的情况下实现了具有竞争力的性能。

Details Motivation: 由于训练大型语言模型需要巨大的计算成本,低资源语言在人工智能中的发展受到严重阻碍。本文旨在探索如何以极小的资源开销,将最先进的大模型扩展到被忽视的语言(如波斯语)。 Method: 提出一种新颖的课程学习流程:首先使用双语叙事数据(Tiny Stories)进行嵌入对齐的“热身”阶段,然后结合持续预训练和基于参数高效微调(PEFT)的指令微调,实现从单语英文模型到波斯语模型的迁移。 Result: 尽管模型仅3.8B参数,Persian-Phi在HuggingFace的Open Persian LLM Leaderboard上取得了具有竞争力的结果,验证了该方法的有效性。 Conclusion: 本研究提供了一个可扩展、可复现的框架,证明了无需依赖大规模模型或数据即可实现高质量的多语言能力,为低资源语言的AI普及提供了可行路径。 Abstract: The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.

[53] Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Tong Wu,Yang Liu,Jun Bai,Zixia Jia,Shuyi Zhang,Ziyong Lin,Yanting Wang,Song-Chun Zhu,Zilong Zheng

Main category: cs.CL

TL;DR: 本文提出了Native Parallel Reasoner (NPR),一种无需教师模型的框架,使大语言模型能够自我演化出真正的并行推理能力,在多个基准上显著提升了性能与推理速度。

Details Motivation: 现有大语言模型多依赖顺序推理,难以实现真正并行的思维路径分解;缺乏自主演化并行推理能力的训练框架。 Method: 提出NPR框架,包含三个核心创新:自蒸馏渐进式训练、执行图内的并行感知策略优化(PAPO)算法,以及基于SGLang重构的支持大规模并行强化学习的NPR引擎。 Result: 在八个推理基准上,基于Qwen3-4B的NPR实现了最高24.5%的性能提升和最高4.6倍的推理加速,并实现100%真实并行执行。 Conclusion: NPR实现了大语言模型从顺序模拟到原生并行认知的转变,为高效、可扩展的智能体推理设立了新标准。 Abstract: We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

[54] Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Zhuoran Zhuang,Ye Chen,Jianghao Su,Chao Luo,Luhui Liu,Xia Zeng

Main category: cs.CL

TL;DR: 本文提出了一种用于优化工具集成推理(TIR)的大型语言模型的代理强化学习框架,通过渐进式奖励塑形(PRS)和基于价值的采样策略优化(VSPO)来解决稀疏奖励和梯度退化问题,在多种问答任务上实现了更优的性能和训练稳定性。

Details Motivation: 现有的代理强化学习在处理工具集成推理任务时面临两个主要挑战:一是验证性二元奖励过于稀疏,难以指导中间步骤的学习;二是GRPO中因组内奖励相同导致优势为零,引发梯度退化,影响训练效率与稳定性。 Method: 提出两种互补技术:1)渐进式奖励塑形(PRS),采用课程学习思想设计分阶段密集反馈机制,先鼓励模型生成格式正确的工具调用,再优化事实准确性和答案质量,并针对短答案和长答案分别设计了长度感知BLEU和LLM-as-a-Judge评分机制防止奖励欺骗;2)基于价值的采样策略优化(VSPO),改进GRPO,通过任务价值指标选择高潜力提示替换低价值样本,并引入值平滑裁剪稳定梯度更新。 Result: 在多个短形式和长形式问答基准上的实验表明,PRS相比传统二元奖励能显著提升性能,VSPO相较于PPO、GRPO、CISPO和SFT-only基线具有更高的训练稳定性、更快的收敛速度和最终更好的表现。 Conclusion: PRS与VSPO结合可有效提升LLM在复杂推理任务中的工具使用能力,增强训练效率与泛化性,为构建高效TIR智能体提供了可行路径。 Abstract: Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.

[55] SPAD: Seven-Source Token Probability Attribution with Syntactic Aggregation for Detecting Hallucinations in RAG

Pengqian Lu,Jie Lu,Anjin Liu,Guangquan Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为SPAD的新方法,用于检测检索增强生成(RAG)中的幻觉现象。通过将每个token的概率分解为七个来源,并结合词性标注聚合分析,SPAD能有效识别由不同生成组件引发的异常,从而实现最先进的幻觉检测性能。

Details Motivation: 现有方法将幻觉归因于内部知识与检索上下文之间的二元冲突,忽略了用户查询、已生成token、当前token和LayerNorm等其他生成组件的影响,因此需要更全面的归因分析方法。 Method: 首先将每个token的生成概率数学分解为七个来源:Query、RAG、Past、Current Token、FFN、Final LayerNorm和Initial Embedding;然后按词性标签(POS)聚合这些得分,分析不同组件对各类语言单元的影响,并通过识别异常(如名词依赖Final LayerNorm)来检测幻觉。 Result: 实验表明,SPAD在多个基准上实现了最先进的幻觉检测性能,能够更细粒度地揭示生成过程中各组件的作用与异常模式。 Conclusion: SPAD提供了一个更完整、可解释的视角来理解RAG中的幻觉成因,通过多源归因与语言结构结合的方法,显著提升了检测效果。 Abstract: Detecting hallucinations in Retrieval-Augmented Generation (RAG) remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge (stored in FFNs) and retrieved context. However, this perspective is incomplete, failing to account for the impact of other components in the generative process, such as the user query, previously generated tokens, the current token itself, and the final LayerNorm adjustment. To address this, we introduce SPAD. First, we mathematically attribute each token's probability into seven distinct sources: Query, RAG, Past, Current Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the current token. Then, we aggregate these scores by POS tags to quantify how different components drive specific linguistic categories. By identifying anomalies, such as Nouns relying on Final LayerNorm, SPAD effectively detects hallucinations. Extensive experiments demonstrate that SPAD achieves state-of-the-art performance

[56] LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

Sebastian Sztwiertnia,Felix Friedrich,Kristian Kersting,Patrick Schramowski,Björn Deiseroth

Main category: cs.CL

TL;DR: 本文提出了LIME(语言元数据嵌入)方法,通过将语法、语义和上下文属性等元数据融入词元嵌入,显著提升了仅解码器语言模型的预训练效率和生成能力,并提出变体LIME+1以提升推理和算术性能。

Details Motivation: 现有预训练依赖大量高质量数据,但数据资源日益受限;元数据虽用于数据集构建,却未被充分用作直接训练信号,本文旨在探索元数据作为增强训练信号的潜力。 Method: 提出LIME方法,将反映语法、语义和上下文特性的元数据嵌入到词元表示中,增强输入表示;并设计LIME+1变体,利用下一词元的先验元数据引导生成过程。 Result: LIME使预训练适应速度加快最多56%,仅增加0.01%参数且计算开销可忽略;改善了分词效果,提升语言建模与生成任务性能,且在5亿到20亿参数模型上均有效;LIME+1使推理性能提升达38%,算术准确率提高达35%。 Conclusion: 将语言元数据融入词元嵌入能显著提升语言模型的训练效率和生成表现,验证了元数据作为有效训练信号的潜力,为未来模型设计提供了新方向。 Abstract: Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.

[57] Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Xiaoran Liu,Yuerong Song,Zhigeng Liu,Zengfeng Huang,Qipeng Guo,Zhaoxiang Liu,Shiguo Lian,Ziwei He,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了一种改进的旋转位置编码方法(RoPE++),通过重新利用复数点积中的虚部信息来增强长距离依赖建模能力。

Details Motivation: 标准RoPE仅使用复数点积的实部计算注意力分数,丢弃了包含重要相位信息的虚部,可能导致长上下文关系建模不充分。 Method: 提出RoPE++,保留并利用完整的复数值表示构建双分量注意力分数,从而更充分地保留位置信息。 Result: 理论和实验表明,该方法在多个长上下文语言建模基准上优于标准RoPE,且随着上下文长度增加性能提升更显著。 Conclusion: 通过引入复数的虚部信息,RoPE++能更有效地建模长距离依赖,提升了大模型在长文本上的表现。 Abstract: Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.

Michelle Wastl,Jannis Vamvas,Rico Sennrich

Main category: cs.CL

TL;DR: 本文提出了SwissGov-RSD,首个自然的、文档级别的跨语言语义差异识别数据集,包含224个多语言文档及人工标注的词元级差异,并评估了多种大模型在该任务上的表现,结果表明现有方法在跨语言文档级语义差异识别上仍有显著不足。

Details Motivation: 语义差异识别在多语言文本生成评估和内容对齐中至关重要,但作为独立任务受到关注较少,尤其缺乏自然的、文档级别的跨语言数据集。 Method: 构建SwissGov-RSD数据集,包含英-德、英-法、英-意多语言文档,由人工进行词元级差异标注;在该数据集上评估开源与闭源大语言模型及编码器模型在不同微调设置下的表现。 Result: 当前自动方法在该新基准上的表现远低于其在单语、句子级或合成数据上的表现,显示出LLM和编码器模型在该任务上存在显著性能差距。 Conclusion: 跨语言、文档级的语义差异识别仍具挑战性,需进一步研究提升模型能力,SwissGov-RSD为未来研究提供了重要基准。 Abstract: Recognizing semantic differences across documents, especially in different languages, is crucial for text generation evaluation and multilingual content alignment. However, as a standalone task it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English-German, English-French, and English-Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and datasets publicly available.

[59] Minimum Bayes Risk Decoding for Error Span Detection in Reference-Free Automatic Machine Translation Evaluation

Boxuan Lyu,Haiyue Song,Hidetaka Kamigaito,Chenchen Ding,Hideki Tanaka,Masao Utiyama,Kotaro Funakoshi,Manabu Okumura

Main category: cs.CL

TL;DR: 本文提出将最小贝叶斯风险(MBR)解码应用于生成式机器翻译错误跨度检测(ESD),通过句子级和跨度级相似性度量选择更接近人工标注的候选结果,显著优于传统的最大后验(MAP)解码,并通过MBR蒸馏缓解了计算开销问题。

Details Motivation: 现有的生成式ESD方法使用MAP解码,假设模型概率与人类标注的相似性完全相关,但作者发现与人类标注差异较大的结果可能具有更高的模型似然,导致解码偏差。 Method: 采用MBR解码框架,利用句子级和跨度级的相似性指标作为效用函数,从多个候选中选择整体风险最小的输出;同时引入MBR蒸馏技术,使贪婪解码模型能够逼近MBR性能。 Result: 在系统级、句子级和跨度级评测上,MBR解码均显著优于MAP基线;MBR蒸馏能有效降低推理延迟,使轻量模型达到相近性能。 Conclusion: MBR解码能提升生成式ESD模型的准确性与鲁棒性,结合蒸馏技术可解决其计算成本高的问题,具备实际应用潜力。 Abstract: Error Span Detection (ESD) is a subtask of automatic machine translation evaluation that localizes error spans in translations and labels their severity. State-of-the-art generative ESD methods typically decode using Maximum a Posteriori (MAP), assuming that model-estimated probabilities are perfectly correlated with similarity to human annotation. However, we observed that annotations dissimilar to the human annotation could achieve a higher model likelihood than the human annotation. We address this issue by applying Minimum Bayes Risk (MBR) decoding to generative ESD models. Specifically, we employ sentence- and span-level similarity metrics as utility functions to select candidate hypotheses based on their approximate similarity to the human annotation. Extensive experimental results show that our MBR decoding outperforms the MAP baseline at the system, sentence, and span-levels. Furthermore, to mitigate the computational cost of MBR decoding, we demonstrate that applying MBR distillation enables a standard greedy model to match MBR decoding performance, effectively eliminating the inference-time latency bottleneck.

[60] Most over-representation of phonological features in basic vocabulary disappears when controlling for spatial and phylogenetic effects

Frederic Blum

Main category: cs.CL

TL;DR: 本研究通过扩展样本至2864种语言并控制语言间的地理和谱系依赖,检验了基本词汇中语音象征模式的稳健性,发现多数先前结果不具鲁棒性,仅有少数模式稳定存在。

Details Motivation: 检验先前语音象征研究结果的可重复性,并解决因未控制语言谱系和区域依赖所带来的潜在偏差问题。 Method: 使用Lexibank中2864种语言的数据,改进原模型,加入空间和系统发育依赖的统计控制,重新分析语音特征在基本词汇中的分布。 Result: 大多数先前报告的语音象征模式在加入谱系和区域控制后消失,仅少数模式仍显著稳定。 Conclusion: 语言普遍性主张需在多种层面上验证其稳健性;真正的跨语言语音象征现象可能比此前认为的更稀少。 Abstract: The statistical over-representation of phonological features in the basic vocabulary of languages is often interpreted as reflecting potentially universal sound symbolic patterns. However, most of those results have not been tested explicitly for reproducibility and might be prone to biases in the study samples or models. Many studies on the topic do not adequately control for genealogical and areal dependencies between sampled languages, casting doubts on the robustness of the results. In this study, we test the robustness of a recent study on sound symbolism of basic vocabulary concepts which analyzed245 languages.The new sample includes data on 2864 languages from Lexibank. We modify the original model by adding statistical controls for spatial and phylogenetic dependencies between languages. The new results show that most of the previously observed patterns are not robust, and in fact many patterns disappear completely when adding the genealogical and areal controls. A small number of patterns, however, emerges as highly stable even with the new sample. Through the new analysis, we are able to assess the distribution of sound symbolism on a larger scale than previously. The study further highlights the need for testing all universal claims on language for robustness on various levels.

[61] MoCoRP: Modeling Consistent Relations between Persona and Response for Persona-based Dialogue

Kyungro Lee,Dongha Choi,Hyunju Lee

Main category: cs.CL

TL;DR: 本文提出了一种名为MoCoRP的框架,通过显式建模人物性格与回应之间的关系来提升基于个性的对话生成质量。

Details Motivation: 现有基于个性的对话数据集缺乏性格句子与回应之间的显式关系,导致模型难以有效捕捉个性信息。 Method: MoCoRP利用自然语言推理(NLI)专家提取性格句子与回应之间的NLI关系,并将其融入BART等预训练模型及大型语言模型(LLM)中,通过对其对齐调优实现更好的个性一致性。 Result: 在ConvAI2和MPChat数据集上的实验表明,MoCoRP在定量和定性指标上均优于现有基线模型,能生成更具个性一致性和上下文感知能力的对话。 Conclusion: 显式建模个性与回应之间的关系有助于提升基于个性的对话系统的性能。 Abstract: As dialogue systems become increasingly important across various domains, a key challenge in persona-based dialogue is generating engaging and context-specific interactions while ensuring the model acts with a coherent personality. However, existing persona-based dialogue datasets lack explicit relations between persona sentences and responses, which makes it difficult for models to effectively capture persona information. To address these issues, we propose MoCoRP (Modeling Consistent Relations between Persona and Response), a framework that incorporates explicit relations into language models. MoCoRP leverages an NLI expert to explicitly extract the NLI relations between persona sentences and responses, enabling the model to effectively incorporate appropriate persona information from the context into its responses. We applied this framework to pre-trained models like BART and further extended it to modern large language models (LLMs) through alignment tuning. Experimental results on the public datasets ConvAI2 and MPChat demonstrate that MoCoRP outperforms existing baselines, achieving superior persona consistency and engaging, context-aware dialogue generation. Furthermore, our model not only excels in quantitative metrics but also shows significant improvements in qualitative aspects. These results highlight the effectiveness of explicitly modeling persona-response relations in persona-based dialogue. The source codes of MoCoRP are available at https://github.com/DMCB-GIST/MoCoRP.

[62] Performance of the SafeTerm AI-Based MedDRA Query System Against Standardised MedDRA Queries

Francois Vandenhende,Anna Georgiou,Michalis Georgiou,Theodoros Psaras,Ellie Karekla,Elena Hadjicosta

Main category: cs.CL

TL;DR: SafeTerm AMQ 是一种基于人工智能的系统,用于自动检索与 MedDRA SMQ 相关的不良事件术语,在信号检测中表现出良好的召回率和精度,可作为药物安全审查中自动化查询生成的有效补充方法。

Details Motivation: 在上市前药物安全审查中,将不良事件术语分组为SMQ或OCMQ对于信号检测至关重要。传统方法依赖人工,效率低且易遗漏,因此需要一种自动化的高效解决方案。 Method: SafeTerm AMQ 系统通过将医学查询术语和MedDRA首选术语(PT)嵌入多维向量空间,利用余弦相似度和极值聚类方法,结合多标准统计模型计算相关性评分(0-1),自动生成并排序相关PT列表,并在110个SMQ上进行验证。 Result: 在多个相似度阈值下评估了精确率、召回率和F1值;中等阈值下召回率达94%,高阈值下精确率最高达89%;最优阈值0.70时整体召回率为48%、精确率为45%;自动阈值0.66更侧重召回率(58% vs 精确率29%);使用窄义术语时性能略有提升。 Conclusion: SafeTerm AMQ 在SMQ和去标识化OCMQ上表现良好,可作为MedDRA查询生成的可行补充工具,建议结合适当的PT术语输入并采用自动阈值法以优化召回,通过调整相似度阈值实现更精确的术语筛选。 Abstract: In pre-market drug safety review, grouping related adverse event terms into SMQs or OCMQs is critical for signal detection. We assess the performance of SafeTerm Automated Medical Query (AMQ) on MedDRA SMQs. The AMQ is a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score (0-1) using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity, and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against tier-1 SMQs (110 queries, v28.1). Precision, recall and F1 were computed at multiple similarity-thresholds, defined either manually or using an automated method. High recall (94%)) is achieved at moderate similarity thresholds, indicative of good retrieval sensitivity. Higher thresholds filter out more terms, resulting in improved precision (up to 89%). The optimal threshold (0.70)) yielded an overall recall of (48%) and precision of (45%) across all 110 queries. Restricting to narrow-term PTs achieved slightly better performance at an increased (+0.05) similarity threshold, confirming increased relatedness of narrow versus broad terms. The automatic threshold (0.66) selection prioritizes recall (0.58) to precision (0.29). SafeTerm AMQ achieves comparable, satisfactory performance on SMQs and sanitized OCMQs. It is therefore a viable supplementary method for automated MedDRA query generation, balancing recall and precision. We recommend using suitable MedDRA PT terminology in query formulation and applying the automated threshold method to optimise recall. Increasing similarity scores allows refined, narrow terms selection.

[63] A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

Nicolas Calbucura,Valentin Barriere

Main category: cs.CL

TL;DR: 提出一种基于lasso特征选择的简单方法,通过筛选重要音频标记来增强预训练语言模型的语音融合效果,在论证谬误检测任务中实现SOTA结果。

Details Motivation: 现有语音与文本融合方法因音频序列过长及大词汇量标记难以低成本集成到大语言模型中,且在某些任务中使用音频被认为适得其反,需有效低代价融合多模态信息。 Method: 利用ASR语音分词器生成音频标记,构建多模态词袋表示,应用lasso特征选择保留最关键音频标记,并通过自监督语言建模目标将选中的标记适配至语言模型,最后进行下游任务微调。 Result: 在两个论证谬误检测与分类任务上优于单模态模型、更大的SpeechLM及学习型音频表示方法,实现最先进性能;即使随机选择音频标记也能提升模型表现。 Conclusion: 该方法能高效低成本地将关键语音信息融入大语言模型,显著提升特定分类任务性能,尤其在以往认为音频无益的任务中也取得突破性进展。 Abstract: This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings from audio with text is the large length of the audio sequence compared to the text one. Our method benefits from an existing speech tokenizer trained for Audio Speech Recognition that output long sequences of tokens from a large vocabulary, making it difficult to integrate it at low cost in a large language model. By applying a simple lasso-based feature selection on multimodal Bag-of-Words representation, we retain only the most important audio tokens for the task, and adapt the language model to them with a self-supervised language modeling objective, before fine-tuning it on the downstream task. We show this helps to improve the performances compared to an unimodal model, to a bigger SpeechLM or to integrating audio via a learned representation. We show the effectiveness of our method on two recent Argumentative Fallacy Detection and Classification tasks where the use of audio was believed counterproductive, reaching state-of-the-art results. We also provide an in-depth analysis of the method, showing that even a random audio token selection helps enhancing the unimodal model. Our code is available [online](https://github.com/salocinc/EACL26SpeechTokFallacy/).

[64] Complementary Learning Approach for Text Classification using Large Language Models

Navid Asgari,Benjamin M. Cole

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLMs)的结构化方法,结合学者与机器的优势,通过思维链和少样本学习提示技术,实现人机协作的定量研究。

Details Motivation: 旨在克服LLMs在研究应用中的固有弱点,并提升人机协作效率。 Method: 采用思维链和少样本学习提示方法,将计算机科学的技术应用于人机协同的定量研究中。 Result: 成功展示了该方法在分析1934份制药联盟新闻稿中人机评分差异的应用。 Conclusion: 该方法能以低成本有效整合人类与机器的优势,提升研究质量和解释能力。 Abstract: In this study, we propose a structured methodology that utilizes large language models (LLMs) in a cost-efficient and parsimonious manner, integrating the strengths of scholars and machines while offsetting their respective weaknesses. Our methodology, facilitated through a chain of thought and few-shot learning prompting from computer science, extends best practices for co-author teams in qualitative research to human-machine teams in quantitative research. This allows humans to utilize abductive reasoning and natural language to interrogate not just what the machine has done but also what the human has done. Our method highlights how scholars can manage inherent weaknesses OF LLMs using careful, low-cost techniques. We demonstrate how to use the methodology to interrogate human-machine rating discrepancies for a sample of 1,934 press releases announcing pharmaceutical alliances (1990-2017).

[65] Metric-Fair Prompting: Treating Similar Samples Similarly

Jing Wang,Jie Shen,Xing Niu,Tong Zhang,Jeremy Weiss

Main category: cs.CL

TL;DR: 本文提出了Metric-Fair Prompting,一种公平性感知的提示框架,通过在相似问题对上施加度量公平性约束来提升大语言模型在医学多选题回答中的准确性和个体公平性。

Details Motivation: 为了提高大语言模型在高风险临床决策中的公平性和一致性,避免对相似问题做出不一致的回答。 Method: 将每个(问题,选项)对视为二分类实例,利用NLP嵌入计算问题间的相似性,并在联合的相似问题对上进行推理;通过提示引导模型提取关键临床特征,为每个实例打分并施加Lipschitz风格的约束以确保相似输入获得相似输出。 Result: 在MedQA (US)基准测试中,该方法相比标准单项目提示提升了性能,表明公平性引导的推理有助于提高准确性。 Conclusion: Metric-Fair Prompting通过引入个体公平性约束,不仅能提升模型决策的一致性,还能增强其在医学问答等高风险任务上的表现。 Abstract: We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.

[66] PCMind-2.1-Kaiyuan-2B Technical Report

Kairong Luo,Zhenbo Sun,Xinyu Shi,Shengqi Chen,Bowen Yu,Yunyi Chen,Chenyi Dang,Hengtao Tao,Hui Wang,Fangming Liu,Kaifeng Lyu,Wenguang Chen

Main category: cs.CL

TL;DR: 本文提出了一个名为PCMind-2.1-Kaiyuan-2B的全开源20亿参数语言模型,旨在缩小开源社区与工业界在大模型训练上的知识差距。

Details Motivation: 由于工业界依赖闭源高质量数据和训练方法,导致开源社区在大模型发展上落后,本文旨在通过完全开源的方式提升资源受限下的训练效率与效果。 Method: 提出三种关键技术:分位数数据基准测试用于评估和混合异构开源数据集;多阶段战略选择性重复机制以有效利用稀疏高质量数据;多领域课程学习策略按数据质量排序训练样本。同时优化了数据预处理流程并改进架构以支持FP16稳定性。 Result: Kaiyuan-2B在多个指标上表现与当前最先进的全开源模型相当,验证了其在资源受限环境下高效预训练的有效性和可扩展性。 Conclusion: 该研究为资源有限条件下的大模型训练提供了实用且可扩展的开源解决方案,并全面公开模型权重、数据和代码以促进社区发展。 Abstract: The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

[67] Bridging Code Graphs and Large Language Models for Better Code Understanding

Zeqi Chen,Zhaoyang Chu,Yi Gui,Feng Guo,Yao Wan,Chuan Shi

Main category: cs.CL

TL;DR: 本文提出CGBridge,一种即插即用的方法,通过外部可训练的Bridge模块将代码图信息注入冻结的大语言模型,提升其对程序结构语义的理解能力,在代码摘要和翻译任务中显著优于现有方法,并实现超过4倍的推理加速。

Details Motivation: 大语言模型在处理代码智能任务时受限于线性化token序列,难以捕捉程序的结构语义;现有图增强方法受限于提示长度或需要特定架构修改,难以兼容大规模指令型LLMs。 Method: 提出CGBridge:首先自监督预训练一个代码图编码器以学习结构语义,然后训练一个外部Bridge模块,通过跨模态注意力机制对齐代码、图和文本的语义,最后生成结构感知的提示并注入冻结的LLM中进行微调。 Result: 实验表明,CGBridge在代码摘要任务的LLM-as-a-Judge指标上相对提升16.19%和9.12%,在代码翻译的执行准确率上提升9.84%和38.87%,且推理速度比LoRA微调模型快4倍以上。 Conclusion: CGBridge有效弥合了代码与结构信息之间的模态差距,是一种高效、通用的结构感知代码理解增强方法,适用于无需修改原模型结构的场景。 Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in code intelligence tasks such as code generation, summarization, and translation. However, their reliance on linearized token sequences limits their ability to understand the structural semantics of programs. While prior studies have explored graphaugmented prompting and structure-aware pretraining, they either suffer from prompt length constraints or require task-specific architectural changes that are incompatible with large-scale instructionfollowing LLMs. To address these limitations, this paper proposes CGBridge, a novel plug-and-play method that enhances LLMs with Code Graph information through an external, trainable Bridge module. CGBridge first pre-trains a code graph encoder via selfsupervised learning on a large-scale dataset of 270K code graphs to learn structural code semantics. It then trains an external module to bridge the modality gap among code, graph, and text by aligning their semantics through cross-modal attention mechanisms. Finally, the bridge module generates structure-informed prompts, which are injected into a frozen LLM, and is fine-tuned for downstream code intelligence tasks. Experiments show that CGBridge achieves notable improvements over both the original model and the graphaugmented prompting method. Specifically, it yields a 16.19% and 9.12% relative gain in LLM-as-a-Judge on code summarization, and a 9.84% and 38.87% relative gain in Execution Accuracy on code translation. Moreover, CGBridge achieves over 4x faster inference than LoRA-tuned models, demonstrating both effectiveness and efficiency in structure-aware code understanding.

[68] When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks

Zihan Chen,Lanyu Yu

Main category: cs.CL

TL;DR: 提出基于图神经网络(GNN)的框架,结合文本内容与评论间关系结构,有效检测维基百科社区中的无礼行为(如毒性、攻击性和人身攻击),性能优于12种主流大语言模型,且推理成本更低。

Details Motivation: 现有在线无礼行为检测方法在准确性和效率上表现有限,尤其依赖纯文本的大语言模型忽视了用户评论间的结构关系,难以充分捕捉复杂社交语境。 Method: 构建图神经网络模型,将每条评论作为节点,以文本相似性构建边,并引入动态调整的注意力机制,联合学习文本内容与拓扑结构特征。 Result: 在多个指标上优于12种最先进的大语言模型,显著降低推理成本,验证了结构上下文在检测在线无礼行为中的关键作用。 Conclusion: 融合语言内容与关系结构的图神经网络能更高效、准确地识别在线无礼行为,克服了纯文本模型的局限,为社交媒体治理提供了更具可扩展性的解决方案。 Abstract: Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automated detection, the performance of existing approaches often remains limited in both accuracy and efficiency. To address this challenge, we propose a Graph Neural Network (GNN) framework for detecting three types of uncivil behavior (i.e., toxicity, aggression, and personal attacks) within the English Wikipedia community. Our model represents each user comment as a node, with textual similarity between comments defining the edges, allowing the network to jointly learn from both linguistic content and relational structures among comments. We also introduce a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation. Empirical evaluations demonstrate that our proposed architecture outperforms 12 state-of-the-art Large Language Models (LLMs) across multiple metrics while requiring significantly lower inference cost. These findings highlight the crucial role of structural context in detecting online incivility and address the limitations of text-only LLM paradigms in behavioral prediction. All datasets and comparative outputs will be publicly available in our repository to support further research and reproducibility.

[69] HalluShift++: Bridging Language and Vision through Internal Representation Shifts for Hierarchical Hallucinations in MLLMs

Sujoy Nath,Arkaprabha Basu,Sharanya Dasgupta,Swagatam Das

Main category: cs.CL

TL;DR: 本文提出了一种新的方法HalluShift++,用于检测多模态大语言模型(MLLMs)中的幻觉问题,通过分析模型内部层动态的可测量异常来实现,而不仅仅依赖于外部LLM评估器。

Details Motivation: 现有的多模态大语言模型在视觉-语言理解任务中表现出色,但常出现生成内容与视觉输入不一致的幻觉问题,且当前依赖外部大语言模型进行评估的方法本身也容易产生幻觉并面临领域适应问题。 Method: 假设幻觉现象会在MLLM的内部层动态中表现出可测量的异常,并基于层间分析特定假设的修改,将原本用于文本大模型的幻觉检测方法扩展到多模态场景。 Result: 提出了HalluShift++方法,能够在多模态场景下有效检测幻觉,减少对外部评估模型的依赖。 Conclusion: 通过分析模型内部动态可以有效识别MLLM中的幻觉,HalluShift++为多模态幻觉检测提供了一个可靠且可扩展的新方向。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding tasks. While these models often produce linguistically coherent output, they often suffer from hallucinations, generating descriptions that are factually inconsistent with the visual content, potentially leading to adverse consequences. Therefore, the assessment of hallucinations in MLLM has become increasingly crucial in the model development process. Contemporary methodologies predominantly depend on external LLM evaluators, which are themselves susceptible to hallucinations and may present challenges in terms of domain adaptation. In this study, we propose the hypothesis that hallucination manifests as measurable irregularities within the internal layer dynamics of MLLMs, not merely due to distributional shifts but also in the context of layer-wise analysis of specific assumptions. By incorporating such modifications, \textsc{\textsc{HalluShift++}} broadens the efficacy of hallucination detection from text-based large language models (LLMs) to encompass multimodal scenarios. Our codebase is available at https://github.com/C0mRD/HalluShift_Plus.

[70] Automated Generation of Custom MedDRA Queries Using SafeTerm Medical Map

Francois Vandenhende,Anna Georgiou,Michalis Georgiou,Theodoros Psaras,Ellie Karekla,Elena Hadjicosta

Main category: cs.CL

TL;DR: 提出了一种名为SafeTerm的人工智能系统,用于自动化检索和排序与药物安全审查相关的MedDRA首选术语(PTs),通过多标准统计方法和向量空间模型实现高召回率和精度。

Details Motivation: 在上市前药物安全审查中,将不良事件术语归类到标准化的MedDRA查询或FDA自定义医学查询(OCMQ)中对信号检测至关重要,但手动分组效率低且易出错,因此需要一种自动化的解决方案。 Method: SafeTerm系统将医学查询术语和MedDRA PT嵌入多维向量空间,利用余弦相似度和极值聚类生成按相关性评分排序的PT列表,并通过与FDA OCMQ v3.0对比进行验证。 Result: 在不同相似度阈值下计算了精确率、召回率和F1值;中等阈值下召回率超过95%,较高阈值下精度最高达86%;最佳阈值(~0.70-0.75)时召回率约50%,精度约33%;窄义子集表现类似但需稍高阈值。 Conclusion: SafeTerm AI系统可作为自动化MedDRA查询生成的有效补充方法,建议初始使用~0.60的相似度阈值,后续逐步提高以优化术语选择。 Abstract: In pre-market drug safety review, grouping related adverse event terms into standardised MedDRA queries or the FDA Office of New Drugs Custom Medical Queries (OCMQs) is critical for signal detection. We present a novel quantitative artificial intelligence system that understands and processes medical terminology and automatically retrieves relevant MedDRA Preferred Terms (PTs) for a given input query, ranking them by a relevance score using multi-criteria statistical methods. The system (SafeTerm) embeds medical query terms and MedDRA PTs in a multidimensional vector space, then applies cosine similarity and extreme-value clustering to generate a ranked list of PTs. Validation was conducted against the FDA OCMQ v3.0 (104 queries), restricted to valid MedDRA PTs. Precision, recall and F1 were computed across similarity-thresholds. High recall (>95%) is achieved at moderate thresholds. Higher thresholds improve precision (up to 86%). The optimal threshold (~0.70 - 0.75) yielded recall ~50% and precision ~33%. Narrow-term PT subsets performed similarly but required slightly higher similarity thresholds. The SafeTerm AI-driven system provides a viable supplementary method for automated MedDRA query generation. A similarity threshold of ~0.60 is recommended initially, with increased thresholds for refined term selection.

[71] Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

Karin de Langis,Püren Öncel,Ryan Peters,Andrew Elfenbein,Laura Kristen Allen,Andreas Schramm,Dongyeop Kang

Main category: cs.CL

TL;DR: 大型语言模型(LLMs)在内部表征上能识别故事的不连贯性,但在生成评分回答时却无法有效区分连贯与不连贯叙事,揭示其在叙事理解上的局限;模型对违反场景的不连贯更敏感,表明其更依赖原型世界知识而非深层叙事逻辑。

Details Motivation: 探究大型语言模型是否真正理解叙事连贯性,还是仅依赖表面特征或世界知识进行判断。 Method: 使用成对叙事数据集,通过探针研究分析LLM内部表示,并测试其在多种提示下对连贯与不连贯故事的评分能力,同时比较不同类型不连贯(场景违反 vs. 角色特质违反)的影响。 Result: LLM内部表示可识别不连贯故事,但输出响应未能有效区分;推理型LLM未能弥补这一差距;模型对场景违反比角色特质违反更敏感。 Conclusion: LLM缺乏完整的叙事连贯性理解,行为与内部状态存在脱节,且更依赖原型世界知识而非构建意义一致的叙事结构。 Abstract: Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs' internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM's understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.

[72] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Charlie Zhang,Graham Neubig,Xiang Yue

Main category: cs.CL

TL;DR: 本文开发了一个完全受控的实验框架,以分离预训练、中段训练和基于强化学习(RL)的后训练对语言模型推理能力的因果贡献。通过合成推理任务和系统性操纵训练分布,研究发现:当预训练保留足够潜力且RL数据针对模型能力边界时,RL才能带来真正的性能提升;上下文泛化依赖于适度的预训练暴露;中段训练在相同计算资源下显著优于仅使用RL;过程级奖励能减少奖励欺骗并提高推理保真度。

Details Motivation: 尽管强化学习(RL)技术在提升语言模型推理能力方面表现出色,但尚不清楚后训练是否真正扩展了模型在预训练之外的推理能力。现有训练流程缺乏透明度与控制,导致难以厘清各阶段的因果作用。 Method: 提出一个全可控实验框架,采用具有明确原子操作、可解析逐步推理轨迹的合成推理任务,并系统操纵预训练、中段训练和RL后训练的数据分布。评估维度包括对更复杂组合的外推泛化和跨表面情境的上下文泛化。 Result: 1) RL仅在预训练留有足够余地且RL数据针对模型能力边界任务时带来真实能力提升(pass@128);2) 上下文泛化只需最低限度但足够的预训练暴露即可由RL实现迁移;3) 中段训练在固定计算下显著优于仅用RL;4) 过程级奖励可减少奖励欺骗并提升推理保真度。 Conclusion: 预训练、中段训练与RL之间存在复杂互动:预训练决定潜力空间,中段训练起关键中介作用,而RL的有效性高度依赖前两者的条件。该研究为理解与优化推理型语言模型的训练策略提供了因果基础。 Abstract: Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.

[73] Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support

Raunak Jain,Mudita Khurana

Main category: cs.CL

TL;DR: 本文提出了“协作因果意义构建”(CCS)框架,旨在将基于大语言模型的代理重构为能与专家协同认知的决策支持伙伴,而非被动工具,以解决当前人-AI团队在高风险复杂环境中表现不佳的问题。

Details Motivation: 现有的AI辅助决策系统未能有效融入专家的协作认知过程,导致人-AI团队表现不佳,缺乏真正的互补性。问题根源不在模型准确性,而在对AI辅助角色的根本理解不足。 Method: 提出CCS作为研究议程和组织框架,强调AI应作为认知工作的合作者,具备动态建模专家推理方式、共同构建与检验因果假设、协同调整目标,并从联合决策结果中共同学习的能力。 Result: 勾画了实现CCS的关键挑战,包括促进协作思维的训练生态、支持共构模型的表征与交互协议,以及以信任和互补性为核心的评估体系。 Conclusion: CCS可重新定义多智能体系统(MAS)研究方向,推动开发真正参与协作意义建构、与人类共同思考的AI队友。 Abstract: LLM-based agents are rapidly being plugged into expert decision-support, yet in messy, high-stakes settings they rarely make the team smarter: human-AI teams often underperform the best individual, experts oscillate between verification loops and over-reliance, and the promised complementarity does not materialise. We argue this is not just a matter of accuracy, but a fundamental gap in how we conceive AI assistance: expert decisions are made through collaborative cognitive processes where mental models, goals, and constraints are continually co-constructed, tested, and revised between human and AI. We propose Collaborative Causal Sensemaking (CCS) as a research agenda and organizing framework for decision-support agents: systems designed as partners in cognitive work, maintaining evolving models of how particular experts reason, helping articulate and revise goals, co-constructing and stress-testing causal hypotheses, and learning from the outcomes of joint decisions so that both human and agent improve over time. We sketch challenges around training ecologies that make collaborative thinking instrumentally valuable, representations and interaction protocols for co-authored models, and evaluation centred on trust and complementarity. These directions can reframe MAS research around agents that participate in collaborative sensemaking and act as AI teammates that think with their human partners.

[74] Do Generalisation Results Generalise?

Matteo Boglioni,Andrea Sgobbi,Gabriel Tavernini,Francesco Rita,Marius Mosbach,Tiago Pimentel

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型在多个分布外(OOD)测试集上的泛化性能之间的相关性,发现在控制领域内表现后,不同模型的OOD表现相关性无统一趋势。

Details Motivation: 现有评估LLM泛化能力的研究多依赖单一OOD数据集,难以反映实际部署中多样的数据偏移,因此需要探究OOD结果是否具有可推广性。 Method: 通过在微调过程中评估OLMo2和OPT模型在多个OOD测试集上的表现,并计算在排除领域内性能影响后的部分相关性。 Result: 实验结果显示,不同OOD测试集之间的性能相关性因模型而异,没有一致的正或负相关趋势。 Conclusion: OOD泛化结果本身不具备跨模型的一致性,表明单一OOD数据集的评估可能不足以全面衡量模型的泛化能力。 Abstract: A large language model's (LLM's) out-of-distribution (OOD) generalisation ability is crucial to its deployment. Previous work assessing LLMs' generalisation performance, however, typically focuses on a single out-of-distribution dataset. This approach may fail to precisely evaluate the capabilities of the model, as the data shifts encountered once a model is deployed are much more diverse. In this work, we investigate whether OOD generalisation results generalise. More specifically, we evaluate a model's performance across multiple OOD testsets throughout a finetuning run; we then evaluate the partial correlation of performances across these testsets, regressing out in-domain performance. This allows us to assess how correlated are generalisation performances once in-domain performance is controlled for. Analysing OLMo2 and OPT, we observe no overarching trend in generalisation results: the existence of a positive or negative correlation between any two OOD testsets depends strongly on the specific choice of model analysed.

cs.CV [Back]

[75] Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices

Hokin Deng

Main category: cs.CV

TL;DR: 本文提出了一种基于“任务对”设计的视频生成模型推理能力评估范式,并构建了可扩展的代码框架VMEvalKit,验证了自动化评估与人类判断的高度相关性,为通过强化学习提升视频模型推理能力提供了新机遇。

Details Motivation: 探索视频生成模型是否具备推理能力,并建立一个可扩展、自动化的评估框架来系统测试和提升其在复杂任务中的表现。 Method: 采用‘任务对’实验设计,选取国际象棋、迷宫、数独、心理旋转和瑞文推理矩阵等任务,对Sora-2等主流视频生成模型进行测试,开发VMEvalKit代码框架支持模型与任务的快速扩展,并验证自动化评估结果与人类判断的相关性。 Result: 领先模型如Sora-2在所测任务中达到60%的成功率,自动化评估结果与人类判断高度一致,证明该范式具有强可扩展性和可靠性。 Conclusion: 视频生成模型已初步具备一定推理能力,所提出的评估范式和开源工具为未来通过强化学习进一步提升其推理性能提供了坚实基础。 Abstract: We show that video generation models could reason now. Testing on tasks such as chess, maze, Sudoku, mental rotation, and Raven's Matrices, leading models such as Sora-2 achieve sixty percent success rates. We establish a robust experimental paradigm centered on the "Task Pair" design. We build a code framework, with 39 models available already, that supports this paradigm and allows for easy scaling - users can add models and tasks efficiently. We show our automated evaluation strongly correlates with human judgment, and therefore this paradigm is highly scalable. We see an opportunity, given the availability of our paradigm, to do reinforcement learning for improving reasoning in video models. You could checkout all of our raw $\href{https://grow-ai-like-a-child.com/video-reason/}{results}$ and our $\href{https://github.com/hokindeng/VMEvalKit}{VMEvalKit}$ codebase.

[76] Adaptive Dataset Quantization: A New Direction for Dataset Pruning

Chenyue Yu,Jianyu Yu

Main category: cs.CV

TL;DR: 本文提出了一种新的数据集量化方法,通过减少样本内冗余来降低边缘设备上大规模数据集的存储和通信成本,能够在保持模型训练性能的同时实现显著的数据压缩。

Details Motivation: 针对资源受限的边缘设备在处理大规模数据集时面临的存储和通信开销问题,传统方法主要关注样本间冗余,而忽视了样本内的冗余信息。 Method: 采用线性对称量化确定初始量化范围和尺度,进而引入自适应量化分配算法,根据不同样本的精度需求动态分配量化比例,在保持总压缩比恒定的情况下压缩每个图像。 Result: 在CIFAR-10、CIFAR-100和ImageNet-1K上进行了大量实验,结果表明该方法在相同压缩比下优于传统量化和数据集剪枝基线方法,能有效保持模型训练性能。 Conclusion: 所提方法首次利用有限比特表示数据集以减少存储,提出了数据集级的自适应量化算法,并验证了其在真实数据集上的有效性与优越性。 Abstract: This paper addresses the challenges of storage and communication costs for large-scale datasets in resource-constrained edge devices by proposing a novel dataset quantization approach to reduce intra-sample redundancy. Unlike traditional dataset pruning and distillation methods that focus on inter-sample redundancy, the proposed method compresses each image by reducing redundant or less informative content within samples while preserving essential features. It first applies linear symmetric quantization to obtain an initial quantization range and scale for each sample. Then, an adaptive quantization allocation algorithm is introduced to distribute different quantization ratios for samples with varying precision requirements, maintaining a constant total compression ratio. The main contributions include: (1) being the first to use limited bits to represent datasets for storage reduction; (2) introducing a dataset-level quantization algorithm with adaptive ratio allocation; and (3) validating the method's effectiveness through extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K. Results show that the method maintains model training performance while achieving significant dataset compression, outperforming traditional quantization and dataset pruning baselines under the same compression ratios.

[77] VG3T: Visual Geometry Grounded Gaussian Transformer

Junho Kim,Seongwon Lee

Main category: cs.CV

TL;DR: 提出VG3T,一种基于多视角3D高斯表示的语义占据预测网络,通过联合预测语义属性高斯实现一致的3D场景表示,在nuScenes上性能提升显著且更高效。

Details Motivation: 现有方法在多视角融合方面存在不足,导致3D表示碎片化和性能次优,尤其在单视图推理高斯时存在不一致性和密度偏差问题。 Method: 提出VG3T,一种前馈网络,采用联合多视角方式直接预测具有语义属性的3D高斯集合;引入Grid-Based Sampling和Positional Refinement缓解像素对齐高斯初始化中的距离依赖密度偏置。 Result: 在nuScenes基准上比当前最优方法减少46%的图元数量,同时mIoU提升1.7个百分点。 Conclusion: VG3T通过统一的多视角高斯表示实现了更连贯、高效的3D语义占据预测,为几何与语义联合建模提供了新范式。 Abstract: Generating a coherent 3D scene representation from multi-view images is a fundamental yet challenging task. Existing methods often struggle with multi-view fusion, leading to fragmented 3D representations and sub-optimal performance. To address this, we introduce VG3T, a novel multi-view feed-forward network that predicts a 3D semantic occupancy via a 3D Gaussian representation. Unlike prior methods that infer Gaussians from single-view images, our model directly predicts a set of semantically attributed Gaussians in a joint, multi-view fashion. This novel approach overcomes the fragmentation and inconsistency inherent in view-by-view processing, offering a unified paradigm to represent both geometry and semantics. We also introduce two key components, Grid-Based Sampling and Positional Refinement, to mitigate the distance-dependent density bias common in pixel-aligned Gaussian initialization methods. Our VG3T shows a notable 1.7%p improvement in mIoU while using 46% fewer primitives than the previous state-of-the-art on the nuScenes benchmark, highlighting its superior efficiency and performance.

[78] EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head

Chang Liu,Tianjiao Jing,Chengcheng Ma,Xuanqi Zhou,Zhengxuan Lian,Qin Jin,Hongliang Yuan,Shi-Sheng Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯点阵的可编辑说话头模型EmoDiffTalk,通过情感感知的扩散机制实现细粒度、多模态的情感控制,显著提升了表情自然性与可控性。

Details Motivation: 现有基于3D高斯点阵的逼真说话头在情感表达操控方面存在不足,尤其缺乏对多模态输入下精细且动态的情感编辑能力。 Method: 提出情感感知高斯扩散机制,包含动作单元(AU)提示的扩散过程和文本到AU的情绪控制器,实现细粒度面部动画生成与基于文本的动态情感编辑。 Result: 在EmoTalk3D和RenderMe-360数据集上验证了方法在情感细腻度、唇音同步性和可控性方面的优越表现。 Conclusion: EmoDiffTalk是首批支持在AU表达空间中进行连续、多模态情感编辑的3D高斯点阵说话头生成框架之一,为高质量、扩散驱动的可编辑3D说话头合成提供了新路径。 Abstract: Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synthesis. To our best knowledge, our EmoDiffTalk is one of the first few 3D Gaussian Splatting talking-head generation framework, especially supporting continuous, multimodal emotional editing within the AU-based expression space.

[79] Domain-Specific Foundation Model Improves AI-Based Analysis of Neuropathology

Ruchika Verma,Shrishtee Kandoi,Robina Afzal,Shengjia Chen,Jannes Jegminat,Michael W. Karlovich,Melissa Umphlett,Timothy E. Richardson,Kevin Clare,Quazi Hossain,Jorge Samanamud,Phyllis L. Faust,Elan D. Louis,Ann C. McKee,Thor D. Stein,Jonathan D. Cherry,Jesse Mez,Anya C. McGoldrick,Dalilah D. Quintana Mora,Melissa J. Nirenberg,Ruth H. Walker,Yolfrankcis Mendez,Susan Morgello,Dennis W. Dickson,Melissa E. Murray,Carlos Cordon-Cardo,Nadejda M. Tsankova,Jamie M. Walker,Diana K. Dangoor,Stephanie McQuillan,Emma L. Thorn,Claudia De Sanctis,Shuying Li,Thomas J. Fuchs,Kurt Farrell,John F. Crary,Gabriele Campanella

Main category: cs.CV

TL;DR: NeuroFM is a domain-specific foundation model trained on brain tissue whole-slide images to better capture neuropathology-specific features, outperforming general-purpose models in tasks like dementia classification and neurodegeneration segmentation.

Details Motivation: General foundation models in computational pathology are primarily trained on non-neurological surgical data, leading to a domain mismatch that limits their effectiveness in capturing critical morphological patterns in neurodegenerative diseases. Method: Developed NeuroFM, a foundation model specifically trained on whole-slide images of brain tissue covering diverse neurodegenerative pathologies, and evaluated it on neuropathology-specific downstream tasks. Result: NeuroFM outperformed general-purpose models in multiple neuropathology tasks, including mixed dementia classification, hippocampal region segmentation, and identification of cerebellar ataxias. Conclusion: Domain-specialized foundation models like NeuroFM are more effective than general models for neuropathology analysis, enabling improved AI-driven diagnosis and research in neurodegenerative diseases. Abstract: Foundation models have transformed computational pathology by providing generalizable representations from large-scale histology datasets. However, existing models are predominantly trained on surgical pathology data, which is enriched for non-nervous tissue and overrepresents neoplastic, inflammatory, metabolic, and other non-neurological diseases. Neuropathology represents a markedly different domain of histopathology, characterized by unique cell types (neurons, glia, etc.), distinct cytoarchitecture, and disease-specific pathological features including neurofibrillary tangles, amyloid plaques, Lewy bodies, and pattern-specific neurodegeneration. This domain mismatch may limit the ability of general-purpose foundation models to capture the morphological patterns critical for interpreting neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease, and cerebellar ataxias. To address this gap, we developed NeuroFM, a foundation model trained specifically on whole-slide images of brain tissue spanning diverse neurodegenerative pathologies. NeuroFM demonstrates superior performance compared to general-purpose models across multiple neuropathology-specific downstream tasks, including mixed dementia disease classification, hippocampal region segmentation, and neurodegenerative ataxia identification encompassing cerebellar essential tremor and spinocerebellar ataxia subtypes. This work establishes that domain-specialized foundation models trained on brain tissue can better capture neuropathology-specific features than models trained on general surgical pathology datasets. By tailoring foundation models to the unique morphological landscape of neurodegenerative diseases, NeuroFM enables more accurate and reliable AI-based analysis for brain disease diagnosis and research, setting a precedent for domain-specific model development in specialized areas of digital pathology.

[80] FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting

Yi Liu,Jingyu Song,Vedanth Kallakuri,Katherine A. Skinner

Main category: cs.CV

TL;DR: FishDetector-R1 是一个基于多模态大语言模型(MLLM)的统一框架,用于在弱监督下实现水下鱼类的检测、分割与计数,显著提升了性能并具备跨域鲁棒性。

Details Motivation: 水下鱼类图像分析对生态监测至关重要,但由于图像质量差和标注成本高,现有方法面临挑战。 Method: 提出 FishDetector-R1 框架,结合新型 detect-to-count 提示机制和基于可验证奖励的强化学习(RLVR),利用稀疏点标签实现弱监督学习。 Result: 在 DeepFish 数据集上,AP 提升 20%,mIoU 提升 10%,MAE 降低 30%,GAME 降低 35%;并在其他水下数据集上展现出良好的泛化能力。 Conclusion: FishDetector-R1 通过弱监督实现了高效、可扩展的海洋视觉理解,为生态监测提供了可靠解决方案。 Abstract: Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision. The project page for FishDetector-R1 is https://umfieldrobotics.github.io/FishDetector-R1.

[81] PrunedCaps: A Case For Primary Capsules Discrimination

Ramin Sharifi,Pouya Shiri,Amirali Baniasadi

Main category: cs.CV

TL;DR: 本文研究了在Capsule Networks中对Primary Capsules进行剪枝的可行性,结果表明在不损失精度的情况下,移除95%的Capsules可使模型速度提升高达9.90倍,并减少超过95.36%的浮点运算量。

Details Motivation: 由于Capsule Networks存在计算资源消耗大、训练和测试速度慢的问题,本文旨在探索通过剪枝Primary Capsules来提升其效率。 Method: 在MNIST、Fashion-MNIST、CIFAR-10和SVHN数据集上对CapsNet的Primary Capsules进行剪枝,并评估其在准确率、速度和计算开销方面的表现。 Result: 剪枝后的CapsNet在多个数据集上实现了最高达9.90倍的速度提升,减少了95%以上的Capsules和超过95.36%的动态路由阶段浮点运算,且未损失精度。不同数据集对剪枝的响应存在差异。 Conclusion: Primary Capsules剪枝是一种有效提升CapsNet效率的方法,能够在保持精度的同时大幅降低计算资源消耗,具有实际应用潜力。 Abstract: Capsule Networks (CapsNets) are a generation of image classifiers with proven advantages over Convolutional Neural Networks (CNNs). Better robustness to affine transformation and overlapping image detection are some of the benefits associated with CapsNets. However, CapsNets cannot be classified as resource-efficient deep learning architecture due to the high number of Primary Capsules (PCs). In addition, CapsNets' training and testing are slow and resource hungry. This paper investigates the possibility of Primary Capsules pruning in CapsNets on MNIST handwritten digits, Fashion-MNIST, CIFAR-10, and SVHN datasets. We show that a pruned version of CapsNet performs up to 9.90 times faster than the conventional architecture by removing 95 percent of Capsules without a loss of accuracy. Also, our pruned architecture saves on more than 95.36 percent of floating-point operations in the dynamic routing stage of the architecture. Moreover, we provide insight into why some datasets benefit significantly from pruning while others fall behind.

[82] Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

Xuefei,Wang,Kai A. Horstmann,Ethan Lin,Jonathan Chen,Alexander R. Farhang,Sophia Stiles,Atharva Sehgal,Jonathan Light,David Van Valen,Yisong Yue,Jennifer J. Sun

Main category: cs.CV

TL;DR: 提出了一种基于AI代理的自动化代码适配方法,用于解决科学数据集与现有计算机视觉工具之间的“最后一公里”问题,实验表明简单代理框架优于人类专家方案。

Details Motivation: 将现有的计算机视觉工具适配到特定科学数据集上存在瓶颈,传统微调和手动编码成本高且不现实。 Method: 设计了一个系统性评估框架来研究三种生产级生物医学图像处理流程中的代理代码优化,并比较不同代理架构的效果。 Result: 简单的代理框架生成的适配代码在多个任务上优于人类专家编写的解决方案,且复杂代理架构并不总是更优。 Conclusion: 为针对科学数据分析的AI代理设计提供了实用路线图,并通过开源框架和实际部署验证了其真实世界应用潜力。 Abstract: Adapting production-level computer vision tools to bespoke scientific datasets is a critical "last mile" bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.

[83] Fast and Flexible Robustness Certificates for Semantic Segmentation

Thomas Massena,Corentin Friedrich,Franck Mamalet,Mathieu Serrurier

Main category: cs.CV

TL;DR: 本文提出了一类具有内置Lipschitz约束的可认证鲁棒语义分割网络,实现了高效训练和高像素精度,并首次实现了实时兼容的可认证鲁棒语义分割。

Details Motivation: 现有的对抗性鲁棒深度学习研究主要集中于分类任务,语义分割任务缺乏高效的认证方法。 Method: 引入具有内置Lipschitz约束的语义分割网络,提出一种通用的鲁棒性认证框架,并利用Lipschitz网络实现高效的认证过程。 Result: 在Cityscapes等挑战性数据集上达到竞争性的像素精度,认证速度比随机平滑方法快约600倍,并能计算ℓ₂攻击下的最坏性能界限。 Conclusion: 该方法首次实现了实时可认证的鲁棒语义分割,且认证高效、紧密,适用于多种性能指标。 Abstract: Deep Neural Networks are vulnerable to small perturbations that can drastically alter their predictions for perceptually unchanged inputs. The literature on adversarially robust Deep Learning attempts to either enhance the robustness of neural networks (e.g, via adversarial training) or to certify their decisions up to a given robustness level (e.g, by using randomized smoothing, formal methods or Lipschitz bounds). These studies mostly focus on classification tasks and few efficient certification procedures currently exist for semantic segmentation. In this work, we introduce a new class of certifiably robust Semantic Segmentation networks with built-in Lipschitz constraints that are efficiently trainable and achieve competitive pixel accuracy on challenging datasets such as Cityscapes. Additionally, we provide a novel framework that generalizes robustness certificates for semantic segmentation tasks, where we showcase the flexibility and computational efficiency of using Lipschitz networks. Our approach unlocks real-time compatible certifiably robust semantic segmentation for the first time. Moreover, it allows the computation of worst-case performance under $\ell_2$ attacks of radius $ε$ across a wide range of performance measures. Crucially, we benchmark the runtime of our certification process and find our approach to be around 600 times faster than randomized smoothing methods at inference with comparable certificates on an NVIDIA A100 GPU. Finally, we evaluate the tightness of our worstcase certificates against state-of-the-art adversarial attacks to further validate the performance of our method.

[84] High-Throughput Unsupervised Profiling of the Morphology of 316L Powder Particles for Use in Additive Manufacturing

Emmanuel Akeweje,Conall Kirk,Chi-Wai Chan,Denis Dowling,Mimi Zhang

Main category: cs.CV

TL;DR: 提出了一种结合高通量成像与机器学习的自动化框架,用于大规模金属粉末形态分析,其中傅里叶描述子+k-means聚类效果最优,可实现快速、自动化的粉末形态评估,支持SLM工艺中的实时原料监控。

Details Motivation: 传统粉末表征方法通量低且定性,难以捕捉工业级批次的异质性,制约了选择性激光熔融(SLM)中对原料形态的高效评估。 Method: 开发并评估三种聚类流程:自编码器流程、形状描述符流程和函数数据流程,结合高通量成像与形态特征提取,对约12.6万个金属粉末图像进行无监督聚类分析。 Result: 在约126,000个粉末图像的数据集上,傅里叶描述子+k-means管道表现出最优内部有效性指标(最低Davies-Bouldin指数和最高Calinski-Harabasz分数),且每颗粒运行时间低于毫秒级。 Conclusion: 该无监督学习框架可实现金属粉末形态的快速自动化评估,支持再利用周期中的形态演变追踪,为SLM流程中的实时原料监控提供了可行路径。 Abstract: Selective Laser Melting (SLM) is a powder-bed additive manufacturing technique whose part quality depends critically on feedstock morphology. However, conventional powder characterization methods are low-throughput and qualitative, failing to capture the heterogeneity of industrial-scale batches. We present an automated, machine learning framework that couples high-throughput imaging with shape extraction and clustering to profile metallic powder morphology at scale. We develop and evaluate three clustering pipelines: an autoencoder pipeline, a shape-descriptor pipeline, and a functional-data pipeline. Across a dataset of approximately 126,000 powder images (0.5-102 micrometer diameter), internal validity metrics identify the Fourier-descriptor + k-means pipeline as the most effective, achieving the lowest Davies-Bouldin index and highest Calinski-Harabasz score while maintaining sub-millisecond runtime per particle on a standard desktop workstation. Although the present work focuses on establishing the morphological-clustering framework, the resulting shape groups form a basis for future studies examining their relationship to flowability, packing density, and SLM part quality. Overall, this unsupervised learning framework enables rapid, automated assessment of powder morphology and supports tracking of shape evolution across reuse cycles, offering a path toward real-time feedstock monitoring in SLM workflows.

[85] VAT: Vision Action Transformer by Unlocking Full Representation of ViT

Wenhao Li,Chengwei Ma,Weixin Mao

Main category: cs.CV

TL;DR: 提出Vision Action Transformer (VAT),利用ViT全层特征实现感知与动作生成的深度融合,在机器人模仿学习中取得SOTA性能。

Details Motivation: 现有方法仅使用ViT最后一层特征,丢弃了宝贵的中间信息,导致表征不足。VAT旨在充分利用ViT的完整特征层次结构,提升机器人策略学习能力。 Method: 在ViT基础上引入可跨层传播的专用action tokens,与各层视觉特征进行深度且渐进式的融合,实现从感知到动作生成的端到端学习。 Result: 在四个LIBERO仿真操作任务上平均成功率达到98.15%,超越OpenVLA-OFT等先前方法,达到新的SOTA水平。 Conclusion: 充分利用视觉模型的完整‘表征轨迹’对提升机器人策略至关重要,VAT为高效模仿学习提供了有力架构。 Abstract: In robot learning, Vision Transformers (ViTs) are standard for visual perception, yet most methods discard valuable information by using only the final layer's features. We argue this provides an insufficient representation and propose the Vision Action Transformer (VAT), a novel architecture that is extended from ViT and unlocks the full feature hierarchy of ViT. VAT processes specialized action tokens with visual features across all transformer layers, enabling a deep and progressive fusion of perception and action generation. On a suite of simulated manipulation tasks, VAT achieves a 98.15\% average success rate across four LIBERO benchmarks, establishing a new state-of-the-art by outperforming prior methods like OpenVLA-OFT. Our work presents not only a powerful model for imitation learning but also demonstrates the critical importance of leveraging the complete ''representation trajectory'' of vision models to advance robotic policy. The GitHub URL for the project code is https://github.com/sellerbubble/VAT.

[86] Benchmarking CXR Foundation Models With Publicly Available MIMIC-CXR and NIH-CXR14 Datasets

Jiho Shin,Dominic Marshall,Matthieu Komorowski

Main category: cs.CV

TL;DR: 本文比较了两种大规模胸部X光图像嵌入模型(CXR-Foundation和MedImageInsight)在公共数据集上的表现,采用统一预处理和固定下游分类器以确保可复现性。结果显示MedImageInsight在多数任务上性能略优,而CXR-Foundation具有更强的跨数据集稳定性。此外,对MedImageInsight嵌入的无监督聚类揭示出与定量结果一致的疾病特异性结构。研究强调了医学基础模型标准化评估的重要性,并为未来的多模态和临床整合研究建立了可复现的基线。

Details Motivation: 尽管近期的基础模型在医学图像表示学习中表现出色,但它们在不同数据集上的行为比较仍缺乏系统研究。因此,本文旨在填补这一空白,提供可复现的基准评估。 Method: 采用统一的预处理流程和固定的下游LightGBM分类器,直接从两个预训练编码器(CXR-Foundation和MedImageInsight)提取嵌入,并在MIMIC-CXR和NIH ChestX-ray14数据集上进行多疾病标签分类,使用AUROC和F1分数(含95%置信区间)评估性能。同时对嵌入进行无监督聚类分析。 Result: MedImageInsight在大多数任务中表现略优,CXR-Foundation展现出更强的跨数据集稳定性;无监督聚类显示MedImageInsight的嵌入具有与定量结果一致的疾病相关结构。 Conclusion: 医学基础模型需要标准化评估框架,本研究提供了可复现的基准结果,支持未来多模态与临床应用的研究。 Abstract: Recent foundation models have demonstrated strong performance in medical image representation learning, yet their comparative behaviour across datasets remains underexplored. This work benchmarks two large-scale chest X-ray (CXR) embedding models (CXR-Foundation (ELIXR v2.0) and MedImagelnsight) on public MIMIC-CR and NIH ChestX-ray14 datasets. Each model was evaluated using a unified preprocessing pipeline and fixed downstream classifiers to ensure reproducible comparison. We extracted embeddings directly from pre-trained encoders, trained lightweight LightGBM classifiers on multiple disease labels, and reported mean AUROC, and F1-score with 95% confidence intervals. MedImageInsight achieved slightly higher performance across most tasks, while CXR-Foundation exhibited strong cross-dataset stability. Unsupervised clustering of MedImageIn-sight embeddings further revealed a coherent disease-specific structure consistent with quantitative results. The results highlight the need for standardised evaluation of medical foundation models and establish reproducible baselines for future multimodal and clinical integration studies.

[87] PrefGen: Multimodal Preference Learning for Preference-Conditioned Image Generation

Wenyi Mo,Tianyu Zhang,Yalong Bai,Ligong Han,Ying Ba,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)的偏好条件图像生成框架,通过偏好导向的视觉问答任务和用户间/用户内判别探针任务提取细粒度用户偏好表征,并设计最大均值差异对齐损失以兼容扩散模型文本编码器,从而实现更精准的个性化图像生成。

Details Motivation: 现有方法在捕捉用户细微审美偏好或编码个性化视觉信号方面存在不足,难以同时满足文本提示与个人偏好的双重条件生成需求。 Method: 利用MLLM提取用户偏好表征,训练其完成偏好导向的视觉问答任务;引入跨用户和用户内判别两种探针任务以分离偏好相关特征;设计基于最大均值差异的对齐损失,使MLLM输出的嵌入与扩散模型文本编码器兼容,进而用于条件生成。 Result: 实验表明,该方法在图像质量和偏好对齐方面显著优于强基线模型,有效提升了个性化生成效果。 Conclusion: 通过精细化的偏好表征提取与跨模态对齐策略,能够显著提升扩散模型在个性化图像生成中的表现,验证了表示学习与对齐机制在该任务中的关键作用。 Abstract: Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals. In this work, we propose a multimodal framework that leverages multimodal large language models (MLLMs) to extract rich user representations and inject them into diffusion-based image generation. We train the MLLM with a preference-oriented visual question answering task to capture fine-grained semantic cues. To isolate preference-relevant features, we introduce two complementary probing tasks: inter-user discrimination to distinguish between different users, and intra-user discrimination to separate liked from disliked content. To ensure compatibility with diffusion text encoders, we design a maximum mean discrepancy-based alignment loss that bridges the modality gap while preserving multimodal structure. The resulting embeddings are used to condition the generator, enabling faithful adherence to both prompts and user preferences. Extensive experiments demonstrate that our method substantially outperforms strong baselines in both image quality and preference alignment, highlighting the effectiveness of representation extraction and alignment for personalized generation.

[88] Neural reconstruction of 3D ocean wave hydrodynamics from camera sensing

Jiabin Liu,Zihao Zhou,Jialei Yan,Anxin Guo,Alvise Benetazzo,Hui Li

Main category: cs.CV

TL;DR: 提出一种基于注意力增强金字塔架构的波浪自由表面视觉重建神经网络,实现高精度、快速的三维波浪表面及非线性速度场重建。

Details Motivation: 解决长期海洋波浪观测中密集视觉重建计算成本高和持续视觉遮挡带来的挑战。 Method: 设计了一种注意力增强的金字塔神经网络架构,结合物理约束进行时间分辨的非线性三维速度场重建。 Result: 在真实海洋条件下实现了毫米级波高预测,主频误差低于0.01 Hz,精确估计高频谱幂律,并在1.35秒内完成两百万点的密集重建。 Conclusion: 该模型优于传统视觉重建方法,在遮挡条件下具有强泛化能力,得益于其全局多尺度注意力和对波浪传播动态的学习编码。 Abstract: Precise three-dimensional (3D) reconstruction of wave free surfaces and associated velocity fields is essential for developing a comprehensive understanding of ocean physics. To address the high computational cost of dense visual reconstruction in long-term ocean wave observation tasks and the challenges introduced by persistent visual occlusions, we propose an wave free surface visual reconstruction neural network, which is designed as an attention-augmented pyramid architecture tailored to the multi-scale and temporally continuous characteristics of wave motions. Using physics-based constraints, we perform time-resolved reconstruction of nonlinear 3D velocity fields from the evolving free-surface boundary. Experiments under real-sea conditions demonstrate millimetre-level wave elevation prediction in the central region, dominant-frequency errors below 0.01 Hz, precise estimation of high-frequency spectral power laws, and high-fidelity 3D reconstruction of nonlinear velocity fields, while enabling dense reconstruction of two million points in only 1.35 s. Built on a stereo-vision dataset, the model outperforms conventional visual reconstruction approaches and maintains strong generalization in occluded conditions, owing to its global multi-scale attention and its learned encoding of wave propagation dynamics.

[89] The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Ranjan Sapkota,Konstantinos I. Roumeliotis,Manoj Karkee

Main category: cs.CV

TL;DR: 本文分析了SAM2与SAM3在分割范式上的根本性断层,指出SAM2基于空间提示的几何分割方法无法迁移到SAM3的多模态概念驱动框架,并从概念、架构、数据、训练和评估五个方面系统阐述了二者差异,确立SAM3为新一代分割基础模型。

Details Motivation: 理解SAM2到SAM3的技术跃迁为何导致原有prompt-based经验失效,明确概念驱动分割新范式的本质变化与发展方向。 Method: 通过五个维度进行对比分析:概念断层、架构差异、数据集与标注区别、训练与超参数不同、评估指标与失败模式演变。 Result: 揭示SAM3引入了多模态融合、语义推理、示例学习等新能力,形成了不同于SAM2的统一视觉-语言架构,并需新的训练与评估体系。 Conclusion: SAM3标志着从几何分割向概念驱动分割的范式转变,构成新一代分割基础模型,未来研究应聚焦开放词汇、语义理解和多模态协同。 Abstract: This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.

[90] Representation Learning for Point Cloud Understanding

Siming Yan

Main category: cs.CV

TL;DR: 本文探讨了3D数据在多个领域中的应用,重点研究点云表示学习、自监督学习以及从2D到3D的迁移学习,提出结合预训练2D模型提升3D理解能力的方法。

Details Motivation: 为了提升3D数据的理解能力,尤其是在点云分割和表示学习方面,需要有效利用已有的2D模型知识。 Method: 提出一种将预训练的2D模型集成到3D网络训练中的方法,实现从2D到3D的知识迁移,并探索自监督和监督学习策略。 Result: 实验表明,所提方法显著提升了3D点云表示学习的效果,验证了2D知识在3D任务中的有效性。 Conclusion: 通过融合2D先验知识,能够有效增强3D理解能力,为点云表示学习提供了新思路。 Abstract: With the rapid advancement of technology, 3D data acquisition and utilization have become increasingly prevalent across various fields, including computer vision, robotics, and geospatial analysis. 3D data, captured through methods such as 3D scanners, LiDARs, and RGB-D cameras, provides rich geometric, shape, and scale information. When combined with 2D images, 3D data offers machines a comprehensive understanding of their environment, benefiting applications like autonomous driving, robotics, remote sensing, and medical treatment. This dissertation focuses on three main areas: supervised representation learning for point cloud primitive segmentation, self-supervised learning methods, and transfer learning from 2D to 3D. Our approach, which integrates pre-trained 2D models to support 3D network training, significantly improves 3D understanding without merely transforming 2D data. Extensive experiments validate the effectiveness of our methods, showcasing their potential to advance point cloud representation learning by effectively integrating 2D knowledge.

[91] EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing

Runjia Li,Moayed Haji-Ali,Ashkan Mirzaei,Chaoyang Wang,Arpit Sahni,Ivan Skorokhodov,Aliaksandr Siarohin,Tomas Jakab,Junlin Han,Sergey Tulyakov,Philip Torr,Willi Menapace

Main category: cs.CV

TL;DR: 本文提出了一种面向第一人称视频编辑的生态系统EgoEdit,解决了现有方法在快速自我运动和手物交互场景下的局限性,支持实时流式推理,并发布了专用数据集EgoEditData与评估基准EgoEditBench。

Details Motivation: 现有的视频编辑方法主要针对第三人称视角,在第一人称视角下因快速自我运动和频繁的手物交互而表现不佳,且离线处理延迟高,难以支持实时交互应用。 Method: 构建了专用于第一人称视频编辑的数据集EgoEditData,开发了支持指令跟随和单GPU实时流式推理的编辑模型EgoEdit,并提出了包含指令忠实性、手部与交互保留、运动稳定性评估的EgoEditBench。 Result: EgoEdit在第一人称编辑任务上显著优于现有方法,同时在通用编辑任务上性能与最强基线相当,实现了低延迟、高时序稳定性和指令一致性。 Conclusion: EgoEdit为第一人称视频编辑提供了有效的解决方案,推动了AR等交互式应用的发展,其数据集和评估基准将公开以促进后续研究。 Abstract: We study instruction-guided editing of egocentric videos for interactive AR applications. While recent AI video editors perform well on third-person footage, egocentric views present unique challenges - including rapid egomotion and frequent hand-object interactions - that create a significant domain gap. Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. First, we construct EgoEditData, a carefully designed and manually curated dataset specifically designed for egocentric editing scenarios, featuring rich hand-object interactions, while explicitly preserving hands. Second, we develop EgoEdit, an instruction-following egocentric video editor that supports real-time streaming inference on a single GPU. Finally, we introduce EgoEditBench, an evaluation suite targeting instruction faithfulness, hand and interaction preservation, and temporal stability under egomotion. Across both egocentric and general editing tasks, EgoEdit produces temporally stable, instruction-faithful results with interactive latency. It achieves clear gains on egocentric editing benchmarks-where existing methods struggle-while maintaining performance comparable to the strongest baselines on general editing tasks. EgoEditData and EgoEditBench will be made public for the research community. See our website at https://snap-research.github.io/EgoEdit

[92] Shoot-Bounce-3D: Single-Shot Occlusion-Aware 3D from Lidar by Decomposing Two-Bounce Light

Tzofi Klinghoffer,Siddharth Somasundaram,Xiaoyu Xiang,Yuchen Fan,Christian Richardt,Akshat Dave,Ramesh Raskar,Rakesh Ranjan

Main category: cs.CV

TL;DR: 本文提出了一种基于单光子激光雷达的数据驱动方法,用于从单次测量中重建包含遮挡和镜面反射的复杂三维场景。通过构建首个大规模室内场景的激光瞬态模拟数据集(约10万个样本),学习复杂光传输的先验,分解多光源点同时照射下的二次反射光信号,从而恢复稠密深度、被遮挡的几何结构和材质属性,并在实验中验证了该方法的有效性。

Details Motivation: 单次测量中重建包含遮挡区域和镜面材料(如镜子)的3D场景极具挑战性,传统方法难以处理多重反射、阴影和高光等复杂光路。现有单光子激光雷达研究多局限于逐点扫描照明,无法满足实际应用需求。因此,需要发展能在多点同时照明条件下解析复杂光传输的方法。 Method: 提出一种数据驱动的方法来反演单光子激光雷达中的复杂光传输过程;构建一个包含约10万个室内场景激光瞬态的大型模拟数据集,用以训练模型学习多路径光传输的先验知识;利用该模型将实测的二次反射光分解为各个激光点的贡献,进而推断出完整的3D几何结构。 Result: 成功实现了在多点同时照明条件下,从单次测量中恢复出包含遮挡和镜面物体的室内场景3D结构;实验验证了该方法在真实场景下的有效性,能够准确解析复杂的光路并重建细节几何。 Conclusion: 本文展示了在更实用且更具挑战性的多点同时照明场景下,使用数据驱动方法结合大规模仿真数据集,能够有效分解多路径光信号并实现高质量的3D场景重建,尤其适用于存在遮挡和镜面反射的复杂环境。 Abstract: 3D scene reconstruction from a single measurement is challenging, especially in the presence of occluded regions and specular materials, such as mirrors. We address these challenges by leveraging single-photon lidars. These lidars estimate depth from light that is emitted into the scene and reflected directly back to the sensor. However, they can also measure light that bounces multiple times in the scene before reaching the sensor. This multi-bounce light contains additional information that can be used to recover dense depth, occluded geometry, and material properties. Prior work with single-photon lidar, however, has only demonstrated these use cases when a laser sequentially illuminates one scene point at a time. We instead focus on the more practical - and challenging - scenario of illuminating multiple scene points simultaneously. The complexity of light transport due to the combined effects of multiplexed illumination, two-bounce light, shadows, and specular reflections is challenging to invert analytically. Instead, we propose a data-driven method to invert light transport in single-photon lidar. To enable this approach, we create the first large-scale simulated dataset of ~100k lidar transients for indoor scenes. We use this dataset to learn a prior on complex light transport, enabling measured two-bounce light to be decomposed into the constituent contributions from each laser spot. Finally, we experimentally demonstrate how this decomposed light can be used to infer 3D geometry in scenes with occlusions and mirrors from a single measurement. Our code and dataset are released at https://shoot-bounce-3d.github.io.

[93] BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving

Karthik Mohan,Sonam Singh,Amit Arvind Kale

Main category: cs.CV

TL;DR: BeLLA是一种端到端架构,通过将统一的360°鸟瞰图(BEV)表示与大语言模型结合,提升自动驾驶中基于视觉-语言模型的场景理解和空间推理能力,在NuScenes-QA和DriveLM基准上显著优于现有方法。

Details Motivation: 现有视觉-语言模型在自动驾驶中多依赖单视角或多视角聚合特征,缺乏统一的空间表示,难以进行以自我为中心的方向、物体关系和上下文推理。 Method: 提出BeLLA架构,将多摄像头生成的统一360° BEV特征与大语言模型结合,实现端到端的视觉-语言联合推理,支持复杂的空间和行为问题回答。 Result: 在NuScenes-QA和DriveLM基准上,BeLLA在需要空间推理的任务(如相对位置判断、周边物体行为理解)中最高提升达+9.3%,在其他类型问题上也表现具有竞争力。 Conclusion: BeLLA通过统一的BEV表示增强了多模态语言模型的空间感知能力,有效提升了自动驾驶中复杂场景的理解与问答性能。 Abstract: The rapid development of Vision-Language models (VLMs) and Multimodal Language Models (MLLMs) in autonomous driving research has significantly reshaped the landscape by enabling richer scene understanding, context-aware reasoning, and more interpretable decision-making. However, a lot of existing work often relies on either single-view encoders that fail to exploit the spatial structure of multi-camera systems or operate on aggregated multi-view features, which lack a unified spatial representation, making it more challenging to reason about ego-centric directions, object relations, and the wider context. We thus present BeLLA, an end-to-end architecture that connects unified 360° BEV representations with a large language model for question answering in autonomous driving. We primarily evaluate our work using two benchmarks - NuScenes-QA and DriveLM, where BeLLA consistently outperforms existing approaches on questions that require greater spatial reasoning, such as those involving relative object positioning and behavioral understanding of nearby objects, achieving up to +9.3% absolute improvement in certain tasks. In other categories, BeLLA performs competitively, demonstrating the capability of handling a diverse range of questions.

[94] SpectraIrisPAD: Leveraging Vision Foundation Models for Spectrally Conditioned Multispectral Iris Presentation Attack Detection

Raghavendra Ramachandra,Sushma Venkatesh

Main category: cs.CV

TL;DR: 本文提出了一种基于多光谱成像和DINOv2视觉Transformer的新型虹膜活体检测框架SpectraIrisPAD,并发布了包含多种攻击类型的多光谱虹膜数据集MSIrPAD,实验表明该方法在未见攻击场景下具有卓越的泛化能力和检测性能。

Details Motivation: 随着虹膜识别在现实场景中的广泛应用,其面临呈现攻击(PA)的安全威胁日益严重,尤其是在近红外波段的传统系统难以应对多样化的欺骗手段,亟需提升活体检测(PAD)方法的鲁棒性和泛化能力。 Method: 提出SpectraIrisPAD框架,采用DINOv2 Vision Transformer作为骨干网络,引入可学习的光谱位置编码、令牌融合机制和对比学习策略,利用五个不同近红外波长(800nm, 830nm, 850nm, 870nm, 980nm)的多光谱图像提取具有判别性的波段特异性特征,以区分真实虹膜与各类伪造攻击。 Result: 在自建的多光谱虹膜PAD数据集MSIrPAD(包含18,848张图像、8类攻击类型)上进行的实验表明,所提方法在未知攻击评估协议下显著优于多种现有先进基线模型,在各项性能指标上均取得最优结果。 Conclusion: SpectraIrisPAD通过结合多光谱信息与先进的深度学习架构,有效提升了虹膜活体检测的鲁棒性和泛化能力,验证了多光谱成像在生物特征安全领域的潜力,为未来抗欺骗技术的发展提供了新方向。 Abstract: Iris recognition is widely recognized as one of the most accurate biometric modalities. However, its growing deployment in real-world applications raises significant concerns regarding its vulnerability to Presentation Attacks (PAs). Effective Presentation Attack Detection (PAD) is therefore critical to ensure the integrity and security of iris-based biometric systems. While conventional iris recognition systems predominantly operate in the near-infrared (NIR) spectrum, multispectral imaging across multiple NIR bands provides complementary reflectance information that can enhance the generalizability of PAD methods. In this work, we propose \textbf{SpectraIrisPAD}, a novel deep learning-based framework for robust multispectral iris PAD. The SpectraIrisPAD leverages a DINOv2 Vision Transformer (ViT) backbone equipped with learnable spectral positional encoding, token fusion, and contrastive learning to extract discriminative, band-specific features that effectively distinguish bona fide samples from various spoofing artifacts. Furthermore, we introduce a new comprehensive dataset Multispectral Iris PAD (\textbf{MSIrPAD}) with diverse PAIs, captured using a custom-designed multispectral iris sensor operating at five distinct NIR wavelengths (800\,nm, 830\,nm, 850\,nm, 870\,nm, and 980\,nm). The dataset includes 18,848 iris images encompassing eight diverse PAI categories, including five textured contact lenses, print attacks, and display-based attacks. We conduct comprehensive experiments under unseen attack evaluation protocols to assess the generalization capability of the proposed method. SpectraIrisPAD consistently outperforms several state-of-the-art baselines across all performance metrics, demonstrating superior robustness and generalizability in detecting a wide range of presentation attacks.

[95] Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation

Junwen Zheng,Xinran Xu,Li Rong Wang,Chang Cai,Lucinda Siyun Tan,Dingyuan Wang,Hong Liang Tey,Xiuyi Fan

Main category: cs.CV

TL;DR: 提出了一种基于对比学习的跨模态可解释框架CEFM,将皮肤癌诊断中的临床标准(ABC)映射到视觉Transformer空间,并生成自然语言解释,实现了高精度分类与临床可信度之间的平衡。

Details Motivation: 深度学习在黑色素瘤分类中表现优异,但模型缺乏可解释性,阻碍了其在临床中的应用。临床医生难以信任“黑箱”模型的决策过程,因此需要提升模型透明度以促进实际应用。 Method: 采用对比学习作为核心机制,设计双投影头将临床诊断标准(Asymmetry, Border, Color)映射到Vision Transformer的嵌入空间,实现临床语义与视觉特征对齐,并通过自然语言生成模块输出结构化文本解释。 Result: 在公开数据集上达到92.79%的准确率和0.961的AUC,显著提升了多项可解释性指标;定性分析显示学习到的嵌入空间分布与医生使用ABC规则的逻辑一致。 Conclusion: CEFM有效桥接了高性能深度学习模型与临床可解释性需求之间的差距,增强了医生对模型决策的信任,推动了AI在临床皮肤科的应用落地。 Abstract: Deep learning has demonstrated expert-level performance in melanoma classification, positioning it as a powerful tool in clinical dermatology. However, model opacity and the lack of interpretability remain critical barriers to clinical adoption, as clinicians often struggle to trust the decision-making processes of black-box models. To address this gap, we present a Cross-modal Explainable Framework for Melanoma (CEFM) that leverages contrastive learning as the core mechanism for achieving interpretability. Specifically, CEFM maps clinical criteria for melanoma diagnosis-namely Asymmetry, Border, and Color (ABC)-into the Vision Transformer embedding space using dual projection heads, thereby aligning clinical semantics with visual features. The aligned representations are subsequently translated into structured textual explanations via natural language generation, creating a transparent link between raw image data and clinical interpretation. Experiments on public datasets demonstrate 92.79% accuracy and an AUC of 0.961, along with significant improvements across multiple interpretability metrics. Qualitative analyses further show that the spatial arrangement of the learned embeddings aligns with clinicians' application of the ABC rule, effectively bridging the gap between high-performance classification and clinical trust.

[96] Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

Su Sun,Cheng Zhao,Himangi Mittal,Gaurav Mittal,Rohith Kukkala,Yingjie Victor Chen,Mei Chen

Main category: cs.CV

TL;DR: 提出Track4DGen,一个结合多视角视频扩散模型与基础点跟踪器的两阶段框架,通过引入跟踪引导的运动先验提升4D动态对象生成的时序一致性与跨视角连贯性。

Details Motivation: 现有方法依赖像素或潜在空间的视频扩散损失,缺乏显式的时序感知特征级跟踪指导,导致视图间不一致、时间漂移和伪影问题。 Method: 第一阶段在扩散生成器中引入密集的特征级点对应,利用跟踪器提供的运动先验增强时序和跨视图一致性;第二阶段采用融合扩散特征与Hex-plane特征的混合编码方式重建4D高斯点阵,并加入4D球谐函数提升动态建模质量。 Result: 在多视角视频生成和4D生成基准上优于基线方法,生成结果具有更好的时间稳定性、外观一致性和文本可编辑性,同时发布了高质量数据集Sketchfab28。 Conclusion: 通过显式注入跟踪引导的运动先验到扩散模型的中间特征,有效提升了动态4D生成的质量与时序一致性,为对象中心的4D内容生成提供了新思路。 Abstract: Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance. We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that curb appearance drift and enhance cross-view coherence. In Stage Two, we reconstruct a dynamic 4D-GS using a hybrid motion encoding that concatenates co-located diffusion features (carrying Stage-One tracking priors) with Hex-plane features, and augment them with 4D Spherical Harmonics for higher-fidelity dynamics modeling. \emph{Track4DGen} surpasses baselines on both multi-view video generation and 4D generation benchmarks, yielding temporally stable, text-editable 4D assets. Lastly, we curate \emph{Sketchfab28}, a high-quality dataset for benchmarking object-centric 4D generation and fostering future research.

[97] Automated Annotation of Shearographic Measurements Enabling Weakly Supervised Defect Detection

Jessica Plassmann,Nicolas Schuler,Michael Schuth,Georg von Freymann

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的自动化工作流程,用于从剪切图测量中生成缺陷标注,减少人工 effort 并支持可扩展的数据集创建。

Details Motivation: 由于手动标注费时、主观且难以标准化,缺乏高质量标注数据集限制了剪切图技术在工业中的应用。 Method: 采用深度学习方法,从剪切图测量数据中自动生成高分辨率的缺陷分割和边界框标注。 Result: 与专家标注数据对比,该方法具有足够高的准确性,可用于弱监督训练。 Conclusion: 该自动化工作流程能有效减少人工标注负担,促进剪切图技术在缺陷检测中的规模化应用。 Abstract: Shearography is an interferometric technique sensitive to surface displacement gradients, providing high sensitivity for detecting subsurface defects in safety-critical components. A key limitation to industrial adoption is the lack of high-quality annotated datasets, since manual labeling remains labor-intensive, subjective, and difficult to standardize. We introduce an automated workflow that generates defect annotations from shearography measurements using deep learning, producing high-resolution segmentation and bounding-box labels. Evaluation against expert-labeled data demonstrates sufficient accuracy to enable weakly supervised training, reducing manual effort and supporting scalable dataset creation for robust defect detection.

[98] Physics-Grounded Shadow Generation from Monocular 3D Geometry Priors and Approximate Light Direction

Shilin Hu,Jingyi Xu,Akshat Dave,Dimitris Samaras,Hieu Le

Main category: cs.CV

TL;DR: 本文提出了一种将显式物理建模(几何与光照)嵌入深度学习的阴影生成新框架,通过结合基于物理的初始估计与扩散模型优化,生成既逼真又符合物理规律的高保真阴影。

Details Motivation: 现有基于深度学习的阴影生成方法很少利用阴影形成的显式物理模型,导致生成的阴影在复杂几何或模糊光照场景下缺乏物理一致性。 Method: 首先从单目RGB图像中恢复密集点图和主光源方向,基于物理原理估计初始阴影位置与形状;然后将该物理先验嵌入扩散模型框架中进行精细化,生成高保真且与场景一致的阴影。 Result: 在DESOBAV2数据集上训练后,模型在视觉真实感和物理一致性方面均优于现有方法,尤其在复杂几何或光照模糊场景中表现更优。 Conclusion: 将显式物理建模与深度学习结合可有效提升阴影生成的质量与物理合理性,为未来研究提供了可行方向。 Abstract: Shadow generation aims to produce photorealistic shadows that are visually consistent with object geometry and scene illumination. In the physics of shadow formation, the occluder blocks some light rays casting from the light source that would otherwise arrive at the surface, creating a shadow that follows the silhouette of the occluder. However, such explicit physical modeling has rarely been used in deep-learning-based shadow generation. In this paper, we propose a novel framework that embeds explicit physical modeling - geometry and illumination - into deep-learning-based shadow generation. First, given a monocular RGB image, we obtain approximate 3D geometry in the form of dense point maps and predict a single dominant light direction. These signals allow us to recover fairly accurate shadow location and shape based on the physics of shadow formation. We then integrate this physics-based initial estimate into a diffusion framework that refines the shadow into a realistic, high-fidelity appearance while ensuring consistency with scene geometry and illumination. Trained on DESOBAV2, our model produces shadows that are both visually realistic and physically coherent, outperforming existing approaches, especially in scenes with complex geometry or ambiguous lighting.

[99] Physics-Grounded Attached Shadow Detection Using Approximate 3D Geometry and Light Direction

Shilin Hu,Jingyi Xu,Sagnik Das,Dimitris Samaras,Hieu Le

Main category: cs.CV

TL;DR: 本文提出了一种联合检测投射阴影和附着阴影的新框架,通过光照与几何关系的闭环推理显著提升了附着阴影的检测性能,并发布了首个包含两类阴影标注的数据集。

Details Motivation: 现有阴影检测方法主要关注投射阴影,缺乏针对附着阴影的专门数据集和模型,导致对物体三维结构和场景理解的不足。 Method: 提出一个包含阴影检测模块和光照估计模块的闭环框架:首先分别预测两种阴影,然后结合表面法向量和估计的光照方向生成几何一致的部分自遮挡图,用于反馈优化阴影检测结果,实现迭代式联合优化。 Result: 在自建的1,458张图像数据集上验证了方法的有效性,附着阴影检测的BER降低了至少33%,同时保持了对整体阴影和投射阴影的良好检测性能。 Conclusion: 通过引入光照与几何一致性约束的闭环推理机制,能够有效提升附着阴影的检测精度,为更完整的场景理解提供了新的解决方案。 Abstract: Attached shadows occur on the surface of the occluder where light cannot reach because of self-occlusion. They are crucial for defining the three-dimensional structure of objects and enhancing scene understanding. Yet existing shadow detection methods mainly target cast shadows, and there are no dedicated datasets or models for detecting attached shadows. To address this gap, we introduce a framework that jointly detects cast and attached shadows by reasoning about their mutual relationship with scene illumination and geometry. Our system consists of a shadow detection module that predicts both shadow types separately, and a light estimation module that infers the light direction from the detected shadows. The estimated light direction, combined with surface normals, allows us to derive a geometry-consistent partial map that identifies regions likely to be self-occluded. This partial map is then fed back to refine shadow predictions, forming a closed-loop reasoning process that iteratively improves both shadow segmentation and light estimation. In order to train our method, we have constructed a dataset of 1,458 images with separate annotations for cast and attached shadows, enabling training and quantitative evaluation of both. Experimental results demonstrate that this iterative geometry-illumination reasoning substantially improves the detection of attached shadows, with at least 33% BER reduction, while maintaining strong full and cast shadow performance.

[100] SPOOF: Simple Pixel Operations for Out-of-Distribution Fooling

Ankit Gupta,Christoph Adami,Emily Dolson

Main category: cs.CV

TL;DR: 现代深度神经网络在图像识别任务中表现出色,但在面对与自然图像毫无相似之处的输入时仍表现出过度自信。本文重新实现了基于CPPN和直接编码的进化欺骗攻击,并引入了一种更简单、高效且一致的黑盒攻击方法SPOOF,可在极少查询下生成高置信度的欺骗图像。实验表明,即使是ViT等先进模型也极易受此类攻击,且通过将欺骗图像作为额外类别进行再训练只能提供有限防御,凸显了现代分类器的持续脆弱性。

Details Motivation: 尽管深度神经网络在图像识别上表现优异,但其对非自然、无意义输入产生高置信度预测的问题仍未解决。受Nguyen等人关于‘欺骗图像’工作的启发,作者旨在检验现代架构(包括卷积网络与视觉Transformer)对此类攻击的脆弱性是否依然存在,并探索更高效的攻击方式以揭示模型的根本缺陷。 Method: 作者重新实现了CPPN和直接编码的进化攻击方法,并提出SPOOF——一种极简、一致且计算效率更高的黑盒攻击算法。SPOOF通过最小化像素修改来生成无法识别的高置信度欺骗图像。此外,研究还评估了将欺骗图像作为新类别加入训练集后的模型鲁棒性。 Result: 实验显示,当前最先进的模型仍易被欺骗,尤其是ViT-B/16模型在更少查询次数下即可达到近乎确定的错误分类。SPOOF相比传统方法能以更低计算成本生成同样有效的欺骗图像。即使在包含欺骗图像的再训练后,模型仍可被SPOOF以稍多查询继续成功攻击。 Conclusion: 现代深度分类器,包括先进的视觉Transformer,在面对结构化噪声输入时仍存在根本性的脆弱性。SPOOF的高效性和持续有效性表明,仅靠数据增强或添加对抗样本来训练难以根除该问题,需从根本上改进模型的泛化与校准机制。 Abstract: Deep neural networks (DNNs) excel across image recognition tasks, yet continue to exhibit overconfidence on inputs that bear no resemblance to natural images. Revisiting the "fooling images" work introduced by Nguyen et al. (2015), we re-implement both CPPN-based and direct-encoding-based evolutionary fooling attacks on modern architectures, including convolutional and transformer classifiers. Our re-implementation confirm that high-confidence fooling persists even in state-of-the-art networks, with transformer-based ViT-B/16 emerging as the most susceptible--achieving near-certain misclassifications with substantially fewer queries than convolution-based models. We then introduce SPOOF, a minimalist, consistent, and more efficient black-box attack generating high-confidence fooling images. Despite its simplicity, SPOOF generates unrecognizable fooling images with minimal pixel modifications and drastically reduced compute. Furthermore, retraining with fooling images as an additional class provides only partial resistance, as SPOOF continues to fool consistently with slightly higher query budgets--highlighting persistent fragility of modern deep classifiers.

[101] Multi-Modal Zero-Shot Prediction of Color Trajectories in Food Drying

Shichen Li,Ahmadreza Eslaminia,Chenhui Shao

Main category: cs.CV

TL;DR: 提出一种多模态颜色轨迹预测方法,结合高维时序颜色信息与干燥参数,显著提升食品干燥过程中颜色变化预测的准确性与泛化能力。

Details Motivation: 现有方法依赖低维颜色特征,难以捕捉食品干燥过程中复杂的动态颜色变化,且无法推广到未见的干燥条件。 Method: 开发了一种融合高维时序颜色信息与干燥过程参数的多模态颜色轨迹预测方法,实现对食品干燥过程中颜色演变的精确建模。 Result: 在未见干燥条件下,模型对饼干和苹果干燥的颜色预测RMSE分别为2.12和1.29,相比基线模型误差降低超过90%。 Conclusion: 该方法具有高精度、强鲁棒性和良好泛化性,适用于多种食品干燥过程的颜色质量监控。 Abstract: Food drying is widely used to reduce moisture content, ensure safety, and extend shelf life. Color evolution of food samples is an important indicator of product quality in food drying. Although existing studies have examined color changes under different drying conditions, current approaches primarily rely on low-dimensional color features and cannot fully capture the complex, dynamic color trajectories of food samples. Moreover, existing modeling approaches lack the ability to generalize to unseen process conditions. To address these limitations, we develop a novel multi-modal color-trajectory prediction method that integrates high-dimensional temporal color information with drying process parameters to enable accurate and data-efficient color trajectory prediction. Under unseen drying conditions, the model attains RMSEs of 2.12 for cookie drying and 1.29 for apple drying, reducing errors by over 90% compared with baseline models. These experimental results demonstrate the model's superior accuracy, robustness, and broad applicability.

[102] The MICCAI Federated Tumor Segmentation (FeTS) Challenge 2024: Efficient and Robust Aggregation Methods for Federated Learning

Akis Linardos,Sarthak Pati,Ujjwal Baid,Brandon Edwards,Patrick Foley,Kevin Ta,Verena Chung,Micah Sheller,Muhammad Irfan Khan,Mojtaba Jafaritadi,Elina Kontio,Suleiman Khan,Leon Mächler,Ivan Ezhov,Suprosanna Shit,Johannes C. Paetzold,Gustav Grimberg,Manuel A. Nickel,David Naccache,Vasilis Siomos,Jonathan Passerat-Palmbach,Giacomo Tarroni,Daewoon Kim,Leonard L. Klausmann,Prashant Shah,Bjoern Menze,Dimitrios Makris,Spyridon Bakas

Main category: cs.CV

TL;DR: MICCAI FeTS Challenge 2024评估了联邦学习在胶质瘤MRI分割中的应用,重点测试新的权重聚合方法,PID控制器方法在分割性能和通信效率上均表现最佳。

Details Motivation: 推动联邦学习在医学图像分割中的发展,提升模型鲁棒性与通信效率,解决多中心数据隐私问题。 Method: 采用标准化联邦学习框架,在基于BraTS的多机构数据集上评估六支参赛队伍的方法,使用Dice系数和HD95评估分割性能,结合收敛分数评估通信效率。 Result: PID控制器方法取得最高综合评分,ET、TC、WT的平均DSC分别为0.733、0.761、0.751,HD95分别为33.922mm、33.623mm、32.309mm,收敛分数达0.764。 Conclusion: PID控制器在联邦学习权重聚合中表现出优越性能,显著提升分割效果与通信效率,推动了医学图像联邦学习的发展。 Abstract: We present the design and results of the MICCAI Federated Tumor Segmentation (FeTS) Challenge 2024, which focuses on federated learning (FL) for glioma sub-region segmentation in multi-parametric MRI and evaluates new weight aggregation methods aimed at improving robustness and efficiency. Six participating teams were evaluated using a standardized FL setup and a multi-institutional dataset derived from the BraTS glioma benchmark, consisting of 1,251 training cases, 219 validation cases, and 570 hidden test cases with segmentations for enhancing tumor (ET), tumor core (TC), and whole tumor (WT). Teams were ranked using a cumulative scoring system that considered both segmentation performance, measured by Dice Similarity Coefficient (DSC) and the 95th percentile Hausdorff Distance (HD95), and communication efficiency assessed through the convergence score. A PID-controller-based method achieved the top overall ranking, obtaining mean DSC values of 0.733, 0.761, and 0.751 for ET, TC, and WT, respectively, with corresponding HD95 values of 33.922 mm, 33.623 mm, and 32.309 mm, while also demonstrating the highest communication efficiency with a convergence score of 0.764. These findings advance the state of federated learning for medical imaging, surpassing top-performing methods from previous challenge iterations and highlighting PID controllers as effective mechanisms for stabilizing and optimizing weight aggregation in FL. The challenge code is available at https://github.com/FeTS-AI/Challenge.

[103] Revisiting SVD and Wavelet Difference Reduction for Lossy Image Compression: A Reproducibility Study

Alena Makarova

Main category: cs.CV

TL;DR: 本研究对一种结合SVD与WDR的图像压缩方法进行了可重复性验证,发现其性能未如原文所述优于JPEG2000或单独WDR,且原方法存在关键实现细节模糊问题。

Details Motivation: 验证原论文中SVD+WDR图像压缩方法的可重复性和性能声称的真实性。 Method: 重新实现原方法,填补缺失的技术细节,并在原始及新增图像上复现实验,使用PSNR和SSIM进行评估。 Result: 实验结果表明,SVD+WDR在PSNR上未优于JPEG2000或WDR,仅在SSIM上部分优于JPEG2000;同时发现原方法在量化和阈值初始化等方面存在描述不清的问题。 Conclusion: 原论文的性能优势未能被复现,且其实现细节的不明确性影响了结果的可重复性,提示需更透明的方法描述以确保科学可靠性。 Abstract: This work presents an independent reproducibility study of a lossy image compression technique that integrates singular value decomposition (SVD) and wavelet difference reduction (WDR). The original paper claims that combining SVD and WDR yields better visual quality and higher compression ratios than JPEG2000 and standalone WDR. I re-implemented the proposed method, carefully examined missing implementation details, and replicated the original experiments as closely as possible. I then conducted additional experiments on new images and evaluated performance using PSNR and SSIM. In contrast to the original claims, my results indicate that the SVD+WDR technique generally does not surpass JPEG2000 or WDR in terms of PSNR, and only partially improves SSIM relative to JPEG2000. The study highlights ambiguities in the original description (e.g., quantization and threshold initialization) and illustrates how such gaps can significantly impact reproducibility and reported performance.

[104] GPU-GLMB: Assessing the Scalability of GPU-Accelerated Multi-Hypothesis Tracking

Pranav Balakrishnan,Sidisha Barik,Sean M. O'Rourke,Benjamin M. Marlin

Main category: cs.CV

TL;DR: 本文提出了一种改进的广义标记多伯努利(GLMB)滤波器变体,通过允许多个检测来自同一传感器以提升并行可扩展性,并实现GPU加速,显著提高了多目标跟踪的计算效率。

Details Motivation: 标准GLMB滤波器在处理多假设时计算成本高,难以满足实时应用需求,尤其是在分布式机器学习虚拟传感器网络中需要支持同一目标多个检测的情况下。 Method: 引入一种允许同一目标产生多个检测的GLMB滤波器变体,打破标准GLMB中检测间的依赖关系,从而提升并行性,并在GPU上实现高效部署。 Result: 初步实验结果表明,所提出的GPU加速GLMB跟踪器在对象数量和保留假设数增加时具有良好的运行时间可扩展性。 Conclusion: 该方法有效降低了GLMB滤波器的计算复杂度,提升了并行处理能力,为大规模多目标跟踪的实际部署提供了可行性。 Abstract: Much recent research on multi-target tracking has focused on multi-hypothesis approaches leveraging random finite sets. Of particular interest are labeled random finite set methods that maintain temporally coherent labels for each object. While these methods enjoy important theoretical properties as closed-form solutions to the multi-target Bayes filter, the maintenance of multiple hypotheses under the standard measurement model is highly computationally expensive, even when hypothesis pruning approximations are applied. In this work, we focus on the Generalized Labeled Multi-Bernoulli (GLMB) filter as an example of this class of methods. We investigate a variant of the filter that allows multiple detections per object from the same sensor, a critical capability when deploying tracking in the context of distributed networks of machine learning-based virtual sensors. We show that this breaks the inter-detection dependencies in the filter updates of the standard GLMB filter, allowing updates with significantly improved parallel scalability and enabling efficient deployment on GPU hardware. We report the results of a preliminary analysis of a GPU-accelerated implementation of our proposed GLMB tracker, with a focus on run time scalability with respect to the number of objects and the maximum number of retained hypotheses.

[105] Opinion: Learning Intuitive Physics May Require More than Visual Data

Ellen Su,Solim Legris,Todd M. Gureckis,Mengye Ren

Main category: cs.CV

TL;DR: 本研究探讨了通过在开发上更真实的视觉数据集(SAYCam)上预训练V-JEPA模型,是否能提升深度学习在直观物理理解上的表现。结果显示,即使数据分布更贴近人类经验,当前架构仍无法显著提升性能,表明仅改变数据量和分布不足以实现人工直觉物理。

Details Motivation: 尽管现有深度学习模型使用大量视频数据,但在直观物理任务上仍不及人类。本文探究数据分布而非数据量是否是掌握物理直觉的关键因素。 Method: 在SAYCam数据集(三个儿童第一人称日常视觉数据,仅相当于SOTA模型训练数据的0.01%)上预训练Video Joint Embedding Predictive Architecture (V-JEPA)模型,并在IntPhys2基准上评估其直观物理理解能力。 Result: 在SAYCam上训练的V-JEPA模型在IntPhys2基准上未表现出显著性能提升,说明当前架构无法仅通过更符合发育现实的数据分布来有效学习直观物理表征。 Conclusion: 仅改变视觉数据的量和分布不足以使当前深度学习模型掌握类似人类的直观物理能力,需进一步探索新的架构或学习机制。 Abstract: Humans expertly navigate the world by building rich internal models founded on an intuitive understanding of physics. Meanwhile, despite training on vast quantities of internet video data, state-of-the-art deep learning models still fall short of human-level performance on intuitive physics benchmarks. This work investigates whether data distribution, rather than volume, is the key to learning these principles. We pretrain a Video Joint Embedding Predictive Architecture (V-JEPA) model on SAYCam, a developmentally realistic, egocentric video dataset partially capturing three children's everyday visual experiences. We find that training on this dataset, which represents 0.01% of the data volume used to train SOTA models, does not lead to significant performance improvements on the IntPhys2 benchmark. Our results suggest that merely training on a developmentally realistic dataset is insufficient for current architectures to learn representations that support intuitive physics. We conclude that varying visual data volume and distribution alone may not be sufficient for building systems with artificial intuitive physics.

[106] NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks

Fangzhou Lin,Yuping Wang,Yuliang Guo,Zixun Huang,Xinyu Huang,Haichong Zhang,Kazunori Yamada,Zhengzhong Tu,Liu Ren,Ziming Zhang

Main category: cs.CV

TL;DR: 本文提出了NexusFlow,一种轻量级、即插即用的多任务学习框架,用于解决部分监督下任务结构差异大且标注不完整的挑战。该方法通过可逆耦合层对齐不同任务的潜在特征分布,实现跨任务知识迁移,并在自动驾驶和NYUv2数据集上取得领先性能。

Details Motivation: 现有部分监督多任务学习方法主要关注同质密集预测任务,难以应对结构异构任务间的知识迁移问题。本文旨在解决结构差异大、标注不完整的真实场景下的多任务学习挑战。 Method: 提出NexusFlow框架,引入带有可逆耦合层的代理网络,将不同任务的特征映射到共享的规范空间中,实现特征分布对齐。利用双射性质保持信息完整性,避免表示崩溃,同时支持异构任务的知识转移。 Result: 在nuScenes数据集上,NexusFlow在域分区自动驾驶任务(密集建图与稀疏跟踪)中优于强基线,达到最先进性能;在NYUv2的三个同质密集任务(分割、深度、法向)中也表现出一致增益。 Conclusion: NexusFlow能有效处理结构多样与标注不全的多任务学习场景,具有良好的通用性与即插即用特性,为部分监督多任务学习提供了新的解决方案。 Abstract: Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expressive capacity. We first evaluate NexusFlow on the core challenge of domain-partitioned autonomous driving, where dense map reconstruction and sparse multi-object tracking are supervised in different geographic regions, creating both structural disparity and a strong domain gap. NexusFlow sets a new state-of-the-art result on nuScenes, outperforming strong partially supervised baselines. To demonstrate generality, we further test NexusFlow on NYUv2 using three homogeneous dense prediction tasks, segmentation, depth, and surface normals, as a representative N-task PS-MTL scenario. NexusFlow yields consistent gains across all tasks, confirming its broad applicability.

[107] Language-driven Fine-grained Retrieval

Shijie Wang,Xin Yu,Yadan Luo,Zijian Wang,Pengfei Zhang,Zi Huang

Main category: cs.CV

TL;DR: 提出LaFG框架,利用大语言模型和视觉-语言模型从类别名称生成属性级监督信号,提升细粒度图像检索在未见类别上的泛化能力。

Details Motivation: 现有细粒度图像检索方法依赖稀疏的一-hot标签,忽略了类别名称中的丰富语义,限制了跨类别细节的可比性和对未见类别的泛化能力。 Method: LaFG框架将类别名称作为语义锚点,通过大语言模型生成属性描述,并利用冻结的视觉-语言模型对齐语义空间,聚类构建全局属性词汇表,提取互补属性;进而构建类别特定的语言原型用于监督检索模型。 Result: 该方法在多个细粒度图像检索数据集上实现了优于现有方法的性能,尤其在未见类别上表现出更强的泛化能力。 Conclusion: 通过语言驱动的方式挖掘类别名称中的深层语义信息,能够有效提升细粒度图像检索的判别能力和跨类别泛化性能。 Abstract: Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer

[108] Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs

Chaoyang Wang,Yangfan He,Yiyang Zhou,Yixuan Wang,Jiaqi Liu,Peng Xia,Zhengzhong Tu,Mohit Bansal,Huaxiu Yao

Main category: cs.CV

TL;DR: 本文揭示了大视觉语言模型(LVLMs)中的一个关键问题:即使模型具备正确答案的知识,也常因推理路径选择偏差而得出错误结论。为此提出了一种名为PSO的两阶段后训练框架,通过路径选择优化提升推理准确性和稳定性,实验显示其显著提升了7.4%的平均准确率。

Details Motivation: 尽管LVLMs具备知识,但其推理过程不稳定且常选择错误路径,导致结果不可靠;现有方法未能有效解决这一路径选择偏差问题。 Method: 提出PSO框架,第一阶段使用基于模板和答案奖励的Group Relative Policy Optimization(GRPO)培养结构化推理;第二阶段进行在线偏好优化,并利用负反馈记忆库(NRM)存储错误路径以避免重复错误,实现持续优化。 Result: PSO有效剪枝无效推理路径,在多个任务上显著提升推理准确性(平均+7.4%),并增强推理链的一致性与稳定性。 Conclusion: LVLMs的失败主因是推理路径选择偏差而非知识缺失,PSO通过系统性路径优化显著提升了模型的推理性能和可靠性。 Abstract: We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs): even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths. The core issue is not a lack of knowledge, but a path selection bias within the vast reasoning search space. Although LVLMs are often capable of sampling correct solution trajectories, they disproportionately favor unstable or logically inconsistent ones, leading to erratic and unreliable outcomes. The substantial disparity between Pass@K (with large K) and Pass@1 across numerous models provides compelling evidence that such failures primarily stem from misreasoning rather than ignorance. To systematically investigate and address this issue, we propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs. In the first stage, we employ Group Relative Policy Optimization (GRPO) with template and answer-based rewards to cultivate structured, step-by-step reasoning. In the second stage, we conduct online preference optimization, where the model samples reasoning paths from GRPO-generated data, self-evaluates them, and aligns itself toward the preferred trajectories. Incorrect or suboptimal paths are concurrently stored in a Negative Replay Memory (NRM) as hard negatives, which are periodically revisited to prevent the model from repeating prior mistakes and to facilitate continual reasoning refinement. Extensive experiments show that PSO effectively prunes invalid reasoning paths, substantially enhances reasoning accuracy (with 7.4% improvements on average), and yields more stable and consistent chains of thought. Our code will be available at https://github.com/aiming-lab/PSO.

[109] TriaGS: Differentiable Triangulation-Guided Geometric Consistency for 3D Gaussian Splatting

Quan Tran,Tuan Dang

Main category: cs.CV

TL;DR: 提出一种通过约束多视图三角化来增强3D高斯点云重建几何一致性的新方法,在DTU数据集上达到最先进的 Chamfer Distance 0.50 mm。

Details Motivation: 3D Gaussian Splatting 仅依赖光度损失导致重建中存在“浮点”伪影和非结构化几何,缺乏几何一致性。 Method: 通过多视图三角化构建鲁棒的3D共识点,并以自监督方式惩罚渲染点与共识点的偏差,从而优化3D高斯点的位置。 Result: 在多个数据集上实现了先进性能,DTU数据集上的平均 Chamfer Distance 达到 0.50 mm。 Conclusion: 该方法有效提升了3D高斯表示的几何一致性,减少了浮点伪影,支持高质量表面重建。 Abstract: 3D Gaussian Splatting is crucial for real-time novel view synthesis due to its efficiency and ability to render photorealistic images. However, building a 3D Gaussian is guided solely by photometric loss, which can result in inconsistencies in reconstruction. This under-constrained process often results in "floater" artifacts and unstructured geometry, preventing the extraction of high-fidelity surfaces. To address this issue, our paper introduces a novel method that improves reconstruction by enforcing global geometry consistency through constrained multi-view triangulation. Our approach aims to achieve a consensus on 3D representation in the physical world by utilizing various estimated views. We optimize this process by penalizing the deviation of a rendered 3D point from a robust consensus point, which is re-triangulated from a bundle of neighboring views in a self-supervised fashion. We demonstrate the effectiveness of our method across multiple datasets, achieving state-of-the-art results. On the DTU dataset, our method attains a mean Chamfer Distance of 0.50 mm, outperforming comparable explicit methods. We will make our code open-source to facilitate community validation and ensure reproducibility.

[110] FacePhys: State of the Heart Learning

Kegang Wang,Jiankai Tang,Yuntao Wang,Xin Liu,Yuxuan Fan,Jiatong Ji,Yuanchun Shi,Daniel McDuff

Main category: cs.CV

TL;DR: 提出了一种名为FacePhys的内存高效的rPPG算法,基于时空状态空间对偶性,解决了模型可扩展性、跨数据集泛化和实时操作之间的三难问题,在误差上减少了49%,并在内存占用和延迟方面显著优于现有方法。

Details Motivation: 现有的基于摄像头的生命体征测量技术受限于前端设备的计算能力以及压缩传输通道导致的信号质量下降,难以实现实用化部署。 Method: 提出FacePhys算法,利用时空状态空间对偶性和可迁移的心率状态,捕捉视频帧间的细微周期性变化,同时保持极低的计算开销,支持长时间序列训练和低延迟推理。 Result: FacePhys在误差上比现有方法减少49%,内存占用仅为3.6 MB,每帧延迟为9.46 ms,性能提升83%至99%,实现了新的最先进水平。 Conclusion: FacePhys有效解决了rPPG技术在实际应用中的三难困境,具备良好的泛化能力与实时性,适用于实际场景中的持续健康监测。 Abstract: Vital sign measurement using cameras presents opportunities for comfortable, ubiquitous health monitoring. Remote photoplethysmography (rPPG), a foundational technology, enables cardiac measurement through minute changes in light reflected from the skin. However, practical deployment is limited by the computational constraints of performing analysis on front-end devices and the accuracy degradation of transmitting data through compressive channels that reduce signal quality. We propose a memory efficient rPPG algorithm - \emph{FacePhys} - built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time operation. Leveraging a transferable heart state, FacePhys captures subtle periodic variations across video frames while maintaining a minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. FacePhys establishes a new state-of-the-art, with a substantial 49\% reduction in error. Our solution enables real-time inference with a memory footprint of 3.6 MB and per-frame latency of 9.46 ms -- surpassing existing methods by 83\% to 99\%. These results translate into reliable real-time performance in practical deployments, and a live demo is available at https://www.facephys.com/.

[111] RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension

Tianyi Gao,Hao Li,Han Fang,Xin Wei,Xiaodong Dong,Hongbo Sun,Ye Yuan,Zhongjiang He,Jinglin Xu,Jingmin Xin,Hao Sun

Main category: cs.CV

TL;DR: 本文提出了一个名为RefBench-PRO的新基准,用于评估多模态大模型在指代表达理解中的感知与推理能力,并通过自动化数据生成和新的强化学习方法Ref-R1提升定位精度。

Details Motivation: 现有指代表达理解(REC)基准主要评估感知能力,缺乏对多模态大语言模型在不同认知能力下接地能力的可解释评估。因此需要一个能分解感知与推理维度的综合基准。 Method: 提出RefBench-PRO基准,将指代表达分解为感知与推理两个维度,并细分为六个渐进任务;构建自动化数据生成流程;提出基于强化学习的Ref-R1训练方法,采用动态IoU-based GRPO提升复杂推理下的定位准确性。 Result: 实验表明,RefBench-PRO能够实现对MLLM在指代表达理解任务上的可解释性评估,在感知与推理方面均带来更大挑战,并验证了Ref-R1在提升定位性能方面的有效性。 Conclusion: RefBench-PRO为评估多模态模型的细粒度认知能力提供了有效工具,推动了指代表达理解从感知向复杂推理的发展。 Abstract: Referring Expression Comprehension (REC) is a vision-language task that localizes a specific image region based on a textual description. Existing REC benchmarks primarily evaluate perceptual capabilities and lack interpretable scoring mechanisms, which cannot reveal the grounding capability of Multi-modal Large Language Model (MLLM) across different cognitive abilities. To address this limitation, we introduce RefBench-PRO, a comprehensive REC benchmark, which decomposes referring expressions into two core dimensions, i.e., perception and reasoning, and further subdivides them into six progressively challenging tasks, such as attribute, position, interaction, commonsense, relation and reject. We also develop a fully automated data-generation pipeline that produces diverse referring expressions across these six sub-dimensions. Furthermore, We propose Ref-R1, an RL-based learning scheme, which incorporates Dynamic IoU-based GRPO to improve localization accuracy under increasingly complex reasoning conditions, establishing a stronger baseline for REC. Extensive experiments demonstrate that our RefBench-PRO enables interpretable evaluation of MLLM on referring expression comprehension, presenting greater challenges in both perception and reasoning.

[112] Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

Hengzhuang Li,Xinsong Zhang,Qiming Peng,Bin Luo,Han Hu,Dengyang Jiang,Han-Jia Ye,Teng Zhang,Hai Jin

Main category: cs.CV

TL;DR: 提出了一种名为Latent Visual Reconstruction (LaVer)的新训练框架,通过在大语言模型的联合潜在语义空间中进行掩码图像建模,增强多模态大语言模型对视觉信息的利用,缓解模态不平衡问题。

Details Motivation: 多模态大语言模型(MLLMs)在深层中往往低估视觉信息,导致视觉性能下降或产生幻觉,主要由于训练过程中依赖文本token预测而缺乏直接的视觉监督信号。 Method: 提出LaVer框架,在LLM的联合潜在语义空间中引入掩码图像建模,提供直接的视觉激活信号,增强视觉表征的学习。 Result: 实验表明,LaVer在多个基准上显著提升了MLLMs的视觉理解能力,特别是在需要密集视觉感知的任务中表现优异。 Conclusion: LaVer有效缓解了MLLM中的模态不平衡问题,增强了视觉信息在深层网络中的保留与利用,推动了更均衡的多模态学习。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.

[113] A Sleep Monitoring System Based on Audio, Video and Depth Information

Lyn Chao-ling Chen,Kuan-Wen Chen,Yi-Ping Hung

Main category: cs.CV

TL;DR: 提出了一种基于事件的非侵入式睡眠障碍监测系统,利用深度传感器、RGB相机和麦克风阵列检测睡眠中的运动、光照变化和噪声事件,并通过背景建模和事件检测算法实现有效识别。

Details Motivation: 为了定量评估睡眠障碍,需要一种能够在家庭环境中非侵入式地监测并分类不同类型睡眠干扰事件的方法。 Method: 使用红外深度传感器、RGB相机和四麦克风阵列采集睡眠数据;分别在深度信号和彩色图像中建立背景模型以检测运动和光照变化;结合声音信息检测噪声事件;采用事件检测算法从多传感器数据中识别三类睡眠干扰事件。 Result: 系统能在低光照环境下有效监测睡眠干扰事件,实验结果验证了该系统的可靠性。 Conclusion: 所提出的基于多模态传感器和背景建模的事件检测方法能够可靠地识别家庭环境中的睡眠干扰事件,具有良好的非侵入性和实用性。 Abstract: For quantitative evaluation of sleep disturbances, a noninvasive monitoring system is developed by introducing an event-based method. We observe sleeping in home context and classify the sleep disturbances into three types of events: motion events, light-on/off events and noise events. A device with an infrared depth sensor, a RGB camera, and a four-microphone array is used in sleep monitoring in an environment with barely light sources. One background model is established in depth signals for measuring magnitude of movements. Because depth signals cannot observe lighting changes, another background model is established in color images for measuring magnitude of lighting effects. An event detection algorithm is used to detect occurrences of events from the processed data of the three types of sensors. The system was tested in sleep condition and the experiment result validates the system reliability.

[114] StrokeNet: Unveiling How to Learn Fine-Grained Interactions in Online Handwritten Stroke Classification

Yiheng Huang,Shuang She,Zewei Wei,Jianmin Lin,Ming Yang,Wenyin Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为StrokeNet的新网络架构,用于解决笔画分类中的细粒度语义关系建模问题,通过参考点序列和Cross-Ellipse Query机制有效捕捉局部笔画交互,在多个手写数据集上实现了最先进的性能。

Details Motivation: 现有深度学习方法难以捕捉笔画之间细粒度的语义关系,且点级表示易引入冗余,因此需要一种更高效、精细的笔画建模方式。 Method: 提出StrokeNet,将笔画编码为参考点对表示(点+特征向量),通过动态选择并排序参考点,利用Inline Sequence Attention(ISA)模块构建上下文特征,并设计Cross-Ellipse Query(CEQ)机制在多尺度空间上捕获特征交互,最后通过联合优化框架进行笔画分类和语义过渡建模。 Result: 在多个公开在线手写数据集上达到最先进性能,尤其在CASIA-onDo数据集上准确率从93.81%提升至95.54%。 Conclusion: StrokeNet通过细粒度的参考点表示和空间查询机制,有效建模笔画间的局部语义关系,显著提升了笔画分类的准确性和鲁棒性。 Abstract: Stroke classification remains challenging due to variations in writing style, ambiguous content, and dynamic writing positions. The core challenge in stroke classification is modeling the semantic relationships between strokes. Our observations indicate that stroke interactions are typically localized, making it difficult for existing deep learning methods to capture such fine-grained relationships. Although viewing strokes from a point-level perspective can address this issue, it introduces redundancy. However, by selecting reference points and using their sequential order to represent strokes in a fine-grained manner, this problem can be effectively solved. This insight inspired StrokeNet, a novel network architecture encoding strokes as reference pair representations (points + feature vectors), where reference points enable spatial queries and features mediate interaction modeling. Specifically, we dynamically select reference points for each stroke and sequence them, employing an Inline Sequence Attention (ISA) module to construct contextual features. To capture spatial feature interactions, we devised a Cross-Ellipse Query (CEQ) mechanism that clusters reference points and extracts features across varying spatial scales. Finally, a joint optimization framework simultaneously predicts stroke categories via reference points regression and adjacent stroke semantic transition modeling through an Auxiliary Branch (Aux-Branch). Experimental results show that our method achieves state-of-the-art performance on multiple public online handwritten datasets. Notably, on the CASIA-onDo dataset, the accuracy improves from 93.81$\%$ to 95.54$\%$, demonstrating the effectiveness and robustness of our approach.

[115] Exploiting Spatiotemporal Properties for Efficient Event-Driven Human Pose Estimation

Haoxian Zhou,Chuanzhi Xu,Langyi Chen,Haodong Chen,Yuk Ying Chung,Qiang Qu,Xaoming Chen,Weidong Cai

Main category: cs.CV

TL;DR: 本文提出了一种基于点云框架的事件流时空特性利用方法,用于提升人体姿态估计性能,避免了传统方法将事件流转换为密集帧所带来的计算开销和时间分辨率损失。

Details Motivation: 现有方法通常将事件流转换为密集事件帧,导致额外计算负担并牺牲事件信号的高时间分辨率,难以充分利用事件相机的优势。 Method: 设计了事件时间切片卷积模块以捕捉事件切片间的短期依赖,并结合事件切片序列化模块进行结构化时间建模;同时在点云表示中应用边缘增强以提升稀疏事件下的空间边缘信息。 Result: 在DHP19数据集上的实验表明,所提方法在PointNet、DGCNN和Point Transformer三种代表性点云骨干网络上均持续提升了性能。 Conclusion: 该方法有效利用了事件流的时空稀疏性与高时间分辨率特性,通过点云框架实现了更高效、鲁棒的人体姿态估计。 Abstract: Human pose estimation focuses on predicting body keypoints to analyze human motion. Event cameras provide high temporal resolution and low latency, enabling robust estimation under challenging conditions. However, most existing methods convert event streams into dense event frames, which adds extra computation and sacrifices the high temporal resolution of the event signal. In this work, we aim to exploit the spatiotemporal properties of event streams based on point cloud-based framework, designed to enhance human pose estimation performance. We design Event Temporal Slicing Convolution module to capture short-term dependencies across event slices, and combine it with Event Slice Sequencing module for structured temporal modeling. We also apply edge enhancement in point cloud-based event representation to enhance spatial edge information under sparse event conditions to further improve performance. Experiments on the DHP19 dataset show our proposed method consistently improves performance across three representative point cloud backbones: PointNet, DGCNN, and Point Transformer.

[116] ReCAD: Reinforcement Learning Enhanced Parametric CAD Model Generation with Vision-Language Models

Jiahao Li,Yusheng Luo,Yunzhong Lou,Xiangdong Zhou

Main category: cs.CV

TL;DR: ReCAD是一种强化学习框架,通过引导预训练大模型从多模态输入生成精确的参数化CAD模型,利用参数化代码和分层学习策略,在文本到CAD和图像到CAD任务中显著提升了几何精度。

Details Motivation: 现有方法依赖监督微调,编辑能力有限,且未能充分利用预训练大模型的生成先验,难以生成高精度和可编辑的CAD模型。 Method: 首先微调视觉-语言模型以具备基础CAD生成能力,将CAD脚本重写为参数化代码用于监督;然后提出一种新的强化学习策略,结合参数化代码指导模型推理;最后通过分层基元学习过程,在统一奖励函数下逐步训练结构化和组合技能。 Result: ReCAD在文本到CAD和图像到CAD任务上达到最先进性能,显著提升几何精度:在图像到CAD任务中,平均Chamfer距离从73.47降至29.61(分布内),从272.06降至80.23(分布外)。 Conclusion: ReCAD有效利用了预训练大模型的生成能力,通过强化学习与参数化代码引导,实现了高精度、可编辑的CAD模型生成,推动了自动化设计的发展。 Abstract: We present ReCAD, a reinforcement learning (RL) framework that bootstraps pretrained large models (PLMs) to generate precise parametric computer-aided design (CAD) models from multimodal inputs by leveraging their inherent generative capabilities. With just access to simple functional interfaces (e.g., point coordinates), our approach enables the emergence of complex CAD operations (e.g., pattern replication and mirror). This stands in contrast to previous methods, which typically rely on knowledge injected through supervised fine-tuning (SFT), offer limited support for editability, and fail to exploit the strong generative priors of PLMs. Specifically, the ReCAD framework begins by fine-tuning vision-language models (VLMs) to equip them with basic CAD model generation capabilities, where we rewrite CAD scripts into parameterized code that is leveraged to generate accurate textual descriptions for supervision. Then, we propose a novel RL strategy that incorporates parameterized code as guidance to enhance the model's reasoning on challenging questions. Furthermore, we employ a hierarchical primitive learning process to progressively teach structured and compositional skills under a unified reward function that ensures both geometric accuracy and semantic fidelity. ReCAD sets a new state-of-the-art in both text-to-CAD and image-to-CAD tasks, significantly improving geometric accuracy across in-distribution and out-of-distribution settings. In the image-to-CAD task, for instance, it reduces the mean Chamfer Distance from 73.47 to 29.61 (in-distribution) and from 272.06 to 80.23 (out-of-distribution), outperforming existing baselines by a substantial margin.

[117] S2WMamba: A Spectral-Spatial Wavelet Mamba for Pansharpening

Haoyu Zhang,Junhan Luo,Yugang Cao,Siran Peng,Jie Huang,Liangjian-Deng

Main category: cs.CV

TL;DR: 提出S2WMamba,通过2D/1D Haar小波变换分离空间与光谱频率信息,并利用Mamba进行轻量级跨模态交互,实现高效解缠融合,在多个数据集上达到先进性能。

Details Motivation: 传统全色锐化方法在融合PAN和多光谱图像时难以平衡空间细节增强与光谱保真度,容易造成信息纠缠。 Method: 采用2D Haar DWT提取PAN图像的空间高频特征(边缘、纹理),用通道独立的1D Haar DWT处理MS图像以分离光谱高低频成分;构建双分支结构(光谱支路与空间支路),通过Mamba-based跨模态调制实现长程依赖建模,并设计多尺度动态门机制自适应融合结果。 Result: 在WV3、GF2和QB数据集上性能优于或媲美FusionMamba、CANNet等最新方法,PSNR最高提升0.23 dB,WV3全分辨率下HQNR达0.956。 Conclusion: S2WMamba通过显式频率解耦与轻量跨模态交互有效提升了全色锐化质量,兼顾空间细节注入与光谱保真,且结构设计经消融实验验证有效。 Abstract: Pansharpening fuses a high-resolution PAN image with a low-resolution multispectral (LRMS) image to produce an HRMS image. A key difficulty is that jointly processing PAN and MS often entangles spatial detail with spectral fidelity. We propose S2WMamba, which explicitly disentangles frequency information and then performs lightweight cross-modal interaction. Concretely, a 2D Haar DWT is applied to PAN to localize spatial edges and textures, while a channel-wise 1D Haar DWT treats each pixel's spectrum as a 1D signal to separate low/high-frequency components and limit spectral distortion. The resulting Spectral branch injects wavelet-extracted spatial details into MS features, and the Spatial branch refines PAN features using spectra from the 1D pyramid; the two branches exchange information through Mamba-based cross-modulation that models long-range dependencies with linear complexity. A multi-scale dynamic gate (multiplicative + additive) then adaptively fuses branch outputs.On WV3, GF2, and QB, S2WMamba matches or surpasses recent strong baselines (FusionMamba, CANNet, U2Net, ARConv), improving PSNR by up to 0.23 dB and reaching HQNR 0.956 on full-resolution WV3. Ablations justify the choice of 2D/1D DWT placement, parallel dual branches, and the fusion gate. Our code is available at https://github.com/KagUYa66/S2WMamba.

[118] CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks

Jeffrey Gu,Minkyu Jeon,Ambri Ma,Serena Yeung-Levy,Ellen D. Zhong

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的超网络CryoHype,用于同时高效重建大量不同生物分子结构,显著提升了冷冻电镜在处理成分异质性方面的高通量能力。

Details Motivation: 现有方法主要关注单个或少数结构的构象异质性,难以应对包含多种不同分子物种的成分异质性问题,限制了冷冻电镜的高通量应用潜力。 Method: 提出CryoHype,一种基于Transformer的超网络,通过动态调整隐式神经表示的权重,实现对多个分子结构的同时重建。 Result: 在包含100个结构的挑战性基准数据集上达到最先进性能,并在固定姿态设定下成功扩展到从无标签冷冻电镜图像中重建1000个不同结构。 Conclusion: CryoHype有效解决了冷冻电镜中多分子混合物的结构重建难题,展现出强大的可扩展性和高通量结构解析潜力。 Abstract: Cryo-electron microscopy (cryo-EM) is an indispensable technique for determining the 3D structures of dynamic biomolecular complexes. While typically applied to image a single molecular species, cryo-EM has the potential for structure determination of many targets simultaneously in a high-throughput fashion. However, existing methods typically focus on modeling conformational heterogeneity within a single or a few structures and are not designed to resolve compositional heterogeneity arising from mixtures of many distinct molecular species. To address this challenge, we propose CryoHype, a transformer-based hypernetwork for cryo-EM reconstruction that dynamically adjusts the weights of an implicit neural representation. Using CryoHype, we achieve state-of-the-art results on a challenging benchmark dataset containing 100 structures. We further demonstrate that CryoHype scales to the reconstruction of 1,000 distinct structures from unlabeled cryo-EM images in the fixed-pose setting.

[119] Beyond Hallucinations: A Multimodal-Guided Task-Aware Generative Image Compression for Ultra-Low Bitrate

Kaile Wang,Lijun He,Haisheng Fu,Haixia Bi,Fan Li

Main category: cs.CV

TL;DR: 提出了一种多模态引导的任务感知生成式图像压缩框架MTGC,通过文本、高度压缩图像和语义伪词增强语义一致性,在超低比特率下显著提升生成质量与语义保真度。

Details Motivation: 现有生成式图像压缩在超低比特率下存在语义失真问题,限制了其在6G语义通信中的可靠应用,需增强语义一致性。 Method: 提出MTGC框架,融合文本描述、高度压缩图像和任务感知生成的语义伪词(SPWs),并通过双路径协同引导的扩散解码器(MGDD)注入扩散过程以重建图像。 Result: 在DIV2K数据集上DISTS下降10.59%,同时在感知质量和像素保真度方面均有显著提升。 Conclusion: MTGC有效缓解了超低比特率下的语义偏差问题,为6G场景下的可靠语义通信提供了可行方案。 Abstract: Generative image compression has recently shown impressive perceptual quality, but often suffers from semantic deviations caused by generative hallucinations at ultra-low bitrate (bpp < 0.05), limiting its reliable deployment in bandwidth-constrained 6G semantic communication scenarios. In this work, we reassess the positioning and role of of multimodal guidance, and propose a Multimodal-Guided Task-Aware Generative Image Compression (MTGC) framework. Specifically, MTGC integrates three guidance modalities to enhance semantic consistency: a concise but robust text caption for global semantics, a highly compressed image (HCI) retaining low-level visual information, and Semantic Pseudo-Words (SPWs) for fine-grained task-relevant semantics. The SPWs are generated by our designed Task-Aware Semantic Compression Module (TASCM), which operates in a task-oriented manner to drive the multi-head self-attention mechanism to focus on and extract semantics relevant to the generation task while filtering out redundancy. Subsequently, to facilitate the synergistic guidance of these modalities, we design a Multimodal-Guided Diffusion Decoder (MGDD) employing a dual-path cooperative guidance mechanism that synergizes cross-attention and ControlNet additive residuals to precisely inject these three guidance into the diffusion process, and leverages the diffusion model's powerful generative priors to reconstruct the image. Extensive experiments demonstrate that MTGC consistently improves semantic consistency (e.g., DISTS drops by 10.59% on the DIV2K dataset) while also achieving remarkable gains in perceptual quality and pixel-level fidelity at ultra-low bitrate.

[120] CLUENet: Cluster Attention Makes Neural Networks Have Eyes

Xiangshuai Song,Jun-Jie Huang,Tianrui Liu,Ke Liang,Chang Tang

Main category: cs.CV

TL;DR: 提出CLUENet,一种基于聚类的透明深度网络,通过软聚合、硬分配和改进的池化策略提升视觉语义理解的准确性与可解释性。

Details Motivation: 现有卷积和注意力模型因固定感受野和复杂结构难以建模不规则空间模式且可解释性差;聚类方法虽具可解释性但存在精度低、效率差和梯度消失问题。 Method: 设计CLUENet,包含全局软聚合与温度缩放余弦注意力、门控残差连接、块间硬共享特征分发及改进的聚类池化策略。 Result: 在CIFAR-100和Mini-ImageNet上优于现有聚类方法和主流视觉模型,兼顾分类性能与视觉可解释性。 Conclusion: CLUENet在保持高透明度的同时实现了优越的性能,为可解释视觉模型提供了有效解决方案。 Abstract: Despite the success of convolution- and attention-based models in vision tasks, their rigid receptive fields and complex architectures limit their ability to model irregular spatial patterns and hinder interpretability, therefore posing challenges for tasks requiring high model transparency. Clustering paradigms offer promising interpretability and flexible semantic modeling, but suffer from limited accuracy, low efficiency, and gradient vanishing during training. To address these issues, we propose CLUster attEntion Network (CLUENet), an transparent deep architecture for visual semantic understanding. We propose three key innovations include (i) a Global Soft Aggregation and Hard Assignment with a Temperature-Scaled Cosin Attention and gated residual connections for enhanced local modeling, (ii) inter-block Hard and Shared Feature Dispatching, and (iii) an improved cluster pooling strategy. These enhancements significantly improve both classification performance and visual interpretability. Experiments on CIFAR-100 and Mini-ImageNet demonstrate that CLUENet outperforms existing clustering methods and mainstream visual models, offering a compelling balance of accuracy, efficiency, and transparency.

Kaicheng Yang,Kaisen Yang,Baiting Wu,Xun Zhang,Qianrui Yang,Haotong Qin,He Zhang,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出了TreeQ,一个用于Diffusion Transformers (DiT) 混合精度量化的一体化框架,通过Tree Structured Search (TSS)、Environmental Noise Guidance (ENG) 和 General Monarch Branch (GMB) 三个核心组件,解决了量化过程中的搜索效率、配置对齐和信息瓶颈问题,首次实现了DiT模型在4位PTQ下的近无损性能。

Details Motivation: 尽管DiT在图像生成中表现出色,但其高计算和内存开销限制了实际部署;现有混合精度量化方法在U-Net上有效,但在DiT上的应用尚不充分,亟需针对性的量化方案。 Method: 提出TreeQ框架:1)Tree Structured Search (TSS) 利用DiT的线性特性实现O(n)时间复杂度的高效搜索并结合剪枝提升准确性;2)Environmental Noise Guidance (ENG) 使用单一超参数统一PTQ与QAT的优化目标;3)General Monarch Branch (GMB) 引入结构化稀疏分支以缓解超低位宽下的信息损失。 Result: 在DiT-XL/2模型上,TreeQ在W3A3和W4A4的PTQ/PEFT设置下达到SOTA性能,首次实现近无损的4位PTQ效果。 Conclusion: TreeQ为DiT架构提供了高效且可扩展的量化解决方案,推动了扩散模型在资源受限环境下的部署前景。 Abstract: Diffusion Transformers (DiTs) have emerged as a highly scalable and effective backbone for image generation, outperforming U-Net architectures in both scalability and performance. However, their real-world deployment remains challenging due to high computational and memory demands. Mixed-Precision Quantization (MPQ), designed to push the limits of quantization, has demonstrated remarkable success in advancing U-Net quantization to sub-4bit settings while significantly reducing computational and memory overhead. Nevertheless, its application to DiT architectures remains limited and underexplored. In this work, we propose TreeQ, a unified framework addressing key challenges in DiT quantization. First, to tackle inefficient search and proxy misalignment, we introduce Tree Structured Search (TSS). This DiT-specific approach leverages the architecture's linear properties to traverse the solution space in O(n) time while improving objective accuracy through comparison-based pruning. Second, to unify optimization objectives, we propose Environmental Noise Guidance (ENG), which aligns Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) configurations using a single hyperparameter. Third, to mitigate information bottlenecks in ultra-low-bit regimes, we design the General Monarch Branch (GMB). This structured sparse branch prevents irreversible information loss, enabling finer detail generation. Through extensive experiments, our TreeQ framework demonstrates state-of-the-art performance on DiT-XL/2 under W3A3 and W4A4 PTQ/PEFT settings. Notably, our work is the first to achieve near-lossless 4-bit PTQ performance on DiT models. The code and models will be available at https://github.com/racoonykc/TreeQ

[122] Rectifying Latent Space for Generative Single-Image Reflection Removal

Mingjia Li,Jin Hu,Hainuo Wang,Qiming Hu,Jiarui Wang,Xiaojie Guo

Main category: cs.CV

TL;DR: 提出一种基于潜扩散模型的单图像反射去除方法,通过反射等变VAE、可学习文本嵌入和深度引导采样策略,实现对复合图像的精确解析与高质量恢复。

Details Motivation: 现有方法难以处理单图像反射去除中的高度模糊复合区域,因语义编码器的潜在空间缺乏解析图像层叠结构的能力。 Method: 引入反射等变VAE以匹配反射形成的线性物理特性,设计任务特定的可学习文本嵌入避免语言歧义,并采用深度引导的早期分支采样策略利用生成随机性提升效果。 Result: 在多个基准上达到SOTA性能,并在真实复杂场景中表现出良好的泛化能力。 Conclusion: 该方法有效解决了单图像反射去除中潜在空间结构不匹配的问题,显著提升了恢复质量与实际应用潜力。 Abstract: Single-image reflection removal is a highly ill-posed problem, where existing methods struggle to reason about the composition of corrupted regions, causing them to fail at recovery and generalization in the wild. This work reframes an editing-purpose latent diffusion model to effectively perceive and process highly ambiguous, layered image inputs, yielding high-quality outputs. We argue that the challenge of this conversion stems from a critical yet overlooked issue, i.e., the latent space of semantic encoders lacks the inherent structure to interpret a composite image as a linear superposition of its constituent layers. Our approach is built on three synergistic components, including a reflection-equivariant VAE that aligns the latent space with the linear physics of reflection formation, a learnable task-specific text embedding for precise guidance that bypasses ambiguous language, and a depth-guided early-branching sampling strategy to harness generative stochasticity for promising results. Extensive experiments reveal that our model achieves new SOTA performance on multiple benchmarks and generalizes well to challenging real-world cases.

[123] Spoofing-aware Prompt Learning for Unified Physical-Digital Facial Attack Detection

Jiabao Guo,Yadian Wang,Hui Ma,Yuhao Fu,Ju Jia,Hui Liu,Shengeng Tang,Lechao Cheng,Yunfeng Diao,Ajian Liu

Main category: cs.CV

TL;DR: 提出了一种名为SPL-UAD的欺骗感知提示学习框架,用于统一物理和数字攻击检测,通过解耦提示空间中的优化分支并引入自适应上下文提示生成与线索感知增强,显著提升了对未见攻击类型的鲁棒性和检测性能。

Details Motivation: 现有方法在统一物理和数字人脸欺骗检测中因共享提示空间导致优化方向冲突,难以兼顾两类攻击的检测性能,缺乏对未知攻击类型的泛化能力。 Method: 提出SPL-UAD框架:1)构建可学习的并行提示分支,结合自适应欺骗上下文提示生成机制,分别独立优化物理和数字攻击检测;2)设计线索感知增强模块,利用双提示机制生成具有挑战性的样本挖掘任务,提升模型鲁棒性。 Result: 在大规模UniAttackDataPlus数据集上进行了广泛实验,结果表明所提方法在统一攻击检测任务中显著优于现有方法,尤其在未知攻击类型上表现出更强的泛化能力和鲁棒性。 Conclusion: SPL-UAD通过解耦提示空间中的优化路径,有效解决了物理与数字欺骗检测间的优化冲突,实现了更全面、鲁棒的人脸反欺诈保护,为统一多模态攻击防御提供了新思路。 Abstract: Real-world face recognition systems are vulnerable to both physical presentation attacks (PAs) and digital forgery attacks (DFs). We aim to achieve comprehensive protection of biometric data by implementing a unified physical-digital defense framework with advanced detection. Existing approaches primarily employ CLIP with regularization constraints to enhance model generalization across both tasks. However, these methods suffer from conflicting optimization directions between physical and digital attack detection under same category prompt spaces. To overcome this limitation, we propose a Spoofing-aware Prompt Learning for Unified Attack Detection (SPL-UAD) framework, which decouples optimization branches for physical and digital attacks in the prompt space. Specifically, we construct a learnable parallel prompt branch enhanced with adaptive Spoofing Context Prompt Generation, enabling independent control of optimization for each attack type. Furthermore, we design a Cues-awareness Augmentation that leverages the dual-prompt mechanism to generate challenging sample mining tasks on data, significantly enhancing the model's robustness against unseen attack types. Extensive experiments on the large-scale UniAttackDataPlus dataset demonstrate that the proposed method achieves significant performance improvements in unified attack detection tasks.

[124] Human3R: Incorporating Human Priors for Better 3D Dynamic Reconstruction from Monocular Videos

Weitao Xiong,Zhiyuan Yuan,Jiahao Lu,Chengfeng Zhao,Peng Li,Yuan Liu

Main category: cs.CV

TL;DR: 提出Human3R方法,结合SMPL人体模型与单目深度估计,通过分层流程和特征融合模块提升动态场景中单目视频重建的几何一致性和细节精度。

Details Motivation: 现有单目动态视频重建方法在处理动态人体场景时存在几何不一致、肢体比例失真、人-物融合不自然以及因下采样导致的人体边界漂移问题,缺乏对3D人体结构的理解。 Method: 提出Human3R,采用分层pipeline:首先处理全分辨率图像以获取整体场景几何,然后通过裁剪和交叉注意力融合增强人体细节;引入SMPL先验与单目深度估计结合,并设计特征融合模块(Feature Fusion Module)整合混合几何先验,保持人体表面一致性并保留精细边界。 Result: 在TUM Dynamics和GTA-IM数据集上进行了广泛实验,结果表明该方法在动态人体重建中优于现有方法,显著提升几何合理性和细节还原能力。 Conclusion: 通过融合SMPL人体先验与单目深度估计,Human3R有效解决了单目动态视频重建中的人体结构不一致与细节丢失问题,实现了更真实、稳定的三维动态人体重建。 Abstract: Monocular dynamic video reconstruction faces significant challenges in dynamic human scenes due to geometric inconsistencies and resolution degradation issues. Existing methods lack 3D human structural understanding, producing geometrically inconsistent results with distorted limb proportions and unnatural human-object fusion, while memory-constrained downsampling causes human boundary drift toward background geometry. To address these limitations, we propose to incorporate hybrid geometric priors that combine SMPL human body models with monocular depth estimation. Our approach leverages structured human priors to maintain surface consistency while capturing fine-grained geometric details in human regions. We introduce Human3R, featuring a hierarchical pipeline with refinement components that processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. The method integrates SMPL priors through a Feature Fusion Module to ensure geometrically plausible reconstruction while preserving fine-grained human boundaries. Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction.

[125] VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

Yuji Wang,Wenlong Liu,Jingxuan Niu,Haoji Zhang,Yansong Tang

Main category: cs.CV

TL;DR: 本文提出了VG-Refiner,首个面向工具精炼的指向性 grounded 推理框架,通过两阶段的思考-再思考机制和精炼奖励来应对视觉工具输出不可靠的问题,并提升了模型在指代表达和推理任务中的准确性和修正能力。

Details Motivation: 现有工具集成视觉推理(TiVR)方法忽视了对不可靠或错误工具输出的有效响应机制设计,尤其在指向和定位任务中易因检测工具预测不准导致模型产生幻觉推理。 Method: 提出VG-Refiner框架,采用两阶段的think-rethink机制使模型能显式分析并响应工具反馈,并引入精炼奖励以激励对不良工具结果进行有效纠正;同时设计两个新指标和公平评估协议来系统评估模型的精炼能力。 Result: 使用少量任务特定数据增强VG-Refiner的精炼能力,在指代表达与推理定位基准上显著提升了准确率和纠错能力,同时保持了预训练模型的通用性。 Conclusion: VG-Refiner首次实现了针对工具反馈的精细化响应机制,有效缓解了因工具输出错误引发的推理幻觉问题,为构建更鲁棒的多模态推理系统提供了新思路。 Abstract: Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.

[126] Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework

Xinhao Xiang,Abhijeet Rastogi,Jiawei Zhang

Main category: cs.CV

TL;DR: 本文研究了AI生成的驾驶视频(AIGVs)在自动驾驶模型训练和评估中的可靠性,提出了一个诊断框架、一个包含常见故障模式分类法的基准ADGV-Bench,以及一个驾驶感知质量评估器ADGVE,实验证明使用ADGVE筛选AIGVs可提升感知性能和视频质量。

Details Motivation: 尽管AIGVs为自动驾驶提供了低成本、可扩展的数据来源,但其在视觉、运动和交通语义上的常见缺陷可能影响下游任务性能,因此需要系统性评估其可用性。 Method: 提出一个诊断框架,构建包含人工标注和密集标签的ADGV-Bench基准,设计融合静态语义、时序信息、车道合规性和VLM推理的ADGVE评估器,用于量化AIGV质量并指导数据筛选。 Result: 实验表明直接使用原始AIGVs会降低感知性能,而通过ADGVE过滤后能显著提升视频质量评估指标和下游模型表现。 Conclusion: AIGVs具有潜力成为真实数据的有益补充,但需谨慎筛选;本研究提供了安全利用大规模生成视频的实用工具与方法论。 Abstract: Recent text-to-video models have enabled the generation of high-resolution driving scenes from natural language prompts. These AI-generated driving videos (AIGVs) offer a low-cost, scalable alternative to real or simulator data for autonomous driving (AD). But a key question remains: can such videos reliably support training and evaluation of AD models? We present a diagnostic framework that systematically studies this question. First, we introduce a taxonomy of frequent AIGV failure modes, including visual artifacts, physically implausible motion, and violations of traffic semantics, and demonstrate their negative impact on object detection, tracking, and instance segmentation. To support this analysis, we build ADGV-Bench, a driving-focused benchmark with human quality annotations and dense labels for multiple perception tasks. We then propose ADGVE, a driving-aware evaluator that combines static semantics, temporal cues, lane obedience signals, and Vision-Language Model(VLM)-guided reasoning into a single quality score for each clip. Experiments show that blindly adding raw AIGVs can degrade perception performance, while filtering them with ADGVE consistently improves both general video quality assessment metrics and downstream AD models, and turns AIGVs into a beneficial complement to real-world data. Our study highlights both the risks and the promise of AIGVs, and provides practical tools for safely leveraging large-scale video generation in future AD pipelines.

[127] VAD-Net: Multidimensional Facial Expression Recognition in Intelligent Education System

Yi Huo,Yun Ge

Main category: cs.CV

TL;DR: 该研究在FER2013数据集上首次引入了完整的VAD(效价-唤醒-支配)三维情感标注,特别是补充了缺失的支配(Dominance)维度,并提出采用正交卷积的网络结构以提升VAD预测精度。

Details Motivation: 现有面部表情识别数据集多使用离散情绪类别标注,表达能力有限;而AffectNet等虽引入了VA信息,但仍缺乏D维度。为实现更全面、精确的情感计算,需要引入完整的VAD多维情感度量。 Method: 在FER2013数据集上人工标注VAD三维连续值标签,特别补充D维度;设计并应用带有正交卷积的回归网络(基于ResNet)以增强特征多样性与预测性能。 Result: 实验表明D维度相比V和A更难标注与预测;消融实验证明引入正交卷积可显著提升VAD预测准确性;构建了首个带完整VAD标注的FER2013数据集并开源代码。 Conclusion: 该研究填补了FER数据集中D维度标注的空白,提出的正交卷积网络有效提升了VAD预测性能,所发布的数据集和模型可作为多维情感识别的新基准。 Abstract: Current FER (Facial Expression Recognition) dataset is mostly labeled by emotion categories, such as happy, angry, sad, fear, disgust, surprise, and neutral which are limited in expressiveness. However, future affective computing requires more comprehensive and precise emotion metrics which could be measured by VAD(Valence-Arousal-Dominance) multidimension parameters. To address this, AffectNet has tried to add VA (Valence and Arousal) information, but still lacks D(Dominance). Thus, the research introduces VAD annotation on FER2013 dataset, takes the initiative to label D(Dominance) dimension. Then, to further improve network capacity, it enforces orthogonalized convolution on it, which extracts more diverse and expressive features and will finally increase the prediction accuracy. Experiment results show that D dimension could be measured but is difficult to obtain compared with V and A dimension no matter in manual annotation or regression network prediction. Secondly, the ablation test by introducing orthogonal convolution verifies that better VAD prediction could be obtained in the configuration of orthogonal convolution. Therefore, the research provides an initiative labelling for D dimension on FER dataset, and proposes a better prediction network for VAD prediction through orthogonal convolution. The newly built VAD annotated FER2013 dataset could act as a benchmark to measure VAD multidimensional emotions, while the orthogonalized regression network based on ResNet could act as the facial expression recognition baseline for VAD emotion prediction. The newly labeled dataset and implementation code is publicly available on https://github.com/YeeHoran/VAD-Net .

[128] OCFER-Net: Recognizing Facial Expression in Online Learning System

Yi Huo,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的面部表情识别网络OCFER-Net,通过在卷积核上施加正交正则化来增强特征的多样性和表达能力,在FER-2013数据集上取得了优于基线模型的性能。

Details Motivation: 由于在线学习中情感交互的重要性以及现有方法较少利用卷积矩阵的正交性,本文旨在提升面部表情识别的准确率。 Method: 通过引入正交正则化约束卷积核,构建了OCFER-Net模型,以提高特征提取的多样性和表达能力。 Result: 在FER-2013数据集上的实验结果显示,该方法相比基线模型性能提升了1.087。 Conclusion: OCFER-Net通过利用正交性有效提升了面部表情识别的准确性,有助于教师更好地了解学生的情绪状态。 Abstract: Recently, online learning is very popular, especially under the global epidemic of COVID-19. Besides knowledge distribution, emotion interaction is also very important. It can be obtained by employing Facial Expression Recognition (FER). Since the FER accuracy is substantial in assisting teachers to acquire the emotional situation, the project explores a series of FER methods and finds that few works engage in exploiting the orthogonality of convolutional matrix. Therefore, it enforces orthogonality on kernels by a regularizer, which extracts features with more diversity and expressiveness, and delivers OCFER-Net. Experiments are carried out on FER-2013, which is a challenging dataset. Results show superior performance over baselines by 1.087. The code of the research project is publicly available on https://github.com/YeeHoran/OCFERNet.

[129] Perceptual Region-Driven Infrared-Visible Co-Fusion for Extreme Scene Enhancement

Jing Tao,Yonghong Zong,Banglei Guana,Pengju Sun,Taihang Lei,Yang Shanga,Qifeng Yu

Main category: cs.CV

TL;DR: 提出了一种基于区域感知的红外与可见光图像融合框架,结合多曝光和多模态成像,提升了极端条件下的融合质量与几何保真度。

Details Motivation: 现有方法在融合红外与可见光图像时常常损害可见光图像质量,影响测量精度,尤其在极端环境下表现不佳。 Method: 采用基于区域感知的特征融合实现精确配准,结合自适应融合与对比度增强,并引入基于区域显著图引导的结构相似性补偿机制,支持多曝光与单曝光场景。 Result: 在合成与真实数据上的实验表明,该方法在图像清晰度和融合效果上优于现有最先进方法,定量与视觉评估均表现优越。 Conclusion: 所提框架有效解决了极端环境下红外与可见光图像融合中的质量退化问题,兼顾热辐射信息保留与可见光几何特征的高保真还原。 Abstract: In photogrammetry, accurately fusing infrared (IR) and visible (VIS) spectra while preserving the geometric fidelity of visible features and incorporating thermal radiation is a significant challenge, particularly under extreme conditions. Existing methods often compromise visible imagery quality, impacting measurement accuracy. To solve this, we propose a region perception-based fusion framework that combines multi-exposure and multi-modal imaging using a spatially varying exposure (SVE) camera. This framework co-fuses multi-modal and multi-exposure data, overcoming single-exposure method limitations in extreme environments. The framework begins with region perception-based feature fusion to ensure precise multi-modal registration, followed by adaptive fusion with contrast enhancement. A structural similarity compensation mechanism, guided by regional saliency maps, optimizes IR-VIS spectral integration. Moreover, the framework adapts to single-exposure scenarios for robust fusion across different conditions. Experiments conducted on both synthetic and real-world data demonstrate superior image clarity and improved performance compared to state-of-the-art methods, as evidenced by both quantitative and visual evaluations.

[130] Rethinking Training Dynamics in Scale-wise Autoregressive Generation

Gengze Zhou,Chongjian Ge,Hao Tan,Feng Liu,Yicong Hong

Main category: cs.CV

TL;DR: 本文提出了一种名为Self-Autoregressive Refinement (SAR)的方法,用于解决尺度自回归(scale-wise AR)生成模型中的暴露偏差问题,通过Stagger-Scale Rollout(SSR)机制和对比学生强制损失(CSFL)提升图像生成质量,实验证明其在预训练模型上能以极低计算开销显著改善性能。

Details Motivation: 尺度自回归模型在媒体合成中存在暴露偏差问题,主要源于训练与测试的不匹配以及不同尺度间学习难度不平衡,影响生成质量,因此需要一种有效且高效的后训练方法来缓解这些问题。 Method: 提出SAR方法,包含两个核心组件:Stagger-Scale Rollout(SSR)机制,通过轻量级自回归rollout使模型在训练时接触自身中间预测,对齐训练与测试模式;以及对比学生强制损失(CSFL),为自生成上下文提供充分监督以稳定训练过程。 Result: 在ImageNet 256数据集上,将SAR应用于预训练的FlexVAR-d16模型,在仅10个epoch(约5小时,32块A100 GPU)内实现了5.2%的FID下降,表明该方法能快速有效地提升生成质量。 Conclusion: SAR是一种高效、可扩展且有效的后训练策略,能够显著缓解尺度自回归模型中的暴露偏差问题,有望成为视觉自回归生成模型的标准优化组件。 Abstract: Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own intermediate predictions, thereby aligning train-test patterns, and a complementary Contrastive Student-Forcing Loss (CSFL) that provides adequate supervision for self-generated contexts to ensure stable training. Experimental results show that applying SAR to pretrained AR models consistently improves generation quality with minimal computational overhead. For instance, SAR yields a 5.2% FID reduction on FlexVAR-d16 trained on ImageNet 256 within 10 epochs (5 hours on 32xA100 GPUs). Given its efficiency, scalability, and effectiveness, we expect SAR to serve as a reliable post-training method for visual autoregressive generation.

[131] A Perception CNN for Facial Expression Recognition

Chunwei Tian,Jingyuan Xie,Lingjun Li,Wangmeng Zuo,Yanning Zhang,David Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于面部表情识别(FER)的感知卷积神经网络(PCNN),通过并行网络捕捉局部面部特征,并结合多域交互机制融合局部与全局特征,配合两阶段损失函数提升识别精度,在多个数据集上取得了优越性能。

Details Motivation: 现有CNN在FER中可能忽略面部分割的影响,未能充分捕捉细微表情变化,因此需要一种能敏感提取局部器官特征并有效融合全局结构信息的新方法。 Method: 提出PCNN模型,采用五个并行网络分别学习眼睛、脸颊和嘴巴等局部区域特征;引入多域交互机制实现局部器官特征与全局面部结构特征的对齐与融合;设计两阶段损失函数,约束感知信息准确性和重建图像质量。 Result: 在CK+、JAFFE、FER2013、FERPlus、RAF-DB和遮挡与姿态变化数据集上均取得优异表现,验证了PCNN在实验室和真实场景下的有效性。 Conclusion: PCNN通过局部敏感感知和多域特征融合,显著提升了FER性能,尤其在复杂条件下仍保持鲁棒性,为基于深度学习的FER提供了有效框架。 Abstract: Convolutional neural networks (CNNs) can automatically learn data patterns to express face images for facial expression recognition (FER). However, they may ignore effect of facial segmentation of FER. In this paper, we propose a perception CNN for FER as well as PCNN. Firstly, PCNN can use five parallel networks to simultaneously learn local facial features based on eyes, cheeks and mouth to realize the sensitive capture of the subtle changes in FER. Secondly, we utilize a multi-domain interaction mechanism to register and fuse between local sense organ features and global facial structural features to better express face images for FER. Finally, we design a two-phase loss function to restrict accuracy of obtained sense information and reconstructed face images to guarantee performance of obtained PCNN in FER. Experimental results show that our PCNN achieves superior results on several lab and real-world FER benchmarks: CK+, JAFFE, FER2013, FERPlus, RAF-DB and Occlusion and Pose Variant Dataset. Its code is available at https://github.com/hellloxiaotian/PCNN.

[132] DragMesh: Interactive 3D Generation Made Easy

Tianshan Zhang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出了DragMesh,一种用于实时交互式3D运动的框架,通过解耦运动生成与运动学推理,实现对新物体的合理、快速且符合物理规律的关节运动生成。

Details Motivation: 现有的关节运动方法在物理一致性与生成速度之间难以兼顾,且常违反基本运动学约束,因此需要一个既能实时运行又能保持运动学正确性的解决方案。 Method: 提出了一种解耦的框架:首先使用KPP-Net分离语义意图推理和几何回归来预测潜在关节参数;然后利用双四元数变分自编码器(DQ-VAE)结合用户拖动输入生成完整运动轨迹,并通过FiLM条件注入和交叉积损失在每层保证运动学约束。 Result: DragMesh实现了实时性能,在保持严格运动学一致性的同时,能够对未见过的物体生成合理的交互运动,无需重新训练。 Conclusion: DragMesh通过解耦运动学推理与运动生成,提供了一个高效、通用且物理合理的实时3D交互解决方案,推动了生成式3D智能的发展。 Abstract: While generative models have excelled at creating static 3D content, the pursuit of systems that understand how objects move and respond to interactions remains a fundamental challenge. Current methods for articulated motion lie at a crossroads: they are either physically consistent but too slow for real-time use, or generative but violate basic kinematic constraints. We present DragMesh, a robust framework for real-time interactive 3D articulation built around a lightweight motion generation core. Our core contribution is a novel decoupled kinematic reasoning and motion generation framework. First, we infer the latent joint parameters by decoupling semantic intent reasoning (which determines the joint type) from geometric regression (which determines the axis and origin using our Kinematics Prediction Network (KPP-Net)). Second, to leverage the compact, continuous, and singularity-free properties of dual quaternions for representing rigid body motion, we develop a novel Dual Quaternion VAE (DQ-VAE). This DQ-VAE receives these predicted priors, along with the original user drag, to generate a complete, plausible motion trajectory. To ensure strict adherence to kinematics, we inject the joint priors at every layer of the DQ-VAE's non-autoregressive Transformer decoder using FiLM (Feature-wise Linear Modulation) conditioning. This persistent, multi-scale guidance is complemented by a numerically-stable cross-product loss to guarantee axis alignment. This decoupled design allows DragMesh to achieve real-time performance and enables plausible, generative articulation on novel objects without retraining, offering a practical step toward generative 3D intelligence. Code: https://github.com/AIGeeksGroup/DragMesh. Website: https://aigeeksgroup.github.io/DragMesh.

[133] When Gender is Hard to See: Multi-Attribute Support for Long-Range Recognition

Nzakiese Mbongo,Kailash A. Hambarde,Hugo Proença

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的双路径Transformer框架,用于远距离极端条件下的性别识别,通过视觉和属性双路径建模,在低分辨率、视角多变等挑战下实现鲁棒识别。

Details Motivation: 远距离图像中性别识别面临分辨率低、姿态变化大、面部特征缺失等问题,现有方法难以有效应对,因此需要一种更鲁棒且可解释的方法。 Method: 提出双路径框架:1)直接视觉路径,通过选择性微调CLIP图像编码器高层;2)属性中介路径,利用软生物特征提示(如发型、衣着)在CLIP文本-图像空间中推断性别;引入空间通道注意力模块增强定位能力。 Result: 在自构建的统一长距离性别数据集U-DetAGReID上实验表明,该方法在多个指标(macro-F1、准确率、AUC)上优于现有的属性识别与重识别基线,且对距离、角度和高度变化具有强鲁棒性;注意力可视化显示模型具备可解释性和合理拒绝行为。 Conclusion: 语言引导的双路径学习为无约束远距离场景下的性别识别提供了一个有原则且可扩展的基础,兼顾性能与责任性。 Abstract: Accurate gender recognition from extreme long-range imagery remains a challenging problem due to limited spatial resolution, viewpoint variability, and loss of facial cues. For such purpose, we present a dual-path transformer framework that leverages CLIP to jointly model visual and attribute-driven cues for gender recognition at a distance. The framework integrates two complementary streams: (1) a direct visual path that refines a pre-trained CLIP image encoder through selective fine-tuning of its upper layers, and (2) an attribute-mediated path that infers gender from a set of soft-biometric prompts (e.g., hairstyle, clothing, accessories) aligned in the CLIP text-image space. Spatial channel attention modules further enhance discriminative localization under occlusion and low resolution. To support large-scale evaluation, we construct U-DetAGReID, a unified long-range gender dataset derived from DetReIDx and AG-ReID.v2, harmonized under a consistent ternary labeling scheme (Male, Female, Unknown). Extensive experiments suggest that the proposed solution surpasses state-of-the-art person-attribute and re-identification baselines across multiple metrics (macro-F1, accuracy, AUC), with consistent robustness to distance, angle, and height variations. Qualitative attention visualizations confirm interpretable attribute localization and responsible abstention behavior. Our results show that language-guided dual-path learning offers a principled, extensible foundation for responsible gender recognition in unconstrained long-range scenarios.

[134] Automated Deep Learning Estimation of Anthropometric Measurements for Preparticipation Cardiovascular Screening

Lucas R. Mareque,Ricardo L. Armentano,Leandro J. Cymberknop

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的全自动方法,从2D合成人体图像中估计五项关键人体测量指标,用于运动员心血管筛查,ResNet50模型达到亚厘米精度(平均MAE 0.668 cm),具有可扩展性潜力。

Details Motivation: 传统人工人体测量方法耗时、依赖操作者且难以规模化,限制了在运动员心血管筛查中的应用;需要一种自动化、高精度的方法来提高筛查效率和可及性。 Method: 使用由3D人体网格生成的10万张2D合成人体图像数据集,训练VGG19、ResNet50和DenseNet121结合全连接层进行回归,以自动估计五项关键人体测量值。 Result: 所有深度学习模型均实现亚厘米级准确度,其中ResNet50表现最优,五项测量的平均MAE为0.668 cm。 Conclusion: 深度学习可用于大规模、准确地获取人体测量数据,是一种有前景的工具,可补充现有的运动员心血管筛查方案;未来需在真实世界图像上验证模型以拓展应用。 Abstract: Preparticipation cardiovascular examination (PPCE) aims to prevent sudden cardiac death (SCD) by identifying athletes with structural or electrical cardiac abnormalities. Anthropometric measurements, such as waist circumference, limb lengths, and torso proportions to detect Marfan syndrome, can indicate elevated cardiovascular risk. Traditional manual methods are labor-intensive, operator-dependent, and challenging to scale. We present a fully automated deep-learning approach to estimate five key anthropometric measurements from 2D synthetic human body images. Using a dataset of 100,000 images derived from 3D body meshes, we trained and evaluated VGG19, ResNet50, and DenseNet121 with fully connected layers for regression. All models achieved sub-centimeter accuracy, with ResNet50 performing best, achieving a mean MAE of 0.668 cm across all measurements. Our results demonstrate that deep learning can deliver accurate anthropometric data at scale, offering a practical tool to complement athlete screening protocols. Future work will validate the models on real-world images to extend applicability.

[135] AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars

Ramazan Fazylov,Sergey Zagoruyko,Aleksandr Parkin,Stamatis Lefkimmiatis,Ivan Laptev

Main category: cs.CV

TL;DR: 本文提出了AGORA,一种基于3D高斯点阵(3DGS)的生成对抗网络框架,用于生成高保真、可动画的3D人物化身。其核心创新是一个轻量级的、基于FLAME条件的形变分支,实现对表情的细粒度控制,并支持实时推理,首次实现了CPU上实用的可动画3DGS化身合成。

Details Motivation: 现有基于NeRF等隐式表示的方法存在渲染慢和动态不一致问题,而3DGS方法多局限于静态头部生成,缺乏动态控制能力。因此需要一种既能保持高质量渲染又能支持高效动画控制的新型化身生成方法。 Method: 提出AGORA框架,将3DGS与生成对抗网络结合,引入一个FLAME条件的轻量级形变分支来预测每个高斯点的残差位移,从而实现身份保持的表情控制;采用双判别器训练策略,利用参数化网格的合成渲染来增强表情保真度。 Result: AGORA在表情准确性和视觉真实感方面优于最先进的NeRF方法,单GPU上渲染速度超过250 FPS,CPU上可达约9 FPS,首次实现了CPU下的实用级可动画3DGS化身合成。 Conclusion: AGORA显著推进了高性能数字人的实用性,在VR、远程呈现和娱乐等场景中具有广泛应用前景,是迈向高效、可控3D化身生成的重要一步。 Abstract: The generation of high-fidelity, animatable 3D human avatars remains a core challenge in computer graphics and vision, with applications in VR, telepresence, and entertainment. Existing approaches based on implicit representations like NeRFs suffer from slow rendering and dynamic inconsistencies, while 3D Gaussian Splatting (3DGS) methods are typically limited to static head generation, lacking dynamic control. We bridge this gap by introducing AGORA, a novel framework that extends 3DGS within a generative adversarial network to produce animatable avatars. Our key contribution is a lightweight, FLAME-conditioned deformation branch that predicts per-Gaussian residuals, enabling identity-preserving, fine-grained expression control while allowing real-time inference. Expression fidelity is enforced via a dual-discriminator training scheme leveraging synthetic renderings of the parametric mesh. AGORA generates avatars that are not only visually realistic but also precisely controllable. Quantitatively, we outperform state-of-the-art NeRF-based methods on expression accuracy while rendering at 250+ FPS on a single GPU, and, notably, at $\sim$9 FPS under CPU-only inference - representing, to our knowledge, the first demonstration of practical CPU-only animatable 3DGS avatar synthesis. This work represents a significant step toward practical, high-performance digital humans. Project website: https://ramazan793.github.io/AGORA/

[136] Towards Stable Cross-Domain Depression Recognition under Missing Modalities

Jiuyi Chen,Mingkui Tan,Haifeng Lu,Qiuna Xu,Zhihua Wang,Runhao Zeng,Xiping Hu

Main category: cs.CV

TL;DR: 提出了一种基于多模态大语言模型的稳定跨域抑郁识别统一框架SCD-MLLM,通过多源数据输入适配器和模态感知融合模块,实现对异构数据和缺失模态的鲁棒处理,在多个数据集上优于现有方法。

Details Motivation: 现有基于音频和视频的自动抑郁检测方法缺乏统一、可泛化的框架,且在模态缺失情况下稳定性差,难以适应真实场景中的多样化和不完整数据。 Method: 提出SCD-MLLM框架,包含多源数据输入适配器(MDIA)将异构输入转换为统一token序列,以及模态感知自适应融合模块(MAFM)通过共享投影机制自适应融合音视频特征。 Result: 在五个公开异构抑郁数据集上进行多数据集联合训练实验,结果表明SCD-MLLM在完整与部分模态条件下均优于现有最先进模型及商业大模型(如Gemini、GPT),展现出更强的跨域泛化能力和模态缺失下的稳定性。 Conclusion: SCD-MLLM为跨域、多源、模态不完整的抑郁检测提供了高效且稳定的解决方案,推动了多模态大模型在心理健康筛查中的实际应用。 Abstract: Depression poses serious public health risks, including suicide, underscoring the urgency of timely and scalable screening. Multimodal automatic depression detection (ADD) offers a promising solution; however, widely studied audio- and video-based ADD methods lack a unified, generalizable framework for diverse depression recognition scenarios and show limited stability to missing modalities, which are common in real-world data. In this work, we propose a unified framework for Stable Cross-Domain Depression Recognition based on Multimodal Large Language Model (SCD-MLLM). The framework supports the integration and processing of heterogeneous depression-related data collected from varied sources while maintaining stability in the presence of incomplete modality inputs. Specifically, SCD-MLLM introduces two key components: (i) Multi-Source Data Input Adapter (MDIA), which employs masking mechanism and task-specific prompts to transform heterogeneous depression-related inputs into uniform token sequences, addressing inconsistency across diverse data sources; (ii) Modality-Aware Adaptive Fusion Module (MAFM), which adaptively integrates audio and visual features via a shared projection mechanism, enhancing resilience under missing modality conditions. e conduct comprehensive experiments under multi-dataset joint training settings on five publicly available and heterogeneous depression datasets from diverse scenarios: CMDC, AVEC2014, DAIC-WOZ, DVlog, and EATD. Across both complete and partial modality settings, SCD-MLLM outperforms state-of-the-art (SOTA) models as well as leading commercial LLMs (Gemini and GPT), demonstrating superior cross-domain generalization, enhanced ability to capture multimodal cues of depression, and strong stability to missing modality cases in real-world applications.

[137] Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction

Kush Revankar,Shreyas Deshpande,Araham Sayeed,Ansh Tandale,Sarika Bobde

Main category: cs.CV

TL;DR: Sanvaad是一个轻量级多模态可访问性框架,支持聋人、视障用户与健听人群之间的实时双向交流。

Details Motivation: 现有工具通常只支持单向交互,无法满足聋人和视障用户与外界充分沟通的需求。 Method: 采用MediaPipe关键点实现印度手语识别;通过语音转文本、文本摘要和文本转语音技术为视障用户提供无屏幕语音接口;结合GIF或字母可视化实现语音到手语的转换。系统基于Streamlit构建,可在桌面和移动设备上运行。 Result: 实现了低计算负载下的实时双向通信,能够在边缘设备上流畅运行,无需专用硬件。 Conclusion: Sanvaad通过整合轻量级计算机视觉与语音处理技术,提供了一种实用且包容的通信解决方案,推动了无障碍交互的发展。 Abstract: Communication between deaf users, visually im paired users, and the general hearing population often relies on tools that support only one direction of interaction. To address this limitation, this work presents Sanvaad, a lightweight multimodal accessibility framework designed to support real time, two-way communication. For deaf users, Sanvaad includes an ISL recognition module built on MediaPipe landmarks. MediaPipe is chosen primarily for its efficiency and low computational load, enabling the system to run smoothly on edge devices without requiring dedicated hardware. Spoken input from a phone can also be translated into sign representations through a voice-to-sign component that maps detected speech to predefined phrases and produces corresponding GIFs or alphabet-based visualizations. For visually impaired users, the framework provides a screen free voice interface that integrates multilingual speech recognition, text summarization, and text-to-speech generation. These components work together through a Streamlit-based interface, making the system usable on both desktop and mobile environments. Overall, Sanvaad aims to offer a practical and accessible pathway for inclusive communication by combining lightweight computer vision and speech processing tools within a unified framework.

[138] Method of UAV Inspection of Photovoltaic Modules Using Thermal and RGB Data Fusion

Andrii Lysyi,Anatoliy Sachenko,Pavlo Radiuk,Mykola Lysyi,Oleksandr Melnychenko,Diana Zahorodnia

Main category: cs.CV

TL;DR: 本研究提出了一种多模态智能框架,用于光伏设施的自动化检测,解决了传统方法中的热图谱偏差、数据冗余和高带宽需求问题,显著提升了检测精度与运维效率。

Details Motivation: 传统的光伏检测方法存在热图谱偏差、数据冗余和高通信带宽需求等问题,难以满足大规模光伏电站高效、精准运维的需求,亟需一种全自动、智能化的解决方案。 Method: 采用一种协同架构:首先通过表征一致性学习实现无热图谱偏置的热嵌入,并与对比度归一化的RGB图像通过门控机制融合;结合基于Rodrigues更新的闭环自适应重采集控制器,对疑似异常进行定向确认;并利用基于haversine距离的DBSCAN聚类实现地理空间去重。 Result: 在PVF-10公开基准上达到0.903的mAP@0.5,比单模态基线提升12-15%;实地验证召回率达96%;去重模块减少15-20%的重复误报;仅传输相关遥测数据使空中通信数据量减少60-70%。 Conclusion: 该研究建立了一种主动式光伏检测的新范式,实现了从数据采集到生成可操作维护警报的全流程自动化,显著提高了电站的安全性和运行效率,具备良好的实际部署前景。 Abstract: The subject of this research is the development of an intelligent, integrated framework for the automated inspection of photovoltaic (PV) infrastructure that addresses the critical shortcomings of conventional methods, including thermal palette bias, data redundancy, and high communication bandwidth requirements. The goal of this study is to design, develop, and validate a comprehensive, multi-modal system that fully automates the monitoring workflow, from data acquisition to the generation of actionable, geo-located maintenance alerts, thereby enhancing plant safety and operational efficiency. The methods employed involve a synergistic architecture that begins with a palette-invariant thermal embedding, learned by enforcing representational consistency, which is fused with a contrast-normalized RGB stream via a gated mechanism. This is supplemented by a closed-loop, adaptive re-acquisition controller that uses Rodrigues-based updates for targeted confirmation of ambiguous anomalies and a geospatial deduplication module that clusters redundant alerts using DBSCAN over the haversine distance. In conclusion, this study establishes a powerful new paradigm for proactive PV inspection, with the proposed system achieving a mean Average Precision (mAP@0.5) of 0.903 on the public PVF-10 benchmark, a significant 12-15% improvement over single-modality baselines. Field validation confirmed the system's readiness, achieving 96% recall, while the de-duplication process reduced duplicate-induced false positives by 15-20%, and relevance-only telemetry cut airborne data transmission by 60-70%.

[139] ShadowWolf -- Automatic Labelling, Evaluation and Model Training Optimised for Camera Trap Wildlife Images

Jens Dede,Anna Förster

Main category: cs.CV

TL;DR: 本文提出了一种名为ShadowWolf的统一框架,用于优化野生动物监测中AI模型的训练与评估,通过动态重训练提升模型在多变环境下的适应性与准确性。

Details Motivation: 由于人类活动扩张导致人与野生动物互动增加,亟需高效的野生动物监测手段;而传统AI模型在复杂多变的野外环境中鲁棒性和适应性不足。 Method: 设计并实现了一个集成化的AI训练框架ShadowWolf,支持图像采集、标注和模型训练的优化,并引入动态模型重训练机制以适应环境变化和应用需求。 Result: 该框架能够减少人工标注工作量,支持现场模型自适应调整,提高了野生动物监测系统的准确性和效率。 Conclusion: ShadowWolf为野生动物监测提供了一种高效、可扩展的解决方案,有助于推动人工智能在生态保护中的实际应用。 Abstract: The continuous growth of the global human population is leading to the expansion of human habitats, resulting in decreasing wildlife spaces and increasing human-wildlife interactions. These interactions can range from minor disturbances, such as raccoons in urban waste bins, to more severe consequences, including species extinction. As a result, the monitoring of wildlife is gaining significance in various contexts. Artificial intelligence (AI) offers a solution by automating the recognition of animals in images and videos, thereby reducing the manual effort required for wildlife monitoring. Traditional AI training involves three main stages: image collection, labelling, and model training. However, the variability, for example, in the landscape (e.g., mountains, open fields, forests), weather (e.g., rain, fog, sunshine), lighting (e.g., day, night), and camera-animal distances presents significant challenges to model robustness and adaptability in real-world scenarios. In this work, we propose a unified framework, called ShadowWolf, designed to address these challenges by integrating and optimizing the stages of AI model training and evaluation. The proposed framework enables dynamic model retraining to adjust to changes in environmental conditions and application requirements, thereby reducing labelling efforts and allowing for on-site model adaptation. This adaptive and unified approach enhances the accuracy and efficiency of wildlife monitoring systems, promoting more effective and scalable conservation efforts.

[140] On The Role of K-Space Acquisition in MRI Reconstruction Domain-Generalization

Mohammed Wattad,Tamir Shor,Alex Bronstein

Main category: cs.CV

TL;DR: 本文研究了在加速磁共振成像(MRI)中,学习得到的k空间采样模式在跨域场景下的泛化能力,并提出通过训练时引入采样不确定性来增强模型的域鲁棒性。

Details Motivation: 现有学习型k空间采样方法多针对特定数据集或模态设计,缺乏对跨域迁移能力的考虑,限制了其在实际临床中的应用。 Method: 通过在不同数据集和采样范式下进行系统评估,验证学习型采样模式的跨域泛化性能;提出一种新方法,在训练过程中随机扰动k空间轨迹以引入采集不确定性,模拟不同扫描仪和成像条件的变异。 Result: 实验表明,采用学习型采样模式训练的模型在跨域设置下具有更好的重建性能和泛化能力;引入采集不确定性可进一步提升域鲁棒性。 Conclusion: k空间轨迹设计不仅是加速手段,更可作为提升MRI重建模型域泛化能力的主动调节因素。 Abstract: Recent work has established learned k-space acquisition patterns as a promising direction for improving reconstruction quality in accelerated Magnetic Resonance Imaging (MRI). Despite encouraging results, most existing research focuses on acquisition patterns optimized for a single dataset or modality, with limited consideration of their transferability across imaging domains. In this work, we demonstrate that the benefits of learned k-space sampling can extend beyond the training domain, enabling superior reconstruction performance under domain shifts. Our study presents two main contributions. First, through systematic evaluation across datasets and acquisition paradigms, we show that models trained with learned sampling patterns exhibitimproved generalization under cross-domain settings. Second, we propose a novel method that enhances domain robustness by introducing acquisition uncertainty during training-stochastically perturbing k-space trajectories to simulate variability across scanners and imaging conditions. Our results highlight the importance of treating kspace trajectory design not merely as an acceleration mechanism, but as an active degree of freedom for improving domain generalization in MRI reconstruction.

[141] Novel Deep Learning Architectures for Classification and Segmentation of Brain Tumors from MRI Images

Sayan Das,Arghadip Biswas

Main category: cs.CV

TL;DR: 本文提出了两种新的深度学习架构SAETCN和SAS-Net,分别用于脑肿瘤的分类与分割,实现了高精度的自动检测。

Details Motivation: 由于脑肿瘤发病率上升,手动从MRI图像中检测肿瘤耗时且困难,现有模型泛化能力差,在验证数据上表现不佳。 Method: 提出两种新型深度学习架构:SAETCN用于脑肿瘤分类,SAS-Net用于肿瘤分割。使用包含胶质瘤、脑膜瘤、垂体瘤及非肿瘤病例的数据集进行训练。 Result: SAETCN在验证集上达到99.38%的分类准确率,SAS-Net实现99.23%的像素级分割准确率。 Conclusion: 所提出的模型在脑肿瘤分类与分割任务中表现出色,具有良好的临床应用前景。 Abstract: Brain tumors pose a significant threat to human life, therefore it is very much necessary to detect them accurately in the early stages for better diagnosis and treatment. Brain tumors can be detected by the radiologist manually from the MRI scan images of the patients. However, the incidence of brain tumors has risen amongst children and adolescents in recent years, resulting in a substantial volume of data, as a result, it is time-consuming and difficult to detect manually. With the emergence of Artificial intelligence in the modern world and its vast application in the medical field, we can make an approach to the CAD (Computer Aided Diagnosis) system for the early detection of Brain tumors automatically. All the existing models for this task are not completely generalized and perform poorly on the validation data. So, we have proposed two novel Deep Learning Architectures - (a) SAETCN (Self-Attention Enhancement Tumor Classification Network) for the classification of different kinds of brain tumors. We have achieved an accuracy of 99.38% on the validation dataset making it one of the few Novel Deep learning-based architecture that is capable of detecting brain tumors accurately. We have trained the model on the dataset, which contains images of 3 types of tumors (glioma, meningioma, and pituitary tumors) and non-tumor cases. and (b) SAS-Net (Self-Attentive Segmentation Network) for the accurate segmentation of brain tumors. We have achieved an overall pixel accuracy of 99.23%.

[142] Bridging spatial awareness and global context in medical image segmentation

Dalia Alzu'bi,A. Ben Hamza

Main category: cs.CV

TL;DR: 本文提出了一种名为U-CycleMLP的新型U形编码器-解码器网络,用于提升医学图像分割的精度和效率,通过引入多种模块有效捕捉局部和全局上下文信息,并在多个基准数据集上验证了其优越性能。

Details Motivation: 现有医学图像分割模型在捕捉局部和全局上下文信息方面存在不足,导致边界像素丢失和分割错误,亟需一种兼顾精度与计算效率的解决方案。 Method: 提出U-CycleMLP网络,编码器采用位置注意力权重激励块、密集空洞块和下采样操作学习多尺度上下文特征;解码器通过上采样、密集空洞块和特征融合机制重建高分辨率分割掩码,并在跳跃连接中引入通道CycleMLP块以增强特征整合能力,同时保持线性计算复杂度。 Result: 在三个基准数据集上的实验结果表明,U-CycleMLP在分割精度上优于现有最先进方法,能够更好捕捉细粒度解剖结构,具有跨模态鲁棒性,且消融研究验证了核心组件的有效性。 Conclusion: U-CycleMLP通过高效的架构设计实现了医学图像分割性能的提升,平衡了准确性与计算成本,具有广泛的应用前景。 Abstract: Medical image segmentation is a fundamental task in computer-aided diagnosis, requiring models that balance segmentation accuracy and computational efficiency. However, existing segmentation models often struggle to effectively capture local and global contextual information, leading to boundary pixel loss and segmentation errors. In this paper, we propose U-CycleMLP, a novel U-shaped encoder-decoder network designed to enhance segmentation performance while maintaining a lightweight architecture. The encoder learns multiscale contextual features using position attention weight excitation blocks, dense atrous blocks, and downsampling operations, effectively capturing both local and global contextual information. The decoder reconstructs high-resolution segmentation masks through upsampling operations, dense atrous blocks, and feature fusion mechanisms, ensuring precise boundary delineation. To further refine segmentation predictions, channel CycleMLP blocks are incorporated into the decoder along the skip connections, enhancing feature integration while maintaining linear computational complexity relative to input size. Experimental results, both quantitative and qualitative, across three benchmark datasets demonstrate the competitive performance of U-CycleMLP in comparison with state-of-the-art methods, achieving better segmentation accuracy across all datasets, capturing fine-grained anatomical structures, and demonstrating robustness across different medical imaging modalities. Ablation studies further highlight the importance of the model's core architectural components in enhancing segmentation accuracy.

[143] SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities

Dung Thuy Nguyen,Quang Nguyen,Preston K. Robinette,Eli Jiang,Taylor T. Johnson,Kevin Leach

Main category: cs.CV

TL;DR: 本文提出了SUGAR,一种可扩展的生成式遗忘框架,用于从3D感知生成模型中移除多个身份,同时保持生成质量与多样性。

Details Motivation: 随着3D感知生成模型在人脸合成上的进步,如何在不重新训练模型的情况下保护用户隐私并移除特定个体成为迫切问题。 Method: SUGAR为每个需遗忘的身份学习一个个性化的代理潜在表示,将其重构引导至视觉合理的替代输出,并引入持续效用保持目标,防止遗忘多个身份时模型性能下降。 Result: SUGAR在移除多达200个身份时表现最优,相比现有方法保留效用提升高达700%。 Conclusion: SUGAR实现了高效、可扩展的生成模型遗忘机制,在保护隐私的同时有效维持了模型的生成质量与多样性。 Abstract: Recent advances in 3D-aware generative models have enabled high-fidelity image synthesis of human identities. However, this progress raises urgent questions around user consent and the ability to remove specific individuals from a model's output space. We address this by introducing SUGAR, a framework for scalable generative unlearning that enables the removal of many identities (simultaneously or sequentially) without retraining the entire model. Rather than projecting unwanted identities to unrealistic outputs or relying on static template faces, SUGAR learns a personalized surrogate latent for each identity, diverting reconstructions to visually coherent alternatives while preserving the model's quality and diversity. We further introduce a continual utility preservation objective that guards against degradation as more identities are forgotten. SUGAR achieves state-of-the-art performance in removing up to 200 identities, while delivering up to a 700% improvement in retention utility compared to existing baselines. Our code is publicly available at https://github.com/judydnguyen/SUGAR-Generative-Unlearn.

[144] GNC-Pose: Geometry-Aware GNC-PnP for Accurate 6D Pose Estimation

Xiujin Liu

Main category: cs.CV

TL;DR: GNC-Pose是一种无需学习的单目6D物体姿态估计方法,结合渲染初始化、几何感知的对应点加权和GNC优化,在无训练数据的情况下实现竞争性精度。

Details Motivation: 现有方法依赖大量训练数据或易受异常值影响,需要一种不依赖学习且鲁棒的6D姿态估计方案。 Method: 通过特征匹配和渲染对齐获得2D-3D对应关系,引入基于几何结构一致性的聚类加权机制,并采用GNC优化与LM精调提升稳定性与精度。 Result: 在YCB数据集上达到与学习型及非学习方法相当的精度,尤其在高异常值情况下表现稳定。 Conclusion: GNC-Pose提供了一种简单、鲁棒、实用的无需学习的6D姿态估计解决方案,适用于有纹理的物体。 Abstract: We present GNC-Pose, a fully learning-free monocular 6D object pose estimation pipeline for textured objects that combines rendering-based initialization, geometry-aware correspondence weighting, and robust GNC optimization. Starting from coarse 2D-3D correspondences obtained through feature matching and rendering-based alignment, our method builds upon the Graduated Non-Convexity (GNC) principle and introduces a geometry-aware, cluster-based weighting mechanism that assigns robust per point confidence based on the 3D structural consistency of the model. This geometric prior and weighting strategy significantly stabilizes the optimization under severe outlier contamination. A final LM refinement further improve accuracy. We tested GNC-Pose on The YCB Object and Model Set, despite requiring no learned features, training data, or category-specific priors, GNC-Pose achieves competitive accuracy compared with both learning-based and learning-free methods, and offers a simple, robust, and practical solution for learning-free 6D pose estimation.

[145] MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Yuhao Su,Anwesa Choudhuri,Zhongpai Gao,Benjamin Planche,Van Nguyen Nguyen,Meng Zheng,Yuhan Shen,Arun Innanje,Terrence Chen,Ehsan Elhamifar,Ziyan Wu

Main category: cs.CV

TL;DR: 本文提出了MedVidBench,一个大规模的医学视频理解基准,以及MedGRPO,一种用于多数据集强化学习的新框架,以解决现有视觉-语言模型在医学视频理解中的局限性。

Details Motivation: 大型视觉-语言模型在医学视频理解方面表现不佳,特别是在空间精度、时间推理和临床语义方面存在挑战,因此需要更有效的训练方法和评估基准。 Method: 构建了一个包含531,850个视频-指令对的MedVidBench基准,并提出MedGRPO框架,包括跨数据集奖励归一化和基于医学LLM裁判的五维临床质量评估机制,采用监督微调与强化学习结合的方法进行模型训练。 Result: 在MedVidBench上对Qwen2.5-VL-7B进行监督微调后性能显著超过GPT-4.1和Gemini-2.5-Flash;MedGRPO进一步提升了接地和描述任务的表现,且解决了标准强化学习中的奖励不平衡问题。 Conclusion: MedVidBench为医学视频理解提供了重要的基准,而MedGRPO提供了一种稳定高效的多数据集强化学习训练方法,二者共同推动了医学领域视觉-语言模型的发展。 Abstract: Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench substantially outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, demonstrating MedVidBench's efficacy, while our MedGRPO framework further improves upon the SFT baseline across grounding and captioning tasks. Our work establishes a foundational benchmark and robust training methodology for advancing vision-language models in medical domains. Our project website is available at https://yuhaosu.github.io/MedGRPO/.

[146] The Role of Entropy in Visual Grounding: Analysis and Optimization

Shuo Li,Jiajun Sun,Zhihao Zhang,Xiaoran Fan,Senjie Jin,Hui Li,Yuming Yang,Junjie Ye,Lixing Shen,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CV

TL;DR: 本文提出了一种用于视觉定位任务的可解释强化学习算法ECVGPO,通过控制熵来平衡探索与利用,显著提升了多模态大模型在多种基准和模型上的表现。

Details Motivation: 现有基于强化学习的多模态大模型微调方法虽引入了熵控制技术,但在感知类任务(如视觉定位)中熵的作用和调控策略尚不明确,需深入研究。 Method: 通过分析视觉定位任务中熵的特性,并与推理任务对比,提出ECVGPO算法,采用可解释的熵控制机制以优化策略学习过程。 Result: 实验表明,ECVGPO在多个基准和模型上均实现了性能提升,有效平衡了探索与利用。 Conclusion: ECVGPO为感知导向任务提供了有效的熵控制方案,推动了多模态大模型在视觉定位中的应用与发展。 Abstract: Recent advances in fine-tuning multimodal large language models (MLLMs) using reinforcement learning have achieved remarkable progress, particularly with the introduction of various entropy control techniques. However, the role and characteristics of entropy in perception-oriented tasks like visual grounding, as well as effective strategies for controlling it, remain largely unexplored. To address this issue, we focus on the visual grounding task and analyze the role and characteristics of entropy in comparison to reasoning tasks. Building on these findings, we introduce ECVGPO (Entropy Control Visual Grounding Policy Optimization), an interpretable algorithm designed for effective entropy regulation. Through entropy control, the trade-off between exploration and exploitation is better balanced. Experiments show that ECVGPO achieves broad improvements across various benchmarks and models.

[147] From Remote Sensing to Multiple Time Horizons Forecasts: Transformers Model for CyanoHAB Intensity in Lake Champlain

Muhammad Adil,Patrick J. Clemins,Andrew W. Schroth,Panagiotis D. Oikonomou,Donna M. Rizzo,Peter D. F. Isles,Xiaohan Zhang,Kareem I. Hannoun,Scott Turnbull,Noah B. Beckage,Asim Zia,Safwan Wshah

Main category: cs.CV

TL;DR: 本研究提出了一种基于遥感数据的Transformer-BiLSTM模型,用于提前1至14天预测尚普兰湖蓝藻水华(CyanoHABs)的发生强度,利用稀疏卫星数据实现了高精度预警。

Details Motivation: 蓝藻水华对水生态系统和公共健康构成严重威胁,尚普兰湖因营养富集和气候变率频繁发生此类事件,亟需高效、连续的大范围监测与预测手段。 Method: 结合Transformer和BiLSTM构建纯遥感预测框架,使用Cyanobacterial Index和MODIS温度数据,通过两阶段预处理(前向填充、加权时间插补和滑动平滑)处理高比例缺失数据,并采用等频分箱和温度统计特征提取方法进行特征工程。 Result: 模型在不同预测时域表现良好,1天、2天、3天预测F1分数分别为89.5%、86.4%和85.5%,14天预测仍保持78.9%的F1分数和82.6%的AUC。 Conclusion: 该模型能有效捕捉稀疏遥感数据中的复杂时空动态,为蓝藻水华管理提供可靠的早期预警工具。 Abstract: Cyanobacterial Harmful Algal Blooms (CyanoHABs) pose significant threats to aquatic ecosystems and public health globally. Lake Champlain is particularly vulnerable to recurring CyanoHAB events, especially in its northern segment: Missisquoi Bay, St. Albans Bay, and Northeast Arm, due to nutrient enrichment and climatic variability. Remote sensing provides a scalable solution for monitoring and forecasting these events, offering continuous coverage where in situ observations are sparse or unavailable. In this study, we present a remote sensing only forecasting framework that combines Transformers and BiLSTM to predict CyanoHAB intensities up to 14 days in advance. The system utilizes Cyanobacterial Index data from the Cyanobacterial Assessment Network and temperature data from Moderate Resolution Imaging Spectroradiometer satellites to capture long range dependencies and sequential dynamics in satellite time series. The dataset is very sparse, missing more than 30% of the Cyanobacterial Index data and 90% of the temperature data. A two stage preprocessing pipeline addressed data gaps by applying forward fill and weighted temporal imputation at the pixel level, followed by smoothing to reduce the discontinuities of CyanoHAB events. The raw dataset is transformed into meaningful features through equal frequency binning for the Cyanobacterial Index values and extracted temperature statistics. Transformer BiLSTM model demonstrates strong forecasting performance across multiple horizons, achieving F1 scores of 89.5%, 86.4%, and 85.5% at one, two, and three-day forecasts, respectively, and maintaining an F1 score of 78.9% with an AUC of 82.6% at the 14-day horizon. These results confirm the model's ability to capture complex spatiotemporal dynamics from sparse satellite data and to provide reliable early warning for CyanoHABs management.

[148] MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

Yueqian Wang,Songxiang Liu,Disong Wang,Nuo Xu,Guanglu Wan,Huishuai Zhang,Dongyan Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到文本的主动交互方法,用于视频多模态大语言模型(Video MLLMs),通过多轮强化学习训练,使模型能够在流媒体视频中自主决定是否响应,无需精确标注回复时间,在ProactiveVideoQA基准上实现了最先进的性能。

Details Motivation: 现有的视频多模态大语言模型多为回合制交互,无法在视频播放过程中主动判断回复时机,限制了其在实时应用中的表现。本文旨在解决如何让模型在没有精确时间标注的情况下,自主、及时且准确地进行响应的问题。 Method: 提出一种基于对话历史和当前视觉上下文的文本到文本的主动交互框架,并采用多轮强化学习(RL)训练方法,避免手动调参和精确时间标注,实现对响应时机的自主决策。 Result: 在包含52k个视频的数据集上训练MMDuet2模型,实验表明其在响应时机和回答质量方面均优于现有基线,在ProactiveVideoQA基准上达到SOTA性能。 Conclusion: 该方法有效解决了视频流中主动交互的挑战,提升了Video MLLMs在真实场景下的实用性与响应能力。 Abstract: Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without requiring precise response time annotations. We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL. Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on the ProactiveVideoQA benchmark.

Kazuya Nishimura,Haruka Hirose,Ryoma Bise,Kaito Shiku,Yasuhiro Kojima

Main category: cs.CV

TL;DR: 提出了一种新的损失函数STRank,通过学习基因表达的相对模式而非绝对水平,来减少病理图像中基因表达估计的噪声和批次效应影响。

Details Motivation: 由于测序技术的复杂性和细胞间的固有变异性,观察到的基因表达数据包含随机噪声和批次效应,导致准确估计绝对表达值具有挑战性。 Method: 提出学习基因表达的相对表达模式,并设计了名为STRank的新型损失函数,该函数对噪声和批次效应更具鲁棒性。 Result: 在合成数据集和真实数据集上的实验表明,所提方法在估计基因表达相对模式方面优于传统点态损失函数方法。 Conclusion: STRank能够有效缓解噪声和批次效应带来的影响,为从病理图像中估计基因表达提供了更稳健的解决方案。 Abstract: Gene expression estimation from pathology images has the potential to reduce the RNA sequencing cost. Point-wise loss functions have been widely used to minimize the discrepancy between predicted and absolute gene expression values. However, due to the complexity of the sequencing techniques and intrinsic variability across cells, the observed gene expression contains stochastic noise and batch effects, and estimating the absolute expression values accurately remains a significant challenge. To mitigate this, we propose a novel objective of learning relative expression patterns rather than absolute levels. We assume that the relative expression levels of genes exhibit consistent patterns across independent experiments, even when absolute expression values are affected by batch effects and stochastic noise in tissue samples. Based on the assumption, we model the relation and propose a novel loss function called STRank that is robust to noise and batch effects. Experiments using synthetic datasets and real datasets demonstrate the effectiveness of the proposed method. The code is available at https://github.com/naivete5656/STRank.

[150] Less Is More, but Where? Dynamic Token Compression via LLM-Guided Keyframe Prior

Yulin Li,Haokun Gui,Ziyang Fan,Junjie Wang,Bin Kang,Bin Chen,Zhuotao Tian

Main category: cs.CV

TL;DR: 本文提出了一种名为DyToK的无需训练的动态令牌压缩方法,利用大语言模型引导的关键帧先验来提升视频大模型的推理效率,在保持准确性的同时显著减少计算开销。

Details Motivation: 现有的关键帧采样方法虽然提升了长视频处理效率,但引入了额外计算成本且采用二元选择策略不够优化,而VLLM注意力机制中存在可利用的关键帧先验信息。 Method: 通过分析发现VLLM注意力层天然编码了基于查询的关键帧先验,DyToK据此动态调整每帧的令牌保留比例,优先保留语义丰富的帧并抑制冗余,实现训练自由的动态令牌压缩。 Result: 实验表明DyToK在多个VLLM(如LLaVA-OneVision和Qwen2.5-VL)上实现了最先进的效率-精度权衡,与现有压缩方法(如VisionZip、FastV)结合可达到4.3倍加速且保持准确率。 Conclusion: DyToK是一种高效、即插即用的动态令牌压缩框架,充分利用VLLM自身注意力机制中的关键帧先验,为长视频理解提供了高效率、高精度的解决方案。 Abstract: Recent advances in Video Large Language Models (VLLMs) have achieved remarkable video understanding capabilities, yet face critical efficiency bottlenecks due to quadratic computational growth with lengthy visual token sequences of long videos. While existing keyframe sampling methods can improve temporal modeling efficiency, additional computational cost is introduced before feature encoding, and the binary frame selection paradigm is found suboptimal. Therefore, in this work, we propose Dynamic Token compression via LLM-guided Keyframe prior (DyToK), a training-free paradigm that enables dynamic token compression by harnessing VLLMs' inherent attention mechanisms. Our analysis reveals that VLLM attention layers naturally encoding query-conditioned keyframe priors, by which DyToK dynamically adjusts per-frame token retention ratios, prioritizing semantically rich frames while suppressing redundancies. Extensive experiments demonstrate that DyToK achieves state-of-the-art efficiency-accuracy tradeoffs. DyToK shows plug-and-play compatibility with existing compression methods, such as VisionZip and FastV, attaining 4.3x faster inference while preserving accuracy across multiple VLLMs, such as LLaVA-OneVision and Qwen2.5-VL. Code is available at https://github.com/yu-lin-li/DyToK .

[151] Hierarchical Deep Learning for Diatom Image Classification: A Multi-Level Taxonomic Approach

Yueying Ke

Main category: cs.CV

TL;DR: 本文提出一种基于层次化卷积网络的硅藻分类方法,通过引入分类层级结构提升识别准确性与错误局部性,在相同数据下优于传统扁平化模型。

Details Motivation: 传统的硅藻分类依赖专家,而现有深度学习方法多为扁平分类,无法充分利用分类学层级关系,导致错误传播严重。本文旨在通过嵌入层级结构提升分类性能。 Method: 提出一种具有五个级联头的层次化卷积网络,联合预测纲、目、科、属、种,利用上层输出的概率分布和二值掩码限制下层预测空间,并在训练和推理中共享骨干特征。 Result: 在物种级别准确率与扁平模型相当(69.4%),但在所有上级分类单元表现更优;分类错误更局部化,92.5%的物种误分类仍能正确预测到属,平均分类距离降低38.2%。 Conclusion: 层次化模型通过上下双向机制(自上而下的约束与自下而上的特征优化)实现了更鲁棒、可解释且符合生物学规律的多级分类结果。 Abstract: Accurate taxonomic identification of diatoms is essential for aquatic ecosystem monitoring, yet conventional methods depend heavily on expert taxonomists. Recent deep learning approaches improve automation, but most treat diatom recognition as flat classification predicting only one taxonomic rank. We investigate whether embedding taxonomic hierarchy into neural network architectures can improve both accuracy and error locality. We introduce a hierarchical convolutional network with five cascaded heads that jointly predict class, order, family, genus, and species. Each head receives shared backbone features and probability distributions from higher levels, with binary masks restricting predictions to valid descendants during training and inference. Using a filtered dataset of 1,456 diatom images covering 82 species, we compare hierarchical and flat models under identical settings. The hierarchical model matches flat baselines at species level (69.4% accuracy) while outperforming at all upper taxonomic levels. When species predictions fail, errors remain taxonomically local: 92.5 % of misclassified species are correctly predicted at genus level, versus 67.2% for flat baselines. The hierarchical model reduces mean taxonomic distance by 38.2% (1.209 vs. 1.955). Progressive training reveals bidirectional mechanisms: hierarchical constraint masks operate top-down to constrain prediction space, while gradients from fine-grained levels propagate bottom-up through the shared backbone, refining features. This improves class accuracy from 96.2% to 99.5% and yields 6-8% gains at upper levels, producing more robust, interpretable, and biologically aligned predictions for multi-level taxonomic classification.

[152] NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification

Ziyang Song,Zelin Zang,Xiaofan Ye,Boqiang Xu,Long Bai,Jinlin Wu,Hongliang Ren,Hongbin Liu,Jiebo Luo,Zhen Lei

Main category: cs.CV

TL;DR: 本文提出了NeuroABench,首个用于评估神经外科领域解剖理解能力的多模态基准,基于9小时标注视频和68个临床解剖结构,揭示现有MLLM在该任务上准确率仅为40.87%,表现接近最低分医学生,但仍显著落后于医学生平均水平。

Details Motivation: 现有研究主要关注手术流程理解,忽视了解剖结构理解这一临床关键需求,亟需专门评估MLLM在神经外科解剖认知方面的能力。 Method: 构建NeuroABench:包含9小时覆盖89种手术的标注神经外科视频,采用新型多模态标注流程,评估68个临床解剖结构识别能力;在10多个先进MLLM上进行实验,并与4名神经外科培训生的表现对比。 Result: 最佳MLLM在解剖识别任务中仅达到40.87%的准确率;医学生测试中最高分为56%,最低28%,平均46.5%;最优MLLM表现接近最低分学生,但低于平均成绩。 Conclusion: 尽管MLLM在手术视频解剖理解上取得进展,但在精确识别神经外科解剖结构方面仍显著落后于人类水平,NeuroABench为未来模型发展提供了重要评估标准和改进方向。 Abstract: Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.

[153] Masked Autoencoder Pretraining on Strong-Lensing Images for Joint Dark-Matter Model Classification and Super-Resolution

Achmad Ardani Prasha,Clavino Ourizqi Rachmadi,Muhamad Fauzan Ibnu Syahlan,Naufal Rahfi Anugerah,Nanda Garin Raditya,Putri Amelia,Sabrina Laila Mutiara,Hilman Syachr Ramadhan

Main category: cs.CV

TL;DR: 提出基于掩码自编码器(MAE)的预训练方法,在模拟强引力透镜图像上学习可泛化的表示,用于暗物质模型分类和图像超分辨率重建任务,结果优于或媲美从零训练的ViT模型。

Details Motivation: 强引力透镜可揭示星系中暗物质子结构的影响,但噪声和低分辨率图像给分析带来挑战,需提升模型在多任务下的表现与泛化能力。 Method: 采用掩码图像建模目标,在DeepLense ML4SCI基准的模拟强透镜图像上预训练Vision Transformer编码器,并分别微调用于(1)暗物质模型分类(冷暗物质、类轴子、无子结构)和(2)低分辨率图像超分辨率重建。 Result: 在90%掩码比下,分类器宏AUC达0.968,准确率88.65%(优于从零训练的0.957和82.46%);超分辨率任务中PSNR约33 dB,SSIM 0.961,略优于基线;消融实验显示高掩码比有利于分类但轻微降低重建质量。 Conclusion: 在物理丰富的模拟数据上进行MAE预训练,可获得适用于多个强透镜分析任务的灵活且可复用的编码器。 Abstract: Strong gravitational lensing can reveal the influence of dark-matter substructure in galaxies, but analyzing these effects from noisy, low-resolution images poses a significant challenge. In this work, we propose a masked autoencoder (MAE) pretraining strategy on simulated strong-lensing images from the DeepLense ML4SCI benchmark to learn generalizable representations for two downstream tasks: (i) classifying the underlying dark matter model (cold dark matter, axion-like, or no substructure) and (ii) enhancing low-resolution lensed images via super-resolution. We pretrain a Vision Transformer encoder using a masked image modeling objective, then fine-tune the encoder separately for each task. Our results show that MAE pretraining, when combined with appropriate mask ratio tuning, yields a shared encoder that matches or exceeds a ViT trained from scratch. Specifically, at a 90% mask ratio, the fine-tuned classifier achieves macro AUC of 0.968 and accuracy of 88.65%, compared to the scratch baseline (AUC 0.957, accuracy 82.46%). For super-resolution (16x16 to 64x64), the MAE-pretrained model reconstructs images with PSNR ~33 dB and SSIM 0.961, modestly improving over scratch training. We ablate the MAE mask ratio, revealing a consistent trade-off: higher mask ratios improve classification but slightly degrade reconstruction fidelity. Our findings demonstrate that MAE pretraining on physics-rich simulations provides a flexible, reusable encoder for multiple strong-lensing analysis tasks.

[154] Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng,Chaochao Lu,Xia Hu,Wenqi Shao,Wenjie Wang

Main category: cs.CV

TL;DR: 本文提出了Think-Reflect-Revise (TRR) 框架,通过三阶段训练(推理-反思-修正)提升大型视觉语言模型(LVLMs)的安全对齐能力,利用反思机制发现并修正首次推理中的有害内容,在安全基准和越狱攻击测试中显著提升了安全响应率。

Details Motivation: 现有的安全推理方法采用单次‘先思考后回答’范式,容易受到上下文或视觉越狱攻击的影响,且可能忽略自身输出中的有害内容,缺乏自我修正能力。 Method: 提出TRR框架:1)构建包含5000个样本的ReSafe数据集,遵循‘思考-反思-修正’流程;2)用该数据集微调模型以初始化反思行为;3)通过强化学习进一步加强策略引导的反思能力。 Result: 在Qwen2.5-VL-7B上,整体安全响应率从42.8%提升至87.7%,在安全感知基准和越狱攻击评估中表现显著提升,同时在MMMU和MMStar等通用基准上保持稳定性能。 Conclusion: TRR通过引入反思机制有效利用初次推理中暴露的恶意信号,实现真正的自我修正,显著增强LVLMs的安全性,为安全对齐提供了新思路。 Abstract: As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.

[155] TextMamba: Scene Text Detector with Mamba

Qiyan Zhao,Yue Yan,Da-Han Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于Mamba的新型场景文本检测器,结合选择机制与注意力层,提升长序列中关键信息提取能力,在多个基准上达到领先或具有竞争力的表现。

Details Motivation: 现有的Transformer方法在建模长距离依赖时存在遗忘重要信息或关注无关表示的问题,且传统方法在全局特征提取上有局限。 Method: 提出一种基于Mamba的检测器,引入选择机制增强注意力层;采用Top_k算法筛选关键信息,并设计双尺度前馈网络和嵌入金字塔增强模块以促进高维状态交互和多尺度特征融合。 Result: 在CTW1500、TotalText和ICDAR19ArT数据集上分别取得89.7%、89.2%和78.5%的F-measure,性能达到SOTA或具有竞争力。 Conclusion: 所提方法有效提升了长序列文本检测中的信息选择与特征融合能力,显著提高了检测性能。 Abstract: In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.

[156] Generating Storytelling Images with Rich Chains-of-Reasoning

Xiujie Song,Qi Jia,Shota Watanabe,Xiaoyi Pang,Ruijie Chen,Mengyue Wu,Kenny Q. Zhu

Main category: cs.CV

TL;DR: 本文提出了“叙事图像生成”任务,旨在利用生成式AI模型创建富含语义和推理链条的叙事图像,并提出了一种结合大语言模型与文生图模型的两阶段框架StorytellingPainter,同时设计了专门的评估体系和轻量级模型Mini-Storytellers以提升生成效果。

Details Motivation: 由于叙事图像具有复杂的语义结构和丰富的视觉线索,人工创作困难且资源稀缺,因此需要借助AI自动生成此类图像以支持其在认知分析、插画创作等领域的应用。 Method: 提出两阶段生成框架StorytellingPainter:首先利用大语言模型生成包含逻辑推理链的故事内容,再通过文生图模型将其转化为视觉图像;同时构建包含语义复杂性、多样性及图文对齐性的评估框架,并针对开源与闭源大模型间的性能差距设计轻量级Mini-Storytellers模型进行优化。 Result: 实验结果表明所提出的框架能有效生成高质量的叙事图像,评估体系合理且具备区分性,Mini-Storytellers在故事生成任务上显著缩小了开源与闭源模型之间的性能差距。 Conclusion: 本研究验证了利用大语言模型与生成模型协同完成复杂语义图像生成的可行性,为叙事图像的自动化创作提供了有效方法和系统框架。 Abstract: An image can convey a compelling story by presenting rich, logically connected visual clues. These connections form Chains-of-Reasoning (CoRs) within the image, enabling viewers to infer events, causal relationships, and other information, thereby understanding the underlying story. In this paper, we focus on these semantically rich images and define them as Storytelling Images. Such images have diverse applications beyond illustration creation and cognitive screening, leveraging their ability to convey multi-layered information visually and inspire active interpretation. However, due to their complex semantic nature, Storytelling Images are inherently challenging to create, and thus remain relatively scarce. To address this challenge, we introduce the Storytelling Image Generation task, which explores how generative AI models can be leveraged to create such images. Specifically, we propose a two-stage pipeline, StorytellingPainter, which combines the creative reasoning abilities of Large Language Models (LLMs) with the visual synthesis capabilities of Text-to-Image (T2I) models to generate Storytelling Images. Alongside this pipeline, we develop a dedicated evaluation framework comprising three main evaluators: a Semantic Complexity Evaluator, a KNN-based Diversity Evaluator and a Story-Image Alignment Evaluator. Given the critical role of story generation in the Storytelling Image Generation task and the performance disparity between open-source and proprietary LLMs, we further explore tailored training strategies to reduce this gap, resulting in a series of lightweight yet effective models named Mini-Storytellers. Experimental results demonstrate the feasibility and effectiveness of our approaches. The code is available at https://github.com/xiujiesong/StorytellingImageGeneration.

[157] Personalized Image Descriptions from Attention Sequences

Ruoyu Xue,Hieu Le,Jingyi Xu,Sounak Mondal,Abe Leite,Gregory Zelinsky,Minh Hoai,Dimitris Samaras

Main category: cs.CV

TL;DR: 本文提出DEPER模型,通过结合个体的视觉行为和语言风格来生成个性化的图像描述,利用辅助注意力预测任务学习主体嵌入,并通过轻量级适配器与冻结的视觉-语言模型对齐,实现在多个数据集上的显著性能提升。

Details Motivation: 现有个性化图像描述模型仅关注语言风格,忽略了个体观看模式的影响。本文旨在填补这一空白,将个性化观看行为作为描述生成的核心因素进行建模。 Method: 提出DEPER(Description-Perception persona encoder)模型,通过辅助注意力预测任务学习包含语言风格和观看行为的主体嵌入,并使用轻量级适配器将其与冻结的视觉-语言模型对齐,实现少样本个性化。 Result: 在四个涵盖不同观看任务和描述长度的数据集上,DEPER平均提升了24%的性能,生成的描述更符合人类感知且质量更高。 Conclusion: 建模个性化的注意力机制有助于生成更贴近人类表达的图像描述,理解人类感知差异可提升多模态系统的性能与人类对齐性。 Abstract: People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.

[158] Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models

Kassoum Sanogo,Renzo Ardiccioni

Main category: cs.CV

TL;DR: 提出一种无需训练的视觉语言模型自修正框架,通过多维不确定性引导的视觉重关注来减少幻觉。

Details Motivation: 视觉语言模型(VLMs)常生成关于图像内容的看似合理但错误的陈述(即幻觉),缺乏可靠性和可解释性。 Method: 结合多维不确定性量化(词元熵、注意力分散、语义一致性和主张置信度)与注意力引导的图像裁剪,在冻结的预训练VLM上实现迭代式响应修正,无需梯度更新。 Result: 在POPE和MMHAL BENCH基准上使用Qwen2.5-VL-7B验证,幻觉率降低9.8个百分点,对象存在准确性提高4.7点,定性分析显示修正过程有效依赖视觉证据。 Conclusion: 该训练-free自修正框架能有效减少VLM的幻觉问题,提升输出可靠性,且具有良好的通用性和应用潜力。 Abstract: Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.

[159] CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

Yu Qi,Yumeng Zhang,Chenting Gong,Xiao Tan,Weiming Zhang,Wei Zhang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为CoT4Det的新方法,将感知任务分解为分类、计数和定位三个可解释的步骤,显著提升了大型视觉语言模型在感知任务上的表现,同时保持其通用视觉语言能力。

Details Motivation: 大型视觉语言模型在通用视觉问答和OCR等任务中表现出色,但在目标检测、语义分割等感知密集型任务上性能较差,尤其是对复杂场景和小物体的识别能力不足。 Method: 提出Chain-of-Thought for Detection (CoT4Det),将感知任务重新构想为三个逐步推理阶段:分类、计数和 grounding,更契合LVLM的推理能力。 Result: 在COCO2017 val上mAP从19.0%提升至33.0%,在RefCOCO系列上超过基线2%,在Flickr30k entities上提升19%。 Conclusion: CoT4Det是一种简单而高效的方法,显著增强了LVLM在感知任务上的性能,同时保留了其原有的多模态理解能力,具有良好的通用性和应用潜力。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

[160] 1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

Shida Gao,Feng Xue,Xiangfeng Wang,Anlong Ming,Teng Long,Yihua Shao,Haozhe Wang,Zhaowen Lin,Wei Wang,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种名为DEViL的检测器赋能视频大语言模型,通过结合开放词汇检测器(OVD)和参考语义令牌(RST),解决了现有MLLM在时空定位中因自回归空间解码导致的误差累积和漂移问题,并引入TTReg实现时间一致性,显著提升了细粒度视频理解任务的表现。

Details Motivation: 现有的多模态大语言模型(MLLM)在处理视频中的时空定位时,由于将边界框作为文本标记自回归生成,导致输出序列过长、空间误差累积和定位漂移,难以准确推理事件的语义关系。因此需要一种更高效、精确的时空接地与推理方法。 Method: 提出DEViL框架,将视频大语言模型与开放词汇检测器(OVD)结合;设计参考-语义令牌(RST),用于传递用户查询的语义表示,并作为OVD文本嵌入的替代和控制信号;在OVD中引入管状挖掘的时间正则化(TTReg),增强目标对象的时间一致性关联。 Result: 实验表明,DEViL在多个细粒度视频理解任务(如STVG和GroundedVQA)上实现了优异性能,有效缓解了定位漂移问题并提高了时空接地的准确性。 Conclusion: DEViL通过融合大语言模型与检测器的优势,利用RST和TTReg机制实现了端到端的指代表达理解与时空定位,在保持语义丰富性的同时增强了时空一致性,为复杂视频理解提供了新的解决方案。 Abstract: Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.

[161] RunawayEvil: Jailbreaking the Image-to-Video Generative Models

Songping Wang,Rufan Qian,Yueming Lyu,Qinglong Liu,Linzhuang Zou,Jie Qin,Songhua Liu,Caifeng Shan

Main category: cs.CV

TL;DR: 本文提出了RunawayEvil,首个具备动态进化能力的多模态越狱攻击框架,用于评估图像到视频(I2V)生成模型的安全性。该框架基于“策略-战术-行动”范式,通过自我演化的策略定制、多模态战术规划与自动执行机制,在无需人工干预的情况下持续增强攻击效果。实验表明其在多个商用I2V模型上达到最先进的攻击成功率。

Details Motivation: I2V生成模型在创意控制方面具有重要意义,但其安全性尤其是对越狱攻击的脆弱性尚未得到充分研究。现有攻击方法缺乏动态适应性和多模态协同能力,难以有效揭示系统漏洞。因此,亟需一种自动化、可进化的攻击框架来系统评估I2V模型的安全边界。 Method: 提出RunawayEvil框架,采用“策略-战术-行动”三层架构:1)策略感知单元利用强化学习和大语言模型实现攻击策略的自我演化;2)多模态战术规划单元生成协调的文本越狱指令与图像篡改指南;3)战术执行单元实施并评估攻击效果。整个框架支持闭环反馈与持续优化。 Result: 在Open-Sora 2.0和CogVideoX等商用I2V模型上验证了RunawayEvil的有效性,攻击成功率显著优于现有方法,在COCO2017数据集上高出58.5%至79%。框架展现出强大的自我进化能力和跨模型泛化性能。 Conclusion: RunawayEvil为I2V模型提供了首个可自我演化的多模态越狱攻击框架,揭示了当前视频生成系统的严重安全风险,同时为构建更鲁棒的防御机制奠定了基础。 Abstract: Image-to-Video (I2V) generation synthesizes dynamic visual content from image and text inputs, providing significant creative control. However, the security of such multimodal systems, particularly their vulnerability to jailbreak attacks, remains critically underexplored. To bridge this gap, we propose RunawayEvil, the first multimodal jailbreak framework for I2V models with dynamic evolutionary capability. Built on a "Strategy-Tactic-Action" paradigm, our framework exhibits self-amplifying attack through three core components: (1) Strategy-Aware Command Unit that enables the attack to self-evolve its strategies through reinforcement learning-driven strategy customization and LLM-based strategy exploration; (2) Multimodal Tactical Planning Unit that generates coordinated text jailbreak instructions and image tampering guidelines based on the selected strategies; (3) Tactical Action Unit that executes and evaluates the multimodal coordinated attacks. This self-evolving architecture allows the framework to continuously adapt and intensify its attack strategies without human intervention. Extensive experiments demonstrate RunawayEvil achieves state-of-the-art attack success rates on commercial I2V models, such as Open-Sora 2.0 and CogVideoX. Specifically, RunawayEvil outperforms existing methods by 58.5 to 79 percent on COCO2017. This work provides a critical tool for vulnerability analysis of I2V models, thereby laying a foundation for more robust video generation systems.

[162] EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy

Yumeng He,Zanwei Zhou,Yekun Zheng,Chen Liang,Yunbo Wang,Xiaokang Yang

Main category: cs.CV

TL;DR: EMGauss是一种基于高斯点阵渲染的通用3D重建框架,用于从平面扫描的2D切片中恢复体积电子显微镜(vEM)数据,克服了传统各向同性假设的局限性。

Details Motivation: 现有的深度学习方法依赖于各向同性的先验假设,在处理形态上各向异性的生物结构时表现不佳,且vEM数据存在轴向分辨率低和获取受限的问题。 Method: 将切片到3D的重建问题重新定义为基于高斯溅射(Gaussian splatting)的3D动态场景渲染问题,将轴向切片的 progression 视为2D高斯点云的时间演化,并引入教师-学生自举机制,利用未观测切片上的高置信度预测作为伪监督信号以提升稀疏数据下的重建保真度。 Result: 相比基于扩散和GAN的方法,EMGauss显著提升了插值质量,支持连续切片生成,且无需大规模预训练。 Conclusion: EMGauss有效解决了vEM中因采集限制导致的各向异性问题,提供了一种无需强各向同性先验的通用切片到3D重建框架,具有跨成像领域的推广潜力。 Abstract: Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors, yet their assumptions break down for morphologically anisotropic structures. We present EMGauss, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a Teacher-Student bootstrapping mechanism that uses high-confidence predictions on unobserved slices as pseudo-supervisory signals. Compared with diffusion- and GAN-based reconstruction methods, EMGauss substantially improves interpolation quality, enables continuous slice synthesis, and eliminates the need for large-scale pretraining. Beyond vEM, it potentially provides a generalizable slice-to-3D solution across diverse imaging domains.

[163] Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

Jisoo Park,Seonghak Lee,Guisik Kim,Taewoo Kim,Junseok Kwon

Main category: cs.CV

TL;DR: 本文提出了UniVoiceLite,一种轻量级、无监督的音视频框架,统一了语音增强和语音分离任务,利用唇动和面部身份线索引导语音提取,并通过Wasserstein距离正则化稳定潜在空间,无需配对数据即可实现高效且泛化能力强的语音分离与增强。

Details Motivation: 传统上语音增强和语音分离被作为独立任务处理,但真实场景中常同时存在噪声和重叠说话人,现有方法多为多阶段、参数量大且依赖监督训练,限制了可扩展性和泛化能力,因此需要一个统一、高效的解决方案。 Method: 提出UniVoiceLite,一种轻量级无监督音视频模型,结合唇动和面部身份信息进行语音提取,并引入Wasserstein距离正则化来稳定潜在表示,不依赖成对的干净-噪声数据进行训练。 Result: 实验表明UniVoiceLite在含噪和多人说话场景下均表现出色,兼具高效性与强泛化能力。 Conclusion: UniVoiceLite成功实现了语音增强与分离的统一建模,具备轻量、无需配对训练数据、泛化性强等优点,为实际应用提供了可行方案。 Abstract: Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.

[164] Graph Convolutional Long Short-Term Memory Attention Network for Post-Stroke Compensatory Movement Detection Based on Skeleton Data

Jiaxing Fan,Jiaojiao Liu,Wenkong Wang,Yang Zhang,Xin Ma,Jichen Zhang

Main category: cs.CV

TL;DR: 提出了一种基于骨架数据的GCN-LSTM-ATT模型,用于检测中风后患者的代偿性运动,实验结果显示其准确率显著高于传统机器学习方法。

Details Motivation: 中风患者康复训练中普遍存在代偿性动作,影响长期恢复,因此需要精准检测以优化康复策略。 Method: 采用Kinect深度相机采集16名中风患者的骨骼数据,构建GCN-LSTM-ATT模型,并与SVM、KNN、随机森林等传统方法对比,进行消融实验验证各组件贡献。 Result: GCN-LSTM-ATT模型的检测准确率达到0.8580,优于传统机器学习算法,消融实验表明模型各组成部分均对性能提升有显著作用。 Conclusion: 该模型为中风后代偿性运动检测提供了更精确的工具,有望促进康复训练策略的优化。 Abstract: Most stroke patients experience upper limb motor dysfunction. Compensatory movements are prevalent during rehabilitation training, which is detrimental to patients' long-term recovery. Therefore, detecting compensatory movements is of great significance. In this study, a Graph Convolutional Long Short-Term Memory Attention Network (GCN-LSTM-ATT) based on skeleton data is proposed for the detection of compensatory movements after stroke. Sixteen stroke patients were selected in the research. The skeleton data of the patients performing specific rehabilitation movements were collected using the Kinect depth camera. After data processing, detection models were constructed respectively using the GCN-LSTM-ATT model, the Support Vector Machine(SVM), the K-Nearest Neighbor algorithm(KNN), and the Random Forest(RF). The results show that the detection accuracy of the GCN-LSTM-ATT model reaches 0.8580, which is significantly higher than that of traditional machine learning algorithms. Ablation experiments indicate that each component of the model contributes significantly to the performance improvement. These findings provide a more precise and powerful tool for the detection of compensatory movements after stroke, and are expected to facilitate the optimization of rehabilitation training strategies for stroke patients.

[165] FedSCAl: Leveraging Server and Client Alignment for Unsupervised Federated Source-Free Domain Adaptation

M Yashwanth,Sampath Koti,Arunabh Singh,Shyam Marjit,Anirban Chakraborty

Main category: cs.CV

TL;DR: 本文提出了FedSCAl框架,通过服务器-客户端对齐(SCAl)机制解决联邦无源域自适应(FFreeDA)中的客户漂移问题,提升了伪标签准确率和分类性能。

Details Motivation: 在FFreeDA设置下,由于客户端数据分布差异大且无法访问源域数据,现有SFDA方法在联邦学习中因极端数据异构性导致客户漂移和伪标签不可靠。 Method: 提出Server-Client Alignment (SCAl) 机制,在训练过程中对齐客户端与服务器模型的预测,以正则化客户端更新,缓解客户漂移并提升伪标签质量。 Result: 在多个基准视觉数据集上的实验表明,FedSCAl在FFreeDA设置下的分类任务中 consistently 优于现有的最先进联邦学习方法。 Conclusion: FedSCAl通过SCAl机制有效解决了FFreeDA中的客户漂移问题,显著提升了无源域自适应场景下的模型性能。 Abstract: We address the Federated source-Free Domain Adaptation (FFreeDA) problem, with clients holding unlabeled data with significant inter-client domain gaps. The FFreeDA setup constrains the FL frameworks to employ only a pre-trained server model as the setup restricts access to the source dataset during the training rounds. Often, this source domain dataset has a distinct distribution to the clients' domains. To address the challenges posed by the FFreeDA setup, adaptation of the Source-Free Domain Adaptation (SFDA) methods to FL struggles with client-drift in real-world scenarios due to extreme data heterogeneity caused by the aforementioned domain gaps, resulting in unreliable pseudo-labels. In this paper, we introduce FedSCAl, an FL framework leveraging our proposed Server-Client Alignment (SCAl) mechanism to regularize client updates by aligning the clients' and server model's predictions. We observe an improvement in the clients' pseudo-labeling accuracy post alignment, as the SCAl mechanism helps to mitigate the client-drift. Further, we present extensive experiments on benchmark vision datasets showcasing how FedSCAl consistently outperforms state-of-the-art FL methods in the FFreeDA setup for classification tasks.

[166] Task-Model Alignment: A Simple Path to Generalizable AI-Generated Image Detection

Ruoxin Chen,Jiahui Gao,Kaiqing Lin,Keyue Zhang,Yandan Zhao,Isabel Guan,Taiping Yao,Shouhong Ding

Main category: cs.CV

TL;DR: 本文提出了一种基于任务-模型对齐原则的双分支检测器AlignGemini,用于AI生成图像检测,通过语义一致性检查和像素伪影检测两个互补任务提升检测性能。

Details Motivation: 现有视觉语言模型在AI生成图像检测中存在任务与模型不匹配问题,导致对细粒度像素伪影不敏感且易产生幻觉。 Method: 将AI生成图像检测分解为语义一致性检查和像素伪影检测两个任务,分别用视觉语言模型和像素伪影专家进行独立训练,并通过正交监督实现任务-模型对齐。 Result: 在五个真实场景基准上,AlignGemini平均准确率提升了9.5%。 Conclusion: 任务-模型对齐是实现可泛化的AI生成图像检测的有效路径,忽略任一任务都会导致系统性盲点。 Abstract: Vision Language Models (VLMs) are increasingly adopted for AI-generated images (AIGI) detection, yet converting VLMs into detectors requires substantial resource, while the resulting models still exhibit severe hallucinations. To probe the core issue, we conduct an empirical analysis and observe two characteristic behaviors: (i) fine-tuning VLMs on high-level semantic supervision strengthens semantic discrimination and well generalize to unseen data; (ii) fine-tuning VLMs on low-level pixel-artifact supervision yields poor transfer. We attribute VLMs' underperformance to task-model misalignment: semantics-oriented VLMs inherently lack sensitivity to fine-grained pixel artifacts, and semantically non-discriminative pixel artifacts thus exceeds their inductive biases. In contrast, we observe that conventional pixel-artifact detectors capture low-level pixel artifacts yet exhibit limited semantic awareness relative to VLMs, highlighting that distinct models are better matched to distinct tasks. In this paper, we formalize AIGI detection as two complementary tasks--semantic consistency checking and pixel-artifact detection--and show that neglecting either induces systematic blind spots. Guided by this view, we introduce the Task-Model Alignment principle and instantiate it as a two-branch detector, AlignGemini, comprising a VLM fine-tuned exclusively with pure semantic supervision and a pixel-artifact expert trained exclusively with pure pixel-artifact supervision. By enforcing orthogonal supervision on two simplified datasets, each branch trains to its strengths, producing complementary discrimination over semantic and pixel cues. On five in-the-wild benchmarks, AlignGemini delivers a +9.5 gain in average accuracy, supporting task-model alignment as an effective path to generalizable AIGI detection.

[167] UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement

Weiqi Li,Xuanyu Zhang,Bin Chen,Jingfen Xie,Yan Wang,Kexin Zhang,Junlin Li,Li Zhang,Jian Zhang,Shijie Zhao

Main category: cs.CV

TL;DR: 本文提出了UARE,首个统一的视觉-语言模型,用于图像质量评估(IQA)、恢复与增强,通过两阶段训练框架实现多任务协同优化。

Details Motivation: 尽管图像质量评估与图像恢复在概念上密切相关,但现有工作大多将其分离处理;受多模态模型进展启发,探索二者统一建模以提升性能。 Method: 基于预训练的统一理解与生成模型,设计两阶段训练:第一阶段从单一退化逐步过渡到复杂混合退化;第二阶段使用交错的图文数据联合微调质量理解与恢复任务。 Result: 在多个IQA、图像恢复与增强任务上实验表明,UARE能有效利用质量评估信号提升恢复性能。 Conclusion: UARE首次实现了图像质量评估与恢复的统一建模,验证了理解任务对生成任务的引导作用,为低层视觉任务的协同优化提供了新思路。 Abstract: Image quality assessment (IQA) and image restoration are fundamental problems in low-level vision. Although IQA and restoration are closely connected conceptually, most existing work treats them in isolation. Recent advances in unified multimodal understanding-generation models demonstrate promising results and indicate that stronger understanding can improve generative performance. This motivates a single model that unifies IQA and restoration and explicitly studies how IQA can guide restoration, a setting that remains largely underexplored yet highly valuable. In this paper, we propose UARE, to our knowledge the first Unified vision-language model for image quality Assessment, Restoration, and Enhancement. Built on pretrained unified understanding and generation models, we introduce a two-stage training framework. First, a progressive, easy-to-hard schedule expands from single-type distortions to higher-order mixed degradations, enabling UARE to handle multiple degradations. Second, we perform unified fine-tuning of quality understanding and restoration with interleaved text-image data, aligning IQA signals with restoration objectives. Through multi-task co-training, UARE leverages IQA to boost restoration and enhancement performance. Extensive experiments across IQA, restoration, and enhancement tasks demonstrate the effectiveness of UARE. The code and models will be available at https://github.com/lwq20020127/UARE.

[168] VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors

Wenbo Lyu,Yingjun Du,Jinglin Zhao,Xianton Zhen,Ling Shao

Main category: cs.CV

TL;DR: VisChainBench是一个大规模基准,旨在评估大型视觉-语言模型在多图像、多轮次场景中的多步视觉推理能力,强调上下文依赖的推理和视觉到视觉的推断,涵盖超过20,000张图像和1,457个任务,数据和代码已公开。

Details Motivation: 现有的基准主要关注静态或横向比较,依赖语言线索,忽略了渐进式、上下文相关的推理以及视觉到视觉推理的挑战。因此需要一个更贴近真实世界决策过程的新基准来评估LVLMs的能力。 Method: 提出VisChainBench,包含1,457个任务和超过20,000张图像,覆盖多个领域,采用多智能体生成流程构建,以确保高视觉多样性和可控的语言偏差,并减少对语言提示的依赖。 Result: VisChainBench能够有效评估LVLMs在多步视觉推理任务中的表现,突出模型在视觉推理而非语言推理上的能力,且数据和代码已公开以便复现和进一步研究。 Conclusion: VisChainBench填补了现有基准在多步、上下文相关视觉推理评估方面的空白,为LVLMs的发展提供了更具挑战性和现实意义的测试平台。 Abstract: Understanding multi-image, multi-turn scenarios is a critical yet underexplored capability for Large Vision-Language Models (LVLMs). Existing benchmarks predominantly focus on static or horizontal comparisons -- e.g., spotting visual differences or assessing appropriateness -- while relying heavily on language cues. Such settings overlook progressive, context-dependent reasoning and the challenge of visual-to-visual inference. To bridge this gap, we present VisChainBench, a large-scale benchmark designed to rigorously evaluate LVLMs' ability to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance. VisChainBench contains 1,457 tasks spanning over 20,000 images across three diverse domains (e.g., daily scenarios, engineering troubleshooting), structured to mimic real-world decision-making processes. Uniquely, the benchmark is constructed using a multi-agent generation pipeline, ensuring high visual diversity and controlled language bias. All the benchmark data and code for benchmark construction are available for viewing and download via following Link: https://huggingface.co/datasets/eyehole/VisChainBench

[169] JOCA: Task-Driven Joint Optimisation of Camera Hardware and Adaptive Camera Control Algorithms

Chengyang Yan,Mitch Bryson,Donald G. Dansereau

Main category: cs.CV

TL;DR: 本文提出了一种联合优化相机硬件与自适应控制算法的方法,以提升下游视觉任务的性能,尤其在低光和快速运动等挑战条件下表现优越。

Details Motivation: 现有方法多关注于固定相机参数的优化,而缺乏对运行时可调参数(如曝光)的自适应控制,限制了感知任务性能的进一步提升。 Method: 提出一个统一的优化框架,结合基于梯度和无导数优化方法,支持连续与离散参数、不可微成像过程及基于神经网络的自适应控制;并设计DF-Grad混合策略来处理运动模糊等不可微问题。 Result: 实验表明,该方法在低光和快速运动场景下优于分别优化静态与动态参数的基线方法,显著提升了感知性能。 Conclusion: 联合优化相机硬件与自适应控制算法能有效提升下游视觉任务性能,为任务驱动的相机系统设计提供了统一解决方案。 Abstract: The quality of captured images strongly influences the performance of downstream perception tasks. Recent works on co-designing camera systems with perception tasks have shown improved task performance. However, most prior approaches focus on optimising fixed camera parameters set at manufacturing, while many parameters, such as exposure settings, require adaptive control at runtime. This paper introduces a method that jointly optimises camera hardware and adaptive camera control algorithms with downstream vision tasks. We present a unified optimisation framework that integrates gradient-based and derivative-free methods, enabling support for both continuous and discrete parameters, non-differentiable image formation processes, and neural network-based adaptive control algorithms. To address non-differentiable effects such as motion blur, we propose DF-Grad, a hybrid optimisation strategy that trains adaptive control networks using signals from a derivative-free optimiser alongside unsupervised task-driven learning. Experiments show that our method outperforms baselines that optimise static and dynamic parameters separately, particularly under challenging conditions such as low light and fast motion. These results demonstrate that jointly optimising hardware parameters and adaptive control algorithms improves perception performance and provides a unified approach to task-driven camera system design.

[170] Stitch and Tell: A Structured Multimodal Data Augmentation Method for Spatial Understanding

Hang Yin,Xiaomin He,PeiWen Yuan,Yiwei Li,Jiayi Shi,Wenxiao Fan,Shaoxiong Feng,Kan Li

Main category: cs.CV

TL;DR: 提出了一种名为Stitch and Tell (SiTe)的简单、无需标注、即插即用的方法,通过在训练数据中注入结构化的空间监督来增强视觉-语言模型的空间理解能力,有效缓解了空间幻觉问题,并在多个模型和基准上显著提升了性能。

Details Motivation: 现有视觉-语言模型常出现空间幻觉,即对图像中物体相对位置描述错误,主要源于图像与文本之间的不对称性。因此需要增强模型的空间理解能力。 Method: 提出Stitch and Tell (SiTe)方法,通过沿空间轴拼接图像构建拼接图像-文本对,并根据拼接布局生成具有空间意识的描述或问答对,无需依赖昂贵模型或人工标注。 Result: 在LLaVA-v1.5-7B、LLaVA-Qwen2-1.5B和HALVA-7B等模型上验证,SiTe显著提升空间理解任务表现(如MME_Position +5.50%,Spatial-MM +4.19%),同时在COCO-QA(+1.02%)和MMBench(+4.76%)等通用基准上保持或改善性能。 Conclusion: 显式地将空间结构信息注入训练数据是一种有效缓解空间幻觉、提升空间理解能力的同时保持模型通用性能的有效途径。 Abstract: Existing vision-language models often suffer from spatial hallucinations, i.e., generating incorrect descriptions about the relative positions of objects in an image. We argue that this problem mainly stems from the asymmetric properties between images and text. To enrich the spatial understanding ability of vision-language models, we propose a simple, annotation-free, plug-and-play method named $\text{Stitch and Tell}$ (abbreviated as SiTe), which injects structured spatial supervision into data. It constructs stitched image-text pairs by stitching images along a spatial axis and generating spatially-aware captions or question answer pairs based on the layout of stitched image, without relying on costly advanced models or human involvement. We evaluate SiTe across three architectures including LLaVA-v1.5-7B, LLaVA-Qwen2-1.5B and HALVA-7B, two training datasets, and eight benchmarks. Experiments show that SiTe improves spatial understanding tasks such as $\text{MME}_{\text{Position}}$ (+5.50%) and Spatial-MM (+4.19%), while maintaining or improving performance on general vision-language benchmarks including COCO-QA (+1.02%) and MMBench (+4.76%). Our findings suggest that explicitly injecting spatially-aware structure into training data offers an effective way to mitigate spatial hallucinations and improve spatial understanding, while preserving general vision-language capabilities.

[171] RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

Longjie Zhao,Ziming Hong,Zhenyang Ren,Runnan Chen,Mingming Gong,Tongliang Liu

Main category: cs.CV

TL;DR: 本文提出了RDSplat,一种针对3D Gaussian Splatting(3DGS)的鲁棒数字水印方法,能够有效抵抗基于扩散模型的编辑攻击。通过将水印嵌入在扩散编辑过程中保留的低频高斯分量中,并结合协方差正则化与2D滤波技术,以及使用高斯模糊作为代理进行对抗训练,RDSplat实现了对扩散编辑的高度鲁棒性,同时保持水印的不可见性,在多个数据集上表现出最先进的性能。

Details Motivation: 现有的3DGS水印方法容易被扩散模型编辑擦除,缺乏对新兴生成式编辑操作的鲁棒性,因此亟需一种能抵御此类攻击的水印机制以保护数字资产版权。 Method: 提出RDSplat,采用多域框架直接在3DGS空间内操作:1)主动选择低频高斯分量嵌入水印;2)利用协方差正则化和2D滤波实现跨域一致性;3)引入高斯模糊作为扩散编辑的高效代理,在训练中进行对抗微调以增强鲁棒性。 Result: 在三个基准数据集上的实验表明,RDSplat在面对扩散编辑时显著优于现有方法,水印保持高度鲁棒且视觉不可见,PSNR和SSIM等指标领先,成功抵御多种编辑操作。 Conclusion: RDSplat通过挖掘扩散编辑对低频成分的保留特性,结合代理模型进行对抗训练,为3DGS提供了首个有效抵御扩散编辑的水印解决方案,推动了3D内容版权保护的发展。 Abstract: 3D Gaussian Splatting (3DGS) has enabled the creation of digital assets and downstream applications, underscoring the need for robust copyright protection via digital watermarking. However, existing 3DGS watermarking methods remain highly vulnerable to diffusion-based editing, which can easily erase embedded provenance. This challenge highlights the urgent need for 3DGS watermarking techniques that are intrinsically resilient to diffusion-based editing. In this paper, we introduce RDSplat, a Robust watermarking paradigm against Diffusion editing for 3D Gaussian Splatting. RDSplat embeds watermarks into 3DGS components that diffusion-based editing inherently preserve, achieved through (i) proactively targeting low-frequency Gaussians and (ii) adversarial training with a diffusion proxy. Specifically, we introduce a multi-domain framework that operates natively in 3DGS space and embeds watermarks into diffusion-editing-preserved low-frequency Gaussians via coordinated covariance regularization and 2D filtering. In addition, we exploit the low-pass filtering behavior of diffusion-based editing by using Gaussian blur as an efficient training surrogate, enabling adversarial fine-tuning that further enhances watermark robustness against diffusion-based editing. Empirically, comprehensive quantitative and qualitative evaluations on three benchmark datasets demonstrate that RDSplat not only maintains superior robustness under diffusion-based editing, but also preserves watermark invisibility, achieving state-of-the-art performance.

[172] Physics Informed Human Posture Estimation Based on 3D Landmarks from Monocular RGB-Videos

Tobias Leuthold,Michele Xiloyannis,Yves Zimmermann

Main category: cs.CV

TL;DR: 提出一种实时后处理算法,结合BlazePose的2D和3D姿态估计,通过引入骨骼长度和生物力学模型约束,显著提升姿态估计的准确性与解剖一致性,适用于消费级设备上的自动化物理治疗、医疗和运动指导应用。

Details Motivation: 现有单目视频姿态估计模型(如BlazePose)缺乏解剖学约束,导致在实际应用中存在误差,限制了其在物理治疗等对精度要求高的场景中的可靠性,因此需要引入生理知识以提升估计的准确性和鲁棒性。 Method: 提出一种加权优化的实时后处理算法,融合BlazePose的2D和3D估计结果,并引入骨骼长度偏差惩罚和生物力学模型约束;使用带有自适应测量信任的卡尔曼滤波器,根据个体解剖结构动态优化骨骼长度估计。 Result: 在Physio2.2M数据集上评估显示,相比BlazePose 3D估计,3D MPJPE减少了10.2%,身体关节间角度误差降低了16.6%,且具有良好的解剖一致性和实时性能。 Conclusion: 该方法通过引入生理约束和优化策略,显著提升了单目视频姿态估计的精度和鲁棒性,可在消费级笔记本和移动设备上运行,适合用于自动化物理治疗、健康监测和运动指导,且后端处理仅使用匿名数据,保障隐私。 Abstract: Applications providing automated coaching for physical training are increasing in popularity, for example physical therapy. These applications rely on accurate and robust pose estimation using monocular video streams. State-of-the-art models like BlazePose excel in real-time pose tracking, but their lack of anatomical constraints indicates improvement potential by including physical knowledge. We present a real-time post-processing algorithm fusing the strengths of BlazePose 3D and 2D estimations using a weighted optimization, penalizing deviations from expected bone length and biomechanical models. Bone length estimations are refined to the individual anatomy using a Kalman filter with adapting measurement trust. Evaluation using the Physio2.2M dataset shows a 10.2 percent reduction in 3D MPJPE and a 16.6 percent decrease in errors of angles between body segments compared to BlazePose 3D estimation. Our method provides a robust, anatomically consistent pose estimation based on a computationally efficient video-to-3D pose estimation, suitable for automated physiotherapy, healthcare, and sports coaching on consumer-level laptops and mobile devices. The refinement runs on the backend with anonymized data only.

[173] Generalized Geometry Encoding Volume for Real-time Stereo Matching

Jiaxin Liu,Gangwei Xu,Xianqi Wang,Chengliang Zhang,Xin Yang

Main category: cs.CV

TL;DR: 提出了一种名为GGEV的实时立体匹配网络,通过深度感知特征和动态代价聚合模块实现强泛化能力,在多个基准上达到最先进性能。

Details Motivation: 现有实时立体匹配方法在域内性能上表现良好,但忽视了实际应用中的泛化能力;而基于单目基础模型的方法虽提升了泛化性,却存在推理延迟高的问题。 Method: 提出GGEV网络,首先提取编码领域不变结构先验的深度感知特征,然后设计深度感知动态代价聚合(DDCA)模块,自适应地将这些先验引入每个视差假设中,构建具有强泛化能力的几何编码体。 Result: GGEV在零样本泛化能力上超越所有现有实时方法,并在KITTI 2012、KITTI 2015和ETH3D基准上取得最先进性能。 Conclusion: GGEV有效平衡了实时性与泛化能力,为真实场景下的立体匹配提供了高效且通用的解决方案。 Abstract: Real-time stereo matching methods primarily focus on enhancing in-domain performance but often overlook the critical importance of generalization in real-world applications. In contrast, recent stereo foundation models leverage monocular foundation models (MFMs) to improve generalization, but typically suffer from substantial inference latency. To address this trade-off, we propose Generalized Geometry Encoding Volume (GGEV), a novel real-time stereo matching network that achieves strong generalization. We first extract depth-aware features that encode domain-invariant structural priors as guidance for cost aggregation. Subsequently, we introduce a Depth-aware Dynamic Cost Aggregation (DDCA) module that adaptively incorporates these priors into each disparity hypothesis, effectively enhancing fragile matching relationships in unseen scenes. Both steps are lightweight and complementary, leading to the construction of a generalized geometry encoding volume with strong generalization capability. Experimental results demonstrate that our GGEV surpasses all existing real-time methods in zero-shot generalization capability, and achieves state-of-the-art performance on the KITTI 2012, KITTI 2015, and ETH3D benchmarks.

[174] VDOT: Efficient Unified Video Creation via Optimal Transport Distillation

Yutong Wang,Haiyu Zhang,Tianfan Xue,Yu Qiao,Yaohui Wang,Chang Xu,Xinyuan Chen

Main category: cs.CV

TL;DR: 提出了一种高效的统一视频生成模型VDOT,基于分布匹配蒸馏(DMD)框架,引入计算最优传输(OT)技术优化真实与生成分数分布间的差异,并结合判别器提升生成质量,同时构建自动化数据标注 pipeline 和统一评测基准UVCBench,实验证明其在4步生成中优于或媲美需100步去噪的基线模型。

Details Motivation: 现有视频生成模型要么仅支持少数特定条件,要么因推理复杂导致生成时间过长,难以实用。因此需要一个高效且支持多种任务的统一视频生成模型。 Method: 提出VDOT模型,采用分布匹配蒸馏(DMD)框架,用计算最优传输(OT)替代或补充KL散度来优化真实与生成分数分布之间的差距;引入判别器增强对真实数据的感知;并构建自动化视频数据标注与过滤pipeline,以及统一评测基准UVCBench。 Result: VDOT在仅4步生成的情况下,性能优于或媲美其他需要100步去噪的基线模型,显著提升了生成效率与稳定性。 Conclusion: VDOT通过引入最优传输和判别机制,在极短生成步数下实现了高质量、高效率的统一视频生成,为实际应用提供了可行方案。 Abstract: The rapid development of generative models has significantly advanced image and video applications. Among these, video creation, aimed at generating videos under various conditions, has gained substantial attention. However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. Concretely, we model the training process with the distribution matching distillation (DMD) paradigm. Instead of using the Kullback-Leibler (KL) minimization, we additionally employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions. The OT distance inherently imposes geometric constraints, mitigating potential zero-forcing or gradient collapse issues that may arise during KL-based distillation within the few-step generation scenario, and thus, enhances the efficiency and stability of the distillation process. Further, we integrate a discriminator to enable the model to perceive real video data, thereby enhancing the quality of generated videos. To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering that accommodates multiple video creation tasks. Meanwhile, we curate a unified testing benchmark, UVCBench, to standardize evaluation. Experiments demonstrate that our 4-step VDOT outperforms or matches other baselines with 100 denoising steps.

[175] RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

Xiang Lin,Weixin Li,Shu Guo,Lihong Wang,Di Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的基于重建的多模态适配器(RMAdapter),通过双分支结构在少样本场景下实现视觉语言模型的高效微调,兼顾任务特异性与泛化能力。

Details Motivation: 现有的VLM微调方法主要集中在提示调优,而适配器方法研究不足,且在少样本情况下难以平衡适应性与泛化性。 Method: 提出RMAdapter,包含两个分支:一是适配分支用于注入任务特定知识,二是重建分支通过重构特征空间保留通用知识;并在每层本地计算重建损失、共享投影模块以保持轻量,引入一致性约束来调节判别性与泛化性的权衡。 Result: 在新类别、新数据集和领域泛化的三个任务上,RMAdapter均优于现有最先进方法,且不依赖数据增强或重复提示设计。 Conclusion: RMAdapter有效实现了任务特定适应与模型泛化之间的动态平衡,是一种高效、轻量且性能优越的多模态适配器方案。 Abstract: Pre-trained Vision-Language Models (VLMs), \textit{e.g.} CLIP, have become essential tools in multimodal transfer learning. However, fine-tuning VLMs in few-shot scenarios poses significant challenges in balancing task-specific adaptation and generalization in the obtained model. Meanwhile, current researches have predominantly focused on prompt-based adaptation methods, leaving adapter-based approaches underexplored and revealing notable performance gaps. To address these challenges, we introduce a novel Reconstruction-based Multimodal Adapter (RMAdapter), which leverages a dual-branch architecture. Unlike conventional single-branch adapters, RMAdapter consists of: (1) an adaptation branch that injects task-specific knowledge through parameter-efficient fine-tuning, and (2) a reconstruction branch that preserves general knowledge by reconstructing latent space features back into the original feature space. This design facilitates a dynamic balance between general and task-specific knowledge. Importantly, although RMAdapter introduces an additional reconstruction branch, it is carefully optimized to remain lightweight. By computing reconstruction loss locally at each layer and sharing projection modules, the overall computational overhead is kept minimal. A consistency constraint is also incorporated to better regulate the trade-off between discriminability and generalization. We comprehensively evaluate the effectiveness of RMAdapter on three representative tasks: generalization to new categories, generalization to new target datasets, and domain generalization. Without relying on data augmentation or duplicate prompt designs, our RMAdapter consistently outperforms state-of-the-art approaches across all evaluation metrics.

[176] MeshSplatting: Differentiable Rendering with Opaque Meshes

Jan Held,Sanghyun Son,Renaud Vandeghen,Daniel Rebain,Matheus Gadelha,Yi Zhou,Anthony Cioppa,Ming C. Lin,Marc Van Droogenbroeck,Andrea Tagliasacchi

Main category: cs.CV

TL;DR: 提出MeshSplatting,一种基于网格的重建方法,通过可微渲染联合优化几何和外观,兼容AR/VR和游戏引擎,实现实时渲染与高质量视觉效果。

Details Motivation: 解决基于点的表示(如3D高斯泼溅)无法与基于网格的AR/VR和游戏引擎兼容的问题。 Method: 采用受限Delaunay三角剖分建立连接性,并通过可微渲染联合优化几何形状和外观,提升表面一致性。 Result: 在Mip-NeRF360数据集上比当前最先进的MiLo方法PSNR提升+0.69 dB,训练速度快2倍,内存占用减少一半。 Conclusion: MeshSplatting成功融合了神经渲染与交互式3D图形,为实时场景交互提供了高效、高质量的解决方案。 Abstract: Primitive-based splatting methods like 3D Gaussian Splatting have revolutionized novel view synthesis with real-time rendering. However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering. By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines. On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction. The project page is available at https://meshsplatting.github.io/.

[177] SparseCoop: Cooperative Perception with Kinematic-Grounded Queries

Jiahao Wang,Zhongwei Jiang,Wenchao Sun,Jiaru Zhong,Haibao Yu,Yuner Zhang,Chenyang Lu,Chuang Zhang,Lei He,Shaobing Xu,Jianqiang Wang

Main category: cs.CV

TL;DR: 本文提出SparseCoop,一种完全稀疏的协同感知框架,用于自动驾驶中的3D检测与跟踪,摒弃了中间BEV表示,通过基于运动学的实例查询、粗到精聚合模块和协同实例去噪任务实现高效、低通信成本且鲁棒的性能。

Details Motivation: 现有基于密集BEV特征共享的方法存在通信开销大、灵活性差和跨视角对齐困难的问题;而稀疏查询方法则面临几何表征不足、融合策略不佳和训练不稳定等挑战。 Method: 提出SparseCoop框架:1)基于运动学的实例查询,使用包含3D几何和速度的状态向量实现精确时空对齐;2)设计粗到精的聚合模块以提升融合鲁棒性;3)引入协同实例去噪任务来加速并稳定训练过程,完全去除中间BEV表示。 Result: 在V2X-Seq和Griffin数据集上实验表明,SparseCoop在3D检测与跟踪任务中达到最先进性能,同时具备更高的计算效率、更低的传输成本以及对通信延迟的强鲁棒性。 Conclusion: SparseCoop通过全稀疏化设计和新型对齐与融合机制,在性能、效率和实用性方面均优于现有方法,为协同感知提供了更高效可靠的解决方案。 Abstract: Cooperative perception is critical for autonomous driving, overcoming the inherent limitations of a single vehicle, such as occlusions and constrained fields-of-view. However, current approaches sharing dense Bird's-Eye-View (BEV) features are constrained by quadratically-scaling communication costs and the lack of flexibility and interpretability for precise alignment across asynchronous or disparate viewpoints. While emerging sparse query-based methods offer an alternative, they often suffer from inadequate geometric representations, suboptimal fusion strategies, and training instability. In this paper, we propose SparseCoop, a fully sparse cooperative perception framework for 3D detection and tracking that completely discards intermediate BEV representations. Our framework features a trio of innovations: a kinematic-grounded instance query that uses an explicit state vector with 3D geometry and velocity for precise spatio-temporal alignment; a coarse-to-fine aggregation module for robust fusion; and a cooperative instance denoising task to accelerate and stabilize training. Experiments on V2X-Seq and Griffin datasets show SparseCoop achieves state-of-the-art performance. Notably, it delivers this with superior computational efficiency, low transmission cost, and strong robustness to communication latency. Code is available at https://github.com/wang-jh18-SVM/SparseCoop.

[178] CADE: Continual Weakly-supervised Video Anomaly Detection with Ensembles

Satoshi Hashimoto,Tatsuya Konishi,Tomoya Kaichi,Kazunori Matsumoto,Mori Kurokawa

Main category: cs.CV

TL;DR: 本文提出了首个结合持续学习(CL)和弱监督视频异常检测(WVAD)的新方法CADE,通过双生成器解决数据不平衡和标签不确定性,并利用多判别器集成缓解遗忘导致的异常漏检问题,在多个多场景数据集上显著优于现有方法。

Details Motivation: 现有WVAD方法主要针对静态数据集,忽视了数据域可能变化的问题;若不考虑持续学习,仅用新数据训练会导致对旧数据性能下降(即遗忘)。 Method: 提出Continual Anomaly Detection with Ensembles (CADE):使用双生成器(DG)处理WVAD中的数据不平衡与标签不确定性;设计多判别器(MD)集成以捕捉因遗忘而漏检的异常。 Result: 在ShanghaiTech和Charlotte Anomaly等多场景VAD数据集上,CADE显著优于现有的VAD方法。 Conclusion: CADE成功融合了持续学习与弱监督VAD,有效缓解了域迁移下的遗忘问题,提升了多场景下异常检测的鲁棒性和完整性。 Abstract: Video anomaly detection (VAD) has long been studied as a crucial problem in public security and crime prevention. In recent years, weakly-supervised VAD (WVAD) have attracted considerable attention due to their easy annotation process and promising research results. While existing WVAD methods tackle mainly on static datasets, the possibility that the domain of data can vary has been neglected. To adapt such domain-shift, the continual learning (CL) perspective is required because otherwise additional training only with new coming data could easily cause performance degradation for previous data, i.e., forgetting. Therefore, we propose a brand-new approach, called Continual Anomaly Detection with Ensembles (CADE) that is the first work combining CL and WVAD viewpoints. Specifically, CADE uses the Dual-Generator(DG) to address data imbalance and label uncertainty in WVAD. We also found that forgetting exacerbates the "incompleteness'' where the model becomes biased towards certain anomaly modes, leading to missed detections of various anomalies. To address this, we propose to ensemble Multi-Discriminator (MD) that capture missed anomalies in past scenes due to forgetting, using multiple models. Extensive experiments show that CADE significantly outperforms existing VAD methods on the common multi-scene VAD datasets, such as ShanghaiTech and Charlotte Anomaly datasets.

[179] Pseudo Anomalies Are All You Need: Diffusion-Based Generation for Weakly-Supervised Video Anomaly Detection

Satoshi Hashimoto,Hitoshi Nishimura,Yanan Wang,Mori Kurokawa

Main category: cs.CV

TL;DR: 本文提出了一种名为PA-VAD的生成驱动方法,通过合成伪异常视频和真实正常视频配对来训练视频异常检测模型,无需任何真实异常视频,仅用少量正常图像驱动合成,在ShanghaiTech和UCF-Crime数据集上取得了领先性能。

Details Motivation: 由于真实异常视频稀缺且收集成本高,限制了视频异常检测在实际中的部署。因此,需要一种不依赖真实异常视频的训练方法以推动该技术的可扩展应用。 Method: 提出PA-VAD方法,利用CLIP选择类别相关初始图像,并通过视觉-语言模型优化文本提示,再使用视频扩散模型生成伪异常视频;训练时引入域对齐正则化模块,缓解合成异常中过度的时空幅度问题,结合真实正常视频进行弱监督训练。 Result: 在ShanghaiTech上达到98.2%的性能,在UCF-Crime上达到82.5%,分别超过最强使用真实异常的方法和现有UVAD最先进方法0.6%和1.9%。 Conclusion: 无需收集真实异常视频即可实现高精度异常检测,PA-VAD为可扩展的视频异常检测部署提供了实用路径。 Abstract: Deploying video anomaly detection in practice is hampered by the scarcity and collection cost of real abnormal footage. We address this by training without any real abnormal videos while evaluating under the standard weakly supervised split, and we introduce PA-VAD, a generation-driven approach that learns a detector from synthesized pseudo-abnormal videos paired with real normal videos, using only a small set of real normal images to drive synthesis. For synthesis, we select class-relevant initial images with CLIP and refine textual prompts with a vision-language model to improve fidelity and scene consistency before invoking a video diffusion model. For training, we mitigate excessive spatiotemporal magnitude in synthesized anomalies by an domain-aligned regularized module that combines domain alignment and memory usage-aware updates. Extensive experiments show that our approach reaches 98.2% on ShanghaiTech and 82.5% on UCF-Crime, surpassing the strongest real-abnormal method on ShanghaiTech by +0.6% and outperforming the UVAD state-of-the-art on UCF-Crime by +1.9%. The results demonstrate that high-accuracy anomaly detection can be obtained without collecting real anomalies, providing a practical path toward scalable deployment.

[180] Hide-and-Seek Attribution: Weakly Supervised Segmentation of Vertebral Metastases in CT

Matan Atad,Alexander W. Marka,Lisa Steinhelfer,Anna Curto-Vilalta,Yannik Leonhardt,Sarah C. Foreman,Anna-Sophia Walburga Dietrich,Robert Graf,Alexandra S. Gersing,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke,Hendrik Möller

Main category: cs.CV

TL;DR: 提出了一种弱监督方法,利用椎体级别的健康/恶性标签进行CT中椎体转移瘤的分割,无需病变掩码标注,通过扩散自编码器和像素级差异图生成候选区域,并结合“隐藏-寻找”归因策略实现精确分割。

Details Motivation: 准确分割CT中的椎体转移瘤具有重要临床意义,但由于缺乏体素级标注且病灶与良性退行性改变相似,难以扩展应用。 Method: 采用仅基于椎体级别标签的弱监督学习方法,结合扩散自编码器(DAE)生成健康编辑图像与像素级差异图以提出候选病变区域;引入‘隐藏-寻找’归因机制,逐个揭示候选区域并隐藏其余部分,通过DAE将图像投影回数据流形,由潜在空间分类器量化各区域的恶性贡献,高分区域构成最终分割结果。 Result: 在放射科医生标注的测试集上表现出色,无需任何掩码监督即可实现良好的成骨性和溶骨性病灶分割性能(F1: 0.91/0.85;Dice: 0.87/0.78),优于基线方法(F1: 0.79/0.67;Dice: 0.74/0.55)。 Conclusion: 椎体级别的标签可有效转化为可靠的病变掩码,表明生成性编辑结合选择性遮蔽能够支持CT图像中准确的弱监督分割。 Abstract: Accurate segmentation of vertebral metastasis in CT is clinically important yet difficult to scale, as voxel-level annotations are scarce and both lytic and blastic lesions often resemble benign degenerative changes. We introduce a weakly supervised method trained solely on vertebra-level healthy/malignant labels, without any lesion masks. The method combines a Diffusion Autoencoder (DAE) that produces a classifier-guided healthy edit of each vertebra with pixel-wise difference maps that propose candidate lesion regions. To determine which regions truly reflect malignancy, we introduce Hide-and-Seek Attribution: each candidate is revealed in turn while all others are hidden, the edited image is projected back to the data manifold by the DAE, and a latent-space classifier quantifies the isolated malignant contribution of that component. High-scoring regions form the final lytic or blastic segmentation. On held-out radiologist annotations, we achieve strong blastic/lytic performance despite no mask supervision (F1: 0.91/0.85; Dice: 0.87/0.78), exceeding baselines (F1: 0.79/0.67; Dice: 0.74/0.55). These results show that vertebra-level labels can be transformed into reliable lesion masks, demonstrating that generative editing combined with selective occlusion supports accurate weakly supervised segmentation in CT.

[181] Omni-Referring Image Segmentation

Qiancheng Zheng,Yunhang Shen,Gen Luo,Baiyang Song,Xing Sun,Xiaoshuai Sun,Yiyi Zhou,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了一个名为Omni-Referring Image Segmentation (OmniRIS)的新任务,支持文本指令和带掩码、框或涂鸦的参考图像作为多模态提示,实现高度泛化的图像分割,并构建了大规模数据集OmniRef和提出强基线模型OmniSegNet。

Details Motivation: 现有单模态条件分割任务(如RIS)难以同时利用文本和视觉模态的优势,限制了对细粒度属性和非常见物体的准确分割,因此需要一种更通用的多模态分割框架。 Method: 提出OmniRIS任务,设计多模态提示(omni-prompts)输入机制;构建包含186,939个样本的大规模数据集OmniRef;提出OmniSegNet作为基线模型,解决多模态提示编码等关键挑战。 Result: 实验表明OmniSegNet能有效遵循多模态指令进行分割,在多种分割设置下表现优异,验证了OmniRIS在高度泛化图像分割上的潜力。 Conclusion: OmniRIS通过融合文本与视觉模态的互补优势,实现了更灵活、通用的图像分割,为未来多模态感知系统提供了新方向。 Abstract: In this paper, we propose a novel task termed Omni-Referring Image Segmentation (OmniRIS) towards highly generalized image segmentation. Compared with existing unimodally conditioned segmentation tasks, such as RIS and visual RIS, OmniRIS supports the input of text instructions and reference images with masks, boxes or scribbles as omni-prompts. This property makes it can well exploit the intrinsic merits of both text and visual modalities, i.e., granular attribute referring and uncommon object grounding, respectively. Besides, OmniRIS can also handle various segmentation settings, such as one v.s. many and many v.s. many, further facilitating its practical use. To promote the research of OmniRIS, we also rigorously design and construct a large dataset termed OmniRef, which consists of 186,939 omni-prompts for 30,956 images, and establish a comprehensive evaluation system. Moreover, a strong and general baseline termed OmniSegNet is also proposed to tackle the key challenges of OmniRIS, such as omni-prompt encoding. The extensive experiments not only validate the capability of OmniSegNet in following omni-modal instructions, but also show the superiority of OmniRIS for highly generalized image segmentation.

[182] Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Kaixuan Lu,Mehmet Onurcan Kaya,Dim P. Papadopoulos

Main category: cs.CV

TL;DR: 本文提出了一种名为AutoQ-VIS的新型无监督视频实例分割框架,通过质量引导的自训练方法,解决了合成数据到真实数据的域间差距问题,在无需人工标注的情况下实现了最先进的性能。

Details Motivation: 视频实例分割(VIS)面临严重的标注挑战,现有无监督方法受限于合成到真实的域差距,难以有效迁移。 Method: 提出AutoQ-VIS框架,构建伪标签生成与自动质量评估之间的闭环系统,实现从合成数据到真实视频的渐进式自适应,采用质量引导的自训练策略。 Result: 在YouTubeVIS-2019验证集上达到52.6 AP₅₀,超过此前最优方法VideoCutLER 4.4%,且完全无需人工标注。 Conclusion: 质量感知的自训练策略在无监督VIS中具有可行性,能有效缩小域间差距并提升性能。 Abstract: Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 $\text{AP}_{50}$ on YouTubeVIS-2019 $\texttt{val}$ set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.

[183] Spatial Retrieval Augmented Autonomous Driving

Xiaosong Jia,Chenhe Zhang,Yule Jiang,Songbur Wong,Zhiyuan Zhang,Chen Chen,Shaofeng Zhang,Xuanhe Zhou,Xue Yang,Junchi Yan,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 提出一种基于离线地理图像检索的空间检索范式,以增强自动驾驶系统在感知受限条件下的性能,通过引入如Google Maps等来源的地理图像作为额外输入,提升多任务表现,并将开源相关数据与代码。

Details Motivation: 现有自动驾驶系统依赖车载传感器,在视野受限、遮挡或恶劣天气下感知能力受限;而人类驾驶员能在低能见度下依靠记忆导航,因此希望赋予模型类似“回忆”道路结构的能力。 Method: 提出空间检索范式,利用离线缓存(如Google Maps)获取地理图像,并将其作为额外输入模态;扩展nuScenes数据集,对齐地理图像与自车轨迹,构建涵盖五个核心任务的基准测试。 Result: 实验表明,引入地理图像可提升部分自动驾驶任务的性能,尤其在感知受限场景下表现出潜力。 Conclusion: 该方法为现有自动驾驶系统提供了一种即插即用的增强方式,验证了利用先验地理信息提升感知与规划能力的新范式可行性,未来将开源数据与工具促进该方向研究。 Abstract: Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this ``recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD tasks. For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajectories. We establish baselines across five core autonomous driving tasks: object detection, online mapping, occupancy prediction, end-to-end planning, and generative world modeling. Extensive experiments show that the extended modality could enhance the performance of certain tasks. We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.

[184] Towards Robust Pseudo-Label Learning in Semantic Segmentation: An Encoding Perspective

Wangkai Li,Rui Sun,Zhaoyang Li,Tianzhu Zhang

Main category: cs.CV

TL;DR: 提出ECOCSeg,利用纠错输出码(ECOC)进行细粒度类别编码,提升伪标签学习在语义分割中的稳定性和泛化能力。

Details Motivation: 伪标签学习在标签稀缺场景中广泛应用,但易产生错误伪标签并因独热编码而被放大。 Method: 引入基于ECOC的分类器,将类别解耦为属性,并设计位级标签去噪机制以生成高质量伪标签。 Result: ECOCSeg在多个UDA和SSL基准上显著优于现有方法,且兼容不同分割架构。 Conclusion: ECOCSeg通过细粒度编码和位级优化有效缓解伪标签噪声问题,提升了模型鲁棒性与性能。 Abstract: Pseudo-label learning is widely used in semantic segmentation, particularly in label-scarce scenarios such as unsupervised domain adaptation (UDA) and semisupervised learning (SSL). Despite its success, this paradigm can generate erroneous pseudo-labels, which are further amplified during training due to utilization of one-hot encoding. To address this issue, we propose ECOCSeg, a novel perspective for segmentation models that utilizes error-correcting output codes (ECOC) to create a fine-grained encoding for each class. ECOCSeg offers several advantages. First, an ECOC-based classifier is introduced, enabling model to disentangle classes into attributes and handle partial inaccurate bits, improving stability and generalization in pseudo-label learning. Second, a bit-level label denoising mechanism is developed to generate higher-quality pseudo-labels, providing adequate and robust supervision for unlabeled images. ECOCSeg can be easily integrated with existing methods and consistently demonstrates significant improvements on multiple UDA and SSL benchmarks across different segmentation architectures. Code is available at https://github.com/Woof6/ECOCSeg.

[185] SceneMixer: Exploring Convolutional Mixing Networks for Remote Sensing Scene Classification

Mohammed Q. Alkhatib,Ali Jamali,Swalpa Kumar Roy

Main category: cs.CV

TL;DR: 本文提出了一种基于卷积混合范式的轻量级架构SceneMixer,用于遥感场景分类,通过多尺度空间混合和通道混合有效提取局部与上下文信息,在AID和EuroSAT数据集上实现了精度与效率的良好平衡。

Details Motivation: 现有遥感场景分类模型在面对空间分辨率、视角、方向和背景条件变化时泛化能力不足,且通常计算成本较高,因此需要一种高效且鲁棒的轻量级模型。 Method: 提出一种基于卷积混合范式的轻量级网络结构,交替使用多尺度深度可分离卷积进行空间混合和逐点操作进行通道混合,以低参数量和计算量实现局部与全局特征的有效提取。 Result: 在AID和EuroSAT数据集上进行了实验,模型取得了74.7%、74.57%、73.79(AID)和93.90%、93.93%、93.22(EuroSAT)的整体准确率、平均准确率和Kappa系数。 Conclusion: 所提方法在遥感场景分类中表现出良好的精度-效率权衡,优于或可比于主流CNN和Transformer模型,适合作为资源受限下的高效解决方案。 Abstract: Remote sensing scene classification plays a key role in Earth observation by enabling the automatic identification of land use and land cover (LULC) patterns from aerial and satellite imagery. Despite recent progress with convolutional neural networks (CNNs) and vision transformers (ViTs), the task remains challenging due to variations in spatial resolution, viewpoint, orientation, and background conditions, which often reduce the generalization ability of existing models. To address these challenges, this paper proposes a lightweight architecture based on the convolutional mixer paradigm. The model alternates between spatial mixing through depthwise convolutions at multiple scales and channel mixing through pointwise operations, enabling efficient extraction of both local and contextual information while keeping the number of parameters and computations low. Extensive experiments were conducted on the AID and EuroSAT benchmarks. The proposed model achieved overall accuracy, average accuracy, and Kappa values of 74.7%, 74.57%, and 73.79 on the AID dataset, and 93.90%, 93.93%, and 93.22 on EuroSAT, respectively. These results demonstrate that the proposed approach provides a good balance between accuracy and efficiency compared with widely used CNN- and transformer-based models. Code will be publicly available on: https://github.com/mqalkhatib/SceneMixer

[186] Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion

Yu Zhu,Naoya Chiba,Koichi Hashimoto

Main category: cs.CV

TL;DR: 提出了一种层次化的图像引导3D分割框架,通过实例级到部件级的逐步细化,有效应对遮挡和结构复杂性,实现在工业场景中的高精度分割。

Details Motivation: 现有3D点云方法依赖昂贵标注,图像引导方法存在跨视图语义不一致,且在密集布局和多尺度物体场景中表现不佳。 Method: 采用分阶段策略:先通过顶视图渲染和SAM生成的掩码(由YOLO-World提示)进行实例级分割;再对每个实例渲染多视角图像,应用2D分割与反投影,并通过贝叶斯融合保证跨视图语义一致性。 Result: 在真实工厂数据上实现了高每类mIoU分数,在公共数据集上的实验也验证了方法的泛化能力、标注效率和适应性。 Conclusion: 该框架在处理遮挡、多尺度和复杂结构方面表现出色,兼具鲁棒性和实用性,适用于多样化的3D环境。 Abstract: Reliable 3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects, as commonly seen in industrial environments. In such scenarios, heavy occlusion weakens geometric boundaries between objects, and large differences in object scale will cause end-to-end models fail to capture both coarse and fine details accurately. Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views. To address these challenges, we propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level. Instance segmentation involves rendering a top-view image and projecting SAM-generated masks prompted by YOLO-World back onto the 3D point cloud. Part-level segmentation is subsequently performed by rendering multi-view images of each instance obtained from the previous stage and applying the same 2D segmentation and back-projection process at each view, followed by Bayesian updating fusion to ensure semantic consistency across views. Experiments on real-world factory data demonstrate that our method effectively handles occlusion and structural complexity, achieving consistently high per-class mIoU scores. Additional evaluations on public dataset confirm the generalization ability of our framework, highlighting its robustness, annotation efficiency, and adaptability to diverse 3D environments.

[187] JoPano: Unified Panorama Generation via Joint Modeling

Wancheng Feng,Chen An,Zhenliang He,Meina Kan,Shiguang Shan,Lukun Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于DiT的联合面全景生成方法(JoPano),统一了文本到全景和视图到全景两个任务,通过联合面适配器和泊松融合提升生成质量与边界一致性,并引入新的缝合评估指标,在多个指标上达到SOTA。

Details Motivation: 现有全景生成方法受限于U-Net架构导致视觉质量不足,且将文本到全景和视图到全景任务分离处理,造成建模冗余和效率低下。 Method: 提出JoPano方法,基于DiT架构,采用立方体贴图表示,设计联合面适配器以迁移自然图像中的生成能力;使用泊松融合减少立方体面间的接缝;引入Seam-SSIM和Seam-Sobel指标评估接缝一致性;并通过条件切换机制在一个模型中统一两种生成任务。 Result: 实验表明,JoPano在FID、CLIP-FID、IS和CLIP-Score等指标上均达到最先进的性能,能高质量生成两类全景图像。 Conclusion: JoPano通过DiT架构与联合建模有效提升了全景生成的质量与效率,实现了文本与视图到全景生成的统一,具有良好的应用潜力。 Abstract: Panorama generation has recently attracted growing interest in the research community, with two core tasks, text-to-panorama and view-to-panorama generation. However, existing methods still face two major challenges: their U-Net-based architectures constrain the visual quality of the generated panoramas, and they usually treat the two core tasks independently, which leads to modeling redundancy and inefficiency. To overcome these challenges, we propose a joint-face panorama (JoPano) generation approach that unifies the two core tasks within a DiT-based model. To transfer the rich generative capabilities of existing DiT backbones learned from natural images to the panorama domain, we propose a Joint-Face Adapter built on the cubemap representation of panoramas, which enables a pretrained DiT to jointly model and generate different views of a panorama. We further apply Poisson Blending to reduce seam inconsistencies that often appear at the boundaries between cube faces. Correspondingly, we introduce Seam-SSIM and Seam-Sobel metrics to quantitatively evaluate the seam consistency. Moreover, we propose a condition switching mechanism that unifies text-to-panorama and view-to-panorama tasks within a single model. Comprehensive experiments show that JoPano can generate high-quality panoramas for both text-to-panorama and view-to-panorama generation tasks, achieving state-of-the-art performance on FID, CLIP-FID, IS, and CLIP-Score metrics.

[188] Balanced Learning for Domain Adaptive Semantic Segmentation

Wangkai Li,Rui Sun,Bohao Liao,Zhaoyang Li,Tianzhu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为BLDA的平衡学习方法,用于解决无监督域适应语义分割中的类别不平衡问题,通过分析预测logits分布并进行在线校正,提升了对欠预测类别的性能。

Details Motivation: 由于数据和标签空间中存在类别不平衡和分布偏移,现有的自训练方法在各类别上学习不均衡。 Method: 通过分析预测logits分布识别过预测和欠预测类别,使用共享锚定分布对齐不同类别的logits分布,并在线估计分布以将logits校正项引入损失函数,同时利用累积密度作为域间共享结构知识。 Result: 在两个标准UDA语义分割基准上实验表明,BLDA能持续提升性能,尤其改善了欠预测类别的表现。 Conclusion: BLDA有效缓解了域适应中的类别偏倚问题,无需先验知识即可实现更平衡的类别学习,适用于多种现有方法。 Abstract: Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Despite the effectiveness of self-training techniques in UDA, they struggle to learn each class in a balanced manner due to inherent class imbalance and distribution shift in both data and label space between domains. To address this issue, we propose Balanced Learning for Domain Adaptation (BLDA), a novel approach to directly assess and alleviate class bias without requiring prior knowledge about the distribution shift. First, we identify over-predicted and under-predicted classes by analyzing the distribution of predicted logits. Subsequently, we introduce a post-hoc approach to align the logits distributions across different classes using shared anchor distributions. To further consider the network's need to generate unbiased pseudo-labels during self-training, we estimate logits distributions online and incorporate logits correction terms into the loss function. Moreover, we leverage the resulting cumulative density as domain-shared structural knowledge to connect the source and target domains. Extensive experiments on two standard UDA semantic segmentation benchmarks demonstrate that BLDA consistently improves performance, especially for under-predicted classes, when integrated into various existing methods. Code is available at https://github.com/Woof6/BLDA.

[189] Overcoming Small Data Limitations in Video-Based Infant Respiration Estimation

Liyang Song,Hardik Bishnoi,Sai Kumar Reddy Manne,Sarah Ostadabbas,Briana J. Taylor,Michael Wan

Main category: cs.CV

TL;DR: 本文提出了一个名为AIR-400的婴儿呼吸数据集,包含400个带有标注的视频,并开发了首个可复现的基于视觉的婴儿呼吸估计管道,结合婴儿特定区域检测和时空神经网络处理,推动了非接触式婴儿呼吸监测的发展。

Details Motivation: 现有的婴儿呼吸监测缺乏公开的大规模视频数据集和可复现算法,而成人领域已有成熟支持,因此需要针对婴儿建立专用的数据集与方法以提升早期呼吸异常检测能力。 Method: 提出AIR-400数据集,包含275个新采集并精细标注的婴儿视频;设计基于婴儿特定感兴趣区域检测和融合光流输入的时空神经网络的可复现呼吸估计管道。 Result: 建立了首个可复现的基于视觉的婴儿呼吸估计基准,在多个模型上进行了全面实验验证,性能显著优于现有方法。 Conclusion: 该研究填补了婴儿非接触式呼吸监测领域的数据与算法空白,通过公开数据、代码和模型促进了后续研究的发展。 Abstract: The development of contactless respiration monitoring for infants could enable advances in the early detection and treatment of breathing irregularities, which are associated with neurodevelopmental impairments and conditions like sudden infant death syndrome (SIDS). But while respiration estimation for adults is supported by a robust ecosystem of computer vision algorithms and video datasets, only one small public video dataset with annotated respiration data for infant subjects exists, and there are no reproducible algorithms which are effective for infants. We introduce the annotated infant respiration dataset of 400 videos (AIR-400), contributing 275 new, carefully annotated videos from 10 recruited subjects to the public corpus. We develop the first reproducible pipelines for infant respiration estimation, based on infant-specific region-of-interest detection and spatiotemporal neural processing enhanced by optical flow inputs. We establish, through comprehensive experiments, the first reproducible benchmarks for the state-of-the-art in vision-based infant respiration estimation. We make our dataset, code repository, and trained models available for public use.

[190] Scaling Zero-Shot Reference-to-Video Generation

Zijian Zhou,Shikun Liu,Haozhe Liu,Haonan Qiu,Zhaochong An,Weiming Ren,Zhiheng Liu,Xiaoke Huang,Kam Woh Ng,Tian Xie,Xiao Han,Yuren Cong,Hang Li,Chuyan Zhu,Aditya Patel,Tao Xiang,Sen He

Main category: cs.CV

TL;DR: Saber是一种无需显式参考图像-视频-文本三元组的可扩展零样本参考到视频(R2V)生成框架,通过掩码训练策略和基于注意力的模型设计实现身份一致性和参考感知生成。

Details Motivation: 现有R2V方法依赖昂贵且难以扩展的显式三元组数据,限制了其可扩展性,因此需要一种无需此类数据的新型框架。 Method: Saber仅使用视频-文本对进行训练,采用掩码训练策略、基于注意力的模型结构以及掩码增强技术,以学习身份一致且参考感知的表示,并减少复制粘贴伪影。 Result: Saber在OpenS2V-Eval基准上表现优于使用R2V数据训练的方法,并能很好地泛化到不同数量的参考图像。 Conclusion: Saber提供了一种可扩展、零样本的R2V生成新范式,摆脱了对标注R2V数据的依赖,同时实现了高质量的身份保持视频生成。 Abstract: Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.

[191] Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology

Shravan Venkatraman,Muthu Subash Kavitha,Joe Dhanith P R,V Manikandarajan,Jia Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为Neural Tissue Relation Modeling (NTRM)的新型组织分割框架,通过在CNN基础上引入组织级图神经网络来建模组织间的空间与功能关系,显著提升了皮肤癌病理图像中复杂区域的分割性能。

Details Motivation: 现有的CNN方法主要依赖视觉纹理进行分割,难以捕捉组织间的生物上下文和空间关系,尤其在组织重叠或形态相似区域表现不佳。因此需要一种能显式建模组织间依赖关系的方法。 Method: NTRM首先基于CNN生成初步分割结果,构建组织区域图,利用图神经网络通过消息传递传播上下文信息,并通过空间投影模块优化分割结果,从而实现对组织间关系的显式建模。 Result: 在非黑色素瘤皮肤癌病理图像分割数据集上,NTRM的Dice相似系数比现有最优模型高出4.9%至31.25%,在边界密集区域表现出更强的分割一致性。 Conclusion: 通过引入组织级关系建模,NTRM能够有效提升病理图像分割的准确性和结构合理性,为实现更具可解释性的医学图像分析提供了新思路。 Abstract: Histopathology image segmentation is essential for delineating tissue structures in skin cancer diagnostics, but modeling spatial context and inter-tissue relationships remains a challenge, especially in regions with overlapping or morphologically similar tissues. Current convolutional neural network (CNN)-based approaches operate primarily on visual texture, often treating tissues as independent regions and failing to encode biological context. To this end, we introduce Neural Tissue Relation Modeling (NTRM), a novel segmentation framework that augments CNNs with a tissue-level graph neural network to model spatial and functional relationships across tissue types. NTRM constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection. Unlike prior methods, NTRM explicitly encodes inter-tissue dependencies, enabling structurally coherent predictions in boundary-dense zones. On the benchmark Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving a robust Dice similarity coefficient that is 4.9\% to 31.25\% higher than the best-performing models among the evaluated approaches. Our experiments indicate that relational modeling offers a principled path toward more context-aware and interpretable histological segmentation, compared to local receptive-field architectures that lack tissue-level structural awareness. Our code is available at https://github.com/shravan-18/NTRM.

[192] Selective Masking based Self-Supervised Learning for Image Semantic Segmentation

Yuemin Wang,Ian Stavness

Main category: cs.CV

TL;DR: 提出一种基于选择性掩码图像重建的自监督语义分割预训练方法,通过迭代选择高重建损失的图像块进行掩码,在多个数据集上优于随机掩码和ImageNet监督预训练,尤其提升下游任务中低性能类别的准确性。

Details Motivation: 传统掩码图像建模多采用随机掩码策略,未能充分利用模型学习过程中的知识;为提高语义分割性能,尤其是低资源和低表现类别下的效果,需要更有效的自监督预训练方法。 Method: 提出选择性掩码方法,将图像重建分解为多步迭代过程,在每步中选择当前模型重建误差最大的图像块进行掩码,从而利用模型已有知识优化训练样本选择。 Result: 在Pascal VOC、Cityscapes、Nassar 2020和Sugarbeets 2016四个数据集上验证了方法有效性,相比随机掩码和ImageNet监督预训练,下游分割精度分别提升2.9%(通用数据集)和2.5%(杂草分割数据集),显著改善低性能类别表现,并显示同源数据集用于预训练可进一步提升小预算场景效果。 Conclusion: 所提选择性掩码图像重建方法是一种高效且实用的自监督预训练方案,能有效提升端到端语义分割性能,特别适用于计算资源受限和需快速推理的应用场景。 Abstract: This paper proposes a novel self-supervised learning method for semantic segmentation using selective masking image reconstruction as the pretraining task. Our proposed method replaces the random masking augmentation used in most masked image modelling pretraining methods. The proposed selective masking method selectively masks image patches with the highest reconstruction loss by breaking the image reconstruction pretraining into iterative steps to leverage the trained model's knowledge. We show on two general datasets (Pascal VOC and Cityscapes) and two weed segmentation datasets (Nassar 2020 and Sugarbeets 2016) that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy by 2.9% for general datasets and 2.5% for weed segmentation datasets. Furthermore, we found that our selective masking method significantly improves accuracy for the lowest-performing classes. Lastly, we show that using the same pretraining and downstream dataset yields the best result for low-budget self-supervised pretraining. Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end-to-end semantic segmentation workflows, especially for scenarios that require limited model capacity to meet inference speed and computational resource requirements.

[193] Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues

Tuan-Anh Vu,Hai Nguyen-Truong,Ziqiang Zheng,Binh-Son Hua,Qing Guo,Ivor Tsang,Sai-Kit Yeung

Main category: cs.CV

TL;DR: 本文提出了一种名为TransCues的新框架,通过边界和反射特征增强模块有效提升透明物体(如玻璃)的分割性能,在多个基准数据集上显著超越现有方法。

Details Motivation: 现有分割方法难以区分透明玻璃与不透明物体,且未充分结合人类感知中关键的边界和反射特征,因此需要更有效的透明物体分割方法。 Method: 提出TransCues框架,采用金字塔式Transformer编解码结构,并引入边界特征增强和反射特征增强模块,以相互促进的方式融合两种视觉线索。 Result: 在Trans10K-v2、MSD、RGBD-Mirror、TROSD和Stanford2D3D等多个数据集上取得领先性能,mIoU分别提升4.2%、5.6%、10.1%、13.1%和8.3%。 Conclusion: 边界与反射特征的协同融合显著提升了透明物体分割效果,TransCues在多种场景下均表现出卓越性能,推动了该领域的技术发展。 Abstract: Glass is a prevalent material among solid objects in everyday life, yet segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While it is known that human perception relies on boundary and reflective-object features to distinguish glass objects, the existing literature has not yet sufficiently captured both properties when handling transparent objects. Hence, we propose incorporating both of these powerful visual cues via the Boundary Feature Enhancement and Reflection Feature Enhancement modules in a mutually beneficial way. Our proposed framework, TransCues, is a pyramidal transformer encoder-decoder architecture to segment transparent objects. We empirically show that these two modules can be used together effectively, improving overall performance across various benchmark datasets, including glass object semantic segmentation, mirror object semantic segmentation, and generic segmentation datasets. Our method outperforms the state-of-the-art by a large margin, achieving +4.2% mIoU on Trans10K-v2, +5.6% mIoU on MSD, +10.1% mIoU on RGBD-Mirror, +13.1% mIoU on TROSD, and +8.3% mIoU on Stanford2D3D, showing the effectiveness of our method against glass objects.

[194] Evaluating and Preserving High-level Fidelity in Super-Resolution

Josep M. Rocafort,Shaolin Su,Javier Vazquez-Corral,Alexandra Gomez-Villa

Main category: cs.CV

TL;DR: 本文强调了在超分辨率(SR)模型中衡量高层保真度的重要性,构建了首个带有保真度评分的标注数据集,并评估了SOTA模型在保持高层保真度方面的表现。研究表明,现有图像质量指标与保真度相关性较弱,而基础模型更适合该任务。通过基于保真度反馈微调SR模型,可同时提升语义保真度和感知质量。

Details Motivation: 现有的低层次图像质量度量难以捕捉生成式SR模型可能改变图像内容的高层失真问题,因此需要新的评价标准来衡量SR结果的语义可靠性。 Method: 构建了一个包含不同SR模型输出及其人工标注保真度分数的数据集,分析现有图像质量指标与人类感知的保真度之间的相关性,并探索使用基础模型进行高层保真度评估的可行性,最后利用该保真度反馈对SR模型进行微调优化。 Result: 发现当前SOTA SR模型在高保真度方面仍有不足;传统图像质量指标与高层保真度相关性差;基础模型能更有效地评估保真度;基于保真度的微调可同时改善语义一致性和视觉质量。 Conclusion: 高层保真度是评估生成式SR模型的重要补充标准,提出的评估框架和数据集有助于更可靠地衡量和优化SR模型的语义一致性与视觉质量。 Abstract: Recent image Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering visually pleasant outputs. However, the overpowering generative ability can sometimes hallucinate and thus change the image content despite gaining high visual quality. This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics. In this paper, we establish the importance of measuring high-level fidelity for SR models as a complementary criterion to reveal the reliability of generative SR models. We construct the first annotated dataset with fidelity scores from different SR models, and evaluate how state-of-the-art (SOTA) SR models actually perform in preserving high-level fidelity. Based on the dataset, we then analyze how existing image quality metrics correlate with fidelity measurement, and further show that this high-level task can be better addressed by foundation models. Finally, by fine-tuning SR models based on our fidelity feedback, we show that both semantic fidelity and perceptual quality can be improved, demonstrating the potential value of our proposed criteria, both in model evaluation and optimization. We will release the dataset, code, and models upon acceptance.

[195] DAUNet: A Lightweight UNet Variant with Deformable Convolutions and Parameter-Free Attention for Medical Image Segmentation

Adnan Munir,Shujaat Khan

Main category: cs.CV

TL;DR: 本文提出了一种轻量级UNet变体DAUNet,结合可变形卷积V2和无参数注意力(SimAM),在不增加模型复杂度的情况下提升医学图像分割性能。

Details Motivation: 为提高医学图像分割中对几何变化和上下文信息的适应能力,同时保持模型轻量化,以适用于资源受限的临床环境。 Method: DAUNet在瓶颈层使用可变形卷积V2以动态捕捉几何形变,在解码器和跳跃路径中引入SimAM注意力模块,增强显著性特征的融合。 Result: 在FH-PS-AoP和FUMPE两个数据集上,DAUNet在Dice分数、HD95和ASD指标上优于现有最先进模型,并具有更高的参数效率;消融实验验证了各组件的有效性。 Conclusion: DAUNet因其对缺失上下文和低对比度区域的鲁棒性,适合用于实时和资源受限的临床场景。 Abstract: Medical image segmentation plays a pivotal role in automated diagnostic and treatment planning systems. In this work, we present DAUNet, a novel lightweight UNet variant that integrates Deformable V2 Convolutions and Parameter-Free Attention (SimAM) to improve spatial adaptability and context-aware feature fusion without increasing model complexity. DAUNet's bottleneck employs dynamic deformable kernels to handle geometric variations, while the decoder and skip pathways are enhanced using SimAM attention modules for saliency-aware refinement. Extensive evaluations on two challenging datasets, FH-PS-AoP (fetal head and pubic symphysis ultrasound) and FUMPE (CT-based pulmonary embolism detection), demonstrate that DAUNet outperforms state-of-the-art models in Dice score, HD95, and ASD, while maintaining superior parameter efficiency. Ablation studies highlight the individual contributions of deformable convolutions and SimAM attention. DAUNet's robustness to missing context and low-contrast regions establishes its suitability for deployment in real-time and resource-constrained clinical environments.

[196] RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting

Hoang-Nhat Tran,Francesco Di Sario,Gabriele Spadaro,Giuseppe Valenzise,Enzo Tartaglione

Main category: cs.CV

TL;DR: 提出了一种灵活的3D高斯点阵压缩方案,支持在预定义范围内的任意码率插值,无需重新训练,计算轻量且保持渲染质量。

Details Motivation: 3D高斯点阵(3DGS)虽然实现了实时逼真渲染,但内存占用大、训练成本高,现有压缩方法缺乏对不同带宽和设备的自适应能力。 Method: 设计了一种支持任意码率插值的轻量级压缩框架,通过插值机制在不同操作点间平滑调整压缩率,无需重新训练。 Result: 实验表明该方法在广泛的操作点上实现了高效、高质量的压缩,同时保持良好的渲染质量,并支持动态码率控制。 Conclusion: 所提方法为3DGS提供了实用、灵活的压缩解决方案,适用于沉浸式多媒体应用中的实际部署。 Abstract: Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and costly training procedures, motivating efforts toward compression. Existing approaches, however, operate at fixed rates, limiting adaptability to varying bandwidth and device constraints. In this work, we propose a flexible compression scheme for 3DGS that supports interpolation at any rate between predefined bounds. Our method is computationally lightweight, requires no retraining for any rate, and preserves rendering quality across a broad range of operating points. Experiments demonstrate that the approach achieves efficient, high-quality compression while offering dynamic rate control, making it suitable for practical deployment in immersive applications. The code will be provided open-source upon acceptance of the work.

[197] $\mathrm{D}^{\mathrm{3}}$-Predictor: Noise-Free Deterministic Diffusion for Dense Prediction

Changliang Xia,Chengyou Jia,Minnan Luo,Zhuohang Dang,Xin Shen,Bowen Ping

Main category: cs.CV

TL;DR: 本文提出了一种名为 $\mathrm{D}^3$-Predictor$ 的无噪声确定性框架,通过重构预训练扩散模型去除随机噪声,利用时间步依赖的视觉专家集成并自监督聚合先验知识,实现高效、单步推理的密集预测,显著提升几何结构保持能力,并在多种任务上达到先进性能。

Details Motivation: 现有的基于扩散模型的密集预测方法由于依赖随机噪声采样,破坏了细粒度空间线索和几何结构映射,导致与确定性密集预测任务不匹配。 Method: 将预训练扩散模型重新构想为多个时间步依赖的视觉专家集合,去除随机噪声,通过自监督方式聚合其异构先验得到干净完整的几何先验,并结合任务特定监督将其适配到密集预测任务中。 Result: $\mathrm{D}^3$-Predictor 在多个密集预测任务上实现了具有竞争力或最先进的性能,仅需不到一半的训练数据,并支持高效的单步推理。 Conclusion: $\mathrm{D}^3$-Predictor 有效解决了扩散模型中随机噪声对密集预测的负面影响,提供了一种更高效、更准确的确定性框架,推动了扩散先验在几何敏感任务中的应用。 Abstract: Although diffusion models with strong visual priors have emerged as powerful dense prediction backboens, they overlook a core limitation: the stochastic noise at the core of diffusion sampling is inherently misaligned with dense prediction that requires a deterministic mapping from image to geometry. In this paper, we show that this stochastic noise corrupts fine-grained spatial cues and pushes the model toward timestep-specific noise objectives, consequently destroying meaningful geometric structure mappings. To address this, we introduce $\mathrm{D}^{\mathrm{3}}$-Predictor, a noise-free deterministic framework built by reformulating a pretrained diffusion model without stochasticity noise. Instead of relying on noisy inputs to leverage diffusion priors, $\mathrm{D}^{\mathrm{3}}$-Predictor views the pretrained diffusion network as an ensemble of timestep-dependent visual experts and self-supervisedly aggregates their heterogeneous priors into a single, clean, and complete geometric prior. Meanwhile, we utilize task-specific supervision to seamlessly adapt this noise-free prior to dense prediction tasks. Extensive experiments on various dense prediction tasks demonstrate that $\mathrm{D}^{\mathrm{3}}$-Predictor achieves competitive or state-of-the-art performance in diverse scenarios. In addition, it requires less than half the training data previously used and efficiently performs inference in a single step. Our code, data, and checkpoints are publicly available at https://x-gengroup.github.io/HomePage_D3-Predictor/.

[198] Persistent Homology-Guided Frequency Filtering for Image Compression

Anil Chintapalli,Peter Tenholder,Henry Chen,Arjun Rao

Main category: cs.CV

TL;DR: 本文提出了一种结合离散傅里叶变换与持续同调分析的图像特征提取方法,能够在噪声环境下有效压缩并重构图像,同时提升二分类任务中卷积神经网络的性能。

Details Motivation: 在噪声图像数据集中,传统特征提取方法难以保证模型可靠性,因此需要一种能够保留有意义拓扑特征的压缩与重建方法。 Method: 利用离散傅里叶变换提取图像频率,并结合持续同调分析筛选与特定拓扑特征相关的频率成分,实现图像压缩与重构。 Result: 实验结果显示,该方法在六种指标上的压缩效果可与JPEG相媲美,并能更好地区分有意义的数据,在增强CNN的二分类性能方面优于传统方法。 Conclusion: 持久同调引导的频率滤波有助于提高噪声条件下图像压缩的可靠性,具有在机器学习任务中增强特征提取潜力的应用价值。 Abstract: Feature extraction in noisy image datasets presents many challenges in model reliability. In this paper, we use the discrete Fourier transform in conjunction with persistent homology analysis to extract specific frequencies that correspond with certain topological features of an image. This method allows the image to be compressed and reformed while ensuring that meaningful data can be differentiated. Our experimental results show a level of compression comparable to that of using JPEG using six different metrics. The end goal of persistent homology-guided frequency filtration is its potential to improve performance in binary classification tasks (when augmenting a Convolutional Neural Network) compared to traditional feature extraction and compression methods. These findings highlight a useful end result: enhancing the reliability of image compression under noisy conditions.

[199] Context-measure: Contextualizing Metric for Camouflage

Chen-Yang Wang,Gepeng Ji,Song Shao,Ming-Ming Cheng,Deng-Ping Fan

Main category: cs.CV

TL;DR: 提出一种新的上下文化评估范式Context-measure,基于概率像素感知相关框架,通过引入空间依赖性和像素级伪装量化,更贴近人类感知,实验表明其在多个数据集上比现有方法更可靠。

Details Motivation: 当前的伪装场景评估指标忽略了上下文依赖性这一关键因素,而这些指标原本是为一般或显著物体设计的,假设空间上下文无关,无法准确反映伪装对象的真实情况。 Method: 提出Context-measure,基于概率像素感知的相关框架,结合空间依赖性和像素级伪装量化来评估伪装效果。 Result: 在三个具有挑战性的伪装物体分割数据集上的实验表明,Context-measure比现有的非上下文感知指标更具可靠性。 Conclusion: Context-measure能更好地对伪装物体进行评估,可作为农业、工业和医疗等涉及伪装模式的计算机视觉应用的基础评估基准。 Abstract: Camouflage is primarily context-dependent yet current metrics for camouflaged scenarios overlook this critical factor. Instead, these metrics are originally designed for evaluating general or salient objects, with an inherent assumption of uncorrelated spatial context. In this paper, we propose a new contextualized evaluation paradigm, Context-measure, built upon a probabilistic pixel-aware correlation framework. By incorporating spatial dependencies and pixel-wise camouflage quantification, our measure better aligns with human perception. Extensive experiments across three challenging camouflaged object segmentation datasets show that Context-measure delivers more reliability than existing context-independent metrics. Our measure can provide a foundational evaluation benchmark for various computer vision applications involving camouflaged patterns, such as agricultural, industrial, and medical scenarios. Code is available at https://github.com/pursuitxi/Context-measure.

[200] DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection

Bo Gao,Jingcheng Tong,Xingsheng Chen,Han Yu,Zichen Li

Main category: cs.CV

TL;DR: 本文提出DFIR-DETR,一种基于动态特征聚合与频域处理的检测器,用于解决无人机遥感和工业缺陷检测中的小目标检测难题,在NEU-DET和VisDrone数据集上实现了最先进的性能,同时保持轻量化。

Details Motivation: 小目标检测在遥感与工业检测中面临特征稀疏、背景复杂和尺度变化大的挑战,现有基于Transformer的检测器存在特征退化、局部感受野限制和上采样冗余等问题。 Method: 提出DFIR-DETR,包含三个模块:DCFA模块采用动态K稀疏注意力和空间门控线性单元以降低计算复杂度并增强非线性建模;DFPN模块引入幅度归一化上采样和双路径shuffle卷积以保留多尺度空间细节;FIRC3模块在频域操作实现全局感受野。 Result: 在NEU-DET和VisDrone数据集上分别取得92.9%和51.6%的mAP50成绩,模型仅含11.7M参数和41.2 GFLOPs,表现轻量且高效。 Conclusion: DFIR-DETR通过动态特征聚合与频域处理有效解决了小目标检测中的特征退化与长距离依赖问题,具备良好的跨场景泛化能力,适用于资源受限环境。 Abstract: Detecting small objects in UAV remote sensing images and identifying surface defects in industrial inspection remain difficult tasks. These applications face common obstacles: features are sparse and weak, backgrounds are cluttered, and object scales vary dramatically. Current transformer-based detectors, while powerful, struggle with three critical issues. First, features degrade severely as networks downsample progressively. Second, spatial convolutions cannot capture long-range dependencies effectively. Third, standard upsampling methods inflate feature maps unnecessarily. We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing. Our architecture builds on three novel components. The DCFA module uses dynamic K-sparse attention, cutting complexity from O(N2) down to O(NK), and employs spatial gated linear units for better nonlinear modeling. The DFPN module applies amplitude-normalized upsampling to prevent feature inflation and uses dual-path shuffle convolution to retain spatial details across scales. The FIRC3 module operates in the frequency domain, achieving global receptive fields without sacrificing efficiency. We tested our method extensively on NEU-DET and VisDrone datasets. Results show mAP50 scores of 92.9% and 51.6% respectively-both state-of-the-art. The model stays lightweight with just 11.7M parameters and 41.2 GFLOPs. Strong performance across two very different domains confirms that DFIR-DETR generalizes well and works effectively in resource-limited settings for cross-scene small object detection.

[201] COREA: Coarse-to-Fine 3D Representation Alignment Between Relightable 3D Gaussians and SDF via Bidirectional 3D-to-3D Supervision

Jaeyoon Lee,Hojoon Jung,Sungtae Hwang,Jihyong Oh,Jongwon Choi

Main category: cs.CV

TL;DR: 本文提出了COREA,首个联合学习可重光照3D高斯和符号距离场(SDF)的统一框架,通过双向3D到3D对齐策略直接在3D空间中学习几何信息,显著提升了新视角合成、网格重建和基于物理的渲染性能。

Details Motivation: 现有3D高斯点阵方法在从2D渲染图像中学习几何时存在表面粗糙和BRDF-光照分解不可靠的问题,因此需要一种更精确的联合几何重建与重光照框架。 Method: 提出了一种由粗到精的双向3D-to-3D对齐策略,利用深度进行粗略对齐,并通过深度梯度和法线细化精细结构;引入密度控制机制以稳定高斯增长,平衡几何保真度与内存效率。 Result: 在标准基准上的实验表明,COREA在新视角合成、网格重建和基于物理的渲染方面均优于现有方法。 Conclusion: COREA通过联合优化3D高斯与SDF,在统一框架下实现了高精度几何重建与高质量的可重光照效果,推动了3D场景表示与渲染的发展。 Abstract: We present COREA, the first unified framework that jointly learns relightable 3D Gaussians and a Signed Distance Field (SDF) for accurate geometry reconstruction and faithful relighting. While recent 3D Gaussian Splatting (3DGS) methods have extended toward mesh reconstruction and physically-based rendering (PBR), their geometry is still learned from 2D renderings, leading to coarse surfaces and unreliable BRDF-lighting decomposition. To address these limitations, COREA introduces a coarse-to-fine bidirectional 3D-to-3D alignment strategy that allows geometric signals to be learned directly in 3D space. Within this strategy, depth provides coarse alignment between the two representations, while depth gradients and normals refine fine-scale structure, and the resulting geometry supports stable BRDF-lighting decomposition. A density-control mechanism further stabilizes Gaussian growth, balancing geometric fidelity with memory efficiency. Experiments on standard benchmarks demonstrate that COREA achieves superior performance in novel-view synthesis, mesh reconstruction, and PBR within a unified framework.

[202] MSN: Multi-directional Similarity Network for Hand-crafted and Deep-synthesized Copy-Move Forgery Detection

Liangwei Jiang,Jinluo Xie,Yecheng Huang,Hua Zhang,Hongyu Yang,Di Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为多方向相似性网络(MSN)的双流模型,用于准确高效地检测复制-移动图像伪造,通过改进特征表示和定位能力,在多个基准数据集上实现了最先进的性能。

Details Motivation: 现有的深度检测模型在表示能力和定位能力上存在局限,难以应对复杂的变换和精细的篡改操作,因此需要一种更有效的检测方法。 Method: 提出多方向CNN网络进行分层编码,并设计基于2D相似性矩阵的解码器,利用空间信息提升检测效果;同时构建了一个由多种深度神经网络生成的新伪造数据库作为新基准。 Result: 在CASIA CMFD、CoMoFoD及新构建的数据集上进行了广泛实验,取得了最先进的检测结果,验证了方法的有效性。 Conclusion: 所提出的MSN模型在表示和定位方面优于现有方法,能够更准确高效地检测复制-移动伪造,尤其对深度合成的伪造图像具有良好的检测能力。 Abstract: Copy-move image forgery aims to duplicate certain objects or to hide specific contents with copy-move operations, which can be achieved by a sequence of manual manipulations as well as up-to-date deep generative network-based swapping. Its detection is becoming increasingly challenging for the complex transformations and fine-tuned operations on the tampered regions. In this paper, we propose a novel two-stream model, namely Multi-directional Similarity Network (MSN), to accurate and efficient copy-move forgery detection. It addresses the two major limitations of existing deep detection models in \textbf{representation} and \textbf{localization}, respectively. In representation, an image is hierarchically encoded by a multi-directional CNN network, and due to the diverse augmentation in scales and rotations, the feature achieved better measures the similarity between sampled patches in two streams. In localization, we design a 2-D similarity matrix based decoder, and compared with the current 1-D similarity vector based one, it makes full use of spatial information in the entire image, leading to the improvement in detecting tampered regions. Beyond the method, a new forgery database generated by various deep neural networks is presented, as a new benchmark for detecting the growing deep-synthesized copy-move. Extensive experiments are conducted on two classic image forensics benchmarks, \emph{i.e.} CASIA CMFD and CoMoFoD, and the newly presented one. The state-of-the-art results are reported, which demonstrate the effectiveness of the proposed approach.

[203] Training-free Clothing Region of Interest Self-correction for Virtual Try-On

Shengjie Lu,Zhibin Wan,Jiejie Liu,Quan Zhang,Mingjie Sun

Main category: cs.CV

TL;DR: 本文提出了一种基于能量函数约束注意力图的虚拟试穿方法,提升了生成衣物细节与目标衣物的一致性,并设计了新的评估指标VTID,综合性能优于现有SOTA方法。

Details Motivation: 现有虚拟试穿方法在生成结果中存在与目标衣物在纹理、图案和边界上的不一致问题,且现有评估指标忽视生成结果与目标衣物的对齐程度。 Method: 引入能量函数对生成过程中的注意力图进行约束,使每一步生成更聚焦于服装区域;同时提出新的评估指标VTID,以衡量生成结果与目标服装的对齐质量。 Result: 在VITON-HD和DressCode数据集上,本方法在LPIPS、FID、KID和VTID指标上分别超越现有SOTA方法1.4%、2.3%、12.3%和5.8%;在下游CC-Reid任务中,Rank-1准确率在LTCC、PRCC、VC-Clothes上分别提升2.5%、1.1%和1.6%。 Conclusion: 所提方法通过能量约束注意力机制有效提升了虚拟试穿的细节一致性,新指标VTID弥补了传统评估的不足,整体性能更优。 Abstract: VTON (Virtual Try-ON) aims at synthesizing the target clothing on a certain person, preserving the details of the target clothing while keeping the rest of the person unchanged. Existing methods suffer from the discrepancies between the generated clothing results and the target ones, in terms of the patterns, textures and boundaries. Therefore, we propose to use an energy function to impose constraints on the attention map extracted through the generation process. Thus, at each generation step, the attention can be more focused on the clothing region of interest, thereby influencing the generation results to be more consistent with the target clothing details. Furthermore, to address the limitation that existing evaluation metrics concentrate solely on image realism and overlook the alignment with target elements, we design a new metric, Virtual Try-on Inception Distance (VTID), to bridge this gap and ensure a more comprehensive assessment. On the VITON-HD and DressCode datasets, our approach has outperformed the previous state-of-the-art (SOTA) methods by 1.4%, 2.3%, 12.3%, and 5.8% in the traditional metrics of LPIPS, FID, KID, and the new VTID metrics, respectively. Additionally, by applying the generated data to downstream Clothing-Change Re-identification (CC-Reid) methods, we have achieved performance improvements of 2.5%, 1.1%, and 1.6% on the LTCC, PRCC, VC-Clothes datasets in the metrics of Rank-1. The code of our method is public at https://github.com/MrWhiteSmall/CSC-VTON.git.

[204] MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

Chau Truong,Hieu Ta Quang,Dung D. Le

Main category: cs.CV

TL;DR: 本文提出了一种名为MulCLIP的新型端到端多级对齐框架,用于提升视觉语言模型在长文本描述下的细粒度理解能力,无需依赖区域提议信息,同时在多个基准上实现了性能提升。

Details Motivation: 现有的视觉语言模型(如CLIP)在短标题上表现良好,但在处理长而详细的文本描述时表现不佳,且现有方法依赖区域提议信息带来较高部署成本。 Method: MulCLIP通过保留图像与摘要及长文本之间的全局对比对齐,并扩展位置嵌入以支持更长文本;同时引入两种新策略:基于局部校准特征的词元重建对齐,以及子句聚合的图像块对齐,以增强细粒度语义匹配。 Result: 实验结果表明,MulCLIP在多个下游任务中均优于现有方法,消融研究验证了其多尺度对齐机制的有效性。 Conclusion: MulCLIP通过多级对齐机制有效提升了模型对长文本和图像组件的细粒度理解能力,且无需额外区域提议模块,更适合实际应用部署。 Abstract: Vision-language models like CLIP show impressive ability to align images and text, but their training on short, concise captions makes them struggle with lengthy, detailed descriptions. Recent advances mitigate this challenge by leveraging region-proposal information to map visual regions with corresponding sentences from lengthy captions, yet incurring notable deployment costs. We introduce MulCLIP, a novel end-to-end multi-level alignment framework that bridges natural long-text structures with image components. MulCLIP first preserves global contrastive alignment between images and both summary and long captions, while extending positional embeddings for longer text sequences. To further enhance fine-grained understanding, we propose two novel strategies: (1) a token reconstruction alignment over locally calibrated features to strengthen semantic connections between words and image patches, and (2) a subcaption-aggregated patch alignment that automatically extracts and aggregates context-rich patches for each subcaption. Experimental results across diverse benchmarks demonstrate our method consistently improves downstream performance, while ablation studies confirm its multi-scale alignment is the key factor driving better fine-grained capability than region-proposal-assisted approaches, making it particularly suitable for diverse real-world applications.

[205] TrajMoE: Scene-Adaptive Trajectory Planning with Mixture of Experts and Reinforcement Learning

Zebin Xing,Pengxuan Yang,Linbo Wang,Yichen Zhang,Yiming Hu,Yupeng Zheng,Junli Wang,Yinfeng Gao,Guang Li,Kun Ma,Long Chen,Zhongpu Xia,Qichao Zhang,Hangjun Ye,Dongbin Zhao

Main category: cs.CV

TL;DR: 本文提出了一种改进的自动驾驶规划方法,通过MoE为不同驾驶场景定制轨迹先验,并利用强化学习优化轨迹评分机制,结合多感知骨干网络提升性能,在navsim ICCV基准上取得第三名的成绩。

Details Motivation: 现有端到端自动驾驶系统在使用轨迹先验时未充分考虑不同驾驶场景下的差异性,且其轨迹评估机制受限于单阶段监督训练,缺乏策略驱动的优化能力。 Method: 采用MoE(Mixture of Experts)根据不同的驾驶场景动态选择合适的轨迹先验;使用强化学习对轨迹评分机制进行微调,实现策略驱动的精细化评估;集成多种感知骨干网络以增强特征表示。 Result: 所提出的集成模型在navsim ICCV基准上取得了51.08分的成绩,排名第三。 Conclusion: 通过场景自适应的轨迹先验和强化学习驱动的评分机制,能够有效提升自动驾驶规划系统的性能。 Abstract: Current autonomous driving systems often favor end-to-end frameworks, which take sensor inputs like images and learn to map them into trajectory space via neural networks. Previous work has demonstrated that models can achieve better planning performance when provided with a prior distribution of possible trajectories. However, these approaches often overlook two critical aspects: 1) The appropriate trajectory prior can vary significantly across different driving scenarios. 2) Their trajectory evaluation mechanism lacks policy-driven refinement, remaining constrained by the limitations of one-stage supervised training. To address these issues, we explore improvements in two key areas. For problem 1, we employ MoE to apply different trajectory priors tailored to different scenarios. For problem 2, we utilize Reinforcement Learning to fine-tune the trajectory scoring mechanism. Additionally, we integrate models with different perception backbones to enhance perceptual features. Our integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place.

[206] A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Siyang Jiang,Mu Yuan,Xiang Ji,Bufang Yang,Zeyu Liu,Lilin Xu,Yang Li,Yuting He,Liran Dong,Wenrui Lu,Zhenyu Yan,Xiaofan Jiang,Wei Gao,Hongkai Chen,Guoliang Xing

Main category: cs.CV

TL;DR: CUHK-X是一个大规模多模态数据集和基准套件,用于人类动作识别(HAR)、理解(HAU)和推理(HARn),包含58,445个样本,提出基于提示的场景生成方法以提升文本描述一致性,并支持多种任务评估。

Details Motivation: 现有LLM和多模态数据集在非RGB模态(如深度、IMU、mmWave)上缺乏细粒度文本标注,难以支持动作理解与推理任务;需要更丰富、逻辑一致的数据-描述对来推动HAR、HAU和HARn的发展。 Method: 构建CUHK-X数据集,包含40种动作、30名参与者、两种室内环境;采用基于大语言模型的提示式场景生成方法生成连贯的动作序列文本描述,并结合人工验证确保逻辑与时空一致性;设计三个基准共六项评估任务。 Result: CUHK-X包含58,445个样本,实验结果显示平均准确率为:HAR 76.52%,HAU 40.76%,HARn 70.25%;显著提升了多模态动作理解与推理的细粒度描述能力。 Conclusion: CUHK-X为多模态人类活动分析提供了高质量、大规模的数据资源和统一基准,推动了从动作识别到理解与推理的发展,支持数据密集型学习方法的研究与应用。 Abstract: Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.

[207] CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics

Dahyeon Kye,Jeahun Sung,MinKyu Jeon,Jihyong Oh

Main category: cs.CV

TL;DR: 本文提出CHIMERA,一种基于扩散模型的零样本图像变形框架,通过缓存注入和语义锚提示实现平滑且语义一致的图像过渡,并设计新的评估指标GLCS衡量变形质量。

Details Motivation: 现有图像变形方法在处理语义差异较大的图像时,常出现过渡不自然、颜色过饱和等问题,缺乏有效的结构与语义对齐机制。 Method: 将图像变形建模为缓存引导的去噪过程,提出自适应缓存注入(ACI)以在DDIM反演中缓存双输入特征并在去噪时自适应重注入,并结合基于视觉-语言模型的语义锚提示(SAP)提供语义一致性引导。 Result: 实验表明CHIMERA在多种复杂场景下均能生成更平滑、语义更连贯的过渡结果,在用户研究和定量指标上优于现有方法。 Conclusion: CHIMERA通过深度与时间自适应的特征融合及语义锚定,显著提升了扩散模型在图像变形任务中的表现,推动了零样本变形技术的发展。 Abstract: Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.

[208] MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation

Muyu Xu,Fangneng Zhan,Xiaoqin Zhang,Ling Shao,Shijian Lu

Main category: cs.CV

TL;DR: 本文提出了MuSASplat,一种用于稀疏视角3D高斯点阵渲染的高效框架,通过引入轻量级多尺度适配器和特征融合聚合器,在显著降低训练参数与计算资源消耗的同时保持高质量的新视角生成。

Details Motivation: 现有基于全微调大ViT主干网络的无姿态前馈3D高斯点阵方法计算成本高昂,难以广泛应用,因此需要一种更高效的训练方式以降低GPU开销和参数量。 Method: 提出MuSASplat框架,包含一个轻量级的多尺度适配器用于局部微调ViT架构,以及一个特征融合聚合器来高效整合多视角特征,避免使用内存银行并减少内存占用和计算复杂度。 Result: 在多个数据集上的实验表明,MuSASplat在渲染质量上达到最先进水平,同时显著减少了模型参数量和训练资源需求。 Conclusion: MuSASplat通过高效的适配器设计和特征融合策略,实现了高性能与低计算成本之间的良好平衡,适用于稀疏输入下的快速3D场景重建与新视角合成。 Abstract: Sparse-view 3D Gaussian splatting seeks to render high-quality novel views of 3D scenes from a limited set of input images. While recent pose-free feed-forward methods leveraging pre-trained 3D priors have achieved impressive results, most of them rely on full fine-tuning of large Vision Transformer (ViT) backbones and incur substantial GPU costs. In this work, we introduce MuSASplat, a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models with little compromise of rendering quality. Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters. This design avoids the prohibitive GPU overhead associated with previous full-model adaptation techniques while maintaining high fidelity in novel view synthesis, even with very sparse input views. In addition, we introduce a Feature Fusion Aggregator that integrates features across input views effectively and efficiently. Unlike widely adopted memory banks, the Feature Fusion Aggregator ensures consistent geometric integration across input views and meanwhile mitigates the memory usage, training complexity, and computational costs significantly. Extensive experiments across diverse datasets show that MuSASplat achieves state-of-the-art rendering quality but has significantly reduced parameters and training resource requirements as compared with existing methods.

[209] When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing

Siyuan Xu,Yibing Liu,Peilin Chen,Yung-Hui Li,Shiqi Wang,Sam Kwong

Main category: cs.CV

TL;DR: 本文提出了一种新的方法来评估和恢复多模态大语言模型(MLLM)中受保护的用户隐私,引入了SPPE数据集以及一种基于多模态信号引导生成的统一恢复方法,在隐私保护与模型可用性之间实现了良好平衡。

Details Motivation: 现有研究在掩盖MLLM中的私有信息方面虽有效,但忽视了对用户隐私真实性及恢复质量的评估,本文旨在解决这一问题。 Method: 构建SPPE数据集,包含多种隐私类别和用户指令,并将隐私恢复建模为基于多模态信号引导的生成任务,提出统一框架以重建私有内容并保持编辑保真度。 Result: 实验表明该方法在SPPE和InstructPix2Pix数据集上均具有良好泛化能力,能在不同视觉内容和编辑任务中实现高质量隐私恢复。 Conclusion: 所提方法有效弥补了隐私恢复评估的研究空白,在保障隐私的同时维持了MLLM的实用性。 Abstract: Privacy leakage in Multimodal Large Language Models (MLLMs) has long been an intractable problem. Existing studies, though effectively obscure private information in MLLMs, often overlook the evaluation of the authenticity and recovery quality of user privacy. To this end, this work uniquely focuses on the critical challenge of how to restore surrogate-driven protected data in diverse MLLM scenarios. We first bridge this research gap by contributing the SPPE (Surrogate Privacy Protected Editable) dataset, which includes a wide range of privacy categories and user instructions to simulate real MLLM applications. This dataset offers protected surrogates alongside their various MLLM-edited versions, thus enabling the direct assessment of privacy recovery quality. By formulating privacy recovery as a guided generation task conditioned on complementary multimodal signals, we further introduce a unified approach that reliably reconstructs private content while preserving the fidelity of MLLM-generated edits. The experiments on both SPPE and InstructPix2Pix further show that our approach generalizes well across diverse visual content and editing tasks, achieving a strong balance between privacy protection and MLLM usability.

[210] Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

Jiayang Li,Chengjie Jiang,Junjun Jiang,Pengwei Liang,Jiayi Ma,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出DiTFuse,一种指令驱动的扩散-Transformer框架,实现端到端、语义感知的图像融合,支持多模态输入与自然语言指令的联合编码,在无真值标签的情况下通过多退化掩码建模训练,统一处理多种图像融合任务并实现用户可控的精细化融合。

Details Motivation: 现有图像融合方法在鲁棒性、适应性和可控性方面存在不足,难以灵活融入用户意图,且缺乏真实标注数据和大规模数据集,限制了语义理解与细粒度对齐的能力。 Method: 提出DiTFuse框架,将双图像与自然语言指令在共享潜在空间中联合编码,采用多退化掩码图像建模策略进行无监督训练,结合自建的多粒度指令数据集,实现跨模态对齐、模态不变恢复和任务感知特征选择。 Result: 在IVIF、MFF和MEF等公开基准上取得优于现有方法的定量与定性结果,生成更清晰的纹理和更好的语义保持,并支持文本控制的融合优化、分割等下游任务及零样本泛化能力。 Conclusion: DiTFuse通过统一的指令驱动架构实现了高性能、高可控性和强泛化的图像融合,为多模态融合提供了新范式。 Abstract: Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and lack the ability to flexibly incorporate user intent, especially in complex scenarios involving low-light degradation, color shifts, or exposure imbalance. Moreover, the absence of ground-truth fused images and the small scale of existing datasets make it difficult to train an end-to-end model that simultaneously understands high-level semantics and performs fine-grained multimodal alignment. We therefore present DiTFuse, instruction-driven Diffusion-Transformer (DiT) framework that performs end-to-end, semantics-aware fusion within a single model. By jointly encoding two images and natural-language instructions in a shared latent space, DiTFuse enables hierarchical and fine-grained control over fusion dynamics, overcoming the limitations of pre-fusion and post-fusion pipelines that struggle to inject high-level semantics. The training phase employs a multi-degradation masked-image modeling strategy, so the network jointly learns cross-modal alignment, modality-invariant restoration, and task-aware feature selection without relying on ground truth images. A curated, multi-granularity instruction dataset further equips the model with interactive fusion capabilities. DiTFuse unifies infrared-visible, multi-focus, and multi-exposure fusion-as well as text-controlled refinement and downstream tasks-within a single architecture. Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention. The model also supports multi-level user control and zero-shot generalization to other multi-image fusion scenarios, including instruction-conditioned segmentation.

[211] TIDE: Two-Stage Inverse Degradation Estimation with Guided Prior Disentanglement for Underwater Image Restoration

Shravan Venkatraman,Rakesh Raj Madavan,Pavan Kumar S,Muthu Subash Kavitha

Main category: cs.CV

TL;DR: 本文提出了一种名为TIDE的两阶段水下图像恢复框架,通过显式建模退化特性并结合专门的先验分解实现针对性恢复,在颜色校正和对比度增强方面表现优异。

Details Motivation: 现有的水下图像恢复方法通常对整幅图像应用统一的恢复策略,难以应对空间上变化且多种共存的退化问题,因此需要一种能够自适应处理复杂退化模式的方法。 Method: TIDE将退化分解为颜色失真、雾霾、细节丢失和噪声四个关键因素,并为每个因素设计专用的恢复专家;通过生成多个专门的恢复假设,并根据局部退化模式自适应融合,再经过渐进式精炼阶段修正残余伪影。 Result: 在标准基准和浑浊水体条件下实验表明,TIDE在基于参考的保真度指标上具有竞争力,并在无参考感知质量指标上优于现有方法,尤其在颜色校正和对比度增强方面有显著提升。 Conclusion: TIDE通过解耦退化因素并采用分阶段自适应融合与精炼策略,有效解决了水下图像中复杂且空间变化的退化问题,实现了更自然、高质量的恢复结果。 Abstract: Underwater image restoration is essential for marine applications ranging from ecological monitoring to archaeological surveys, but effectively addressing the complex and spatially varying nature of underwater degradations remains a challenge. Existing methods typically apply uniform restoration strategies across the entire image, struggling to handle multiple co-occurring degradations that vary spatially and with water conditions. We introduce TIDE, a $\underline{t}$wo stage $\underline{i}$nverse $\underline{d}$egradation $\underline{e}$stimation framework that explicitly models degradation characteristics and applies targeted restoration through specialized prior decomposition. Our approach disentangles the restoration process into multiple specialized hypotheses that are adaptively fused based on local degradation patterns, followed by a progressive refinement stage that corrects residual artifacts. Specifically, TIDE decomposes underwater degradations into four key factors, namely color distortion, haze, detail loss, and noise, and designs restoration experts specialized for each. By generating specialized restoration hypotheses, TIDE balances competing degradation factors and produces natural results even in highly degraded regions. Extensive experiments across both standard benchmarks and challenging turbid water conditions show that TIDE achieves competitive performance on reference based fidelity metrics while outperforming state of the art methods on non reference perceptual quality metrics, with strong improvements in color correction and contrast enhancement. Our code is available at: https://rakesh-123-cryp.github.io/TIDE.

[212] START: Spatial and Textual Learning for Chart Understanding

Zhuoming Liu,Xiaofeng Gao,Feiyang Niu,Qiaozi Gao,Liu Liu,Robinson Piramuthu

Main category: cs.CV

TL;DR: 本文提出了一种名为START的新方法,用于提升多模态大语言模型对图表的理解能力,结合空间和文本学习,并构建了新的数据集与基准CS-Bench进行评估。

Details Motivation: 图表理解需要同时掌握其视觉布局(空间属性)和底层数据(文本属性),而现有方法在这两方面尤其是空间结构理解上存在不足。 Method: 提出START方法,包括图表元素定位和图表到代码生成;通过新构建的START-Dataset进行训练,该数据集利用MLLM将真实图表转换为可执行代码并用LLM演化以保留视觉结构;并建立CS-Bench基准评估模型的空间理解能力。 Result: START在不同模型大小和多个基准上均优于基线模型,并显著超越先前最优方法。 Conclusion: 结合空间与文本学习能有效提升多模态模型对图表的细粒度理解,所提出的数据集和评估基准填补了领域空白。 Abstract: Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

[213] Integrating Multi-scale and Multi-filtration Topological Features for Medical Image Classification

Pengfei Gu,Huimin Li,Haoteng Tang,Dongkuan,Xu,Erik Enriquez,DongChul Kim,Bin Fu,Danny Z. Chen

Main category: cs.CV

TL;DR: 提出一种新的基于多尺度和多滤波持久拓扑特征的医学图像分类框架,通过整合立方体持久图和交叉注意力网络,提升模型对复杂解剖结构的识别能力。

Details Motivation: 现有深度神经网络在医学图像分类中忽视了基本解剖结构的拓扑特征,或仅使用单参数持久性捕捉简单拓扑特征,缺乏对多层次拓扑信息的有效利用。 Method: 首先计算多分辨率下的立方体持久图(PDs),然后设计‘vineyard’算法将这些PDs融合为一个稳定的综合图,最后通过交叉注意力网络处理该图并与CNN或Transformer的特征融合。 Result: 在三个公开数据集上评估显示,所提方法显著优于强基线和现有最先进方法,分类性能稳定提升。 Conclusion: 通过引入多尺度和多滤波拓扑特征,增强了模型对复杂解剖结构的理解能力,验证了综合拓扑视角在医学图像分类中的有效性与可解释性。 Abstract: Modern deep neural networks have shown remarkable performance in medical image classification. However, such networks either emphasize pixel-intensity features instead of fundamental anatomical structures (e.g., those encoded by topological invariants), or they capture only simple topological features via single-parameter persistence. In this paper, we propose a new topology-guided classification framework that extracts multi-scale and multi-filtration persistent topological features and integrates them into vision classification backbones. For an input image, we first compute cubical persistence diagrams (PDs) across multiple image resolutions/scales. We then develop a ``vineyard'' algorithm that consolidates these PDs into a single, stable diagram capturing signatures at varying granularities, from global anatomy to subtle local irregularities that may indicate early-stage disease. To further exploit richer topological representations produced by multiple filtrations, we design a cross-attention-based neural network that directly processes the consolidated final PDs. The resulting topological embeddings are fused with feature maps from CNNs or Transformers. By integrating multi-scale and multi-filtration topologies into an end-to-end architecture, our approach enhances the model's capacity to recognize complex anatomical structures. Evaluations on three public datasets show consistent, considerable improvements over strong baselines and state-of-the-art methods, demonstrating the value of our comprehensive topological perspective for robust and interpretable medical image classification.

[214] RefLSM: Linearized Structural-Prior Reflectance Model for Medical Image Segmentation and Bias-Field Correction

Wenqi Zhao,Jiacheng Sang,Fenghua Cheng,Yonglu Shu,Dong Li,Xiaofeng Yang

Main category: cs.CV

TL;DR: 提出了一种基于反射率的变分水平集模型(RefLSM),通过Retinex启发的反射率分解实现医学图像分割,有效应对强度不均、噪声和模糊边界等问题。

Details Motivation: 传统水平集方法依赖于对偏置场的近似估计,在非均匀成像条件下表现不佳,难以处理医学图像中的强度不均、噪声、边界模糊和不规则结构等挑战。 Method: 将Retinex理论引入水平集框架,分解图像为反射率和偏置场分量,直接对光照不变的反射率进行分割;引入线性结构先验引导梯度,并采用松弛二值水平集结合凸松弛与符号投影避免重初始化;使用ADMM优化求解。 Result: 在多个医学图像数据集上实验表明,RefLSM在分割精度、鲁棒性和计算效率方面优于现有先进水平集方法。 Conclusion: RefLSM通过显式建模反射率成分并引入结构先验与稳定演化机制,显著提升了复杂医学图像下的分割性能。 Abstract: Medical image segmentation remains challenging due to intensity inhomogeneity, noise, blurred boundaries, and irregular structures. Traditional level set methods, while effective in certain cases, often depend on approximate bias field estimations and therefore struggle under severe non-uniform imaging conditions. To address these limitations, we propose a novel variational Reflectance-based Level Set Model (RefLSM), which explicitly integrates Retinex-inspired reflectance decomposition into the segmentation framework. By decomposing the observed image into reflectance and bias field components, RefLSM directly segments the reflectance, which is invariant to illumination and preserves fine structural details. Building on this foundation, we introduce two key innovations for enhanced precision and robustness. First, a linear structural prior steers the smoothed reflectance gradients toward a data-driven reference, providing reliable geometric guidance in noisy or low-contrast scenes. Second, a relaxed binary level-set is embedded in RefLSM and enforced via convex relaxation and sign projection, yielding stable evolution and avoiding reinitialization-induced diffusion. The resulting variational problem is solved efficiently using an ADMM-based optimization scheme. Extensive experiments on multiple medical imaging datasets demonstrate that RefLSM achieves superior segmentation accuracy, robustness, and computational efficiency compared to state-of-the-art level set methods.

[215] HVQ-CGIC: Enabling Hyperprior Entropy Modeling for VQ-Based Controllable Generative Image Compression

Niu Yi,Xu Tianyi,Ma Mingming,Wang Xinkun

Main category: cs.CV

TL;DR: 提出了一种基于VQ超先验的可控生成式图像压缩框架HVQ-CGIC,首次在基于向量量化的生成式压缩中实现了率失真平衡与控制,在Kodak数据集上相比现有最先进方法平均节省61.3%比特。

Details Motivation: 现有基于向量量化的生成式图像压缩方法使用静态全局概率分布估计熵,缺乏内容自适应性,导致码率利用不充分且难以实现灵活的率控。 Method: 提出了HVQ-CGIC框架,通过引入VQ索引的超先验模型并建立其数学基础,结合新颖的损失函数设计,实现对生成式压缩的率失真平衡与控制,并采用轻量级超先验估计网络提升性能。 Result: 在Kodak数据集上,与Control-GIC、CDC和HiFiC等SOTA方法达到相同感知质量时,平均节省61.3%的比特。 Conclusion: HVQ-CGIC首次将率失真控制引入基于VQ的生成式图像压缩,显著提升了率失真性能,有望成为VQGAN类压缩方法的基础组件。 Abstract: Generative learned image compression methods using Vector Quantization (VQ) have recently shown impressive potential in balancing distortion and perceptual quality. However, these methods typically estimate the entropy of VQ indices using a static, global probability distribution, which fails to adapt to the specific content of each image. This non-adaptive approach leads to untapped bitrate potential and challenges in achieving flexible rate control. To address this challenge, we introduce a Controllable Generative Image Compression framework based on a VQ Hyperprior, termed HVQ-CGIC. HVQ-CGIC rigorously derives the mathematical foundation for introducing a hyperprior to the VQ indices entropy model. Based on this foundation, through novel loss design, to our knowledge, this framework is the first to introduce RD balance and control into vector quantization-based Generative Image Compression. Cooperating with a lightweight hyper-prior estimation network, HVQ-CGIC achieves a significant advantage in rate-distortion (RD) performance compared to current state-of-the-art (SOTA) generative compression methods. On the Kodak dataset, we achieve the same LPIPS as Control-GIC, CDC and HiFiC with an average of 61.3% fewer bits. We posit that HVQ-CGIC has the potential to become a foundational component for VQGAN-based image compression, analogous to the integral role of the HyperPrior framework in neural image compression.

[216] SUCCESS-GS: Survey of Compactness and Compression for Efficient Static and Dynamic Gaussian Splatting

Seokhyun Youn,Soohyun Lee,Geonho Kim,Weeyoung Kwon,Sung-Ho Bae,Jihyong Oh

Main category: cs.CV

TL;DR: 本文首次系统综述了高效3D和4D高斯点阵化技术,提出参数压缩与重构压缩两大分类体系,涵盖数据集、评估指标与基准对比,并讨论当前局限与未来研究方向。

Details Motivation: 3D高斯点阵化虽能实现高质量3D重建与新视角合成,但其庞大的存储与计算开销限制了实际应用,尤其在动态4D场景中更为严重,亟需高效的压缩与优化方法。 Method: 对现有高效高斯点阵化方法进行系统性分类,分为参数压缩与重构压缩两大方向,总结各方向核心技术思路与方法演进趋势,并整理常用数据集、评估指标及基准性能比较。 Result: 建立了首个针对3D与4D高斯点阵化的统一高效方法综述框架,归纳了各类技术路径的优势与适用场景,提供了全面的基准分析与性能对比。 Conclusion: 参数与结构压缩是实现可扩展、紧凑且实时高斯点阵化的关键路径,未来研究应进一步探索两者的协同优化,以推动其在静态与动态场景中的广泛应用。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful explicit representation enabling real-time, high-fidelity 3D reconstruction and novel view synthesis. However, its practical use is hindered by the massive memory and computational demands required to store and render millions of Gaussians. These challenges become even more severe in 4D dynamic scenes. To address these issues, the field of Efficient Gaussian Splatting has rapidly evolved, proposing methods that reduce redundancy while preserving reconstruction quality. This survey provides the first unified overview of efficient 3D and 4D Gaussian Splatting techniques. For both 3D and 4D settings, we systematically categorize existing methods into two major directions, Parameter Compression and Restructuring Compression, and comprehensively summarize the core ideas and methodological trends within each category. We further cover widely used datasets, evaluation metrics, and representative benchmark comparisons. Finally, we discuss current limitations and outline promising research directions toward scalable, compact, and real-time Gaussian Splatting for both static and dynamic 3D scene representation.

[217] Understanding Diffusion Models via Code Execution

Cheng Yu

Main category: cs.CV

TL;DR: 本文提出一个约300行的简洁实现,从代码执行的角度解释扩散模型,帮助研究者理解其在实践中如何工作以及理论与代码之间的对应关系。

Details Motivation: 现有教程主要集中在公式推导上,缺乏对扩散模型在代码中实际运行方式的指导,且理论与开源实现之间存在鸿沟。 Method: 构建一个极简的实现,保留前向扩散、反向采样、噪声预测网络和训练循环等核心组件,去除不必要的工程细节。 Result: 提供了一个清晰、以实现为中心的理解框架,并公开了代码和预训练模型。 Conclusion: 该实现有助于弥合扩散模型理论与实践之间的差距,促进研究者更好地掌握其工作原理。 Abstract: Diffusion models have achieved remarkable performance in generative modeling, yet their theoretical foundations are often intricate, and the gap between mathematical formulations in papers and practical open-source implementations can be difficult to bridge. Existing tutorials primarily focus on deriving equations, offering limited guidance on how diffusion models actually operate in code. To address this, we present a concise implementation of approximately 300 lines that explains diffusion models from a code-execution perspective. Our minimal example preserves the essential components -- including forward diffusion, reverse sampling, the noise-prediction network, and the training loop -- while removing unnecessary engineering details. This technical report aims to provide researchers with a clear, implementation-first understanding of how diffusion models work in practice and how code and theory correspond. Our code and pre-trained models are available at: https://github.com/disanda/GM/tree/main/DDPM-DDIM-ClassifierFree.

[218] MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

Xuhui Zheng,Kang An,Ziliang Wang,Yuhang Wang,Faqiang Qian,Yichao Wu

Main category: cs.CV

TL;DR: 本文提出了MMRPT,一种将强化学习引入多模态预训练的框架,通过视觉依赖性感知的掩码和语义-视觉奖励机制,增强MLLMs的视觉推理能力。

Details Motivation: 现有图像-文本对预训练存在描述偏差,导致模型依赖语言表面线索而非真正的视觉理解。 Method: 提出MMRPT框架,在预训练中引入强化学习;通过注意力估计句子级视觉依赖,掩码高依赖片段,并利用语义-视觉奖励指导模型进行视觉接地的重建。 Result: 实验显示在多种零样本基准上性能持续提升,并在有监督微调下表现出更强的鲁棒性。 Conclusion: 基于强化学习的掩码多模态推理是一种更可靠、更具泛化性的多模态模型预训练目标。 Abstract: Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.

[219] AutoLugano: A Deep Learning Framework for Fully Automated Lymphoma Segmentation and Lugano Staging on FDG-PET/CT

Boyang Pan,Zeyu Zhang,Hongyu Meng,Bin Cui,Yingying Zhang,Wenli Hou,Junhao Li,Langdi Zhong,Xiaoxiao Chen,Xiaoyu Xu,Changjin Zuo,Chao Cheng,Nan-Jie Gong

Main category: cs.CV

TL;DR: 本研究提出了一种名为AutoLugano的全自动深度学习系统,可基于基线FDG-PET/CT扫描实现淋巴瘤的端到端分类,包括病灶分割、解剖定位和Lugano分期。

Details Motivation: 现有的淋巴瘤分期方法依赖人工操作,耗时且易受主观因素影响,亟需一种全自动、准确且可重复的系统以支持临床决策。 Method: AutoLugano包含三个模块:(1) 基于3D nnU-Net的解剖感知病灶分割;(2) 基于图谱的解剖定位,利用TotalSegmentator将病灶映射至21个预定义区域;(3) 自动Lugano分期,根据受累区域的空间分布确定分期和治疗分组(局限期 vs. 晚期)。系统在autoPET数据集上训练,并在独立队列中验证。 Result: 在外部验证集中,区域受累检测的准确率为88.31%,F1分数为80.80%;治疗分组(局限期 vs. 晚期)判断准确率达85.07%,敏感性和特异性分别为82.61%和90.48%。 Conclusion: AutoLugano是首个能从单次FDG-PET/CT扫描中自动生成完整Lugano分期的端到端自动化系统,具有辅助初始分期、治疗分层和临床决策的潜力。 Abstract: Purpose: To develop a fully automated deep learning system, AutoLugano, for end-to-end lymphoma classification by performing lesion segmentation, anatomical localization, and automated Lugano staging from baseline FDG-PET/CT scans. Methods: The AutoLugano system processes baseline FDG-PET/CT scans through three sequential modules:(1) Anatomy-Informed Lesion Segmentation, a 3D nnU-Net model, trained on multi-channel inputs, performs automated lesion detection (2) Atlas-based Anatomical Localization, which leverages the TotalSegmentator toolkit to map segmented lesions to 21 predefined lymph node regions using deterministic anatomical rules; and (3) Automated Lugano Staging, where the spatial distribution of involved regions is translated into Lugano stages and therapeutic groups (Limited vs. Advanced Stage).The system was trained on the public autoPET dataset (n=1,007) and externally validated on an independent cohort of 67 patients. Performance was assessed using accuracy, sensitivity, specificity, F1-scorefor regional involvement detection and staging agreement. Results: On the external validation set, the proposed model demonstrated robust performance, achieving an overall accuracy of 88.31%, sensitivity of 74.47%, Specificity of 94.21% and an F1-score of 80.80% for regional involvement detection,outperforming baseline models. Most notably, for the critical clinical task of therapeutic stratification (Limited vs. Advanced Stage), the system achieved a high accuracy of 85.07%, with a specificity of 90.48% and a sensitivity of 82.61%.Conclusion: AutoLugano represents the first fully automated, end-to-end pipeline that translates a single baseline FDG-PET/CT scan into a complete Lugano stage. This study demonstrates its strong potential to assist in initial staging, treatment stratification, and supporting clinical decision-making.

[220] Object Pose Distribution Estimation for Determining Revolution and Reflection Uncertainty in Point Clouds

Frederik Hagelskjær,Dimitrios Arapis,Steffen Madsen,Thorbjørn Mosekjær Iversen

Main category: cs.CV

TL;DR: 提出了一种仅使用3D无颜色数据的神经网络方法来估计物体姿态不确定性,是首个不依赖RGB输入的深度学习方法。

Details Motivation: 现有姿态分布方法依赖颜色信息,在工业场景中常不可用;单个姿态估计无法捕捉视觉模糊带来的不确定性,影响机器人行为可靠性。 Method: 设计一种基于深度学习的神经网络框架,利用纯3D几何数据估计姿态分布,特别处理反射和旋转对称性,支持扩展到完整的SE(3)姿态分布估计。 Result: 在真实环境的分拣任务中验证了方法有效性,适用于不同几何模糊程度的物体,能够准确表达姿态不确定性。 Conclusion: 该方法首次实现了仅用3D无色数据进行深度学习驱动的姿态不确定性估计,提升了工业场景下机器人感知的鲁棒性与可靠性。 Abstract: Object pose estimation is crucial to robotic perception and typically provides a single-pose estimate. However, a single estimate cannot capture pose uncertainty deriving from visual ambiguity, which can lead to unreliable behavior. Existing pose distribution methods rely heavily on color information, often unavailable in industrial settings. We propose a novel neural network-based method for estimating object pose uncertainty using only 3D colorless data. To the best of our knowledge, this is the first approach that leverages deep learning for pose distribution estimation without relying on RGB input. We validate our method in a real-world bin picking scenario with objects of varying geometric ambiguity. Our current implementation focuses on symmetries in reflection and revolution, but the framework is extendable to full SE(3) pose distribution estimation. Source code available at opde3d.github.io

[221] VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation

Md Selim Sarowar,Sungho Kim

Main category: cs.CV

TL;DR: 本文比较了基于CLIP和DINOv2的视觉模型在手-物抓取场景中6D物体位姿估计任务中的表现,发现CLIP在语义一致性上更优,而DINOv2在几何精度上更具优势,二者具有互补性。

Details Motivation: 为了提升机器人操作和抓取应用中3D姿态估计的性能,需要深入理解不同视觉基础模型(如CLIP和DINOv2)在语义与几何特征提取上的差异与优势。 Method: 通过在基准数据集上对基于CLIP和DINOv2的方法进行广泛实验,系统评估它们在6D物体位姿估计任务中的表现,并分析其语义理解和几何特征能力。 Result: 实验表明,基于CLIP的方法在语义一致性方面表现更好,而基于DINOv2的方法在几何精度上具有竞争力且更优。 Conclusion: CLIP和DINOv2在姿态估计任务中各有优势,应根据具体应用需求选择合适的模型,二者可互补用于提升机器人抓取系统的性能。 Abstract: Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.

[222] Towards Robust Protective Perturbation against DeepFake Face Swapping

Hengyang Yao,Lin Li,Ke Sun,Jianing Qiu,Huiping Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为EOLT的新框架,通过可学习的变换分布来提升深度伪造人脸交换防御中的鲁棒性,相较于传统均匀采样方法显著提高了抗变换能力。

Details Motivation: 现有的基于不可见扰动的防御方法在面对压缩、缩放等基本变换时脆弱,且保护鲁棒性对训练时采用的变换高度敏感,标准的EOT方法因使用均匀采样而性能受限。 Method: 提出Expectation Over Learned distribution of Transformation (EOLT),引入一个策略网络,利用强化学习自动优先选择关键变换,并生成实例特定的自适应扰动,将变换分布作为可学习部分进行建模。 Result: 实验表明,EOLT在六类共30种变换下表现优异,平均鲁棒性提升26%,在具有挑战性的变换类别中最高提升达30%。 Conclusion: EOLT通过学习最优变换分布有效解决了传统防御方法对变换敏感的问题,在保持广泛迁移性的同时显著增强了防御鲁棒性,为DeepFake防御提供了更可靠的技术路径。 Abstract: DeepFake face swapping enables highly realistic identity forgeries, posing serious privacy and security risks. A common defence embeds invisible perturbations into images, but these are fragile and often destroyed by basic transformations such as compression or resizing. In this paper, we first conduct a systematic analysis of 30 transformations across six categories and show that protection robustness is highly sensitive to the choice of training transformations, making the standard Expectation over Transformation (EOT) with uniform sampling fundamentally suboptimal. Motivated by this, we propose Expectation Over Learned distribution of Transformation (EOLT), the framework to treat transformation distribution as a learnable component rather than a fixed design choice. Specifically, EOLT employs a policy network that learns to automatically prioritize critical transformations and adaptively generate instance-specific perturbations via reinforcement learning, enabling explicit modeling of defensive bottlenecks while maintaining broad transferability. Extensive experiments demonstrate that our method achieves substantial improvements over state-of-the-art approaches, with 26% higher average robustness and up to 30% gains on challenging transformation categories.

[223] ReLKD: Inter-Class Relation Learning with Knowledge Distillation for Generalized Category Discovery

Fang Zhou,Zhiqiang Chen,Martin Pavlovski,Yizhong Zhang

Main category: cs.CV

TL;DR: 提出ReLKD框架,通过隐式类间关系和知识蒸馏提升广义类别发现中对新类别的分类性能。

Details Motivation: 现有GCD方法忽略类间关系,难以有效利用未标注数据中的已知与新类别结构。 Method: 设计包含细粒度模块、粗粒度模块和蒸馏模块的端到端框架ReLKD,利用粗粒度层级关系指导细粒度表示学习。 Result: 在四个数据集上验证了ReLKD的有效性,尤其在标签数据有限时表现优异。 Conclusion: ReLKD通过挖掘隐式类间关系并结合知识蒸馏,显著提升了广义类别发现的性能。 Abstract: Generalized Category Discovery (GCD) faces the challenge of categorizing unlabeled data containing both known and novel classes, given only labels for known classes. Previous studies often treat each class independently, neglecting the inherent inter-class relations. Obtaining such inter-class relations directly presents a significant challenge in real-world scenarios. To address this issue, we propose ReLKD, an end-to-end framework that effectively exploits implicit inter-class relations and leverages this knowledge to enhance the classification of novel classes. ReLKD comprises three key modules: a target-grained module for learning discriminative representations, a coarse-grained module for capturing hierarchical class relations, and a distillation module for transferring knowledge from the coarse-grained module to refine the target-grained module's representation learning. Extensive experiments on four datasets demonstrate the effectiveness of ReLKD, particularly in scenarios with limited labeled data. The code for ReLKD is available at https://github.com/ZhouF-ECNU/ReLKD.

[224] STRinGS: Selective Text Refinement in Gaussian Splatting

Abhinav Raundhal,Gaurav Behera,P J Narayanan,Ravi Kiran Sarvadevabhatla,Makarand Tapaswi

Main category: cs.CV

TL;DR: 提出STRinGS,一种面向文本的3D高斯点阵选择性优化框架,显著提升3D场景中文本的可读性与重建质量。

Details Motivation: 3D高斯点阵(3DGS)在重建中难以保留文本细节,微小误差会导致语义丢失,影响场景理解。 Method: 将文本区域与非文本区域分离处理,优先优化文本区域,再与非文本区域融合进行全局优化,并引入OCR字符错误率(CER)评估文本可读性。 Result: 在仅7K次迭代下,相比3DGS实现了63.6%的相对性能提升,并发布了包含多样文本场景的数据集STRinGS-360。 Conclusion: STRinGS有效提升了3D重建中文本的清晰度与可读性,推动了文本丰富环境下的3D场景理解发展。 Abstract: Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.

[225] Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models

Biao Chen,Lin Zuo,Mengmeng Jing,Kunbin He,Yuchen Wang

Main category: cs.CV

TL;DR: 提出Dropout Prompt Learning,通过在视觉和语言模型的文本和视觉分支的token上应用基于重要性的dropout,并引入残差熵正则化,提升模型在低样本、长尾分布和分布外场景下的鲁棒性。

Details Motivation: 为了提升视觉-语言模型在复杂场景下的泛化能力和鲁棒性,受传统dropout启发,探索在提示学习中对token进行有选择的dropout。 Method: 在文本和视觉分支的token级别应用dropout,根据模态内上下文和模态间对齐评估token重要性,动态调整dropout概率;引入残差熵正则化以在保持语义对齐的同时鼓励多样性表示。 Result: 在15个基准上验证了方法的有效性,在低样本学习、长尾分类和分布外泛化等任务中表现优异,尤其在base-to-novel泛化上分别超越KgCoOp和PromptSRC达5.10%和2.13%。 Conclusion: Dropout Prompt Learning通过重要性感知的token级dropout和残差熵正则化,有效提升了视觉-语言模型的鲁棒性和泛化能力,适用于多种挑战性场景。 Abstract: Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.

[226] Unified Camera Positional Encoding for Controlled Video Generation

Cheng Zhang,Boying Li,Meng Wei,Yan-Pei Cao,Camilo Cruz Gambardella,Dinh Phung,Jianfei Cai

Main category: cs.CV

TL;DR: 本文提出了统一相机位置编码(UCPE),一种几何一致的表示方法,整合了完整的相机信息(6自由度姿态、内参和镜头畸变),通过轻量级空间注意力适配器集成到预训练视频扩散Transformer中,在保持不到1%可训练参数增加的同时实现了最先进的相机可控性和视觉保真度。

Details Motivation: 现有相机编码方法通常依赖简化的针孔假设,难以泛化到真实世界中多样化的相机内参与镜头畸变,限制了在3D感知、视频生成等任务中的相机控制能力。 Method: 提出相对射线编码(Relative Ray Encoding)和绝对方向编码(利用俯仰角和滚动角),构成UCPE;将其通过轻量级空间注意力适配器注入预训练视频Diffusion Transformer,并构建包含多样化相机运动和镜头类型的大规模视频数据集用于训练与评估。 Result: 在相机可控文本到视频生成任务上,UCPE显著优于现有方法,实现了更精确的相机控制和更高的视觉质量,且仅增加不到1%的可训练参数;实验验证了其在多视角、视频和3D任务中作为通用相机表示的潜力。 Conclusion: UCPE是一种高效且通用的相机位置编码方案,能够准确建模复杂相机几何,为基于Transformer的多模态模型提供了更强的三维感知与控制能力,有望推动自动驾驶和具身AI等领域的发展。 Abstract: Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.

[227] Squeezed-Eff-Net: Edge-Computed Boost of Tomography Based Brain Tumor Classification leveraging Hybrid Neural Network Architecture

Md. Srabon Chowdhury,Syeda Fahmida Tanzim,Sheekar Banerjee,Ishtiak Al Mamoon,AKM Muzahidul Islam

Main category: cs.CV

TL;DR: 提出了一种基于SqueezeNet v1和EfficientNet-B0的混合深度学习模型,结合手工设计的放射组学特征,用于脑肿瘤MRI图像分类,准确率达98.93%(TTA下达99.08%),具有高效性和临床可靠性。

Details Motivation: 脑肿瘤诊断依赖MRI,但人工分割耗时且易受观察者差异影响,现有深度学习模型在计算效率与准确性之间难以平衡,需更高效可靠的自动化方法。 Method: 构建一个融合SqueezeNet v1和EfficientNet-B0的混合深度学习模型,并引入HOG、LBP、Gabor滤波和小波变换等手工放射组学特征;在Nickparvar公开数据集的7,023张MRI切片上进行训练与测试,采用TTA提升性能。 Result: 模型测试准确率达到98.93%,TTA后提升至99.08%,仅使用少于210万参数和1.2 GFLOPs,表现出优异的泛化能力和计算效率;手工特征增强了纹理敏感性,EfficientNet-B0捕捉了深层层次特征。 Conclusion: 该混合模型在诊断精度和计算效率之间实现了良好平衡,具备接近临床应用水平的可靠性,有望集成于临床决策支持系统中用于脑肿瘤的自动分类。 Abstract: Brain tumors are one of the most common and dangerous neurological diseases which require a timely and correct diagnosis to provide the right treatment procedures. Even with the promotion of magnetic resonance imaging (MRI), the process of tumor delineation is difficult and time-consuming, which is prone to inter-observer error. In order to overcome these limitations, this work proposes a hybrid deep learning model based on SqueezeNet v1 which is a lightweight model, and EfficientNet-B0, which is a high-performing model, and is enhanced with handcrafted radiomic descriptors, including Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gabor filters and Wavelet transforms. The framework was trained and tested only on publicly available Nickparvar Brain Tumor MRI dataset, which consisted of 7,023 contrast-enhanced T1-weighted axial MRI slices which were categorized into four groups: glioma, meningioma, pituitary tumor, and no tumor. The testing accuracy of the model was 98.93% that reached a level of 99.08% with Test Time Augmentation (TTA) showing great generalization and power. The proposed hybrid network offers a compromise between computation efficiency and diagnostic accuracy compared to current deep learning structures and only has to be trained using fewer than 2.1 million parameters and less than 1.2 GFLOPs. The handcrafted feature addition allowed greater sensitivity in texture and the EfficientNet-B0 backbone represented intricate hierarchical features. The resulting model has almost clinical reliability in automated MRI-based classification of tumors highlighting its possibility of use in clinical decision-support systems.

[228] Zero-Shot Textual Explanations via Translating Decision-Critical Features

Toshinori Yamauchi,Hiroshi Kera,Kazuhiko Kawamoto

Main category: cs.CV

TL;DR: 本文提出了一种名为TEXTER的新方法,用于生成图像分类器决策的文本解释,通过隔离关键决策特征并将其映射到CLIP特征空间,从而生成更忠实和可解释的推理描述。

Details Motivation: 现有的零样本解释方法主要对齐全局图像特征与语言,导致生成的是视觉内容描述而非驱动预测的关键因素;大型视觉-语言模型虽能生成字幕,但并非专为分类器推理设计。因此需要一种能够捕捉分类决策依据的方法。 Method: TEXTER识别对预测有贡献的神经元,并强调这些神经元编码的特征(即决策关键特征),然后将这些特征映射到CLIP特征空间以检索反映模型推理过程的文本解释;使用稀疏自编码器进一步提升解释性,尤其适用于Transformer架构。 Result: 大量实验表明,TEXTER在生成解释方面比现有方法更具保真度和可解释性。 Conclusion: TEXTER通过聚焦于决策关键特征并与先进视觉-语言模型结合,显著提升了图像分类器解释的质量,是迈向透明AI系统的重要一步。 Abstract: Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.

[229] AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

Ziming Hong,Tianyu Huang,Runnan Chen,Shanshan Ye,Mingming Gong,Bo Han,Tongliang Liu

Main category: cs.CV

TL;DR: 本文提出了AdLift,首个用于3D高斯点阵(3DGS)的编辑保护方法,通过将2D对抗扰动提升到3D高斯表示中,实现跨视角和维度的指令驱动编辑防护。

Details Motivation: 随着基于扩散模型的2D图像编辑技术被扩展到3D高斯点阵(3DGS),3D内容面临未经授权修改和恶意篡改的风险。现有2D对抗扰动方法难以直接应用于3DGS,且需解决视图通用性和扰动不可见性之间的平衡问题。 Method: 提出AdLift,通过定制的Lifted PGD优化策略,先在渲染图像上进行梯度截断并施加投影梯度以限制图像级扰动,再通过图像到高斯拟合操作将扰动反向传播至保护性高斯参数,交替执行以实现多视角一致性保护。 Result: 实验结果表明,AdLift在定性和定量上均能有效防御最先进的指令驱动2D图像和3DGS编辑方法,具备良好的视角泛化能力。 Conclusion: AdLift是首个针对3DGS的编辑防护框架,能够在保证扰动不可见的同时,提供跨视角、鲁棒且泛化的对抗保护,为3D内容安全提供了新方向。 Abstract: Recent studies have extended diffusion-based instruction-driven 2D image editing pipelines to 3D Gaussian Splatting (3DGS), enabling faithful manipulation of 3DGS assets and greatly advancing 3DGS content creation. However, it also exposes these assets to serious risks of unauthorized editing and malicious tampering. Although imperceptible adversarial perturbations against diffusion models have proven effective for protecting 2D images, applying them to 3DGS encounters two major challenges: view-generalizable protection and balancing invisibility with protection capability. In this work, we propose the first editing safeguard for 3DGS, termed AdLift, which prevents instruction-driven editing across arbitrary views and dimensions by lifting strictly bounded 2D adversarial perturbations into 3D Gaussian-represented safeguard. To ensure both adversarial perturbations effectiveness and invisibility, these safeguard Gaussians are progressively optimized across training views using a tailored Lifted PGD, which first conducts gradient truncation during back-propagation from the editing model at the rendered image and applies projected gradients to strictly constrain the image-level perturbation. Then, the resulting perturbation is backpropagated to the safeguard Gaussian parameters via an image-to-Gaussian fitting operation. We alternate between gradient truncation and image-to-Gaussian fitting, yielding consistent adversarial-based protection performance across different viewpoints and generalizes to novel views. Empirically, qualitative and quantitative results demonstrate that AdLift effectively protects against state-of-the-art instruction-driven 2D image and 3DGS editing.

[230] See More, Change Less: Anatomy-Aware Diffusion for Contrast Enhancement

Junqi Liu,Zejun Wu,Pedro R. A. S. Bassi,Xinze Zhou,Wenxuan Li,Ibrahim E. Hamamci,Sezgin Er,Tianyu Lin,Yi Luo,Szymon Płotka,Bjoern Menze,Daguang Xu,Kai Ding,Kang Wang,Yang Yang,Yucheng Tang,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: SMILE是一种解剖结构感知的扩散模型,用于医学图像增强,能精准提升临床相关区域的图像质量而不改变其他区域,显著提高图像质量和癌症检测性能。

Details Motivation: 现有医学图像增强模型常因缺乏对解剖结构和对比剂动态的理解而过度编辑,导致器官变形、伪影和小肿瘤遗漏,影响临床决策。 Method: 提出SMILE模型,引入结构感知监督、无需配准的学习方式和统一推断机制,利用器官边界和对比模式直接从非配准多期相CT扫描中学习并进行图像增强。 Result: 在六个外部数据集上,SMILE在图像质量(SSIM提升14.2%,PSNR提升20.6%,FID改善50%)和临床可用性方面优于现有方法,并将非增强CT的癌症检测F1分数最高提升10%。 Conclusion: SMILE能够生成解剖准确且诊断有意义的图像,有效避免过度编辑问题,具有良好的临床应用潜力。 Abstract: Image enhancement improves visual quality and helps reveal details that are hard to see in the original image. In medical imaging, it can support clinical decision-making, but current models often over-edit. This can distort organs, create false findings, and miss small tumors because these models do not understand anatomy or contrast dynamics. We propose SMILE, an anatomy-aware diffusion model that learns how organs are shaped and how they take up contrast. It enhances only clinically relevant regions while leaving all other areas unchanged. SMILE introduces three key ideas: (1) structure-aware supervision that follows true organ boundaries and contrast patterns; (2) registration-free learning that works directly with unaligned multi-phase CT scans; (3) unified inference that provides fast and consistent enhancement across all contrast phases. Across six external datasets, SMILE outperforms existing methods in image quality (14.2% higher SSIM, 20.6% higher PSNR, 50% better FID) and in clinical usefulness by producing anatomically accurate and diagnostically meaningful images. SMILE also improves cancer detection from non-contrast CT, raising the F1 score by up to 10 percent.

[231] DGGAN: Degradation Guided Generative Adversarial Network for Real-time Endoscopic Video Enhancement

Handing Xu,Zhenguo Nie,Tairan Peng,Huimin Pan,Xin-Jun Liu

Main category: cs.CV

TL;DR: 提出一种基于退化感知的框架,用于实时内窥镜视频增强,通过跨帧传播退化表示,在性能与效率之间实现优越平衡。

Details Motivation: 内窥镜手术中视频质量常因光照不均、组织散射、遮挡和运动模糊等问题而下降,影响手术安全性和效果,现有深度学习方法计算成本高,难以实现实时应用。 Method: 采用对比学习提取图像中的退化表示,并通过融合机制将其调制到单帧增强模型中,结合周期一致性约束进行训练,实现高效、鲁棒的视频增强。 Result: 实验表明该框架在多个先进方法中实现了更优的性能与效率平衡,支持实时处理并提升图像质量。 Conclusion: 隐式学习和传播退化表示为临床实时内窥镜视频增强提供了一条可行路径。 Abstract: Endoscopic surgery relies on intraoperative video, making image quality a decisive factor for surgical safety and efficacy. Yet, endoscopic videos are often degraded by uneven illumination, tissue scattering, occlusions, and motion blur, which obscure critical anatomical details and complicate surgical manipulation. Although deep learning-based methods have shown promise in image enhancement, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose a degradation-aware framework for endoscopic video enhancement, which enables real-time, high-quality enhancement by propagating degradation representations across frames. In our framework, degradation representations are first extracted from images using contrastive learning. We then introduce a fusion mechanism that modulates image features with these representations to guide a single-frame enhancement model, which is trained with a cycle-consistency constraint between degraded and restored images to improve robustness and generalization. Experiments demonstrate that our framework achieves a superior balance between performance and efficiency compared with several state-of-the-art methods. These results highlight the effectiveness of degradation-aware modeling for real-time endoscopic video enhancement. Nevertheless, our method suggests that implicitly learning and propagating degradation representation offer a practical pathway for clinical application.

[232] A graph generation pipeline for critical infrastructures based on heuristics, images and depth data

Mike Diessner,Yannick Tarant

Main category: cs.CV

TL;DR: 提出了一种基于摄影测量的图生成管道,利用RGB图像和深度数据通过深度学习检测对象并推断其关系,为关键基础设施提供低成本、高透明度的虚拟建模方法。

Details Motivation: 降低获取物理关键基础设施三维模型的成本和技术门槛,避免依赖昂贵的激光扫描设备和专业知识。 Method: 使用立体相机采集RGB图像和深度数据,结合深度学习进行对象检测与实例分割,并通过用户定义的启发式规则推断对象间的关系,构建图结构。 Result: 在两个液压系统上的实验表明,该方法生成的图接近真实情况,具有较高的准确性和应用灵活性。 Conclusion: 该方法成本低、透明度高且可定制,适用于关键基础设施中高风险决策支持的数字孪生与仿真应用。 Abstract: Virtual representations of physical critical infrastructures, such as water or energy plants, are used for simulations and digital twins to ensure resilience and continuity of their services. These models usually require 3D point clouds from laser scanners that are expensive to acquire and require specialist knowledge to use. In this article, we present a graph generation pipeline based on photogrammetry. The pipeline detects relevant objects and predicts their relation using RGB images and depth data generated by a stereo camera. This more cost-effective approach uses deep learning for object detection and instance segmentation of the objects, and employs user-defined heuristics or rules to infer their relations. Results of two hydraulic systems show that this strategy can produce graphs close to the ground truth while its flexibility allows the method to be tailored to specific applications and its transparency qualifies it to be used in the high stakes decision-making that is required for critical infrastructures.

[233] RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation

Zhi Rao,Yucheng Zhou,Benjia Zhou,Yiqing Huang,Sergio Escalera,Jun Wan

Main category: cs.CV

TL;DR: 本文提出了一种无需词汇标注的视觉-语言框架RVLF,用于解决手语翻译中的表征不足和句子级语义错位问题。该方法结合DINOv2提取视觉特征与骨架运动信息,并通过GRPO强化学习优化翻译质量,在多个数据集上显著提升了BLEU分数。

Details Motivation: 现有无标注手语翻译方法存在手语表征不充分和句子级语义对齐差的问题,限制了翻译质量。 Method: 提出三阶段RVLF框架:1)融合骨架运动与DINOv2视觉特征进行语义表示学习;2)指令微调建立SLT-SFT基线模型;3)采用基于GRPO的强化学习,结合BLEU和ROUGE设计奖励函数进行优化。 Result: 在CSL-Daily、PHOENIX-2014T、How2Sign和OpenASL四个数据集上,BLEU-4分别提升+5.1、+1.11、+1.4和+1.61;为首个将GRPO引入手语翻译的工作。 Conclusion: RVLF通过增强视觉表征和GRPO优化策略,有效提升了无标注手语翻译的语义一致性和整体翻译质量,且无需依赖大规模预训练。 Abstract: Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.

[234] Effective Attention-Guided Multi-Scale Medical Network for Skin Lesion Segmentation

Siyu Wang,Hua Wang,Huiyu Li,Fan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于多尺度残差结构的编码器-解码器网络,结合MRCF、CMAM和EAB模块,有效提升了皮肤病变区域分割的精度与鲁棒性。

Details Motivation: 现有方法在处理不规则形状和低对比度的皮肤病变图像时存在特征提取不充分、信息丢失等问题,难以实现精确分割。 Method: 提出一种新型编码器-解码器网络,引入多分辨率多通道融合(MRCF)模块捕获跨尺度特征,设计交叉混合注意力模块(CMAM)动态计算多上下文注意力权重,并通过外部注意力桥(EAB)缓解跳跃连接中的信息损失。 Result: 在多个皮肤病变分割数据集上实验表明,该方法显著优于现有的Transformer和CNN模型,具有更高的分割精度和更强的鲁棒性。 Conclusion: 所提出的网络架构通过多尺度特征融合与注意力机制优化,有效解决了复杂皮肤病变的分割难题,为临床辅助诊断提供了有力的技术支持。 Abstract: In the field of healthcare, precise skin lesion segmentation is crucial for the early detection and accurate diagnosis of skin diseases. Despite significant advances in deep learning for image processing, existing methods have yet to effectively address the challenges of irregular lesion shapes and low contrast. To address these issues, this paper proposes an innovative encoder-decoder network architecture based on multi-scale residual structures, capable of extracting rich feature information from different receptive fields to effectively identify lesion areas. By introducing a Multi-Resolution Multi-Channel Fusion (MRCF) module, our method captures cross-scale features, enhancing the clarity and accuracy of the extracted information. Furthermore, we propose a Cross-Mix Attention Module (CMAM), which redefines the attention scope and dynamically calculates weights across multiple contexts, thus improving the flexibility and depth of feature capture and enabling deeper exploration of subtle features. To overcome the information loss caused by skip connections in traditional U-Net, an External Attention Bridge (EAB) is introduced, facilitating the effective utilization of information in the decoder and compensating for the loss during upsampling. Extensive experimental evaluations on several skin lesion segmentation datasets demonstrate that the proposed model significantly outperforms existing transformer and convolutional neural network-based models, showcasing exceptional segmentation accuracy and robustness.

[235] Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery

Mai Tsujimoto,Junjue Wang,Weihao Xuan,Naoto Yokoya

Main category: cs.CV

TL;DR: Geo3DVQA是一个基于RGB遥感图像的三维地理空间视觉语言推理新基准,旨在实现无需专用传感器的高度感知3D分析,涵盖110k个问题对和16类任务,评估显示现有VLM在RGB到3D推理上表现有限,领域适配可显著提升性能。

Details Motivation: 现有3D地理分析依赖昂贵传感器且难以整合多线索、处理多样查询和提供可解释推理,限制了全球可访问性和实用性,因此需要一种基于普通RGB图像、更通用、可解释的3D空间推理方法。 Method: 提出Geo3DVQA基准,利用仅含RGB的遥感影像进行高度感知的3D地理空间视觉语言问答,设计包含110k问答对的多层级任务(单特征、多特征、应用级分析),覆盖16个类别,并在10个先进VLM上评估,同时探索领域特定微调的效果。 Result: 十种最先进VLM在基准上表现较差,GPT-4o和Gemini-2.5-Flash准确率仅为28.6%和33.0%,而经过领域微调的Qwen2.5-VL-7B达到49.6%,提升了24.8个百分点,表明当前VLM在RGB到3D推理方面存在显著挑战,但领域适应有效。 Conclusion: Geo3DVQA为可扩展、可访问和整体性的3D地理空间分析提供了新挑战方向,揭示了当前视觉语言模型在从RGB图像进行3D推理上的局限性,同时证明了领域特定训练的重要性与潜力。 Abstract: Three-dimensional geospatial analysis is critical to applications in urban planning, climate adaptation, and environmental assessment. Current methodologies depend on costly, specialized sensors (e.g., LiDAR and multispectral), which restrict global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We hereby present Geo3DVQA, a comprehensive benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning using RGB-only remote sensing imagery. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns. The benchmark encompasses 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6% and 33.0% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% (+24.8 points). These results reveal both the limitations of current VLMs and the effectiveness of domain adaptation. Geo3DVQA introduces new challenge frontiers for scalable, accessible, and holistic 3D geospatial analysis. The dataset and code will be released upon publication at https://github.com/mm1129/Geo3DVQA.

[236] Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts

Mingning Guo,Mengwei Wu,Shaoxian Li,Haifeng Li,Chao Tao

Main category: cs.CV

TL;DR: 提出AerialVP,首个用于无人机图像感知的任务提示增强代理框架,通过提取多维辅助信息改进视觉语言模型在复杂场景下的表现,并构建AerialSense基准进行评估。

Details Motivation: 现有基于视觉语言模型(VLM)的图像感知方法依赖简单的文本任务提示,在面对无人机图像中的目标混淆、尺度变化和复杂背景等挑战时,难以实现视觉与文本语义的有效对齐,限制了模型对关键信息的关注能力。 Method: 提出AerialVP代理框架,包含三阶段提示增强流程:分析任务类型与需求、从工具库中选择合适工具、生成增强后的任务提示;同时构建AerialSense基准,涵盖空中视觉推理、问答和定位任务,支持多分辨率、光照和场景类型的评估。 Result: 实验表明AerialVP能显著提升任务提示的引导能力,在开源和专有VLM上均实现稳定且显著的性能提升,验证了其在复杂无人机图像理解中的有效性。 Conclusion: AerialVP通过主动增强任务提示,克服了传统VLM方法在无人机图像感知中的局限性,为复杂场景下的视觉理解提供了新思路,具有良好的通用性和应用前景。 Abstract: Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.

[237] Reevaluating Automated Wildlife Species Detection: A Reproducibility Study on a Custom Image Dataset

Tobias Abraham Haider

Main category: cs.CV

TL;DR: 本研究重現了Carl等人使用預訓練的Google Inception-ResNet-v2模型進行歐洲野生哺乳動物物種識別的研究,並使用包含90個物種共900張影像的新資料集進行驗證。結果顯示整體準確率為62%,與原研究的71%相近,但宏平均F1分數僅為0.28,顯示模型在跨資料集時泛化能力有限,尤其當物種標籤與ImageNet不一致時表現較差。研究結論認為預訓練CNN可作為野生物種識別的實用基線,但仍需針對特定物種進行微調或遷移學習以提升穩定性與準確度。

Details Motivation: 評估Carl等人提出的自動化野生物種識別方法的可重現性與泛化能力,特別是在不同資料集上的表現。 Method: 從頭實現Inception-ResNet-v2模型,使用公開資源與一個新的包含900張圖片、90種物種的資料集,在最小前處理的情況下重新進行實驗。 Result: 整體分類準確率為62%,接近原研究的71%;宏平均F1分數為0.28,顯示各類別表現差異大,尤其在標籤未對應ImageNet類別時表現不佳。 Conclusion: 預訓練卷積神經網絡可作為野生動物圖像識別的有效基線,但若要達到高且一致的預測性能,仍需進行物種特異性的適配或遷移學習。 Abstract: This study revisits the findings of Carl et al., who evaluated the pre-trained Google Inception-ResNet-v2 model for automated detection of European wild mammal species in camera trap images. To assess the reproducibility and generalizability of their approach, we reimplemented the experiment from scratch using openly available resources and a different dataset consisting of 900 images spanning 90 species. After minimal preprocessing, we obtained an overall classification accuracy of 62%, closely aligning with the 71% reported in the original work despite differences in datasets. As in the original study, per-class performance varied substantially, as indicated by a macro F1 score of 0.28,highlighting limitations in generalization when labels do not align directly with ImageNet classes. Our results confirm that pretrained convolutional neural networks can provide a practical baseline for wildlife species identification but also reinforce the need for species-specific adaptation or transfer learning to achieve consistent, high-quality predictions.

[238] ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation

Ziyang Mai,Yu-Wing Tai

Main category: cs.CV

TL;DR: 提出ContextAnyone,一种基于上下文感知的扩散框架,通过强调注意力模块和双引导损失,实现从单张参考图像和文本生成具有角色一致性的视频。

Details Motivation: 现有文本到视频生成方法在保持角色身份一致性方面存在不足,尤其在发型、服装和体型等上下文线索的保留上表现不佳。 Method: 提出ContextAnyone框架,结合参考图像重建与视频生成,利用Emphasize-Attention模块增强参考特征,引入Gap-RoPE位置编码分离参考与视频令牌,并采用双引导损失提升外观保真度。 Result: 实验表明,ContextAnyone在身份一致性和视觉质量上优于现有参考到视频生成方法,能在不同动作和场景中生成连贯且上下文保持的角色视频。 Conclusion: ContextAnyone有效提升了文本到视频生成中的角色一致性,通过上下文感知机制和结构创新实现了更高质量的个性化视频生成。 Abstract: Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.

[239] The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers

Kanishk Awadhiya

Main category: cs.CV

TL;DR: Vision Transformers 自发形成“U形”熵分布,中间层压缩信息以适应任务语义抽象需求,这种“归纳瓶颈”是数据依赖的适应性行为,而非架构缺陷。

Details Motivation: 研究ViT为何在没有CNN层级归纳偏置的情况下仍出现中间层信息压缩现象,探索该现象是否为数据驱动的结果。 Method: 通过分析DINO训练的ViT在不同复杂度数据集(UC Merced、Tiny ImageNet、CIFAR-100)上的逐层有效编码维度(EED),研究其信息流动模式与数据语义之间的关系。 Result: 发现“归纳瓶颈”的深度与任务所需的语义抽象程度高度相关:纹理主导的数据集保持高维表示,而物体中心的数据集促使网络在中间层抑制高频信息,形成瓶颈以提取语义特征。 Conclusion: ViT中的“U形”熵分布是一种数据依赖的自适应机制,用于在中间层学习语义分离,而非由架构引起的副作用。 Abstract: Vision Transformers (ViTs) lack the hierarchical inductive biases inherent to Convolutional Neural Networks (CNNs), theoretically allowing them to maintain high-dimensional representations throughout all layers. However, recent observations suggest ViTs often spontaneously manifest a "U-shaped" entropy profile-compressing information in middle layers before expanding it for the final classification. In this work, we demonstrate that this "Inductive Bottleneck" is not an architectural artifact, but a data-dependent adaptation. By analyzing the layer-wise Effective Encoding Dimension (EED) of DINO-trained ViTs across datasets of varying compositional complexity (UC Merced, Tiny ImageNet, and CIFAR-100), we show that the depth of the bottleneck correlates strongly with the semantic abstraction required by the task. We find that while texture-heavy datasets preserve high-rank representations throughout, object-centric datasets drive the network to dampen high-frequency information in middle layers, effectively "learning" a bottleneck to isolate semantic features.

[240] Generalized Referring Expression Segmentation on Aerial Photos

Luís Marnoto,Alexandre Bernardino,Bruno Martins

Main category: cs.CV

TL;DR: 本文提出了Aerial-D,一个大规模的航空影像指代表达分割数据集,包含37,288张图像和超过150万条指代表达,覆盖21类目标。通过结合规则生成与大语言模型增强的方法自动生成语言表达,并模拟历史成像条件,支持现代与历史航空图像的文本驱动统一实例与语义分割。

Details Motivation: 航空影像具有分辨率变化大、颜色不一致、目标微小且密集、部分遮挡等问题,现有数据集难以满足复杂场景下的指代表达分割需求,尤其缺乏对历史影像的支持。因此需要构建一个涵盖多样化成像条件的大规模数据集以推动该领域发展。 Method: 提出Aerial-D数据集,采用全自动流程构建:利用系统性规则生成基础指代表达,结合大语言模型(LLM)增强语言多样性与视觉细节描述;使用过滤器模拟历史成像条件(如黑白、棕褐色、颗粒感)。基于RSRefSeg架构训练模型,并联合其他航空数据集进行多任务学习,实现现代与历史图像的统一实例与语义分割。 Result: 在现代基准测试中取得有竞争力的性能,同时在模拟的历史退化条件下(单色、棕褐、颗粒噪声)保持较高的分割准确率。模型在处理极小目标(仅几个像素)和高密度场景时表现稳健。 Conclusion: Aerial-D为航空影像中的指代表达分割提供了新的大规模资源,支持跨时代影像分析,推动自然语言与视觉定位在复杂空中场景中的融合应用,所发布数据集、模型与代码有助于后续研究。 Abstract: Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .

[241] Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting

Shilong Jin,Haoran Duan,Litao Hua,Wentao Huang,Yuan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为TD-Attn的新框架,用于解决基于文本到图像扩散模型的3D任务中的多视图不一致性问题,通过3D感知注意力引导和分层注意力调制模块显著提升了跨视图的一致性。

Details Motivation: 由于T2I模型存在先验视角偏差,导致在不同视角下生成的对象外观不一致,限制了其在3D任务中的应用。 Method: 提出了TD-Attn框架,包括3D-AAG模块构建视图一致的3D注意力高斯分布,以及HAM模块利用语义引导树定位并对响应视图条件的交叉注意力层进行调制。 Result: 实验证明TD-Attn能有效提升多种3D任务中的多视图一致性,并可作为通用插件使用。 Conclusion: TD-Attn通过深入分析先验视角偏差并针对性设计模块,在无需大量3D训练数据的情况下显著改善了由T2I模型衍生的3D任务的质量与一致性。 Abstract: Versatile 3D tasks (e.g., generation or editing) that distill from Text-to-Image (T2I) diffusion models have attracted significant research interest for not relying on extensive 3D training data. However, T2I models exhibit limitations resulting from prior view bias, which produces conflicting appearances between different views of an object. This bias causes subject-words to preferentially activate prior view features during cross-attention (CA) computation, regardless of the target view condition. To overcome this limitation, we conduct a comprehensive mathematical analysis to reveal the root cause of the prior view bias in T2I models. Moreover, we find different UNet layers show different effects of prior view in CA. Therefore, we propose a novel framework, TD-Attn, which addresses multi-view inconsistency via two key components: (1) the 3D-Aware Attention Guidance Module (3D-AAG) constructs a view-consistent 3D attention Gaussian for subject-words to enforce spatial consistency across attention-focused regions, thereby compensating for the limited spatial information in 2D individual view CA maps; (2) the Hierarchical Attention Modulation Module (HAM) utilizes a Semantic Guidance Tree (SGT) to direct the Semantic Response Profiler (SRP) in localizing and modulating CA layers that are highly responsive to view conditions, where the enhanced CA maps further support the construction of more consistent 3D attention Gaussians. Notably, HAM facilitates semantic-specific interventions, enabling controllable and precise 3D editing. Extensive experiments firmly establish that TD-Attn has the potential to serve as a universal plugin, significantly enhancing multi-view consistency across 3D tasks.

[242] MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

Xinyu Wei,Kangrui Cen,Hongyang Wei,Zhen Guo,Bairui Li,Zeqing Wang,Jinrui Zhang,Lei Zhang

Main category: cs.CV

TL;DR: 本文系统研究了多图像合成(MICo),提出了包含7个任务的分类体系,构建了大规模高质量数据集MICo-150K和评估基准MICo-Bench,并提出新指标Weighted-Ref-VIEScore。通过人类参与的合成与筛选流程,实现了身份一致性控制的多图像合成。微调后的基线模型Qwen-MICo在多图输入支持上超越现有模型。

Details Motivation: 多图像合成(MICo)因缺乏高质量训练数据而受限,现有方法难以保证图像间的语义连贯性和身份一致性,且缺乏统一的任务分类、数据集和评估标准,亟需系统性研究来推动该领域发展。 Method: 1)将MICo划分为7类代表性任务;2)收集高质量源图像并构建多样化提示;3)利用强闭源模型生成复合图像,结合人工筛选构建MICo-150K数据集;4)构建De&Re子集实现真实与合成组合;5)建立MICo-Bench评估集与Weighted-Ref-VIEScore新指标;6)在MICo-150K上微调多个模型进行验证。 Result: 成功构建了含15万样本的MICo-150K数据集及11K样本的De&Re子集,发布了含1000个案例的MICo-Bench评估集。微调后的Qwen-MICo模型在三图合成任务上媲美Qwen-Image-2509,同时支持任意数量的多图输入,在连贯性与一致性方面显著提升。 Conclusion: 本文通过系统化构建数据集、基准和评估方法,有效推动了多图像合成技术的发展,所提出的MICo-150K和MICo-Bench为后续研究提供了重要资源,证明了数据驱动方法在提升模型多图合成能力上的有效性。 Abstract: In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.

[243] DeepAgent: A Dual Stream Multi Agent Fusion for Robust Multimodal Deepfake Detection

Sayeem Been Zaman,Wasimul Karim,Arefin Ittesafun Abian,Reem E. Mohamed,Md Rafiqul Islam,Asif Karim,Sami Azam

Main category: cs.CV

TL;DR: 本文提出了一种名为DeepAgent的多智能体协作框架,用于检测深度伪造视频,通过结合视觉和音频模态信息,并利用随机森林元分类器融合两个代理的决策,实现了高准确率和跨数据集的鲁棒性。

Details Motivation: 现有的深度伪造检测方法大多将音视频信息集成在一个模型中,容易受到模态不匹配、噪声和操纵的影响,缺乏鲁棒性。因此,需要一种更有效的多模态融合方法来提升检测性能。 Method: 提出DeepAgent框架,包含两个互补代理:Agent-1使用基于AlexNet的CNN分析视频帧以识别伪造痕迹;Agent-2结合声学特征、Whisper语音转录和EasyOCR图像文本读取来检测音视频不一致性;两者的输出由随机森林元分类器进行融合决策。 Result: 在Celeb-DF和FakeAVCeleb组合数据集上,Agent-1测试准确率达94.35%;在FakeAVCeleb上,Agent-2和元分类器分别达到93.69%和81.56%的准确率;跨数据集验证显示元分类器在DeepFakeTIMIT上达到97.49%的准确率,表现出强泛化能力。 Conclusion: 基于层次化的多智能体融合策略能有效缓解单一模态的弱点,验证了多代理协作在应对多样化深度伪造操纵中的有效性与鲁棒性。 Abstract: The increasing use of synthetic media, particularly deepfakes, is an emerging challenge for digital content verification. Although recent studies use both audio and visual information, most integrate these cues within a single model, which remains vulnerable to modality mismatches, noise, and manipulation. To address this gap, we propose DeepAgent, an advanced multi-agent collaboration framework that simultaneously incorporates both visual and audio modalities for the effective detection of deepfakes. DeepAgent consists of two complementary agents. Agent-1 examines each video with a streamlined AlexNet-based CNN to identify the symbols of deepfake manipulation, while Agent-2 detects audio-visual inconsistencies by combining acoustic features, audio transcriptions from Whisper, and frame-reading sequences of images through EasyOCR. Their decisions are fused through a Random Forest meta-classifier that improves final performance by taking advantage of the different decision boundaries learned by each agent. This study evaluates the proposed framework using three benchmark datasets to demonstrate both component-level and fused performance. Agent-1 achieves a test accuracy of 94.35% on the combined Celeb-DF and FakeAVCeleb datasets. On the FakeAVCeleb dataset, Agent-2 and the final meta-classifier attain accuracies of 93.69% and 81.56%, respectively. In addition, cross-dataset validation on DeepFakeTIMIT confirms the robustness of the meta-classifier, which achieves a final accuracy of 97.49%, and indicates a strong capability across diverse datasets. These findings confirm that hierarchy-based fusion enhances robustness by mitigating the weaknesses of individual modalities and demonstrate the effectiveness of a multi-agent approach in addressing diverse types of manipulations in deepfakes.

[244] Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation

Qiming Huang,Hao Ai,Jianbo Jiao

Main category: cs.CV

TL;DR: 提出一种结构感知的特征校正方法,利用图像中的实例特定先验(通过区域邻接图RAG)来增强CLIP特征的局部判别能力,从而提升开放词汇语义分割的性能。

Details Motivation: CLIP等视觉语言模型在开放词汇语义分割中因全局语义对齐倾向导致局部区域预测噪声大、不一致,主要源于对比学习训练范式带来的分散偏差。 Method: 构建基于低层特征(如颜色和纹理)的区域邻接图(RAG),捕捉局部结构关系,并用于校正和增强CLIP特征的局部判别性。 Result: 该方法有效抑制了分割噪声,提升了区域级一致性,在多个开放词汇分割基准上实现了优异性能。 Conclusion: 通过引入结构感知的特征校正机制,结合低层视觉先验与CLIP的高层语义,可显著改善开放词汇语义分割中的局部预测质量。 Abstract: Benefiting from the inductive biases learned from large-scale datasets, open-vocabulary semantic segmentation (OVSS) leverages the power of vision-language models, such as CLIP, to achieve remarkable progress without requiring task-specific training. However, due to CLIP's pre-training nature on image-text pairs, it tends to focus on global semantic alignment, resulting in suboptimal performance when associating fine-grained visual regions with text. This leads to noisy and inconsistent predictions, particularly in local areas. We attribute this to a dispersed bias stemming from its contrastive training paradigm, which is difficult to alleviate using CLIP features alone. To address this, we propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image. Specifically, we construct a region adjacency graph (RAG) based on low-level features (e.g., colour and texture) to capture local structural relationships and use it to refine CLIP features by enhancing local discrimination. Extensive experiments show that our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.

[245] Enhancing Small Object Detection with YOLO: A Novel Framework for Improved Accuracy and Efficiency

Mahila Moghadami,Mohammad Ali Keyvanrad,Melika Sabaghian

Main category: cs.CV

TL;DR: 本文提出了一种改进的小目标检测模型SW-YOLO,通过优化滑动窗口裁剪策略和网络结构(包括在颈部增强特征提取、在主干中引入CBAM模块以及设计新的检测头),在VisDrone2019数据集上将mAP从YOLOv5L的35.5显著提升至61.2,优于现有方法如SAHI和CZDet。

Details Motivation: 由于航拍图像在工业和关键应用中的重要性日益增加,现有小目标检测方法存在精度不足的问题,亟需更鲁棒的框架来有效检测大规模航拍图像中的小目标。 Method: 基于SW-YOLO基础方法,优化滑动窗口的裁剪尺寸与重叠率,并对网络架构进行多项改进:在颈部引入先进的特征提取模块以增强特征图,在主干网络中集成CBAM模块以保留空间和通道信息,并设计新的检测头以提升小目标检测精度。 Result: 在VisDrone2019数据集上,所提方法将mAP .5:.95指标从YOLOv5L的35.5提升到61.2,显著优于CZDet的58.36,并相比SAHI等先进框架表现出更强的检测性能。 Conclusion: 本文提出的改进型SW-YOLO模型通过联合优化裁剪策略与网络结构,显著提升了大规模航拍图像中小目标的检测精度,验证了其在复杂场景下的有效性与优越性。 Abstract: This paper investigates and develops methods for detecting small objects in large-scale aerial images. Current approaches for detecting small objects in aerial images often involve image cropping and modifications to detector network architectures. Techniques such as sliding window cropping and architectural enhancements, including higher-resolution feature maps and attention mechanisms, are commonly employed. Given the growing importance of aerial imagery in various critical and industrial applications, the need for robust frameworks for small object detection becomes imperative. To address this need, we adopted the base SW-YOLO approach to enhance speed and accuracy in small object detection by refining cropping dimensions and overlap in sliding window usage and subsequently enhanced it through architectural modifications. we propose a novel model by modifying the base model architecture, including advanced feature extraction modules in the neck for feature map enhancement, integrating CBAM in the backbone to preserve spatial and channel information, and introducing a new head to boost small object detection accuracy. Finally, we compared our method with SAHI, one of the most powerful frameworks for processing large-scale images, and CZDet, which is also based on image cropping, achieving significant improvements in accuracy. The proposed model achieves significant accuracy gains on the VisDrone2019 dataset, outperforming baseline YOLOv5L detection by a substantial margin. Specifically, the final proposed model elevates the mAP .5.5 accuracy on the VisDrone2019 dataset from the base accuracy of 35.5 achieved by the YOLOv5L detector to 61.2. Notably, the accuracy of CZDet, which is another classic method applied to this dataset, is 58.36. This research demonstrates a significant improvement, achieving an increase in accuracy from 35.5 to 61.2.

[246] Tessellation GS: Neural Mesh Gaussians for Robust Monocular Reconstruction of Dynamic Objects

Shuohan Tao,Boyao Zhou,Hanzhang Tu,Yuwang Wang,Yebin Liu

Main category: cs.CV

TL;DR: 提出Tessellation GS,一种基于网格面的结构化2D高斯点阵方法,用于从单个连续移动或静态相机重建动态场景,显著提升稀疏视角下的泛化能力。

Details Motivation: 3D高斯点阵(GS)在视点外推方面表现差,易过拟合,尤其在稀疏视角和动态场景中泛化能力不足。 Method: 将2D高斯分布约束在网格面上,通过网格面上的分层神经特征推断其属性;采用细节感知损失引导自适应面细分,并利用重建基础模型的先验初始化高斯形变。 Result: 在外观和网格重建任务中,LPIPS降低29.1%,Chamfer距离降低49.2%,优于此前SOTA方法。 Conclusion: Tessellation GS通过结构化约束和先验引导,实现了更鲁棒的动态场景重建,尤其适用于单静态相机输入的挑战性场景。 Abstract: 3D Gaussian Splatting (GS) enables highly photorealistic scene reconstruction from posed image sequences but struggles with viewpoint extrapolation due to its anisotropic nature, leading to overfitting and poor generalization, particularly in sparse-view and dynamic scene reconstruction. We propose Tessellation GS, a structured 2D GS approach anchored on mesh faces, to reconstruct dynamic scenes from a single continuously moving or static camera. Our method constrains 2D Gaussians to localized regions and infers their attributes via hierarchical neural features on mesh faces. Gaussian subdivision is guided by an adaptive face subdivision strategy driven by a detail-aware loss function. Additionally, we leverage priors from a reconstruction foundation model to initialize Gaussian deformations, enabling robust reconstruction of general dynamic objects from a single static camera, previously extremely challenging for optimization-based methods. Our method outperforms previous SOTA method, reducing LPIPS by 29.1% and Chamfer distance by 49.2% on appearance and mesh reconstruction tasks.

[247] LogicCBMs: Logic-Enhanced Concept-Based Learning

Deepika SN Vemuri,Gautham Bellamkonda,Aditya Pola,Vineeth N Balasubramanian

Main category: cs.CV

TL;DR: 提出LogicCBM,通过可微分逻辑操作增强概念瓶颈模型,提升模型表达性、准确性和可解释性。

Details Motivation: 线性组合限制了概念瓶颈模型的表达能力,难以捕捉概念间的复杂关系,需引入逻辑操作以增强建模能力。 Method: 设计一个可微分的逻辑模块,将CBM学到的概念通过逻辑运算(如与、或、非)进行组合,实现端到端训练。 Result: 在基准和合成数据集上,LogicCBM表现出更高的准确性、更好的干预能力和更强的可解释性。 Conclusion: 通过引入可微分逻辑操作,LogicCBM超越了传统CBM的线性限制,在保持可解释性的同时提升了性能。 Abstract: Concept Bottleneck Models (CBMs) provide a basis for semantic abstractions within a neural network architecture. Such models have primarily been seen through the lens of interpretability so far, wherein they offer transparency by inferring predictions as a linear combination of semantic concepts. However, a linear combination is inherently limiting. So we propose the enhancement of concept-based learning models through propositional logic. We introduce a logic module that is carefully designed to connect the learned concepts from CBMs through differentiable logic operations, such that our proposed LogicCBM can go beyond simple weighted combinations of concepts to leverage various logical operations to yield the final predictions, while maintaining end-to-end learnability. Composing concepts using a set of logic operators enables the model to capture inter-concept relations, while simultaneously improving the expressivity of the model in terms of logic operations. Our empirical studies on well-known benchmarks and synthetic datasets demonstrate that these models have better accuracy, perform effective interventions and are highly interpretable.

[248] How Far are Modern Trackers from UAV-Anti-UAV? A Million-Scale Benchmark and New Baseline

Chunhui Zhang,Li Liu,Zhipeng Zhang,Yong Wang,Hao Wen,Xi Zhou,Shiming Ge,Yanfeng Wang

Main category: cs.CV

TL;DR: 提出了一种新的多模态视觉跟踪任务UAV-Anti-UAV,用于从移动无人机平台跟踪目标无人机,并构建了一个大规模数据集和基于Mamba的基线方法MambaSTS。

Details Motivation: 现有反无人机研究主要关注地面固定摄像头捕捉的RGB或红外视频,缺乏对移动无人机平台上跟踪目标无人机的研究。 Method: 提出了UAV-Anti-UAV任务,构建了包含1810个视频的大规模数据集,并提出MambaSTS方法,结合Mamba和Transformer模型进行时空语义学习。 Result: 在UAV-Anti-UAV数据集上验证了MambaSTS的有效性,实验表明当前50种主流深度跟踪算法仍有较大提升空间。 Conclusion: UAV-Anti-UAV任务更具挑战性且更贴近实际应用,为未来反无人机技术提供了新方向和数据支持。 Abstract: Unmanned Aerial Vehicles (UAVs) offer wide-ranging applications but also pose significant safety and privacy violation risks in areas like airport and infrastructure inspection, spurring the rapid development of Anti-UAV technologies in recent years. However, current Anti-UAV research primarily focuses on RGB, infrared (IR), or RGB-IR videos captured by fixed ground cameras, with little attention to tracking target UAVs from another moving UAV platform. To fill this gap, we propose a new multi-modal visual tracking task termed UAV-Anti-UAV, which involves a pursuer UAV tracking a target adversarial UAV in the video stream. Compared to existing Anti-UAV tasks, UAV-Anti-UAV is more challenging due to severe dual-dynamic disturbances caused by the rapid motion of both the capturing platform and the target. To advance research in this domain, we construct a million-scale dataset consisting of 1,810 videos, each manually annotated with bounding boxes, a language prompt, and 15 tracking attributes. Furthermore, we propose MambaSTS, a Mamba-based baseline method for UAV-Anti-UAV tracking, which enables integrated spatial-temporal-semantic learning. Specifically, we employ Mamba and Transformer models to learn global semantic and spatial features, respectively, and leverage the state space model's strength in long-sequence modeling to establish video-level long-term context via a temporal token propagation mechanism. We conduct experiments on the UAV-Anti-UAV dataset to validate the effectiveness of our method. A thorough experimental evaluation of 50 modern deep tracking algorithms demonstrates that there is still significant room for improvement in the UAV-Anti-UAV domain. The dataset and codes will be available at {\color{magenta}https://github.com/983632847/Awesome-Multimodal-Object-Tracking}.

[249] GlimmerNet: A Lightweight Grouped Dilated Depthwise Convolutions for UAV-Based Emergency Monitoring

Đorđe Nedeljković

Main category: cs.CV

TL;DR: 本文提出了一种名为GlimmerNet的超轻量级卷积网络,通过分组扩张深度卷积和新型聚合模块,在保持极低计算开销的同时实现了强全局感知能力,在UAV图像数据集AIDERv2上取得了新的SOTA性能。

Details Motivation: 现有基于自注意力机制的视觉Transformer虽能增强CNN的全局上下文理解,但计算开销大,难以部署于边缘和移动设备,本文旨在不依赖高成本组件的情况下实现高效的全局感知。 Method: 提出了GlimmerNet,引入分组扩张深度卷积(GDBlocks)以在无额外参数成本下实现多尺度特征提取,并设计了基于分组点卷积的新型聚合模块来高效融合跨组特征表示。 Result: GlimmerNet仅用31K参数和比最新基线少29%的FLOPs,在AIDERv2数据集上达到了0.966的加权F1分数,显著优于现有方法。 Conclusion: GlimmerNet在资源受限的无人机平台实现实时应急监测中,建立了新的精度-效率权衡前沿,验证了无需复杂模块也可获得优异全局感知性能的可能性。 Abstract: Convolutional Neural Networks (CNNs) have proven highly effective for edge and mobile vision tasks due to their computational efficiency. While many recent works seek to enhance CNNs with global contextual understanding via self-attention-based Vision Transformers, these approaches often introduce significant computational overhead. In this work, we demonstrate that it is possible to retain strong global perception without relying on computationally expensive components. We present GlimmerNet, an ultra-lightweight convolutional network built on the principle of separating receptive field diversity from feature recombination. GlimmerNet introduces Grouped Dilated Depthwise Convolutions(GDBlocks), which partition channels into groups with distinct dilation rates, enabling multi-scale feature extraction at no additional parameter cost. To fuse these features efficiently, we design a novel Aggregator module that recombines cross-group representations using grouped pointwise convolution, significantly lowering parameter overhead. With just 31K parameters and 29% fewer FLOPs than the most recent baseline, GlimmerNet achieves a new state-of-the-art weighted F1-score of 0.966 on the UAV-focused AIDERv2 dataset. These results establish a new accuracy-efficiency trade-off frontier for real-time emergency monitoring on resource-constrained UAV platforms. Our implementation is publicly available at https://github.com/djordjened92/gdd-cnn.

[250] Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

Zhifan Zhu,Siddhant Bansal,Shashank Tripathi,Dima Damen

Main category: cs.CV

TL;DR: 提出了ROHIT任务,通过手部交互时间线(HIT)和约束优化传播(COP)框架提升物体姿态重建效果,尤其在稳定抓取场景下显著优于现有方法。

Details Motivation: 现有方法难以在无3D真值的情况下准确重建交互中物体的完整姿态序列,尤其是在手部稳定抓取阶段;为此提出ROHIT任务以系统建模物体在交互时间线上的姿态变化。 Method: 定义了从物体视角出发的手部交互时间线(HIT),并提出COP框架,利用稳定抓取期间物体与手部保持恒定接触的特性,通过约束优化和姿态传播来提升重建精度。 Result: 在HOT3D和EPIC-Kitchens两个数据集上验证了方法有效性,共标注了3.6K个稳定抓取片段;使用2D投影误差评估,COP在稳定抓取重建上提升6.2-11.3%,在整个HIT重建上最高提升24.5%。 Conclusion: COP框架能有效利用HIT中的姿态约束,在无3D监督的情况下显著提升物体重建质量,为基于视频的物体姿态估计提供了新思路。 Abstract: We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.

[251] InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs

Bin Li,Ruichi Zhang,Han Liang,Jingyan Zhang,Juze Zhang,Xin Chen,Lan Xu,Jingyi Yu,Jingya Wang

Main category: cs.CV

TL;DR: 本文提出了InterAgent,首个端到端的文本驱动、基于物理的多人形代理控制框架,通过多流扩散Transformer和交互图表示实现协调动作生成。

Details Motivation: 现有方法主要局限于单智能体场景,缺乏对多智能体间物理交互的建模,难以实现类人社交行为中的复杂协调。 Method: 提出InterAgent框架,采用自回归扩散Transformer与多流模块,解耦本体感知、外部感知与动作;设计交互图外部感知表示和稀疏边基注意力机制,以捕捉细粒度关节间依赖并增强交互建模鲁棒性。 Result: 实验表明InterAgent在多个基准上优于强基线方法,实现了最先进的性能,能从纯文本提示生成连贯、物理合理且语义忠实的多智能体行为。 Conclusion: InterAgent有效解决了多智能体人形控制中的跨模态干扰与空间依赖建模难题,为文本驱动的物理化社交行为生成提供了新思路。 Abstract: Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.

[252] Data-driven Exploration of Mobility Interaction Patterns

Gabriele Galatolo,Mirco Nanni

Main category: cs.CV

TL;DR: 本文提出了一种基于数据挖掘的方法,用于从实际移动数据中发现个体间的相互作用模式,以改进现有对人类移动行为的建模与仿真。

Details Motivation: 理解个体在外部环境中的行为及其相互影响对于人类动态建模至关重要,但现有方法多依赖预设行为模型,缺乏从真实数据出发的分析。 Method: 采用数据挖掘方法,从真实移动数据中识别可能表示个体间相互作用的移动事件,并挖掘其中复杂、持久且随时间演化的模式。 Result: 在车辆和行人两个真实案例中验证了该方法的有效性,实验评估涵盖性能、参数敏感性和结果解释。 Conclusion: 所提方法能够揭示个体间移动交互的潜在机制,为改进人群模拟和应急管理等应用中的仿真模型提供新见解。 Abstract: Understanding the movement behaviours of individuals and the way they react to the external world is a key component of any problem that involves the modelling of human dynamics at a physical level. In particular, it is crucial to capture the influence that the presence of an individual can have on the others. Important examples of applications include crowd simulation and emergency management, where the simulation of the mass of people passes through the simulation of the individuals, taking into consideration the others as part of the general context. While existing solutions basically start from some preconceived behavioural model, in this work we propose an approach that starts directly from the data, adopting a data mining perspective. Our method searches the mobility events in the data that might be possible evidences of mutual interactions between individuals, and on top of them looks for complex, persistent patterns and time evolving configurations of events. The study of these patterns can provide new insights on the mechanics of mobility interactions between individuals, which can potentially help in improving existing simulation models. We instantiate the general methodology on two real case studies, one on cars and one on pedestrians, and a full experimental evaluation is performed, both in terms of performances, parameter sensitivity and interpretation of sample results.

[253] When normalization hallucinates: unseen risks in AI-powered whole slide image processing

Karel Moens,Matthew B. Blaschko,Tinne Tuytelaars,Bart Diricx,Jonas De Vylder,Mustafa Yousif

Main category: cs.CV

TL;DR: 本文提出了一种新的图像比较度量方法,用于自动检测全切片图像(WSI)标准化过程中产生的幻觉伪影,并揭示了现有方法在真实临床数据上的显著不一致性和失败风险。

Details Motivation: 深度学习驱动的WSI标准化方法容易产生视觉上难以察觉的幻觉内容,威胁下游分析,但当前评估方法对此类问题缺乏有效检测手段。 Method: 提出一种新颖的图像比较度量方法,用于自动检测标准化输出中的幻觉;在真实世界临床数据上重新训练并系统评估多个主流归一化方法。 Result: 发现现有方法在公共数据集表现良好,但在真实临床数据中频繁出现幻觉和不一致;新指标能有效识别这些被传统指标忽略的问题。 Conclusion: 需要更鲁棒、可解释的标准化技术以及更严格的验证协议,以确保临床部署的安全性。 Abstract: Whole slide image (WSI) normalization remains a vital preprocessing step in computational pathology. Increasingly driven by deep learning, these models learn to approximate data distributions from training examples. This often results in outputs that gravitate toward the average, potentially masking diagnostically important features. More critically, they can introduce hallucinated content, artifacts that appear realistic but are not present in the original tissue, posing a serious threat to downstream analysis. These hallucinations are nearly impossible to detect visually, and current evaluation practices often overlook them. In this work, we demonstrate that the risk of hallucinations is real and underappreciated. While many methods perform adequately on public datasets, we observe a concerning frequency of hallucinations when these same models are retrained and evaluated on real-world clinical data. To address this, we propose a novel image comparison measure designed to automatically detect hallucinations in normalized outputs. Using this measure, we systematically evaluate several well-cited normalization methods retrained on real-world data, revealing significant inconsistencies and failures that are not captured by conventional metrics. Our findings underscore the need for more robust, interpretable normalization techniques and stricter validation protocols in clinical deployment.

[254] Unified Video Editing with Temporal Reasoner

Xiangpeng Yang,Ji Xie,Yiyuan Yang,Yan Huang,Min Xu,Qiang Wu

Main category: cs.CV

TL;DR: 本文提出了VideoCoF,一种受思维链启发的视频编辑方法,通过引入推理步骤实现无需掩码的精确指令到区域对齐,并结合RoPE对齐策略实现运动一致性和时长外推,在仅使用5万视频对的情况下达到SOTA性能。

Details Motivation: 现有视频编辑方法在依赖任务特定先验(如掩码)的专家模型与缺乏明确空间线索的统一模型之间存在权衡,难以兼顾精度与通用性。 Method: 提出VideoCoF,采用Chain-of-Frames框架,强制扩散模型先预测编辑区域的隐变量(推理token),再生成目标视频,从而实现“先看、再推理、后编辑”的过程;并设计RoPE对齐策略利用推理token保证运动连贯和时长外推。 Result: 在仅50k视频对的低数据成本下,VideoCoF在VideoCoF-Bench上实现了最先进的性能。 Conclusion: VideoCoF通过显式推理步骤解决了视频编辑中精确性与统一性的矛盾,在无需用户掩码的前提下实现了精准的空间定位与时间扩展,推动了通用视频编辑的发展。 Abstract: Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.

[255] Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance

Naifu Xue,Zhaoyang Jia,Jiahao Li,Bin Li,Zihan Zheng,Yuan Zhang,Yan Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于单步扩散的视频编解码器S2VC,结合条件编码框架和高效的单步扩散生成器,在低比特率下实现高质量的视频重建,同时降低采样成本。

Details Motivation: 传统和神经视频编解码器在低比特率下提升感知质量仍面临挑战,现有方法在生成质量和采样复杂度之间难以平衡。 Method: 提出S2VC,引入上下文语义引导以提取帧自适应的细粒度语义信息,并在扩散U-Net中加入时间一致性引导以增强帧间连贯性。 Result: 实验表明,S2VC在感知质量上达到最先进水平,相比先前方法平均节省52.73%的比特率。 Conclusion: 单步扩散结合语义与时间引导能有效提升低码率视频压缩的感知质量,兼具高效性与实用性。 Abstract: While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.

[256] Towards Robust DeepFake Detection under Unstable Face Sequences: Adaptive Sparse Graph Embedding with Order-Free Representation and Explicit Laplacian Spectral Prior

Chih-Chung Hsu,Shao-Ning Chen,Chia-Ming Lee,Yi-Fang Wang,Yi-Shiuan Chou

Main category: cs.CV

TL;DR: 提出了一种基于拉普拉斯正则化图卷积网络(LR-GCN)的深度伪造视频检测方法,通过构建顺序无关的时间图嵌入(OF-TGE)和双层稀疏机制,实现对噪声、缺失或乱序人脸序列的鲁棒检测。

Details Motivation: 现有DeepFake检测器依赖时间连续且清晰的人脸序列,但在真实场景中常因压缩伪影、遮挡或对抗攻击导致人脸检测不稳定,影响检测性能。因此需要一种能处理无序、缺失或损坏人脸序列的鲁棒方法。 Method: 提出LR-GCN框架,构建顺序无关的时间图嵌入(OF-TGE),将帧级CNN特征组织为基于语义相似性的自适应稀疏图;引入双层稀疏机制抑制无效人脸影响;并设计图拉普拉斯谱先验作为高通滤波器,增强结构异常与伪造痕迹,结合低通GCN聚合形成任务驱动的谱带通机制。 Result: 在FF++、Celeb-DFv2和DFDC数据集上实验表明,LR-GCN在面对严重全局与局部干扰(如人脸缺失、遮挡、对抗性扰动)时仍保持优异性能,显著优于现有方法,达到最先进水平。 Conclusion: LR-GCN通过图结构建模与谱域先验设计,实现了对不完整、无序和被破坏人脸序列的鲁棒DeepFake检测,提升了模型在复杂现实场景下的适用性和稳定性。 Abstract: Ensuring the authenticity of video content remains challenging as DeepFake generation becomes increasingly realistic and robust against detection. Most existing detectors implicitly assume temporally consistent and clean facial sequences, an assumption that rarely holds in real-world scenarios where compression artifacts, occlusions, and adversarial attacks destabilize face detection and often lead to invalid or misdetected faces. To address these challenges, we propose a Laplacian-Regularized Graph Convolutional Network (LR-GCN) that robustly detects DeepFakes from noisy or unordered face sequences, while being trained only on clean facial data. Our method constructs an Order-Free Temporal Graph Embedding (OF-TGE) that organizes frame-wise CNN features into an adaptive sparse graph based on semantic affinities. Unlike traditional methods constrained by strict temporal continuity, OF-TGE captures intrinsic feature consistency across frames, making it resilient to shuffled, missing, or heavily corrupted inputs. We further impose a dual-level sparsity mechanism on both graph structure and node features to suppress the influence of invalid faces. Crucially, we introduce an explicit Graph Laplacian Spectral Prior that acts as a high-pass operator in the graph spectral domain, highlighting structural anomalies and forgery artifacts, which are then consolidated by a low-pass GCN aggregation. This sequential design effectively realizes a task-driven spectral band-pass mechanism that suppresses background information and random noise while preserving manipulation cues. Extensive experiments on FF++, Celeb-DFv2, and DFDC demonstrate that LR-GCN achieves state-of-the-art performance and significantly improved robustness under severe global and local disruptions, including missing faces, occlusions, and adversarially perturbed face detections.

[257] MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer

Penghui Liu,Jiangshan Wang,Yutong Shen,Shanhui Mo,Chenyang Qi,Yue Ma

Main category: cs.CV

TL;DR: 本文提出了MultiMotion,一种用于解决Diffusion Transformer(DiT)在多目标视频运动迁移中运动纠缠和缺乏对象级控制问题的统一框架。其核心创新是Maskaware Attention Motion Flow (AMF),利用SAM2掩码显式解耦并控制DiT管道中的多个对象运动特征,并引入高效的高阶采样器RectPC。作者还构建了首个面向DiT的多目标运动迁移基准数据集。实验表明,该方法在保持DiT高质量和可扩展性的同时,实现了精确、语义对齐且时间连贯的多对象运动迁移。

Details Motivation: 现有的DiT架构在处理多目标视频运动迁移时面临运动特征纠缠和无法进行对象级别控制的问题,限制了其在复杂场景中的应用。因此需要一种能够有效分离并独立控制多个对象运动的新方法。 Method: 提出Maskaware Attention Motion Flow (AMF),在DiT中利用SAM2生成的掩码引导注意力机制,实现对不同对象运动特征的显式解耦;设计RectPC,一种高阶预测-校正采样器,提升多实体生成的效率与精度;构建首个针对DiT的多目标运动迁移基准数据集用于评估。 Result: MultiMotion在多对象运动迁移任务中表现出色,能够实现精确、语义一致且时间上连贯的运动效果,同时保持了DiT原有的高质量生成能力和可扩展性;RectPC提升了采样效率;所构建的数据集填补了该领域评测空白。 Conclusion: MultiMotion通过AMF和RectPC成功解决了DiT在多目标运动迁移中的关键挑战,为复杂场景下的可控视频生成提供了有效方案,并推动了相关领域的标准化评估发展。 Abstract: Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.

[258] SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation

Yao Teng,Zhihuan Jiang,Han Shi,Xian Liu,Xuefei Ning,Guohao Dai,Yu Wang,Zhenguo Li,Xihui Liu

Main category: cs.CV

TL;DR: 提出了一种无需训练的并行解码算法SJD++,通过多token预测和重用高置信度草稿token,显著加速自回归文本到图像生成,实现2到3倍的推理延迟减少和2到7倍的步数压缩,同时保持视觉质量。

Details Motivation: 大型自回归模型在生成高质量、高分辨率图像时面临生成速度慢的问题,因其推理过程中需要数百到数千次的顺序前向传递。为解决这一问题,需加速自回归文本到图像生成过程。 Method: 结合了Jacobi解码的迭代多token预测机制和推测采样的概率性草稿-验证机制,并在每次验证后重用高置信度的草稿token,避免全部重新采样,从而减少生成步骤。 Result: 在多个代表性自回归文本到图像模型上实验表明,SJD++实现了2到3倍的推理延迟减少和2到7倍的步数压缩,且视觉质量无明显下降。 Conclusion: SJD++是一种高效的训练-free并行解码方法,能显著提升自回归文本到图像模型的生成速度,同时保持生成质量,具有广泛的应用潜力。 Abstract: Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.

[259] ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points

Ryota Okumura,Kaede Shiohara,Toshihiko Yamasaki

Main category: cs.CV

TL;DR: 提出ControlVP,一种用户引导的框架,用于纠正生成图像中的消失点不一致问题,提升几何结构的真实感。

Details Motivation: 现有文本到图像模型在生成建筑场景时存在消失点不一致问题,导致几何结构不合理,影响空间真实感。 Method: 基于预训练扩散模型,引入建筑轮廓提取的结构引导,并加入几何约束,促使图像边缘与透视线索对齐。 Result: 显著提升了生成图像的全局几何一致性,同时保持了与基线相当的视觉质量。 Conclusion: ControlVP有效改善了生成图像的结构真实性,对需要精确空间结构的应用(如图像到3D重建)具有重要价值。 Abstract: Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at https://github.com/RyotaOkumura/ControlVP .

[260] MeshRipple: Structured Autoregressive Generation of Artist-Meshes

Junkai Lin,Hang Long,Huipeng Guo,Jielei Zhang,JiaYi Yang,Tianle Guo,Yang Yang,Jianwen Li,Wenxiao Zhang,Matthias Nießner,Wei Yang

Main category: cs.CV

TL;DR: 本文提出了MeshRipple,一种用于自回归网格生成的新方法,通过前沿感知的BFS分词、扩展性预测策略和稀疏注意力全局记忆,解决了传统方法在长距离几何依赖上的不足,显著提升了网格生成的表面保真度和拓扑完整性。

Details Motivation: 现有的自回归网格生成方法由于使用滑动窗口推理导致长距离几何依赖被破坏,产生孔洞和碎片化组件,限制了生成质量,因此需要一种能够保持拓扑连贯性和完整性的新方法。 Method: 提出MeshRipple,包含三个核心创新:前沿感知的广度优先搜索(BFS)分词方式,使生成顺序与表面拓扑对齐;扩展性预测策略,维持连贯且连接的表面增长;基于稀疏注意力的全局记忆机制,提供近乎无界的感受野以捕捉长程拓扑依赖。 Result: MeshRipple在多个评估指标上优于近期强基线方法,能够生成具有高表面保真度和完整拓扑结构的3D网格,有效减少孔洞和碎片问题。 Conclusion: MeshRipple通过模拟涟漪扩散的方式实现从活跃前沿向外扩展的网格生成,成功克服了传统自回归方法中训练与推理不一致的问题,为高质量3D网格生成提供了新的有效范式。 Abstract: Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface.MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies.This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.

[261] From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Fei Yu,Yu Liu,Luyang Tang,Mingchao Sun,Zengye Ge,Rui Bu,Yuchao Jin,Haisen Zhao,He Sun,Yangyan Li,Mu Xu,Wenzheng Chen,Baoquan Chen

Main category: cs.CV

TL;DR: 提出一种基于2.5D高度图和可微渲染的城市级3D重建方法,从少量卫星图像生成逼真的地面视角视图。

Details Motivation: 现有方法(如NeRF和3DGS)在极端视角外推下无法有效处理卫星图像中严重透视变形的建筑立面,导致城市级3D重建失败。 Method: 将城市几何建模为Z单调符号距离场(SDF)表示的2.5D高度图,并通过可微分渲染从卫星图像绘制外观;引入生成式纹理恢复网络以增强低质量输入中的高频细节。 Result: 在大规模真实城区(如4km²区域)上实现了最先进的地面视图合成效果,仅用少量卫星图像即可生成具有清晰屋顶和垂直外墙的水密网格与高保真纹理。 Conclusion: 该方法在极端视角外推场景下表现出强鲁棒性和可扩展性,生成的模型可用于城市规划和仿真等实际应用。 Abstract: City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation.

[262] Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation

Xuecheng Li,Weikuan Jia,Alisher Kurbonaliev,Qurbonaliev Alisher,Khudzhamkulov Rustam,Ismoilov Shuhratjon,Eshmatov Javhariddin,Yuanjie Zheng

Main category: cs.CV

TL;DR: 提出了一种双流残差语义去相关网络(DSRSD-Net),通过分离模态特有和共有信息来提升多模态学习的泛化性和可解释性。

Details Motivation: 现有跨模态学习方法存在模态主导、冗余耦合和虚假相关问题,导致泛化能力差且难以解释预测结果。 Method: 设计双流架构,利用残差分解分离模态内和模态间潜在因子,并引入对比与回归目标对齐共享空间,结合去相关和正交损失抑制冗余。 Result: 在两个大规模教育基准上验证了方法的有效性,在下一步预测和最终结果预测任务中均优于单模态及多种融合基线。 Conclusion: DSRSD-Net能有效解耦多模态表示中的共享与特有信息,提升模型鲁棒性、可解释性及预测性能。 Abstract: Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.

[263] All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs

Yahong Wang,Juncheng Wu,Zhangkai Ni,Longzhen Yang,Yihang Liu,Chengmei Yang,Ying Wen,Xianfeng Tang,Hui Liu,Yuyin Zhou,Lianghua He

Main category: cs.CV

TL;DR: 本文提出“信息地平线”概念,指出在视觉大模型深层中视觉标记的信息会逐渐消失,导致现有剪枝方法失效;通过量化标记信息量,发现深层可采用随机剪枝有效提升效率,并结合现有方法实现SOTA性能。

Details Motivation: 现有训练-free剪枝方法在VLLM深层表现不佳,作者试图探究其根本原因并提出更高效的推理加速方案。 Method: 提出一种基于输出概率变化的视觉标记信息度量方法,分析不同层中视觉标记的信息演化,识别"信息地平线",并在该点后采用随机剪枝策略。 Result: 发现了信息随深度增加而消失的现象,确定了信息地平线的存在及其与任务类型、模型能力的关系;随机剪枝在深层有效,结合DivPrune可在保留96.9%性能的同时剪去50%视觉标记。 Conclusion: 视觉大模型深层中的视觉标记因信息消失而冗余,随机剪枝是简单高效的方法,合理利用信息地平线可显著提升推理效率。 Abstract: Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.

[264] LongCat-Image Technical Report

Meituan LongCat Team,Hanghang Ma,Haoxian Tan,Jiale Huang,Junqiang Wu,Jun-Yan He,Lishuai Gao,Songlin Xiao,Xiaoming Wei,Xiaoqi Ma,Xunliang Cai,Yayong Guan,Jie Hu

Main category: cs.CV

TL;DR: LongCat-Image 是一个开创性的开源双语(中英文)图像生成基础模型,通过精细的数据策略和紧凑的6B参数扩散模型,在文本渲染、真实感、效率和开放性方面达到SOTA水平,并建立迄今最全面的开源生态。

Details Motivation: 解决当前主流图像生成模型在多语言文本渲染(尤其是中文)、真实感、部署效率和开发者可访问性方面的核心问题。 Method: 采用分阶段(预训练、中段训练、SFT)的严格数据筛选策略,并在强化学习阶段结合精心设计的奖励模型;使用仅6B参数的紧凑扩散模型,避免庞大的MoE架构;同时发布完整训练工具链和多版本模型。 Result: 在文本渲染(特别是复杂和罕见汉字)覆盖范围和准确性上超越现有开源与商业模型;图像真实感和美学质量显著提升;推理速度快、显存占用低,部署成本大幅降低;在图像编辑任务上达到SOTA,编辑一致性优于其他开源工作。 Conclusion: LongCat-Image 在性能、效率和开放性之间实现了卓越平衡,树立了中文字符渲染的新行业标准,其全面开源的生态系统将有力推动视觉内容生成领域的研究与应用发展。 Abstract: We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models. 1) We achieve this through rigorous data curation strategies across the pre-training, mid-training, and SFT stages, complemented by the coordinated use of curated reward models during the RL phase. This strategy establishes the model as a new state-of-the-art (SOTA), delivering superior text-rendering capabilities and remarkable photorealism, and significantly enhancing aesthetic quality. 2) Notably, it sets a new industry standard for Chinese character rendering. By supporting even complex and rare characters, it outperforms both major open-source and commercial solutions in coverage, while also achieving superior accuracy. 3) The model achieves remarkable efficiency through its compact design. With a core diffusion model of only 6B parameters, it is significantly smaller than the nearly 20B or larger Mixture-of-Experts (MoE) architectures common in the field. This ensures minimal VRAM usage and rapid inference, significantly reducing deployment costs. Beyond generation, LongCat-Image also excels in image editing, achieving SOTA results on standard benchmarks with superior editing consistency compared to other open-source works. 4) To fully empower the community, we have established the most comprehensive open-source ecosystem to date. We are releasing not only multiple model versions for text-to-image and image editing, including checkpoints after mid-training and post-training stages, but also the entire toolchain of training procedure. We believe that the openness of LongCat-Image will provide robust support for developers and researchers, pushing the frontiers of visual content creation.

[265] Robust Variational Model Based Tailored UNet: Leveraging Edge Detector and Mean Curvature for Improved Image Segmentation

Kaili Qi,Zhongyi Huang,Wenli Yang

Main category: cs.CV

TL;DR: 提出了一种结合变分方法与深度学习的鲁棒VM_TUNet模型,用于噪声图像分割,兼顾性能与计算效率。

Details Motivation: 针对噪声图像中边界模糊或断裂导致的分割难题,现有纯CNN或Transformer模型在边界细节和计算成本间难以平衡。 Method: 将物理先验、边缘检测器和平均曲率项融入改进的Cahn-Hilliard方程,并设计双模块架构(F模块用于频域预处理,T模块保证局部稳定性)。 Result: 在三个基准数据集上验证,该方法在定量指标和视觉质量上优于纯CNN模型,接近Transformer方法,且计算开销合理。 Conclusion: 所提方法有效融合了PDE的可解释性与DNN的表达能力,在图像分割中实现了性能与效率的良好平衡。 Abstract: To address the challenge of segmenting noisy images with blurred or fragmented boundaries, this paper presents a robust version of Variational Model Based Tailored UNet (VM_TUNet), a hybrid framework that integrates variational methods with deep learning. The proposed approach incorporates physical priors, an edge detector and a mean curvature term, into a modified Cahn-Hilliard equation, aiming to combine the interpretability and boundary-smoothing advantages of variational partial differential equations (PDEs) with the strong representational ability of deep neural networks. The architecture consists of two collaborative modules: an F module, which conducts efficient frequency domain preprocessing to alleviate poor local minima, and a T module, which ensures accurate and stable local computations, backed by a stability estimate. Extensive experiments on three benchmark datasets indicate that the proposed method achieves a balanced trade-off between performance and computational efficiency, which yields competitive quantitative results and improved visual quality compared to pure convolutional neural network (CNN) based models, while achieving performance close to that of transformer-based method with reasonable computational expense.

[266] More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery

Wenzhen Dong,Jieming Yu,Yiming Huang,Hongqiu Wang,Lei Zhu,Albert C. S. Chung,Hongliang Ren,Long Bai

Main category: cs.CV

TL;DR: SAM 3 在手术机器人中进行了实证评估,展示了在零样本分割、动态视频跟踪和3D重建方面的进步,但在语言提示分割和复杂动态场景中仍有局限。

Details Motivation: 评估 SAM 3 在机器人辅助手术中的性能,探索其在零样本分割、语言提示和3D感知方面的能力。 Method: 在 MICCAI EndoVis 2017/2018、SCARED、StereoMIS 和 EndoNeRF 数据集上进行基准测试,评估点、边界框和语言提示下的图像与视频分割及3D重建能力。 Result: SAM 3 在空间提示下的图像和视频分割优于 SAM 和 SAM 2,在单目深度估计和3D器械重建中表现良好;语言提示效果尚不理想,复杂动态场景仍存在挑战。 Conclusion: SAM 3 在手术场景中展现出显著进步,但需进一步领域特定训练以提升语言分割和动态场景处理能力。 Abstract: The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3's 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.

[267] Online Segment Any 3D Thing as Instance Tracking

Hanshi Wang,Zijian Cai,Jin Gao,Yiwei Zhang,Weiming Hu,Ke Wang,Zhipeng Zhang

Main category: cs.CV

TL;DR: 本文提出AutoSeg3D,将在线3D语义分割重新定义为实例跟踪问题,利用对象查询进行时序信息传播,结合长短时实例关联与更新,并引入空间一致性学习以提升分割性能。

Details Motivation: 现有基于查询的3D分割方法忽略了时间维度的动态感知能力,难以应对视角变化导致的部分观测问题,缺乏时序连贯性建模。 Method: 将在线3D分割视为实例跟踪任务,使用对象查询传递时序信息:长时实例关联保持特征与身份一致性,短时实例更新增强当前观测;同时引入空间一致性学习缓解VFMs的分割碎片化问题。 Result: 在ScanNet200上超越ESAM达2.8 AP,并在ScanNet、SceneNN和3RScan数据集上均取得一致性能提升。 Conclusion: 通过将3D分割转化为实例跟踪并有效建模时序信息,显著提升了实体智能体对动态环境的时空理解能力,实现了新的SOTA性能。 Abstract: Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2.8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets.

[268] Decomposition Sampling for Efficient Region Annotations in Active Learning

Jingna Qiu,Frauke Wilm,Mathias Öttl,Jonas Utz,Maja Schlereth,Moritz Schillinger,Marc Aubreville,Katharina Breininger

Main category: cs.CV

TL;DR: 提出了一种名为DECOMP的主动学习采样策略,用于提高密集预测任务中的标注效率,特别是在医学图像中,通过将图像分解为类别特定的组件并从每个类别中采样区域来增强标注多样性。

Details Motivation: 现有的代表性标注区域选择方法存在高计算和内存成本、不相关的区域选择以及对不确定性采样的过度依赖等问题。 Method: 使用伪标签将图像分解为类别特定的组件,并从每个类别中采样区域,同时利用类别的预测置信度指导采样过程,确保困难类别获得更多标注。 Result: 在ROI分类、2D分割和3D分割任务上,DECOMP consistently 超越了基线方法,更好地采样少数类区域并提升这些挑战性类别的性能。 Conclusion: DECOMP是一种有效的主动学习采样策略,能够显著提高密集预测任务中的标注效率和模型性能。 Abstract: Active learning improves annotation efficiency by selecting the most informative samples for annotation and model training. While most prior work has focused on selecting informative images for classification tasks, we investigate the more challenging setting of dense prediction, where annotations are more costly and time-intensive, especially in medical imaging. Region-level annotation has been shown to be more efficient than image-level annotation for these tasks. However, existing methods for representative annotation region selection suffer from high computational and memory costs, irrelevant region choices, and heavy reliance on uncertainty sampling. We propose decomposition sampling (DECOMP), a new active learning sampling strategy that addresses these limitations. It enhances annotation diversity by decomposing images into class-specific components using pseudo-labels and sampling regions from each class. Class-wise predictive confidence further guides the sampling process, ensuring that difficult classes receive additional annotations. Across ROI classification, 2-D segmentation, and 3-D segmentation, DECOMP consistently surpasses baseline methods by better sampling minority-class regions and boosting performance on these challenging classes. Code is in https://github.com/JingnaQiu/DECOMP.git.

[269] MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation

Zhiqi Li,Wenhuan Li,Tengfei Wang,Zhenwei Wang,Junta Wu,Haoyuan Wang,Yunhan Yang,Zehuan Huang,Yang Li,Peidong Liu,Chunchao Guo

Main category: cs.CV

TL;DR: 本文提出MoCA,一种具有重要性感知组件路由和非重要组件压缩机制的可组合3D生成模型,解决了现有方法在组件数量增加时计算成本过高的问题,实现了高效且细粒度的3D资产生成。

Details Motivation: 现有的基于部件的3D生成方法在组件增多时因全局注意力机制的二次计算开销而难以扩展,缺乏可扩展性和效率。 Method: 提出MoCA模型,包含两个关键设计:基于重要性的组件路由(选择前k个相关组件进行稀疏全局注意力)和对不重要组件进行压缩以保留上下文先验并降低计算复杂度。 Result: 实验表明,MoCA在可组合的3D物体和场景生成任务上优于基线方法,支持更多组件数量下的高效生成。 Conclusion: MoCA通过稀疏注意力和组件压缩机制,实现了可扩展、高效的细粒度3D对象与场景生成,推动了3D内容创作的可组合性发展。 Abstract: Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: https://lizhiqi49.github.io/MoCA

[270] Liver Fibrosis Quantification and Analysis: The LiQA Dataset and Baseline Method

Yuanye Liu,Hanxiao Zhang,Nannan Shi,Yuxin Shi,Arif Mahmood,Murtaza Taj,Xiahai Zhuang

Main category: cs.CV

TL;DR: 本文介绍了用于肝纤维化定量和分析的LiQA数据集,该数据集包含440名患者的多相、多中心MRI扫描,旨在复杂现实条件下评估肝脏分割和肝纤维化分期算法。研究还描述了CARE 2024挑战赛中表现最佳的方法,该方法结合半监督学习框架与外部数据进行稳健分割,并利用多视角共识方法和基于类激活图(CAM)的正则化进行分期。

Details Motivation: 准确的肝纤维化分期对于临床管理至关重要,但现实中存在域偏移、模态缺失和空间错位等挑战,需要一个高质量的数据集和鲁棒的算法来应对这些复杂条件。 Method: 提出了LiQA数据集,并设计了CARE 2024挑战赛。顶级方法采用半监督学习框架结合外部数据实现肝脏分割,使用多视角共识方法并结合类激活图(CAM)正则化进行肝纤维化分期。 Result: 基准评估表明,利用多源数据和解剖约束可以显著提高模型在临床环境中的鲁棒性。 Conclusion: LiQA数据集为肝纤维化分割与分期提供了具有现实挑战性的基准,结合半监督学习与多视角分析的方法在复杂临床条件下展现出良好的性能潜力。 Abstract: Liver fibrosis represents a significant global health burden, necessitating accurate staging for effective clinical management. This report introduces the LiQA (Liver Fibrosis Quantification and Analysis) dataset, established as part of the CARE 2024 challenge. Comprising $440$ patients with multi-phase, multi-center MRI scans, the dataset is curated to benchmark algorithms for Liver Segmentation (LiSeg) and Liver Fibrosis Staging (LiFS) under complex real-world conditions, including domain shifts, missing modalities, and spatial misalignment. We further describe the challenge's top-performing methodology, which integrates a semi-supervised learning framework with external data for robust segmentation, and utilizes a multi-view consensus approach with Class Activation Map (CAM)-based regularization for staging. Evaluation of this baseline demonstrates that leveraging multi-source data and anatomical constraints significantly enhances model robustness in clinical settings.

[271] An AI-Powered Autonomous Underwater System for Sea Exploration and Scientific Research

Hamad Almazrouei,Mariam Al Nasseri,Maha Alzaabi

Main category: cs.CV

TL;DR: 本文提出了一种基于AI的自主水下航行器(AUV)系统,结合YOLOv12 Nano、ResNet50、PCA、K-Means++和GPT-4o Mini,实现水下目标的实时检测、聚类与智能报告生成,显著提升海洋探索效率与数据分析深度。

Details Motivation: 传统海洋探测受限于极端环境、能见度低和高成本,导致大量海域未被探索。本文旨在通过自动化AI系统克服这些挑战,提升水下探测的效率与安全性。 Method: 系统采用YOLOv12 Nano进行实时目标检测,ResNet50提取特征,PCA降维并保留98%方差,K-Means++对海洋物体按视觉特征聚类,并利用GPT-4o Mini生成结构化报告,结合DeepFish和OzFish超过5.5万张图像进行训练与评估。 Result: 系统在mAP@0.5达到0.512,精确率为0.535,召回率为0.438;PCA有效降低维度;K-Means++成功实现视觉相似性聚类;LLM可基于检测与位置信息生成有意义的总结报告。 Conclusion: 该集成AI系统显著降低人工潜水风险,提高任务效率与数据解析速度,为复杂海洋环境下的科学研究提供了高效、智能的新工具。 Abstract: Traditional sea exploration faces significant challenges due to extreme conditions, limited visibility, and high costs, resulting in vast unexplored ocean regions. This paper presents an innovative AI-powered Autonomous Underwater Vehicle (AUV) system designed to overcome these limitations by automating underwater object detection, analysis, and reporting. The system integrates YOLOv12 Nano for real-time object detection, a Convolutional Neural Network (CNN) (ResNet50) for feature extraction, Principal Component Analysis (PCA) for dimensionality reduction, and K-Means++ clustering for grouping marine objects based on visual characteristics. Furthermore, a Large Language Model (LLM) (GPT-4o Mini) is employed to generate structured reports and summaries of underwater findings, enhancing data interpretation. The system was trained and evaluated on a combined dataset of over 55,000 images from the DeepFish and OzFish datasets, capturing diverse Australian marine environments. Experimental results demonstrate the system's capability to detect marine objects with a mAP@0.5 of 0.512, a precision of 0.535, and a recall of 0.438. The integration of PCA effectively reduced feature dimensionality while preserving 98% variance, facilitating K-Means clustering which successfully grouped detected objects based on visual similarities. The LLM integration proved effective in generating insightful summaries of detections and clusters, supported by location data. This integrated approach significantly reduces the risks associated with human diving, increases mission efficiency, and enhances the speed and depth of underwater data analysis, paving the way for more effective scientific research and discovery in challenging marine environments.

[272] Optimization-Guided Diffusion for Interactive Scene Generation

Shiaho Li,Naisheng Ye,Tianyu Li,Kashyap Chitta,Tuo An,Peng Su,Boyang Wang,Haiou Liu,Chen Lv,Hongyang Li

Main category: cs.CV

TL;DR: OMEGA是一个无需训练的优化引导框架,通过在扩散模型采样过程中引入约束优化,生成符合物理规律且行为连贯的多智能体驾驶场景,特别适用于生成稀有的安全关键型对抗场景。

Details Motivation: 现有数据驱动的驾驶场景生成方法在可控性、物理合理性和交互真实性方面存在不足,难以有效生成稀有但重要的安全关键事件。 Method: 提出OMEGA框架,在扩散模型的每一步反向扩散中通过约束优化重新锚定生成轨迹;将主车与攻击者交互建模为分布空间中的博弈优化问题,求解纳什均衡以生成对抗性场景。 Result: 在nuPlan和Waymo数据集上实验显示,OMEGA将物理和行为有效的场景比例从32.35%提升至72.27%(自由探索)和11%至80%(可控生成),并能生成5倍以上的近碰撞帧(TTC<3秒),同时保持整体场景真实性。 Conclusion: OMEGA通过优化引导的训练-free方法实现了高 realism、一致性与可控性的驾驶场景生成,尤其擅长生成安全关键的对抗性交通场景,为自动驾驶评估提供了高效且可靠的数据增强方案。 Abstract: Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.

[273] EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset

Ronan John,Aditya Kesari,Vincenzo DiMatteo,Kristin Dana

Main category: cs.CV

TL;DR: 本文提出了EgoCampus数据集和EgoCampusNet模型,用于预测行人户外导航中的视觉注意力。

Details Motivation: 现有数据集多关注室内任务或缺乏眼动数据,难以有效研究真实世界中的视觉注意。因此,作者希望构建一个包含丰富眼动信息的户外导航数据集,并开发相应的注视点预测模型。 Method: 收集了来自80多名行人在校园内行走时的第一人称眼动视频数据,构建EgoCampus数据集;利用Meta的Project Aria眼镜采集带眼动、RGB图像、惯性传感器和GPS信息的数据;基于此数据提出EgoCampusNet模型来预测行人移动过程中的注视位置。 Result: 构建了一个覆盖6公里、25条路径、超过80名行人的大规模户外眼动数据集EgoCampus,并开发了可有效预测户外导航中视觉注意的EgoCampusNet方法。 Conclusion: 该工作为研究真实场景下的视觉注意力提供了新资源,推动了面向导航任务的注视预测模型发展。 Abstract: We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta's Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at https://github.com/ComputerVisionRutgers/EgoCampus .

[274] DIST-CLIP: Arbitrary Metadata and Image Guided MRI Harmonization via Disentangled Anatomy-Contrast Representations

Mehmet Yigit Avci,Pedro Borges,Virginia Fernandez,Paul Wright,Mehmet Yigitsoy,Sebastien Ourselin,Jorge Cardoso

Main category: cs.CV

TL;DR: 本文提出了一种名为DIST-CLIP的MRI图像标准化统一框架,利用CLIP引导实现解耦风格迁移,可灵活使用目标图像或DICOM元数据进行指导,在真实临床数据上显著优于现有方法。

Details Motivation: 深度学习在医学影像分析中潜力巨大,但因数据异质性(如MRI中扫描仪硬件、采集协议和序列参数差异)导致的域偏移问题严重限制了其临床泛化能力;现有数据标准化方法存在依赖目标图像或仅使用简单标签而无法充分捕捉复杂采集细节的问题。 Method: 提出DIST-CLIP框架,通过预训练的CLIP编码器提取对比度表示,并显式解耦解剖内容与图像对比度;引入新型自适应风格迁移模块将对比度嵌入整合到解剖内容中,支持基于目标图像或DICOM元数据的灵活引导。 Result: 在多样化的真实世界临床数据集上训练和评估显示,DIST-CLIP在风格转换保真度和解剖结构保持方面均显著优于当前最先进的方法。 Conclusion: DIST-CLIP为MRI数据的风格迁移和标准化提供了一个灵活且高效的新方案,有效应对临床环境中广泛存在的数据异质性问题,具有良好的实际应用前景。 Abstract: Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.

[275] sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only

Arslan Artykov,Corentin Sautier,Vincent Lepetit

Main category: cs.CV

TL;DR: 本文提出了首个从单目自由移动相机拍摄的视频中联合预测物体部件分割和关节参数的数据驱动方法,仅使用合成数据训练即可在真实场景中良好泛化。

Details Motivation: 现有工作多依赖多视角、静态扫描或固定相机设置,难以应用于动态真实场景。需要一种更实用、可扩展的方法来理解铰接物体。 Method: 提出一种数据驱动的方法,直接在随意录制的单目视频上操作,利用合成数据进行训练,联合预测部件分割和关节参数。 Result: 该方法在真实世界物体上表现出良好的泛化能力,适用于动态环境中的实时应用。 Conclusion: 本文方法为铰接物体理解提供了一种可扩展且实用的解决方案,推动了机器人和数字孪生领域的发展。 Abstract: Understanding articulated objects is a fundamental challenge in robotics and digital twin creation. To effectively model such objects, it is essential to recover both part segmentation and the underlying joint parameters. Despite the importance of this task, previous work has largely focused on setups like multi-view systems, object scanning, or static cameras. In this paper, we present the first data-driven approach that jointly predicts part segmentation and joint parameters from monocular video captured with a freely moving camera. Trained solely on synthetic data, our method demonstrates strong generalization to real-world objects, offering a scalable and practical solution for articulated object understanding. Our approach operates directly on casually recorded video, making it suitable for real-time applications in dynamic environments. Project webpage: https://aartykov.github.io/sim2art/

[276] Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment

Sangha Park,Eunji Kim,Yeongtak Oh,Jooyoung Choi,Sungroh Yoon

Main category: cs.CV

TL;DR: 本文提出了一种名为Negative Prompting for Image Correction (NPC)的自动化方法,通过生成和选择负向提示来提升文本-图像生成中的对齐精度,尤其在处理复杂或富有想象力的提示时表现优异。

Details Motivation: 尽管文本到图像生成取得了显著进展,但在处理包含丰富组合结构或想象性元素的提示时,精确的文本-图像对齐仍然困难。 Method: 通过分析交叉注意力模式,识别并应用负向提示以抑制不期望的内容;采用验证器-描述器-提议器框架生成候选负向提示,并使用显著文本空间得分进行排序。 Result: 在GenEval++和Imagine-Bench上,NPC优于强基线方法,在GenEval++上得分为0.571(对比0.371),并在Imagine-Bench上取得最佳整体性能。 Conclusion: NPC提供了一种原则性强且完全自动化的途径,通过指导模型避免生成特定内容,从而增强扩散模型中的文本-图像对齐。 Abstract: Despite substantial progress in text-to-image generation, achieving precise text-image alignment remains challenging, particularly for prompts with rich compositional structure or imaginative elements. To address this, we introduce Negative Prompting for Image Correction (NPC), an automated pipeline that improves alignment by identifying and applying negative prompts that suppress unintended content. We begin by analyzing cross-attention patterns to explain why both targeted negatives-those directly tied to the prompt's alignment error-and untargeted negatives-tokens unrelated to the prompt but present in the generated image-can enhance alignment. To discover useful negatives, NPC generates candidate prompts using a verifier-captioner-proposer framework and ranks them with a salient text-space score, enabling effective selection without requiring additional image synthesis. On GenEval++ and Imagine-Bench, NPC outperforms strong baselines, achieving 0.571 vs. 0.371 on GenEval++ and the best overall performance on Imagine-Bench. By guiding what not to generate, NPC provides a principled, fully automated route to stronger text-image alignment in diffusion models. Code is released at https://github.com/wiarae/NPC.

[277] PVeRA: Probabilistic Vector-Based Random Matrix Adaptation

Leo Fillioux,Enzo Ferrante,Paul-Henry Cournède,Maria Vakalopoulou,Stergios Christodoulidis

Main category: cs.CV

TL;DR: 本文提出了PVeRA,一种基于VeRA适配器的随机低秩矩阵的概率性改进版本,通过概率化的方式提升小样本适应性能,并在VTAB-1k基准上优于VeRA及其他适配器。

Details Motivation: 由于大模型训练和微调需要大量数据和计算资源,参数高效适应方法(如添加可训练模块)被提出以降低开销;VeRA通过共享固定的随机低秩矩阵实现高效适应,但缺乏对输入不确定性的建模能力。 Method: 提出PVeRA,将VeRA中的固定低秩矩阵改为概率性建模,即引入分布而非固定值,在训练和测试时可通过不同采样策略处理输入的模糊性,并仅更新少量参数。 Result: 在VTAB-1k基准上对七种适配器进行了评估,PVeRA表现优于VeRA和其他对比方法,验证了其有效性。 Conclusion: PVeRA通过概率化低秩矩阵提升了参数高效微调的表达能力和灵活性,在少样本场景下展现出优越性能,为轻量级适配器设计提供了新思路。 Abstract: Large foundation models have emerged in the last years and are pushing performance boundaries for a variety of tasks. Training or even finetuning such models demands vast datasets and computational resources, which are often scarce and costly. Adaptation methods provide a computationally efficient solution to address these limitations by allowing such models to be finetuned on small amounts of data and computing power. This is achieved by appending new trainable modules to frozen backbones with only a fraction of the trainable parameters and fitting only these modules on novel tasks. Recently, the VeRA adapter was shown to excel in parameter-efficient adaptations by utilizing a pair of frozen random low-rank matrices shared across all layers. In this paper, we propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner. This modification naturally allows handling inherent ambiguities in the input and allows for different sampling configurations during training and testing. A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters. Our code for training models with PVeRA and benchmarking all adapters is available https://github.com/leofillioux/pvera.

[278] UnCageNet: Tracking and Pose Estimation of Caged Animal

Sayak Dutta,Harish Katti,Shashikant Verma,Shanmuganathan Raman

Main category: cs.CV

TL;DR: 提出了一种三阶段预处理管道,通过Gabor增强的ResNet-UNet进行笼子分割、CRFill进行修复,显著提升了在有遮挡情况下动物追踪与姿态估计的性能。

Details Motivation: 现有动物追踪与姿态估计系统(如STEP和ViTPose)在面对笼子结构和系统性遮挡时性能显著下降,需有效处理此类遮挡问题。 Method: 采用三阶段预处理:1)使用Gabor增强的ResNet-UNet进行笼子分割;2)利用CRFill进行内容感知的图像修复;3)在去遮挡后的帧上进行姿态估计与追踪评估。 Result: 实验表明,该方法能显著提升关键点检测精度和轨迹一致性,使有遮挡环境下的性能接近无遮挡情况。 Conclusion: 所提出的预处理管道有效缓解了笼状结构带来的遮挡问题,显著提升了复杂环境下动物姿态估计与追踪的鲁棒性。 Abstract: Animal tracking and pose estimation systems, such as STEP (Simultaneous Tracking and Pose Estimation) and ViTPose, experience substantial performance drops when processing images and videos with cage structures and systematic occlusions. We present a three-stage preprocessing pipeline that addresses this limitation through: (1) cage segmentation using a Gabor-enhanced ResNet-UNet architecture with tunable orientation filters, (2) cage inpainting using CRFill for content-aware reconstruction of occluded regions, and (3) evaluation of pose estimation and tracking on the uncaged frames. Our Gabor-enhanced segmentation model leverages orientation-aware features with 72 directional kernels to accurately identify and segment cage structures that severely impair the performance of existing methods. Experimental validation demonstrates that removing cage occlusions through our pipeline enables pose estimation and tracking performance comparable to that in environments without occlusions. We also observe significant improvements in keypoint detection accuracy and trajectory consistency.

[279] ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

Fan Yang,Heyuan Li,Peihao Li,Weihao Yuan,Lingteng Qiu,Chaoyue Song,Cheng Chen,Yisheng He,Shifeng Zhang,Xiaoguang Han,Steven Hoi,Guosheng Lin

Main category: cs.CV

TL;DR: 本文提出了一种结合3D重建模型与实时自回归视频扩散模型的新方法,用于从单张图像生成高保真上半身3D虚拟 avatar。该方法利用3D重建提供结构和外观先验,指导视频扩散模型生成逼真细节和流畅动态,有效减少纹理模糊、动作僵硬和结构不一致问题,在视觉质量和稳定性上显著优于现有方法。

Details Motivation: 现有3D avatar生成方法在纹理清晰度和动作自然性方面存在不足:基于3D重建的方法易产生模糊纹理和僵硬动作,而生成式视频模型虽能生成逼真动态结果,却常出现结构错误和身份漂移。因此,亟需一种兼顾几何稳定性和生成质量的方案。 Method: 提出一种融合框架,使用3D重建模型提取鲁棒的结构和外观先验,用以引导一个实时自回归视频扩散模型进行渲染。通过将结构先验注入生成过程,实现高频率细节和流畅动态的实时合成,同时保持身体结构一致性。 Result: 实验表明,该方法在减少纹理模糊、运动僵硬和结构不稳定等伪影方面显著优于当前先进方法,生成的avatar具有更高视觉质量和时间连贯性,适用于游戏和虚拟现实等实时应用。 Conclusion: 本文方法成功结合了3D重建模型的结构稳定性与视频生成模型的高质量动态合成能力,实现了从单图输入到高保真、动态一致3D avatar的实时生成,为实际应用场景提供了高效且鲁棒的解决方案。 Abstract: Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa

[280] Improving action classification with brain-inspired deep networks

Aidas Aglinskas,Stefano Anzellotti

Main category: cs.CV

TL;DR: 该研究探讨了深度神经网络(DNN)在动作识别中对身体和背景信息的利用情况,并提出一种受大脑启发的、具有独立处理通路的双流架构,以更有效地整合身体与场景信息,提升识别性能并使其表现更接近人类。

Details Motivation: 目前不清楚DNN在动作识别中如何利用身体和背景信息,且可能因数据集中两者的相关性而偏向依赖其中一种;相比之下,人类有专门的大脑区域分别处理身体和场景,因此可能更善于综合利用二者。研究旨在探索是否可通过构建类脑的领域特异性网络结构来实现更优、更人性化的动作识别。 Method: 使用HAA500数据集训练DNN,测试其在完整刺激、去身体刺激和去背景刺激上的表现;同时让28名人类被试执行相同任务;随后设计一种具有独立通路分别处理身体和背景信息的新架构,并评估其性能。 Result: 标准DNN在去除身体后仍表现良好,但在去除背景后降至随机水平;人类则能在所有条件下准确识别动作,且仅见身体时表现优于仅见背景;新提出的双流架构不仅提升了识别准确率,其在不同刺激条件下的表现模式也更接近人类。 Conclusion: 将身体和背景信息通过独立通路处理的类脑双流架构能更有效地融合多源信息,提高动作识别性能,并使模型的行为模式更接近人类,表明借鉴大脑组织原理有助于改进人工智能系统的设计。 Abstract: Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.

[281] SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination

Sangha Park,Seungryong Yoo,Jisoo Mok,Sungroh Yoon

Main category: cs.CV

TL;DR: 提出SAVE框架,利用稀疏自编码器增强视觉信息,减少多模态大模型中的对象幻觉。

Details Motivation: 多模态大语言模型易受语言先验和视觉信息丢失影响,导致对象幻觉问题。 Method: 通过构建稀疏自编码器(SAE)提取潜在特征,使用二分类问答探针识别关键视觉理解特征,并在推理时引导模型强化视觉信息处理。 Result: 在CHAIR_S、POPE和MMHal-Bench等多个基准上显著优于现有训练免费方法,CHAIR_S提升10个百分点,且跨模型和层的实验验证了方法的鲁棒性和泛化性。 Conclusion: SAVE通过引导视觉相关SAE特征有效抑制幻觉,提升了模型对图像内容的忠实度,具备简单高效、可扩展的优点。 Abstract: Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.

[282] SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

Meng Cao,Xingyu Li,Xue Liu,Ian Reid,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出SpatialDreamer,一种基于强化学习的框架,通过主动探索、视觉想象和证据支持的推理来提升多模态大语言模型在复杂空间推理任务中的表现。

Details Motivation: 现有MLLM在需要心理模拟的复杂空间推理任务中表现有限,主要依赖于被动观察,缺乏主动的心智图像生成过程。 Method: 提出SpatialDreamer框架,结合主动探索、基于世界模型的视觉想象和证据支持的推理,并设计几何策略优化(GeoPO)方法,引入树结构采样和几何一致性约束下的步级奖励估计。 Result: 在多个具有挑战性的基准测试上进行了广泛实验,结果表明SpatialDreamer取得了极具竞争力的表现。 Conclusion: SpatialDreamer显著提升了MLLM在类人主动空间心智模拟方面的能力,推动了复杂场景理解的发展。 Abstract: Despite advancements in Multi-modal Large Language Models (MLLMs) for scene understanding, their performance on complex spatial reasoning tasks requiring mental simulation remains significantly limited. Current methods often rely on passive observation of spatial data, failing to internalize an active mental imagery process. To bridge this gap, we propose SpatialDreamer, a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration, visual imagination via a world model, and evidence-grounded reasoning. To address the lack of fine-grained reward supervision in longhorizontal reasoning tasks, we propose Geometric Policy Optimization (GeoPO), which introduces tree-structured sampling and step-level reward estimation with geometric consistency constraints. Extensive experiments demonstrate that SpatialDreamer delivers highly competitive results across multiple challenging benchmarks, signifying a critical advancement in human-like active spatial mental simulation for MLLMs.

[283] HLTCOE Evaluation Team at TREC 2025: VQA Track

Dengjia Zhang,Charles Weng,Katherine Guerrerio,Yi Lu,Kenton Murray,Alexander Martin,Reno Kriz,Benjamin Van Durme

Main category: cs.CV

TL;DR: 提出一种结合生成与判别排序的列表式学习框架,通过新颖的带排名权重的掩码指针交叉熵损失函数,提升视频问答中答案生成的语义精度与排序一致性。

Details Motivation: 为了提高视频问答任务中答案生成的语义精度和排序一致性,尤其是在需要时间推理和语义消歧的问题上表现更稳定。 Method: 采用基础多模态模型生成多个候选答案,然后使用基于列表的排序模型进行重排序,该模型采用带排名权重的掩码指针交叉熵损失进行训练,结合指针机制、排名加权和词汇限制下的交叉熵。 Result: 实验表明该方法在准确率和排序稳定性方面均有持续提升,尤其在需要时间推理和语义理解的任务上表现优异。 Conclusion: 所提方法有效融合了生成与判别优势,实现了更连贯、细粒度的答案列表输出,显著提升了视频问答中的答案生成质量。 Abstract: The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.

[284] DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving

Jialv Zou,Shaoyu Chen,Bencheng Liao,Zhiyu Zheng,Yuehao Song,Lefei Zhang,Qian Zhang,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: 本文提出了DiffusionDriveV2,一种基于强化学习的端到端自动驾驶生成扩散模型,通过引入尺度自适应噪声和改进的GRPO算法,在保持行为多样性的同时显著提升了生成轨迹的质量。

Details Motivation: 现有的扩散模型在自动驾驶中容易出现模式崩溃,生成的行为过于保守且缺乏多样性;而现有方法在追求多样性时难以保证生成质量的一致性。 Method: 提出DiffusionDriveV2,采用强化学习框架,引入高斯混合模型保持多模态特性;使用尺度自适应乘性噪声促进广泛探索;设计了锚内GRPO和截断锚间GRPO机制,以合理估计不同意图下的优势函数,避免跨意图的不当比较。 Result: 在NAVSIM v1和v2数据集上分别取得了91.2 PDMS和85.5 EPDMS的闭环评测成绩,创下新纪录;实验表明该方法有效解决了多样性与高质量之间的权衡难题。 Conclusion: DiffusionDriveV2通过强化学习与改进的优势估计策略,成功平衡了生成轨迹的多样性和质量,为端到端自动驾驶中的扩散模型应用提供了更优解决方案。 Abstract: Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse, tending to generate conservative and homogeneous behaviors. While DiffusionDrive employs predefined anchors representing different driving intentions to partition the action space and generate diverse trajectories, its reliance on imitation learning lacks sufficient constraints, resulting in a dilemma between diversity and consistent high quality. In this work, we propose DiffusionDriveV2, which leverages reinforcement learning to both constrain low-quality modes and explore for superior trajectories. This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model. First, we use scale-adaptive multiplicative noise, ideal for trajectory planning, to promote broad exploration. Second, we employ intra-anchor GRPO to manage advantage estimation among samples generated from a single anchor, and inter-anchor truncated GRPO to incorporate a global perspective across different anchors, preventing improper advantage comparisons between distinct intentions (e.g., turning vs. going straight), which can lead to further mode collapse. DiffusionDriveV2 achieves 91.2 PDMS on the NAVSIM v1 dataset and 85.5 EPDMS on the NAVSIM v2 dataset in closed-loop evaluation with an aligned ResNet-34 backbone, setting a new record. Further experiments validate that our approach resolves the dilemma between diversity and consistent high quality for truncated diffusion models, achieving the best trade-off. Code and model will be available at https://github.com/hustvl/DiffusionDriveV2

[285] Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

Shihao Zhao,Yitong Chen,Zeyinzi Jiang,Bojia Zi,Shaozhe Hao,Yu Liu,Chaojie Mao,Kwan-Yee K. Wong

Main category: cs.CV

TL;DR: 本文提出了一种名为Unison的低成本、高效率的多模态统一理解与生成模型,采用两阶段方案,在保留预训练模型能力的同时,实现了对多种理解与生成任务的广泛覆盖,并能自动解析用户意图和提取任务相关元信息,实现全自动化处理。

Details Motivation: 现有的多模态统一理解与生成方法要么成本过高,要么任务覆盖有限、生成质量差,且缺乏自动解析输入元信息的能力,需手动配置参数,不够智能。因此需要一种低成本、高性能且智能化的解决方案。 Method: 提出Unison模型,采用两阶段框架:连接预训练的理解与生成模型,并通过低代价对齐微调;设计机制自动解析用户输入中的任务类型和元信息(如图像分辨率、视频时长等),实现无需人工干预的全自动任务执行。 Result: 在仅50万训练样本和50 GPU小时的低成本设置下,模型能够准确识别任务类型、提取参数,并在多种多模态理解与生成任务上表现出优越性能。 Conclusion: Unison在极低训练成本下实现了广泛的多模态任务覆盖和高质量生成,同时具备自动意图解析与元信息提取能力,推动了多模态系统向智能化、自动化方向的发展。 Abstract: Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.

[286] UltrasODM: A Dual Stream Optical Flow Mamba Network for 3D Freehand Ultrasound Reconstruction

Mayank Anand,Ujair Alam,Surya Prakash,Priya Shukla,Gora Chand Nandi,Domenec Puig

Main category: cs.CV

TL;DR: UltrasODM是一个双流框架,通过帧级不确定性估计、显著性诊断和可操作提示辅助超声医师,提升图像重建的可靠性与临床可用性。

Details Motivation: 临床超声采集高度依赖操作者,快速探头运动和亮度变化常导致重建误差,降低信任度和临床实用性。 Method: 提出UltrasODM,包含三个核心模块:基于对比排序的帧分组模块、融合光流与Dual-Mamba时序模块的6-DoF姿态估计、结合贝叶斯不确定性与显著性图的人在回路(HITL)层,并在超过阈值时发出提示。 Result: 在临床自由手柄超声数据集上,相比UltrasOM,漂移减少15.2%,距离误差降低12.1%,Hausdorff距离减少10.1%,同时输出每帧的不确定性与显著性图。 Conclusion: UltrasODM通过强调透明性和医生反馈,提升了重建可靠性,支持更安全、可信的临床工作流程。 Abstract: Clinical ultrasound acquisition is highly operator-dependent, where rapid probe motion and brightness fluctuations often lead to reconstruction errors that reduce trust and clinical utility. We present UltrasODM, a dual-stream framework that assists sonographers during acquisition through calibrated per-frame uncertainty, saliency-based diagnostics, and actionable prompts. UltrasODM integrates (i) a contrastive ranking module that groups frames by motion similarity, (ii) an optical-flow stream fused with Dual-Mamba temporal modules for robust 6-DoF pose estimation, and (iii) a Human-in-the-Loop (HITL) layer combining Bayesian uncertainty, clinician-calibrated thresholds, and saliency maps highlighting regions of low confidence. When uncertainty exceeds the threshold, the system issues unobtrusive alerts suggesting corrective actions such as re-scanning highlighted regions or slowing the sweep. Evaluated on a clinical freehand ultrasound dataset, UltrasODM reduces drift by 15.2%, distance error by 12.1%, and Hausdorff distance by 10.1% relative to UltrasOM, while producing per-frame uncertainty and saliency outputs. By emphasizing transparency and clinician feedback, UltrasODM improves reconstruction reliability and supports safer, more trustworthy clinical workflows. Our code is publicly available at https://github.com/AnandMayank/UltrasODM.

[287] Modality-Aware Bias Mitigation and Invariance Learning for Unsupervised Visible-Infrared Person Re-Identification

Menglin Wang,Xiaojin Gong,Jiachen Li,Genlin Ji

Main category: cs.CV

TL;DR: 本文提出了一种无监督可见光-红外行人重识别方法,通过模态感知的Jaccard距离和全局聚类来缓解模态偏差,并引入“分割-对比”策略学习模态不变且身份可区分的特征表示,显著提升了跨模态匹配性能。

Details Motivation: 现有方法在估计可见光与红外模态间的可靠关联时易受局部聚类误差影响,且忽视实例级全局关系,难以有效应对模态差异带来的距离偏差。 Method: 提出模态感知的Jaccard距离以缓解模态差异引起的距离偏差,并通过全局聚类实现更可靠的跨模态关联;设计“分割-对比”策略构建模态特定的全局原型,在全局关联指导下对齐这些原型以学习模态不变且具有身份判别性的特征。 Result: 该方法在多个基准VI-ReID数据集上取得了当前最优性能,显著优于现有方法。 Conclusion: 通过全局关联与模态特定原型对齐,所提方法有效解决了无监督可见光-红外行人重识别中的模态偏差问题,实现了更鲁棒的跨模态匹配。 Abstract: Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match individuals across visible and infrared cameras without relying on any annotation. Given the significant gap across visible and infrared modality, estimating reliable cross-modality association becomes a major challenge in USVI-ReID. Existing methods usually adopt optimal transport to associate the intra-modality clusters, which is prone to propagating the local cluster errors, and also overlooks global instance-level relations. By mining and attending to the visible-infrared modality bias, this paper focuses on addressing cross-modality learning from two aspects: bias-mitigated global association and modality-invariant representation learning. Motivated by the camera-aware distance rectification in single-modality re-ID, we propose modality-aware Jaccard distance to mitigate the distance bias caused by modality discrepancy, so that more reliable cross-modality associations can be estimated through global clustering. To further improve cross-modality representation learning, a `split-and-contrast' strategy is designed to obtain modality-specific global prototypes. By explicitly aligning these prototypes under global association guidance, modality-invariant yet ID-discriminative representation learning can be achieved. While conceptually simple, our method obtains state-of-the-art performance on benchmark VI-ReID datasets and outperforms existing methods by a significant margin, validating its effectiveness.

[288] GorillaWatch: An Automated System for In-the-Wild Gorilla Re-Identification and Population Monitoring

Maximilian Schall,Felix Leonard Knöfel,Noah Elias König,Jan Jonas Kubeler,Maximilian von Klinski,Joan Wilhelm Linnemann,Xiaoshi Liu,Iven Jelle Schlegelmilch,Ole Woyciniuk,Alexandra Schild,Dante Wasmuht,Magdalena Bermejo Espinet,German Illera Basas,Gerard de Melo

Main category: cs.CV

TL;DR: 本文提出了用于野生大猩猩再识别的三个新数据集和一个端到端分析流程GorillaWatch,结合自监督预训练与可解释性验证,实现了高效、科学有效的个体识别与种群计数。

Details Motivation: 现有大猩猩监测依赖大量人工处理相机陷阱视频,缺乏适用于深度学习模型训练的大规模野外视频数据集,限制了自动化识别的发展。 Method: 构建了Gorilla-SPAC-Wild、Gorilla-Berlin-Zoo和Gorilla-SPAC-MoT三个新数据集;提出GorillaWatch端到端流水线,集成检测、跟踪与再识别;采用多帧自监督预训练利用tracklet时序一致性;使用可微分AttnLRP验证模型关注生物特征而非背景;结合时空约束改进无监督聚类以实现种群计数。 Result: 实验证明基于大规模图像骨干网络的特征聚合优于专用视频架构;自监督预训练提升跨域泛化能力;AttnLRP确保模型决策基于生物辨识特征;改进的聚类方法有效缓解过分割问题,提升无监督计数准确性。 Conclusion: 所提数据集和方法显著推动了濒危灵长类动物的非侵入式自动监测,为野生动物保护提供了可扩展的技术框架,并公开代码与数据以促进相关研究发展。 Abstract: Monitoring critically endangered western lowland gorillas is currently hampered by the immense manual effort required to re-identify individuals from vast archives of camera trap footage. The primary obstacle to automating this process has been the lack of large-scale, "in-the-wild" video datasets suitable for training robust deep learning models. To address this gap, we introduce a comprehensive benchmark with three novel datasets: Gorilla-SPAC-Wild, the largest video dataset for wild primate re-identification to date; Gorilla-Berlin-Zoo, for assessing cross-domain re-identification generalization; and Gorilla-SPAC-MoT, for evaluating multi-object tracking in camera trap footage. Building on these datasets, we present GorillaWatch, an end-to-end pipeline integrating detection, tracking, and re-identification. To exploit temporal information, we introduce a multi-frame self-supervised pretraining strategy that leverages consistency in tracklets to learn domain-specific features without manual labels. To ensure scientific validity, a differentiable adaptation of AttnLRP verifies that our model relies on discriminative biometric traits rather than background correlations. Extensive benchmarking subsequently demonstrates that aggregating features from large-scale image backbones outperforms specialized video architectures. Finally, we address unsupervised population counting by integrating spatiotemporal constraints into standard clustering to mitigate over-segmentation. We publicly release all code and datasets to facilitate scalable, non-invasive monitoring of endangered species

[289] Distribution Matching Variational AutoEncoder

Sen Ye,Jianning Pei,Mengde Xu,Shuyang Gu,Chunyu Wang,Liwei Wang,Han Hu

Main category: cs.CV

TL;DR: 本文提出了Distribution-Matching VAE (DMVAE),通过显式对齐编码器的潜在分布与任意参考分布,突破了传统VAE高斯先验的限制,发现自监督学习导出的分布在图像重建保真度和建模效率之间具有最佳平衡,在仅64个训练周期下在ImageNet上达到3.2的gFID。

Details Motivation: 现有视觉生成模型的潜在空间分布未被显式塑造,导致难以确定最优分布形式;需要一种灵活机制来研究不同潜在分布对建模性能的影响。 Method: 提出DMVAE,引入分布匹配约束,使编码器的潜在分布显式对齐于任意参考分布(如自监督特征、扩散噪声等),从而超越传统VAE的高斯先验假设。 Result: DMVAE可在ImageNet上仅用64个epoch实现gFID为3.2的结果;实验表明基于自监督学习(SSL)的潜在分布能更好平衡重建质量和建模效率。 Conclusion: 选择合适的潜在分布结构(通过分布级对齐实现)比依赖固定先验更能弥合易建模潜变量与高保真图像生成之间的差距。 Abstract: Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbf{Distribution-Matching VAE} (\textbf{DMVAE}), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.

[290] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Zhaochong An,Menglin Jia,Haonan Qiu,Zijian Zhou,Xiaoke Huang,Zhiheng Liu,Weiming Ren,Kumara Kahatapitiya,Ding Liu,Sen He,Chenyang Zhang,Tao Xiang,Fanny Yang,Serge Belongie,Tian Xie

Main category: cs.CV

TL;DR: 本文提出OneStory,通过重构多镜头视频生成为下一镜头生成任务,实现基于预训练图像到视频模型的连贯长篇叙事生成。

Details Motivation: 现有方法在处理复杂叙事时难以有效建模长距离跨镜头上下文,导致生成质量下降。 Method: 引入帧选择模块构建全局记忆,并通过自适应调节器进行重要性引导的分块化以生成紧凑上下文;采用下一镜头生成范式并利用预训练I2V模型进行自回归合成。 Result: 在60K高质量多镜头数据集上微调后,OneStory在文本和图像条件下均实现了最先进的叙事一致性。 Conclusion: OneStory能够实现可控且沉浸式的长篇视频叙事生成,显著提升多镜头视频的连贯性与可扩展性。 Abstract: Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.

[291] Multi-view Pyramid Transformer: Look Coarser to See Broader

Gyeongjin Kang,Seungkwon Yang,Seungtae Nam,Younggeun Lee,Jungwoo Kim,Eunbyung Park

Main category: cs.CV

TL;DR: 提出Multi-view Pyramid Transformer (MVP),一种可扩展的多视图Transformer架构,通过双层次结构实现从数十到数百张图像中单次前向传播重建大型3D场景。

Details Motivation: 为了高效且高质量地从大量图像中重建大尺度3D场景,需要兼顾计算效率与表征能力。 Method: 采用局部到全局的跨视图层次和精细到粗略的视图内层次结构,逐步整合多视图信息并压缩空间表示。 Result: 在多种数据集上验证了MVP的有效性,结合3D高斯点阵化表示,在通用重建质量、效率和可扩展性方面达到先进水平。 Conclusion: MVP通过双层次设计实现了对大型复杂场景的快速、高质量3D重建,具有良好的可扩展性和实际应用潜力。 Abstract: We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

[292] Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes

Shai Krakovsky,Gal Fiebelman,Sagie Benaim,Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: 提出一种新的3D表示方法,通过极低维语义瓶颈特征和多分辨率哈希编码器,提升大规模场景中语言与几何对齐的效率和性能。

Details Motivation: 现有特征蒸馏方法在处理大规模互联网数据时存在语义特征不对齐以及内存和运行效率低下的问题,难以实现高效的3D语言嵌入。 Method: 引入极低维语义瓶颈特征作为3D高斯表示的一部分,结合渲染后的多分辨率特征哈希编码器处理;设计衰减下采样模块并提出多种正则化方法以缓解2D真值特征的语义不对齐问题。 Result: 在真实世界数据集HolyScenes上验证了方法的有效性,相比现有方法在性能和效率(运行时间和GPU内存)方面均有提升。 Conclusion: 该方法有效解决了大规模场景中语义对齐与计算效率的挑战,为自然语言交互的3D场景理解提供了更高效、可扩展的解决方案。 Abstract: Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.

[293] WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling

Shaoheng Fang,Hanwen Jiang,Yunpeng Bai,Niloy J. Mitra,Qixing Huang

Main category: cs.CV

TL;DR: WorldReel是一种4D视频生成模型,能够实现时空一致性,通过联合生成RGB帧和4D场景表示(如点图、相机轨迹和光流),在动态场景和相机运动下保持几何和外观的连贯性。

Details Motivation: 现有视频生成模型在3D上缺乏一致性,难以建模真实世界中复杂的动态和视角变化,因此需要一种能实现原生时空一致性的生成方法。 Method: 提出WorldReel,联合生成RGB帧与4D场景表示(点图、相机轨迹、密集光流);利用合成数据提供精确的4D监督信号,结合真实视频提升视觉多样性和 realism。 Result: 在几何一致性、运动连贯性和减少视图-时间伪影等指标上优于现有方法,实现了动态场景和移动相机下的最先进视频生成效果。 Conclusion: WorldReel推动了视频生成向4D一致的世界建模发展,为未来智能体在稳定时空表示下进行渲染、交互与推理提供了可能。 Abstract: Recent video generators achieve striking photorealism, yet remain fundamentally inconsistent in 3D. We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. WorldReel jointly produces RGB frames together with 4D scene representations, including pointmaps, camera trajectory, and dense flow mapping, enabling coherent geometry and appearance modeling over time. Our explicit 4D representation enforces a single underlying scene that persists across viewpoints and dynamic content, yielding videos that remain consistent even under large non-rigid motion and significant camera movement. We train WorldReel by carefully combining synthetic and real data: synthetic data providing precise 4D supervision (geometry, motion, and camera), while real videos contribute visual diversity and realism. This blend allows WorldReel to generalize to in-the-wild footage while preserving strong geometric fidelity. Extensive experiments demonstrate that WorldReel sets a new state-of-the-art for consistent video generation with dynamic scenes and moving cameras, improving metrics of geometric consistency, motion coherence, and reducing view-time artifacts over competing methods. We believe that WorldReel brings video generation closer to 4D-consistent world modeling, where agents can render, interact, and reason about scenes through a single and stable spatiotemporal representation.

[294] OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing

Haoyang He,Jie Wang,Jiangning Zhang,Zhucun Xue,Xingyuan Bu,Qiangpeng Yang,Shilei Wen,Lei Xie

Main category: cs.CV

TL;DR: 本文提出了OpenVE-3M,一个大规模、高质量的开源指令式视频编辑数据集,以及OpenVE-Bench统一评测基准,并训练了5B参数的OpenVE-Edit模型,在性能上超越现有开源模型。

Details Motivation: 现有的指令式视频编辑数据集在规模、多样性和质量上不足,且缺乏统一的评估基准,限制了该领域的发展。 Method: 设计了一个严格的质量过滤数据生成 pipeline,构建包含多种编辑类型的OpenVE-3M数据集,并建立涵盖多任务的OpenVE-Bench评测集;基于此训练了一个5B参数的视频编辑模型OpenVE-Edit。 Result: OpenVE-3M在规模、编辑类型多样性、指令长度和质量上优于现有开源数据集;OpenVE-Edit在OpenVE-Bench上超越所有先前开源模型,包括14B的基线模型,达到SOTA性能。 Conclusion: OpenVE-3M和OpenVE-Bench为指令式视频编辑提供了重要的数据和评估基础,OpenVE-Edit验证了该数据集的有效性,推动了该领域的进展。 Abstract: The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://github.com/lewandofskee/OpenVE.

[295] One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Yuan Gao,Chen Chen,Tianrong Chen,Jiatao Gu

Main category: cs.CV

TL;DR: 本文提出了FAE(Feature Auto-Encoder),一种将预训练视觉表示有效适配到低维生成潜空间的简单而高效框架,通过耦合两个深度解码器,在扩散模型和归一化流中均实现了优异性能。

Details Motivation: 现有生成模型的潜空间与理解导向的高维特征空间存在根本性不匹配,导致难以直接利用高质量预训练表示,本文旨在解决这一表征对齐问题。 Method: 提出FAE框架,使用一个注意力层将预训练特征编码为低维潜变量,并耦合两个解码器:一个重建原始特征空间,另一个基于重建特征生成图像。 Result: 在ImageNet 256x256等基准上,FAE在有无CFG情况下均达到接近或达到SOTA的FID分数(如800 epoch下FID 1.29和1.48),且学习速度更快。 Conclusion: FAE提供了一种通用、简洁的方法,成功桥接了理解型特征与生成型潜空间之间的鸿沟,适用于多种自监督编码器和生成模型族。 Abstract: Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.

[296] UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Jiehui Huang,Yuechen Zhang,Xu He,Yuan Gao,Zhi Cen,Bin Xia,Yan Zhou,Xin Tao,Pengfei Wan,Jiaya Jia

Main category: cs.CV

TL;DR: UnityVideo是一个统一的多模态视频生成框架,通过联合学习多种模态和训练范式,提升视频生成的质量、一致性和对物理世界约束的符合性。

Details Motivation: 现有视频生成模型受限于单模态条件,缺乏跨模态交互和多样化的模态输入,导致对世界的整体理解不足。 Method: 提出UnityVideo框架,包含动态加噪机制以统一异构训练范式,以及带有上下文学习器的模态切换器,实现多模态信息的统一处理。 Result: 在130万样本的大规模数据集上验证,UnityVideo加快了收敛速度,显著提升了对未见数据的零样本泛化能力,并在视频质量与一致性方面表现优越。 Conclusion: UnityVideo通过多模态联合学习实现了更优的世界感知视频生成,为通用视觉生成模型的发展提供了有效路径。 Abstract: Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo

[297] Relational Visual Similarity

Thao Nguyen,Sicheng Mo,Krishna Kumar Singh,Yilin Wang,Jing Shi,Nicholas Kolkin,Eli Shechtman,Yong Jae Lee,Yuheng Li

Main category: cs.CV

TL;DR: 本文提出了一种新的图像相似性度量方法,专注于捕捉人类感知中的关系相似性,而非传统模型关注的属性相似性。为此,作者构建了一个包含11.4万图文对的数据集,其中文本描述图像内部元素之间的关系结构而非表面内容,并基于此微调了一个视觉-语言模型来衡量图像间的关系相似性。

Details Motivation: 现有视觉相似性度量模型(如LPIPS、CLIP、DINO)仅关注图像的感知属性相似性,无法捕捉人类所擅长的关系相似性(例如地球与桃子在结构上的类比)。这种能力被认为是人类认知的独特之处,但在当前视觉计算中存在明显缺失。 Method: 将关系图像相似性定义为可测量的问题:当两幅图像中视觉元素之间的内在关系或功能相对应时,它们是关系相似的。构建一个114k规模的匿名化图像-字幕数据集,字幕描述场景的底层关系逻辑而非表面内容,并利用该数据集微调一个视觉-语言模型以学习和衡量图像间的关系相似性。 Result: 成功构建了首个专注于关系相似性的大规模数据集,并训练出能够有效捕捉图像间关系结构相似性的视觉-语言模型。实验证明,现有主流图像相似性模型在该任务上表现不佳,凸显了当前技术的局限性。 Conclusion: 通过构建专门数据集并微调视觉-语言模型,可以初步实现对图像间关系相似性的建模,填补了传统视觉相似性度量在人类类比认知能力模拟方面的关键空白,为未来视觉理解系统提供了新方向。 Abstract: Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.

[298] Voxify3D: Pixel Art Meets Volumetric Rendering

Yi-Chuan Huang,Jiewen Chan,Hao-Jen Chien,Yu-Lun Liu

Main category: cs.CV

TL;DR: 提出Voxify3D,一种可微的两阶段框架,通过正交像素监督、基于patch的CLIP对齐和调色板约束的Gumbel-Softmax量化,实现从3D网格到语义保持且色彩离散一致的体素艺术生成。

Details Motivation: 现有方法在生成体素艺术时难以同时兼顾几何抽象、语义保留和离散色彩一致性,导致过度简化或美学效果不佳。 Method: 提出Voxify3D,结合正交像素艺术监督以精确对齐体素与像素,使用patch-based CLIP保持多层级语义,并引入调色板约束的Gumbel-Softmax实现离散颜色空间的可微优化。 Result: 在CLIP-IQA上达到37.12分,用户偏好率达77.90%,支持2-8色控制和20x-50x分辨率缩放,适用于多种角色模型。 Conclusion: Voxify3D有效解决了体素艺术生成中语义保留、像素美学和离散优化之间的矛盾,实现了高质量、可控的自动化体素化。 Abstract: Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

eess.IV [Back]

[299] Proof of Concept for Mammography Classification with Enhanced Compactness and Separability Modules

Fariza Dahes

Main category: eess.IV

TL;DR: 本研究验证并扩展了一种用于医学图像分类的方法框架,将其从阿尔茨海默MRI应用迁移至乳腺X线分类,结果表明GAGM和SEVector模块可提升特征判别能力,但FSL在该任务中未见明显效果,并通过多指标评估、可视化分析及交互仪表板增强了模型可解释性。

Details Motivation: 探究先前在阿尔茨海默MRI上表现良好的改进型ConvNeXt Tiny架构是否适用于乳腺X线图像分类,验证其跨模态迁移能力。 Method: 采用整合了全局平均与最大池化融合(GAGM)、轻量级通道注意力(SEVector)和特征平滑损失(FSL)的ConvNeXt Tiny等模型,在包含INbreast、MIAS和DDSM数据集的Kaggle乳腺X线数据上进行实验,并结合Grad-CAM进行特征可视化,开发交互式仪表板辅助临床分析。 Result: GAGM和SEVector显著提升了特征判别性和恶性样本的检出率,降低了假阴性;但FSL在当前任务中未带来可测量的性能提升;通过多指标评估(macro F1、类别召回方差、ROC/AUC)和可视化分析验证了模型有效性。 Conclusion: GAGM与SEVector有助于提升乳腺X线分类性能,尤其对恶性病例更敏感,但FSL的效果具有任务依赖性;未来需探索增强类内紧凑性和类间可分性的新方法,以更好区分良恶性病变。 Abstract: This study presents a validation and extension of a recent methodological framework for medical image classification. While an improved ConvNeXt Tiny architecture, integrating Global Average and Max Pooling fusion (GAGM), lightweight channel attention (SEVector), and Feature Smoothing Loss (FSL), demonstrated promising results on Alzheimer MRI under CPU friendly conditions, our work investigates its transposability to mammography classification. Using a Kaggle dataset that consolidates INbreast, MIAS, and DDSM mammography collections, we compare a baseline CNN, ConvNeXt Tiny, and InceptionV3 backbones enriched with GAGM and SEVector modules. Results confirm the effectiveness of GAGM and SEVector in enhancing feature discriminability and reducing false negatives, particularly for malignant cases. In our experiments, however, the Feature Smoothing Loss did not yield measurable improvements under mammography classification conditions, suggesting that its effectiveness may depend on specific architectural and computational assumptions. Beyond validation, our contribution extends the original framework through multi metric evaluation (macro F1, per class recall variance, ROC/AUC), feature interpretability analysis (Grad CAM), and the development of an interactive dashboard for clinical exploration. As a perspective, we highlight the need to explore alternative approaches to improve intra class compactness and inter class separability, with the specific goal of enhancing the distinction between malignant and benign cases in mammography classification.