Table of Contents
cs.CL [Back]
[1] EvalCards: A Framework for Standardized Evaluation Reporting
Ruchira Dhar,Danae Sanchez Villegas,Antonia Karamolegkou,Alice Schiavone,Yifei Yuan,Xinyi Chen,Jiaang Li,Stella Frank,Laura De Grazia,Monorama Swain,Stephanie Brandl,Daniel Hershcovich,Anders Søgaard,Desmond Elliott
Main category: cs.CL
TL;DR: 本文提出了Evaluation Disclosure Cards (EvalCards),以改善当前NLP模型评估报告在可复现性、可访问性和治理方面的不足,增强研究透明度并支持合规需求。
Details
Motivation: 现有的评估报告标准不足以应对快速发布的开源模型所带来的透明度和治理挑战。 Method: 基于对近期评估与文档工作的调研,提出EvalCards框架。 Result: 设计出一种新的报告工具EvalCards,旨在提升研究人员和实践者的透明度,并满足新兴的治理要求。 Conclusion: EvalCards为解决当前NLP评估实践中存在的三大问题提供了可行路径,是迈向更负责任AI开发的重要步骤。 Abstract: Evaluation has long been a central concern in NLP, and transparent reporting practices are more critical than ever in today's landscape of rapidly released open-access models. Drawing on a survey of recent work on evaluation and documentation, we identify three persistent shortcomings in current reporting practices: reproducibility, accessibility, and governance. We argue that existing standardization efforts remain insufficient and introduce Evaluation Disclosure Cards (EvalCards) as a path forward. EvalCards are designed to enhance transparency for both researchers and practitioners while providing a practical foundation to meet emerging governance requirements.[2] Cacheback: Speculative Decoding With Nothing But Cache
Zhiyao Ma,In Gim,Lin Zhong
Main category: cs.CL
TL;DR: Cacheback Decoding是一种无需训练、模型无关的推测解码方法,利用语言中的局部性来加速大语言模型推理。
Details
Motivation: 为了提升大语言模型推理速度,同时保持方法的简洁性和通用性。 Method: 使用LRU缓存表存储token n-gram,基于局部性生成草稿序列进行推测解码。 Result: 在同类方法中达到最先进性能,且具有良好的系统集成性和新领域适应潜力。 Conclusion: Cacheback以极简设计实现了高效推理加速,具备广泛适用性和实用价值。 Abstract: We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference. Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences. Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems. Cacheback also shows potential for fast adaptation to new domains.[3] JELV: A Judge of Edit-Level Validity for Evaluation and Automated Reference Expansion in Grammatical Error Correction
Yuhao Zhan,Yuqing Zhang,Jing Yuan,Qixiang Ma,Zhiqi Yang,Yu Gu,Zemin Liu,Fei Wu
Main category: cs.CL
TL;DR: 本文提出了JELV框架,用于验证语法纠错中的修改,通过提高参考多样性来改善模型评估和泛化能力。
Details
Motivation: 现有的语法纠错系统因参考答案多样性不足,导致评估偏低且模型泛化受限。 Method: 提出JELV框架,包含一个多轮LLM-as-Judges流水线和一个蒸馏的DeBERTa分类器,并构建了PEVData数据集进行评估。 Result: JELV在人类标注一致性上达到90%,精确度达85%;改进的评估指标与人工判断具有最先进的相关性;扩展数据集后重训练GEC系统性能显著提升。 Conclusion: JELV为增强参考多样性和提升语法纠错系统的评估效果与模型泛化提供了可扩展的解决方案。 Abstract: Existing Grammatical Error Correction (GEC) systems suffer from limited reference diversity, leading to underestimated evaluation and restricted model generalization. To address this issue, we introduce the Judge of Edit-Level Validity (JELV), an automated framework to validate correction edits from grammaticality, faithfulness, and fluency. Using our proposed human-annotated Pair-wise Edit-level Validity Dataset (PEVData) as benchmark, JELV offers two implementations: a multi-turn LLM-as-Judges pipeline achieving 90% agreement with human annotators, and a distilled DeBERTa classifier with 85% precision on valid edits. We then apply JELV to reclassify misjudged false positives in evaluation and derive a comprehensive evaluation metric by integrating false positive decoupling and fluency scoring, resulting in state-of-the-art correlation with human judgments. We also apply JELV to filter LLM-generated correction candidates, expanding the BEA19's single-reference dataset containing 38,692 source sentences. Retraining top GEC systems on this expanded dataset yields measurable performance gains. JELV provides a scalable solution for enhancing reference diversity and strengthening both evaluation and model generalization.[4] 47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations
Chiung-Yi Tseng,Danyang Zhang,Tianyang Wang,Hongying Luo,Lu Chen,Junming Huang,Jibin Guan,Junfeng Hao,Junhao Song,Ziqian Bi
Main category: cs.CL
TL;DR: 本文对27个最先进的大语言模型在涵盖七个医学专科和两个专业级别的中文医学考试题目上进行了全面的基准评估,提出了一个评估框架,并基于2800道精心策划的问题分析了模型表现,揭示了模型规模与性能之间无明显相关性、不同专科间存在显著性能差异等发现。
Details
Motivation: 随着大语言模型(LLMs)的快速发展,其在医学领域的应用潜力受到广泛关注。然而,现有研究缺乏系统性评估LLMs在中文医学语境下的表现,尤其是在不同医学专科和专业级别上的能力差异。因此,本文旨在构建一个全面、细粒度的评估基准,以衡量当前主流LLMs在真实医学考试场景中的实际性能,为医疗AI的应用提供实证依据。 Method: 本文构建了一个包含2,800道题目的中文医学考试数据集,覆盖心血管、消化、血液、感染、肾科、神经和呼吸七个专科,并区分主治医师和主任医师两个难度等级。采用准确率作为主要评价指标,对27个主流大语言模型进行系统评测,包括闭源与开源模型,特别关注模型规模、架构类型(如MoE)与性能之间的关系。通过跨专科、跨难度的细粒度分析,揭示模型在不同医学领域的能力分布。 Result: 实验结果显示,Mixtral-8x7B以74.25%的总体准确率位居第一,DeepSeek-R1-671B紧随其后达到64.07%;模型大小与性能之间没有一致的相关性,小型混合专家模型表现出较强竞争力;不同专科间性能差异显著,模型在心血管和神经科表现较好,而在消化和肾科较弱;顶尖模型在主治与主任医师级别间表现差距小,显示良好泛化能力。 Conclusion: 该基准测试为大语言模型在医学教育和临床决策支持系统中的部署提供了重要参考,表明尽管部分模型已具备较强的医学知识掌握能力,但在特定专科领域仍存在局限,未来应关注领域均衡优化与模型架构创新,而非单纯追求参数规模扩大。 Abstract: The rapid advancement of large language models(LLMs) has prompted significant interest in their potential applications in medical domains. This paper presents a comprehensive benchmark evaluation of 27 state-of-the-art LLMs on Chinese medical examination questions, encompassing seven medical specialties across two professional levels. We introduce a robust evaluation framework that assesses model performance on 2,800 carefully curated questions from cardiovascular, gastroenterology, hematology, infectious diseases, nephrology, neurology, and respiratory medicine domains. Our dataset distinguishes between attending physician and senior physician difficulty levels, providing nuanced insights into model capabilities across varying complexity. Our empirical analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%, followed by DeepSeek-R1-671B at 64.07%. Notably, we observe no consistent correlation between model size and performance, as evidenced by the strong performance of smaller mixture-of-experts architectures. The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions compared to gastroenterology and nephrology domains. Furthermore, our analysis indicates minimal performance degradation between attending and senior physician levels for top-performing models, suggesting robust generalization capabilities. This benchmark provides critical insights for the deployment of LLMs in medical education and clinical decision support systems, highlighting both the promise and current limitations of these technologies in specialized medical contexts.[5] CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference
Dong Liu,Yanxuan Yu,Ben Lengerich
Main category: cs.CL
TL;DR: 提出CSV-Decode方法,通过几何上界构建每步解码的小子词汇表,实现高效稀疏计算,并保证精确的top-k认证和ε-近似softmax的双重正确性。
Details
Motivation: 大语言模型在推理过程中由于大规模词汇表导致输出层计算昂贵,存在显著的计算瓶颈。 Method: 离线聚类词汇嵌入,利用质心加半径的边界判断哪些词元可安全省略;结合稀疏GEMV内核、多GPU分片和CUDA Graph优化实现系统。 Result: 实验结果表明,相比全词汇表解码,该方法显著加速,同时保持分布保证和低回退率。 Conclusion: CSV-Decode在保证模型输出质量的前提下,有效缓解了大语言模型推理中的计算瓶颈,提升了效率。 Abstract: Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-$k$ certification and $\varepsilon$-certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \href{https://github.com/FastLM/CSV-Decode}{https://github.com/FastLM/CSV-Decode}.[6] Evaluating Embedding Generalization: How LLMs, LoRA, and SLERP Shape Representational Geometry
Siyaxolisa Kabane
Main category: cs.CL
TL;DR: 研究比较了基于大语言模型(LLM)和非LLM编码器的密集文本嵌入的泛化性能,发现LLM能更好捕捉高阶数值模式但易受适配器主导影响,而使用SLERP模型合并可有效恢复基础模型结构并保持任务增益,提升聚类分离性和鲁棒性。
Details
Motivation: 探究LLM作为嵌入骨干时的泛化特性,以及任务特定适配(如LoRA)导致的过专业化问题,评估模型融合方法(如SLERP)是否能缓解该问题。 Method: 设计了一套控制良好的实验,使用数字序列嵌入任务,比较四类模型:非LLM编码器、LoRA适配的LLM、模型汤合并的LoRA-LLM、SLERP合并的LoRA-LLM;采用Silhouette和Davies Bouldin指数评估表征质量,并分析kmeans标签所隐含的信息。 Result: LLM基底能更好捕捉高阶、组合性数值模式,但存在适配器主导问题,损害均衡泛化;SLERP合并能持续恢复基础模型结构,保留多数任务增益,在聚类可分性和鲁棒性上优于模型汤或其他未合并方法。 Conclusion: SLERP是一种有效的模型合并策略,可在不牺牲任务性能的前提下恢复LLM嵌入模型的泛化能力,缓解LoRA等适配方法带来的过专业化问题。 Abstract: We investigate the generalization properties of dense text embeddings when the embedding backbone is a large language model (LLM) versus when it is a non-LLM encoder, and we study the extent to which spherical linear interpolation (SLERP) model-merging mitigates over-specialization introduced by task-specific adaptation (e.g., LoRA). To make the comparison concrete and domain-agnostic, we design a controlled suite of experiments in which models embed short numerical sequences and are evaluated on their ability to cluster and classify those sequences according to well-defined number-theoretic properties. Our experimental protocol compares four families of models: (1) non-LLM encoders trained from scratch or fine-tuned for embeddings, (2) LLM-based encoders adapted with parameter-efficient methods (LoRA), (3) LLM-based encoders with LoRA followed by model souping merging into the base weights, and (4) the same LoRA-adapted LLMs merged using SLERP across checkpoints or stages. We evaluate representational quality with clustering indices (Silhouette and Davies Bouldin). We additionally analyze the use of kmeans labels to see if the embeddings encode any other information besides the one we are testing for. Empirically, we find that LLM-based backbones produce embeddings that better capture higher-order, compositional numeric patterns, but are prone to adapter dominance that degrades balanced generalization; SLERP merging consistently recovers base-model structure while retaining most task gains, yielding superior tradeoffs in clustering separability, and robustness compared to model souping or models that were not merged.[7] On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models
Jonatas Grosman,Cassio Almeida,Guilherme Schardong,Hélio Lopes
Main category: cs.CL
TL;DR: 本研究探讨了基于wav2vec2的预训练模型在多语言语音识别任务中的跨语言迁移能力,发现预训练数据的多样性比数据量更重要,且印欧语系表现优于非印欧语系。
Details
Motivation: 现有研究较少关注wav2vec2类模型在不同语言间的知识迁移行为,尤其是预训练语言与目标任务语言不同时的表现,本文旨在填补这一空白。 Method: 在18种语言上对15个大型预训练模型进行了语音识别任务的微调实验,系统评估跨语言迁移效果。 Result: 预训练数据的多样性对最终性能影响大于数据规模;印欧语系语言表现普遍优于非印欧语系;单语预训练模型展现出积极的跨语言知识迁移,尤其当预训练语言与目标语言相近时更为显著。 Conclusion: 模型预训练应更注重数据多样性而非单纯规模,语言亲缘关系显著影响迁移效果,研究结果有助于指导现有模型的使用及新模型的预训练。 Abstract: Using representations provided by a large pre-trained model has become the primary strategy for achieving state-of-the-art results in a wide range of tasks. A recently proposed large pre-trained model, wav2vec 2.0, was seminal for several other works on pre-training large models on speech data. Many models are being pre-trained using the same architecture as wav2vec 2.0 and are getting state-of-the-art in various speech-related tasks. Previous work has demonstrated that the data used during the pre-training of these wav2vec2-based models can impact the model's performance in downstream tasks, and this should be taken into consideration before utilizing these models. However, few works have proposed investigating further how the transfer knowledge of these pre-trained models behaves in different languages, even when the target language differs from the one used during the model's pre-training. Our work aims to investigate the cross-lingual transferability of these wav2vec2-based models. We performed several fine-tuning experiments on the speech recognition task in 18 languages using 15 large pre-trained models. The results of our experiments showed us that the size of data used during the pre-training of these models is not as important to the final performance as the diversity. We noticed that the performance of Indo-European languages is superior to non-Indo-European languages in the evaluated models. We have observed a positive cross-lingual transfer of knowledge using monolingual models, which was evident in all the languages we used, but more pronounced when the language used during pre-training was more similar to the downstream task language. With these findings, we aim to assist the scientific community in utilizing existing wav2vec2-based pre-trained models, as well as facilitate the pre-training of new ones.[8] Insight-A: Attribution-aware for Multimodal Misinformation Detection
Junjie Wu,Yumeng Fu,Chen Gong,Guohong Fu
Main category: cs.CL
TL;DR: 本文提出Insight-A,通过多模态大语言模型的洞察实现对生成式AI虚假信息的溯源检测,引入跨归因提示和自动去偏提示以提升多模态虚假信息检测的准确性与客观性。
Details
Motivation: 现有基于标准提示的多模态大模型方法在检测AI生成的虚假信息时缺乏对伪造来源的归因能力,难以应对AIGC时代复杂的多模态虚假信息威胁。 Method: 提出Insight-A框架,包括跨归因提示(CAP)建模感知与推理关联以追溯伪造源,自动归因去偏提示(ADP)减少人为提示主观性,并结合图像描述(IC)增强跨模态一致性验证,采用分层推理 pipeline 检测多模态失真。 Result: 实验表明,该方法在多模态虚假信息检测任务中表现优越,有效提升了检测性能与归因准确性。 Conclusion: Insight-A为AIGC时代的多模态虚假信息检测提供了新范式,通过引入归因机制和自动化提示策略,增强了检测的可解释性与鲁棒性。 Abstract: AI-generated content (AIGC) technology has emerged as a prevalent alternative to create multimodal misinformation on social media platforms, posing unprecedented threats to societal safety. However, standard prompting leverages multimodal large language models (MLLMs) to identify the emerging misinformation, which ignores the misinformation attribution. To this end, we present Insight-A, exploring attribution with MLLM insights for detecting multimodal misinformation. Insight-A makes two efforts: I) attribute misinformation to forgery sources, and II) an effective pipeline with hierarchical reasoning that detects distortions across modalities. Specifically, to attribute misinformation to forgery traces based on generation patterns, we devise cross-attribution prompting (CAP) to model the sophisticated correlations between perception and reasoning. Meanwhile, to reduce the subjectivity of human-annotated prompts, automatic attribution-debiased prompting (ADP) is used for task adaptation on MLLMs. Additionally, we design image captioning (IC) to achieve visual details for enhancing cross-modal consistency checking. Extensive experiments demonstrate the superiority of our proposal and provide a new paradigm for multimodal misinformation detection in the era of AIGC.[9] A General Highly Accurate Online Planning Method Integrating Large Language Models into Nested Rollout Policy Adaptation for Dialogue Tasks
Hui Wang,Fafa Zhang,Xiaoyu Zhang,Chaoxu Mu
Main category: cs.CL
TL;DR: 提出了一种名为NRPA-GD的新型面向目标对话策略规划方法,利用大语言模型模拟用户和系统行为,通过嵌套蒙特卡洛模拟与策略自适应优化对话策略,无需专门训练即可在多个数据集上超越现有方法。
Details
Motivation: 现有方法依赖复杂的提示工程或需训练的策略模型,难以适应新场景且成本高,因此需要一种无需训练、易于迁移的高效对话策略方法。 Method: 提出NRPA-GD,利用大语言模型同时模拟用户与系统行为,构建对话轨迹评估机制,结合嵌套蒙特卡洛模拟与策略自适应进行动态策略优化。 Result: 在四个典型数据集上实验表明,NRPA-GD优于现有的提示工程和预训练模型方法,甚至用仅0.6亿参数的LLM超越ChatGPT和大型策略模型。 Conclusion: NRPA-GD有效解决了目标导向对话中策略适应性与训练成本的问题,展示了将规划方法与大语言模型结合在实际任务中的潜力。 Abstract: In goal-oriented dialogue tasks, the main challenge is to steer the interaction towards a given goal within a limited number of turns. Existing approaches either rely on elaborate prompt engineering, whose effectiveness is heavily dependent on human experience, or integrate policy networks and pre-trained policy models, which are usually difficult to adapt to new dialogue scenarios and costly to train. Therefore, in this paper, we present Nested Rollout Policy Adaptation for Goal-oriented Dialogue (NRPA-GD), a novel dialogue policy planning method that completely avoids specific model training by utilizing a Large Language Model (LLM) to simulate behaviors of user and system at the same time. Specifically, NRPA-GD constructs a complete evaluation mechanism for dialogue trajectories and employs an optimization framework of nested Monte Carlo simulation and policy self-adaptation to dynamically adjust policies during the dialogue process. The experimental results on four typical goal-oriented dialogue datasets show that NRPA-GD outperforms both existing prompt engineering and specifically pre-trained model-based methods. Impressively, NRPA-GD surpasses ChatGPT and pre-trained policy models with only a 0.6-billion-parameter LLM. The proposed approach further demonstrates the advantages and novelty of employing planning methods on LLMs to solve practical planning tasks.[10] Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?
Matteo Spreafico,Ludovica Tassini,Camilla Sancricca,Cinzia Cappiello
Main category: cs.CL
TL;DR: 本文探讨了大语言模型在数据准备任务中的应用能力,特别是针对数据质量较差的数据集进行数据描述和清洗的效果,并与传统工具进行了比较。
Details
Motivation: 数据准备是数据驱动流程中关键但通常耗时的步骤,探索大语言模型在此类任务中的潜力有助于提升自动化水平和效率。 Method: 研究采用了通用和微调的表格大语言模型,通过输入低质量数据集并评估其在数据描述和清洗等任务中的表现,同时与传统数据准备工具对比,并设计了经过用户研究验证的质量模型进行评估。 Result: 大语言模型在数据准备任务中展现出一定支持能力,尤其在理解复杂数据问题和生成修复建议方面优于传统工具,但准确性仍有提升空间。 Conclusion: 大语言模型有望成为数据准备的有效辅助工具,但在实际应用中仍需结合领域知识和人工监督以确保结果可靠性。 Abstract: Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a critical yet often labor-intensive step in data-driven processes. This paper investigates whether large language models can effectively support users in selecting and automating data preparation tasks. To this aim, we considered both general-purpose and fine-tuned tabular large language models. We prompted these models with poor-quality datasets and measured their ability to perform tasks such as data profiling and cleaning. We also compare the support provided by large language models with that offered by traditional data preparation tools. To evaluate the capabilities of large language models, we developed a custom-designed quality model that has been validated through a user study to gain insights into practitioners' expectations.[11] Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach
Blessed Guda,Lawrence Francis,Gabrial Zencha Ashungafac,Carlee Joe-Wong,Moise Busogi
Main category: cs.CL
TL;DR: 本文提出了一种新的无监督、无需标签的排列偏差度量(PBM),一种高效的批处理问题上下文KV缓存(BaQCKV)方法,以及一种基于低秩适应(LoRA-1)的微调策略,以减轻大语言模型在多项选择题中的选择偏差,同时降低计算成本并保持泛化能力。
Details
Motivation: 大语言模型在多项选择题中常表现出受选项位置或符号影响的选择偏差,削弱了评估的可靠性。现有指标和缓解方法存在依赖标签、计算昂贵或泛化性差等问题,亟需更高效且通用的解决方案。 Method: 提出了三种方法:1)排列偏差度量(PBM),通过衡量模型在不同答案顺序下的预测不一致性来量化选择偏差;2)BaQCKV,通过缓存机制实现高效的多数投票以减少计算开销;3)LoRA-1,一种无监督的低秩微调策略,结合PBM与BaQCKV进行偏差缓解。 Result: 在多个MCQ基准上的实验表明,所提方法有效降低了选择偏差,提高了预测一致性,在减少计算成本的同时保持了准确性。 Conclusion: PBM提供了一种更精确的无监督偏差评估方式,BaQCKV显著提升了多数投票效率,LoRA-1实现了无需验证集的高效微调,三者结合为MCQ任务中的选择偏差问题提供了高效、可推广的解决方案。 Abstract: Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs). However, LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content. This bias undermines the reliability of MCQ as an evaluation framework. Most existing selection bias metrics require answer labels and measure divergences between prediction and answer distributions, but do not fully capture the consistency of a model's predictions across different orderings of answer choices. Existing selection bias mitigation strategies have notable limitations: majority voting, though effective, is computationally prohibitive; calibration-based methods require validation sets and often fail to generalize across datasets. To address these gaps, we propose three key contributions: (1) a new unsupervised label-free Permutation Bias Metric (PBM) that directly quantifies inconsistencies in model predictions across answer permutations, providing a more precise measure of selection bias, (2) an efficient majority voting approach called Batch Question-Context KV caching (BaQCKV), to significantly reduce computational costs while preserving bias mitigation effectiveness, and (3) an unsupervised Low-Rank Adaptation (LoRA-1) fine-tuning strategy based on our proposed metric and the BaQCKV that mitigates selection bias, providing a computationally efficient alternative that maintains model generalizability. Experiments across multiple MCQ benchmarks demonstrate that our approaches reduce bias, increasing consistency in accuracy while minimizing computational costs.[12] Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation
Fatima Kazi
Main category: cs.CL
TL;DR: 该研究探讨了大型语言模型(如ChatGPT)中存在的社会、文化、种族和性别等偏见问题,使用StereoSet和CrowSPairs等基准测试评估多个生成模型中的显性和隐性偏见,并提出通过微调、提示工程和数据增强等增强学习策略来缓解这些问题。
Details
Motivation: 由于LLMs在训练数据中继承了各种偏见,可能导致不公平或有害的输出,因此有必要系统识别和减轻这些偏见,以确保生成内容的公平性和真实性。 Method: 采用三管齐下的方法检测显性和隐性偏见,使用StereoSet和CrowSPairs等偏见专用基准对BERT、GPT 3.5和ADA等多个模型进行评估,并结合微调、不同提示技术和数据增强来提升模型表现。 Result: 发现微调模型在性别偏见上表现较差,但在种族偏见识别上较好;LLMs常过度依赖提示中的关键词而缺乏对真实性的理解;通过增强学习策略,模型在跨数据集测试中表现出良好的适应性,隐性偏见任务性能提升高达20%。 Conclusion: 尽管当前LLMs在减少偏见方面有一定进展,但仍存在严重依赖关键词和偏见残留的问题,需结合多种技术手段持续优化以实现更公平、可靠的生成结果。 Abstract: Large Language models (LLMs), such as ChatGPT, have gained popularity in recent years with the advancement of Natural Language Processing (NLP), with use cases spanning many disciplines and daily lives as well. LLMs inherit explicit and implicit biases from the datasets they were trained on; these biases can include social, ethical, cultural, religious, and other prejudices and stereotypes. It is important to comprehensively examine such shortcomings by identifying the existence and extent of such biases, recognizing the origin, and attempting to mitigate such biased outputs to ensure fair outputs to reduce harmful stereotypes and misinformation. This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI). We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA. To detect both explicit and implicit biases, we adopt a three-pronged approach for thorough and inclusive analysis. Results indicate fine-tuned models struggle with gender biases but excel at identifying and avoiding racial biases. Our findings also illustrated that despite some cases of success, LLMs often over-rely on keywords in prompts and its outputs. This demonstrates the incapability of LLMs to attempt to truly understand the accuracy and authenticity of its outputs. Finally, in an attempt to bolster model performance, we applied an enhancement learning strategy involving fine-tuning, models using different prompting techniques, and data augmentation of the bias benchmarks. We found fine-tuned models to exhibit promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.[13] EulerESG: Automating ESG Disclosure Analysis with LLMs
Yi Ding,Xushuo Tang,Zhengyi Yang,Wenqian Zhang,Simin Wu,Yuxin Huang,Lingjing Lan,Weiyuan Li,Yin Chen,Mingchen Ju,Wenke Yang,Thong Hoang,Mykhailo Klymenko,Xiwei Zu,Wenjie Zhang
Main category: cs.CL
TL;DR: EulerESG是一个基于大语言模型(LLM)的系统,能够自动化分析ESG报告,显式结合ESG框架标准,通过双通道检索和交互式界面实现高精度、可解释的披露信息提取与比较。
Details
Motivation: 现有ESG报告多为异构PDF文档,难以系统化查询;当前工具依赖规则或忽略报告标准结构,导致提取效果差且缺乏可解释性。 Method: 提出EulerESG系统,结合双通道检索与LLM驱动的披露分析,并构建支持探索、对标和解释的交互式仪表板与聊天机器人,显式建模ESG报告标准(如SASB)。 Result: 在四个全球公司和十二个SASB子行业中验证,EulerESG能以高达0.95的平均准确率自动填充标准对齐的指标表,且端到端运行时间实用;并对比了多种最新LLM在此任务上的表现。 Conclusion: EulerESG通过融合ESG框架知识与LLM能力,实现了高效、准确、可解释的ESG信息披露自动化分析,推动了可持续性报告的智能化处理。 Abstract: Environmental, Social, and Governance (ESG) reports have become central to how companies communicate climate risk, social impact, and governance practices, yet they are still published primarily as long, heterogeneous PDF documents. This makes it difficult to systematically answer seemingly simple questions. Existing tools either rely on brittle rule-based extraction or treat ESG reports as generic text, without explicitly modelling the underlying reporting standards. We present \textbf{EulerESG}, an LLM-powered system for automating ESG disclosure analysis with explicit awareness of ESG frameworks. EulerESG combines (i) dual-channel retrieval and LLM-driven disclosure analysis over ESG reports, and (ii) an interactive dashboard and chatbot for exploration, benchmarking, and explanation. Using four globally recognised companies and twelve SASB sub-industries, we show that EulerESG can automatically populate standard-aligned metric tables with high fidelity (up to 0.95 average accuracy) while remaining practical in end-to-end runtime, and we compare several recent LLM models in this setting. The full implementation, together with a demonstration video, is publicly available at https://github.com/UNSW-database/EulerESG.[14] GPS: General Per-Sample Prompter
Pawel Batorski,Paul Swoboda
Main category: cs.CL
TL;DR: 本文提出了一种名为GPS的通用、逐样本提示生成方法,通过强化学习训练,无需任务特定调优即可为每个输入生成定制化提示,在多个任务上表现优异,且无需任务特定训练数据或昂贵优化。
Details
Motivation: 现有自动提示方法依赖大量任务特定数据、昂贵的优化过程,并生成单一任务级提示,缺乏对单个输入的适应性,限制了其泛化性和效率。 Method: 提出GPS方法,使用强化学习在多任务套件上训练一个通用提示生成器,引入新的正则化策略以支持逐样本提示生成,并采用最小贝叶斯风险解码来稳定推理过程。 Result: GPS在文本简化、摘要和分类任务上达到与基线相当或领先的表现,且未在这些任务上进行训练;在GSM8K任务上取得当前最优结果。 Conclusion: GPS展示了一种新颖有效的自动提示范式,能够在无任务特定训练集和低计算成本下生成自适应、输入特定的提示,具有良好的跨任务泛化能力。 Abstract: LLMs are sensitive to prompting, with task performance often hinging on subtle, sometimes imperceptible variations in phrasing. As a result, crafting effective prompts manually remains challenging and time-consuming. Recent automatic prompting methods mitigate this difficulty but face three key limitations: (i) for each new task, they require large datasets to train good prompts;(ii) they rely on costly optimization loops that may take hours; (iii)they typically produce a single task-level prompt that does not adapt to the individual input problem to be solved. We propose GPS, the first general-purpose, per-sample prompting method. Without any task-specific tuning, GPS generates a tailored prompt for each unseen input, improving performance across diverse tasks. The prompter is trained with reinforcement learning on a suite of training tasks and includes a novel regularization for effectively adapting to per-sample prompting. Finally, we employ Minimum Bayes Risk decoding to stabilize inference. Empirically, GPS demonstrates competitive performance: we attain second best results among baselines on text simplification, third best results on summarization and on-par results on classification, while not training on any of these tasks, in contrast to the baselines. For in-domain prompting, we obtain sota on GSM8K. Our work shows the potential of a novel and effective paradigm for automatic prompting: generating adaptive, input-specific prompts without extensive optimization and without access to a task-specific training set. Our code is available at https://github.com/Batorskq/GPS.[15] An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features
Shabbir Anees,Anshuman,Ayush Chaurasia,Prathmesh Bogar
Main category: cs.CL
TL;DR: 本文提出了一种基于机器学习的先进系统,结合文本预处理、多模态特征提取、哈里斯鹰优化(HHO)和堆叠集成分类器,有效识别AI生成的虚假评论,在公开数据集上取得了95.40%的准确率。
Details
Motivation: 由于AI生成的虚假评论威胁在线购物的可信度,亟需高效技术来识别机器生成文本,以维护用户信任和平台安全。 Method: 采用先进的文本预处理和多模态特征提取,利用哈里斯鹰优化(HHO)进行特征选择,并构建堆叠集成分类器对原始与AI生成评论进行分类。 Result: 在包含40,432条评论的公开数据集上,HHO实现了89.9%的降维效果(从13,539降至1,368个特征),最终模型达到95.40%准确率、92.81%精确率、95.01%召回率和93.90% F1分数。 Conclusion: 集成学习与生物启发式优化相结合的方法在检测AI生成评论方面表现出色,同时强调在云端大规模分析中需结合差分隐私等技术以保护用户数据隐私。 Abstract: It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.[16] CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution
Baoliang Tian,Yuxuan Si,Jilong Wang,Lingyao Li,Zhongyuan Bao,Zineng Zhou,Tao Wang,Sixu Li,Ziyao Xu,Mingze Wang,Zhouzhuo Zhang,Zhihao Wang,Yike Yun,Ke Tian,Ning Yang,Minghui Qiu
Main category: cs.CL
TL;DR: 本文提出了CrossCheck-Bench,一个用于评估多模态输入中矛盾检测能力的诊断性基准,揭示了现有视觉-语言模型在跨模态不一致性推理上的局限性,并指出结合符号推理与视觉处理的方法更具潜力。
Details
Motivation: 现有的多模态大模型主要依赖对齐的图文对进行训练和评估,难以检测现实世界中的跨模态矛盾;而在开放域应用中,视觉与文本线索常存在冲突,需要更深层次的结构化推理能力。 Method: 提出CrossCheck-Bench,包含三个推理复杂度层级和七种原子能力,构建了15k个带有合成矛盾的真实世界样本问答对,并通过超过450小时的专家标注确保语义有效性和难度平衡;评估了13种主流模型并分析其能力分布。 Result: 实验显示当前模型在从感知匹配转向逻辑矛盾检测时性能显著下降,尤其在需多步推理或规则验证的任务上表现不佳;传统提示方法(如思维链)提升有限,而融合符号推理与视觉处理的方法效果更稳定。 Conclusion: 多模态模型在跨模态矛盾检测方面仍存在严重瓶颈,未来应探索将符号推理与感知 grounding 相结合的新范式,以实现更鲁棒的跨模态验证。 Abstract: Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.[17] When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers
Zhaoxin Zhang,Borui Chen,Yiming Hu,Youyang Qu,Tianqing Zhu,Longxiang Gao
Main category: cs.CL
TL;DR: 本文提出了一种新的、模型无关的LLM越狱方法MICM,通过概念触发器在不触发现有安全过滤器的情况下操纵模型输出中的隐含价值观,实验表明该方法在多个先进大模型上均取得高成功率,揭示了当前对齐机制在防御隐性价值操控方面的脆弱性。
Details
Motivation: 现有研究主要关注引发明显有害输出的越狱技术,忽视了利用模型抽象泛化能力来操控隐含社会价值观的攻击方式,导致安全对齐策略存在盲区。 Method: 基于概念形态学理论,MICM将特定概念配置编码为固定提示模板,使用预定义短语作为概念触发器,引导模型输出特定价值立场,且不触发现有安全过滤机制。 Result: 在GPT-4o、Deepseek-R1和Qwen3-8B等五个先进LLM上的实验显示,MICM在成功率和规避检测方面显著优于现有最先进越狱技术。 Conclusion: 商业大语言模型的安全机制仍易受隐性价值操控攻击,需重新审视和加强模型在深层价值对齐方面的鲁棒性。 Abstract: Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.[18] PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health Organizations
Gao Mo,Naveen Raman,Megan Chai,Cindy Peng,Shannon Pagdon,Nev Jones,Hong Shen,Peggy Swarbrick,Fei Fang
Main category: cs.CL
TL;DR: PeerCoPilot是一个基于大语言模型的助手,通过检索增强生成技术帮助同行提供者制定健康计划和目标,并定位组织资源,在实际应用中表现出高可靠性与用户支持度。
Details
Motivation: 行为健康问题在美国疾病负担中居首,同行运营的组织(PROs)面临资源和服务能力不足的问题,需要技术支持以提升服务效率和质量。 Method: 开发了一个名为PeerCoPilot的LLM驱动助手,采用检索增强生成(RAG)管道,依托包含1300多个经过审核资源的数据库,协助创建健康计划、设定目标并查找资源。 Result: 在15名同行提供者和6名服务用户中进行的人类评估显示,超过90%的用户支持使用PeerCoPilot;相比基础LLM,其提供的信息更可靠具体。目前该系统已在CSPNJ组织中由5-10名同行提供者使用。 Conclusion: PeerCoPilot能有效支持同行提供者的工作,提高服务质量和信息准确性,具备在更多PROs中推广的潜力。 Abstract: Behavioral health conditions, which include mental health and substance use disorders, are the leading disease burden in the United States. Peer-run behavioral health organizations (PROs) critically assist individuals facing these conditions by combining mental health services with assistance for needs such as income, employment, and housing. However, limited funds and staffing make it difficult for PROs to address all service user needs. To assist peer providers at PROs with their day-to-day tasks, we introduce PeerCoPilot, a large language model (LLM)-powered assistant that helps peer providers create wellness plans, construct step-by-step goals, and locate organizational resources to support these goals. PeerCoPilot ensures information reliability through a retrieval-augmented generation pipeline backed by a large database of over 1,300 vetted resources. We conducted human evaluations with 15 peer providers and 6 service users and found that over 90% of users supported using PeerCoPilot. Moreover, we demonstrated that PeerCoPilot provides more reliable and specific information than a baseline LLM. PeerCoPilot is now used by a group of 5-10 peer providers at CSPNJ, a large behavioral health organization serving over 10,000 service users, and we are actively expanding PeerCoPilot's use.[19] German General Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies
Jens Rupprecht,Leon Fröhling,Claudia Wagner,Markus Strohmaier
Main category: cs.CL
TL;DR: 本文提出了德国通用人物画像(GGP)集合,基于德国综合社会调查(ALLBUS)构建,用于通过大语言模型进行人群代表性模拟,实验表明GGP引导的模型在数据稀缺下优于现有分类器,并探讨了人物属性选择对模拟效果的影响。
Details
Motivation: 缺乏真实、系统且具有代表性的用户画像集合,限制了大语言模型在社会科学研究中模拟人类观点的准确性与代表性。 Method: 基于德国综合社会调查(ALLBUS)构建德国通用人物画像(GGP)集合,设计可嵌入提示中的结构化人物画像,并在多种主题的调查响应模拟任务中评估不同LLM的表现。 Result: GGP引导的LLM在模拟调查响应分布时优于最先进的分类器,尤其在数据稀缺情况下表现更优;研究还发现人物画像的代表性与属性选择显著影响模型输出与真实人群的对齐程度。 Conclusion: GGP是一个有潜力的资源,有助于推动面向人群代表性的LLM社会模拟研究,支持NLP与社会科学中更系统的基于人物画像的探索。 Abstract: The use of Large Language Models (LLMs) for simulating human perspectives via persona prompting is gaining traction in computational social science. However, well-curated, empirically grounded persona collections remain scarce, limiting the accuracy and representativeness of such simulations. Here we introduce the German General Personas (GGP) collection, a comprehensive and representative persona prompt collection built from the German General Social Survey (ALLBUS). The GGP and its persona prompts are designed to be easily plugged into prompts for all types of LLMs and tasks, steering models to generate responses aligned with the underlying German population. We evaluate GGP by prompting various LLMs to simulate survey response distributions across diverse topics, demonstrating that GGP-guided LLMs outperform state-of-the-art classifiers, particularly under data scarcity. Furthermore, we analyze how the representativity and attribute selection within persona prompts affect alignment with population responses. Our findings suggest that GGP provides a potentially valuable resource for research on LLM-based social simulations that enables more systematic explorations of population-aligned persona prompting in NLP and social science research.[20] AD-CDO: A Lightweight Ontology for Representing Eligibility Criteria in Alzheimer's Disease Clinical Trials
Zenan Sun,Rashmie Abeysinghe,Xiaojin Li,Xinyue Hu,Licong Cui,Guo-Qiang Zhang,Jiang Bian,Cui Tao
Main category: cs.CL
TL;DR: 本研究提出了阿尔茨海默病临床试验通用数据元素本体(AD-CDO),用于标准化和表示关键的入组标准概念。
Details
Motivation: 为了弥合广泛生物医学本体与特定任务的临床试验建模需求之间的差距,需要一个轻量级且语义丰富的本体来支持阿尔茨海默病临床试验的数据整合与分析。 Method: 从ClinicalTrials.gov上1500多个阿尔茨海默病临床试验中提取高频概念,并将其归入七个语义类别,使用Jenks自然断点法优化概念集,同时利用UMLS、OMOP等标准生物医学词汇进行注释。 Result: 优化后的AD-CDO覆盖了超过63%的试验概念,在保持简洁性的同时有效捕捉了最常见且具有临床意义的实体,并成功应用于试验模拟系统和实体归一化任务。 Conclusion: AD-CDO通过整合关键入组标准并对接标准词汇,为阿尔茨海默病临床试验研究提供了可扩展、可互操作的本体基础,支持表型算法开发、队列识别和结构化数据集成等应用。 Abstract: Objective This study introduces the Alzheimer's Disease Common Data Element Ontology for Clinical Trials (AD-CDO), a lightweight, semantically enriched ontology designed to represent and standardize key eligibility criteria concepts in Alzheimer's disease (AD) clinical trials. Materials and Methods We extracted high-frequency concepts from more than 1,500 AD clinical trials on ClinicalTrials.gov and organized them into seven semantic categories: Disease, Medication, Diagnostic Test, Procedure, Social Determinants of Health, Rating Criteria, and Fertility. Each concept was annotated with standard biomedical vocabularies, including the UMLS, OMOP Standardized Vocabularies, DrugBank, NDC, and NLM VSAC value sets. To balance coverage and manageability, we applied the Jenks Natural Breaks method to identify an optimal set of representative concepts. Results The optimized AD-CDO achieved over 63% coverage of extracted trial concepts while maintaining interpretability and compactness. The ontology effectively captured the most frequent and clinically meaningful entities used in AD eligibility criteria. We demonstrated AD-CDO's practical utility through two use cases: (a) an ontology-driven trial simulation system for formal modeling and virtual execution of clinical trials, and (b) an entity normalization task mapping raw clinical text to ontology-aligned terms, enabling consistency and integration with EHR data. Discussion AD-CDO bridges the gap between broad biomedical ontologies and task-specific trial modeling needs. It supports multiple downstream applications, including phenotyping algorithm development, cohort identification, and structured data integration. Conclusion By harmonizing essential eligibility entities and aligning them with standardized vocabularies, AD-CDO provides a versatile foundation for ontology-driven AD clinical trial research.[21] PromptTailor: Multi-turn Intent-Aligned Prompt Synthesis for Lightweight LLMs
Yizhou Xu,Janet Davis
Main category: cs.CL
TL;DR: PromptTailor 是一个用于开放文本生成的可控提示生成系统,通过意图对齐的提示合成来提升轻量级语言模型的输出质量。
Details
Motivation: 轻量级语言模型在设备端和隐私敏感场景中具有吸引力,但其输出对提示质量敏感;非专业用户难以持续编写高质量提示,且现有优化方法可能偏离用户意图。 Method: 提出 PromptTailor 系统,基于量化后的 Llama3-8B 模型,使用轻量级 LoRA 适配器,在来自三个强大学习模型蒸馏出的 12,300 条跨领域提示优化对话数据上进行微调,实现从简短指令生成丰富、领域感知且符合用户偏好的提示。 Result: 在多个人类和 LLM 评估中,PromptTailor 在更少模型调用次数下(如 3 次 vs. 9 次)优于思维链提示,并达到或超过当前最优提示优化方法的效果。 Conclusion: 一个经过强大教师模型指导的小型学生模型可以学会有效的提示生成策略,在提升输出质量的同时保持与用户意图的一致性,适合边缘部署。 Abstract: Lightweight language models remain attractive for on-device and privacy-sensitive applications, but their responses are highly sensitive to prompt quality. For open-ended generation, non-expert users often lack the knowledge or time to consistently craft high-quality prompts, leading them to rely on prompt optimization tools. However, a key challenge is ensuring the optimized prompts genuinely align with users' original intents and preferences. We introduce PromptTailor, a system for controllable prompt generation for open-ended text that improves model output quality by intent-aligned prompt synthesis. PromptTailor expands minimal user instructions into rich, domain-aware prompts while preserving the user's stated preferences. The system is a quantized Llama3-8B model fine-tuned with a lightweight LoRA adapter on 12,300 prompt-refinement dialogues spanning 41 everyday domains, distilled from three stronger LLMs. The adapter attaches to any Llama3-8B base, enabling edge deployment. In human and LLM-judge evaluations across multiple target models and optimization baselines, PromptTailor yields higher preference rates than chain-of-thought prompting and matches or surpasses state-of-the-art prompt optimization methods while requiring fewer model calls (e.g., 3 vs. 9). These results show that a compact student, guided by powerful teachers, can learn effective prompt-generation strategies that enhance response quality while maintaining alignment with user intent.[22] Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks
Yicong Zheng,Kevin L. McKee,Thomas Miconi,Zacharie Bugaud,Mick van Gelderen,Jed McCaleb
Main category: cs.CL
TL;DR: 本文提出了SUMER,一种基于经验回放的无损记忆搜索框架,通过强化学习使大语言模型在长上下文对话理解任务中超越了现有的压缩式记忆方法,取得了SOTA性能。
Details
Motivation: 现有记忆框架依赖有偏见的压缩算法,难以泛化到不同数据分布;作者希望探索无需压缩、直接在原始数据上进行搜索的记忆范式,以实现更通用的长时记忆能力。 Method: 提出SUMER框架,结合端到端强化学习与可验证奖励机制(RLVR),让模型学会使用搜索工具在未压缩的记忆中检索信息,并在LoCoMo数据集上进行训练与评估。 Result: 在LoCoMo数据集上,基于Qwen2.5-7B-Instruct的SUMER显著优于所有基于压缩的方法和全上下文基线,性能提升达43%,达到SOTA。 Conclusion: 直接在未压缩数据上进行目标导向的搜索优于传统的压缩式记忆方法,表明应建立更动态、可自主扩展的新记忆范式与基准。 Abstract: How to enable human-like long-term memory in large language models (LLMs) has been a central question for unlocking more general capabilities such as few-shot generalization. Existing memory frameworks and benchmarks focus on finding the optimal memory compression algorithm for higher performance in tasks that require recollection and sometimes further reasoning. However, such efforts have ended up building more human bias into the compression algorithm, through the search for the best prompts and memory architectures that suit specific benchmarks, rather than finding a general solution that would work on other data distributions. On the other hand, goal-directed search on uncompressed information could potentially exhibit superior performance because compression is lossy, and a predefined compression algorithm will not fit all raw data distributions. Here we present SUMER (Search in Uncompressed Memory via Experience Replay), an end-to-end reinforcement learning agent with verifiable reward (RLVR) that learns to use search tools to gather information and answer a target question. On the LoCoMo dataset for long-context conversation understanding, SUMER with Qwen2.5-7B-Instruct learned to use search tools and outperformed all other biased memory compression approaches and also the full-context baseline, reaching SOTA performance (43% gain over the prior best). We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks, arguing for new paradigms and benchmarks that are more dynamic and autonomously scalable. Code for SUMER and all implemented baselines is publicly available at https://github.com/zycyc/SUMER.[23] Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue
Lin Yu,Xiaofei Han,Yifei Kang,Chiung-Yi Tseng,Danyang Zhang,Ziqian Bi,Zhimo Han
Main category: cs.CL
TL;DR: AffectMind是一种多模态情感对话代理,通过主动推理和动态知识 grounding 提升营销对话中的情感一致性与说服效果。
Details
Motivation: 现有大语言模型在情感丰富且目标导向的场景(如营销对话)中表现不足,缺乏主动性和情感对齐能力。 Method: 提出AffectMind,包含三个组件:主动知识 grounding 网络(PKGN)、情感-意图对齐模型(EIAM)和强化话语循环(RDL),结合文本、视觉和韵律信息进行多模态情感与意图建模,并通过强化学习优化对话策略。 Result: 在MM-ConvMarket和AffectPromo两个新构建的数据集上,AffectMind在情感一致性(+26%)、说服成功率(+19%)和长期用户参与度(+23%)上均优于强基线模型。 Conclusion: 情感锚定的主动性是提升商业多模态对话系统性能的关键。 Abstract: Recent advances in large language models (LLMs) have enabled fluent dialogue systems, but most remain reactive and struggle in emotionally rich, goal-oriented settings such as marketing conversations. To address this limitation, we propose AffectMind, a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding to sustain emotionally aligned and persuasive interactions. AffectMind combines three components: a Proactive Knowledge Grounding Network (PKGN) that continuously updates factual and affective context from text, vision, and prosody; an Emotion--Intent Alignment Model (EIAM) that jointly models user emotion and purchase intent to adapt persuasion strategies; and a Reinforced Discourse Loop (RDL) that optimizes emotional coherence and engagement via reinforcement signals from user responses. Experiments on two newly curated marketing dialogue datasets, MM-ConvMarket and AffectPromo, show that AffectMind outperforms strong LLM-based baselines in emotional consistency (+26\%), persuasive success rate (+19\%), and long-term user engagement (+23\%), highlighting emotion-grounded proactivity as a key capability for commercial multimodal agents.[24] Beyond Component Strength: Synergistic Integration and Adaptive Calibration in Multi-Agent RAG Systems
Jithin Krishnan
Main category: cs.CL
TL;DR: 本文研究了检索增强生成(RAG)系统中各组件的协同作用,发现单独使用增强技术效果有限,但组合使用可将放弃回答率从40%降至2%,且不增加幻觉。同时指出标签不一致会导致误判幻觉率,强调标准化评估和自适应校准的重要性。
Details
Motivation: 构建可靠的RAG系统需要理解组件间的交互,而非仅依赖强大组件;现有验证策略的标签不一致可能导致性能评估偏差。 Method: 通过对50个查询(15个可回答、10个边缘情况、25个对抗性案例)进行消融实验,评估混合检索、集成验证和自适应阈值等技术的独立与组合效果。 Result: 单一增强技术几乎无益,但三者结合使放弃回答率下降95%(从40%到2%),未增加幻觉;不同验证策略因标签不一致造成幻觉率误判。 Conclusion: RAG系统的性能提升关键在于组件的协同集成而非单个组件强度;需标准化指标与标签,并采用自适应校准以避免过度自信的回答。 Abstract: Building reliable retrieval-augmented generation (RAG) systems requires more than adding powerful components; it requires understanding how they interact. Using ablation studies on 50 queries (15 answerable, 10 edge cases, and 25 adversarial), we show that enhancements such as hybrid retrieval, ensemble verification, and adaptive thresholding provide almost no benefit when used in isolation, yet together achieve a 95% reduction in abstention (from 40% to 2%) without increasing hallucinations. We also identify a measurement challenge: different verification strategies can behave safely but assign inconsistent labels (for example, "abstained" versus "unsupported"), creating apparent hallucination rates that are actually artifacts of labeling. Our results show that synergistic integration matters more than the strength of any single component, that standardized metrics and labels are essential for correctly interpreting performance, and that adaptive calibration is needed to prevent overconfident over-answering even when retrieval quality is high.[25] A Benchmark for Procedural Memory Retrieval in Language Agents
Ishant Kohar,Aswanth Krishnan
Main category: cs.CL
TL;DR: 本文提出了一个新基准,用于评估AI代理在面对新颖任务时的程序性记忆检索能力,揭示了现有嵌入方法在跨情境迁移上的局限性,并发现LLM生成的过程抽象能更可靠地实现泛化。
Details
Motivation: 当前AI代理在面对包含未见词汇的新颖任务时表现急剧下降,暴露出程序性记忆系统在泛化上的核心缺陷,亟需一种能够分离记忆检索与任务执行的评估框架。 Method: 基于ALFWorld构建专家和LLM生成的双轨迹语料库,设计分层查询来系统评估六种检索方法,分离程序理解与表面记忆。 Result: 嵌入方法在熟悉情境中表现良好,但在新情境中显著退化;LLM生成的过程抽象展现出更强的跨情境迁移能力;消融实验显示当前嵌入将过程视为无序词袋,忽略时序结构;语料规模比表示增强带来更大增益。 Conclusion: 现有编码器存在架构瓶颈,缺乏对程序性行为的真正理解;所提基准为诊断程序性记忆提供了新工具,推动构建可信赖的通用检索系统。 Abstract: Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies -- a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval from task execution, evaluating whether agents can recognize functionally equivalent procedures that span different object instantiations. Using ALFWorld, we construct dual corpora of expert and LLM-generated trajectories and evaluate six retrieval methods using systematically stratified queries. Our results expose a clear generalization cliff: embedding-based methods perform strongly on familiar contexts, yet degrade considerably on novel ones, while LLM-generated procedural abstractions demonstrate reliable cross-context transfer. Controlled ablations show that although embeddings capture some lexical-level abstraction, they fundamentally treat procedures as unordered bags of words, discarding temporal structure necessary for cross-context transfer. Corpus scale delivers far larger gains than representation enrichment, revealing an architectural ceiling in current encoders. Our benchmark offers the first diagnostic framework separating genuine procedural understanding from surface-level memorization and gives tools for developing retrieval systems capable of dependable generalization. Resources available at our GitHub repository (https://github.com/qpiai/Proced_mem_bench).[26] Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition
Diederik Aerts,Jonito Aerts Arguëlles,Lester Beltran,Suzette Geriente,Roberto Leporini,Massimiliano Sassoli de Bianchi,Sandro Sozzo
Main category: cs.CL
TL;DR: 本文通过在大型语言模型(如ChatGPT和Gemini)上进行认知测试,发现概念组合中显著违反贝尔不等式并表现出玻色-爱因斯坦统计特性,揭示了人类与人工智能在概念-语言领域中量子结构的系统性涌现及其意义组织的趋同机制。
Details
Motivation: 探索大型语言模型是否在概念组合和语言处理中展现出类似人类认知的量子结构特征,以理解语义组织的本质机制。 Method: 利用ChatGPT和Gemini进行两类认知测试:一是检验概念组合是否违反贝尔不等式以判断是否存在量子纠缠;二是分析大规模文本中词汇分布的统计特性(玻色-爱因斯坦 vs 麦克斯韦-玻尔兹曼)。 Result: 发现LLMs中贝尔不等式显著被违反,表明存在‘量子纠缠’;词汇分布符合‘玻色-爱因斯坦统计’而非经典统计;这些结果与人类认知实验和真实语料检索结果一致。 Conclusion: 无论认知主体是人还是AI,概念-语言领域中都会系统性地涌现出量子结构;这反映了一种深层的意义组织机制,源于分布式语义向量空间中的进化趋同,而非单纯的神经网络架构。 Abstract: We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of 'quantum entanglement' in the tested concepts. In the second test, also performed using ChatGPT and Gemini, we instead identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of quantum structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.[27] HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Jiajun Zhang,Shijia Luo,Ruikang Zhang,Qi Su
Main category: cs.CL
TL;DR: 本文提出了HUMORCHAIN,一种基于幽默理论的多阶段推理框架,首次将认知结构显式嵌入到多模态幽默生成中,实现了从视觉理解到幽默创作的可解释、可控的认知链路。
Details
Motivation: 现有数据驱动方法缺乏对幽默理论的显式建模,导致生成的图像描述流畅但缺乏真正的幽默感和认知深度,难以捕捉人类感知中的幽默机制。 Method: 提出HUMORCHAIN框架,结合视觉语义解析、基于幽默与心理学的推理,以及微调后的幽默判别器,构建理论引导的多步推理链,实现可控且可解释的多模态幽默生成。 Result: 在Meme-Image-No-Text、Oogiri-GO和OxfordTVG-HIC数据集上的实验表明,HUMORCHAIN在人类幽默偏好、Elo/BT评分和语义多样性方面优于现有最先进模型。 Conclusion: 通过将幽默理论的认知结构融入生成过程,理论驱动的结构化推理能有效提升大语言模型在多模态场景下生成符合人类感知的幽默内容的能力。 Abstract: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.[28] RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models
Dayan Pan,Jingyuan Wang,Yilong Zhou,Jiawei Cheng,Pengyue Jia,Xiangyu Zhao
Main category: cs.CL
TL;DR: 提出RoSA,一种基于RoPE感知的选择性适应框架,通过维度和层的联合选择性优化实现更高效的参数高效微调。
Details
Motivation: 现有PEFT方法忽略模型组件的不同角色及各层重要性的差异,导致适应效率受限;而RoPE在注意力状态低频维度中引发关键激活,启发了更精细的参数分配策略。 Method: 设计RoPE-aware Attention Enhancement(RoAE)模块以增强受RoPE影响的低频注意力成分,并结合基于LayerNorm梯度范数的Dynamic Layer Selection(DLS)策略动态选择关键层进行更新。 Result: 在十五个常识和算术基准上实验表明,RoSA在可训练参数相当的情况下优于主流PEFT方法。 Conclusion: RoSA通过细粒度的维度与层级别参数分配,实现了更高效、更有效的大型语言模型微调。 Abstract: Fine-tuning large language models is essential for task-specific adaptation, yet it remains computationally prohibitive. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a solution, but current approaches typically ignore the distinct roles of model components and the heterogeneous importance across layers, thereby limiting adaptation efficiency. Motivated by the observation that Rotary Position Embeddings (RoPE) induce critical activations in the low-frequency dimensions of attention states, we propose RoPE-aware Selective Adaptation (RoSA), a novel PEFT framework that allocates trainable parameters in a more targeted and effective manner. RoSA comprises a RoPE-aware Attention Enhancement (RoAE) module, which selectively enhances the low-frequency components of RoPE-influenced attention states, and a Dynamic Layer Selection (DLS) strategy that adaptively identifies and updates the most critical layers based on LayerNorm gradient norms. By combining dimension-wise enhancement with layer-wise adaptation, RoSA achieves more targeted and efficient fine-tuning. Extensive experiments on fifteen commonsense and arithmetic benchmarks demonstrate that RoSA outperforms existing mainstream PEFT methods under comparable trainable parameters. The code is available to ease reproducibility at https://github.com/Applied-Machine-Learning-Lab/RoSA.[29] Asking LLMs to Verify First is Almost Free Lunch
Shiguang Wu,Quanming Yao
Main category: cs.CL
TL;DR: 本文提出了一种名为Verification-First(VF)的新策略,通过让大语言模型在生成答案前先验证一个候选答案(即使是随机的),来增强其推理能力。该方法触发“反向推理”过程,比传统的正向思维链更易激发模型的批判性思维,减少逻辑错误。进一步提出的Iter-VF是一种迭代式的测试时扩展方法,在多种任务和模型上均优于现有方法,且计算开销极低。
Details
Motivation: 为了在不增加训练成本或大量测试采样的前提下提升大语言模型的推理能力,研究者希望探索一种高效、低成本的方法来激活模型的批判性思维,减少推理过程中的逻辑错误。 Method: 提出Verification-First(VF)策略:要求模型先验证一个给定的候选答案(即使随机),再生成自己的解答,从而引发反向推理;进一步发展为Iter-VF,通过迭代地将验证与生成过程结合,在测试时进行序列化扩展。 Result: 在数学推理、编程和智能体任务等多个基准及多种规模的大语言模型(从1B参数开源模型到最先进商用模型)上的实验表明,VF配合随机答案始终优于标准思维链(CoT),且Iter-VF优于现有的测试时扩展策略,同时仅带来极小的计算开销。 Conclusion: VF和Iter-VF为提升大语言模型的推理能力提供了高效、通用且低成本的新范式,验证优先的策略能有效激发模型的批判性思维,是传统思维链的有力补充。 Abstract: To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process that is cognitively easier and complementary to standard forward Chain-of-Thought (CoT), effectively invoking the model's critical thinking to reduce logical errors. We further generalize the VF strategy to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks (from mathematical reasoning to coding and agentic tasks) and various LLMs (from open-source 1B to cutting-edge commercial ones) confirm that VF with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies.[30] Closing the Performance Gap Between AI and Radiologists in Chest X-Ray Reporting
Harshita Sharma,Maxwell C. Reynolds,Valentina Salvatelli,Anne-Marie G. Sykes,Kelly K. Horst,Anton Schwaighofer,Maximilian Ilse,Olesya Melnichenko,Sam Bond-Taylor,Fernando Pérez-García,Vamshi K. Mugu,Alex Chan,Ceylan Colak,Shelby A. Swartz,Motassem B. Nashawaty,Austin J. Gonzalez,Heather A. Ouellette,Selnur B. Erdal,Beth A. Schueler,Maria T. Wetscherek,Noel Codella,Mohit Jain,Shruthi Bannur,Kenza Bouzid,Daniel C. Castro,Stephanie Hyland,Panos Korfiatis,Ashish Khandelwal,Javier Alvarez-Valle
Main category: cs.CL
TL;DR: MAIRA-X是一个用于纵向胸部X光报告生成的多模态AI模型,能够准确描述临床发现和导管/管路(L&T)信息,在大规模数据上训练并经过放射科医生评估,表现出与人工报告相近的准确性。
Details
Motivation: 由于筛查指南扩展、病例复杂性和人员短缺,放射科医生工作量增加,需要AI辅助报告生成来减轻负担,特别是在高工作量下对导管和管路(L&T)的重复性判读需求迫切。 Method: 基于来自梅奥诊所的大规模多中心纵向数据集(310万项研究,600万张图像,80.6万名患者),开发了MAIRA-X模型,并在三个保留数据集和公开MIMIC-CXR数据集上进行评估;提出了一种新的L&T专用指标框架,用于评估类型、纵向变化和放置等属性的报告准确性。 Result: MAIRA-X在词汇质量、临床正确性和L&T相关元素上显著优于现有最先进AI模型;九名放射科医生参与的盲法回顾性用户研究表明,AI生成报告的关键错误率(4.6%)与原始报告(3.0%)相近,可接受句子比例相似(97.4% vs 97.8%),显著优于以往研究。 Conclusion: MAIRA-X能有效辅助放射科医生生成胸部X光报告,尤其适用于高工作量临床环境,且在L&T报告方面表现优异,具备临床应用潜力。 Abstract: AI-assisted report generation offers the opportunity to reduce radiologists' workload stemming from expanded screening guidelines, complex cases and workforce shortages, while maintaining diagnostic accuracy. In addition to describing pathological findings in chest X-ray reports, interpreting lines and tubes (L&T) is demanding and repetitive for radiologists, especially with high patient volumes. We introduce MAIRA-X, a clinically evaluated multimodal AI model for longitudinal chest X-ray (CXR) report generation, that encompasses both clinical findings and L&T reporting. Developed using a large-scale, multi-site, longitudinal dataset of 3.1 million studies (comprising 6 million images from 806k patients) from Mayo Clinic, MAIRA-X was evaluated on three holdout datasets and the public MIMIC-CXR dataset, where it significantly improved AI-generated reports over the state of the art on lexical quality, clinical correctness, and L&T-related elements. A novel L&T-specific metrics framework was developed to assess accuracy in reporting attributes such as type, longitudinal change and placement. A first-of-its-kind retrospective user evaluation study was conducted with nine radiologists of varying experience, who blindly reviewed 600 studies from distinct subjects. The user study found comparable rates of critical errors (3.0% for original vs. 4.6% for AI-generated reports) and a similar rate of acceptable sentences (97.8% for original vs. 97.4% for AI-generated reports), marking a significant improvement over prior user studies with larger gaps and higher error rates. Our results suggest that MAIRA-X can effectively assist radiologists, particularly in high-volume clinical settings.[31] R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization
Jiayi Chen,Jieqi Shi,Jing Huo,Chen Wu
Main category: cs.CL
TL;DR: 提出了一种名为Residual Refinement Quantization (R2Q) 的新2位量化框架,通过将量化过程分解为两个连续的1位子量化步骤,在极端压缩下显著提升了性能、训练稳定性和收敛速度。
Details
Motivation: 现有的2位量化方法在极低比特压缩下存在严重精度下降问题,难以满足大语言模型对高效计算和内存的需求。 Method: 提出R2Q框架,采用残差学习机制,将2位量化分解为两个自适应的1位子量化步骤,构建可调节的量化格,并可与现有的量化感知训练(QAT)框架无缝集成。 Result: 在Llama、OPT和Qwen等模型上广泛评估表明,R2Q在细粒度和粗粒度设置下均优于现有2位量化方法,显著提升准确率、训练稳定性和收敛速度。 Conclusion: R2Q为大语言模型在2位极端量化下的高效部署提供了有效解决方案,具有良好的模块化和兼容性,具备实际应用潜力。 Abstract: The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.[32] Polarity-Aware Probing for Quantifying Latent Alignment in Language Models
Sabrina Sadiekh,Elena Ericheva,Chirag Agarwal
Main category: cs.CL
TL;DR: 本文提出了一种新的无监督探针方法PA-CCS,用于评估语言模型在极性反转下的内部表征一致性,进而衡量模型对有害与安全语句的语义鲁棒性和对齐程度。作者设计了两个主要数据集和一个控制数据集,并在16个语言模型上验证了方法的有效性,发现PA-CCS能有效识别模型架构和层级上的差异,揭示出对齐良好的模型在否定标记替换时表现敏感性下降,而对齐较差的模型则无此现象,表明该方法有助于模型对齐评估。
Details
Motivation: 现有的无监督探针方法(如CCS)虽能揭示模型的潜在信念,但其是否可靠地评估模型对齐仍不清楚。尤其在面对有害内容时,需判断模型内部表示是否保持语义一致性。因此,需要一种更结构化、对极性变化敏感的探针方法来评估模型对齐。 Method: 提出了Polarity-Aware CCS(PA-CCS),结合极性反转构造匹配的有害-安全句子对,引入两个新指标:Polar-Consistency 和 Contradiction Index,以量化模型在极性变化下潜在知识的一致性。构建了两个主数据集和一个控制数据集,在16个语言模型上进行实验,分析不同层和架构的表现差异。 Result: PA-CCS能够区分不同架构和层级中对有害知识的编码方式;对齐良好的模型在将否定词替换为无意义标记后PA-CCS得分显著下降,而对齐差的模型则不敏感;结果表明内部表征一致性强的模型更具语义鲁棒性。 Conclusion: 无监督探针方法(如PA-CCS)具有潜力成为评估模型对齐的有效工具,尤其是通过结构鲁棒性(如极性一致性)检验可增强解释性基准的可靠性,未来应将此类结构性检测纳入模型评估体系。 Abstract: Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: https://github.com/SadSabrina/polarity-probing. WARNING: This paper contains potentially sensitive, harmful, and offensive content.[33] Decoding inner speech with an end-to-end brain-to-text neural interface
Yizi Zhang,Linyang He,Chaofei Fan,Tingkai Liu,Han Yu,Trung Le,Jingyuan Li,Scott Linderman,Lea Duncker,Francis R Willett,Nima Mesgarani,Liam Paninski
Main category: cs.CL
TL;DR: 本文提出了一种端到端的Brain-to-Text(BIT)框架,通过单一可微神经网络将神经活动直接转化为连贯句子,结合跨任务预训练编码器与音频大语言模型,显著降低了词错误率,并实现了尝试性与想象性言语的跨任务泛化。
Details
Motivation: 现有语音脑机接口多采用级联框架,无法实现所有阶段的联合优化,限制了性能提升。 Method: 提出端到端的BIT框架,使用跨任务、跨物种预训练的神经编码器,并结合音频大语言模型与对比学习进行跨模态对齐训练。 Result: 在Brain-to-Text '24和'25基准上达到SOTA;端到端设置下词错误率从24.69%降至10.22%;实现了尝试与想象言语表征的对齐与跨任务泛化。 Conclusion: BIT框架推动了大规模多样化神经数据的整合,为可微分、无缝优化的脑电到文本解码提供了新路径。 Abstract: Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.[34] A Multiscale Geometric Method for Capturing Relational Topic Alignment
Conrad D. Hougen,Karl T. Pazdernik,Alfred O. Hero
Main category: cs.CL
TL;DR: 提出一种结合文本和合著者网络数据的几何方法,通过Hellinger距离和Ward聚类构建层次化主题树,有效识别稀有主题并可视化主题随时间的平滑演变。
Details
Motivation: 现有基于密集Transformer嵌入的主题模型难以捕捉稀有主题和时间上的平滑演化,尤其在重视新颖性的科学文献中,识别小众主题对理解研究兴趣演变至关重要。 Method: 利用Hellinger距离和Ward层次聚类,融合文本的词袋表示与合著者网络结构,构建具有多尺度语义和时间对齐能力的层次化主题树。 Result: 该方法能更好地发现稀有主题结构,并可视化主题的渐进演变;实验表明,结合几何对齐的可解释词袋模型优于传统深度语义模型。 Conclusion: 通过几何方法融合多模态数据,可在保持可解释性的同时提升对稀有主题和时间动态的建模能力,为科学演进分析提供更精细的工具。 Abstract: Interpretable topic modeling is essential for tracking how research interests evolve within co-author communities. In scientific corpora, where novelty is prized, identifying underrepresented niche topics is particularly important. However, contemporary models built from dense transformer embeddings tend to miss rare topics and therefore also fail to capture smooth temporal alignment. We propose a geometric method that integrates multimodal text and co-author network data, using Hellinger distances and Ward's linkage to construct a hierarchical topic dendrogram. This approach captures both local and global structure, supporting multiscale learning across semantic and temporal dimensions. Our method effectively identifies rare-topic structure and visualizes smooth topic drift over time. Experiments highlight the strength of interpretable bag-of-words models when paired with principled geometric alignment.[35] EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants
Meenakshi Mittal,Rishi Khare,Mihran Miroyan,Chancharik Mitra,Narges Norouzi
Main category: cs.CL
TL;DR: 本文提出了一种模块化的函数调用LLM流水线EduMod-LLM,用于教育领域的问答系统,并从函数调用策略、检索方法和生成模型三个方面进行了综合评估。
Details
Motivation: 随着大型语言模型在教育问答系统中的广泛应用,有必要对其各个流水线组件进行细粒度的性能评估,以提高系统的可解释性和教学适用性。 Method: 设计了一个模块化的函数调用框架,分别评估不同LLM的函数调用性能、提出一种结构感知的检索方法并与基于向量和LLM评分的基线方法比较,并测试多种LLM在响应生成上的表现。 Result: 该框架能够隔离并分析各组件的性能,揭示了特定的失败模式和性能特征,结构感知检索方法优于传统方法,且不同LLM在各项任务中表现差异显著。 Conclusion: 模块化设计有助于提升教育问答系统的透明度和教学对齐能力,为构建更可靠、可解释的教育AI系统提供了有效路径。 Abstract: With the growing use of Large Language Model (LLM)-based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce {\model}, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment. Website and Supplementary Material: https://chancharikmitra.github.io/EduMod-LLM-website/[36] Scaling Competence, Shrinking Reasoning: Cognitive Signatures in Language Model Learning
Mukul Singh,Ananya Singha,Arjun Radhakrishna,Sumit Gulwani
Main category: cs.CL
TL;DR: 本文研究了语言模型在任务特定微调过程中推理行为的变化,提出推理标记的动态变化与人类认知中的“能力四阶段”理论相对应,并发现推理长度先增加后减少的趋势,表明推理在训练中起到支架作用,最终可被模型内化。
Details
Motivation: 理解语言模型在微调过程中的推理机制,借鉴认知科学理论揭示模型学习过程的阶段性特征。 Method: 通过分析微调过程中推理标记的生成长度和模式变化,结合‘能力四阶段’理论对模型表现进行阶段划分和动态追踪。 Result: 发现推理标记长度随性能提升先增长后下降,在‘有意识胜任’阶段达到峰值;训练完成后即使移除推理,模型仍保持性能,说明推理起到了学习支架的作用。 Conclusion: 推理行为是模型学习过程中的临时支撑机制,其动态变化可作为诊断训练阶段、判断收敛和指导早停的有用信号。 Abstract: We analyze reasoning in language models during task-specific fine-tuning and draws parallel between reasoning tokens--intermediate steps generated while solving problem and the human working memory. Drawing from cognitive science, we align training dynamics with the Four Stages of Competence: models initially produce incorrect outputs without reasoning, then begin reasoning (but still fail), eventually reason effectively, and finally solve tasks without explicit reasoning. We find that reasoning token length expands as performance improves, peaks at the stage of conscious competence, then declines as the model internalizes the task. Notably, after training, models retain performance even when reasoning is removed--suggesting it scaffolded learning but is no longer needed. This progression offers actionable insights: reasoning token dynamics can serve as a signal for diagnosing training stage, identifying convergence, and guiding early stopping. We propose metrics to track this trajectory and argue that reasoning behavior is valuable for understanding and optimizing reasoning model training.[37] A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features
Sergey K. Aityan,William Claster,Karthik Sai Emani,Sohni Rais,Thy Tran
Main category: cs.CL
TL;DR: 本文提出了一种轻量级的AI生成文本检测方法NEULIF,结合风格特征与可读性特征,使用小型CNN或随机森林实现高精度检测,兼具高效性与跨场景应用潜力。
Details
Motivation: 现有AI生成文本检测方法依赖大型Transformer模型或集成方法,计算成本高且跨域泛化能力有限;而轻量级方法在大规模数据集上准确率较低,因此需要一种高效且准确的轻量级解决方案。 Method: 将文本分解为风格计量和可读性特征,输入到紧凑的卷积神经网络(CNN)或随机森林(RF)中进行分类。 Result: 在Kaggle AI vs. Human数据集上,CNN模型达到97%准确率(F1约0.95),RF模型达到95%准确率(F1约0.94),ROC-AUC分别为99.5%和95%,模型体积分别为25MB和10.6MB,显著小于基于Transformer的模型。 Conclusion: NEULIF在不牺牲准确性的前提下,实现了高效的AI生成文本检测,展示了结构化特征引导下的简单模型在检测任务中可媲美复杂模型,并具备跨语言、跨领域及流式处理的应用潜力。 Abstract: A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing accuracy.This study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.[38] DELTA: Language Diffusion-based EEG-to-Text Architecture
Mingyu Jeon,Hyobin Kim
Main category: cs.CL
TL;DR: 本文提出了DELTA模型,结合残差向量量化(RVQ)EEG分词器与掩码语言扩散模型(LLaDA),实现从脑电图(EEG)到文本的高效转换,显著提升语义对齐和生成性能。
Details
Motivation: EEG-to-text面临高维噪声、受试者间差异和自回归解码中的误差累积等挑战,亟需更鲁棒且可扩展的方法。 Method: 采用RVQ将连续EEG信号离散化为多层令牌以降低噪声和个体差异;使用LLaDA通过非自回归去噪方式重建句子。 Result: 在ZuCo数据集上,相比自回归基线模型,DELTA在词级条件下BLEU-1达到21.9,ROUGE-1 F达到17.2,语义对齐最高提升5.37点。 Conclusion: DELTA能在小规模EEG-文本数据集上实现可靠的文本生成,推动了可扩展的多模态EEG-语言模型的发展。 Abstract: Electroencephalogram (EEG)-to-text remains challenging due to high-dimensional noise, subject variability, and error accumulation in autoregressive decoding. We introduce DELTA, which pairs a Residual Vector Quantization (RVQ) EEG tokenizer with a masked language diffusion model (LLaDA). RVQ discretizes continuous EEG into multi-layer tokens to reduce noise and individual differences, while LLaDA reconstructs sentences via non-sequential denoising. On ZuCo, DELTA improves semantic alignment by up to 5.37 points over autoregressive baselines, achieving BLEU-1 21.9 and ROUGE-1 F 17.2 under word-level conditions. These results enable reliable text generation from small EEG-text datasets and point toward scalable multimodal EEG-language models.[39] Building Domain-Specific Small Language Models via Guided Data Generation
Aman Kumar,Ekant Muljibhai Amin,Xian Yeow Lee,Lasitha Vidyaratne,Ahmed K. Farahat,Dipanjan D. Ghosh,Yuta Koreeda,Chetan Gupta
Main category: cs.CL
TL;DR: 本文提出了一种高效且可扩展的训练流程,结合合成数据生成与领域数据整理,用于训练小型专业化大语言模型DiagnosticSLM(3B参数),在工业故障诊断等任务中显著优于同规模开源模型。
Details
Motivation: 解决在专业领域部署大语言模型时面临的数据隐私问题、计算资源需求高以及缺乏高质量领域训练数据的挑战。 Method: 提出一种结合引导式合成数据生成和自下而上的领域数据收集的训练流程,整合领域自适应预训练(DAPT)、领域监督微调(DSFT)和直接偏好优化(DPO)方法。 Result: 在四个领域特定基准(DiagnosticMCQ、DiagnosticQA、DiagnosticComp、DiagnosticSum)上,DiagnosticSLM在多项任务中优于或匹配2B-9B规模的开源模型,其中MCQ任务准确率最高提升达25%。 Conclusion: 该训练 pipeline 能有效构建高性能、低成本的小型领域专用语言模型,适用于知识密集型专业场景,具有良好的推理与泛化能力。 Abstract: Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.[40] Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness
Svitlana Volkova,Will Dupree,Hsien-Te Kao,Peter Bautista,Gabe Ganberg,Jeff Beaubien,Laura Cassani
Main category: cs.CL
TL;DR: 本文提出了BRIES,一种用于检测和评估说服性攻击有效性的新型复合AI架构,包含生成对抗内容的Twister、检测攻击类型的Detector、通过内容接种生成抗性内容的Defender,以及利用因果推断评估接种效果的Assessor。实验表明不同语言模型在识别复杂说服技巧方面表现差异显著,提示工程对检测效果有重大影响,且不同攻击类型针对特定认知维度,为增强人类认知韧性提供了框架。
Details
Motivation: 随着生成式AI的发展,信息环境中出现越来越多的说服性攻击,这些攻击可能操纵用户认知与情感。然而当前缺乏系统性方法来检测、分析并防御此类攻击,尤其在不同语言模型间的防御能力差异方面理解不足。因此需要一个综合框架来量化模型脆弱性并提升认知安全。 Method: 提出BRIES架构,包含四个专业化代理:Twister生成使用特定说服策略的对抗性文本;Detector基于可配置参数识别攻击类型;Defender通过内容接种生成抵御性文本;Assessor采用因果推断评估防御效果。在SemEval 2023 Task 3分类体系下的合成数据集上,对比GPT-4、Llama3、Mistral和Gemma等模型的表现,并分析温度设置、置信度评分和提示工程对检测性能的影响。 Result: 实验显示GPT-4在识别复杂说服技巧方面显著优于开源模型(如Llama3和Mistral),后者难以识别微妙修辞手法;提示工程显著影响检测效果:Gemma和GPT-4在低温时表现更佳,而Llama3和Mistral在高温时反而提升;因果分析揭示不同说服攻击类型具有特定的社会-情感-认知特征,针对不同认知维度进行干预可提高心理韧性。 Conclusion: BRIES为评估和防御生成式AI中的说服性攻击提供了有效框架,揭示了不同LLM在处理说服语言上的根本差异,并强调提示工程的重要性。研究推动了生成式AI安全与认知安全领域的发展,提出了通过结构化事前干预增强人类认知抵抗力的新路径。 Abstract: This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and Mistral demonstrated notable weaknesses in identifying subtle rhetorical, suggesting that different architectures encode and process persuasive language patterns in fundamentally different ways. We show that prompt engineering dramatically affects detection efficacy, with temperature settings and confidence scoring producing model-specific variations; Gemma and GPT-4 perform optimally at lower temperatures while Llama3 and Mistral show improved capabilities at higher temperatures. Our causal analysis provides novel insights into socio-emotional-cognitive signatures of persuasion attacks, revealing that different attack types target specific cognitive dimensions. This research advances generative AI safety and cognitive security by quantifying LLM-specific vulnerabilities to persuasion attacks and delivers a framework for enhancing human cognitive resilience through structured interventions before exposure to harmful content.[41] Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification
Yanxi Li,Ruocheng Shan
Main category: cs.CL
TL;DR: 本文提出了一种名为Label Disguise Defense (LDD)的轻量级、模型无关的防御策略,通过使用语义转换或无关联的别名标签(如蓝色vs.黄色)来隐藏真实标签,从而抵御针对大语言模型的类别指令注入攻击。
Details
Motivation: 现有的防御方法要么需要重新训练模型,要么容易受到混淆攻击的影响,因此需要一种无需修改模型结构且能有效抵抗各种形式的提示注入攻击的新方法。 Method: 引入了Label Disguise Defense (LDD),该方法用语义变换或无关的别名标签替换原始标签,并通过少量示例让模型隐式学习新的标签映射关系,防止注入指令直接对应到决策输出上。 Result: 在九个最先进模型上的实验表明,LDD能够恢复因对抗性攻击而损失的部分性能;对于每个评估的模型,LDD都能恢复一部分准确率下降;并且大多数模型中存在多个别名对的表现优于仅依赖于少量学习的基线模型;语言学分析显示,语义对齐的别名标签比非对齐符号提供更强的鲁棒性。 Conclusion: 研究表明,通过改变标签语义可以作为有效的防御层,将意义本身转化为对抗提示注入攻击的盾牌。 Abstract: Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.[42] Extracting Disaster Impacts and Impact Related Locations in Social Media Posts Using Large Language Models
Sameeah Noreen Hameed,Surangika Ranathunga,Raj Prasanna,Kristin Stock,Christopher B. Jones
Main category: cs.CL
TL;DR: 该研究利用微调的大规模语言模型(LLMs)从灾害相关的社交媒体帖子中识别受影响地点和影响事件,有效区分受影响与非受影响地点,填补传统数据源的信息空白,支持应急响应决策。
Details
Motivation: 灾害期间传统数据源(如遥感、实地传感器)常因时间、天气等因素存在时空信息缺失,而社交媒体可作为补充的“地理传感器”,但需准确识别其中提及的真正受影响地点。 Method: 使用大规模语言模型(LLMs),通过微调使其能够识别社交媒体文本中的所有地点、影响类型及真正受灾害影响的地点,特别处理非正式表达、缩写和简称形式。 Result: 微调后的模型在影响识别任务上达到0.69的F1分数,在受影响地点识别上达到0.74的F1分数,显著优于预训练基线模型。 Conclusion: 微调后的语言模型能有效提取灾害相关的关键地理与影响信息,为资源分配、态势感知和灾后恢复提供可扩展且及时的解决方案。 Abstract: Large-scale disasters can often result in catastrophic consequences on people and infrastructure. Situation awareness about such disaster impacts generated by authoritative data from in-situ sensors, remote sensing imagery, and/or geographic data is often limited due to atmospheric opacity, satellite revisits, and time limitations. This often results in geo-temporal information gaps. In contrast, impact-related social media posts can act as "geo-sensors" during a disaster, where people describe specific impacts and locations. However, not all locations mentioned in disaster-related social media posts relate to an impact. Only the impacted locations are critical for directing resources effectively. e.g., "The death toll from a fire which ripped through the Greek coastal town of #Mati stood at 80, with dozens of people unaccounted for as forensic experts tried to identify victims who were burned alive #Greecefires #AthensFires #Athens #Greece." contains impacted location "Mati" and non-impacted locations "Greece" and "Athens". This research uses Large Language Models (LLMs) to identify all locations, impacts and impacted locations mentioned in disaster-related social media posts. In the process, LLMs are fine-tuned to identify only impacts and impacted locations (as distinct from other, non-impacted locations), including locations mentioned in informal expressions, abbreviations, and short forms. Our fine-tuned model demonstrates efficacy, achieving an F1-score of 0.69 for impact and 0.74 for impacted location extraction, substantially outperforming the pre-trained baseline. These robust results confirm the potential of fine-tuned language models to offer a scalable solution for timely decision-making in resource allocation, situational awareness, and post-disaster recovery planning for responders.[43] Dissecting the Ledger: Locating and Suppressing "Liar Circuits" in Financial Large Language Models
Soham Mirajkar
Main category: cs.CL
TL;DR: 提出了一种基于机制的内部幻觉检测方法,通过因果追踪在GPT-2 XL上识别出算术推理的双阶段机制,并验证了后期聚合电路在金融领域幻觉生成中的关键作用。
Details
Motivation: 大型语言模型在金融等高风险领域存在可重现的算术幻觉问题,现有方法多为黑箱式缓解,缺乏对内部机制的理解和针对性检测。 Method: 采用因果追踪技术分析GPT-2 XL在ConvFinQA基准上的算术推理过程,结合消融实验和线性探测器评估关键网络层的作用。 Result: 发现中间层(L12-L30)构成分布式计算草稿区,最后阶段(特别是第46层)存在决定性聚合电路;抑制第46层可使模型对幻觉输出的信心降低81.8%;基于该层训练的线性探测器在未见金融话题上达到98%的检测准确率。 Conclusion: 算术幻觉的产生依赖于特定的层级机制,第46层在聚合信息时起决定性作用,且该机制具有跨主题泛化能力,表明存在一种通用的算术欺骗几何结构。 Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model's confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.[44] Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models
Linye Wei,Wenjue Chen,Pingzhi Tang,Xiaotian Guo,Le Ye,Runsheng Wang,Meng Li
Main category: cs.CL
TL;DR: 本文提出ODB-dLLM,一种通过双边界协同机制加速扩散式大语言模型(dLLM)推理的框架,结合自适应长度预测和跳跃共享的推测解码方法,显著提升推理效率并减少精度损失。
Details
Motivation: 现有dLLM推理框架因双向注意力机制需周期性刷新KV缓存,导致预填充和解码阶段计算开销大,限制了加速效果;同时固定响应长度引入冗余计算,影响效率。 Method: 提出ODB-dLLM框架:在预填充阶段引入自适应长度预测机制以动态减少冗余计算;在解码阶段设计面向dLLM的跳跃共享推测解码方法,降低解码迭代次数,并利用双边界机制协调两阶段计算强度差异以提升整体效率。 Result: 实验表明,ODB-dLLM相比基线dLLM和Fast-dLLM分别实现46-162倍和2.63-6.30倍的推理加速,同时缓解了现有加速框架中的精度下降问题。 Conclusion: ODB-dLLM通过联合优化预填充与解码阶段的计算特性,有效提升了dLLM的推理效率,为扩散式语言模型的实际部署提供了高效解决方案。 Abstract: Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.[45] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Yuxiang Wei,Yanteng Zhang,Xi Xiao,Chengxuan Qian,Tianyang Wang,Vince D. Calhoun
Main category: cs.CL
TL;DR: fMRI-LM 是一个三阶段框架,将功能性磁共振成像(fMRI)与语言模型结合,实现脑活动与语义认知的对齐,支持跨模态脑表示和多种下游应用。
Details
Motivation: 当前多模态大模型在图像、音频、视频上已有进展,但脑成像与语言之间的统一建模仍待探索。连接神经活动与语义认知需要一个能将fMRI信号映射到语言空间的基础模型。 Method: 提出 fMRI-LM 三阶段框架:1)训练神经分词器将 fMRI 映射为离散语言一致的 token;2)使用预训练大语言模型联合建模 fMRI token 和文本,并构建描述性语料库弥补缺乏自然配对数据的问题;3)进行多任务多范式指令微调以增强语义理解。 Result: fMRI-LM 在多个基准上表现出强零样本和少样本性能,可通过 LoRA 等参数高效微调方法快速适应下游任务。 Conclusion: fMRI-LM 建立了一个可扩展的路径,推动了面向结构与语义理解的语言对齐通用 fMRI 模型的发展。 Abstract: Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.[46] LLMs for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti
Tabia Tanzin Prama,Christopher M. Danforth,Peter Sheridan Dodds
Main category: cs.CL
TL;DR: 本研究首次系统探讨了大型语言模型(LLM)在低资源方言Sylheti中的机器翻译应用,提出了一种名为Sylheti-CAP的上下文感知提示框架,通过整合语言规则、词典和真实性检查显著提升了翻译质量。
Details
Motivation: 尽管大型语言模型在翻译任务中表现出色,但在方言和低资源语言环境下的表现仍缺乏研究,尤其是像Sylheti这样的低资源孟加拉方言面临词汇稀缺和模型适应性差的问题。 Method: 评估了五种先进LLM(GPT-4.1、LLaMA 4、Grok 3、DeepSeek V3.2等)在Bangla与Sylheti互译任务中的表现,并提出了Sylheti-CAP三步提示框架,将语言规则书、包含2260个核心词汇和习语的词典以及真实性校验嵌入提示中以提升翻译准确性。 Result: 实验表明,Sylheti-CAP在多种模型和提示策略下均显著提高翻译质量,自动指标和人工评估一致验证其有效性,定性分析显示幻觉、歧义和不自然表达明显减少。 Conclusion: Sylheti-CAP是一种可扩展的解决方案,为低资源和方言场景下的机器翻译提供了新思路,证明了结构化知识注入对提升LLM在特定语言任务中表现的关键作用。 Abstract: Large Language Models (LLMs) have demonstrated strong translation abilities through prompting, even without task-specific training. However, their effectiveness in dialectal and low-resource contexts remains underexplored. This study presents the first systematic investigation of LLM-based machine translation (MT) for Sylheti, a dialect of Bangla that is itself low-resource. We evaluate five advanced LLMs (GPT-4.1, GPT-4.1, LLaMA 4, Grok 3, and DeepSeek V3.2) across both translation directions (Bangla $\Leftrightarrow$ Sylheti), and find that these models struggle with dialect-specific vocabulary. To address this, we introduce Sylheti-CAP (Context-Aware Prompting), a three-step framework that embeds a linguistic rulebook, a dictionary (2{,}260 core vocabulary items and idioms), and an authenticity check directly into prompts. Extensive experiments show that Sylheti-CAP consistently improves translation quality across models and prompting strategies. Both automatic metrics and human evaluations confirm its effectiveness, while qualitative analysis reveals notable reductions in hallucinations, ambiguities, and awkward phrasing, establishing Sylheti-CAP as a scalable solution for dialectal and low-resource MT. Dataset link: \href{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}[47] Factors That Support Grounded Responses in LLM Conversations: A Rapid Review
Gabriele Cesar Iwashima,Claudia Susie Rodrigues,Claudio Dipolitto,Geraldo Xexéo
Main category: cs.CL
TL;DR: 本文综述了大语言模型(LLM)在对话中可能出现的输出不一致、缺乏上下文基础和幻觉等问题,并分析了在不同生命周期阶段(推理时、后训练和基于强化学习的方法)中实现LLM响应对齐的技术。研究发现,推理时方法在无需重新训练的情况下能有效支持用户意图对齐、上下文接地和减少幻觉,具有较高效率。
Details
Motivation: 大语言模型在实际应用中可能生成与用户意图不符、缺乏上下文支持或出现幻觉的输出,影响其可靠性,因此需要系统性地识别和评估能够提升LLM响应对齐性、接地性和一致性的技术。 Method: 采用PRISMA框架和PICO策略指导的快速综述(Rapid Review)方法,系统地组织文献的检索、筛选和选择过程,并根据LLM生命周期阶段对齐策略进行分类分析。 Result: 识别出三类主要对齐策略:推理时、后训练和基于强化学习的方法;其中推理时方法表现出高效率,能在不重新训练模型的情况下有效支持用户意图对齐、上下文接地并减轻幻觉问题。 Conclusion: 推理时对齐技术是提升大语言模型对话响应质量与可靠性的高效途径,未来的研究可进一步优化此类动态干预机制以增强实际应用中的表现。 Abstract: Large language models (LLMs) may generate outputs that are misaligned with user intent, lack contextual grounding, or exhibit hallucinations during conversation, which compromises the reliability of LLM-based applications. This review aimed to identify and analyze techniques that align LLM responses with conversational goals, ensure grounding, and reduce hallucination and topic drift. We conducted a Rapid Review guided by the PRISMA framework and the PICO strategy to structure the search, filtering, and selection processes. The alignment strategies identified were categorized according to the LLM lifecycle phase in which they operate: inference-time, post-training, and reinforcement learning-based methods. Among these, inference-time approaches emerged as particularly efficient, aligning outputs without retraining while supporting user intent, contextual grounding, and hallucination mitigation. The reviewed techniques provided structured mechanisms for improving the quality and reliability of LLM responses across key alignment objectives.[48] FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers
Sarina Xi,Vishisht Rao,Justin Payan,Nihar B. Shah
Main category: cs.CL
TL;DR: 本文提出了FLAWS基准,用于评估大语言模型(LLM)在科研论文中识别和定位错误的能力。通过在已发表论文中插入声明性错误并自动评估模型表现,发现GPT-5在k=10时准确率为39.1%,表现最佳。
Details
Motivation: 随着科学产出的快速增长,人工评审难以全面发现论文中的错误,而大语言模型在评审中的应用潜力尚未充分探索,因此需要系统评估其错误定位能力。 Method: 构建了一个包含713篇论文-错误对的数据集FLAWS,使用LLM在同行评审论文中插入破坏关键主张的错误,并设计自动化评估指标来衡量LLM识别和定位这些错误的效果。 Result: 评估了五种前沿LLM(Claude Sonnet 4.5、DeepSeek Reasoner v3.1、Gemini 2.5 Pro、GPT-5和Grok 4),其中GPT-5在k=10时的识别准确率最高,达到39.1%。 Conclusion: 当前的大语言模型在自动识别科研论文中的关键错误方面仍存在显著局限,FLAWS为未来改进提供了可扩展的评估框架。 Abstract: The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.[49] Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices
Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Claudio Pinhanez,Yago Primerano
Main category: cs.CL
TL;DR: 本文提出了Consistency-Rebalanced Accuracy (CoRA) 指标,通过评估大语言模型在多项选择基准测试中的响应一致性,提升评分的可靠性。
Details
Motivation: 现有的多项选择基准可能高估大语言模型的真实性能,因为模型可能给出正确答案但缺乏一致性推理。因此需要一个能衡量并校正不一致性的新指标。 Method: 提出CoRA指标,利用生成的问题变体计算两个中间得分:最低一致性准确率(BMCA)和一致性指数(CI),进而调整原始MCQA得分。 Result: 在多个基准和不同大语言模型上的实验表明,即使模型具有高MCQA得分,其响应一致性也可能很低,而CoRA能够有效降低这些不一致模型的评分。 Conclusion: CoRA能更可靠地评估大语言模型的真实能力,通过引入一致性校正,避免对表现不稳定模型的过度估计。 Abstract: In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.[50] A Customer Journey in the Land of Oz: Leveraging the Wizard of Oz Technique to Model Emotions in Customer Service Interactions
Sofie Labat,Thomas Demeester,Véronique Hoste
Main category: cs.CL
TL;DR: EmoWOZ-CS是一个包含2,148个双语书面对话的语料库,通过Wizard of Oz实验构建,用于研究客服场景中的情绪识别与预测,揭示了情绪标注差异、情感策略影响及前向情绪推理的挑战。
Details
Motivation: 现有情绪识别资源常存在领域不匹配、标签狭窄和事后检测的问题,缺乏适用于客服场景的情绪感知对话数据和前瞻性预测能力。 Method: 采用受控的Wizard of Oz实验方法,引导参与者在四个商业场景中产生特定情绪轨迹的对话,并收集双语(荷兰语-英语)对话数据;进行人工标注与自我报告对比,分析情绪动态及其可预测性。 Result: 语料库显示中性情绪占主导,非中性情绪中渴望和感激最常见;多标签情绪和效价的标注一致性中等,唤醒度和支配度较低;自我报告与第三方标注在中性、感激和愤怒上较一致;某些情感策略如 cheerful 和 gratitude 能促进积极互惠,而道歉和共情可能引发负面情绪;时间分析表明负向情绪目标更易被引导,正向与中性目标最终效价分布相似;前向情绪推理基准显示从历史对话预测未来情绪仍具挑战。 Conclusion: EmoWOZ-CS为情绪感知客服提供了高质量的跨语言数据支持,验证了WOZ引导情绪轨迹的有效性,揭示了情绪标注的复杂性以及主动情绪干预的设计难点,强调了发展实时情绪预测模型的重要性。 Abstract: Emotion-aware customer service needs in-domain conversational data, rich annotations, and predictive capabilities, but existing resources for emotion recognition are often out-of-domain, narrowly labeled, and focused on post-hoc detection. To address this, we conducted a controlled Wizard of Oz (WOZ) experiment to elicit interactions with targeted affective trajectories. The resulting corpus, EmoWOZ-CS, contains 2,148 bilingual (Dutch-English) written dialogues from 179 participants across commercial aviation, e-commerce, online travel agencies, and telecommunication scenarios. Our contributions are threefold: (1) Evaluate WOZ-based operator-steered valence trajectories as a design for emotion research; (2) Quantify human annotation performance and variation, including divergences between self-reports and third-party judgments; (3) Benchmark detection and forward-looking emotion inference in real-time support. Findings show neutral dominates participant messages; desire and gratitude are the most frequent non-neutral emotions. Agreement is moderate for multilabel emotions and valence, lower for arousal and dominance; self-reports diverge notably from third-party labels, aligning most for neutral, gratitude, and anger. Objective strategies often elicit neutrality or gratitude, while suboptimal strategies increase anger, annoyance, disappointment, desire, and confusion. Some affective strategies (cheerfulness, gratitude) foster positive reciprocity, whereas others (apology, empathy) can also leave desire, anger, or annoyance. Temporal analysis confirms successful conversation-level steering toward prescribed trajectories, most distinctly for negative targets; positive and neutral targets yield similar final valence distributions. Benchmarks highlight the difficulty of forward-looking emotion inference from prior turns, underscoring the complexity of proactive emotion-aware support.[51] Tracing How Annotators Think: Augmenting Preference Judgments with Reading Processes
Karin de Langis,William Walker,Khanh Chi Le,Dongyeop Kang
Main category: cs.CL
TL;DR: 提出一种捕捉标注者阅读过程的注释方法,通过鼠标追踪构建包含细粒度阅读行为的数据集PreferRead,揭示阅读行为与标注结果之间的关系。
Details
Motivation: 理解复杂主观NLP任务中标注者的可靠性、决策过程和分歧,需要超越标签本身,考察背后的认知过程。 Method: 设计捕获阅读行为(如聚焦、重读、浏览)的注释框架,利用鼠标追踪技术进行偏好标注任务的案例研究,构建PreferRead数据集并分析标注者在提示和候选回复间的导航行为。 Result: 约一半试验中标注者会重读回复,通常重访最终选择的选项,很少回顾提示;重读与较高的一致性相关,而长阅读路径和时间与较低一致性相关。 Conclusion: 阅读过程为理解主观NLP任务中的标注者行为提供了重要的补充认知维度,有助于提升对标注可靠性和决策机制的认识。 Abstract: We propose an annotation approach that captures not only labels but also the reading process underlying annotators' decisions, e.g., what parts of the text they focus on, re-read or skim. Using this framework, we conduct a case study on the preference annotation task, creating a dataset PreferRead that contains fine-grained annotator reading behaviors obtained from mouse tracking. PreferRead enables detailed analysis of how annotators navigate between a prompt and two candidate responses before selecting their preference. We find that annotators re-read a response in roughly half of all trials, most often revisiting the option they ultimately choose, and rarely revisit the prompt. Reading behaviors are also significantly related to annotation outcomes: re-reading is associated with higher inter-annotator agreement, whereas long reading paths and times are associated with lower agreement. These results demonstrate that reading processes provide a complementary cognitive dimension for understanding annotator reliability, decision-making and disagreement in complex, subjective NLP tasks. Our code and data are publicly available.[52] A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics
Yuxin Li,Lorraine Xu,Meng Fan Wang
Main category: cs.CL
TL;DR: 本文首次研究了中文歌词的作者归属问题,构建了一个跨流派的基准数据集,并提出了一种领域特定模型与零样本大模型的对比分析方法。
Details
Motivation: 中文歌词作者归属缺乏公开、干净的数据集,且现有方法在不同流派下的表现差异不明,亟需系统性研究和基准建立。 Method: 构建了一个多流派平衡的中文歌词数据集,训练并微调了领域专用模型,同时与DeepSeek大模型的零样本推理进行比较,设计了两个测试集(真实数据与合成增强数据)以评估性能。 Result: 实验表明:1)流派结构化程度显著影响准确率(如民俗类高于情感类);2)微调模型在真实复杂场景中表现更优,但在小规模合成数据上优势不明显,因后者存在标签不平衡和词汇浅层差异等问题。 Conclusion: 建立了首个中文歌词跨流派作者归属基准,强调需关注流派敏感性评估,并建议未来工作应扩大测试集、减少对词级增强的依赖、平衡作者分布,探索领域自适应预训练以提升性能。 Abstract: We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore & Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love & Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance.[53] Start Making Sense(s): A Developmental Probe of Attention Specialization Using Lexical Ambiguity
Pamela D. Rivière,Sean Trott
Main category: cs.CL
TL;DR: 通过发展性方法研究Transformer语言模型中注意力机制在词义消歧中的作用,发现不同规模模型中注意力头的行为差异,较大模型(410M)中的注意力头表现出更强的泛化能力。
Details
Motivation: 尚不清楚Transformer中注意力头如何实现可解释的计算功能,以及它们如何发展出特定的注意力模式,本文旨在通过词义消歧任务来系统探究这一问题。 Method: 利用Pythia系列模型的发展检查点,识别词义消歧性能的转折点,分析注意力头对消歧词的关注与其整体性能的关系,并进行扰动测试和因果删除分析,验证其作用。 Result: 在14M和410M模型中发现了与消歧性能相关的注意力头;410M模型中的注意力头更具鲁棒性和泛化能力;删除这些头会损害消歧性能,尤其在14M中显著;结果在多个随机种子下具有一致性。 Conclusion: 词义消歧依赖于多种机制的协同,小模型机制更敏感于上下文特征,大模型则发展出更稳健的注意力头,支持采用发展视角理解语言模型内部机制。 Abstract: Despite an in-principle understanding of self-attention matrix operations in Transformer language models (LMs), it remains unclear precisely how these operations map onto interpretable computations or functions--and how or when individual attention heads develop specialized attention patterns. Here, we present a pipeline to systematically probe attention mechanisms, and we illustrate its value by leveraging lexical ambiguity--where a single word has multiple meanings--to isolate attention mechanisms that contribute to word sense disambiguation. We take a "developmental" approach: first, using publicly available Pythia LM checkpoints, we identify inflection points in disambiguation performance for each LM in the suite; in 14M and 410M, we identify heads whose attention to disambiguating words covaries with overall disambiguation performance across development. We then stress-test the robustness of these heads to stimulus perturbations: in 14M, we find limited robustness, but in 410M, we identify multiple heads with surprisingly generalizable behavior. Then, in a causal analysis, we find that ablating the target heads demonstrably impairs disambiguation performance, particularly in 14M. We additionally reproduce developmental analyses of 14M across all of its random seeds. Together, these results suggest: that disambiguation benefits from a constellation of mechanisms, some of which (especially in 14M) are highly sensitive to the position and part-of-speech of the disambiguating cue; and that larger models (410M) may contain heads with more robust disambiguation behavior. They also join a growing body of work that highlights the value of adopting a developmental perspective when probing LM mechanisms.[54] AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models
Yann Le Beux,Oluchi Audu,Oche D. Ankeli,Dhananjay Balakrishnan,Melissah Weya,Marie D. Ralaiarinosy,Ignatius Ezeani
Main category: cs.CL
TL;DR: 本文提出了AfriStereo,首个基于非洲本土社会文化背景的开源非洲刻板印象数据集和评估框架,通过社区参与收集并验证了涵盖多个维度的刻板印象,并用于评估语言模型中的偏见。
Details
Motivation: 现有AI偏见评估基准主要反映西方视角,缺乏对非洲情境的代表性,导致在各类应用中产生有害的刻板印象。为填补这一空白,研究旨在建立更符合非洲本土文化的偏见评估体系。 Method: 通过在塞内加尔、肯尼亚和尼日利亚开展社区参与式工作,收集1,163条刻板印象,并使用少样本提示与人工闭环验证将其扩展为5,000多对刻板-反刻板样本;通过语义聚类和具有文化背景知识的评审员手动标注进行验证。 Result: 对语言模型的初步评估显示,11个模型中有9个表现出统计显著的偏见(BPR 0.63–0.78,p ≤ 0.05),尤其在年龄、职业和性别维度上倾向于刻板印象;领域特定模型偏见较弱,表明任务特定训练可能缓解部分偏见。 Conclusion: AfriStereo为构建更具文化敏感性、公平且全球包容的NLP技术提供了关键方法论,推动未来在文化嵌入式偏见评估与缓解方向的研究。 Abstract: Existing AI bias evaluation benchmarks largely reflect Western perspectives, leaving African contexts underrepresented and enabling harmful stereotypes in applications across various domains. To address this gap, we introduce AfriStereo, the first open-source African stereotype dataset and evaluation framework grounded in local socio-cultural contexts. Through community engaged efforts across Senegal, Kenya, and Nigeria, we collected 1,163 stereotypes spanning gender, ethnicity, religion, age, and profession. Using few-shot prompting with human-in-the-loop validation, we augmented the dataset to over 5,000 stereotype-antistereotype pairs. Entries were validated through semantic clustering and manual annotation by culturally informed reviewers. Preliminary evaluation of language models reveals that nine of eleven models exhibit statistically significant bias, with Bias Preference Ratios (BPR) ranging from 0.63 to 0.78 (p <= 0.05), indicating systematic preferences for stereotypes over antistereotypes, particularly across age, profession, and gender dimensions. Domain-specific models appeared to show weaker bias in our setup, suggesting task-specific training may mitigate some associations. Looking ahead, AfriStereo opens pathways for future research on culturally grounded bias evaluation and mitigation, offering key methodologies for the AI community on building more equitable, context-aware, and globally inclusive NLP technologies.[55] ResearchArcade: Graph Interface for Academic Tasks
Jingjun Xu,Chongshan Lin,Haofei Yu,Tao Feng,Jiaxuan You
Main category: cs.CL
TL;DR: 本文提出了ResearchArcade,一种基于图的统一数据接口,整合多源学术数据(如ArXiv论文和OpenReview评审),支持跨任务、多模态和时序建模,提升多种学术任务的机器学习性能。
Details
Motivation: 随着机器学习在学术研究中的广泛应用,缺乏一个统一的数据接口来整合多样化的学术数据源并支持多种研究任务,限制了自动化工具在加速知识发现中的潜力。 Method: 提出ResearchArcade,采用图结构的多表格式组织来自ArXiv和OpenReview等多源数据,融合文本、图表等多模态信息,并保留稿件修改与研究趋势的时序演化;同时统一各类学术任务定义,兼容多种基础模型输入需求。 Result: 在六项学术任务上的实验表明,结合跨源与多模态信息可拓展任务覆盖范围,引入图结构能持续优于基线方法。 Conclusion: ResearchArcade有效整合多源异构学术数据,支持多样化模型与任务,显著提升性能,具有推动科研自动化与知识发现进程的潜力。 Abstract: Academic research generates diverse data sources, and as researchers increasingly use machine learning to assist research tasks, a crucial question arises: Can we build a unified data interface to support the development of machine learning models for various academic tasks? Models trained on such a unified interface can better support human researchers throughout the research process, eventually accelerating knowledge discovery. In this work, we introduce ResearchArcade, a graph-based interface that connects multiple academic data sources, unifies task definitions, and supports a wide range of base models to address key academic challenges. ResearchArcade utilizes a coherent multi-table format with graph structures to organize data from different sources, including academic corpora from ArXiv and peer reviews from OpenReview, while capturing information with multiple modalities, such as text, figures, and tables. ResearchArcade also preserves temporal evolution at both the manuscript and community levels, supporting the study of paper revisions as well as broader research trends over time. Additionally, ResearchArcade unifies diverse academic task definitions and supports various models with distinct input requirements. Our experiments across six academic tasks demonstrate that combining cross-source and multi-modal information enables a broader range of tasks, while incorporating graph structures consistently improves performance over baseline methods. This highlights the effectiveness of ResearchArcade and its potential to advance research progress.[56] Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing
Rochana Chaturvedi,Yue Zhou,Andrew Boyd,Brian T. Layden,Mudassir Rashid,Lu Cheng,Ali Cinar,Barbara Di Eugenio
Main category: cs.CL
TL;DR: 本文提出了两种基于纵向临床文本进行风险预测的方法:HiTGNN(分层时序图神经网络)和ReVeAL(轻量级测试时框架),在保护隐私的前提下提升了2型糖尿病的早期筛查性能。
Details
Motivation: 临床笔记包含丰富的时序信息,但存在文本长、事件分布不均、时序依赖复杂等NLP挑战,现有方法难以有效利用这些数据进行慢性病预测。 Method: 提出HiTGNN模型,融合文档内时序事件结构、跨就诊动态和医学知识;同时提出ReVeAL框架,将大语言模型的推理过程蒸馏到小型验证模型中。 Result: 在私有和公开医院语料库上实验显示,HiTGNN在短期风险预测中准确率最高,ReVeAL提高了对真实病例的敏感性并保留了解释性,且HiTGNN在不同亚组中表现出更公平的性能。 Conclusion: 结合细粒度时序建模与知识增强的HiTGNN,以及轻量化的ReVeAL框架,能有效提升基于临床文本的慢性病风险预测效果,兼顾性能、隐私与可解释性。 Abstract: Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight, test-time framework that distills the reasoning of large language models into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.[57] A Hybrid Theory and Data-driven Approach to Persuasion Detection with Large Language Models
Gia Bao Hoang,Keith J Ransom,Rachel Stephens,Carolyn Semmler,Nicolas Fay,Lewis Mitchell
Main category: cs.CL
TL;DR: 该研究结合大语言模型(LLM)与心理学特征,构建了一个预测在线文本中信念改变的模型,发现认知情感和分享意愿是成功说服的关键因素。
Details
Motivation: 传统信念修正模型主要针对面对面交流,难以适用于社交媒体时代的规模化文本互动,因此需要更有效的模型来理解在线话语中的信念变化。 Method: 采用混合方法,利用大语言模型生成心理学文献中已有特征的评分,并基于这些特征训练随机森林分类模型,以预测信息是否引发信念改变。 Result: 在八个测试特征中,认知情感(epistemic emotion)和分享意愿(willingness to share)是预测信念改变最重要的两个特征。 Conclusion: 研究表明,结合大语言模型与心理学理论可有效建模在线说服过程,为影响力检测、虚假信息遏制和在线叙事效果评估提供了实用工具和理论支持。 Abstract: Traditional psychological models of belief revision focus on face-to-face interactions, but with the rise of social media, more effective models are needed to capture belief revision at scale, in this rich text-based online discourse. Here, we use a hybrid approach, utilizing large language models (LLMs) to develop a model that predicts successful persuasion using features derived from psychological experiments. Our approach leverages LLM generated ratings of features previously examined in the literature to build a random forest classification model that predicts whether a message will result in belief change. Of the eight features tested, \textit{epistemic emotion} and \textit{willingness to share} were the top-ranking predictors of belief change in the model. Our findings provide insights into the characteristics of persuasive messages and demonstrate how LLMs can enhance models of successful persuasion based on psychological theory. Given these insights, this work has broader applications in fields such as online influence detection and misinformation mitigation, as well as measuring the effectiveness of online narratives.[58] Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples
Shuhei Yamashita,Daiki Shirafuji,Tatsuhiko Saito
Main category: cs.CL
TL;DR: 提出一种基于伪数据构建的相似性标准化方法,以解决跨模态检索中的模态间隙问题,显著提升不同模态间文本与图像检索性能。
Details
Motivation: 由于文本和图像模态间相似性分数尺度不一致(即模态间隙),现有跨模态检索模型在多模态数据库中表现受限,尤其缺乏无需人工标注的通用解决方案。 Method: 通过构建伪样本对,计算各模态下查询与其配对数据的相似性均值和方差,并据此对相似性分数进行标准化处理,使不同模态的分数处于同一尺度下进行比较。 Result: 在七个视觉语言模型上验证,于MMQA和WebQA两个多模态问答基准中,当查询与目标数据模态不同时,Recall@20分别平均提升了64%和28%,优于基于图像描述的E5-V方法。 Conclusion: 所提方法能有效缩小模态差距,显著提升跨模态检索效果,且无需人工标注数据,具有良好的通用性和实用性。 Abstract: Advances in vision-language models (VLMs) have enabled effective cross-modality retrieval. However, when both text and images exist in the database, similarity scores would differ in scale by modality. This phenomenon, known as the modality gap, hinders accurate retrieval. Most existing studies address this issue with manually labeled data, e.g., by fine-tuning VLMs on them. In this work, we propose a similarity standardization approach with pseudo data construction. We first compute the mean and variance of the similarity scores between each query and its paired data in text or image modality. Using these modality-specific statistics, we standardize all similarity scores to compare on a common scale across modalities. These statistics are calculated from pseudo pairs, which are constructed by retrieving the text and image candidates with the highest cosine similarity to each query. We evaluate our method across seven VLMs using two multi-modal QA benchmarks (MMQA and WebQA), where each question requires retrieving either text or image data. Our experimental results show that our method significantly improves retrieval performance, achieving average Recall@20 gains of 64% on MMQA and 28% on WebQA when the query and the target data belong to different modalities. Compared to E5-V, which addresses the modality gap through image captioning, we confirm that our method more effectively bridges the modality gap.[59] C$^2$DLM: Causal Concept-Guided Diffusion Large Language Models
Kairong Han,Nuanqiao Shan,Ziyu Zhao,Zijing Hu,Xinpeng Dong,Junjian Ye,Lujia Pan,Fei Wu,Kun Kuang
Main category: cs.CL
TL;DR: 提出了一种因果概念引导的扩散语言模型(C²DLM),通过从教师模型中提取概念级因果图,并在注意力机制中显式建模概念间的因果关系,提升了推理能力与训练效率。
Details
Motivation: 现有自回归和扩散语言模型在建模自然语言的因果结构方面存在不足:前者局限于从左到右的顺序预测,后者完全忽略因果顺序,导致推理能力受限。 Method: 基于扩散语言模型的全连接注意力结构,C²DLM首先从教师模型提取概念级别的因果图,然后利用该图显式引导注意力机制学习概念之间的因果关系,从而在生成过程中融入因果推理。 Result: 在COT-OrderPerturb任务上性能提升12%,训练速度加快3.2倍;在六个下游推理任务上平均提升1.31%。 Conclusion: C²DLM通过引入概念级因果结构指导扩散语言模型的注意力机制,有效增强了模型的推理能力,验证了融合因果知识对语言建模的重要性。 Abstract: Autoregressive (AR) language models and Diffusion Language Models (DLMs) constitute the two principal paradigms of large language models. However, both paradigms suffer from insufficient reasoning capabilities. Human reasoning inherently relies on causal knowledge and thought, which are reflected in natural language. But in the AR paradigm, language is modeled as next token prediction (a strictly left-to-right, token-by-token order), whereas natural language itself exhibits more flexible causal structures. In the DLM paradigm, the attention mechanism is fully connected, which entirely disregards causal order. To fill this gap, we propose a \underline{\textbf{C}}ausal \underline{\textbf{C}}oncept-Guided \underline{\textbf{D}}iffusion \underline{\textbf{L}}anguage \underline{\textbf{M}}odel (C$^2$DLM). Starting from DLM's fully connected attention, C$^2$DLM first obtains a concept-level causal graph from the teacher model, and then explicitly guides attention to learn causal relationships between concepts. By focusing on causal relationships and avoiding interference from difficult subgoals involving causal inversion, C$^2$DLM improves 12\% with about 3.2 times training speedup in the COT-OrderPerturb task, and achieves an average gain of 1.31\% across six downstream reasoning tasks. More details in the repository ~\href{https://github.com/Kairong-Han/C-2-DLM}{here}.[60] A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM-Generated Text
Sepyan Purnama Kristanto,Lutfi Hakim
Main category: cs.CL
TL;DR: 提出一种基于理论指导的混合集成框架,融合三种互补的检测范式,通过优化加权投票机制,在学术文本上显著降低误报率,提升大模型生成文本检测的准确性和可靠性。
Details
Motivation: 现有文本检测器通常依赖单一方法,泛化能力差且误报率高,尤其在高风险学术文本中表现不佳,亟需更可靠、低误报的检测方案以维护学术诚信。 Method: 结合三种检测范式:基于RoBERTa的语义分类器、基于GPT-2的扰动似然曲率概率检测器和统计语言特征分析器,并在概率单形上学习最优权重进行加权集成,最大化F1分数。 Result: 在3万文档的多生成器语料库上达到94.2%准确率和0.978 AUC,对学术文本的误报率相对降低35%,模型间相关性低(rho ~ 0.35-0.42),验证了方差减小效果。 Conclusion: 该混合集成方法在理论和实践上均有效提升检测性能,具备更低误报率和更高可靠性,适用于教育等高风险领域的实际部署。 Abstract: The rapid proliferation of Large Language Models (LLMs) has blurred the line between human and machine authorship, creating practical risks for academic integrity and information reliability. Existing text detectors typically rely on a single methodological paradigm and suffer from poor generalization and high false positive rates (FPR), especially on high-stakes academic text. We propose a theoretically grounded hybrid ensemble that systematically fuses three complementary detection paradigms: (i) a RoBERTa-based transformer classifier for deep semantic feature extraction, (ii) a GPT-2-based probabilistic detector using perturbation-induced likelihood curvature, and (iii) a statistical linguistic feature analyzer capturing stylometric patterns. The core novelty lies in an optimized weighted voting framework, where ensemble weights are learned on the probability simplex to maximize F1-score rather than set heuristically. We provide a bias-variance analysis and empirically demonstrate low inter-model correlation (rho ~ 0.35-0.42), a key condition for variance reduction. Evaluated on a large-scale, multigenerator corpus of 30,000 documents, our system achieves 94.2% accuracy and an AUC of 0.978, with a 35% relative reduction in false positives on academic text. This yields a more reliable and ethically responsible detector for real-world deployment in education and other high-stakes domains.[61] Lips-Jaw and Tongue-Jaw Articulatory Tradeoff in DYNARTmo
Bernd J. Kröger
Main category: cs.CL
TL;DR: 该研究探讨了动态发音模型DYNARTmo如何解释主要与次要发音器官之间的发音权衡,重点分析唇-颌和舌-颌协调。模型通过简化的机制分配多发音器官的协同努力,并能生成符合实证数据的时空运动模式。
Details
Motivation: 旨在理解DYNARTmo模型在缺乏完整生物力学建模的情况下,如何仍能有效模拟发音过程中的协同与权衡现象。 Method: 结合任务空间手势描述与简化协同机制,进行CV音节的发音模拟,分析不同辅音部位和元音环境下下颌位移的变化。 Result: 模型成功再现了多种实证观察到的发音协同模式,如颌部支持的舌尖闭合、双唇塞音中的下唇抬升、舌-颌共动及双唇收缩的饱和效应。 Conclusion: 尽管采用简化的计算假设,DYNARTmo仍能有效捕捉多种辅音-元音组合下的关键发音协同与权衡特征。 Abstract: This paper investigates how the dynamic articulatory model DYNARTmo accounts for articulatory tradeoffs between primary and secondary articulators, with a focus on lips-jaw and tongue-jaw coordination. While DYNARTmo does not implement full task-dynamic second-order biomechanics, it adopts first-order task-space gesture specifications comparable to those used in articulatory phonology and integrates a simplified mechanism for distributing articulatory effort across multiple articulators. We first outline the conceptual relationship between task dynamics and DYNARTmo, emphasizing the distinction between high-level task-space trajectories and their low-level articulatory execution. We then present simulation results for a set of CV syllables that illustrate how jaw displacement varies as a function of both place of articulation (labial, apical, dorsal) and vowel context (/a/, /i/, /u/). The model reproduces empirically attested patterns of articulatory synergy, including jaw-supported apical closures, lower-lip elevation in bilabial stops, tongue-jaw co-movement, and saturation effects in labial constrictions. These results demonstrate that even with computationally simplified assumptions, DYNARTmo can generate realistic spatio-temporal movement patterns that capture key aspects of articulatory tradeoff and synergy across a range of consonant-vowel combinations.[62] RefineBench: Evaluating Refinement Capability of Language Models via Checklists
Young-Jun Lee,Seungone Kim,Byung-Kwan Lee,Minkyeong Moon,Yechan Hwang,Jong Myoung Kim,Graham Neubig,Sean Welleck,Ho-Jin Choi
Main category: cs.CL
TL;DR: 本文提出了RefineBench基准,用于评估语言模型在有反馈(guided)和无反馈(self-refinement)情况下改进自身回答的能力,发现当前前沿模型在自我修正方面表现有限,但在外部反馈指导下可显著提升性能。
Details
Motivation: 研究语言模型是否能自我修正其回答,尤其是在开放性问题和多样化用户反馈的实际场景中,现有研究多局限于可验证任务,缺乏对真实交互情境的考察。 Method: 构建包含1000个跨领域难题的RefineBench基准,采用基于清单的评估框架,测试两种模式:有自然语言反馈的引导式修正和无反馈的自我修正,并在多个模型上进行实验。 Result: 在自我修正模式下,Gemini 2.5 Pro和GPT-5的基线得分仅为31.3%和29.1%,多数模型迭代后提升不明显甚至下降;而在引导式修正中,大型闭源与开源模型(>70B)可在五轮内接近完美表现。 Conclusion: 当前前沿语言模型尚难有效实现自我修正,需进一步突破;RefineBench为衡量模型迭代改进能力提供了有效测试平台。 Abstract: Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.[63] Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information
Lukas Struppek,Dominik Hintersdorf,Hannah Struppek,Daniel Neider,Kristian Kersting
Main category: cs.CL
TL;DR: 提出了一种训练免费的输入中心方法F-CoT,通过将信息提取与推理分离,显著减少大模型推理中的生成token数量,同时保持准确性。
Details
Motivation: 现有提高推理效率的方法多集中在模型本身,而忽略了输入结构对推理效率的影响。希望通过输入端的优化来提升大语言模型的推理效率。 Method: 受认知心理学启发,提出Focused Chain-of-Thought(F-CoT),先从问题中提取关键信息并组织成结构化上下文,再基于该上下文进行推理,避免关注无关细节。 Result: 在算术文字问题上,F-CoT减少了2-3倍的生成token数,同时保持与标准零样本CoT相当的准确率。 Conclusion: 结构化输入是一种简单而有效的手段,可用于提升大语言模型推理的效率。 Abstract: Recent large language models achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.[64] Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques
Guifeng Wang,Yuanfeng Song,Meng Yang,Tao Zhu,Xiaoming Yin,Xing Chen
Main category: cs.CL
TL;DR: 本文提出了一种名为RuCo-C的新型生成性评判模型,用于文本到SQL任务中的细粒度、查询特定的自动评估,通过可解释的批评和密集化奖励反馈提升模型性能。
Details
Motivation: 现有的文本到SQL评估和奖励机制依赖昂贵的手工标注黄金SQL,并且在强化学习中仅使用最终执行结果作为粗粒度奖励信号,忽略了结构和语义错误。因此需要一种无需人工干预的细粒度自动评估方法。 Method: 提出RuCo-C框架:首先自动生成查询特定的评估标准并与可解释的批评关联;然后在强化学习训练中采用“渐进探索”策略整合密集化奖励反馈,动态调整奖励以提升模型性能。 Result: 实验表明,RuCo-C在文本到SQL评估中优于现有方法,显著提升了模型性能。 Conclusion: RuCo-C实现了无需人工干预的细粒度自动评估,有效解决了传统方法在评估成本和奖励稀疏性方面的瓶颈,为文本到SQL任务提供了更高效和精确的训练与评估机制。 Abstract: Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.[65] Token-Level Marginalization for Multi-Label LLM Classifiers
Anjaneya Praharaj,Jaykumar Kasundra
Main category: cs.CL
TL;DR: 本文提出并评估了三种新的token级概率估计方法,以从生成式语言模型中获得可解释的置信度分数,用于多标签内容安全分类,显著提升了模型的可解释性和可靠性。
Details
Motivation: 生成式语言模型在多标签内容安全分类中缺乏直接的类别级概率输出,限制了置信度评估和细粒度分析,难以支持动态阈值设定和错误分析。 Method: 提出了三种基于token级logits的概率估计方法,并在合成生成且严格标注的数据集上进行实验,评估其在不同指令调优模型中的泛化能力。 Result: 实验表明,利用token logits能显著提升生成式分类器的可解释性和可靠性,改善内容安全分类的准确性与解释性。 Conclusion: 所提出的token级概率估计方法有效弥补了生成式模型在置信度输出方面的不足,增强了模型在实际内容安全应用中的可用性和透明度。 Abstract: This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.[66] Sentiment Analysis Of Shopee Product Reviews Using Distilbert
Zahri Aksa Dautd,Aviv Yuniar Rahman
Main category: cs.CL
TL;DR: 本研究探讨了使用DistilBERT对Shopee产品评论进行情感分类的效果,结果显示其在准确率和计算效率之间实现了良好平衡。
Details
Motivation: 由于手动分析海量消费者评论效率低下,需要一种高效的自动化方法来挖掘客户情感和偏好。 Method: 采用DistilBERT模型对约一百万条英文评论进行情感分类,并使用准确率、精确率、召回率和F1分数评估性能,同时与BERT和SVM模型进行对比。 Result: DistilBERT达到94.8%的准确率,略低于BERT的95.3%,但显著高于SVM的90.2%,且计算时间减少超过55%。 Conclusion: DistilBERT在保持高准确率的同时显著提升了计算效率,适合用于大规模电商平台的情感分析任务。 Abstract: The rapid growth of digital commerce has led to the accumulation of a massive number of consumer reviews on online platforms. Shopee, as one of the largest e-commerce platforms in Southeast Asia, receives millions of product reviews every day containing valuable information regarding customer satisfaction and preferences. Manual analysis of these reviews is inefficient, thus requiring a computational approach such as sentiment analysis. This study examines the use of DistilBERT, a lightweight transformer-based deep learning model, for sentiment classification on Shopee product reviews. The dataset used consists of approximately one million English-language reviews that have been preprocessed and trained using the distilbert-base-uncased model. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics, and compared against benchmark models such as BERT and SVM. The results show that DistilBERT achieved an accuracy of 94.8%, slightly below BERT (95.3%) but significantly higher than SVM (90.2%), with computation time reduced by more than 55%. These findings demonstrate that DistilBERT provides an optimal balance between accuracy and efficiency, making it suitable for large scale sentiment analysis on e-commerce platforms. Keywords: Sentiment Analysis, DistilBERT, Shopee Reviews, Natural Language Processing, Deep Learning, Transformer Models.[67] Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis
Bakhtawar Abdalla,Rebwar Mala Nabi,Hassan Eshkiki,Fabio Caraffini
Main category: cs.CL
TL;DR: 提出了首个库尔德语 Sorani 的命名实体识别数据集,包含 64,563 个标注词元,并比较了经典机器学习与神经网络模型,发现传统方法(如 CRF)在低资源环境下优于神经方法。
Details
Motivation: 促进自然语言处理技术的包容性和全球适用性,关注低资源和代表性不足的语言(如库尔德语 Sorani)。 Method: 构建了首个库尔德语 Sorani 的命名实体识别数据集,并使用经典机器学习模型(如 CRF)和神经网络模型(如 BiLSTM)进行对比分析。 Result: CRF 模型达到 0.825 的 F1 分数,显著优于 BiLSTM 模型的 0.706。 Conclusion: 在低资源环境下,更简单且计算效率更高的传统方法可以优于神经网络架构。 Abstract: This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.[68] Mapping Clinical Doubt: Locating Linguistic Uncertainty in LLMs
Srivarshinee Sridhar,Raghav Kaushik Ravi,Kripabandhu Ghosh
Main category: cs.CL
TL;DR: 本研究提出了一种新的度量方法MSU,用于量化大语言模型在临床文本中对语言不确定性线索的表示敏感性,发现模型对不确定性的敏感度随网络深度逐步增强。
Details
Motivation: 在临床环境中,大语言模型对语言不确定性(如认知模态)的敏感性可能影响诊断和决策,但目前对其内部如何表示这类信息知之甚少。 Method: 构建了一个包含不同认知模态的临床语句对比数据集,并提出了层间探测指标MSU,通过激活变化来量化模型对不确定性线索的敏感性。 Result: 实验表明,大语言模型对临床不确定性具有结构化且依赖深度的敏感性,不确定性信息在较深的网络层中被逐步编码。 Conclusion: 语言不确定性在大语言模型中是分层渐进表示的,这一发现有助于提升模型的可解释性与认知可靠性。 Abstract: Large Language Models (LLMs) are increasingly used in clinical settings, where sensitivity to linguistic uncertainty can influence diagnostic interpretation and decision-making. Yet little is known about where such epistemic cues are internally represented within these models. Distinct from uncertainty quantification, which measures output confidence, this work examines input-side representational sensitivity to linguistic uncertainty in medical text. We curate a contrastive dataset of clinical statements varying in epistemic modality (e.g., 'is consistent with' vs. 'may be consistent with') and propose Model Sensitivity to Uncertainty (MSU), a layerwise probing metric that quantifies activation-level shifts induced by uncertainty cues. Our results show that LLMs exhibit structured, depth-dependent sensitivity to clinical uncertainty, suggesting that epistemic information is progressively encoded in deeper layers. These findings reveal how linguistic uncertainty is internally represented in LLMs, offering insight into their interpretability and epistemic reliability.[69] Exploring Performance Variations in Finetuned Translators of Ultra-Low Resource Languages: Do Linguistic Differences Matter?
Isabel Gonçalves,Paulo Cavalin,Claudio Pinhanez
Main category: cs.CL
TL;DR: 本文研究了在使用少量数据微调预训练语言模型时,不同低资源土著语言翻译性能差异的原因,发现数据清洗、模型大小等因素影响有限,语言间的差异可能是关键因素。
Details
Motivation: 由于先前研究中相似方法在不同濒危土著语言上表现差异显著,本文旨在系统探究造成这种性能差异的潜在原因。 Method: 通过两个巴西土著语言进行实验,分析数据清洗流程、预训练模型限制、基础模型大小和训练数据量等因素对双向翻译性能的影响。 Result: 研究发现上述训练因素对翻译性能的影响很小或几乎没有,表明语言本身的差异可能在微调效果中起主导作用。 Conclusion: 低资源语言翻译性能的差异主要源于语言间的结构性差异,而非常见的训练设置因素,提示未来研究需更关注语言特性本身。 Abstract: Finetuning pre-trained language models with small amounts of data is a commonly-used method to create translators for ultra-low resource languages such as endangered Indigenous languages. However, previous works have reported substantially different performances with translators created using similar methodology and data. In this work we systematically explored possible causes of the performance difference, aiming to determine whether it was a product of different cleaning procedures, limitations of the pre-trained models, the size of the base model, or the size of the training dataset, studying both directions of translation. Our studies, using two Brazilian Indigenous languages, related but with significant structural linguistic characteristics, indicated none or very limited influence from those training factors, suggesting differences between languages may play a significant role in the ability to produce translators by fine-tuning pre-trained models.[70] Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking
Katia Vendrame,Bolaji Yusuf,Santosh Kesiraju,Šimon Sedláček,Oldřich Plchot,Jan Černocký
Main category: cs.CL
TL;DR: 提出了一种联合训练方法,利用可用的口语对话状态跟踪(DST)数据和来自其他领域的文本数据,以实现跨域泛化,从而在无需目标领域口语训练数据的情况下获得良好的跨域DST性能。
Details
Motivation: 由于语音输入处理和数据稀缺性,端到端的口语对话状态跟踪(DST)面临挑战。现有方法难以跨域泛化且需要每个目标域的标注语音数据,而这类数据的收集成本高且困难。注意到文本DST数据更容易获取,因此本文旨在通过结合多源数据提升模型的跨域适应能力。 Method: 将现有的语音基础编码器与大语言模型相结合,并联合训练可用的口语DST数据和来自其他领域的书面文本数据,以增强模型的跨域泛化能力。 Result: 实验结果表明,该方法能在没有目标领域口语训练数据的情况下,有效提升跨域DST性能,显示出良好的泛化能力和应用潜力。 Conclusion: 联合使用口语DST数据和异域文本DST数据进行训练,能够显著提高模型在未见领域中的表现,为解决口语DST中的数据稀缺和跨域泛化问题提供了有效途径。 Abstract: End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.[71] Extension Condition "violations" and Merge optimality constraints
Matilde Marcolli,Richard Larson,Riny Huijbregts
Main category: cs.CL
TL;DR: 该论文在强极简主义框架下,利用数学化的合并操作分析了一系列语言现象,表明这些看似违反扩展条件(EC)的现象实际上均可通过侧向合并(Sideward Merge)或替代机制解释,且EC具有明确的代数意义,是模型内在的结构约束。
Details
Motivation: 旨在解决头-头移动、短语词缀、动词-小品词交替等语言现象被认为违反扩展条件(EC)的问题,探讨其在极简主义框架下的可解释性。 Method: 采用强极简主义假设下的合并操作数学模型,结合侧向合并、资源限制的成本函数和最优性条件,分析各类语言现象的推导路径,并引入Hopf代数马尔可夫链和有色操作元生成器进行形式化建模。 Result: 证明所有研究的现象均可在不违反EC的前提下通过侧向合并或其他机制解释;头-头移动仅涉及最小最优性违规;EC被赋予明确的代数含义,是模型的内在结构约束;最小最优性违规的侧向合并在合并的马尔可夫性质中起结构性作用。 Conclusion: 扩展条件是合并操作数学模型中固有的代数约束,而非附加假设;侧向合并虽涉及一定最优性违规,但可通过不同机制调和,支持极简主义对句法推导的统一解释。 Abstract: We analyze, using the mathematical formulation of Merge within the Strong Minimalist Thesis framework, a set of linguistic phenomena, including head-to-head movement, phrasal affixes and syntactic cliticization, verb-particle alternation, and operator-variable phenomena. These are often regarded as problematic, as violations of the Extension Condition. We show that, in fact, all of these phenomena can be explained without involving any EC violation. We first show that derivations using Sideward Merge are possible for all of these cases: these respect EC, though they involve some amount of optimality violations, with respect to Resource Restrictions cost functions, andthe amount of violation differs among these cases. We show that all the cases that involve large optimality violations can be derived in alternative ways involving neither EC nor the use of SM. The main remaining case (head-to-head movement) only involves SM with minimal violations of optimality (near equilibrium fluctuations). We analyze explicitly also the cases of multiple wh-fronting, clusters of clitics in Romance languages and possessor agreement construction in Korean, and how an explanation of these phenomena based on SM can be made compatible with the colored operad generators for phases and theta roles. We also show that the EC condition has a clear algebraic meaning in the mathematical formulation of Merge and is therefore an intrinsic structural algebraic constraint of the model, rather than an additional assumption. We also show that the minimal optimality violating SM plays a structural role in the Markovian properties of Merge, and we compare different optimality conditions coming from Minimal Search and from Resource Restriction in terms of their effect on the dynamics of the Hopf algebra Markov chain, in a simple explicit example.[72] Smarter, not Bigger: Fine-Tuned RAG-Enhanced LLMs for Automotive HIL Testing
Chao Feng,Zihan Liu,Siddhant Gupta,Gongpei Cui,Jan von der Assen,Burkhard Stiller
Main category: cs.CL
TL;DR: 本文提出了HIL-GPT,一种结合领域自适应大语言模型和语义检索的检索增强生成系统,用于提升汽车硬件在环测试中的测试资产利用率。
Details
Motivation: 现有的硬件在环(HIL)测试存在测试资产碎片化和利用不足的问题,需要更高效的工具来管理和复用测试用例与需求。 Method: 提出HIL-GPT系统,采用基于领域数据构建的启发式挖掘与LLM辅助合成方法进行嵌入微调,并结合向量索引实现可扩展、可追溯的测试用例和需求检索。使用小型化模型进行实验评估。 Result: 微调后的小型模型(如bge-base-en-v1.5)在准确性、延迟和成本之间实现了更优权衡,优于更大模型;A/B测试表明RAG增强助手在有用性、真实性和用户满意度上优于通用大模型。 Conclusion: 研究表明,通过领域适配和检索增强,小型化LLM可在工业HIL环境中高效部署,挑战了‘越大越好’的主流认知。 Abstract: Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, a retrieval-augmented generation (RAG) system integrating domain-adapted large language models (LLMs) with semantic retrieval. HIL-GPT leverages embedding fine-tuning using a domain-specific dataset constructed via heuristic mining and LLM-assisted synthesis, combined with vector indexing for scalable, traceable test case and requirement retrieval. Experiments show that fine-tuned compact models, such as \texttt{bge-base-en-v1.5}, achieve a superior trade-off between accuracy, latency, and cost compared to larger models, challenging the notion that bigger is always better. An A/B user study further confirms that RAG-enhanced assistants improve perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs. These findings provide insights for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments.[73] Improving LLM-based Ontology Matching with fine-tuning on synthetic data
Guilherme Sousa,Rinaldo Lima,Cassia Trojahn
Main category: cs.CL
TL;DR: 提出一种结合自动生成数据集与微调的策略,以提升大语言模型在本体匹配任务中的性能。
Details
Motivation: 探索大语言模型直接执行本体匹配的能力,并解决训练数据中参考对齐稀缺的问题。 Method: 采用搜索空间缩减技术选择相关本体子模块,构建提示;利用LLM生成合成数据集用于微调。 Result: 在OAEI复杂赛道的多个数据集上验证,微调后的LLM性能优于未微调的基础模型。 Conclusion: 结合合成数据生成与微调的策略能有效提升LLM在本体匹配中的零样本表现。 Abstract: Large Language Models (LLMs) are increasingly being integrated into various components of Ontology Matching pipelines. This paper investigates the capability of LLMs to perform ontology matching directly on ontology modules and generate the corresponding alignments. Furthermore, it is explored how a dedicated fine-tuning strategy can enhance the model's matching performance in a zero-shot setting. The proposed method incorporates a search space reduction technique to select relevant subsets from both source and target ontologies, which are then used to automatically construct prompts. Recognizing the scarcity of reference alignments for training, a novel LLM-based approach is introduced for generating a synthetic dataset. This process creates a corpus of ontology submodule pairs and their corresponding reference alignments, specifically designed to fine-tune an LLM for the ontology matching task. The proposed approach was evaluated on the Conference, Geolink, Enslaved, Taxon, and Hydrography datasets from the OAEI complex track. The results demonstrate that the LLM fine-tuned on the synthetically generated data exhibits superior performance compared to the non-fine-tuned base model. The key contribution is a strategy that combines automatic dataset generation with fine-tuning to effectively adapt LLMs for ontology matching tasks.[74] Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration
Kanchon Gharami,Quazi Sarwar Muhtaseem,Deepti Gupta,Lavanya Elluri,Shafika Showkat Moni
Main category: cs.CL
TL;DR: 本文提出了一种针对印地语和孟加拉语的新型转写数据集,并基于Marian架构预训练了一个多语言seq2seq大语言模型,显著提升了罗马化脚本到本地脚本的转换性能。
Details
Motivation: 现有的多语言模型在处理南亚地区广泛使用的罗马化印度-雅利安语系脚本时仍面临挑战,且现有转写数据集缺乏发音与拼写多样性、语码混合数据及低资源适应性。 Method: 构建了一个包含约180万条印地语和100万条孟加拉语转写对的新数据集,并基于Marian架构的seq2seq模型进行预训练。 Result: 实验结果表明,该模型在BLEU和CER指标上均显著优于现有相关模型。 Conclusion: 所提出的高质量转写数据集和定制化多语言模型有效提升了印度-雅利安语言的转写性能,有助于改善低资源和代码混合场景下的自然语言处理任务。 Abstract: The development of robust transliteration techniques to enhance the effectiveness of transforming Romanized scripts into native scripts is crucial for Natural Language Processing tasks, including sentiment analysis, speech recognition, information retrieval, and intelligent personal assistants. Despite significant advancements, state-of-the-art multilingual models still face challenges in handling Romanized script, where the Roman alphabet is adopted to represent the phonetic structure of diverse languages. Within the South Asian context, where the use of Romanized script for Indo-Aryan languages is widespread across social media and digital communication platforms, such usage continues to pose significant challenges for cutting-edge multilingual models. While a limited number of transliteration datasets and models are available for Indo-Aryan languages, they generally lack sufficient diversity in pronunciation and spelling variations, adequate code-mixed data for large language model (LLM) training, and low-resource adaptation. To address this research gap, we introduce a novel transliteration dataset for two popular Indo-Aryan languages, Hindi and Bengali, which are ranked as the 3rd and 7th most spoken languages worldwide. Our dataset comprises nearly 1.8 million Hindi and 1 million Bengali transliteration pairs. In addition to that, we pre-train a custom multilingual seq2seq LLM based on Marian architecture using the developed dataset. Experimental results demonstrate significant improvements compared to existing relevant models in terms of BLEU and CER metrics.[75] Mitigating Semantic Drift: Evaluating LLMs' Efficacy in Psychotherapy through MI Dialogue Summarization
Vivek Kumar,Pushpraj Singh Rajawat,Eirini Ntoutsi
Main category: cs.CL
TL;DR: 本研究采用混合方法评估大语言模型(LLM)在心理治疗中的有效性,特别是针对动机性访谈(MI)对话的总结能力。通过基于MITI框架的两阶段标注方案,结合单样本和少样本提示技术,研究揭示了LLM在理解复杂心理概念方面的能力与局限,并提出缓解治疗场景中“语义漂移”的最佳实践。研究贡献包括为低资源领域提供高质量标注数据集,以及对复杂行为疗法中LLM精确上下文解释的深入洞察。
Details
Motivation: 当前大语言模型在敏感且资源稀缺的心理学领域存在缺乏敏感性、事实错误、共情表达不一致、偏见、幻觉等问题,难以捕捉人类理解的深度与复杂性。因此,亟需系统评估其在心理治疗等高要求场景下的表现,并探索改进方法。 Method: 采用混合方法设计研究:使用LLM生成动机性访谈(MI)对话的摘要;基于MITI框架(包括唤起、协作、自主性、导向、共情和非评判态度)构建两阶段标注方案;以专家标注的MI对话为真实基准,建立多分类任务;通过逐步提示技术(如单样本、少样本提示)评估模型性能。 Result: 研究结果揭示了LLM在理解复杂心理构念方面的潜力与局限,特别是在不同提示策略下的表现差异;发现了模型在保持语义一致性方面的挑战,即“语义漂移”现象;验证了少样本提示在提升模型准确性方面的有效性。 Conclusion: 该研究不仅为MI领域提供了高质量的标注数据集以缓解低资源问题,还为在复杂行为治疗中使用LLM进行精准上下文解释提供了关键见解;强调需结合结构化框架与提示工程来提升LLM在敏感心理场景中的可靠性与有效性。 Abstract: Recent advancements in large language models (LLMs) have shown their potential across both general and domain-specific tasks. However, there is a growing concern regarding their lack of sensitivity, factual incorrectness in responses, inconsistent expressions of empathy, bias, hallucinations, and overall inability to capture the depth and complexity of human understanding, especially in low-resource and sensitive domains such as psychology. To address these challenges, our study employs a mixed-methods approach to evaluate the efficacy of LLMs in psychotherapy. We use LLMs to generate precise summaries of motivational interviewing (MI) dialogues and design a two-stage annotation scheme based on key components of the Motivational Interviewing Treatment Integrity (MITI) framework, namely evocation, collaboration, autonomy, direction, empathy, and a non-judgmental attitude. Using expert-annotated MI dialogues as ground truth, we formulate multi-class classification tasks to assess model performance under progressive prompting techniques, incorporating one-shot and few-shot prompting. Our results offer insights into LLMs' capacity for understanding complex psychological constructs and highlight best practices to mitigate ``semantic drift" in therapeutic settings. Our work contributes not only to the MI community by providing a high-quality annotated dataset to address data scarcity in low-resource domains but also critical insights for using LLMs for precise contextual interpretation in complex behavioral therapy.[76] RAG System for Supporting Japanese Litigation Procedures: Faithful Response Generation Complying with Legal Norms
Yuya Ishihara,Atsushi Keyaki,Hiroaki Yamada,Ryutaro Ohara,Mihoko Sumida
Main category: cs.CL
TL;DR: 本文探讨了基于检索增强生成(RAG)的大语言模型系统在支持日本医疗诉讼程序时所需满足的法律规范要求,提出了三个核心设计原则。
Details
Motivation: 为了在医疗诉讼中合法地使用RAG-based LLM系统替代专家委员的角色,必须确保其符合法律对专业知识使用的严格规范。 Method: 提出并讨论了RAG系统应具备的三个关键特性:依据争议问题检索合规外部知识、确保生成内容忠实于检索上下文、以及引用具有适当时间戳的外部信息。 Result: 设计了一个满足法律约束条件的RAG-based LLM系统框架,确保其在司法环境中使用的合法性与可靠性。 Conclusion: 该研究为在受严格法律规范约束的领域(如医疗诉讼)中部署RAG系统提供了可行的设计指南。 Abstract: This study discusses the essential components that a Retrieval-Augmented Generation (RAG)-based LLM system should possess in order to support Japanese medical litigation procedures complying with legal norms. In litigation, expert commissioners, such as physicians, architects, accountants, and engineers, provide specialized knowledge to help judges clarify points of dispute. When considering the substitution of these expert roles with a RAG-based LLM system, the constraint of strict adherence to legal norms is imposed. Specifically, three requirements arise: (1) the retrieval module must retrieve appropriate external knowledge relevant to the disputed issues in accordance with the principle prohibiting the use of private knowledge, (2) the responses generated must originate from the context provided by the RAG and remain faithful to that context, and (3) the retrieval module must reference external knowledge with appropriate timestamps corresponding to the issues at hand. This paper discusses the design of a RAG-based LLM system that satisfies these requirements.[77] JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge
Zhihan Cao,Fumihito Nishino,Hiroaki Yamada,Nguyen Ha Thanh,Yusuke Miyao,Ken Satoh
Main category: cs.CL
TL;DR: JBE-QA是一个基于日本司法考试选择题部分的问答数据集,用于评估大语言模型在法律领域的日语能力,覆盖民法、刑法和宪法,包含3,464个平衡标签的题目,并对26种LLM进行了评测。
Details
Motivation: 现有日语法律评估资源多集中于民法,缺乏全面覆盖多个法律领域的基准;同时缺少结构化的法律推理评估方式,因此需要构建一个更全面、结构化的日语法律问答数据集。 Method: 从2015-2024年日本司法考试的选择题部分构建JBE-QA数据集,将每个问题分解为独立的是非判断,并添加结构化上下文字段,涵盖民法、刑法和宪法三大领域,共收集3,464个样本并进行平衡处理。 Result: 数据集评测了26个大语言模型,发现启用推理功能的专有模型表现最佳;宪法相关问题普遍比民法和刑法问题更容易回答。 Conclusion: JBE-QA是首个全面评估日语法律知识的大语言模型基准,支持细粒度的是非判断与结构化推理,为法律人工智能研究提供了重要资源。 Abstract: We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.[78] FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing
Jingheng Ye,Shen Wang,Jiaqi Chen,Hebin Wang,Deqing Zou,Yanyu Zhu,Jiwei Tang,Hai-Tao Zheng,Ruitong Liu,Haoyang Li,Yanfeng Wang,Qingsong Wen
Main category: cs.CL
TL;DR: 本文提出了一个用于英语学习者的细粒度错误分析基准(FEANEL),以评估大语言模型在K-12英语写作反馈中的能力,实验表明现有模型在此任务上仍存在显著不足。
Details
Motivation: 探索大语言模型在K-12英语写作中提供细粒度教育反馈的能力,填补其在教育场景中错误分析与教学能力评估的空白。 Method: 构建包含1000篇学生作文的FEANEL基准,采用由语言教育专家共同开发的词性基础错误分类体系,对错误类型、严重程度和解释性反馈进行标注,并评估当前主流大语言模型的表现。 Result: 实验结果显示现有大语言模型在细粒度错误分析任务上表现不佳,暴露出其在教育应用中的局限性。 Conclusion: 当前大语言模型尚不足以胜任K-12英语写作的细粒度反馈任务,需针对教育场景发展更专门的方法和技术。 Abstract: Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs' ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.[79] Language-conditioned world model improves policy generalization by reading environmental descriptions
Anh Nguyen,Stefan Lee
Main category: cs.CL
TL;DR: 本文提出了一种基于语言条件世界模型的模型强化学习方法LED-WM,用于提升智能体在未见游戏中的策略泛化能力,无需依赖规划或专家示范,并通过注意力机制将语言描述与观测实体对齐。
Details
Motivation: 现有方法在理解环境动态描述语言方面存在泛化能力不足或依赖强假设(如需专家示范或容忍规划延迟)的问题,限制了实际人机交互应用。 Method: 提出LED-WM,基于DreamerV3构建,采用语言感知的编码器,利用注意力机制将语言描述与观测中的实体显式对齐;通过与环境交互训练语言条件世界模型,并从中学习策略,不依赖推理时规划或专家示范。 Result: 在MESSENGER和MESSENGER-WM两个环境中,LED-WM在多种设置下展现出比基线模型更强的对新环境动态和语言描述的泛化能力,并可通过世界模型生成的合成轨迹微调进一步提升策略性能。 Conclusion: LED-WM能在无专家示范和避免规划延迟的前提下,有效提升语言驱动的世界模型对未见任务的策略泛化能力,且支持利用合成数据预先优化策略,增强部署前的准备性。 Abstract: To interact effectively with humans in the real world, it is important for agents to understand language that describes the dynamics of the environment--that is, how the environment behaves--rather than just task instructions specifying "what to do". Understanding this dynamics-descriptive language is important for human-agent interaction and agent behavior. Recent work address this problem using a model-based approach: language is incorporated into a world model, which is then used to learn a behavior policy. However, these existing methods either do not demonstrate policy generalization to unseen games or rely on limiting assumptions. For instance, assuming that the latency induced by inference-time planning is tolerable for the target task or expert demonstrations are available. Expanding on this line of research, we focus on improving policy generalization from a language-conditioned world model while dropping these assumptions. We propose a model-based reinforcement learning approach, where a language-conditioned world model is trained through interaction with the environment, and a policy is learned from this model--without planning or expert demonstrations. Our method proposes Language-aware Encoder for Dreamer World Model (LED-WM) built on top of DreamerV3. LED-WM features an observation encoder that uses an attention mechanism to explicitly ground language descriptions to entities in the observation. We show that policies trained with LED-WM generalize more effectively to unseen games described by novel dynamics and language compared to other baselines in several settings in two environments: MESSENGER and MESSENGER-WM.To highlight how the policy can leverage the trained world model before real-world deployment, we demonstrate the policy can be improved through fine-tuning on synthetic test trajectories generated by the world model.[80] Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework
Kelaiti Xiao,Liang Yang,Dongyu Zhang,Paerhati Tulajiang,Hongfei Lin
Main category: cs.CL
TL;DR: 提出一个结合大语言模型、文本到图像模型和多模态大语言模型的迭代框架,用于自动生成和评估基于习语的视觉双关图像,并构建包含1000个样本的数据集以支持生成与理解的基准测试。
Details
Motivation: 视觉双关图像需要同时体现习语的字面与隐含意义,现有方法难以有效生成并评估此类图像,因此需要一个协同多模型的自动化框架来提升生成质量与可理解性。 Method: 设计一个迭代框架,结合大语言模型(LLM)生成详细视觉提示,文本到图像模型(T2IM)合成图像,多模态大语言模型(MLLM)从图像推断原习语,并在识别失败时反馈优化提示,直至成功或达到步数限制。 Result: 基于1000个习语生成了对应的视觉双关图像与提示对数据集;实验表明多模态大语言模型的选择是性能关键,GPT表现最佳,Gemini次之,开源模型Gemma也具竞争力;在LLM中,Claude在提示生成上表现最优。 Conclusion: 该框架有效实现了习语视觉双关的生成与评估,所构建数据集可用于后续研究,且MLLM在理解任务中起主导作用,提示选择合适的模型组合对系统性能至关重要。 Abstract: We study idiom-based visual puns--images that align an idiom's literal and figurative meanings--and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.[81] Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
Jinze Li,Yixing Xu,Guanchen Li,Shuo Yang,Jinfeng Xu,Xuanwu Yin,Dong Li,Edith C. H. Ngai,Emad Barsoum
Main category: cs.CL
TL;DR: 提出了一种无需训练的宽松推测解码方法FLy,通过利用目标模型的自纠正行为来判断草稿-目标不匹配是否语义正确,显著提升大语言模型推理速度同时保持高准确性。
Details
Motivation: 现有的推测解码方法因严格的精确匹配验证机制丢弃了许多语义上有效的续写,且在分布外任务上性能下降,需要一种更灵活、泛化性更强的加速解码方法。 Method: 提出FLy方法,采用两层机制:熵级门控判断当前token是否允许多种合理替代,令牌级延迟窗口区分真实错误与语义正确的不同表达;并设计多层次加速策略,同时加速目标模型和草稿模型。 Result: 在Llama-3.1-70B-Instruct上实现平均2.81倍加速,在405B版本上达5.07倍加速,保持超过99%的目标模型准确率;在分布外数据集上比EAGLE-3快1.62倍。 Conclusion: FLy是一种高效、无需训练的推测解码框架,具有强泛化能力,可无缝适配任意草稿-目标模型组合,并在多种模型和任务上显著降低推理延迟。 Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model's self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.[82] Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification
Sumit Mamtani,Abhijeet Bhure
Main category: cs.CL
TL;DR: 本文研究了基于Transformer表示的假新闻检测,比较了仅编码器和仅解码器预训练模型在冻结嵌入加轻量级分类器设置下的表现,发现BERT结合逻辑回归在假新闻检测中表现优异,且注意力机制对任务具有鲁棒性和有效性。
Details
Motivation: 探究Transformer模型的不同结构(如仅编码器与仅解码器)在假新闻检测任务中的表示能力,明确自注意力机制在真实性判断任务中的作用,剥离分类器复杂性对性能的影响。 Method: 将BERT、GPT-2、Transformer-XL作为冻结的嵌入器,搭配轻量级分类器(如逻辑回归和神经网络),在LIAR数据集上进行实验;控制变量比较池化与填充策略、神经网络与线性分类头的效果。 Result: 上下文自注意力编码能有效迁移;BERT嵌入配合逻辑回归在LIAR数据集上优于神经基线;序列长度截断影响小,简单最大或平均池化优于其他聚合方式。 Conclusion: 基于注意力的词元编码器是真实性检测任务中稳健且以架构为核心的基础组件,其性能优势主要来自Transformer本身的表示能力而非分类器复杂度。 Abstract: This paper investigates fake news detection as a downstream evaluation of Transformer representations, benchmarking encoder-only and decoder-only pre-trained models (BERT, GPT-2, Transformer-XL) as frozen embedders paired with lightweight classifiers. Through controlled preprocessing comparing pooling versus padding and neural versus linear heads, results demonstrate that contextual self-attention encodings consistently transfer effectively. BERT embeddings combined with logistic regression outperform neural baselines on LIAR dataset splits, while analyses of sequence length and aggregation reveal robustness to truncation and advantages from simple max or average pooling. This work positions attention-based token encoders as robust, architecture-centric foundations for veracity tasks, isolating Transformer contributions from classifier complexity.[83] ShoppingComp: Are LLMs Really Ready for Your Shopping Cart?
Huaixiao Tou,Ying Zeng,Cong Ma,Muzhi Li,Minghao Li,Weijie Yuan,He Zhang,Kai Jia
Main category: cs.CL
TL;DR: ShoppingComp是一个新的真实世界基准,用于评估大语言模型在电商购物代理中的产品检索、报告生成和安全决策能力,揭示了现有模型在实际应用中的严重不足。
Details
Motivation: 现有的电子商务基准无法充分评估大语言模型在真实购物场景中的综合能力,特别是在产品安全性识别方面的缺失,导致研究与实际部署之间存在差距。 Method: 构建了一个包含120项任务和1,026个场景的基准ShoppingComp,由35位专家设计,确保使用真实产品并可验证,重点评估推荐准确性、报告质量和安全风险识别。 Result: 当前最先进的大语言模型在ShoppingComp上表现很差(如GPT-5为11.22%,Gemini-2.5-Flash为3.92%),常犯无法识别不安全产品或受促销误导等错误。 Conclusion: ShoppingComp填补了研究与现实应用之间的差距,为发展可靠且实用的电商代理设定了新标准。 Abstract: We present ShoppingComp, a challenging real-world benchmark for rigorously evaluating LLM-powered shopping agents on three core capabilities: precise product retrieval, expert-level report generation, and safety critical decision making. Unlike prior e-commerce benchmarks, ShoppingComp introduces highly complex tasks under the principle of guaranteeing real products and ensuring easy verifiability, adding a novel evaluation dimension for identifying product safety hazards alongside recommendation accuracy and report quality. The benchmark comprises 120 tasks and 1,026 scenarios, curated by 35 experts to reflect authentic shopping needs. Results reveal stark limitations of current LLMs: even state-of-the-art models achieve low performance (e.g., 11.22% for GPT-5, 3.92% for Gemini-2.5-Flash). These findings highlight a substantial gap between research benchmarks and real-world deployment, where LLMs make critical errors such as failure to identify unsafe product usage or falling for promotional misinformation, leading to harmful recommendations. ShoppingComp fills the gap and thus establishes a new standard for advancing reliable and practical agents in e-commerce.[84] Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses
Dong Nguyen,Laura Rosseel
Main category: cs.CL
TL;DR: 该研究探讨了英语网络写作中拼写变异的社会感知,比较了人类与大语言模型(LLMs)在正式性、谨慎性和年龄三个社会属性上的评分一致性,发现两者总体相关性较强,但在评分分布和特定拼写变异类型上存在差异。
Details
Motivation: 探究大语言模型在理解拼写变异所传递的社会意义方面是否与人类感知一致,推动语言模型更好地理解和生成符合社会语境的自然语言。 Method: 基于社会语言学方法,收集人类对拼写变异在正式性、小心程度和年龄感知上的评分,并与多个大语言模型的评分进行对比分析,计算相关性和分布差异。 Result: 人类与大语言模型的评分在三个社会属性上总体呈现强相关性,但在不同拼写变异类型(如重复字母程度)和评分分布上存在显著差异。 Conclusion: 尽管大语言模型在捕捉拼写变异的社会感知方面接近人类判断,但仍存在局限性,需进一步优化以更好反映细微的语言社会意义。 Abstract: Spelling variation (e.g. funnnn vs. fun) can influence the social perception of texts and their writers: we often have various associations with different forms of writing (is the text informal? does the writer seem young?). In this study, we focus on the social perception of spelling variation in online writing in English and study to what extent this perception is aligned between humans and large language models (LLMs). Building on sociolinguistic methodology, we compare LLM and human ratings on three key social attributes of spelling variation (formality, carefulness, age). We find generally strong correlations in the ratings between humans and LLMs. However, notable differences emerge when we analyze the distribution of ratings and when comparing between different types of spelling variation.[85] Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts
Paulo J. N. Pinto,Armando J. Pinho,Diogo Pratas
Main category: cs.CL
TL;DR: 本文提出了一种基于可解释性树模型的多特征融合方法,用于历史文本的时间分类,涵盖五个世纪的英文文本,在世纪和十年尺度上均取得了显著高于基线的准确率。
Details
Motivation: 准确断代历史文本对文化遗产的组织与解读至关重要,现有方法在可解释性和跨域适应性方面存在不足。 Method: 结合五类特征(压缩、词汇结构、可读性、新词检测和距离特征),采用基于树的机器学习模型进行时间分类,并利用SHAP进行特征重要性分析和结果解释。 Result: 在大规模语料库上,世纪分类准确率达76.7%,十年分类为26.1%;放宽精度后分别达96.0%(top-2)和85.8%(top-10);AUCROC达94.8%,平均绝对误差为27年;二元分类任务中准确率可达85-98%。 Conclusion: 融合多源特征的树模型在文本断代任务中表现优越,具备良好的可解释性与排序能力,尽管存在跨数据集的泛化挑战,但仍为神经网络提供了一种高效、可解释的替代方案。 Abstract: Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.[86] Standard Occupation Classifier -- A Natural Language Processing Approach
Sidharth Rony,Jack Patman
Main category: cs.CL
TL;DR: 本项目利用自然语言处理技术,结合BERT和神经网络构建集成模型,对英国和美国的职业分类系统(SOC)进行职业代码分类,准确率分别达到61%(第四级)和72%(第三级),可有效用于分析劳动力市场需求。
Details
Motivation: 通过整合职位广告中的大数据与标准职业分类系统(SOC),实现对不同职业的劳动力需求进行细粒度、实时的分析。 Method: 采用多种语言模型(如Google BERT)开发针对英国ONS SOC和美国O*NET SOC的分类器,构建融合职位标题、描述和技能的集成模型(BERT + 神经网络)。 Result: 集成模型在SOC第四级分类中准确率达61%,第三级达72%,表现最优。 Conclusion: 该模型能有效利用职位广告数据,提供及时、准确的劳动力市场动态信息,具有实际应用价值。 Abstract: Standard Occupational Classifiers (SOC) are systems used to categorize and classify different types of jobs and occupations based on their similarities in terms of job duties, skills, and qualifications. Integrating these facets with Big Data from job advertisement offers the prospect to investigate labour demand that is specific to various occupations. This project investigates the use of recent developments in natural language processing to construct a classifier capable of assigning an occupation code to a given job advertisement. We develop various classifiers for both UK ONS SOC and US O*NET SOC, using different Language Models. We find that an ensemble model, which combines Google BERT and a Neural Network classifier while considering job title, description, and skills, achieved the highest prediction accuracy. Specifically, the ensemble model exhibited a classification accuracy of up to 61% for the lower (or fourth) tier of SOC, and 72% for the third tier of SOC. This model could provide up to date, accurate information on the evolution of the labour market using job advertisements.[87] Conveying Imagistic Thinking in TCM Translation: A Prompt Engineering and LLM-Based Evaluation Framework
Jiatong Han
Main category: cs.CL
TL;DR: 本研究提出一种基于“人在回路”框架的提示调整方法,利用大语言模型(如DeepSeek V3.1)识别《黄帝内经》中的隐喻与转喻,并通过ChatGPT和Gemini模拟读者评估翻译效果。结果显示,经提示调整的LLM翻译在认知维度上优于人类和基线模型翻译,为中医典籍等概念密集型古籍的翻译提供了高效、可复制的认知路径。
Details
Motivation: 现有中医英译多采用直译法,难以帮助目标语读者重构源文本背后的意象思维和概念网络,导致理论理解与临床应用困难。因此,需要一种能有效传递中医隐喻与转喻结构的翻译方法。 Method: 采用人在回路(HITL)框架,选取《黄帝内经》四段核心理论文本,通过基于提示的认知支架引导DeepSeek V3.1识别并翻译其中的隐喻与转喻;使用ChatGPT 5 Pro和Gemini 2.5 Pro模拟三类真实读者,对人类翻译、基线模型翻译和提示调整后翻译进行五维认知评分,并开展结构化访谈与解释性现象学分析(IPA)。 Result: 提示调整后的LLM翻译在五个认知维度上表现最优,且跨模型与跨角色一致性高;访谈揭示了人译与机译差异、有效的隐喻/转喻传递策略及读者认知偏好。 Conclusion: 该研究验证了提示驱动的HITL方法在中医典籍翻译中的有效性,提供了一条认知导向、高效且可复制的技术路径,有助于推动传统中医理论在国际上的理解与应用。 Abstract: Traditional Chinese Medicine theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis. Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient and replicable HITL methodological pathway for translation of ancient, concept-dense texts like TCM.[88] Accent Placement Models for Rigvedic Sanskrit Text
Akhil Rajeev P,Annarao Kulkarni
Main category: cs.CL
TL;DR: 本文研究了在《梨俱吠陀》诗句中自动添加音调符号的三种方法,使用ByT5模型全微调、BiLSTM-CRF序列标注基线和LoRA高效微调,并评估其在词错误率、字符错误率和变音符号错误率上的表现。
Details
Motivation: 由于现代电子文本常缺失古老的梵语音调标记(如udātta, svarita等),影响解读与传承,因此需要自动化手段恢复这些关键语音信息。 Method: 构建带音调与无音调诗句的平行语料库,比较三种方法:ByT5全微调、从零训练的BiLSTM-CRF序列标注模型、基于LoRA的参数高效微调;采用WER、CER和专门设计的DER指标进行评估。 Result: ByT5全微调在所有指标上表现最佳,LoRA在效率与准确性之间取得良好平衡,BiLSTM-CRF作为可解释基线有效;研究强调了Unicode安全预处理、标记感知分词及分离字形与音调错误评估的重要性。 Conclusion: 本研究为《梨俱吠陀》音调恢复建立了可复现的基线,推动了面向文化遗产语言的NLP技术发展,支持OCR、语音合成和数字人文学术等下游应用。 Abstract: The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship.[89] Mind Reading or Misreading? LLMs on the Big Five Personality Test
Francesco Di Cursi,Chiara Boldrini,Marco Conti,Andrea Passarella
Main category: cs.CL
TL;DR: 该研究评估了多种大语言模型在二元五大人格模型下从文本中自动预测人格的性能,发现尽管某些开源模型表现接近GPT-4,但在零样本设置下仍缺乏一致性,且整体指标掩盖了类别间的显著差异,提示当前现成的LLM尚不适用于可靠的人格预测。
Details
Motivation: 探索大语言模型在无需微调的情况下进行自动人格预测(APPT)的可行性,并评估不同模型、数据集和提示策略对预测结果的影响。 Method: 选取五种LLM(包括GPT-4和轻量级开源模型),在三个异构数据集(Essays, MyPersonality, Pandora)上测试两种提示策略(最小提示与增强提示),采用零样本二分类设置,分析准确率、宏F1及各类别召回率等指标。 Result: 增强提示可减少无效输出并改善类别平衡,但会系统性偏向预测人格特质存在;Openness和Agreeableness较易识别,Extraversion和Neuroticism较难;部分开源模型性能接近GPT-4,但无一种配置能实现稳定可靠的预测;整体指标掩盖了类别层面的不对称性,类别召回率更具诊断价值。 Conclusion: 现有的即用型大语言模型尚不足以支持可靠的自动人格预测,需谨慎设计提示、人格表述方式和评估指标以获得可解释的结果。 Abstract: We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.[90] Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
Mengjie Liu,Jiahui Peng,Pei Chu,Jiantao Qiu,Ren Ma,He Zhu,Rui Min,Lindong Lu,Wenchang Ning,Linfeng Hou,Kaiwen Liu,Yuan Qu,Zhenxiang Li,Chao Xu,Zhongying Tu,Wentao Zhang,Conghui He
Main category: cs.CL
TL;DR: 本文提出了一种基于轻量级语言模型的高效HTML主内容提取框架Dripper,通过HTML简化、语义块分类、受控解码和新基准数据集WebMainBench,在减少上下文长度和推理成本的同时实现了最先进的性能。
Details
Motivation: 现有的基于大语言模型的内容提取方法受限于上下文窗口长度、推理成本高以及小模型容易产生格式幻觉的问题,难以高效准确地从通用网页中提取主内容。 Method: 1) 设计专用HTML简化算法,将输入token减少至原始HTML的22%;2) 将主内容提取重构为语义块序列分类任务以降低推理成本;3) 引入受控解码机制,通过logits处理器严格限制输出空间,消除幻觉;4) 构建包含7800多个网页的人工标注数据集WebMainBench用于评估。 Result: 使用仅0.6B参数的模型,Dripper在所有基准测试中均达到最先进水平,在WebMainBench上取得81.58%的ROUGE-N F1分数(回退策略下为83.13%),显著优于基线方法。 Conclusion: Dripper通过多项创新有效解决了语言模型在网页主内容提取中的效率与准确性问题,展示了轻量级模型在此任务上的巨大潜力。 Abstract: Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.[91] Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models
Yujiao Yang,Jing Lian,Linhui Li
Main category: cs.CL
TL;DR: 提出了一种新的推理框架MGRS,通过多链图优化与选择机制,在多样性和效率上显著提升了大模型的复杂推理能力。
Details
Motivation: 现有测试时扩展方法(如ToT、GoT)存在推理多样性不足、搜索冗余、异构路径整合与纠错能力弱的问题,限制了大语言模型的复杂推理性能。 Method: MGRS框架首先生成多个不同的推理路径,采用自检与交叉验证相结合的方式优化候选答案,构建推理关系图并估计中间节点的成功率,最后通过累积成功率选择最优答案和推理路径。 Result: 在六个基准数据集上,MGRS平均准确率达到82.9%,比现有最佳方法高出2.1%;在24点游戏中首次达到100%准确率,并相较Forest of Thoughts提速13.6倍。 Conclusion: MGRS有效提升了大语言模型在复杂推理任务中的准确性与计算效率,为推理增强方法提供了新的范式。 Abstract: The complex reasoning ability of Large Language Models (LLMs) poses a critical bottleneck for their practical applications. Test-time expansion methods such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) enhance reasoning by introducing intermediate reasoning structures, tree search, or graph-based exploration mechanisms. However, their reasoning strategies suffer from limited diversity, redundant search branches, and inadequate integration and error correction across heterogeneous reasoning paths. To address these limitations, we propose a novel reasoning framework called Multi-chain Graph Refinement & Selection (MGRS), which first generates multiple diverse reasoning trajectories for a given problem, refines candidate responses using a composite self- and cross-verification strategy, then constructs a reasoning relation graph and estimates the success rate of intermediate nodes, and finally computes cumulative success rates to select the most reliable answer and corresponding reasoning trajectory. Experimental results demonstrate that MGRS significantly advances both the reasoning capability and computational efficiency of reasoning enhancement methods. Across six benchmark datasets spanning four distinct tasks, MGRS achieves an average accuracy of 82.9%, outperforming state-of-the-art baselines by a clear margin of 2.1%. Remarkably, on the 24-point game, MGRS attains 100% accuracy for the first time, while delivering a 13.6x speed-up compared to the leading Forest of Thoughts framework.[92] Are LLMs Good Safety Agents or a Propaganda Engine?
Neemesh Yadav,Francesco Ortu,Jiarui Liu,Joeun Yook,Bernhard Schölkopf,Rada Mihalcea,Alberto Cazzaniga,Zhijing Jin
Main category: cs.CL
TL;DR: 本文提出了PSP数据集,用于探究大语言模型在政治敏感内容上的拒绝行为是源于安全策略还是政治审查,并通过多种方法分析了七种LLM的表现,发现多数模型存在某种形式的审查行为。
Details
Motivation: 缺乏对大语言模型拒绝响应有害内容的行为究竟是出于安全政策还是政治审查的系统性分析,尤其是在不同国家的政治背景下区分这两者具有重要意义。 Method: 构建了名为PSP的数据集,包含来自中国及其他国家被审查的内容,并采用数据驱动和表示层方法(如隐去政治概念)分析七种大语言模型的拒绝行为,同时测试其在提示注入攻击下的脆弱性。 Result: 研究发现大多数大语言模型在处理隐含政治意图的内容时表现出审查行为,且不同模型和国家背景下的拒绝分布存在差异。 Conclusion: 大语言模型的拒绝行为不仅受安全策略影响,也可能反映政治审查;模型在不同政治语境下的响应差异揭示了其潜在的偏见和脆弱性。 Abstract: Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.[93] Listwise Preference Optimization with Element-wise Confusions for Aspect Sentiment Quad Prediction
Wenna Lai,Haoran Xie,Guandong Xu,Qing Li,S. Joe Qin
Main category: cs.CL
TL;DR: 提出一种基于推理生成和列表偏好优化的框架,用于提升方面情感四元组预测的准确性和解释一致性。
Details
Motivation: 现有方法在建模元素间复杂关系上存在困难,且在预测高阶元素时性能显著下降。 Method: 采用基于推理的生成方式,在统一模板下输出四元组及自然语言理由,并引入列表式偏好优化框架,通过句法和语义相近的候选生成,增强结构有效性和关系连贯性。 Result: 在四个基准数据集上的实验表明,该方法有效提升了四元组预测准确率和解释一致性。 Conclusion: 所提框架通过显式关系推理和列表偏好优化,显著改善了ASQP任务的性能和可解释性。 Abstract: Aspect sentiment quad prediction (ASQP) is inherently challenging to predict a structured quadruple with four core sentiment elements, including aspect term (a), aspect category (c), opinion term (o), and sentiment polarity (s). Prior methods relying on marker-based prediction struggle with modeling the intricate relationships among elements and experience sharp performance declines when predicting higher-order elements (e.g., c and s) under standard supervised fine-tuning. To address these limitations, we employ reasoning-based generation to output both the quadruple and a natural language rationale under element prefixes within a unified template, encouraging explicit relational reasoning and interpretability. To further enhance element-wise alignment, we introduce a listwise preference optimization framework for improving structural validity and relational coherence. Specifically, we generate element-wise confusable candidates via syntactic and semantic proximity, then train the model with listwise objectives to prefer the gold candidates over closely competing alternatives. Extensive experiments on four benchmark datasets demonstrate that our framework effectively improves quadruple prediction accuracy and explanation consistency.[94] TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
Guang Liang,Jie Shao,Ningyuan Tang,Xinyao Liu,Jianxin Wu
Main category: cs.CL
TL;DR: 本文提出TWEO,一种新颖的非侵入式损失函数,通过简单损失项消除Transformer训练中的极端激活异常值,实现无需复杂工程或架构修改的全模型FP8训练,并首次使硬件友好的W8A8静态量化在LLM上达到SOTA性能。
Details
Motivation: 现有FP8训练方法受限于极端激活异常值,通常需要复杂的混合精度设计或修改模型结构;本文挑战了异常值源于数据的传统观点,指出其实际是训练过程中权重矩阵结构性质(如共线性)导致的机械性产物。 Method: 提出TWEO(Transformers Without Extreme Outliers)损失函数,引入一个简单的正则化项来抑制由权重矩阵结构引发的极端异常值,从而支持全模型FP8训练,适用于LLM和ViT,且无需额外工程技巧或架构变更。 Result: TWEO将激活异常值从10000+减少到20以下,在标准FP8训练崩溃的情况下,性能媲美BF16基线,训练吞吐提升36%;并首次实现W8A8每张量静态量化在LLM上的有效应用,达到SOTA结果。 Conclusion: 极端激活异常值并非不可避免的数据特性,而是可被机制性控制的训练副产物;TWEO提供了一种简洁、通用且硬件友好的解决方案,推动FP8成为大模型训练的实用选择。 Abstract: Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.[95] Tourism Question Answer System in Indian Language using Domain-Adapted Foundation Models
Praveen Gatla,Anushka,Nikita Kanwar,Gouri Sahoo,Rajesh Kumar Mundotiya
Main category: cs.CL
TL;DR: 本文提出了首个针对印地语旅游领域的抽取式问答系统基线研究,聚焦于文化圣地瓦拉纳西,构建了包含7,715个真实问答对并扩展27,455个生成对的数据集,采用BERT和RoBERTa等基础模型,结合监督微调(SFT)和低秩适应(LoRA)技术,在减少98%可训练参数的同时达到85.3%的F1分数,验证了LoRA在低资源场景下的高效性,并揭示了RoBERTa-SFT在捕捉文化术语上下文方面的优势。
Details
Motivation: 由于缺乏针对印地语文化和旅游领域特定的问答资源,现有NLP系统难以有效处理具有文化细微差别的问题,特别是在如瓦拉纳西这样富含宗教与文化背景的地区,因此需要构建专门的印地语旅游问答数据集并开发适配的基线系统。 Method: 构建了一个涵盖10个旅游子领域的7,715个真实印地语问答对数据集,并通过Llama零样本提示生成额外27,455个样本进行增强;采用BERT(包括Hindi-BERT等变体)和RoBERTa作为基础模型,应用监督微调(SFT)和低秩适应(LoRA)进行参数高效微调,使用F1、BLEU和ROUGE-L等指标评估模型性能。 Result: LoRA微调方法在仅需2%可训练参数的情况下取得了85.3%的F1分数,显著优于全参数SFT的效率;RoBERTa结合SFT在F1得分上表现最佳,尤其在理解Aarti、Kund等文化嵌入术语方面优于各类BERT变体,显示出更强的上下文捕捉能力;不同模型在答案精确性与语言流畅性之间存在权衡。 Conclusion: 该研究为印地语旅游领域的问答系统建立了有效基线,证明了LoRA在低资源语言任务中的高参数效率优势,强调了在文化旅游NLP应用中考虑文化语境的重要性,并呼吁未来开发更多面向特定文化背景的语言技术框架。 Abstract: This article presents the first comprehensive study on designing a baseline extractive question-answering (QA) system for the Hindi tourism domain, with a specialized focus on the Varanasi-a cultural and spiritual hub renowned for its Bhakti-Bhaav (devotional ethos). Targeting ten tourism-centric subdomains-Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, General, Ashram, Temple and Travel, the work addresses the absence of language-specific QA resources in Hindi for culturally nuanced applications. In this paper, a dataset comprising 7,715 Hindi QA pairs pertaining to Varanasi tourism was constructed and subsequently augmented with 27,455 pairs generated via Llama zero-shot prompting. We propose a framework leveraging foundation models-BERT and RoBERTa, fine-tuned using Supervised Fine-Tuning (SFT) and Low-Rank Adaptation (LoRA), to optimize parameter efficiency and task performance. Multiple variants of BERT, including pre-trained languages (e.g., Hindi-BERT), are evaluated to assess their suitability for low-resource domain-specific QA. Evaluation metrics - F1, BLEU, and ROUGE-L - highlight trade-offs between answer precision and linguistic fluency. Experiments demonstrate that LoRA-based fine-tuning achieves competitive performance (85.3\% F1) while reducing trainable parameters by 98\% compared to SFT, striking a balance between efficiency and accuracy. Comparative analysis across models reveals that RoBERTa with SFT outperforms BERT variants in capturing contextual nuances, particularly for culturally embedded terms (e.g., Aarti, Kund). This work establishes a foundational baseline for Hindi tourism QA systems, emphasizing the role of LORA in low-resource settings and underscoring the need for culturally contextualized NLP frameworks in the tourism domain.[96] Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs
Jiancheng Dong,Pengyue Jia,Jingyu Peng,Maolin Wang,Yuhao Wang,Lixin Su,Xin Sun,Shuaiqiang Wang,Dawei Yin,Xiangyu Zhao
Main category: cs.CL
TL;DR: 提出一种轻量级三阶段训练框架,通过学习一个等效行为的单个token [BE] 来替代冗长的系统提示,在实现高达3000倍提示长度压缩的同时保留约98%的原始性能。
Details
Motivation: 长系统提示导致推理延迟高、计算成本增加和有效上下文长度减少,亟需一种极简提示方法在不牺牲行为效果的前提下大幅压缩提示长度。 Method: 设计一个无需访问模型内部、无需辅助压缩模型和标注数据的三阶段训练框架:首先通过重建编码原始提示内容,再将提示的下游行为蒸馏到单个[BE] token中。 Result: 在三个数据集上的实验表明,单个[BE] token可实现最高3000倍的提示长度压缩,同时保持约98%的原始系统提示性能,显著降低推理成本并释放上下文窗口。 Conclusion: 该方法能高效压缩LLM系统提示至单个token,几乎不影响下游任务表现,为降低部署开销和提升上下文利用率提供了有效解决方案。 Abstract: Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.[97] MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)
Aaron Steiner,Ralph Peeters,Christian Bizer
Main category: cs.CL
TL;DR: 本论文比较了四种大语言模型代理与网站交互的架构(HTML、RAG、MCP、NLWeb),在统一测试环境中评估其在电商任务中的有效性与效率,发现非HTML方法显著优于传统HTML浏览。
Details
Motivation: 现有研究缺乏在相同环境下对多种LLM代理接口的系统性比较,导致难以评估不同架构的实际性能差异。本文旨在填补这一空白,为高效Web代理设计提供实证依据。 Method: 构建包含四个模拟电商网站的测试平台,每个站点支持HTML、RAG、MCP和NLWeb三种接口;针对每种接口开发专用代理,并在相同任务集(如搜索、比价、替代品查询和结账)上使用GPT-4.1、GPT-5、GPT-5 mini和Claude Sonnet 4进行评估。 Result: RAG、MCP和NLWeb代理在F1分数(0.75–0.77 vs HTML的0.67)、token消耗(47k–140k vs 241k)和运行时间(50–62秒 vs 291秒)上均显著优于HTML代理。最佳配置为GPT-5配合RAG,F1达0.87,完成率0.79;兼顾成本时,GPT-5 mini+RAG表现最优。 Conclusion: 交互接口的选择对LLM代理的有效性和效率有重大影响,基于结构化或检索增强的方法优于直接HTML解析,未来应优先发展此类接口以提升自动化Web任务性能。 Abstract: Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.[98] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
Xiang Hu,Zhanchao Zhou,Ruiqi Liang,Zehuan Li,Wei Wu,Jianguo Li
Main category: cs.CL
TL;DR: 本文提出了HSA-UltraLong,一种基于层次稀疏注意力(HSA)的8B参数MoE模型,旨在实现高效超长上下文建模,具备稀疏性、随机访问灵活性和长度泛化能力,在长达16M的上下文任务中表现优异。
Details
Motivation: 构建能够长期记忆的机器需要有效的超长上下文建模方法,传统全注意力机制在处理极长序列时计算成本过高,难以扩展。 Method: 提出层次稀疏注意力(HSA)机制,并将其集成到Transformer中,构建了HSA-UltraLong模型;该模型通过稀疏注意力实现高效计算,支持随机访问和不同长度的上下文输入。 Result: HSA-UltraLong在域内长度上性能与全注意力基线相当,在最多达16M token的上下文检索任务中多数情况下达到超过90%的准确率。 Conclusion: HSA-UltraLong满足稀疏性、灵活性和长度泛化三项关键需求,为超长上下文建模提供了可行路径,推动‘可记忆机器’的发展。 Abstract: This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: \textbf{sparsity}, \textbf{random-access flexibility}, and \textbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.[99] Tackling a Challenging Corpus for Early Detection of Gambling Disorder: UNSL at MentalRiskES 2025
Horacio Thompson,Marcelo Errecalde
Main category: cs.CL
TL;DR: 本文参与了MentalRiskES 2025挑战赛的任务1,提出三种基于CPI+DMC的方法,用于在社交媒体中识别有赌博障碍风险的用户,结合SS3、扩展词汇BERT和SBERT模型及历史用户分析策略,在官方结果中取得前两名成绩,但仍有高/低风险用户区分困难的问题。
Details
Motivation: 赌博障碍是一种复杂的行为成瘾,早期风险检测(ERD)对于预防和干预至关重要。社交媒体为识别心理健康问题的早期迹象提供了新途径,因此需要有效且快速决策的分类方法来识别潜在高风险用户。 Method: 采用CPI+DMC框架,结合SS3、扩展词汇BERT和SBERT三种模型,分别处理文本分类任务,并引入基于用户历史行为的决策策略,将预测效果与决策速度作为独立优化目标。 Result: 在MentalRiskES 2025挑战赛Task 1中,两个提案获得官方排名前两位,在决策相关指标上表现突出,但在区分高风险与低风险用户方面仍存在挑战。 Conclusion: 尽管所提方法在竞赛中表现优异,但仍需改进数据解释能力与数据质量,以提升早期风险检测系统的透明性与可靠性,推动更有效的心理健康干预机制。 Abstract: Gambling disorder is a complex behavioral addiction that is challenging to understand and address, with severe physical, psychological, and social consequences. Early Risk Detection (ERD) on the Web has become a key task in the scientific community for identifying early signs of mental health behaviors based on social media activity. This work presents our participation in the MentalRiskES 2025 challenge, specifically in Task 1, aimed at classifying users at high or low risk of developing a gambling-related disorder. We proposed three methods based on a CPI+DMC approach, addressing predictive effectiveness and decision-making speed as independent objectives. The components were implemented using the SS3, BERT with extended vocabulary, and SBERT models, followed by decision policies based on historical user analysis. Although it was a challenging corpus, two of our proposals achieved the top two positions in the official results, performing notably in decision metrics. Further analysis revealed some difficulty in distinguishing between users at high and low risk, reinforcing the need to explore strategies to improve data interpretation and quality, and to promote more transparent and reliable ERD systems for mental disorders.[100] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach
Shuqi Liu,Han Wu,Guanzhi Deng,Jianshu Chen,Xiaoyang Wang,Linqi Song
Main category: cs.CL
TL;DR: 提出一种任务无关的结构化知识猎取模型,结合语言模型生成能力和知识检索的高保真性,提升知识增强文本生成的可解释性和通用性。
Details
Motivation: 现有方法依赖领域特定的知识检索器,泛化能力差,且生成文本缺乏可解释性,限制了其在实际中的可靠性应用。 Method: 利用结构化知识的双层架构(高层实体和底层知识三元组),设计任务无关的知识猎取模型;采用局部-全局交互机制进行知识表示学习,并使用分层Transformer指针网络选择相关知识。 Result: 在RotoWireFG表格到文本生成和KdConv对话生成任务上均超越现有最先进方法,实现了更高的生成质量和可解释性。 Conclusion: 该模型通过融合语言模型与结构化知识检索,在多种任务中表现出优越性能,具备良好的通用性和实用性,推动了可解释知识增强生成的发展。 Abstract: Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.[101] Scaling HuBERT for African Languages: From Base to Large and XL
Antoine Caubrière,Elodie Gauthier
Main category: cs.CL
TL;DR: 本文提出了首个完全基于非洲语音训练的大型自监督语音模型SSA-HuBERT-Large和SSA-HuBERT-XL,并通过实验验证了更大模型在低资源非洲语言上的优越性。
Details
Motivation: 非洲语言在多语言语音处理中代表性不足,尤其是缺乏能在低资源条件下良好迁移的开源强编码器,现有模型多为BASE规模,尚未探索更大模型在非洲语音上的潜力。 Method: 提出并训练了SSA-HuBERT-Large(3.17亿参数)和SSA-HuBERT-XL(9.64亿参数)模型,仅使用非洲语音数据进行自监督学习,并在撒哈拉以南语言的ASR和LID任务上进行评估。 Result: 实验证明,更大的模型架构能显著提升性能,更有效地利用大规模音频数据,在自动语音识别和语言识别任务上优于BASE模型。 Conclusion: 专为非洲语言设计的大规模自监督模型是可行且有效的,扩大模型容量可显著提升低资源语言的语音处理性能,推动非洲语言技术的发展。 Abstract: Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see https://huggingface.co/collections/Orange/african-speech-foundation-models. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.[102] Optimizing Multimodal Language Models through Attention-based Interpretability
Alexander Sergeev,Evgeny Kotelnikov
Main category: cs.CL
TL;DR: 提出一种基于注意力的可解释性方法,通过分析注意力头对图像关键对象的关注程度(HI分数),指导多模态语言模型中参数高效微调(PEFT)的组件选择,在仅微调约0.01%参数的情况下显著提升图像理解能力。
Details
Motivation: 多模态语言模型难以解释,导致无法有效识别哪些组件最适合进行参数高效微调,从而在效率与性能之间取得平衡。 Method: 通过分析注意力分数相对于图像标记的分布,识别关注图像关键对象的注意力头,并计算Head Impact(HI)分数来衡量其重要性,进而选择高HI分数的层进行PEFT。 Result: 在2-3 billion参数的多模态语言模型上实验表明,微调HI分数最高的层相比预训练、随机选择或低HI分数层在指标上有更显著提升。仅微调约0.01%的参数即可显著影响模型的图像理解能力。 Conclusion: 基于注意力的可解释性方法能有效识别对图像理解至关重要的模型组件,指导PEFT实现高性能且高效的微调,验证了可解释性与参数高效微调结合的潜力。 Abstract: Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimodal language models (MLMs) to downstream tasks, full fine-tuning is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by training only a small portion of model weights. However, MLMs are difficult to interpret, making it challenging to identify which components are most effective for training to balance efficiency and performance. We propose an attention-based interpretability method for MLMs by analyzing attention scores relative to image tokens. The core idea is to identify attention heads that focus on image key objects. We utilize this information to select optimal model components for PEFT in multimodal models. Our contributions include a method for identifying attention heads associated with image key objects, its application to PEFT for image captioning, and the creation of a new dataset containing images, key object masks, and their textual descriptions. We conducted experiments on MLMs with 2-3 billion parameters to validate the method's effectiveness. By calculating Head Impact (HI) scores we quantify an attention head's focus on key objects, indicating its significance in image understanding. Our fine-tuning experiments demonstrate that adapting layers with the highest HI scores leads to the most significant shifts in metrics compared to pre-trained, randomly selected, or lowest-HI-score layers. This indicates that fine-tuning a small percentage (around 0.01%) of parameters in these crucial layers can substantially influence image understanding capabilities.[103] Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization
Jian Li,Shenglin Yin,Yujia Zhang,Alan Zhao,Xi Chen,Xiaohui Zhou,Pengfei Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为Ambiguity Awareness Optimization (AAO)的新方法,用于解决Direct Preference Optimization (DPO)中因模糊内容(如语义相似文本)引起的训练歧义问题。通过自动重加权偏好对中的模糊内容,AAO显著提升了模型对齐性能,在多个基准上超越了现有方法。
Details
Motivation: 在DPO训练中,偏好对中频繁出现的相同或语义相似的模糊内容可能导致学习过程中的歧义,限制模型对齐效果,因此需要一种机制来识别并减轻这种影响。 Method: 提出AAO方法,通过计算偏好对之间的语义相似性来自动生成权重,降低模糊内容的影响,从而优化训练过程。该方法无需额外采样或复杂结构,简单且易于集成。 Result: 实验表明,AAO在AlpacaEval 2和Arena-Hard等多个基准上显著优于DPO和其他先进方法,最高提升达8.9分(AlpacaEval 2)和15.0分(Arena-Hard),同时未明显增加生成长度。 Conclusion: AAO通过引入对模糊内容的感知机制,有效缓解了DPO中的训练歧义问题,是一种高效、通用且可扩展的RLHF优化方法。 Abstract: Direct Preference Optimization (DPO) is a widely used reinforcement learning from human feedback (RLHF) method across various domains. Recent research has increasingly focused on the role of token importance in improving DPO effectiveness. It is observed that identical or semantically similar content (defined as ambiguous content) frequently appears within the preference pairs. We hypothesize that the presence of ambiguous content during DPO training may introduce ambiguity, thereby limiting further improvements in alignment. Through mathematical analysis and proof-of-concept experiments, we reveal that ambiguous content may potentially introduce ambiguities, thereby degrading performance. To address this issue, we introduce Ambiguity Awareness Optimization (AAO), a simple yet effective approach that automatically re-weights ambiguous content to reduce ambiguities by calculating semantic similarity from preference pairs. Through extensive experiments, we demonstrate that AAO consistently and significantly surpasses state-of-the-art approaches in performance, without markedly increasing response length, across multiple model scales and widely adopted benchmark datasets, including AlpacaEval 2, MT-Bench, and Arena-Hard. Specifically, AAO outperforms DPO by up to 8.9 points on AlpacaEval 2 and achieves an improvement of by up to 15.0 points on Arena-Hard.[104] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation
Mahdi Rahmani,AmirHossein Saffari,Reyhane Rahmani
Main category: cs.CL
TL;DR: 本文提出了MegaChat,首个完全合成的波斯语问答数据集,用于评估基于Telegram的电子商务中的智能销售聊天机器人。通过多智能体架构自动生成具有人物特征感知的问答对,并提出一种先进的代理系统,在多查询检索、重排序和个性化响应合成方面优于传统RAG模型,为资源稀缺语言的中小企业提供高效、低成本的客户交互解决方案。
Details
Motivation: 波斯语等低资源语言缺乏高质量、大规模的问答数据集,而人工构建此类数据集成本高、耗时长,限制了伊朗中小企业在Telegram上部署AI聊天机器人的能力。因此,需要一种自动化、低成本的方法生成适用于电商场景的合成数据。 Method: 提出一种自动化的多智能体架构,包含专门负责问题生成、验证和优化的智能体,从活跃的Telegram购物频道收集数据并生成具有人物特征感知的问答对;在答案生成评估中,对比三种经典检索增强生成(RAG)模型与提出的先进代理系统,后者结合多查询检索、重排序和个性化响应合成。 Result: 使用GPT-5.1在六个质量维度上进行评估,结果显示该代理系统在5个不同频道中的4个表现优于传统RAG模型,能够生成高质量、多样化的对话数据,且无需人工标注或复杂微调。 Conclusion: MegaChat为低资源语言环境下的中小企业提供了一种可扩展、经济高效的解决方案,推动多语言对话AI的发展,特别是在Telegram电商等特定商业领域。 Abstract: Small and medium-sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real-time engagement is essential for conversion. However, developing AI-driven chatbots for this purpose requires large, high-quality question-and-answer (Q&A) datasets, which are typically expensive and resource-intensive to produce, especially for low-resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian Q&A dataset designed to evaluate intelligent sales chatbots in Telegram-based e-commerce. We propose a novel, automated multi-agent architecture that generates persona-aware Q&A pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval-augmented generation (RAG) models with our advanced agentic system, which features multi-query retrieval, reranking, and persona-aligned response synthesis. Using GPT-5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high-quality datasets without relying on expensive human annotation or complex fine-tuning. MegaChat provides SMEs with an efficient, cost-effective solution for building intelligent customer engagement systems in specialized commercial domains, enabling advancements in multilingual conversational AI for low-resource languages. Download: https://github.com/MegaChat-Tech/MegaChat-DataSetcs.CV [Back]
[105] SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Di Feng,Kaixin Ma,Feng Nan,Haofeng Chen,Bohan Zhai,David Griffiths,Mingfei Gao,Zhe Gan,Eshan Verma,Yinfei Yang,Zhifeng Chen,Afshin Dehghan
Main category: cs.CV
TL;DR: 本文提出了SO-Bench,一个用于评估多模态大语言模型在视觉输入下遵循预定义数据模式进行信息提取和推理能力的基准测试,涵盖UI界面、自然图像、文档和图表四个领域,并揭示了现有模型在生成符合模式的准确输出方面仍存在差距。
Details
Motivation: 现有的多模态大语言模型在真实代理场景中需要生成符合预定义数据模式的输出,但缺乏系统评估该能力的基准测试。 Method: 设计了包含6.5K多个样JSON模式和1.8K人工验证图像-模式对的SO-Bench基准,覆盖四种视觉领域,并对开源和前沿闭源模型进行了评测,同时开展了训练实验以提升模型的结构化输出能力。 Result: 实验表明当前MLLMs在生成准确且符合模式的输出方面仍有显著不足,尤其是在复杂视觉输入下的结构化推理能力有待提高。 Conclusion: SO-Bench填补了多模态结构化输出评估的空白,揭示了现有模型的局限性,并通过训练实验展示了改进潜力,未来将向社区公开该基准。 Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.[106] Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training
Eric Yeats,Darryl Hannan,Wilson Fearn,Timothy Doster,Henry Kvinge,Scott Mahan
Main category: cs.CV
TL;DR: 本文提出了一种新的无需额外训练或标签数据的生成模型引导方法Saddle-Free Guidance (SFG),利用对数密度估计在鞍点区域的正曲率提供强引导,显著提升生成质量与多样性。
Details
Motivation: 现有的生成模型引导方法(如CFG和Auto-Guidance)依赖标签数据或额外模型训练,限制了其在无标签或无法训练新模型场景下的应用。本文旨在开发一种更通用、即插即用的引导方法。 Method: 基于发现:鞍点区域的对数密度估计具有正曲率,可作为有效引导信号。提出SFG方法,通过维护对数密度的最大正曲率估计来指导单个得分模型生成,无需额外模型或标签。 Result: SFG在ImageNet-512无条件生成任务中达到最先进的FID和FD-DINOv2指标;结合Auto-Guidance进一步提升性能;在FLUX.1-dev和Stable Diffusion v3.5上验证可提高图像多样性,同时保持良好的提示遵循性和图像保真度。 Conclusion: SFG是一种高效、通用的生成引导方法,无需额外训练或标签数据,适用于现成的扩散模型和流匹配模型,在生成质量和多样性方面均表现优异。 Abstract: Score-based generative models require guidance in order to generate plausible, on-manifold samples. The most popular guidance method, Classifier-Free Guidance (CFG), is only applicable in settings with labeled data and requires training an additional unconditional score-based model. More recently, Auto-Guidance adopts a smaller, less capable version of the original model to guide generation. While each method effectively promotes the fidelity of generated data, each requires labeled data or the training of additional models, making it challenging to guide score-based models when (labeled) training data are not available or training new models is not feasible. We make the surprising discovery that the positive curvature of log density estimates in saddle regions provides strong guidance for score-based models. Motivated by this, we develop saddle-free guidance (SFG) which maintains estimates of maximal positive curvature of the log density to guide individual score-based models. SFG has the same computational cost of classifier-free guidance, does not require additional training, and works with off-the-shelf diffusion and flow matching models. Our experiments indicate that SFG achieves state-of-the-art FID and FD-DINOv2 metrics in single-model unconditional ImageNet-512 generation. When SFG is combined with Auto-Guidance, its unconditional samples achieve general state-of-the-art in FD-DINOv2 score. Our experiments with FLUX.1-dev and Stable Diffusion v3.5 indicate that SFG boosts the diversity of output images compared to CFG while maintaining excellent prompt adherence and image fidelity.[107] UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation
Bu Jin,Weize Li,Songen Gu,Yupeng Zheng,Yuhang Zheng,Zhengyi Zhou,Yao Yao
Main category: cs.CV
TL;DR: 本文提出UniArt,一种基于扩散模型的端到端框架,能从单张图像直接生成完全可动的3D物体,联合建模几何、纹理、部件分割与运动参数,在PartNet-Mobility数据集上实现了最先进的网格质量和关节精度。
Details
Motivation: 手动构建可动3D物体成本高、难以规模化,现有方法多为多阶段流程,缺乏整体一致性,难以泛化到新关节类型和物体类别。 Method: 提出UniArt,采用统一的潜在表示联合编码几何、纹理、部件分割和运动参数;引入可逆的关节到体素嵌入,对齐运动特征与体积几何;将关节类型预测建模为开放集问题,提升对新类别和未知物体的泛化能力。 Result: 在PartNet-Mobility基准上验证,UniArt在网格质量(如Chamfer Distance)和关节准确性(如AUC)指标上均优于现有方法,实现SOTA性能。 Conclusion: UniArt实现了从单图到完整可动物体的端到端生成,通过统一建模和开放集设计,显著提升了生成质量与泛化能力,推动了可动物体资产的自动化构建。 Abstract: Articulated 3D objects play a vital role in realistic simulation and embodied robotics, yet manually constructing such assets remains costly and difficult to scale. In this paper, we present UniArt, a diffusion-based framework that directly synthesizes fully articulated 3D objects from a single image in an end-to-end manner. Unlike prior multi-stage techniques, UniArt establishes a unified latent representation that jointly encodes geometry, texture, part segmentation, and kinematic parameters. We introduce a reversible joint-to-voxel embedding, which spatially aligns articulation features with volumetric geometry, enabling the model to learn coherent motion behaviors alongside structural formation. Furthermore, we formulate articulation type prediction as an open-set problem, removing the need for fixed joint semantics and allowing generalization to novel joint categories and unseen object types. Experiments on the PartNet-Mobility benchmark demonstrate that UniArt achieves state-of-the-art mesh quality and articulation accuracy.[108] PathReasoning: A multimodal reasoning agent for query-based ROI navigation on whole-slide images
Kunpeng Zhang,Hanwen Xu,Sheng Wang
Main category: cs.CV
TL;DR: 提出了一种名为PathReasoning的多模态推理代理,用于在全切片图像(WSI)中通过多轮推理和自省迭代导航,以高效识别诊断相关区域,无需密集像素级标注。
Details
Motivation: 全切片图像(WSI)虽能全面描绘癌症特征,但其超大尺寸使得临床检查中定位关键区域耗时且困难,亟需一种高效、可解释的导航方法。 Method: 受病理学家阅片过程启发,PathReasoning从随机采样的候选区域开始,通过多轮视觉观察与临床问题之间的对应关系进行推理和自省,逐步构建推理链,引导关注到诊断相关区域,将WSI转化为一系列问题引导的视图。 Result: 在亚型分类和纵向分析任务中,AUROC分别优于现有强基线方法6.7%和3.1%;生成的高质量感兴趣区域支持乳腺癌报告生成,在准确性上比GPT-4o高出10%。 Conclusion: PathReasoning能够优先选择问题相关的区域并构建可解释的推理链,有助于提高数字病理中的阅片效率、诊断一致性、报告全面性和证据可追溯性。 Abstract: Deciphering tumor microenvironment from Whole Slide Images (WSIs) is intriguing as it is key to cancer diagnosis, prognosis and treatment response. While these gigapixel images on one hand offer a comprehensive portrait of cancer, on the other hand, the extremely large size, as much as more than 10 billion pixels, make it challenging and time-consuming to navigate to corresponding regions to support diverse clinical inspection. Inspired by pathologists who conducted navigation on WSIs with a combination of sampling, reasoning and self-reflection, we proposed "PathReasoning", a multi-modal reasoning agent that iteratively navigates across WSIs through multiple rounds of reasoning and refinements. Specifically, starting with randomly sampled candidate regions, PathReasoning reviews current selections with self-reflection, reasoning over the correspondence between visual observations and clinical questions, and concludes by proposing new regions to explore. Across rounds, PathReasoning builds a reasoning chain that gradually directs attention to diagnostically relevant areas. PathReasoning turns each whole slide into a sequence of question-guided views, allowing the model to efficiently find informative ROIs within a fixed number of steps, without the need for dense pixel-level annotations. PathReasoning can substantially outperform strong ROI-selection approaches by 6.7% and 3.1% of AUROC on subtyping and longitudinal analysis tasks. The high-quality ROIs further support accurate report generation on breast cancer, significantly outperforming the standard GPT-4o by 10% in accuracy. PathReasoning prioritizes question-specific regions and constructs interpretable reasoning chains, supporting efficient slide review, consistent diagnostic interpretations, comprehensive reporting, and evidence traceability in digital pathology.[109] Adaptive Parameter Optimization for Robust Remote Photoplethysmography
Cecilia G. Morales,Fanurs Chi En Teh,Kai Li,Pushpak Agrawal,Artur Dubrawski
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的自适应远程光电容积描记法(rPPG)算法PRISM,通过基于信号质量评估的在线参数优化,联合优化光度去趋势和颜色混合,在多种数据集上达到无监督方法中的最优性能,并媲美有监督方法,同时支持实时CPU运行。
Details
Motivation: 现有rPPG方法依赖于固定参数,仅在特定光照和摄像头条件下表现良好,难以适应多样化的实际部署环境,因此需要一种更具适应性的无监督方法。 Method: 提出PRISM算法,采用基于投影的鲁棒信号混合机制,通过在线评估信号质量来自适应地优化光度去趋势和颜色通道混合参数,无需训练过程。 Result: PRISM在PURE和UBFC-rPPG数据集上分别取得0.77 bpm和0.66 bpm的MAE,在5 bpm阈值下准确率分别为97.3%和97.5%,统计检验显示其性能与领先的有监督方法相当(p > 0.2),且可在CPU上实现实时处理。 Conclusion: 自适应时间序列优化能显著提升rPPG在多样化条件下的性能,PRISM作为一种无需训练、实时可用的方法,为实际应用场景提供了高效可靠的解决方案。 Abstract: Remote photoplethysmography (rPPG) enables contactless vital sign monitoring using standard RGB cameras. However, existing methods rely on fixed parameters optimized for particular lighting conditions and camera setups, limiting adaptability to diverse deployment environments. This paper introduces the Projection-based Robust Signal Mixing (PRISM) algorithm, a training-free method that jointly optimizes photometric detrending and color mixing through online parameter adaptation based on signal quality assessment. PRISM achieves state-of-the-art performance among unsupervised methods, with MAE of 0.77 bpm on PURE and 0.66 bpm on UBFC-rPPG, and accuracy of 97.3\% and 97.5\% respectively at a 5 bpm threshold. Statistical analysis confirms PRISM performs equivalently to leading supervised methods ($p > 0.2$), while maintaining real-time CPU performance without training. This validates that adaptive time series optimization significantly improves rPPG across diverse conditions.[110] Interpretable Multimodal Cancer Prototyping with Whole Slide Images and Incompletely Paired Genomics
Yupei Zhang,Yating Huang,Wanming Hu,Lequan Yu,Hujun Yin,Chao Li
Main category: cs.CV
TL;DR: 提出一种灵活的多模态原型框架,用于整合全切片图像和不完整的基因组数据,以应对肿瘤内表型和基因型异质性及临床中基因组数据缺失的问题。
Details
Motivation: 表型和基因型异质性限制了单模态表示质量,并阻碍了有效的跨模态整合;同时现有方法大多忽视了临床中基因组数据部分缺失或完全不可用的现实情况。 Method: 该方法包括四个关键组件:1)基于文本提示和原型加权的生物原型构建;2)通过样本和分布层面的多视图对齐;3)捕获共享与模态特有信息的二分融合;4)语义基因组数据填补以处理缺失数据。 Result: 大量实验表明,该方法在多个下游任务中始终优于其他最先进的方法。 Conclusion: 所提出的框架有效提升了多模态表示质量,能够在基因组数据不完整或缺失的情况下实现精准肿瘤学中的稳健预测,具有良好的临床适用性。 Abstract: Multimodal approaches that integrate histology and genomics hold strong potential for precision oncology. However, phenotypic and genotypic heterogeneity limits the quality of intra-modal representations and hinders effective inter-modal integration. Furthermore, most existing methods overlook real-world clinical scenarios where genomics may be partially missing or entirely unavailable. We propose a flexible multimodal prototyping framework to integrate whole slide images and incomplete genomics for precision oncology. Our approach has four key components: 1) Biological Prototyping using text prompting and prototype-wise weighting; 2) Multiview Alignment through sample- and distribution-wise alignments; 3) Bipartite Fusion to capture both shared and modality-specific information for multimodal fusion; and 4) Semantic Genomics Imputation to handle missing data. Extensive experiments demonstrate the consistent superiority of the proposed method compared to other state-of-the-art approaches on multiple downstream tasks. The code is available at https://github.com/helenypzhang/Interpretable-Multimodal-Prototyping.[111] AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views
Junwei Zhou,Yu-Wing Tai
Main category: cs.CV
TL;DR: 本文提出AmodalGen3D,一种用于从稀疏、未标定且部分遮挡的视图中进行模态外3D物体重建的生成框架,结合2D模态外补全先验与多视角立体几何约束,实现对完整几何形状和外观的高质量重建。
Details
Motivation: 在真实场景中,由于视角稀疏且存在遮挡,许多物体表面无法被直接观测,传统方法难以生成完整且几何一致的3D重建结果。因此需要一种能够推断完整物体结构的新方法。 Method: 提出AmodalGen3D框架,融合2D模态外补全先验与多视角立体几何条件;采用View-Wise Cross Attention机制进行稀疏视图特征融合,并设计Stereo-Conditioned Cross Attention模块以推断未观测区域的结构。模型联合建模可见与隐藏区域,实现一致且合理的3D重建。 Result: 在合成与真实世界数据集上的实验表明,AmodalGen3D在高遮挡、稀疏输入条件下显著优于现有方法,重建结果在完整性与保真度方面均表现优越。 Conclusion: AmodalGen3D有效解决了稀疏、遮挡条件下3D物体重建的挑战,为机器人、AR/VR和具身AI中的物体级场景重建提供了有力支持。 Abstract: Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.[112] TAPVid-360: Tracking Any Point in 360 from Narrow Field of View Video
Finlay G. C. Hudson,James A. D. Gardner,William A. P. Smith
Main category: cs.CV
TL;DR: 本文提出了TAPVid-360任务和TAPVid360-10k数据集,旨在通过窄视场视频预测场景中查询点的3D方向,即使这些点超出视野范围,从而促进无需4D动态真值即可学习全景场景表示。
Details
Motivation: 现有视觉系统在持久性、全景理解方面表现不佳,难以在视野外跟踪2D点,缺乏对场景的全局认知能力。 Method: 提出TAPVid-360任务,利用360度视频生成窄视场视角,并通过2D流水线跨全景跟踪点以生成3D方向真值;构建TAPVid360-10k数据集并基于CoTracker v3设计基线模型,预测每一点的方向更新。 Result: 新基准下的基线模型优于现有的TAP和TAPVid 3D方法,在视野外点的3D方向预测上表现出更强性能。 Conclusion: TAPVid-360为学习超越视野限制的全景场景理解提供了有效的新范式,推动了无需完整4D场景建模的持续视觉跟踪发展。 Abstract: Humans excel at constructing panoramic mental models of their surroundings, maintaining object permanence and inferring scene structure beyond visible regions. In contrast, current artificial vision systems struggle with persistent, panoramic understanding, often processing scenes egocentrically on a frame-by-frame basis. This limitation is pronounced in the Track Any Point (TAP) task, where existing methods fail to track 2D points outside the field of view. To address this, we introduce TAPVid-360, a novel task that requires predicting the 3D direction to queried scene points across a video sequence, even when far outside the narrow field of view of the observed video. This task fosters learning allocentric scene representations without needing dynamic 4D ground truth scene models for training. Instead, we exploit 360 videos as a source of supervision, resampling them into narrow field-of-view perspectives while computing ground truth directions by tracking points across the full panorama using a 2D pipeline. We introduce a new dataset and benchmark, TAPVid360-10k comprising 10k perspective videos with ground truth directional point tracking. Our baseline adapts CoTracker v3 to predict per-point rotations for direction updates, outperforming existing TAP and TAPVid 3D methods.[113] WalkCLIP: Multimodal Learning for Urban Walkability Prediction
Shilong Xiang,JangHyeon Lee,Min Namgung,Yao-Yi Chiang
Main category: cs.CV
TL;DR: 本文提出了一种名为WalkCLIP的多模态框架,通过融合街景图像、卫星影像和人口动态数据来更准确地预测城市步行友好性。
Details
Motivation: 传统的步行性评估方法成本高且难以扩展,单一数据源的方法只能捕捉步行环境的一个方面,因此需要一种能够整合多种互补信息源的综合评估方法。 Method: WalkCLIP利用GPT-4o生成的图像描述学习视觉-语言表示,通过空间聚合模块结合邻里上下文,并将这些特征与人口动态基础模型的表示进行融合。 Result: 在明尼阿波利斯-圣保罗地区的4,660个地点进行评估,WalkCLIP在预测准确性和空间对齐方面均优于单模态和多模态基线模型。 Conclusion: 整合视觉和行为信号可以可靠地预测步行环境,为城市规划和公共健康研究提供了新的工具。 Abstract: Urban walkability is a cornerstone of public health, sustainability, and quality of life. Traditional walkability assessments rely on surveys and field audits, which are costly and difficult to scale. Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability, but these single-source approaches capture only one dimension of the walking environment. Satellite data describe the built environment from above, but overlook the pedestrian perspective. Street view imagery captures conditions at the ground level, but lacks broader spatial context. Population dynamics reveal patterns of human activity but not the visual form of the environment. We introduce WalkCLIP, a multimodal framework that integrates these complementary viewpoints to predict urban walkability. WalkCLIP learns walkability-aware vision-language representations from GPT-4o generated image captions, refines these representations with a spatial aggregation module that incorporates neighborhood context, and fuses the resulting features with representations from a population dynamics foundation model. Evaluated at 4,660 locations throughout Minneapolis-Saint Paul, WalkCLIP outperforms unimodal and multimodal baselines in both predictive accuracy and spatial alignment. These results show that the integration of visual and behavioral signals yields reliable predictions of the walking environment.[114] DeepGI: Explainable Deep Learning for Gastrointestinal Image Classification
Walid Houmaidi,Mohamed Hadadi,Youssef Sabiri,Yousra Chtouki
Main category: cs.CV
TL;DR: 本研究基于包含4000张胃肠内窥镜图像的新数据集,比较了多种深度学习模型在四种疾病分类中的表现,VGG16和MobileNetV2均达到96.5%的准确率,并结合Grad-CAM实现可解释AI,推动了胃肠疾病计算机辅助诊断的发展。
Details
Motivation: 解决内窥镜图像分析中光照变化、视角不稳定和成像伪影等常见挑战,提升自动化疾病分类的准确性与临床可解释性。 Method: 采用VGG16、MobileNetV2和Xception等先进深度学习模型进行比较实验,并使用Grad-CAM可视化技术增强模型预测的可解释性。 Result: VGG16和MobileNetV2在测试集上均达到96.5%的准确率,Xception达到94.24%,表现出优异的分类性能;Grad-CAM成功定位关键病变区域,提升临床可信度。 Conclusion: 该研究建立了胃肠疾病自动分类的可靠基准,强调了高质量临床数据集和模型可解释性在医学AI中的重要性,为计算机辅助诊断提供了实用且可解释的解决方案。 Abstract: This paper presents a comprehensive comparative model analysis on a novel gastrointestinal medical imaging dataset, comprised of 4,000 endoscopic images spanning four critical disease classes: Diverticulosis, Neoplasm, Peritonitis, and Ureters. Leveraging state-of-the-art deep learning techniques, the study confronts common endoscopic challenges such as variable lighting, fluctuating camera angles, and frequent imaging artifacts. The best performing models, VGG16 and MobileNetV2, each achieved a test accuracy of 96.5%, while Xception reached 94.24%, establishing robust benchmarks and baselines for automated disease classification. In addition to strong classification performance, the approach includes explainable AI via Grad-CAM visualization, enabling identification of image regions most influential to model predictions and enhancing clinical interpretability. Experimental results demonstrate the potential for robust, accurate, and interpretable medical image analysis even in complex real-world conditions. This work contributes original benchmarks, comparative insights, and visual explanations, advancing the landscape of gastrointestinal computer-aided diagnosis and underscoring the importance of diverse, clinically relevant datasets and model explainability in medical AI research.[115] PAT3D: Physics-Augmented Text-to-3D Scene Generation
Guying Lin,Kemeng Huang,Michael Liu,Ruihan Gao,Hanke Chen,Lyuhao Chen,Beijia Lu,Taku Komura,Yuan Liu,Jun-Yan Zhu,Minchen Li
Main category: cs.CV
TL;DR: PAT3D是首个结合视觉-语言模型与物理模拟的文本到3D场景生成框架,能生成物理合理、无碰撞且可直接用于仿真的3D场景。
Details
Motivation: 现有文本到3D生成方法缺乏物理合理性与仿真可用性,难以满足实际应用(如机器人操作)中对物理真实性和交互性的需求。 Method: PAT3D通过将文本提示生成3D对象及其空间关系,并构建层次化场景树,输入至可微分刚体物理引擎进行仿真优化;引入仿真在环的优化策略,确保场景在重力下达到静态平衡且无穿透。 Result: 实验表明,PAT3D在物理合理性、语义一致性和视觉质量上显著优于先前方法,并能直接输出可用于下游任务(如场景编辑和机器人操作)的仿真就绪3D场景。 Conclusion: PAT3D实现了文本到物理真实且可仿真的3D场景生成,为生成模型与物理模拟的结合提供了有效框架,拓展了其在现实交互任务中的应用潜力。 Abstract: We introduce PAT3D, the first physics-augmented text-to-3D scene generation framework that integrates vision-language models with physics-based simulation to produce physically plausible, simulation-ready, and intersection-free 3D scenes. Given a text prompt, PAT3D generates 3D objects, infers their spatial relations, and organizes them into a hierarchical scene tree, which is then converted into initial conditions for simulation. A differentiable rigid-body simulator ensures realistic object interactions under gravity, driving the scene toward static equilibrium without interpenetrations. To further enhance scene quality, we introduce a simulation-in-the-loop optimization procedure that guarantees physical stability and non-intersection, while improving semantic consistency with the input prompt. Experiments demonstrate that PAT3D substantially outperforms prior approaches in physical plausibility, semantic consistency, and visual quality. Beyond high-quality generation, PAT3D uniquely enables simulation-ready 3D scenes for downstream tasks such as scene editing and robotic manipulation. Code and data will be released upon acceptance.[116] DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models
Futian Wang,Chaoliu Weng,Xiao Wang,Zhen Chen,Zhicheng Zhao,Jin Tang
Main category: cs.CV
TL;DR: 本文提出了一种用于指针式仪表读数识别的新型视觉-语言模型MRLM,并发布了包含10730张图像的大规模基准数据集RPM-10K,通过引入物理关系建模提升了在复杂环境下的读数精度。
Details
Motivation: 现有指针仪表识别方法在反射、遮挡、视角变化等挑战下表现脆弱,且缺乏大规模数据集支持鲁棒算法的发展。 Method: 构建了大型数据集RPM-10K,并提出MRLM模型,通过物理关系注入,显式编码指针与刻度间的几何与因果关系,结合交叉注意力融合与自适应专家选择机制进行读数预测。 Result: 在新提出的RPM-10K数据集上进行了大量实验,验证了所提方法在仪表读数精度上的显著提升。 Conclusion: 该方法通过融合物理先验与视觉语言建模,有效提升了复杂场景下指针仪表读数的准确性和鲁棒性,推动了智能电力系统中的自动化监测能力。 Abstract: The precise reading recognition of pointer meters plays a key role in smart power systems, but existing approaches remain fragile due to challenges like reflections, occlusions, dynamic viewing angles, and overly between thin pointers and scale markings. Up to now, this area still lacks large-scale datasets to support the development of robust algorithms. To address these challenges, this paper first presents a new large-scale benchmark dataset for dial reading, termed RPM-10K, which contains 10730 meter images that fully reflect the aforementioned key challenges. Built upon the dataset, we propose a novel vision-language model for pointer meter reading recognition, termed MRLM, based on physical relation injection. Instead of exhaustively learning image-level correlations, MRLM explicitly encodes the geometric and causal relationships between the pointer and the scale, aligning perception with physical reasoning in the spirit of world-model perspectives. Through cross-attentional fusion and adaptive expert selection, the model learns to interpret dial configurations and generate precise numeric readings. Extensive experiments fully validated the effectiveness of our proposed framework on the newly proposed benchmark dataset. Both the dataset and source code will be released on https://github.com/Event-AHU/DialBench[117] PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation
Xuchen Li,Hengrui Gu,Mohan Zhang,Qin Liu,Zhen Tan,Xinyuan Zhu,Huixue Zhou,Tianlong Chen,Kaixiong Zhou
Main category: cs.CV
TL;DR: PPBoost是一种零样本框架,通过将弱文本信号转换为强空间视觉提示(如边界框),提升医学图像分割的精度,在无标注数据情况下优于文本和视觉提示基线方法。
Details
Motivation: 文本提示的医学图像分割模型缺乏空间精度且在域偏移下性能下降,而视觉提示模型虽性能强但依赖精确的边界框提示,临床获取成本高。 Method: PPBoost利用视觉-语言模型生成基于文本描述的初始伪边界框,通过不确定性感知准则过滤不可靠预测,并训练伪标签检测器生成高质量边界框;推理时进一步优化边界框以紧密覆盖目标结构,从而指导分割模型生成精确掩码。 Result: 在三个跨模态、跨解剖结构的数据集上,PPBoost在Dice系数和归一化表面距离上均优于文本和视觉提示基线,甚至超过使用标注数据的小样本分割模型,且可泛化至多种主流分割模型骨干网络。 Conclusion: PPBoost有效弥合了文本与视觉提示分割模型之间的差距,在严格零样本条件下实现了从弱文本到强空间引导的转化,具有良好的临床适用性和泛化能力。 Abstract: Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.[118] Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Apratim Bhattacharyya,Bicheng Xu,Sanjay Haresh,Reza Pourreza,Litian Liu,Sunny Panchal,Pulkit Madan,Leonid Sigal,Roland Memisevic
Main category: cs.CV
TL;DR: 本文提出了一个用于实时交互式指导的新基准和数据集Qualcomm Interactive Cooking,并引入了流式多模态大模型LiveMamba,以实现对用户操作的实时反馈与错误检测。
Details
Motivation: 现有的多模态大语言模型在提供实时、交互式的分步指导方面存在不足,尤其是在需要实时检测执行情况和识别错误的场景中,缺乏合适的基准和模型支持。 Method: 基于CaptainCook4D构建了包含用户错误及纠正过程的Qualcomm Interactive Cooking数据集,具有密集标注的定时指令和反馈信息;设计了能够异步响应视频流的流式多模态模型LiveMamba,并在其上进行评估。 Result: 推出了新的基准和数据集,实验评估了现有最先进多模态大模型的表现,并展示了LiveMamba在实时交互指导任务中的有效性。 Conclusion: 该工作为实现实时情境化辅导提供了首个专用基准和强基线模型,推动了AI助手在实时交互指导方向的发展。 Abstract: Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.[119] StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation
Sen Fang,Hongbin Zhong,Yalin Feng,Dimitris N. Metaxas
Main category: cs.CV
TL;DR: 本文提出了一种针对Rectified Flow生成模型的全面加速管道,通过批处理、异构时间步向量化和动态TensorRT编译等新方法,显著提升了生成效率。
Details
Motivation: 由于Rectified Flow与现有扩散模型在理论和设计上存在差异,传统的加速方法难以直接应用,因此需要专门的加速方案。 Method: 从理论、设计和推理策略三方面构建加速管道,采用新的速度场批处理、异构时间步的向量化处理以及动态TensorRT编译技术。 Result: 实验表明,该方法可将512*512图像的生成速度最高提升611%,远超现有的通用加速方法(通常仅18%)。 Conclusion: 所提出的加速管道能高效适配基于流的生成模型,在生成速度上实现显著突破,具有较强的实用价值和推广潜力。 Abstract: New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.[120] MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis
Chunzheng Zhu,Yangfang Lin,Shen Chen,Yijun Wang,Jianxin Lin
Main category: cs.CV
TL;DR: MedEyes 是一种新型强化学习框架,通过结合专家视觉轨迹指导,模拟临床医生的诊断推理过程,提升医学视觉问答任务的准确性和可解释性。
Details
Motivation: 现有视觉语言模型在医学诊断中虽然展现出链式思维能力,但其纯策略学习范式容易强化表面连贯但临床不准确的推理路径,缺乏对真实临床推理过程的建模。 Method: 提出 MedEyes 框架,包含 Gaze-guided Reasoning Navigator (GRN) 进行双模式探索(扫描与钻取),利用专家视觉搜索轨迹作为外部行为信号;引入 Confidence Value Sampler (CVS) 实现多样化且可信的探索路径;并采用双流 GRPO 优化框架解耦策略内外学习信号。 Result: 在多个医学 VQA 基准上平均性能提升 +8.5%,显著优于现有方法。 Conclusion: MedEyes 能有效模仿临床医生的视觉推理过程,通过融合专家指导与自主探索,提升了模型的诊断准确性与可解释性,为构建可信医学 AI 系统提供了新思路。 Abstract: Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5\% across multiple medical VQA benchmarks, validating MedEyes's potential in building interpretable medical AI systems.[121] Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models
Zhenxiang Lin,Maryam Haghighat,Will Browne,Dimity Miller
Main category: cs.CV
TL;DR: 提出一种无需训练的视觉语言模型不确定性估计方法,通过类内视觉特征一致性检测错误预测,具有良好的跨数据集鲁棒性。
Details
Motivation: 现有视觉语言模型(如CLIP)在开放词汇分类中表现良好,但常对错误分类给出高置信度,限制了其在安全关键场景中的可靠性。 Method: 提出一种训练无关的后处理方法,利用特征投影和多元高斯分布构建类特定的概率嵌入,以衡量类内视觉特征的一致性,从而估计不确定性。 Result: 在ImageNet、Flowers102、Food101、EuroSAT和DTD等多个数据集上实现了最优的错误预测检测性能,显著优于确定性和概率性基线方法,且仅需每类10张图像即可有效工作。 Conclusion: 该方法通用性强、无需微调、对分布偏移鲁棒,为对比式视觉语言模型提供了一种高效可靠的不确定性估计方案。 Abstract: Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at https://github.com/zhenxianglin/ICPE.[122] Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation
Joel Alberto Santos,Zongwei Wu,Xavier Alameda-Pineda,Radu Timofte
Main category: cs.CV
TL;DR: 本文探讨了直接从语音进行物体定位的可行性,提出了一种不依赖文本转录的音频-视觉对齐方法,并构建了一个涵盖多种物体和人类口音的新数据集。实验表明,该方法在某些情况下优于基于文本转录的方法,尤其在应对语言变异性时更具鲁棒性。
Details
Motivation: 现有基于文本转录的语音指令物体定位方法存在效率低和鲁棒性差的问题,尤其是在处理多样口音和语言变化时表现不佳,因此需要探索更直接、高效的音频-视觉对齐方式。 Method: 提出直接从单字语音指令进行物体定位的方法,构建新的音频 grounding 数据集,并借鉴音频-视觉领域的模型进行适配与基准测试。 Result: 实验证明,直接从音频进行物体定位是可行的,在某些情况下性能优于传统的文本转录方法,特别是在面对语言变异性时表现出更强的鲁棒性。 Conclusion: 直接音频 grounding 是一种有前景的方向,能够提升多模态理解系统的效率与鲁棒性,值得进一步研究和关注。 Abstract: Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.[123] PAGen: Phase-guided Amplitude Generation for Domain-adaptive Object Detection
Shuchen Du,Shuo Lei,Feiran Li,Jiacheng Li,Daisuke Iso
Main category: cs.CV
TL;DR: 提出一种简单有效的无监督域适应方法,通过在频域中学习图像风格迁移来减少源域和目标域之间的差异,仅在训练时使用轻量级预处理模块,推理时无额外计算开销,在多个域自适应目标检测任务上表现出色。
Details
Motivation: 现有无监督域适应方法通常依赖复杂的对抗训练或复杂的网络结构,导致训练困难且推理成本高,因此需要一种更简洁有效的方法。 Method: 在频域中进行图像风格迁移,引入一个轻量级的预处理模块以对齐源域和目标域的特征分布,该模块仅在训练时使用,推理时完全移除。 Result: 在多个DAOD基准上取得了显著性能提升,优于或媲美现有复杂方法,同时避免了推理时的额外计算。 Conclusion: 所提方法简单、高效且实用,为无监督域适应提供了一种新的思路,尤其适用于资源受限的实际应用场景。 Abstract: Unsupervised domain adaptation (UDA) greatly facilitates the deployment of neural networks across diverse environments. However, most state-of-the-art approaches are overly complex, relying on challenging adversarial training strategies, or on elaborate architectural designs with auxiliary models for feature distillation and pseudo-label generation. In this work, we present a simple yet effective UDA method that learns to adapt image styles in the frequency domain to reduce the discrepancy between source and target domains. The proposed approach introduces only a lightweight pre-processing module during training and entirely discards it at inference time, thus incurring no additional computational overhead. We validate our method on domain-adaptive object detection (DAOD) tasks, where ground-truth annotations are easily accessible in source domains (e.g., normal-weather or synthetic conditions) but challenging to obtain in target domains (e.g., adverse weather or low-light scenes). Extensive experiments demonstrate that our method achieves substantial performance gains on multiple benchmarks, highlighting its practicality and effectiveness.[124] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
Jiayuan Du,Yiming Zhao,Zhenglong Guo,Yong Pan,Wenbo Hou,Zhihui Hao,Kun Zhan,Qijun Chen
Main category: cs.CV
TL;DR: 提出了一种基于Transformer的端到端3D场景占用预测新架构,直接从原始图像特征预测多帧未来占用,避免了离散化和BEV投影的限制,在nuScenes基准上实现了最先进的性能。
Details
Motivation: 现有方法依赖VAE生成离散占用标记或使用BEV投影,受限于表示能力与几何先验,难以有效建模复杂时空动态。 Method: 采用稀疏占用表示,结合注意力机制的Transformer架构,直接从原始图像特征进行端到端的多帧未来占用预测,绕过BEV投影和离散标记化过程。 Result: 在nuScenes数据集1-3秒占用预测任务上显著优于现有方法,表现出强大的场景动态理解和任意轨迹条件下的高预测精度。 Conclusion: 所提方法通过避免离散表示和显式几何先验,有效提升了未来3D场景占用预测的表达能力和性能,验证了端到端稀疏Transformer架构的优越性。 Abstract: This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.[125] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion
Junoh Kang,Donghun Ryu,Bohyung Han
Main category: cs.CV
TL;DR: 提出一种基于稀疏结构信息(色彩图和Canny边缘)的图像条件流形正则化方法(ICM),用于提升真实图像超分辨率的质量,避免了直接使用高密度输入导致的不稳定性。
Details
Motivation: 现有基于文本条件流形正则化的方法在真实图像超分辨率任务中存在概念错位和生成先验缺陷(如颜色失真、边缘模糊),未能充分利用低质量图像的结构信息。 Method: 提出图像条件流形正则化(ICM),将扩散模型的正则化流形从文本条件转为基于色彩图和Canny边缘的稀疏结构信息条件,以实现任务对齐且稳定的优化过程。 Result: 实验表明,ICM显著提升了真实图像超分辨率的感知质量,尤其在颜色保真度和边缘清晰度方面优于现有方法。 Conclusion: 通过引入任务对齐的稀疏图像条件流形,ICM有效解决了传统方法中的概念与实践缺陷,为基于扩散模型的Real-ISR提供了更稳定、高质量的生成方案。 Abstract: Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.[126] TPCNet: Triple physical constraints for Low-light Image Enhancement
Jing-Yi Shi,Ming-Fei Li,Ling-An Wu
Main category: cs.CV
TL;DR: 提出基于Kubelka-Munk理论的三重物理约束(TPC)理论,构建特征空间中的物理约束,提升低光图像增强性能。
Details
Motivation: 现有Retinex方法忽略镜面反射且在图像空间构建约束,限制模型泛化能力。 Method: 引入Kubelka-Munk理论,保留镜面反射系数,在特征空间中建立照明、反射与检测之间的三重物理约束(TPC),设计TPCNet网络。 Result: 在10个数据集上,TPCNet在定量和定性实验中均优于现有最先进方法,且不增加额外参数。 Conclusion: TPC理论有效提升了低光图像增强的性能和视觉质量,具有良好的泛化性和实用性。 Abstract: Low-light image enhancement is an essential computer vision task to improve image contrast and to decrease the effects of color bias and noise. Many existing interpretable deep-learning algorithms exploit the Retinex theory as the basis of model design. However, previous Retinex-based algorithms, that consider reflected objects as ideal Lambertian ignore specular reflection in the modeling process and construct the physical constraints in image space, limiting generalization of the model. To address this issue, we preserve the specular reflection coefficient and reformulate the original physical constraints in the imaging process based on the Kubelka-Munk theory, thereby constructing constraint relationship between illumination, reflection, and detection, the so-called triple physical constraints (TPCs)theory. Based on this theory, the physical constraints are constructed in the feature space of the model to obtain the TPC network (TPCNet). Comprehensive quantitative and qualitative benchmark and ablation experiments confirm that these constraints effectively improve the performance metrics and visual quality without introducing new parameters, and demonstrate that our TPCNet outperforms other state-of-the-art methods on 10 datasets.[127] OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
Jing Hao,Yuci Liang,Lizhuo Lin,Yuxuan Fan,Wenkai Zhou,Kaixin Guo,Zanting Ye,Yanpeng Sun,Xinyu Zhang,Yanqi Yang,Qiankun Li,Hao Tang,James Kit-Hon Tsoi,Linlin Shen,Kuo Feng Hung
Main category: cs.CV
TL;DR: 本文提出了OralGPT-Omni,首个专为牙科设计的多模态大语言模型,并构建了TRACE-CoT数据集和MMOral-Uni基准,显著提升了牙科影像分析的理解与可靠性。
Details
Motivation: 牙科领域在多模态大语言模型中的研究尚不充分,受限于缺乏专业数据、专家标注稀少、模态特异性建模不足及可靠性挑战。 Method: 提出OralGPT-Omni模型,结合临床链式思维数据集TRACE-CoT和四阶段训练范式,增强牙科影像理解;并构建统一多模态基准MMOral-Uni用于评估。 Result: OralGPT-Omni在MMOral-Uni基准上得分为51.84,在MMOral-OPG上得分为45.31,显著优于GPT-5等现有模型。 Conclusion: 本研究推动了智能牙科发展,为未来牙科影像分析提供了可靠模型与标准评测平台。 Abstract: Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists' decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model's capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.[128] DNA: Dual-branch Network with Adaptation for Open-Set Online Handwriting Generation
Tsai-Ling Huang,Nhat-Tuong Do-Tran,Ngoc-Hoang-Lam Le,Hong-Han Shuai,Ching-Chun Huang
Main category: cs.CV
TL;DR: 提出了一种双分支网络DNA用于在线手写生成,能够有效生成训练时未见过的书写风格和字符,尤其适用于汉字等字形语言,在未见字符场景下表现达到SOTA。
Details
Motivation: 现有在线手写生成方法难以生成训练中未见过的字符,尤其在汉字等字形复杂的语言中泛化能力差,限制了实际应用。 Method: 提出双分支网络DNA,包含自适应风格分支和自适应内容分支:风格分支学习笔画方向、间距、位置和流畅性等特征;内容分支通过局部和全局编码器分别提取字符的结构信息和纹理细节,以实现对未见字符的良好泛化。 Result: 实验表明,DNA模型在未见书写风格和未见字符的设置下均表现出色,显著优于现有方法,达到最先进的性能。 Conclusion: DNA模型通过分离风格与内容建模,有效提升了在线手写生成在未见字符场景下的泛化能力,具有良好的实用价值。 Abstract: Online handwriting generation (OHG) enhances handwriting recognition models by synthesizing diverse, human-like samples. However, existing OHG methods struggle to generate unseen characters, particularly in glyph-based languages like Chinese, limiting their real-world applicability. In this paper, we introduce our method for OHG, where the writer's style and the characters generated during testing are unseen during training. To tackle this challenge, we propose a Dual-branch Network with Adaptation (DNA), which comprises an adaptive style branch and an adaptive content branch. The style branch learns stroke attributes such as writing direction, spacing, placement, and flow to generate realistic handwriting. Meanwhile, the content branch is designed to generalize effectively to unseen characters by decomposing character content into structural information and texture details, extracted via local and global encoders, respectively. Extensive experiments demonstrate that our DNA model is well-suited for the unseen OHG setting, achieving state-of-the-art performance.[129] WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation
Quanjian Song,Yiren Song,Kelly Peng,Yuan Gao,Mike Zheng Shou
Main category: cs.CV
TL;DR: 提出WorldWander框架,用于在视频生成中实现第一人称与第三人称视角间的翻译,基于视频扩散模型并引入上下文内视角对齐和协作位置编码,结合新构建的EgoExo-8K数据集,实现了优越的视角同步性、人物一致性和泛化能力。
Details
Motivation: 实现第一人称(自我中心)与第三人称(外部观察)视角之间的无缝视频转换仍鲜有研究,但在影视制作、具身AI和世界模型中具有重要意义。 Method: 基于先进的视频扩散变换器,提出WorldWander框架,包含上下文内视角对齐(In-Context Perspective Alignment)和协作位置编码(Collaborative Position Encoding),并在新构建的大规模同步数据集EgoExo-8K上进行训练与验证。 Result: 实验表明,WorldWander在视角同步性、角色一致性和泛化能力方面表现优越,为第一-第三人称视频转换设立了新基准。 Conclusion: WorldWander通过创新的对齐机制和编码策略,有效实现了跨视角视频生成,推动了多视角动态建模的发展。 Abstract: Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.[130] MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation
Simon Joseph Clément Crête,Marta Kersten-Oertel,Yiming Xiao
Main category: cs.CV
TL;DR: 本文提出了一种基于监督对比学习和Rank-N-Contrast(RNC)损失的MRI脑龄估计新方法,结合Grad-RAM实现可视化解释,在小样本数据上显著优于传统深度回归方法,并在阿尔茨海默病和帕金森病中验证了脑龄差与疾病严重程度的相关性。
Details
Motivation: 现有基于深度学习的脑龄估计方法未能充分捕捉神经形态变化的连续性,导致特征表示次优,需要更有效的学习策略来提升模型性能和可解释性。 Method: 采用监督对比学习框架,首次引入Rank-N-Contrast(RNC)损失函数进行脑龄回归,并结合Grad-RAM对回归结果进行可视化解释,使用T1加权结构MRI数据训练基于ResNet的模型。 Result: 在有限训练数据下达到4.27年的平均绝对误差(MAE)和0.93的R²,优于相同骨干网络的传统深度回归方法,且表现优于或媲美使用更大数据集的最先进方法;Grad-RAM显示RNC能捕获更精细的年龄相关特征。 Conclusion: 所提方法能更准确地估计脑龄,揭示神经退行性疾病中脑龄差与疾病严重程度的关联,具备作为神经系统疾病生物标志物的潜力。 Abstract: MRI-based brain age estimation models aim to assess a subject's biological brain age based on information, such as neuroanatomical features. Various factors, including neurodegenerative diseases, can accelerate brain aging and measuring this phenomena could serve as a potential biomarker for clinical applications. While deep learning (DL)-based regression has recently attracted major attention, existing approaches often fail to capture the continuous nature of neuromorphological changes, potentially resulting in sub-optimal feature representation and results. To address this, we propose to use supervised contrastive learning with the recent Rank-N-Contrast (RNC) loss to estimate brain age based on widely used T1w structural MRI for the first time and leverage Grad-RAM to visually explain regression results. Experiments show that our proposed method achieves a mean absolute error (MAE) of 4.27 years and an $R^2$ of 0.93 with a limited dataset of training samples, significantly outperforming conventional deep regression with the same ResNet backbone while performing better or comparably with the state-of-the-art methods with significantly larger training data. Furthermore, Grad-RAM revealed more nuanced features related to age regression with the RNC loss than conventional deep regression. As an exploratory study, we employed the proposed method to estimate the gap between the biological and chronological brain ages in Alzheimer's Disease and Parkinson's disease patients, and revealed the correlation between the brain age gap and disease severity, demonstrating its potential as a biomarker in neurodegenerative disorders.[131] MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding
Yu Li,Yuenan Hou,Yingmei Wei,Xinge Zhu,Yuexin Ma,Wenqi Shao,Yanming Guo
Main category: cs.CV
TL;DR: 本文提出了一种基于Mixture of Experts(MoE)的多模态3D理解框架MoE3D,通过专业化专家网络处理不同模态或跨模态交互,提升融合效果,在多个3D任务上取得优异表现,尤其在Multi3DRefer上超越现有最优方法6.1 mIoU。
Details
Motivation: 现有的多模态融合方法通常使用单一密集网络,难以应对模态间的异质性和复杂性,导致性能受限。 Method: 引入MoE机制,设计多个专门处理特定模态或跨模态交互的专家网络;采用基于MoE的Transformer和信息聚合模块增强特征融合;使用Top-1门控机制保证效率;提出渐进式预训练策略以利用2D和语义先验知识。 Result: 在四个主流3D理解任务上实现了具有竞争力的性能,尤其在Multi3DRefer数据集上比当前最佳方法提升了6.1 mIoU。 Conclusion: MoE3D通过专家分工与高效融合机制,有效提升了多模态3D理解的性能,验证了MoE在该领域的潜力。 Abstract: Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.[132] HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
Chen Zhang,Yilu An,Ying Chen,Hao Li,Xitong Ling,Lihao Liu,Junjun He,Yuxiang Lin,Zihui Wang,Rongshan Yu
Main category: cs.CV
TL;DR: HyperST是一种用于空间转录组预测的新框架,通过在双曲空间中建模数据的内在层次结构,实现多层级图像-基因表示学习,显著提升跨模态基因表达预测性能。
Details
Motivation: 现有方法主要关注点级别的图像-基因匹配,未能充分利用空间转录组数据的层次结构,尤其在基因表达侧;同时,由于基因表达包含更多缺乏明显视觉对应物的分子信息,导致模态间存在信息不对称,难以有效对齐。 Method: 提出HyperST框架:1)设计多级表示提取器,从两种模态中分别提取spot-level和niche-level表示;2)引入分层双曲对齐模块,在双曲空间中统一这些表示,进行空间对齐并构建图像与基因嵌入的层次结构。 Result: 在四个来自不同组织的公共数据集上达到最先进的性能,显著优于现有方法。 Conclusion: HyperST通过利用双曲空间中的分层建模,有效桥接了图像与基因表达之间的模态差距,增强了图像表示的分子语义,为空间转录组预测提供了更可扩展且准确的解决方案。 Abstract: Spatial Transcriptomics (ST) merges the benefits of pathology images and gene expression, linking molecular profiles with tissue structure to analyze spot-level function comprehensively. Predicting gene expression from histology images is a cost-effective alternative to expensive ST technologies. However, existing methods mainly focus on spot-level image-to-gene matching but fail to leverage the full hierarchical structure of ST data, especially on the gene expression side, leading to incomplete image-gene alignment. Moreover, a challenge arises from the inherent information asymmetry: gene expression profiles contain more molecular details that may lack salient visual correlates in histological images, demanding a sophisticated representation learning approach to bridge this modality gap. We propose HyperST, a framework for ST prediction that learns multi-level image-gene representations by modeling the data's inherent hierarchy within hyperbolic space, a natural geometric setting for such structures. First, we design a Multi-Level Representation Extractors to capture both spot-level and niche-level representations from each modality, providing context-aware information beyond individual spot-level image-gene pairs. Second, a Hierarchical Hyperbolic Alignment module is introduced to unify these representations, performing spatial alignment while hierarchically structuring image and gene embeddings. This alignment strategy enriches the image representations with molecular semantics, significantly improving cross-modal prediction. HyperST achieves state-of-the-art performance on four public datasets from different tissues, paving the way for more scalable and accurate spatial transcriptomics prediction.[133] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization
Mingzhe Li,Renhao Zhang,Zhiyang Wen,Siqi Pan,Bruno Castro da Silva,Juan Zhai,Shiqing Ma
Main category: cs.CV
TL;DR: 本文提出PROMPTMINER,一种黑盒提示词窃取框架,通过强化学习优化和模糊搜索两阶段方法,有效从文本到图像生成模型中恢复原始提示词,在多种指标和实际场景下均表现出优越性能与鲁棒性。
Details
Motivation: 高质量文本提示在文本到图像生成模型中具有重要价值,但也面临知识产权和安全风险,尤其是提示词窃取攻击。现有方法依赖白盒访问、大规模标注数据或简单描述生成,实用性受限,因此需要更实用、通用的黑盒窃取方法。 Method: 提出PROMPTMINER框架,分为两个阶段:第一阶段使用基于强化学习的优化来重建图像的主要主体;第二阶段采用模糊驱动的搜索策略恢复风格修饰词,整个过程无需梯度信息或大量标注数据,适用于黑盒环境。 Result: 在多个数据集和扩散模型上实验表明,PROMPTMINER在CLIP相似度上最高达0.958,SBERT文本对齐度达0.751,优于所有基线方法;在未知生成器的真实图像上比最强基线提升7.5%;且在防御性扰动下仍保持良好性能。 Conclusion: PROMPTMINER是一种高效、鲁棒的黑盒提示词窃取框架,能够在无梯度、少先验条件下准确恢复文本到图像模型的输入提示,揭示了当前生成模型在隐私与版权保护方面的潜在漏洞,同时为数据溯源和水印验证提供了技术工具。 Abstract: Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: https://github.com/aaFrostnova/PromptMiner[134] GoPrune: Accelerated Structured Pruning with $\ell_{2,p}$-Norm Optimization
Li Xu,Xianchao Xiu
Main category: cs.CV
TL;DR: 提出了一种基于 $\ell_{2,p}$-范数的加速结构化剪枝方法 GoPrune,通过扩展 $p$ 值范围并结合高效优化算法,在ResNet和VGG模型上实现了高效的网络压缩与推理加速。
Details
Motivation: 深度卷积神经网络在边缘设备上部署受限于存储和计算成本,现有基于 $\ell_p$-范数的剪枝方法多用于非结构化剪枝且效率较低,缺乏高效的结构化剪枝方案。 Method: 提出 GoPrune 方法,采用 $\ell_{2,p}$-范数($p \in [0, 1)$)进行稀疏学习,并结合近端交替最小化(PAM)算法求解,子问题具有闭式解,提升优化与压缩效率。 Result: 在 CIFAR 数据集上使用 ResNet 和 VGG 模型验证了方法的有效性,相比现有方法在结构化剪枝和推理加速方面表现出更优性能。 Conclusion: GoPrune 通过扩展范数定义域和引入高效优化算法,显著提升了结构化剪枝的效率和实用性,适合资源受限设备上的模型压缩部署。 Abstract: Convolutional neural networks (CNNs) suffer from rapidly increasing storage and computational costs as their depth grows, which severely hinders their deployment on resource-constrained edge devices. Pruning is a practical approach for network compression, among which structured pruning is the most effective for inference acceleration. Although existing work has applied the $\ell_p$-norm to pruning, it only considers unstructured pruning with $p\in (0, 1)$ and has low computational efficiency. To overcome these limitations, we propose an accelerated structured pruning method called GoPrune. Our method employs the $\ell_{2,p}$-norm for sparse network learning, where the value of $p$ is extended to $[0, 1)$. Moreover, we develop an efficient optimization algorithm based on the proximal alternating minimization (PAM), and the resulting subproblems enjoy closed-form solutions, thus improving compression efficiency. Experiments on the CIFAR datasets using ResNet and VGG models demonstrate the superior performance of the proposed method in network pruning. Our code is available at https://github.com/xianchaoxiu/GoPrune.[135] Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation
Xiang Li,Zirui Wang,Zixuan Huang,James M. Rehg
Main category: cs.CV
TL;DR: 本文提出了Cue3D,首个用于量化单图像3D生成中各类视觉线索影响的模型无关框架,通过系统性扰动分析揭示了不同深度模型对阴影、纹理、轮廓等线索的依赖程度,发现几何线索(尤其是阴影)对3D生成至关重要,而形状意义性而非纹理决定泛化能力。
Details
Motivation: 尽管当前深度生成模型在单图像3D生成上取得进展,但其依赖哪些具体图像线索尚不明确,缺乏统一、可解释的评估框架来分析模型对传统单目光学线索的利用情况。 Method: 提出Cue3D框架,通过对输入图像中的六类单目光学线索(如阴影、纹理、轮廓、透视、边缘、局部连续性)进行系统性扰动,并评估七种前沿单图像3D生成方法在扰动下的输出质量变化,从而量化各线索的重要性。 Result: 实验表明:几何线索(特别是阴影)对3D生成最为关键;模型普遍过度依赖提供的轮廓信息;不同模型家族对透视和局部连续性等线索表现出差异化的敏感性;纹理并非决定泛化能力的主要因素,形状意义性更重要。 Conclusion: Cue3D为理解现代3D生成模型如何利用经典视觉线索提供了可解释性工具,揭示了当前方法的优势与局限,为构建更透明、鲁棒和可控的3D生成模型指明了方向。 Abstract: Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.[136] GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
Bin Wang,Ruotong Hu,Wenqian Wang,Wentong Li,Mingliang Gao,Runmin Cong,Wei Zhang
Main category: cs.CV
TL;DR: 提出一种即插即用的耦合提示学习框架,通过引入外部监督提示缓解视觉-语言模型在视频任务微调中的语义空间窄化问题,提升模型在未见类别上的泛化能力。
Details
Motivation: 现有软提示调优方法在视频任务中会损害模型对未见类别的泛化能力,且正则化方法削弱了软提示的学习能力,因此需要一种既能保持泛化性又能增强适应性的提示学习机制。 Method: 设计了一个耦合提示学习框架:在文本提示中引入来自其他数据集的预训练硬提示,并与软提示通过可学习映射层耦合;同时引入无关视频集和负提示作为通用属性锚点,以维持预训练语义空间中的通用属性关联性。 Result: 在多个视频任务实验中,该方法在泛化基准上显著优于当前最先进的提示调优方法,尤其在基类到新类的预测任务中表现突出。 Conclusion: 所提出的耦合提示学习框架有效缓解了微调过程中的语义空间窄化问题,在不牺牲学习能力的前提下提升了模型的泛化性能,具有良好的即插即用特性。 Abstract: Visual and textual soft prompt tuning can effectively improve the adaptability of Vision-Language Models (VLMs) in downstream tasks. However, fine-tuning on video tasks impairs the model's generalization ability to unseen classes. Existing methods attempt to mitigate this forgetting effect by regularizing the gap between hand-crafted prompts and soft prompts, but this also weakens the learning ability of soft prompts. To address this challenge, we propose a plug-and-play coupling prompt learning framework to optimize the generalization performance of V-L models in video tasks, with the core motivation of mitigating semantic space narrowing during fine-tuning by introducing an externally supervised prompt. Specifically, for textual prompts, we introduce pre-trained prompts from other datasets as hard prompt tokens. These are concatenated with soft prompt tokens and coupled via a learnable mapping layer. This competitive prompting approach prevents the semantic space from overfitting to supervised categories. In addition, we introduce a set of well-designed irrelevant video sets and negative prompts as generic attribute anchors to maintain the generic relevance of the attributes in the pre-trained semantic space, thus preserving the generalization ability. Experiments on video tasks demonstrate that our method significantly outperforms state-of-the-art prompt tuning approaches across generalization benchmarks, particularly on base-to-new class prediction.[137] Autonomous labeling of surgical resection margins using a foundation model
Xilin Yang,Musa Aydin,Yuhong Lu,Sahan Yoruc Selcuk,Bijie Bai,Yijie Zhang,Andrew Birkeland,Katjana Ehrlich,Julien Bec,Laura Marcu,Nir Pillar,Aydogan Ozcan
Main category: cs.CV
TL;DR: 提出虚拟墨水网络(VIN),利用冻结的基础模型和紧凑的多层感知机,自动在全切片图像上定位手术切缘,减少对物理染色的依赖,实现标准化的切缘评估。
Details
Motivation: 当前手术切缘评估依赖物理染色,存在操作不一致和烧灼伪影干扰的问题,影响病理评估的准确性和可重复性。 Method: 采用冻结的基础模型提取特征,结合一个两层的多层感知机进行补丁级分类,识别与烧灼一致的特征;使用120张H&E染色切片(约2TB数据)训练,由病理学家标注边界。 Result: 在20张未见过的切片上测试,VIN生成的切缘覆盖图与专家标注定性一致;区域级准确率约为73.3%,错误局限且不影响整体切缘连续性。 Conclusion: VIN能够捕捉烧灼相关的组织形态学特征,提供可重复、无需染色的切缘勾画,适用于常规数字病理工作流程及后续的切缘距离测量。 Abstract: Assessing resection margins is central to pathological specimen evaluation and has profound implications for patient outcomes. Current practice employs physical inking, which is applied variably, and cautery artifacts can obscure the true margin on histological sections. We present a virtual inking network (VIN) that autonomously localizes the surgical cut surface on whole-slide images, reducing reliance on inks and standardizing margin-focused review. VIN uses a frozen foundation model as the feature extractor and a compact two-layer multilayer perceptron trained for patch-level classification of cautery-consistent features. The dataset comprised 120 hematoxylin and eosin (H&E) stained slides from 12 human tonsil tissue blocks, resulting in ~2 TB of uncompressed raw image data, where a board-certified pathologist provided boundary annotations. In blind testing with 20 slides from previously unseen blocks, VIN produced coherent margin overlays that qualitatively aligned with expert annotations across serial sections. Quantitatively, region-level accuracy was ~73.3% across the test set, with errors largely confined to limited areas that did not disrupt continuity of the whole-slide margin map. These results indicate that VIN captures cautery-related histomorphology and can provide a reproducible, ink-free margin delineation suitable for integration into routine digital pathology workflows and for downstream measurement of margin distances.[138] DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Zhen Fang,Zhuoyang Liu,Jiaming Liu,Hao Chen,Yu Zeng,Shiting Huang,Zehui Chen,Lin Chen,Shanghang Zhang,Feng Zhao
Main category: cs.CV
TL;DR: 本文提出DualVLA,通过双层数据剪枝和双教师自适应蒸馏策略,在保持推理能力的同时提升视觉-语言-动作(VLA)模型的动作性能,缓解了通用化VLA模型在微调后出现的动作退化问题,并提出了新的评估指标VLA Score进行细粒度评价。
Details
Motivation: 现有通用VLA模型在融合多模态数据以增强推理能力时,常因微调导致动作执行性能下降,即“动作退化”问题,缺乏有效方法在提升推理的同时保持动作准确性。 Method: 提出DualVLA:1)设计双层数据剪枝方法,去除具身推理中的冗余数据,避免干扰动作学习;2)采用双教师自适应蒸馏策略,针对不同数据域分配不同的监督信号,兼顾动作生成与推理能力;3)提出VLA Score评估框架,将VLA能力解耦为推理、意图、动作和对齐四个维度进行细粒度评估。 Result: DualVLA在SimplerEnv中平均成功率达到61.0,在八个多模态基准上的平均得分为65.4,显著优于基线模型,实现了动作执行与多模态理解之间的更好平衡。 Conclusion: DualVLA有效缓解了通用VLA模型中的动作退化问题,在不牺牲推理能力的前提下提升了动作性能,结合新提出的VLA Score评估体系,为未来VLA模型的发展提供了有效训练策略和评估标准。 Abstract: To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.[139] EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation
Yanchao Zhao,Jihao Zhu,Yu Liu,Weizhuo Chen,Yuling Yang,Kun Peng
Main category: cs.CV
TL;DR: 提出EASL,一种多情绪引导的文本到手语视频生成框架,通过情感-语义解耦模块实现更具表现力的手语生成。
Details
Motivation: 现有基于大语言模型的手语生成方法忽视情感表达,导致生成结果缺乏自然性和表现力。 Method: 设计情感-语义解耦模块,采用渐进式训练分离语义与情感特征;在姿态解码时利用情绪表征引导语义交互,并输出7类情绪置信度分数。 Result: EASL在姿态准确率上优于所有基线模型,能有效融合多情绪信息并适配扩散模型生成富有表情的手语视频。 Conclusion: EASL通过细粒度情感集成显著提升了手语生成的自然性与可理解性,为面向聋哑社区的情感化人机交互提供了新思路。 Abstract: Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.[140] SemOD: Semantic Enabled Object Detection Network under Various Weather Conditions
Aiyinsi Zuo,Zhaoliang Zheng
Main category: cs.CV
TL;DR: 本文提出了一种基于语义信息的网络架构,用于复杂天气条件下的目标检测,通过预处理单元和检测单元协同工作,在多个天气数据集上显著提升了检测性能。
Details
Motivation: 现有的相机感知模型主要在晴朗天气数据上训练,难以适应多变的恶劣天气条件,且现有去天气方法过于关注图像恢复而忽视了对下游检测任务的支持。 Method: 提出一种包含预处理单元(PPU)和检测单元(DTU)的语义增强网络:PPU采用语义增强的U型网络修复退化图像,DTU则利用改进的YOLO网络融合语义信息进行目标检测。 Result: 在不同天气条件的基准数据集上,该方法相比现有技术mAP提升了1.47%至8.80%,验证了语义信息在图像增强与目标检测中的有效性。 Conclusion: 语义信息能够有效支持图像修复与目标检测的联合优化,所提方法为全天候自动驾驶感知提供了新的解决方案。 Abstract: In the field of autonomous driving, camera-based perception models are mostly trained on clear weather data. Models that focus on addressing specific weather challenges are unable to adapt to various weather changes and primarily prioritize their weather removal characteristics. Our study introduces a semantic-enabled network for object detection in diverse weather conditions. In our analysis, semantics information can enable the model to generate plausible content for missing areas, understand object boundaries, and preserve visual coherency and realism across both filled-in and existing portions of the image, which are conducive to image transformation and object recognition. Specific in implementation, our architecture consists of a Preprocessing Unit (PPU) and a Detection Unit (DTU), where the PPU utilizes a U-shaped net enriched by semantics to refine degraded images, and the DTU integrates this semantic information for object detection using a modified YOLO network. Our method pioneers the use of semantic data for all-weather transformations, resulting in an increase between 1.47\% to 8.80\% in mAP compared to existing methods across benchmark datasets of different weather. This highlights the potency of semantics in image enhancement and object detection, offering a comprehensive approach to improving object detection performance. Code will be available at https://github.com/EnisZuo/SemOD.[141] Stacked Ensemble of Fine-Tuned CNNs for Knee Osteoarthritis Severity Grading
Adarsh Gupta,Japleen Kaur,Tanvi Doshi,Teena Sharma,Nishchal K. Verma,Shantaram Vasikarla
Main category: cs.CV
TL;DR: 提出了一种基于堆叠集成模型的卷积神经网络方法,用于膝骨关节炎(KOA)的自动检测与分级,在二分类和多分类任务中均表现出优于现有方法的性能。
Details
Motivation: 传统基于X光片和Kellgren-Lawrence分级系统的膝骨关节炎评估依赖专家经验、耗时且易受主观判断影响,导致诊断准确性下降。 Method: 采用多种预训练CNN模型(MobileNetV2、YOLOv8、DenseNet201)作为基学习器,CatBoost作为元学习器构建堆叠集成模型,实现KOA的二分类检测与多级分级。 Result: 该模型在多分类任务中达到73%的平衡测试准确率,在二分类任务中达到87.5%的准确率,性能优于现有文献中的方法。 Conclusion: 所提出的堆叠集成模型能有效提升KOA自动分级的准确性与鲁棒性,具有辅助临床诊断的潜力。 Abstract: Knee Osteoarthritis (KOA) is a musculoskeletal condition that can cause significant limitations and impairments in daily activities, especially among older individuals. To evaluate the severity of KOA, typically, X-ray images of the affected knee are analyzed, and a grade is assigned based on the Kellgren-Lawrence (KL) grading system, which classifies KOA severity into five levels, ranging from 0 to 4. This approach requires a high level of expertise and time and is susceptible to subjective interpretation, thereby introducing potential diagnostic inaccuracies. To address this problem a stacked ensemble model of fine-tuned Convolutional Neural Networks (CNNs) was developed for two classification tasks: a binary classifier for detecting the presence of KOA, and a multiclass classifier for precise grading across the KL spectrum. The proposed stacked ensemble model consists of a diverse set of pre-trained architectures, including MobileNetV2, You Only Look Once (YOLOv8), and DenseNet201 as base learners and Categorical Boosting (CatBoost) as the meta-learner. This proposed model had a balanced test accuracy of 73% in multiclass classification and 87.5% in binary classification, which is higher than previous works in extant literature.[142] RemedyGS: Defend 3D Gaussian Splatting against Computation Cost Attacks
Yanping Li,Zhening Liu,Zijian Li,Zehong Lin,Jun Zhang
Main category: cs.CV
TL;DR: 本文提出了首个针对3D高斯点阵(3DGS)计算成本攻击的有效且全面的黑盒防御框架RemedyGS,包含检测器和净化器两部分,并引入对抗训练提升防御效果,在安全性和实用性上均达到先进水平。
Details
Motivation: 近期研究揭示了3DGS流程中的关键漏洞,存在导致恶意资源占用甚至拒绝服务(DoS)的计算成本攻击,阻碍了其可靠部署,因此需要有效的防御机制。 Method: 提出RemedyGS框架,包含两个核心组件:一是检测器用于识别带有投毒纹理的攻击输入图像,二是净化器用于从受攻击图像中恢复良性图像;并在净化器中引入对抗训练,使恢复图像与原始自然图像分布对齐。 Result: 实验结果表明,该框架在面对白盒、黑盒和自适应攻击时均能有效防御,显著减轻攻击影响。 Conclusion: RemedyGS是首个针对3DGS计算成本攻击的全面黑盒防御方案,在安全性和重建质量之间取得了良好平衡,推动了3DGS系统的可靠部署。 Abstract: As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier to enforce distributional alignment between the recovered and original natural images, thereby enhancing the defense efficacy. Experimental results demonstrate that our framework effectively defends against white-box, black-box, and adaptive attacks in 3DGS systems, achieving state-of-the-art performance in both safety and utility.[143] IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer
Bo Chen,Tao Liu,Qi Chen,Xie Chen,Zilong Zheng
Main category: cs.CV
TL;DR: IMTalker提出一种基于隐式运动迁移的高保真说话人脸生成方法,通过交叉注意力机制和身份自适应模块,在保持高效的同时实现精准的全局运动建模和身份保持。
Details
Motivation: 现有方法依赖显式光流和局部形变,难以建模复杂的全局运动且易导致身份漂移,因此需要更鲁棒的方法来提升生成质量。 Method: 采用交叉注意力机制在统一潜在空间中隐式建模运动差异与身份对齐,引入身份自适应模块实现运动与身份的解耦,并使用轻量级流匹配运动生成器从音频、姿态和视线线索生成隐式运动向量。 Result: 在视频驱动和音频驱动模式下分别达到40和42 FPS,实验表明该方法在运动精度、身份保持和音唇同步方面优于先前方法,实现了最先进的生成质量。 Conclusion: IMTalker通过隐式运动迁移有效解决了传统方法中的身份漂移和全局运动建模难题,兼具高效率与高保真度,具有良好的应用前景。 Abstract: Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.[144] Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization
Inha Kang,Eunki Kim,Wonjeong Ryu,Jaeyo Shin,Seungjun Yu,Yoon-Hee Kang,Seongeun Jeong,Eunhye Kim,Soontae Kim,Hyunjung Shim
Main category: cs.CV
TL;DR: 本文提出了一种针对东亚地区PM浓度场长时域预测的新框架,通过构建高分辨率实时数据集CMAQ-OBS并引入基于操作优先级的Group-Relative Policy Optimization(GRPO)方法,显著降低误报率,提升空气质量预报系统的实用性。
Details
Motivation: 现有全球模型在复杂地形和强大气动力区域(如东亚)中难以捕捉区域特异性动态,且依赖非实时输入,无法满足本地化预警系统对实时性和可靠性的需求;同时,传统逐点损失函数无法反映误报与漏报的不对称操作成本,导致模型过度预测和高误报率。 Method: 构建并发布适用于东亚地区的高分辨率、实时观测数据集CMAQ-OBS;提出Group-Relative Policy Optimization(GRPO),采用类别相关奖励机制和课程式 rollout 策略,在训练中对不同预测结果赋予差异化奖励,以对齐实际操作目标。 Result: 相比仅使用SFT的基线模型,该方法将误报率(False Alarm Rate)降低了47.3%,同时保持具有竞争力的F1分数;在48至120小时的长时域预测中显著提升了预报可靠性。 Conclusion: 所提出的GRPO框架结合高质量区域数据集,有效解决了长时域空气质量预测中的误报问题,增强了模型在真实公共健康预警系统中的实用性和可信度,为高影响区域提供了更可靠的预测工具。 Abstract: Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non-real-time inputs, limiting their practical utility for localized warning systems. To address this gap, we construct and release the real-world observations and high-resolution CMAQ-OBS dataset for East Asia, reducing regional error by 59.5% and enabling real-time 48-120 hour forecasts critical for public health alerts. However, standard point-wise objectives cannot reflect asymmetric operational costs, where false alarms deteriorate public trust while missed severe events endanger populations. This cost mismatch causes SFT models to over-predict and yield high False Alarm Rates. We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. Experimental results demonstrate that our framework significantly improves the reliability of the forecast. Compared to the SFT-only baseline, our model reduces the False Alarm Rate by 47.3% while achieving a competitive F1-score, proving its effectiveness for practical, real-world air quality forecasting systems on long lead time scenarios.[145] Partially Shared Concept Bottleneck Models
Delong Zhao,Qiang Huang,Di Yan,Yiqun Sun,Jun Yu
Main category: cs.CV
TL;DR: 本文提出了PS-CBM,一种部分共享的概念瓶颈模型框架,通过多模态概念生成、部分共享策略和概念高效准确率(CEA)指标,解决了现有CBM在视觉接地、概念冗余和评价指标方面的不足,在多个数据集上实现了更高的分类准确性和更强的可解释性。
Details
Motivation: 现有的基于大语言模型和视觉语言模型自动生成概念的方法存在视觉接地差、概念冗余以及缺乏平衡预测准确性和概念紧凑性的评估指标等问题,限制了概念瓶颈模型的可解释性和性能。 Method: 提出PS-CBM框架:1)结合LLM语义与示例视觉线索的多模态概念生成器;2)基于激活模式合并概念的部分共享策略,以平衡特异性与紧凑性;3)提出Concept-Efficient Accuracy(CEA)作为兼顾预测准确率和概念紧凑性的后验评估指标。 Result: 在11个多样化数据集上的实验表明,PS-CBM在分类准确率上比现有最先进方法提高1.0%-7.4%,CEA提升2.0%-9.5%,同时使用更少的概念数量。 Conclusion: PS-CBM有效解决了当前自动概念生成中的关键挑战,在提升模型可解释性的同时显著增强了预测性能,验证了其在实现高精度与强可解释性之间的良好平衡方面的优势。 Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by introducing a layer of human-understandable concepts between inputs and predictions. While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce PS-CBM, a Partially Shared CBM framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0%-7.4% and CEA by 2.0%-9.5%, while requiring significantly fewer concepts. These results underscore PS-CBM's effectiveness in achieving both high accuracy and strong interpretability.[146] BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch
Pu Li,Wenhao Zhang,Weize Quan,Biao Zhang,Peter Wonka,Dong-Ming Yan
Main category: cs.CV
TL;DR: 提出BrepGPT,一种单阶段自回归框架,用于生成边界表示(B-rep)CAD模型,通过Voronoi Half-Patch表示统一几何与拓扑结构,并采用双VQ-VAE和Decoder-only Transformer实现高效、紧凑的序列化生成。
Details
Motivation: 现有B-rep生成方法因几何与拓扑耦合复杂而依赖多级级联网络,导致误差累积和计算效率低,亟需统一且高效的单阶段生成框架。 Method: 提出Voronoi Half-Patch(VHP)表示法,将B-rep分解为以半边为中心的局部单元,统一编码几何属性与拓扑关系;使用双VQ-VAE将顶点拓扑和VHP编码为顶点级token,再由Decoder-only Transformer自回归预测并解码为完整B-rep模型。 Result: 在无条件B-rep生成上达到SOTA性能,并支持基于类别标签、点云、文本、图像的条件生成,以及B-rep自动补全与插值,验证了框架的通用性与实用性。 Conclusion: BrepGPT通过VHP表示和紧凑的token化实现了高效、统一的B-rep单阶段生成,显著提升了生成质量与应用灵活性,为CAD模型生成提供了新范式。 Abstract: Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design. The intricate coupling between geometric and topological elements in B-rep structures has forced existing generative methods to rely on cascaded multi-stage networks, resulting in error accumulation and computational inefficiency. We present BrepGPT, a single-stage autoregressive framework for B-rep generation. Our key innovation lies in the Voronoi Half-Patch (VHP) representation, which decomposes B-reps into unified local units by assigning geometry to nearest half-edges and sampling their next pointers. Unlike hierarchical representations that require multiple distinct encodings for different structural levels, our VHP representation facilitates unifying geometric attributes and topological relations in a single, coherent format. We further leverage dual VQ-VAEs to encode both vertex topology and Voronoi Half-Patches into vertex-based tokens, achieving a more compact sequential encoding. A decoder-only Transformer is then trained to autoregressively predict these tokens, which are subsequently mapped to vertex-based features and decoded into complete B-rep models. Experiments demonstrate that BrepGPT achieves state-of-the-art performance in unconditional B-rep generation. The framework also exhibits versatility in various applications, including conditional generation from category labels, point clouds, text descriptions, and images, as well as B-rep autocompletion and interpolation.[147] Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning
Zhaoyang Wei,Wenchao Ding,Yanchao Hao,Xi Chen
Main category: cs.CV
TL;DR: 本文提出了GRiP,一种用于增强多模态模型视觉基础推理能力的两阶段训练框架,通过认知启发的奖励机制显著提升了复杂视觉推理任务的性能。
Details
Motivation: 现有方法在端到端强化学习的不稳定性与监督微调的僵化之间难以平衡,导致模型缺乏对复杂真实场景的认知灵活性。 Method: 提出GRiP框架,包含两个关键创新:基于显著性加权IoU的奖励机制以聚焦关键物体定位,以及多启发式奖励机制以鼓励多样且逻辑合理的推理路径。 Result: 在Qwen2.5-VL-7B基础上实现,在TreeBench和V* Bench等多个挑战性基准上达到开源模型中的最先进水平。 Conclusion: 通过引入认知启发的引导信号,而非简单奖励,可有效提升多模态模型的视觉推理能力和灵活性。 Abstract: Models capable of "thinking with images" by dynamically grounding their reasoning in visual evidence represent a major leap in multimodal AI. However, replicating and advancing this ability is non-trivial, with current methods often trapped between the instability of end-to-end reinforcement learning (RL) and the rigidity of supervised fine-tuning (SFT). This leads to models that either struggle to learn or lack the cognitive flexibility required for complex, real-world scenes. To navigate this dilemma, we introduce GRiP (Guided Reasoning and Perception), a novel two-stage training framework that cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. GRiP's core lies in its cognitive-enhanced RL stage, which features two key innovations: (1) a Salience-Weighted IoU Reward that incentivizes the model to prioritize the localization of mission-critical objects over trivial distractors, and (2) a Multi-Heuristic Reward that encourages cognitive flexibility by rewarding diverse yet logically valid reasoning pathways. Initialized from the Qwen2.5-VL-7B model, GRiP demonstrates significant performance gains across multiple challenging benchmarks. It achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench, proving its effectiveness in complex visual reasoning. Our work demonstrates that moving beyond simplistic rewards and instead guiding models with cognitively-inspired signals for what to see and how to think is crucial for unlocking the next level of multimodal intelligence. The code will be made publicly available.[148] Enhanced Graph Convolutional Network with Chebyshev Spectral Graph and Graph Attention for Autism Spectrum Disorder Classification
Adnan Ferdous Ashrafi,Hasanul Kabir
Main category: cs.CV
TL;DR: 本文提出了一种结合切比雪夫谱图卷积和图注意力网络的图卷积网络模型,利用多模态神经影像和表型数据提升自闭症谱系障碍(ASD)的分类准确性。
Details
Motivation: ASD症状表现和神经基础具有高度异质性,导致早期客观诊断困难,现有方法在准确性和鲁棒性方面仍有不足。 Method: 采用多分支GCN架构,分别处理rs-fMRI、sMRI和表型数据,通过基于站点的相似性构建人群图,使用切比雪夫多项式滤波器进行局部谱学习,并引入GAT层实现注意力加权聚合。模型在ABIDE I数据集上采用分层五折交叉验证训练。 Result: 模型在全数据集上达到74.82%的测试准确率和0.82的AUC,优于多种现有先进方法,包括传统GCN、自编码器深度网络和多模态CNN。 Conclusion: 所提方法有效融合多模态数据并建模个体间关系,显著提升了ASD分类性能,具有临床辅助诊断潜力。 Abstract: ASD is a complicated neurodevelopmental disorder marked by variation in symptom presentation and neurological underpinnings, making early and objective diagnosis extremely problematic. This paper presents a Graph Convolutional Network (GCN) model, incorporating Chebyshev Spectral Graph Convolution and Graph Attention Networks (GAT), to increase the classification accuracy of ASD utilizing multimodal neuroimaging and phenotypic data. Leveraging the ABIDE I dataset, which contains resting-state functional MRI (rs-fMRI), structural MRI (sMRI), and phenotypic variables from 870 patients, the model leverages a multi-branch architecture that processes each modality individually before merging them via concatenation. Graph structure is encoded using site-based similarity to generate a population graph, which helps in understanding relationship connections across individuals. Chebyshev polynomial filters provide localized spectral learning with lower computational complexity, whereas GAT layers increase node representations by attention-weighted aggregation of surrounding information. The proposed model is trained using stratified five-fold cross-validation with a total input dimension of 5,206 features per individual. Extensive trials demonstrate the enhanced model's superiority, achieving a test accuracy of 74.82\% and an AUC of 0.82 on the entire dataset, surpassing multiple state-of-the-art baselines, including conventional GCNs, autoencoder-based deep neural networks, and multimodal CNNs.[149] MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction
Maitrayee Keskar,Mohan Trivedi,Ross Greer
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉的自动驾驶轨迹规划方法MTR-VP,利用ViT编码器从图像和历史运动状态中学习上下文嵌入,替代MTR中的地图特征,并通过交叉注意力融合意图与视觉上下文信息,在Waymo数据集上验证了多模态预测对规划性能的提升。
Details
Motivation: 旨在不依赖高精地图的情况下实现高效轨迹规划,通过学习图像-based上下文嵌入来替代传统依赖地图的方法,增强视觉与运动预测的融合能力。 Method: 采用ViT编码器输入原始图像和历史运动状态,训练生成与MTR相似的上下文嵌入;使用交叉注意力机制结合意图和上下文嵌入,取代MTR中的可学习查询;基于Waymo数据集进行5秒轨迹预测。 Result: 实验表明,尽管引入CLIP和DINOv2等基础模型增强意图表示,Transformer融合视觉与运动特征仍难以有效建模场景上下文;但预测多个未来轨迹分布相比单一轨迹显著提升了规划性能。 Conclusion: 视觉特征与运动特征的有效融合仍是挑战,但多模态未来预测是提升基于视觉规划性能的关键,为无地图依赖的自动驾驶提供了可行方向。 Abstract: We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent's future 5-second trajectory in bird's-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.[150] Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
Daniel Sungho Jung,Kyoung Mu Lee
Main category: cs.CV
TL;DR: 提出了一种名为FECO的框架,用于从单张RGB图像中进行密集足部接触估计,通过鞋型不变性和地面感知学习来克服鞋类外观多样性和地面特征单调性的挑战。
Details
Motivation: 现有方法通常使用零速度约束近似足部接触,且仅关注关节层面的接触,无法捕捉足部与环境之间的详细交互;此外,从单张RGB图像中预测密集足部接触仍研究较少。 Method: 提出了FECO框架,引入鞋型对抗训练以获得对不同鞋型鲁棒的特征,并设计地面特征提取器利用空间上下文捕获地面属性,实现鞋型不变且地面感知的密集足部接触估计。 Result: 该方法在密集足部接触估计上表现出更强的鲁棒性,能够有效应对不同鞋型变化并充分利用地面信息。 Conclusion: FECO框架显著提升了从单张图像中估计密集足部接触的性能,为理解人与物理环境的交互提供了更精细的建模手段。 Abstract: Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.[151] HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving
Qiang Li,Yingwenqi Jiang,Tuoxi Li,Duyu Chen,Xiang Feng,Yucheng Ao,Shangyue Liu,Xingchen Yu,Youcheng Cai,Yumeng Liu,Yuexin Ma,Xin Hu,Li Liu,Yu Zhang,Linkun Xu,Bingtao Gao,Xueyuan Wang,Shuchang Zhou,Xianming Liu,Ligang Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为HybridWorldSim的混合仿真框架,结合多遍历神经重建与生成模型,实现高保真、可控的自动驾驶场景模拟,并发布了一个新的多遍历数据集MIRROR。
Details
Motivation: 现有自动驾驶仿真方法在大视角变化下的新视图合成和几何一致性方面存在不足,难以满足真实可控模拟的需求。 Method: 提出HybridWorldSim框架,将多遍历神经重建用于静态背景,生成模型用于动态智能体,实现视觉与空间一致性的统一建模。同时发布大规模多城市多条件数据集MIRROR。 Result: 实验表明,HybridWorldSim在新视图合成质量和几何一致性方面优于现有最先进方法,能生成多样化且高保真的驾驶场景。 Conclusion: HybridWorldSim为自动驾驶提供了实用、可扩展的高保真仿真解决方案,MIRROR数据集为相关研究提供了重要资源。 Abstract: Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.[152] ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition
Yan Li,Yong Zhao,Xiaohan Xia,Dongmei Jiang
Main category: cs.CV
TL;DR: 提出了一种外观与关系感知的并行图注意力融合网络(ARPGNet),用于学习外观和关系信息之间相互增强的空间-时间表示,提升了面部表情识别性能。
Details
Motivation: 现有方法主要依赖预训练CNN学习面部外观表示,忽略了面部区域之间的关系,难以充分建模表情动态。 Method: 构建面部区域关系图,利用图注意力机制建模区域间关系,并通过并行图注意力融合模块实现外观与关系表示序列的互补性与时间动态的联合建模。 Result: 在三个面部表情识别数据集上的实验表明,ARPGNet优于或媲美当前最先进的方法。 Conclusion: ARPGNet能有效融合外观与关系信息,通过相互增强的学习机制提升面部表情识别的准确性和鲁棒性。 Abstract: The key to facial expression recognition is to learn discriminative spatial-temporal representations that embed facial expression dynamics. Previous studies predominantly rely on pre-trained Convolutional Neural Networks (CNNs) to learn facial appearance representations, overlooking the relationships between facial regions. To address this issue, this paper presents an Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) to learn mutually enhanced spatial-temporal representations of appearance and relation information. Specifically, we construct a facial region relation graph and leverage the graph attention mechanism to model the relationships between facial regions. The resulting relational representation sequences, along with CNN-based appearance representation sequences, are then fed into a parallel graph attention fusion module for mutual interaction and enhancement. This module simultaneously explores the complementarity between different representation sequences and the temporal dynamics within each sequence. Experimental results on three facial expression recognition datasets demonstrate that the proposed ARPGNet outperforms or is comparable to state-of-the-art methods.[153] Controllable 3D Object Generation with Single Image Prompt
Jaeseok Lee,Jaekoo Lee
Main category: cs.CV
TL;DR: 本文提出了一种无需文本反转的3D对象生成方法,通过现成的图像适配器和深度条件预热策略,提升了生成控制能力和3D一致性,在定性和定量实验及用户研究中均表现出优于现有方法的效果。
Details
Motivation: 现有的基于文本到图像扩散模型和文本反转的方法在3D对象生成中需要额外训练时间且缺乏控制能力,因此需要一种更高效、可控的方法。 Method: 提出了两种创新方法:一是使用现成的图像适配器直接生成3D对象,避免文本反转;二是采用深度条件预热策略以增强3D一致性。 Result: 实验结果显示所提方法在定性和定量评估中与现有方法相当,并在3D一致性方面表现更优;用户研究表明该方法在匹配输入图像和保持3D一致性方面优于对比方法。 Conclusion: 本文提出的无需文本反转的3D生成框架有效提升了生成效率和控制能力,同时增强了3D一致性,验证了其优越性。 Abstract: Recently, the impressive generative capabilities of diffusion models have been demonstrated, producing images with remarkable fidelity. Particularly, existing methods for the 3D object generation tasks, which is one of the fastest-growing segments in computer vision, pre-dominantly use text-to-image diffusion models with textual inversion which train a pseudo text prompt to describe the given image. In practice, various text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space, thereby generating sophisticated outputs. However, textual inversion requires additional training time and lacks control ability. To tackle this issues, we propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text. (2) a depth conditioned warmup strategy to enhance 3D consistency. In experimental results, ours show qualitatively and quantitatively comparable performance and improved 3D consistency to the existing text-inversion-based alternatives. Furthermore, we conduct a user study to assess (i) how well results match the input image and (ii) whether 3D consistency is maintained. User study results show that our model outperforms the alternatives, validating the effectiveness of our approaches. Our code is available at GitHub repository:https://github.com/Seooooooogi/Control3D_IP/[154] 3D-Consistent Multi-View Editing by Diffusion Guidance
Josef Bengtson,David Nilsson,Dong In Lee,Fredrik Kahl
Main category: cs.CV
TL;DR: 提出了一种无需训练的扩散框架,通过引入一致性损失来增强多视角图像编辑中的3D一致性,适用于NeRF和高斯点阵等3D表示,支持稠密和稀疏多视角设置,并在实验中展现出优于现有方法的一致性和编辑质量。
Details
Motivation: 现有的基于扩散模型的文本到图像编辑方法在多视角场景下容易产生几何和光度不一致的问题,尤其影响3D表示(如NeRF或高斯点阵)的编辑效果,因此需要一种能保持多视角一致性的编辑框架。 Method: 提出一种无需训练的扩散框架,引入一致性损失,假设未编辑图像中的对应点在编辑后应经历相似变换,利用该损失引导扩散采样过程以实现一致的编辑结果,且可与多种图像编辑方法结合,适用于稠密和稀疏多视角配置。 Result: 实验表明该方法显著提升了多视角编辑的3D一致性,优于现有方法,并成功实现了高质量的高斯点阵编辑,具有清晰细节和对文本提示的高度保真。 Conclusion: 该框架有效解决了多视角图像编辑中的一致性问题,无需训练即可集成到不同编辑方法中,为3D场景的高质量文本驱动编辑提供了可行方案。 Abstract: Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. We propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. To achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: https://3d-consistent-editing.github.io/[155] From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation
Zhen Chen,Yihang Fu,Gabriel Madera,Mauro Giuffre,Serina Applebaum,Hyunjae Kim,Hua Xu,Qingyu Chen
Main category: cs.CV
TL;DR: 本文提出了一种利用生物医学文献中可许可的复合图像训练医疗多图像多模态大语言模型M3LLM的新框架,通过五阶段、上下文感知的指令生成范式,实现了对多图像复杂关系的理解,在多个任务上显著优于现有模型。
Details
Motivation: 现有的多模态大语言模型大多局限于单图像理解,难以满足临床中需要综合多图像(如不同模态或时间点)进行诊断的需求,且缺乏大规模高质量的多图像标注数据。 Method: 设计了一个五阶段、基于分而治之策略的上下文感知指令生成范式,从23.7万张复合图及其文本中解析信息用于训练,并构建了M3LLM模型和PMC-MI-Bench评测基准。 Result: M3LLM在多图像、单图像、纯文本和多选题等场景下均显著优于通用和专用医疗MLLM,且在MIMIC胸部X光纵向分析中表现出强泛化能力。 Conclusion: 该研究建立了可扩展、高效的医疗多图像MLLM开发范式,弥合了生物医学文献与真实临床应用之间的差距。 Abstract: Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.[156] IE-SRGS: An Internal-External Knowledge Fusion Framework for High-Fidelity 3D Gaussian Splatting Super-Resolution
Xiang Feng,Tieshi Zhong,Shuo Chang,Weiliu Wang,Chengkai Wang,Yifei Chen,Yuhe Wang,Zhenzhong Kuang,Xuefei Yin,Yanming Zhu
Main category: cs.CV
TL;DR: 提出IE-SRGS,一种结合外部2D超分辨率先验与内部3DGS特征的高分辨率3D高斯点阵重建新方法,通过掩码引导融合策略实现跨视角一致且高保真的重建。
Details
Motivation: 从低分辨率输入重建高分辨率3D高斯点阵(3DGS)模型面临纹理和几何细节不足的问题,现有方法因2D超分辨率模型的跨视角不一致性和域间隙导致3D高斯模糊。 Method: 利用预训练的2DSR和深度估计模型生成高分辨率图像和深度图作为外部知识,同时构建多尺度3DGS模型生成跨视角一致、适应域的内部特征;引入掩码引导的融合策略整合两者优势,联合优化3D高斯参数。 Result: 在合成和真实场景基准上实验表明,IE-SRGS在定量精度和视觉质量方面均优于现有最先进方法。 Conclusion: IE-SRGS通过融合外部先验与内部3DGS特征,有效解决了低分辨率输入下的3D高斯模糊问题,实现了高保真、跨视角一致的高分辨率3DGS重建。 Abstract: Reconstructing high-resolution (HR) 3D Gaussian Splatting (3DGS) models from low-resolution (LR) inputs remains challenging due to the lack of fine-grained textures and geometry. Existing methods typically rely on pre-trained 2D super-resolution (2DSR) models to enhance textures, but suffer from 3D Gaussian ambiguity arising from cross-view inconsistencies and domain gaps inherent in 2DSR models. We propose IE-SRGS, a novel 3DGS SR paradigm that addresses this issue by jointly leveraging the complementary strengths of external 2DSR priors and internal 3DGS features. Specifically, we use 2DSR and depth estimation models to generate HR images and depth maps as external knowledge, and employ multi-scale 3DGS models to produce cross-view consistent, domain-adaptive counterparts as internal knowledge. A mask-guided fusion strategy is introduced to integrate these two sources and synergistically exploit their complementary strengths, effectively guiding the 3D Gaussian optimization toward high-fidelity reconstruction. Extensive experiments on both synthetic and real-world benchmarks show that IE-SRGS consistently outperforms state-of-the-art methods in both quantitative accuracy and visual fidelity.[157] Bridging 3D Deep Learning and Curation for Analysis and High-Quality Segmentation in Practice
Simon Püttmann,Jonathan Jair Sànchez Contreras,Lennart Kowitz,Peter Lampen,Saumya Gupta,Davide Panzeri,Nina Hagemann,Qiaojie Xiong,Dirk M. Hermann,Cao Chen,Jianxu Chen
Main category: cs.CV
TL;DR: VessQC是一个开源工具,通过引入不确定性地图来指导用户对3D显微图像分割结果进行高效的人工校正,显著提升错误检测的召回率。
Details
Motivation: 现有的3D显微图像分割模型虽先进但易出错,仍需大量人工校正,缺乏有效引导导致效率低下。 Method: 开发了一个名为VessQC的开源工具,集成不确定性地图以引导用户关注最可能出错的生物相关区域,实现高效的人机协同校正。 Result: 在初步用户研究中,不确定性引导使错误检测召回率从67%显著提升至94.0%(p=0.007),且未增加总体校正时间。 Conclusion: VessQC有效弥合了不确定性估计与实际人机交互之间的差距,为三维分割结果的精细化提供了高效解决方案。 Abstract: Accurate 3D microscopy image segmentation is critical for quantitative bioimage analysis but even state-of-the-art foundation models yield error-prone results. Therefore, manual curation is still widely used for either preparing high-quality training data or fixing errors before analysis. We present VessQC, an open-source tool for uncertainty-guided curation of large 3D microscopy segmentations. By integrating uncertainty maps, VessQC directs user attention to regions most likely containing biologically meaningful errors. In a preliminary user study uncertainty-guided correction significantly improved error detection recall from 67% to 94.0% (p=0.007) without a significant increase in total curation time. VessQC thus enables efficient, human-in-the-loop refinement of volumetric segmentations and bridges a key gap in real-world applications between uncertainty estimation and practical human-computer interaction. The software is freely available at github.com/MMV-Lab/VessQC.[158] Creating Blank Canvas Against AI-enabled Image Forgery
Qi Song,Ziyuan Luo,Renjie Wan
Main category: cs.CV
TL;DR: 本文提出了一种基于Segment Anything Model (SAM) 的图像篡改检测新方法,通过引入对抗性扰动使SAM在空白画布视角下感知伪造区域,并结合频域优化策略提升检测性能。
Details
Motivation: 由于AIGC技术使得图像篡改更加容易且逼真,带来了严重的伪造风险,因此需要有效的图像篡改检测方法。 Method: 不训练SAM直接识别篡改区域,而是通过添加对抗性扰动使其无法感知原始图像内容(即变为“看不见”),从而将图像视为空白画布;当图像被篡改时,SAM能察觉异常区域;引入频率感知的优化策略以增强对SAM的欺骗能力,提高定位精度。 Result: 实验结果表明,所提方法能有效实现图像篡改区域的定位,且由于SAM强大的感知能力,结合频域优化后进一步提升了检测效果。 Conclusion: 本文提出的基于对抗性扰动和频率感知优化的SAM篡改检测框架,为AIGC时代下的图像真实性验证提供了新思路,具有较强的实用潜力。 Abstract: AIGC-based image editing technology has greatly simplified the realistic-level image modification, causing serious potential risks of image forgery. This paper introduces a new approach to tampering detection using the Segment Anything Model (SAM). Instead of training SAM to identify tampered areas, we propose a novel strategy. The entire image is transformed into a blank canvas from the perspective of neural models. Any modifications to this blank canvas would be noticeable to the models. To achieve this idea, we introduce adversarial perturbations to prevent SAM from ``seeing anything'', allowing it to identify forged regions when the image is tampered with. Due to SAM's powerful perceiving capabilities, naive adversarial attacks cannot completely tame SAM. To thoroughly deceive SAM and make it blind to the image, we introduce a frequency-aware optimization strategy, which further enhances the capability of tamper localization. Extensive experimental results demonstrate the effectiveness of our method.[159] TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning
Qingtao Yu,Changlin Song,Minghao Sun,Zhengyang Yu,Vinay Kumar Verma,Soumya Roy,Sumit Negi,Hongdong Li,Dylan Campbell
Main category: cs.CV
TL;DR: 提出TTSnap框架,通过噪声感知的剪枝和自蒸馏训练的奖励模型,在不完全去噪的情况下提前剪枝低质量候选,提升文本到图像扩散模型的测试时扩展效率。
Details
Motivation: 现有的测试时扩展方法依赖于搜索多个噪声种子,但每个候选样本都需要完全去噪才能计算奖励,计算成本高,限制了在固定预算下可探索的样本数量。 Method: 提出TTSnap,利用自蒸馏训练噪声感知的奖励模型,使其对中间去噪阶段的图像估计出与最终清晰图像一致的奖励值;采用课程学习策略,逐步从清晰图像过渡到含噪图像进行训练;并在不同噪声水平下进行剪枝以减少计算开销。 Result: 实验显示TTSnap比现有方法性能提升超过16%,在相同计算预算下更高效地实现测试时扩展,并能与后训练技术和局部优化方法结合带来正交增益。 Conclusion: TTSnap通过噪声感知的早期剪枝和对齐的奖励预测,显著提升了扩散模型测试时扩展的效率和效果,为大规模应用提供了可行路径。 Abstract: A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: https://github.com/TerrysLearning/TTSnap/.[160] Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models
Seoyun Yang,Gihoon Kim,Taesup Kim
Main category: cs.CV
TL;DR: 提出一种基于语义锚定的文本到图像扩散模型个性化方法,通过将稀有概念与其常见对应物关联,实现新视觉概念的稳定学习,在保持预训练语义结构的同时提升主体保真度和文本-图像对齐。
Details
Motivation: 现有文本到图像扩散模型在个性化生成时面临过拟合少量参考图像或无法学习新特征的矛盾,难以兼顾主体保真度与语义一致性。 Method: 将个性化过程重构为通过语义锚定引导稀有概念学习其常见对应分布的过程,利用预训练模型的语义先验来稳定地扩展个性化区域。 Result: 该方法在多个实验中实现了比基线更稳定的适应性,并在主体保真度和文本-图像对齐方面取得一致提升。 Conclusion: 语义锚定策略有效平衡了新概念学习与语义先验保持之间的矛盾,增强了扩散模型在少样本个性化任务中的鲁棒性和性能。 Abstract: Text-to-image diffusion models have achieved remarkable progress in generating diverse and realistic images from textual descriptions. However, they still struggle with personalization, which requires adapting a pretrained model to depict user-specific subjects from only a few reference images. The key challenge lies in learning a new visual concept from a limited number of reference images while preserving the pretrained semantic prior that maintains text-image alignment. When the model focuses on subject fidelity, it tends to overfit the limited reference images and fails to leverage the pretrained distribution. Conversely, emphasizing prior preservation maintains semantic consistency but prevents the model from learning new personalized attributes. Building on these observations, we propose the personalization process through a semantic anchoring that guides adaptation by grounding new concepts in their corresponding distributions. We therefore reformulate personalization as the process of learning a rare concept guided by its frequent counterpart through semantic anchoring. This anchoring encourages the model to adapt new concepts in a stable and controlled manner, expanding the pretrained distribution toward personalized regions while preserving its semantic structure. As a result, the proposed method achieves stable adaptation and consistent improvements in both subject fidelity and text-image alignment compared to baseline methods. Extensive experiments and ablation studies further demonstrate the robustness and effectiveness of the proposed anchoring strategy.[161] Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
Bolin Lai,Xudong Wang,Saketh Rambhatla,James M. Rehg,Zsolt Kira,Rohit Girdhar,Ishan Misra
Main category: cs.CV
TL;DR: 本文提出了一种名为FreqWarm的频率预热课程学习方法,用于解决高维潜在空间中重建与生成质量之间的权衡问题,通过增强扩散模型训练初期对高频信号的暴露,显著提升了生成质量。
Details
Motivation: 观察到随着潜在维度增加,虽然重建保真度提高,但生成质量下降,其根源在于编码器对高频信息表示不足而解码器依赖这些信息。因此需要一种方法来平衡高低频信息的学习。 Method: 提出FreqWarm方法,在扩散或流匹配训练早期逐步引入高频潜在信号,增强模型对高频成分的感知能力,无需修改或重新训练自编码器,具有即插即用和架构无关特性。 Result: 在多种高维自编码器上验证了FreqWarm的有效性,生成质量显著提升:在Wan2.2-VAE上gFID降低14.11,LTX-VAE上降低6.13,DC-AE-f32上降低4.42。 Conclusion: 显式管理频率暴露可有效改善高维潜在空间的可扩散性,使高容量自编码器更适用于高质量图像生成任务。 Abstract: Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction-generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency encoding and decoding. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional autoencoders, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.13 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.[162] UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation
Dengbo Chen,Ziwei Zhao,Kexin Zhang,Shishuang Zhao,Junjie Hou,Yaqian Wang,Nianxi Liao,Anlan Sun,Fei Gao,Jia Ding,Yuhang Liu,Dong Wang
Main category: cs.CV
TL;DR: UMind-VL 是一个统一的超声基础模型,结合像素级结构理解与临床推理能力,通过新提出的动态卷积掩码解码器和任务特定标记,在多种超声任务中表现优异。
Details
Motivation: 超声领域缺乏能够同时处理低层次感知(如分割、定位)和高层次解释(如诊断、推理)的综合模型,现有方法难以兼顾精度与多任务协同。 Method: 提出 UMind-VL 模型,采用轻量级动态卷积掩码解码器,利用大语言模型输出动态生成卷积核,并结合任务特定标记;基于自建大规模数据集 UMind-DS(含120万图像-文本对及像素级标注)进行训练和评估。 Result: 在分割、检测、关键点定位和诊断推理等任务上显著优于通用多模态模型,性能媲美或超越专用模型,且具备强泛化能力。 Conclusion: UMind-VL 成功整合了超声图像的低层感知与高层解读,为医学超声分析提供了一个统一、高效、可扩展的基础模型框架。 Abstract: Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.[163] Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?
Wenkai Huang,Yijia Guo,Gaolei Li,Lei Ma,Hang Zhang,Liwen Hu,Jiazheng Wang,Jianhua Li,Tiejun Huang
Main category: cs.CV
TL;DR: 本文首次系统研究了3D高斯点阵(3DGS)水印方案的脆弱性,并提出了首个针对3DGS的水印净化框架GSPure,能够有效去除水印同时保持场景保真度。
Details
Motivation: 现有2D水印去除方法无法有效应用于3DGS,且当前3DGS水印方案的安全性尚未经过充分验证,因此需要专门针对其特性设计新的攻击方法以评估其鲁棒性。 Method: 通过分析视角相关渲染贡献并利用几何精确的特征聚类,GSPure精确定位与水印相关的高斯图元并将其移除,同时保留主体场景结构。 Result: 实验表明,GSPure可将水印PSNR降低高达16.34dB,而原始场景PSNR损失小于1dB,在去除效果和泛化能力上均优于现有方法。 Conclusion: GSPure揭示了当前3DGS水印方案存在的安全隐患,为未来设计更鲁棒的3D水印方法提供了重要参考。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for 3D scenes, widely adopted due to its exceptional efficiency and high-fidelity visual quality. Given the significant value of 3DGS assets, recent works have introduced specialized watermarking schemes to ensure copyright protection and ownership verification. However, can existing 3D Gaussian watermarking approaches genuinely guarantee robust protection of the 3D assets? In this paper, for the first time, we systematically explore and validate possible vulnerabilities of 3DGS watermarking frameworks. We demonstrate that conventional watermark removal techniques designed for 2D images do not effectively generalize to the 3DGS scenario due to the specialized rendering pipeline and unique attributes of each gaussian primitives. Motivated by this insight, we propose GSPure, the first watermark purification framework specifically for 3DGS watermarking representations. By analyzing view-dependent rendering contributions and exploiting geometrically accurate feature clustering, GSPure precisely isolates and effectively removes watermark-related Gaussian primitives while preserving scene integrity. Extensive experiments demonstrate that our GSPure achieves the best watermark purification performance, reducing watermark PSNR by up to 16.34dB while minimizing degradation to original scene fidelity with less than 1dB PSNR loss. Moreover, it consistently outperforms existing methods in both effectiveness and generalization.[164] DriveVGGT: Visual Geometry Transformer for Autonomous Driving
Xiaosong Jia,Yanhao Liu,Junqi You,Renqiu Xia,Yu Hong,Junchi Yan
Main category: cs.CV
TL;DR: 本文提出了DriveVGGT,一种专为自动驾驶设计的尺度感知4D重建框架,通过引入时间视频注意力(TVA)和多相机一致性注意力(MCA)模块,并结合已知的相机内外参与相对位姿约束,显著提升了多相机视频在低重叠场景下的重建性能。
Details
Motivation: 直接将VGGT应用于自动驾驶系统效果不佳,因两者任务先验不同。自动驾驶中存在新的关键先验:相机视图重叠少、相机内外参已知、多相机相对位置固定。需设计专门模型以充分利用这些先验。 Method: 提出DriveVGGT框架,包含三个核心组件:1) TVA模块独立处理各单相机视频序列,利用其时空连续性;2) MCA模块通过归一化相对姿态嵌入进行窗口注意力,建立跨相机一致性并限制token仅关注邻近帧;3) 扩展VGGT头部,增加绝对尺度头和自车姿态头。 Result: 实验表明,DriveVGGT在自动驾驶数据集上优于VGGT、StreamVGGT和fastVGGT,消融研究验证了各设计组件的有效性。 Conclusion: DriveVGGT成功融合了自动驾驶中的多模态先验知识,实现了更优的尺度感知4D重建,适用于实际自动驾驶场景。 Abstract: Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.[165] The Collapse of Patches
Wei Guo,Shunqi Mao,Zhuonan Liang,Heng Wang,Weidong Cai
Main category: cs.CV
TL;DR: 本文提出了“patch collapse”的概念,通过学习图像块之间的依赖关系来优化图像建模,提升了自回归生成和图像分类的效率。
Details
Motivation: 观察到图像中某些块的出现会降低其他块的不确定性,受量子力学波函数坍缩启发,提出patch collapse的概念,以更高效地建模图像结构。 Method: 设计了一个自动编码器,软性选择用于重建每个目标块的源块,利用PageRank分析各块在依赖图中的重要性,得到图像块的最优呈现顺序。 Result: 该方法提升了MAR模型的自回归图像生成性能,并在仅使用22%高PageRank分数块的情况下实现高精度图像分类。 Conclusion: patch collapse为图像建模提供了新视角,有助于提升视觉任务的效率。 Abstract: Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle's wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region's collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch's PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22\% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at https://github.com/wguo-ai/CoP .[166] Match-and-Fuse: Consistent Generation from Unstructured Image Sets
Kate Feingold,Omri Kaduri,Tali Dekel
Main category: cs.CV
TL;DR: 提出Match-and-Fuse方法,一种无需训练的零样本框架,用于生成具有一致性的非结构化图像集,通过图模型和特征融合实现跨图像的局部一致性和全局连贯性。
Details
Motivation: 现有方法多针对单张图像或密集视频生成,难以保持图像集合中共享内容的跨图像一致性,缺乏对稀疏、非结构化图像集的可控生成能力。 Method: 将生成任务建模为图结构,每个节点代表一张图像,每条边触发图像对的联合生成;利用密集对应关系指导图像对间的内部特征融合,并借助文本到图像模型在共享画布上多视角生成的隐式先验来增强一致性。 Result: 在图像集生成的一致性和视觉质量方面达到SOTA水平,无需训练、掩码或人工监督,支持从图像集合中进行新颖的内容创作。 Conclusion: Match-and-Fuse为零样本、训练-free的图像集生成提供了有效框架,通过图结构与特征融合实现了局部与全局一致性的统一,拓展了基于图像集合的内容生成能力。 Abstract: We present Match-and-Fuse - a zero-shot, training-free method for consistent controlled generation of unstructured image sets - collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. Unlike existing methods that operate on individual images or densely sampled videos, our framework performs set-to-set generation: given a source set and user prompts, it produces a new set that preserves cross-image consistency of shared content. Our key idea is to model the task as a graph, where each node corresponds to an image and each edge triggers a joint generation of image pairs. This formulation consolidates all pairwise generations into a unified framework, enforcing their local consistency while ensuring global coherence across the entire set. This is achieved by fusing internal features across image pairs, guided by dense input correspondences, without requiring masks or manual supervision. It also allows us to leverage an emergent prior in text-to-image models that encourages coherent generation when multiple views share a single canvas. Match-and-Fuse achieves state-of-the-art consistency and visual quality, and unlocks new capabilities for content creation from image collections.[167] Structure is Supervision: Multiview Masked Autoencoders for Radiology
Sonia Laguna,Andrea Agostini,Alain Ryser,Samuel Ruiperez-Campillo,Irene Cannistraci,Moritz Vandenhirtz,Stephan Mandt,Nicolas Deperrois,Farhad Nooralahzadeh,Michael Krauthammer,Thomas M. Sutter,Julia E. Vogt
Main category: cs.CV
TL;DR: 本文提出了MVMAE和MVMAE-V2T两种自监督学习框架,利用放射学检查的多视角结构和报告文本信息来学习疾病相关且视角不变的表示,在多个大规模数据集上优于现有方法,尤其在标签稀缺情况下表现突出。
Details
Motivation: 为了构建鲁棒的医学机器学习系统,需要能够利用临床数据内在结构的预训练策略。现有的方法未能充分挖掘放射学检查中天然存在的多视角冗余和文本报告中的语义信息。 Method: 提出Multiview Masked Autoencoder (MVMAE),结合掩码图像重建与跨视角对齐,利用不同投影间的冗余作为自监督信号;进一步扩展为MVMAE-V2T,引入放射学报告作为辅助文本信号以增强语义理解,同时保持纯视觉推理能力。 Result: 在MIMIC-CXR、CheXpert和PadChest三个大型公开数据集的下游疾病分类任务中,MVMAE consistently 超过有监督和视觉-语言基线方法,MVMAE-V2T在低标签场景下带来额外增益。 Conclusion: 结构化监督(多视角一致性)与文本监督是构建可扩展、临床 grounded 的医学基础模型的互补路径。 Abstract: Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.[168] Small Object Detection for Birds with Swin Transformer
Da Huo,Marc A. Kastner,Tingwei Liu,Yasutomo Kawanishi,Takatsugu Hirayama,Takahiro Komamizu,Ichiro Ide
Main category: cs.CV
TL;DR: 本文提出了一种基于Swin Transformer的改进型小目标检测方法,通过优化neck部分的特征学习和调整滑动窗口大小,有效提升了稀疏分布的小目标(如鸟类)的检测性能。
Details
Motivation: 小目标检测面临尺寸小、模糊、遮挡等问题,尤其在目标稀疏且训练样本不足的情况下更难有效学习特征,现有方法多针对密集场景,缺乏对稀疏小目标的有效解决方案。 Method: 提出一种专门用于检测特定类别小目标(鸟类)的方法,改进检测框架中backbone与预测头之间的neck结构,采用Swin Transformer进行特征上采样,并调整滑动窗口大小以适应小目标。 Result: 实验表明,结合CenterNet,所提出的基于Swin Transformer的neck结构通过调整窗口大小可显著提升检测性能,较小的窗口大小(默认为2)有助于提高小目标检测的mAP。 Conclusion: 通过改进neck结构并调整Swin Transformer的窗口大小,能够有效增强稀疏小目标的特征表示,提升检测精度,验证了该方法在小目标检测任务中的有效性。 Abstract: Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.[169] Prompt-based Consistent Video Colorization
Silvia Dani,Tiberio Uricchio,Lorenzo Seidenari
Main category: cs.CV
TL;DR: 提出一种基于语言和分割语义引导的自动化视频着色方法,利用扩散模型和光流 warp 实现高质量、稳定的着色效果,在多个基准上达到SOTA。
Details
Motivation: 现有视频着色方法存在时间闪烁问题或依赖大量人工标注,缺乏高效且自动化的解决方案。 Method: 采用语言条件扩散模型进行帧着色,通过自动生成的对象掩码和文本提示提供语义引导;使用RAFT光流 warp 前一帧的颜色信息以保证时序一致性,并引入校正步骤修复 warp 引入的错误。 Result: 在DAVIS30和VIDEVO20标准数据集上实现了最优的PSNR、Colorfulness和CDC指标。 Conclusion: 该方法验证了基于自动提示的语义引导在实现高保真、稳定视频着色中的有效性。 Abstract: Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.[170] Unexplored flaws in multiple-choice VQA evaluations
Fabio Rosenthal,Sebastian Schmidt,Thorsten Graf,Thorsten Bagodonat,Stephan Günnemann,Leo Schwinn
Main category: cs.CV
TL;DR: 本文研究了多模态大语言模型在多选视觉问答中的表现,发现其对提示格式的微小变化高度敏感,即使语义不变也会影响结果,且现有去偏方法无法解决这些新发现的偏差。
Details
Motivation: 揭示当前多模态大语言模型评估中被忽视的提示格式偏差问题,提高评估的可靠性。 Method: 通过七个MLLM和五个VQA数据集,系统分析了48种提示格式变体的影响,进行大规模实证研究。 Result: 发现多选VQA结果对语义中性的提示格式变化高度敏感,且该偏差独立于已知的选择顺序偏差和模型置信度。 Conclusion: 当前的多选VQA评估存在严重提示格式偏差,现有去偏策略无效,需重新设计更鲁棒的评估方式。 Abstract: Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $\mathbf{\text{seven}}$ MLLMs and $\mathbf{\text{five}}$ VQA datasets, spanning $\mathbf{48}$ distinct $\mathbf{\text{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.[171] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment
Yang Chen,Xiaowei Xu,Shuai Wang,Chenhui Zhu,Ruxue Wen,Xubin Li,Tiezheng Ge,Limin Wang
Main category: cs.CV
TL;DR: 提出了一种新的对齐策略,利用归一化流(NFs)的可逆性,在生成过程中对齐来自强大视觉基础模型的中间特征表示,从而提升生成质量和分类准确性,并在ImageNet上实现了新的SOTA结果。
Details
Motivation: 标准归一化流通过最大似然训练得到的语义表示较差,限制了其生成质量,因此需要一种更有效的语义对齐方法来增强NF的表征能力。 Method: 提出一种新颖的对齐策略,利用NF的可逆性,在生成(反向)过程中对齐中间特征与强大的视觉基础模型(如CLIP)的表示;同时引入一种无需训练、仅在测试时优化的分类算法以内在评估NF中的语义知识。 Result: 该方法将NF的训练速度提升3.3倍以上,在ImageNet 64×64和256×256上均取得NF领域的新SOTA,在生成质量和分类准确率方面均有显著提升。 Conclusion: 通过反向通路的特征对齐和测试时优化,能有效增强NF的语义表示能力,实现生成与判别任务的协同提升,推动NF在实际应用中的发展。 Abstract: Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at https://github.com/MCG-NJU/FlowBack.[172] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts
Anshul Bagaria
Main category: cs.CV
TL;DR: 本文提出了INSIGHT,一个用于检测和解释AI生成图像的多模态框架,能够在极低分辨率下实现鲁棒检测与可解释性。
Details
Motivation: 现有深度伪造检测方法在真实场景(如下采样、压缩、跨域变化)中性能下降严重,且缺乏可解释性,限制了高风险场景下的应用。 Method: 结合分层超分辨率增强细微法证线索,使用Grad-CAM进行多尺度定位,并通过CLIP引导的语义对齐将视觉异常映射为人类可理解的描述,利用ReAct+思维链提示视觉语言模型生成细粒度解释,并通过双阶段评估确保事实性。 Result: 在动物、车辆和抽象场景等多种领域中,INSIGHT在极端退化条件下显著优于现有方法,在检测鲁棒性和解释质量方面均表现更优。 Conclusion: INSIGHT为透明、可靠的AI生成图像取证提供了可行路径,推动了可信多模态内容验证的发展。 Abstract: The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.[173] AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows
Zhenglin Zhou,Fan Ma,Chengzhuo Gui,Xiaobo Xia,Hehe Fan,Yi Yang,Tat-Seng Chua
Main category: cs.CV
TL;DR: 提出AnchorFlow方法,通过保持潜在锚点一致性实现无需训练的3D形状编辑,提升编辑的语义一致性和几何稳定性。
Details
Motivation: 现有无需训练的3D编辑方法因扩散采样中的时间步依赖噪声导致潜在锚点不一致,难以实现强而稳定的几何编辑。 Method: 引入全局共享的潜在锚点,采用松弛的锚点对齐损失和锚点对齐更新规则,确保源与目标轨迹的一致性,并实现无掩码监督的编辑。 Result: 在Eval3DEdit基准上实验表明,AnchorFlow在多种编辑类型下均能生成语义对齐且结构稳健的编辑结果,同时保持良好的几何保真度。 Conclusion: AnchorFlow通过稳定潜在参考空间,有效提升了无需训练3D编辑的语义准确性和几何稳定性,具有良好的泛化能力和应用潜力。 Abstract: Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. Code is at https://github.com/ZhenglinZhou/AnchorFlow.[174] Asking like Socrates: Socrates helps VLMs understand remote sensing images
Run Shao,Ziyu Li,Zhaoyang Zhang,Linrui Xu,Xinran He,Hongyuan Yuan,Bolei He,Yongxing Dai,Yiming Yan,Yijun Chen,Wang Guo,Haifeng Li
Main category: cs.CV
TL;DR: 提出RS-EoT(遥感证据思维)框架,通过语言驱动的迭代视觉证据搜索范式,结合SocraticAgent多智能体系统和两阶段渐进式强化学习,有效缓解遥感图像中的“一瞥效应”,实现基于视觉证据的真实推理,在多个遥感VQA和定位任务中达到SOTA性能。
Details
Motivation: 现有视觉-语言模型在遥感任务中普遍存在伪推理现象,即模型仅描述推理过程而未基于视觉证据进行真实推理,这是由于对大尺度遥感图像的粗略感知(即“Glance Effect”)导致的。 Method: 提出RS-EoT框架,采用语言驱动的迭代视觉证据搜索机制;设计SocraticAgent自博弈多智能体系统,通过推理与视觉检查交替生成推理轨迹;采用两阶段渐进式强化学习策略:先在细粒度定位任务上进行强化学习以增强证据获取能力,再迁移到遥感视觉问答(VQA)任务以提升泛化能力。 Result: RS-EoT在多个遥感VQA和定位基准上取得当前最优性能;分析显示模型展现出清晰的推理与证据搜寻迭代循环,证实其能有效缓解Glance Effect,实现真正基于视觉证据的推理。 Conclusion: RS-EoT通过结构化的证据搜寻机制,克服了传统模型在遥感理解中的伪推理问题,推动了多模态模型在专业领域中实现可解释、可靠的真实推理。 Abstract: Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates[175] UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data
Longkun Zou,Jiale Wang,Rongqin Liang,Hai Wu,Ke Chen,Yaowei Wang
Main category: cs.CV
TL;DR: 本文提出了一种高保真多模态合成数据集UAV-MM3D,用于低空无人机感知与运动理解,包含40万同步帧、多种场景与天气条件及五种传感器模态,并提供了丰富的标注和两个基准模型以推动无人机3D感知研究。
Details
Motivation: 由于空域管制、隐私问题和环境变化,真实世界无人机数据收集受限,且3D姿态和跨模态标注费时昂贵,因此需要一个大规模、精确标注的多模态数据集来推动低空无人机感知技术的发展。 Method: 通过构建高保真的仿真环境生成大规模多模态数据集UAV-MM3D,包含RGB、IR、LiDAR、Radar和DVS五种模态,涵盖多样化场景与天气条件,并提供2D/3D边界框、6自由度姿态和实例级标注;同时提出LGFusionNet(LiDAR引导的多模态融合网络)和无人机轨迹预测基准模型用于性能评估。 Result: UAV-MM3D包含40万帧同步数据,覆盖城市、郊区、森林和沿海等多种环境与不同天气条件,支持3D检测、姿态估计、目标跟踪和短时轨迹预测等任务;提出的LGFusionNet和轨迹预测基线模型为相关任务提供了有效的基准。 Conclusion: UAV-MM3D作为一个公开可用的大规模、多模态、高精度标注的合成数据集,为低空无人机感知与运动理解提供了可靠基准,有助于推动相关智能系统和安全应用的发展。 Abstract: Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.[176] DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention
Furkan Guzelant,Arda Goktogan,Tarık Kaya,Aysegul Dundar
Main category: cs.CV
TL;DR: 本文提出了一种名为DiffStyle360的扩散模型框架,用于基于单张风格参考图像实现多视角一致、身份保持的3D头部风格化,无需针对每种风格进行训练。
Details
Motivation: 现有3D头部风格化方法通常依赖昂贵的优化或特定领域微调,难以高效适应新风格,限制了其在多样化艺术风格中的应用。 Method: 基于3D感知的DiffPortrait360架构,引入风格外观模块(Style Appearance Module)解耦风格与内容,并通过风格融合注意力机制在潜在空间中自适应平衡结构保持与风格保真度;使用3D GAN生成的多视角数据集进行微调,并采用基于温度的关键值缩放策略控制推理时的风格化强度。 Result: 在FFHQ和RenderMe360数据集上的实验表明,DiffStyle360在多种挑战性艺术风格下均优于当前最先进的GAN和扩散基方法,具有更优的风格质量和多视角一致性。 Conclusion: DiffStyle360实现了无需微调的通用3D头部风格化,在保持身份和结构的同时有效迁移多样化艺术风格,推动了3D-aware生成技术在创意视觉内容设计中的应用。 Abstract: 3D head stylization has emerged as a key technique for reimagining realistic human heads in various artistic forms, enabling expressive character design and creative visual experiences in digital media. Despite the progress in 3D-aware generation, existing 3D head stylization methods often rely on computationally expensive optimization or domain-specific fine-tuning to adapt to new styles. To address these limitations, we propose DiffStyle360, a diffusion-based framework capable of producing multi-view consistent, identity-preserving 3D head stylizations across diverse artistic domains given a single style reference image, without requiring per-style training. Building upon the 3D-aware DiffPortrait360 architecture, our approach introduces two key components: the Style Appearance Module, which disentangles style from content, and the Style Fusion Attention mechanism, which adaptively balances structure preservation and stylization fidelity in the latent space. Furthermore, we employ a 3D GAN-generated multi-view dataset for robust fine-tuning and introduce a temperaturebased key scaling strategy to control stylization intensity during inference. Extensive experiments on FFHQ and RenderMe360 demonstrate that DiffStyle360 achieves superior style quality, outperforming state-of-the-art GAN- and diffusion-based stylization methods across challenging style domains.[177] Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models
Minghao Yin,Yukang Cao,Kai Han
Main category: cs.CV
TL;DR: WUKONG是一个无需训练的高质量3D形态转换框架,利用基于流的Transformer生成先验实现高保真纹理过渡,通过最优传输重心问题和语义一致性机制提升形状与纹理的平滑性和保真度。
Details
Motivation: 传统3D形态转换方法依赖人工匹配和形变轨迹估计,泛化能力差且预处理成本高,WUKONG旨在克服这些限制,实现自动化、高保真的自由形态转换。 Method: 将形态转换建模为最优传输重心问题,利用基于流的生成过程的连续性保证形状平滑过渡;提出序列初始化策略防止几何畸变;设计相似性引导的语义一致性机制以保留高频纹理细节并控制融合动态。 Result: 在多种几何与纹理变化场景下,WUKONG在定量与定性评估中均显著优于现有最先进方法,生成结果具有更优的形状过渡平滑性和纹理保真度。 Conclusion: WUKONG通过利用流式Transformer的生成先验和引入优化的形态转换策略,实现了无需训练的高质量3D形态转换,在保真度、细节保持和语义一致性方面表现出色,具有广泛的应用潜力。 Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.[178] Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation
Weining Ren,Hongjun Wang,Xiao Tan,Kai Han
Main category: cs.CV
TL;DR: Fin3R是一种简单、有效且通用的微调方法,用于提升前馈3D重建模型在细粒度几何和鲁棒性方面的表现,通过冻结解码器并仅微调带有LoRA适配器的图像编码器,利用无标签数据从单目教师模型中蒸馏几何细节。
Details
Motivation: 现有前馈3D重建模型因缺乏高保真深度与位姿监督以及多视角点图回归中的几何错位问题,导致在精细几何结构和鲁棒性方面表现不佳。 Method: 提出Fin3R方法,冻结解码器,仅对图像编码器进行微调,并引入轻量级LoRA适配器,从大规模无标注数据上的强单目教师模型中提取细粒度几何信息进行知识蒸馏。 Result: 在DUSt3R、MASt3R、CUT3R和VGGT等多种模型上验证,Fin3R显著提升了边界清晰度、复杂结构恢复能力和几何精度,且几乎不增加推理时的内存与延迟。 Conclusion: Fin3R是一种高效通用的微调框架,能够在不改变模型推理效率的前提下,显著提升各类前馈3D重建模型的重建质量与鲁棒性。 Abstract: We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}[179] SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition
Hongda Liu,Yunfan Liu,Changlu Wang,Yunlong Wang,Zhenan Sun
Main category: cs.CV
TL;DR: 提出SkeletonAgent框架,通过协同代理连接动作识别模型与大语言模型,提升骨架动作识别中对相似动作的区分能力。
Details
Motivation: 现有方法中大语言模型缺乏来自识别模型的性能反馈,难以提供区分相似动作所需的判别性线索。 Method: 设计Questioner和Selector两个协作代理:Questioner识别易混淆类别并提供上下文给LLM;Selector解析LLM输出的关节级约束并反馈至识别器,实现细粒度跨模态对齐。 Result: 在NTU RGB+D、NTU RGB+D 120、Kinetics-Skeleton、FineGYM和UAV-Human五个基准上均超越现有最先进方法。 Conclusion: SkeletonAgent通过闭环交互机制有效融合语义先验与识别反馈,显著提升骨架动作识别性能。 Abstract: Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM's response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at https://github.com/firework8/SkeletonAgent.[180] ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection
Runzhi Deng,Yundi Hu,Xinshuang Zhang,Zhao Wang,Xixi Liu,Wang-Zhou Dai,Caifeng Shan,Fang Zhao
Main category: cs.CV
TL;DR: 提出ABounD方法,通过动态概念融合和对抗边界锻造,在少样本多类别异常检测中实现语义自适应与精确决策边界的统一学习。
Details
Motivation: 现有视觉-语言模型在数据稀缺下难以兼顾类别适应性与判别力,导致正常与异常边界模糊,漏检细微缺陷或误拒非典型正常样本。 Method: 设计动态概念融合(DCF)模块生成类别自适应提示,并利用对抗边界锻造(ABF)通过PGD式扰动生成边界特征以塑造精准决策边界;采用单阶段训练与概念-边界损失函数,其中ABF提供主要监督信号。 Result: 在MVTec-AD和VisA数据集上达到少样本多类别异常检测的最先进性能。 Conclusion: ABounD有效结合语义学习与边界优化,提升了少样本下多类别异常检测的鲁棒性与准确性。 Abstract: Few-shot multi-class industrial anomaly detection remains a challenging task. Vision-language models need to be both category-adaptive and sharply discriminative, yet data scarcity often blurs the boundary between normal and abnormal states. This ambiguity leads to missed subtle defects and the rejection of atypical normal samples. We propose ABounD, an Adversarial Boundary-Driven few-shot learning for multi-class anomaly detection, which is a unified learning framework that integrates semantic concept learning with decision boundary shaping. The Dynamic Concept Fusion (DCF) module produces class-adaptive prompts by fusing generalizable priors with class-specific cues, conditioned on image features. Meanwhile, Adversarial Boundary Forging (ABF) sculpts a more precise decision margin by generating boundary-level fence features via PGD-style perturbations. Training is conducted in a single stage under a Concept-Boundary Loss, where ABF provides the main supervisory signal and semantic-spatial regularizers stabilize the optimization. This synergy yields a decision boundary that closely follows normal data while preserving flexibility and robust semantic alignment. Experiments on MVTec-AD and VisA datasets demonstrate state-of-the-art performance in the task of few-shot multi-class anomaly detection.[181] Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition
Maheswar Bora,Tashvik Dhamija,Shukesh Reddy,Baptiste Chopin,Pranav Balaji,Abhijit Das,Antitza Dantcheva
Main category: cs.CV
TL;DR: 提出了一种基于预训练视觉语音识别(VSR)特征的新型网络FauxNet,用于零样本深度伪造检测,并在多个数据集上超越现有方法,同时能够溯源伪造技术。
Details
Motivation: 深度伪造技术的快速发展带来了媒体滥用的风险,亟需鲁棒且可靠的检测方法,特别是具备泛化能力的零样本检测模型。 Method: 利用预训练的视觉语音识别(VSR)模型提取视频中的时序特征,构建FauxNet网络以区分真实与伪造视频,并实现伪造方法的归因分析。 Result: FauxNet在零样本检测设置下持续优于现有最先进方法,并能在新构建的Authentica-Vox和Authentica-HDTF数据集以及FaceForensics++上有效区分六种不同的生成技术。 Conclusion: FauxNet是一种高效、可泛化的深度伪造检测框架,具备良好的零样本性能和溯源能力,同时新发布的Authentica数据集为未来研究提供了重要资源。 Abstract: Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.[182] Benchmarking machine learning models for multi-class state recognition in double duantum dot data
Valeria Díaz Moreno,Ryan P Khalili,Daniel Schug,Patrick J. Walsh,Justyna P. Zwolak
Main category: cs.CV
TL;DR: 本研究比较了四种现代机器学习模型在双量子点电荷稳定性图(CSD)多类状态识别中的性能,发现CNN结合min-max归一化在实验数据上表现最佳,兼顾准确性与参数效率。
Details
Motivation: 为了实现大规模量子处理器的可扩展性,需要自动化调谐策略来识别量子点器件状态,而准确识别电荷稳定性图(CSD)中的状态是关键挑战之一。 Method: 对四种现代机器学习架构(U-Net、视觉Transformer、MDN和CNN)在合成与实验CSD数据上的多分类状态识别性能进行基准测试,评估不同数据量和归一化方案下的表现。 Result: U-Net和ViT在合成数据上MSE得分超过0.98,但在实验数据上泛化能力差;MDN计算最高效且训练稳定,但峰值性能较低;CNN在实验CSD上表现最优,参数量比U-Net和ViT少两个数量级;min-max归一化提升MSE得分但降低收敛稳定性,z-score则相反。 Conclusion: CNN结合min-max归一化是在实际量子点CSD分析中最实用的方案,提供了精度与效率之间的最佳权衡。 Abstract: Semiconductor quantum dots (QDs) are a leading platform for scalable quantum processors. However, scaling to large arrays requires reliable, automated tuning strategies for devices' bootstrapping, calibration, and operation, with many tuning aspects depending on accurately identifying QD device states from charge-stability diagrams (CSDs). In this work, we present a comprehensive benchmarking study of four modern machine learning (ML) architectures for multi-class state recognition in double-QD CSDs. We evaluate their performance across different data budgets and normalization schemes using both synthetic and experimental data. We find that the more resource-intensive models -- U-Nets and visual transformers (ViTs) -- achieve the highest MSE score (defined as $1-\mathrm{MSE}$) on synthetic data (over $0.98$) but fail to generalize to experimental data. MDNs are the most computationally efficient and exhibit highly stable training, but with substantially lower peak performance. CNNs offer the most favorable trade-off on experimental CSDs, achieving strong accuracy with two orders of magnitude fewer parameters than the U-Nets and ViTs. Normalization plays a nontrivial role: min-max scaling generally yields higher MSE scores but less stable convergence, whereas z-score normalization produces more predictable training dynamics but at reduced accuracy for most models. Overall, our study shows that CNNs with min-max normalization are a practical approach for QD CSDs.[183] Beyond Real versus Fake Towards Intent-Aware Video Analysis
Saurabh Atreya,Nabyl Quignon,Baptiste Chopin,Abhijit Das,Antitza Dantcheva
Main category: cs.CV
TL;DR: 本文提出了IntentHQ,一个用于人类中心意图分析的新基准,旨在通过多模态模型理解视频背后的动机,而非仅仅检测真假。
Details
Motivation: 现有的深度伪造检测方法主要关注视频真实性辨别,忽视了对操纵视频背后意图的理解,而理解意图对于应对社会和安全风险至关重要。 Method: 构建了一个包含5168个视频的数据集,标注了23个细粒度意图类别,并采用监督与自监督的多模态模型结合时空视频特征、音频和文本进行意图识别。 Result: 模型能够有效区分广泛的意图类别,验证了从真实性验证转向意图理解的可行性。 Conclusion: IntentHQ推动了深伪视频分析从‘是否为假’向‘为何而造’的范式转变,为更深层次的内容理解提供了新方向。 Abstract: The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.[184] ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models
Zhenglin Zhou,Fan Ma,Xiaobo Xia,Hehe Fan,Yi Yang,Tat-Seng Chua
Main category: cs.CV
TL;DR: ITS3D是一种在推理时扩展文本引导3D扩散模型的方法,通过优化高斯噪声输入来提升生成质量,无需额外训练。
Details
Motivation: 为了在不进行额外训练的情况下提高文本到3D生成的质量,探索推理时的扩展方法。 Method: 提出ITS3D框架,将任务建模为优化问题,采用验证器引导的搜索算法迭代优化噪声输入,并引入高斯归一化、基于SVD的压缩和奇异空间重置机制以提升稳定性、效率和探索能力。 Result: 实验表明,ITS3D有效提升了生成质量,验证了计算高效的搜索方法在生成过程中的潜力。 Conclusion: ITS3D通过推理时优化噪声输入,显著改善了文本到3D生成的效果,展示了搜索策略在3D扩散模型中的应用前景。 Abstract: We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at https://github.com/ZhenglinZhou/ITS3D.[185] Gaussians on Fire: High-Frequency Reconstruction of Flames
Jakob Nazarenus,Dominik Michels,Wojtek Palubicki,Simin Kou,Fang-Lue Zhang,Soren Pirk,Reinhard Koch
Main category: cs.CV
TL;DR: 提出了一种基于3D高斯的时空表示方法,从少量视图中重建动态火焰,结合多视角立体、单目深度先验和光流融合,实现高质量、时间对齐的火焰重建。
Details
Motivation: 火焰具有高度动态性、透明性和高频细节,传统方法难以从有限视角重建其完整三维结构和运动细节。 Method: 将静态背景与动态火焰区域分离,利用多视角立体和单目深度先验提取几何;通过融合各视角的稠密光流构建3D流场初始化火焰;每个3D高斯包含生命周期和线速度以匹配光流并捕捉高频特征;采用自定义硬件同步确保跨相机亚帧级时间对齐。 Result: 在多个真实火焰场景中实现了鲁棒的定性和定量重建效果,能够准确还原复杂动态火焰的形状、运动和细节。 Conclusion: 该方法能够在仅使用三个摄像头和消费级硬件的情况下,高效且精确地重建动态火焰的三维时空结构,为低约束条件下的动态透明物体重建提供了有效解决方案。 Abstract: We propose a method to reconstruct dynamic fire in 3D from a limited set of camera views with a Gaussian-based spatiotemporal representation. Capturing and reconstructing fire and its dynamics is highly challenging due to its volatile nature, transparent quality, and multitude of high-frequency features. Despite these challenges, we aim to reconstruct fire from only three views, which consequently requires solving for under-constrained geometry. We solve this by separating the static background from the dynamic fire region by combining dense multi-view stereo images with monocular depth priors. The fire is initialized as a 3D flow field, obtained by fusing per-view dense optical flow projections. To capture the high frequency features of fire, each 3D Gaussian encodes a lifetime and linear velocity to match the dense optical flow. To ensure sub-frame temporal alignment across cameras we employ a custom hardware synchronization pattern -- allowing us to reconstruct fire with affordable commodity hardware. Our quantitative and qualitative validations across numerous reconstruction experiments demonstrate robust performance for diverse and challenging real fire scenarios.[186] RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
Xiyan Liu,Han Wang,Yuhu Wang,Junjie Cai,Zhe Cao,Jianzhong Yang,Zhen Lu
Main category: cs.CV
TL;DR: 本文提出了RoadSceneBench,一个专注于中层道路语义理解的轻量级基准,用于评估复杂道路环境中的视觉推理能力,并提出HRRP-T训练框架以提升视觉语言模型的空间连贯性和时序一致性推理性能。
Details
Motivation: 现有数据集主要关注检测或分割等感知任务,缺乏对道路拓扑和动态场景结构推理能力的评估,因此需要一个强调关系理解和结构一致性的新基准。 Method: 提出RoadSceneBench基准,强调关系理解与结构一致性;设计HRRP-T训练框架,通过分层关系奖励传播和时序一致性来增强视觉语言模型的几何和时间推理能力。 Result: 实验表明该方法在多种道路配置下实现了最先进的性能,显著提升了模型在结构化场景理解上的表现。 Conclusion: RoadSceneBench为研究中层道路语义提供了紧凑而强大的基础,推动了结构感知的自动驾驶感知系统的发展。 Abstract: Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.[187] Hybrid, Unified and Iterative: A Novel Framework for Text-based Person Anomaly Retrieval
Tien-Huy Nguyen,Huu-Loc Tran,Huu-Phong Phan-Nguyen,Quang-Vinh Dinh
Main category: cs.CV
TL;DR: 提出了一种结合局部-全局混合视角(LHP)模块与视觉-语言模型(VLM)的方法,并设计了统一图像-文本(UIT)模型和迭代集成策略,显著提升了文本到人物异常检索的性能。
Details
Motivation: 现有方法依赖复杂深度学习技术,难以有效提取细粒度特征,需探索如何优化模型以获取更精细的特征表示。 Method: 引入LHP模块融合细粒度与粗粒度特征,构建UIT模型结合ITC、ITM、MLM和MIM多目标损失函数,并提出基于LHP指导的特征选择算法和迭代集成策略。 Result: 在PAB数据集上实现了SOTA性能,R@1提升9.70%,R@5提升1.77%,R@10提升1.01%。 Conclusion: 所提方法通过融合多尺度特征、多任务学习和迭代集成策略,显著提升了文本到人物异常检索的准确性和鲁棒性。 Abstract: Text-based person anomaly retrieval has emerged as a challenging task, with most existing approaches relying on complex deep-learning techniques. This raises a research question: How can the model be optimized to achieve greater fine-grained features? To address this, we propose a Local-Global Hybrid Perspective (LHP) module integrated with a Vision-Language Model (VLM), designed to explore the effectiveness of incorporating both fine-grained features alongside coarse-grained features. Additionally, we investigate a Unified Image-Text (UIT) model that combines multiple objective loss functions, including Image-Text Contrastive (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Masked Image Modeling (MIM) loss. Beyond this, we propose a novel iterative ensemble strategy, by combining iteratively instead of using model results simultaneously like other ensemble methods. To take advantage of the superior performance of the LHP model, we introduce a novel feature selection algorithm based on its guidance, which helps improve the model's performance. Extensive experiments demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on PAB dataset, compared with previous work, with a 9.70\% improvement in R@1, 1.77\% improvement in R@5, and 1.01\% improvement in R@10.[188] Rethinking Cross-Generator Image Forgery Detection through DINOv3
Zhenglin Huang,Jason Li,Haiquan Wen,Tianxiao Li,Xi Yang,Lu Qi,Bei Peng,Xiaowei Huang,Ming-Hsuan Yang,Guangliang Cheng
Main category: cs.CV
TL;DR: 本研究发现冻结的视觉基础模型(尤其是DINOv3)在无需微调的情况下已具备强大的跨生成器检测能力,并提出一种无需训练的令牌选择方法以提升检测性能。
Details
Motivation: 现有伪造图像检测方法依赖特定生成模型的伪影,难以泛化到未见生成器;本文旨在探索更具迁移性的检测机制。 Method: 从频率、空间和令牌三个角度分析基础模型的行为,提出基于全局低频结构的无训练令牌排序策略,并结合轻量线性探针选择与真实性相关的令牌子集。 Result: 该方法在多个数据集上显著提升了跨生成器的检测准确率,验证了基础模型利用弱但可迁移线索进行判断的能力。 Conclusion: 基础模型天然具备跨生成器检测潜力,利用其全局结构偏好可构建通用、高效且可解释的伪造检测新基线。 Abstract: As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.[189] AI killed the video star. Audio-driven diffusion model for expressive talking head generation
Baptiste Chopin,Tashvik Dhamija,Pranav Balaji,Yaohui Wang,Antitza Dantcheva
Main category: cs.CV
TL;DR: Dimitra++ 是一个基于音频驱动的说话人头生成框架,利用条件运动扩散Transformer(cMDT)建模唇部、表情和头部姿态运动,结合3D表示,在多个数据集上优于现有方法。
Details
Motivation: 现有的说话人头生成方法在同步唇动、自然面部表情和头部姿态方面仍存在不足,需要更统一且高效的模型来提升真实感和表现力。 Method: 提出Dimitra++框架,采用条件运动扩散Transformer(cMDT),以参考人脸图像和音频序列为条件输入,使用3D面部表示建模面部运动序列。 Result: 在VoxCeleb2和CelebV-HQ数据集上的定量、定性实验及用户研究表明,Dimitra++在生成逼真的唇动、面部表情和头部姿态方面优于现有方法。 Conclusion: Dimitra++通过cMDT有效融合音频与外观信息,实现了高质量、多维度的面部运动控制,显著提升了音频驱动说话人头生成的真实性和性能。 Abstract: We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.[190] SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts
Shun Inadumi,Shohei Tanaka,Tosho Hirasawa,Atsushi Hashimoto,Koichiro Yoshino,Yoshitaka Ushiku
Main category: cs.CV
TL;DR: 本文提出了一个名为SciPostGen的大规模数据集,用于从科学论文生成海报布局,并提出了一种基于检索增强的海报布局生成框架,能够根据论文结构和用户指定的约束生成合适的海报布局。
Details
Motivation: 随着科学论文数量的增长,如何有效传达研究成果变得尤为重要。海报作为一种关键的展示媒介,其布局设计对研究传播效果有重要影响。然而,目前缺乏大规模配对标注的数据来理解论文内容与海报布局之间的关系。 Method: 构建了一个名为SciPostGen的大规模数据集,包含论文与其对应海报布局的配对信息;通过分析揭示论文结构与海报元素数量之间的关联;提出了一种检索增强的海报布局生成框架,利用检索到的相似布局作为生成指导,并支持用户设定的布局约束。 Result: 实验表明,所提出的检索器能准确估计与论文结构一致的布局,且该框架在有无布局约束的情况下均能生成符合要求的布局。 Conclusion: SciPostGen数据集和检索增强生成框架有助于提升科研成果通过海报形式的表达效果,为自动化海报设计提供了可行路径。 Abstract: As the number of scientific papers continues to grow, there is a demand for approaches that can effectively convey research findings, with posters serving as a key medium for presenting paper contents. Poster layouts determine how effectively research is communicated and understood, highlighting their growing importance. In particular, a gap remains in understanding how papers correspond to the layouts that present them, which calls for datasets with paired annotations at scale. To bridge this gap, we introduce SciPostGen, a large-scale dataset for understanding and generating poster layouts from scientific papers. Our analyses based on SciPostGen show that paper structures are associated with the number of layout elements in posters. Based on this insight, we explore a framework, Retrieval-Augmented Poster Layout Generation, which retrieves layouts consistent with a given paper and uses them as guidance for layout generation. We conducted experiments under two conditions: with and without layout constraints typically specified by poster creators. The results show that the retriever estimates layouts aligned with paper structures, and our framework generates layouts that also satisfy given constraints.[191] What Shape Is Optimal for Masks in Text Removal?
Hyakka Nakada,Marika Kubota
Main category: cs.CV
TL;DR: 本文提出了一种基于贝叶斯优化的灵活掩码建模方法,用于文档图像中的复杂文本移除,并发现字符级掩码和非最小覆盖更优。
Details
Motivation: 现有文本移除方法主要针对简单场景文本,缺乏对密集、复杂文本图像的有效处理,且对掩码形状敏感,需更实用的解决方案。 Method: 构建包含大量文本的基准数据集,提出可学习的灵活掩码模型,并利用贝叶斯优化调整掩码参数。 Result: 模型生成了字符级掩码,发现非最小覆盖的掩码效果更优,提升了实际文本移除任务的鲁棒性。 Conclusion: 精确的掩码形状调优对实际文本移除至关重要,该研究为手动掩码提供了用户友好的指导方向。 Abstract: The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.[192] DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
Ahmad Mohammadshirazi,Pinaki Prasad Guha Neogi,Dheeraj Kulshrestha,Rajiv Ramnath
Main category: cs.CV
TL;DR: 本文提出了一种名为DocVAL的验证链式思维蒸馏框架,旨在将大型教师模型的空间推理能力有效迁移至可部署的学生视觉语言模型中,从而在文档视觉问答(DocVQA)任务上实现高精度与高效率的平衡。
Details
Motivation: 现有的DocVQA系统在准确性和效率之间存在明显权衡:大模型性能好但难以部署,小模型高效但定位能力差。因此需要一种能有效传递空间推理能力的知识蒸馏方法。 Method: 提出DocVAL框架,包含三个核心组件:1)基于验证时文本检测的教师监督以过滤噪声信号;2)多模块验证器(VAL)用于保证答案正确性和几何一致性,并提供像素级反馈;3)两阶段学生训练策略,先学习经验证的思维链轨迹,再通过VAL反馈进行迭代优化。 Result: 所提学生模型(Gemma-3 12B)作为纯VLM无需OCR即可达到91.4% ANLS和82.4% mAP;消融实验显示验证反馈带来6.3 mAP提升,迭代优化贡献9.7 mAP增益。此外发布了95k高质量、经验证的思维链数据集。 Conclusion: DocVAL有效提升了小型模型在文档理解中的空间推理与定位能力,显著缩小了师生模型差距,为高效部署高性能DocVQA系统提供了可行方案。 Abstract: Document visual question answering (DocVQA) requires models to jointly reason over textual content and spatial layout, yet current systems exhibit a sharp accuracy--efficiency trade-off: large teacher models achieve strong grounding but are too expensive for deployment, while compact students suffer substantial drops in localization performance. We propose DocVAL, a validated chain-of-thought distillation framework that transfers the spatial reasoning ability of a large teacher into a deployable student VLM through three key components: (1) teacher supervision with validation-time text detection to filter and denoise training signals, (2) a multi-module validator (VAL) that enforces answer correctness and geometric consistency while producing fine-grained, pixel-level error feedback, and (3) a two-stage student training scheme that first learns from validated CoT traces and then undergoes iterative refinement driven by VAL feedback. Our student (Gemma-3 12B) achieves 91.4\% ANLS and 82.4\% mAP on DocVQA as a pure VLM requiring no text detection or OCR at inference. Extensive ablations demonstrate that validated feedback contributes 6.3 mAP gain and iterative refinement accounts for 9.7 mAP improvement. We release 95k high-quality, validator-verified CoT traces to advance spatial reasoning research in document understanding.[193] CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving
Zhaohui Wang,Tengbo Yu,Hao Tang
Main category: cs.CV
TL;DR: 本文提出了一种名为CoT4AD的新型视觉-语言-动作(VLA)框架,通过引入思维链(CoT)推理来增强自动驾驶中视觉-语言模型的数值和因果推理能力,在真实和模拟场景中均实现了最先进的性能。
Details
Motivation: 现有VLA模型在自动驾驶中存在数值推理能力不足和输入输出映射过于简化的问题,难以应对需要逐步因果推理的复杂驾驶场景。 Method: 提出CoT4AD框架,结合视觉观测与语言指令进行语义推理、场景理解和轨迹规划;训练时显式建模感知-问题-预测-行动的思维链以对齐推理与动作空间,推理时采用隐式CoT实现一致的数值推理和鲁棒决策。 Result: 在nuScenes和Bench2Drive等真实与模拟基准上,CoT4AD在开环和闭环评估中均达到最先进性能。 Conclusion: CoT4AD通过引入显式与隐式思维链推理,有效提升了VLA模型在复杂自动驾驶场景中的数值与因果推理能力,展现出更强的决策鲁棒性和泛化性。 Abstract: Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.[194] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration
Mengyu Yang,Yanming Yang,Chenyi Xu,Chenxi Song,Yufan Zuo,Tong Zhao,Ruibo Li,Chi Zhang
Main category: cs.CV
TL;DR: Fast3Dcache 是一种无需训练的几何感知缓存框架,用于加速3D扩散模型推理并保持几何一致性。
Details
Motivation: 现有的基于缓存的方法在应用于3D扩散模型时会破坏几何一致性,因微小误差累积导致结构伪影和拓扑不一致。 Method: 提出预测性缓存调度约束(PCSC)动态分配缓存配额,并结合时空稳定性准则(SSC)根据速度和加速度选择稳定特征进行重用。 Result: 实验显示推理速度提升最高达27.12%,FLOPs减少54.8%,且几何质量退化极小(Chamfer Distance增加2.48%,F-Score下降1.95%)。 Conclusion: Fast3Dcache有效平衡了3D扩散模型的生成效率与几何保真度,适用于对结构一致性要求高的3D生成任务。 Abstract: Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).[195] Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior
Ruoyu Feng,Yunpeng Qi,Jinming Liu,Yixin Gao,Xin Li,Xin Jin,Zhibo Chen
Main category: cs.CV
TL;DR: 提出Diff-ICMH,一种兼顾机器智能任务与人类视觉感知的生成式图像压缩框架,通过语义一致性损失和标签引导模块实现语义保真与视觉质量的平衡。
Details
Motivation: 传统图像压缩方法通常单独优化人类感知或机器分析任务,忽视了二者在语义信息保留上的共性需求。 Method: 利用生成先验提升感知真实性,引入语义一致性损失(SC loss)保证语义保真,并设计标签引导模块(TGM)利用图像级标签增强扩散模型的生成能力。 Result: 实验表明Diff-ICMH在多种智能任务中表现优越且具有良好的泛化能力,同时保持高质量的视觉体验。 Conclusion: Diff-ICMH实现了面向人类感知与机器分析的统一图像压缩框架,支持单一编解码器和码流服务于多任务,无需任务特定调整。 Abstract: Image compression methods are usually optimized isolatedly for human perception or machine analysis tasks. We reveal fundamental commonalities between these objectives: preserving accurate semantic information is paramount, as it directly dictates the integrity of critical information for intelligent tasks and aids human understanding. Concurrently, enhanced perceptual quality not only improves visual appeal but also, by ensuring realistic image distributions, benefits semantic feature extraction for machine tasks. Based on this insight, we propose Diff-ICMH, a generative image compression framework aiming for harmonizing machine and human vision in image compression. It ensures perceptual realism by leveraging generative priors and simultaneously guarantees semantic fidelity through the incorporation of Semantic Consistency loss (SC loss) during training. Additionally, we introduce the Tag Guidance Module (TGM) that leverages highly semantic image-level tags to stimulate the pre-trained diffusion model's generative capabilities, requiring minimal additional bit rates. Consequently, Diff-ICMH supports multiple intelligent tasks through a single codec and bitstream without any task-specific adaptation, while preserving high-quality visual experience for human perception. Extensive experimental results demonstrate Diff-ICMH's superiority and generalizability across diverse tasks, while maintaining visual appeal for human perception. Code is available at: https://github.com/RuoyuFeng/Diff-ICMH.[196] Bringing Your Portrait to 3D Presence
Jiawei Zhang,Lei Chu,Jiahao Li,Zhenyu Zang,Chong Li,Xiao Li,Xun Cao,Hao Zhu,Yan Lu
Main category: cs.CV
TL;DR: 提出了一种统一框架,用于从单张肖像重建可动画的3D人体模型,通过Dual-UV表示、合成数据流形和鲁棒代理网格追踪,在姿态不变性、数据扩展和遮挡稳定性方面取得突破。
Details
Motivation: 现有方法在姿态敏感性、数据规模和代理网格估计可靠性方面存在瓶颈,难以实现从头部到全身的通用3D人像重建。 Method: 引入Dual-UV表示(Core-UV与Shell-UV)以消除姿态和构图导致的特征偏移;构建融合2D生成多样性与3D几何一致性的合成数据流形;设计鲁棒的代理网格追踪器以应对部分可见情况。 Result: 仅在半身合成数据上训练,即可实现头部和上半身重建的最先进性能,并在全身重建中表现具有竞争力;在野外场景中展现出强泛化能力。 Conclusion: 该方法通过特征解耦、数据增强和稳定追踪策略,实现了跨范围输入的高质量、可动画3D人像重建,具备良好的实际应用潜力。 Abstract: We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.[197] Text Condition Embedded Regression Network for Automated Dental Abutment Design
Mianjie Zheng,Xinquan Yang,Xuguang Li,Xiaoling Luo,Xuefen Liu,Kun Tang,He Meng,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出了一种基于文本条件嵌入的牙科种植体基台自动设计框架TCEAD,通过结合CLIP文本编码与MeshMAE自监督学习模型,引入文本引导定位模块(TGL)以精确定位口腔扫描数据中的基台区域,显著提升了基台设计的效率与适应性。
Details
Motivation: 传统基台设计耗时费力,不合适的基台可能导致种植体并发症;现有自动化方法对局部细粒度特征捕捉不足,且难以准确定位基台区域。 Method: 在MeshMAE自监督框架基础上引入文本引导定位模块(TGL),利用CLIP的文本编码器将基台区域描述信息嵌入网络,并通过预训练增强模型对口腔扫描数据中关键局部特征(如种植体宽高、对颌牙距离)的提取能力。 Result: 在大规模基台设计数据集上验证,TCEAD相比主流方法IoU提升0.8%-12.85%,能更准确地定位基台区域并生成适配性更高的设计。 Conclusion: TCEAD是一种新颖且有效的自动化基台设计方法,融合文本引导与自监督学习,显著提高了设计精度与效率,具有良好的临床应用前景。 Abstract: The abutment is an important part of artificial dental implants, whose design process is time-consuming and labor-intensive. Long-term use of inappropriate dental implant abutments may result in implant complications, including peri-implantitis. Using artificial intelligence to assist dental implant abutment design can quickly improve the efficiency of abutment design and enhance abutment adaptability. In this paper, we propose a text condition embedded abutment design framework (TCEAD), the novel automated abutment design solution available in literature. The proposed study extends the self-supervised learning framework of the mesh mask autoencoder (MeshMAE) by introducing a text-guided localization (TGL) module to facilitate abutment area localization. As the parameter determination of the abutment is heavily dependent on local fine-grained features (the width and height of the implant and the distance to the opposing tooth), we pre-train the encoder using oral scan data to improve the model's feature extraction ability. Moreover, considering that the abutment area is only a small part of the oral scan data, we designed a TGL module, which introduces the description of the abutment area through the text encoder of Contrastive Language-Image Pre-training (CLIP), enabling the network to quickly locate the abutment area. We validated the performance of TCEAD on a large abutment design dataset. Extensive experiments demonstrate that TCEAD achieves an Intersection over Union (IoU) improvement of 0.8%-12.85% over other mainstream methods, underscoring its potential in automated dental abutment design.[198] Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Yifan Du,Kun Zhou,Yingqian Min,Yue Ling,Wayne Xin Zhao,Youbin Wu
Main category: cs.CV
TL;DR: 研究了不同思维链(CoT)设计对视觉语言模型中可泛化视觉推理能力的影响,发现简洁且仅包含最小 grounding 步骤的 CoT 在迷宫求解等任务中具有最佳泛化性能,揭示了“短即长”的效应。
Details
Motivation: 尽管思维链(CoT)数据被广泛用于监督视觉语言模型的中间推理过程,但尚不清楚哪些CoT设计真正有助于可泛化的视觉推理能力。因此需要系统评估不同CoT格式的效果。 Method: 在可控的迷宫求解基准上比较三种代表性CoT格式:语言CoT、Grounding CoT(含空间坐标轨迹)和Visual CoT(含图像操作),使用Qwen2.5-VL-7B模型在SFT-then-RL流程下进行实验,并将结论推广到其他视觉任务。 Result: 视觉和较长的CoT主要加快收敛速度但不提升最终性能上限;仅包含关键grounding步骤的简洁CoT优于更长的推理链;仅保留最小grounding信息的CoT在不同迷宫规模间泛化最好。结果在其他视觉任务中也得到验证。 Conclusion: 简洁、最小化的grounding CoT最有利于可泛化视觉推理,过长或过度视觉化的CoT并无益处,“短即长”为构建高效的SFT数据提供了实用指导。 Abstract: We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.[199] HarmoCLIP: Harmonizing Global and Regional Representations in Contrastive Vision-Language Models
Haoxi Zeng,Haoxuan Li,Yi Bin,Pengpeng Zeng,Xing Xu,Yang Yang,Heng Tao Shen
Main category: cs.CV
TL;DR: 本文提出HarmoCLIP,通过引入细粒度语义监督和区域-语言对齐策略,有效协调CLIP中的全局与局部表征,解决了现有方法在提升局部感知时损害全局一致性的权衡问题,在图像检索和边界框分类任务上均取得显著性能提升。
Details
Motivation: CLIP因缺乏区域级监督而在细粒度语义理解上受限,现有方法在增强局部对齐时会破坏全局对齐,导致全局与局部性能之间的权衡问题。 Method: 提出HarmoCLIP框架,引入显式的细粒度语义监督项,直接对齐文本片段与其对应的视觉区域,并设计新的区域-语言对齐监督策略,以增强局部语义学习同时保持全局一致性。 Result: 在全局任务(如图像检索)上达到最先进的性能(最高提升69.78%),在局部任务(如边界框分类)上Top-1准确率提升3.2%,优于先前方法。 Conclusion: HarmoCLIP有效解决了CLIP中全局与局部表征的权衡问题,实现了兼顾细粒度理解和全局语义一致性的平衡、高效且即插即用的解决方案。 Abstract: Contrastive Language-Image Pre-training (CLIP) has demonstrated remarkable generalization ability and strong performance across a wide range of vision-language tasks. However, due to the lack of region-level supervision, CLIP exhibits limited fine-grained semantic understanding. Although several methods attempt to mitigate this issue, they unintentionally disrupt the global alignment, resulting in a persistent trade-off where improving local perception simultaneously degrades global coherence. In this paper, we propose HarmoCLIP, a novel framework designed to harmonize global and region representations within CLIP. We first identify that the absence of direct alignment between local textual and visual semantics is the fundamental cause of the trade-off. To address this, HarmoCLIP introduces an explicit fine-grained semantic supervision term that directly aligns textual segments with their corresponding visual regions, effectively bridging the image region space and the textual space. To further strengthen the representation capability at the local level, our method introduces a novel Region-Language Alignment supervision strategy that promotes fine-grained semantic learning without compromising global semantic consistency. Extensive experiments demonstrate that HarmoCLIP achieves state-of-the-art (improvement up to 69.78%) performance on the global task of retrieval and yields a substantial 3.2% improvement in Top-1 accuracy on the region task of bounding-box classification, consistently outperforming prior approaches while providing a balanced, efficient, and plug-and-play solution to the global-local trade-off in CLIP. Code is available at https://github.com/Erosist/HarmoCLIP.[200] AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection
Dayou Huang,Feng Xue,Xurui Li,Yu Zhou
Main category: cs.CV
TL;DR: 本文提出了一种名为AnoRefiner的异常感知优化器,用于将零样本工业异常检测(ZSAD)中的块级异常图提升为像素级,通过引入异常分数图和渐进式分组测试时训练策略,在MVTec AD和VisA数据集上显著提升了性能。
Details
Motivation: 现有的ZSAD方法生成的异常图较粗糙,且由于合成训练异常与真实异常之间存在差距,难以恢复细粒度异常并避免漏检。作者发现异常分数图能提供ZSAD图像特征中缺失的空间线索,从而启发了新方法的设计。 Method: 提出了AnoRefiner框架,包含两个核心组件:1)异常优化解码器(ARD),利用异常分数图逐步增强图像特征;2)渐进式分组测试时训练(PGT)策略,在产品组间进行自适应训练,减少对合成异常数据的依赖,并兼容各类ZSAD模型。 Result: 在MVTec AD和VisA数据集上的实验表明,AnoRefiner可使多种ZSAD模型的像素级AP指标最高提升5.2%,可视化结果也显示异常定位更精确。 Conclusion: AnoRefiner有效提升了ZSAD模型的异常定位精度,实现了从块级到像素级的精细映射,具有良好的通用性和实际应用价值。 Abstract: Zero-shot industrial anomaly detection (ZSAD) methods typically yield coarse anomaly maps as vision transformers (ViTs) extract patch-level features only. To solve this, recent solutions attempt to predict finer anomalies using features from ZSAD, but they still struggle to recover fine-grained anomalies without missed detections, mainly due to the gap between randomly synthesized training anomalies and real ones. We observe that anomaly score maps exactly provide complementary spatial cues that are largely absent from ZSAD's image features, a fact overlooked before. Inspired by this, we propose an anomaly-aware refiner (AnoRefiner) that can be plugged into most ZSAD models and improve patch-level anomaly maps to the pixel level. First, we design an anomaly refinement decoder (ARD) that progressively enhances image features using anomaly score maps, reducing the reliance on synthetic anomaly data. Second, motivated by the mass production paradigm, we propose a progressive group-wise test-time training (PGT) strategy that trains ARD in each product group for the refinement process in the next group, while staying compatible with any ZSAD method. Experiments on the MVTec AD and VisA datasets show that AnoRefiner boosts various ZSAD models by up to a 5.2\% gain in pixel-AP metrics, which can also be directly observed in many visualizations. The code will be available at https://github.com/HUST-SLOW/AnoRefiner.[201] GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing
Xiaoyin Yang
Main category: cs.CV
TL;DR: 本文提出了一个高精度的眼动追踪框架,通过构建首个包含多样化人群的精确基准数据集GazeTrack,并提出新的形状误差正则化和坐标变换方法,显著提升了瞳孔定位与视线追踪的准确性。
Details
Motivation: 现有眼动追踪技术在虚拟与增强现实中的 gaze 准确性不足以满足空间计算的需求,且缺乏涵盖多样化人群的高质量基准数据集。 Method: 设计了一个眼动数据收集框架,使用高精度设备构建了GazeTrack数据集;提出形状误差正则化方法以改进瞳孔椭圆拟合;采用类似“纸张展开”的坐标变换方法预测视线向量;并构建了一个低复杂度的视线向量生成模型。 Result: 在GazeTrack数据集上实现了更低的视线角度误差,同时计算复杂度低于其他方法,提升了语义分割和瞳孔位置预测的准确性。 Conclusion: 所提出的方法和数据集显著提高了眼动追踪的精度与泛化能力,为虚拟和增强现实中的空间计算提供了更可靠的技术支持。 Abstract: Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.[202] MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory
Bo Wang,Jiehong Lin,Chenzhi Liu,Xinting Hu,Yifei Yu,Tianjia Liu,Zhongrui Wang,Xiaojuan Qi
Main category: cs.CV
TL;DR: 本文提出了MG-Nav(Memory-Guided Navigation),一种用于零样本视觉导航的双尺度框架,结合了全局记忆引导规划与局部几何增强控制,在多个基准上实现了最先进的零样本性能。
Details
Motivation: 现有的视觉导航方法在零样本场景下难以兼顾长期目标的全局规划与复杂环境下的局部控制,且对动态变化和未见场景泛化能力有限。因此,需要一种能够整合记忆、空间结构与几何感知的统一框架。 Method: 提出MG-Nav框架,核心是稀疏空间记忆图(SMG),以区域为中心聚合多视角关键帧和语义信息;全局层面通过图像到实例混合检索进行目标条件化的路径规划,生成可达航点序列;局部层面采用基础导航策略执行点目标避障控制,并在接近目标时切换为图像目标模式;引入轻量级VGGT-adapter模块,在共享的3D感知空间中对齐观测与目标特征。 Result: 在HM3D Instance-Image-Goal和MP3D Image-Goal基准上的实验表明,MG-Nav在零样本设置下显著优于现有方法,尤其在动态重排和未见场景条件下仍保持鲁棒性。 Conclusion: MG-Nav通过融合记忆引导的全局规划与几何增强的局部控制,有效提升了零样本视觉导航的性能与泛化能力,为复杂环境下的自主导航提供了新思路。 Abstract: We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.[203] Stable-Drift: A Patient-Aware Latent Drift Replay Method for Stabilizing Representations in Continual Learning
Paraskevi-Antonia Theofilou,Anuhya Thota,Stefanos Kollias,Mamatha Thota
Main category: cs.CV
TL;DR: 提出一种基于潜在漂移引导的回放方法,通过识别和回放具有高表示不稳定性(即潜在漂移)的样本,有效缓解深度学习模型在跨医院医学图像持续学习中的灾难性遗忘问题。
Details
Motivation: 深度学习模型在顺序学习新任务时容易出现灾难性遗忘,这在需要持续适应新医院数据且不能丢失原有诊断能力的医学成像领域尤为关键。现有方法难以平衡新旧知识的学习,因此需要更有效的持续学习策略。 Method: 提出一种基于潜在漂移(latent drift)的回放机制,通过计算样本在朴素域适应后内部特征表示的变化来量化其不稳定性;在患者层面聚合漂移程度,并将每名患者中表现出最大多层表示变化的切片存入记忆缓冲区用于后续回放,以增强多样性和临床相关性。 Result: 在跨医院的COVID-19 CT分类任务中,使用CNN和Vision Transformer骨干网络进行评估,该方法相比朴素微调和随机回放显著减少了遗忘现象,提升了模型在旧任务上的保持能力。 Conclusion: 潜在漂移是一种实用且可解释的回放信号,能够有效指导样本选择,在真实世界医学场景中推动鲁棒的持续学习发展。 Abstract: When deep learning models are sequentially trained on new data, they tend to abruptly lose performance on previously learned tasks, a critical failure known as catastrophic forgetting. This challenge severely limits the deployment of AI in medical imaging, where models must continually adapt to data from new hospitals without compromising established diagnostic knowledge. To address this, we introduce a latent drift-guided replay method that identifies and replays samples with high representational instability. Specifically, our method quantifies this instability via latent drift, the change in a sample internal feature representation after naive domain adaptation. To ensure diversity and clinical relevance, we aggregate drift at the patient level, our memory buffer stores the per patient slices exhibiting the greatest multi-layer representation shift. Evaluated on a cross-hospital COVID-19 CT classification task using state-of-the-art CNN and Vision Transformer backbones, our method substantially reduces forgetting compared to naive fine-tuning and random replay. This work highlights latent drift as a practical and interpretable replay signal for advancing robust continual learning in real world medical settings.[204] REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Fukun Yin,Shiyu Liu,Yucheng Han,Zhibo Wang,Peng Xing,Rui Wang,Wei Cheng,Yingming Wang,Aojie Li,Zixin Yin,Pengtao Chen,Xiangyu Zhang,Daxin Jiang,Xianfang Zeng,Gang Yu
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)推理能力的图像编辑框架,通过“思考-编辑-反思”循环机制提升编辑精度和指令理解能力。
Details
Motivation: 现有的图像编辑模型通常冻结MLLM的推理能力,限制了其对抽象指令的理解和编辑准确性,本文旨在通过解锁MLLM的推理能力来突破这一局限。 Method: 引入两种推理机制:思考机制利用MLLM的世界知识解析抽象指令,反思机制则评估编辑结果、自动纠正错误操作并确定停止轮次,形成一个闭环的思考-编辑-反思流程。 Result: 实验表明,该方法在多个基准上显著优于现有模型,相比Step1X-Edit初始化的DiT,ImgEdit提升4.3%,GEdit提升4.7%,Kris提升8.2%,并在GEdit和Kris上超越了之前的开源方法。 Conclusion: 解锁MLLM的推理能力可有效提升图像编辑模型的性能,提出的思考-编辑-反思循环为未来图像编辑系统提供了新方向。 Abstract: Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).[205] GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes
Di Wang,Shunyu Liu,Wentao Jiang,Fengxiang Wang,Yi Liu,Xiaolei Qin,Zhiming Luo,Chaoyang Zhou,Haonan Guo,Jing Zhang,Bo Du,Dacheng Tao,Liangpei Zhang
Main category: cs.CV
TL;DR: 本文提出了GeoZero,一种无需预定义链式思维监督即可实现多模态大语言模型地理空间推理的框架,通过构建两个数据集和提出A^2GRPO算法,在降低标注成本的同时提升了模型推理的多样性和准确性。
Details
Motivation: 现有远程感知多模态大语言模型依赖人工标注的链式思维数据进行训练,成本高且易引入人类偏见,限制了模型推理的多样性。 Method: 提出GeoZero框架,包含GeoZero-Instruct和GeoZero-Hard两个数据集,结合监督微调与强化学习,并引入答案锚定的群体相对策略优化(A^2GRPO),以模型自身输出的答案来正则化推理过程。 Result: 在多个遥感视觉-语言基准上实验表明,GeoZero优于现有的最先进方法,并展现出跨多种地理空间任务的通用涌现推理能力。 Conclusion: GeoZero有效实现了无需人工链式思维标注的地理空间推理,降低了成本与偏见,同时提升了模型性能和推理多样性。 Abstract: Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model's own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code,data,and models will be publicly available at https://github.com/MiliLab/GeoZero.[206] Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Dian Zheng,Manyuan Zhang,Hongyu Li,Kai Zou,Hongbo Liu,Ziyu Guo,Kaituo Feng,Yexin Liu,Ying Luo,Yan Feng,Peng Pei,Xunliang Cai,Hongsheng Li
Main category: cs.CV
TL;DR: 本文提出了一种无需模型解耦的多模态统一模型训练方法,通过引入注意力交互对齐(AIA)损失来缓解生成与理解任务之间的冲突,提升了跨模态注意力行为的一致性与整体性能。
Details
Motivation: 由于多模态统一模型在图像生成与理解任务中存在固有的目标冲突,现有方法常采用模型解耦策略缓解冲突,但过度解耦会损害模型的交错生成能力。因此,本文旨在探索不依赖解耦的冲突缓解机制。 Method: 通过分析跨模态注意力行为,发现解耦促使模型形成任务特定的交互模式。受此启发,提出注意力交互对齐(AIA)损失,在训练过程中显式学习任务特定的多模态交互模式,并将其应用于Emu3和Janus-Pro模型的不同训练阶段。 Result: AIA损失有效优化了跨模态注意力模式,在不使用模型解耦的情况下,同时提升了图像生成与理解性能。 Conclusion: AIA损失是一种通用且有效的训练机制,能够在保持模型统一性的同时缓解多任务冲突,为构建更强大的统一多模态模型提供了新思路。 Abstract: Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.[207] VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng,Kai Han
Main category: cs.CV
TL;DR: 提出了一种变分多模态提示学习框架(VaMP),通过实例条件化的提示生成和不确定性建模,在少样本和领域泛化任务中实现了最先进的性能。
Details
Motivation: 现有的多模态提示学习方法通常依赖固定的共享提示和确定性参数,难以捕捉实例级别的变化和模型不确定性,限制了在下游任务中的适应能力。 Method: 提出VaMP框架,通过从学习到的后验分布中采样生成实例条件提示,并引入基于实例表示和类原型的类别感知先验,将提示调优建模为潜在提示表示上的变分推断,通过重参数化采样实现端到端训练。 Result: 在少样本学习和领域泛化基准上取得了当前最优的性能,验证了建模不确定性和任务结构的有效性。 Conclusion: VaMP通过引入变分推理和实例特定的提示生成机制,有效提升了多模态模型在低资源场景下的适应能力和泛化性能。 Abstract: Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp[208] A deep learning perspective on Rubens' attribution
A. Afifi,A. Kalimullin,S. Korchagin,I. Kudryashov
Main category: cs.CV
TL;DR: 本研究利用深度学习技术,通过卷积神经网络对鲁本斯及其工作室的绘画作品进行认证与归属分析,取得了高分类准确率。
Details
Motivation: 解决鲁本斯及其工作室作品归属复杂的问题,辅助传统艺术史研究方法。 Method: 使用经过筛选的真迹与对比作品数据集训练卷积神经网络,识别大师笔触中的微观风格特征。 Result: 模型在绘画归属分类任务中表现出高准确率,能够有效区分鲁本斯本人与其工作室的作品。 Conclusion: 计算分析方法可有效补充传统艺术史专家的判断,为艺术家归属和作坊协作研究提供新视角。 Abstract: This study explores the use of deep learning for the authentication and attribution of paintings, focusing on the complex case of Peter Paul Rubens and his workshop. A convolutional neural network was trained on a curated dataset of verified and comparative artworks to identify micro-level stylistic features characteristic of the master s hand. The model achieved high classification accuracy and demonstrated the potential of computational analysis to complement traditional art historical expertise, offering new insights into authorship and workshop collaboration.[209] Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield
Dongyang Liu,Peng Gao,David Liu,Ruoyi Du,Zhen Li,Qilong Wu,Xin Jin,Sihan Cao,Shifeng Zhang,Hongsheng Li,Steven Hoi
Main category: cs.CV
TL;DR: 本文挑战了扩散模型蒸馏中分布匹配是核心机制的传统观点,提出CFG增强(CA)才是Few-step蒸馏的主要驱动力,而分布匹配(DM)仅起正则化作用,并据此提出改进的蒸馏方法,取得了更好的生成效果。
Details
Motivation: 在文本到图像生成等复杂任务中,传统认为分布匹配蒸馏(DMD)的成功源于学生模型对教师模型输出分布的匹配,但作者发现这一理解不准确,尤其是在使用CFG的情况下,需要重新审视DMD中各成分的作用。 Method: 通过对DMD训练目标进行严格分解,识别出CFG增强(CA)项和分布匹配(DM)项的不同作用;通过消融实验验证CA作为‘引擎’、DM作为‘正则器’的角色,并尝试用其他非参数或GAN-based目标替代DM以验证其可替换性;进一步提出解耦噪声调度等改进策略。 Result: 实验证明CA是Few-step蒸馏性能提升的核心,DM主要起稳定训练和减少伪影的作用;多种替代正则项可达到类似效果;基于新理解设计的方法在8步生成中表现优异,并已被Z-Image项目采用。 Conclusion: DMD的成功主要归功于CFG增强而非分布匹配,DM仅作为正则器存在;这种功能解耦为蒸馏方法提供了更系统的设计原则,并推动了高效生成模型的发展。 Abstract: Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core ``engine'' of distillation, while the Distribution Matching (DM) term functions as a ``regularizer'' that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( https://github.com/Tongyi-MAI/Z-Image ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.[210] Emergent Extreme-View Geometry in 3D Foundation Models
Yiwen Zhang,Joseph Tung,Ruojin Cai,David Fouhey,Hadar Averbuch-Elor
Main category: cs.CV
TL;DR: 本文研究了3D基础模型(3DFMs)在极端视角下的几何理解能力,提出了一种轻量级对齐方法来优化其内部3D表示,并发布了新的基准数据集MegaUnScene。
Details
Motivation: 探索3D基础模型在未见过的极端、非重叠视角下是否具备几何推理能力,并提升其在这些挑战性条件下的表现。 Method: 通过分析3DFMs的内部表示,发现其具备 emergent 的极端视角几何理解能力;提出一种仅微调骨干网络中少量偏置项的轻量级对齐方案,保持解码头不变。 Result: 所提方法显著提升了极端视角下的相对位姿估计性能,同时不损害单图像深度和点云质量;构建了新基准MegaUnScene,包含互联网场景及专门测试分割。 Conclusion: 3D基础模型本身已隐式学习到极端视角几何结构,通过局部参数调整即可有效增强其多视角几何推理能力,为未来无需重新训练整个模型的适应性方法提供了方向。 Abstract: 3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with dedicated test splits for both relative pose estimation and dense 3D reconstruction. All code and data will be released.[211] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
Shubhankar Borse,Phuc Pham,Farzad Farhadzadeh,Seokeon Choi,Phong Ha Nguyen,Anh Tuan Tran,Sungrack Yun,Munawar Hayat,Fatih Porikli
Main category: cs.CV
TL;DR: 提出Ar2Can,一种两阶段框架,通过解耦空间规划与身份渲染来生成多人体场景,在计数准确性和身份保持方面表现优异。
Details
Motivation: 现有文本到图像模型在生成多人体场景时常常出现人脸重复、身份混淆或计数错误的问题。 Method: 设计两个模块:Architect模块预测结构化布局,Artist模块基于空间定位的脸部匹配奖励生成图像;结合匈牙利空间对齐与ArcFace身份相似性,并使用GRPO和合成数据进行优化。 Result: 在MultiHuman-Testbench上显著提升了计数准确性和身份保持能力,同时保持高质量视觉效果。 Conclusion: Ar2Can有效解决了多人体生成中的关键挑战,且主要依赖合成数据即可实现优越性能。 Abstract: Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.[212] Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image Team,Huanqia Cai,Sihan Cao,Ruoyi Du,Peng Gao,Steven Hoi,Shijie Huang,Zhaohui Hou,Dengyang Jiang,Xin Jin,Liangchen Li,Zhen Li,Zhong-Yu Li,David Liu,Dongyang Liu,Junhan Shi,Qilong Wu,Feng Yu,Chi Zhang,Shifeng Zhang,Shilin Zhou
Main category: cs.CV
TL;DR: Z-Image是一个6B参数的高效图像生成模型,基于S3-DiT架构,通过全流程优化实现高性能且低成本的训练与推理,支持快速生成、编辑和双语文本渲染,性能媲美大型商业模型。
Details
Motivation: 现有高性能图像生成模型多为闭源或参数量巨大,难以在消费级硬件上部署;缺乏高效、开放且实用的开源替代方案。 Method: 提出Z-Image,采用Scalable Single-Stream Diffusion Transformer(S3-DiT)架构,结合精选数据基础设施、精简训练流程,并通过少步蒸馏与奖励后训练得到Z-Image-Turbo;同时利用全预训练范式训练出具备强指令跟随能力的编辑模型Z-Image-Edit。 Result: 在约314K H800 GPU小时(约63万美元)内完成训练;Z-Image-Turbo可在企业级H800上实现亚秒级推理,并兼容低于16GB VRAM的消费级设备;在多项指标上达到或超越Qwen-Image、Hunyuan-Image-3.0、FLUX.2及Seedream 4.0等模型,尤其在照片级真实感生成和中英双语文本渲染方面表现突出。 Conclusion: Z-Image证明了在显著降低计算开销的同时仍可实现顶级生成性能,推动了高效、开放、可负担的图像生成模型发展,代码、权重和在线演示均已公开。 Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.[213] Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction
Boyao Zhou,Shunyuan Zheng,Zhanfeng Liao,Zihan Ma,Hanzhang Tu,Boning Liu,Yebin Liu
Main category: cs.CV
TL;DR: 提出Splat-SAP,一种基于点图重建的双目相机稀疏输入下的人体场景新视角渲染方法,采用两阶段学习策略实现高质量、可控制的自由视点渲染。
Details
Motivation: 高斯泼溅在新视角合成中表现优异,但通常需要密集输入和逐场景优化;现有前馈方法对输入视图重叠度要求高,难以应对大稀疏场景。 Method: 提出两阶段学习策略:第一阶段通过迭代亲和力学习将像素级点图转换到真实空间,实现相机控制;第二阶段将双视图点图投影至目标平面并通过立体匹配细化几何,锚定高斯原语进行渲染;点图在无3D监督下自监督训练,第二阶段使用光度损失监督。 Result: 在多视图人体数据上验证,提升了点图重建的稳定性与自由视点渲染的视觉质量。 Conclusion: Splat-SAP能有效处理大稀疏输入,在无需密集视图和3D标注的情况下实现稳定且高质量的人体场景新视角渲染。 Abstract: We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. We collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.[214] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Alberto Compagnoni,Marco Morini,Sara Sarto,Federico Cocchi,Davide Caffagni,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
Main category: cs.CV
TL;DR: 提出了一种新的推理增强的多模态RAG方法ReAG,结合粗粒度和细粒度检索以及批评模型来过滤无关段落,通过强化学习提升推理能力,在知识密集型视觉问答任务中显著优于先前方法。
Details
Motivation: 现有的多模态大语言模型在处理特定领域或知识密集型问题时表现不佳,而当前的检索增强方法存在精度低、段落噪声大和推理能力有限的问题。 Method: 提出ReAG,结合粗粒度和细粒度检索,并引入一个批评模型过滤无关段落;采用多阶段训练策略,使用强化学习增强对检索内容的推理,监督微调仅用于冷启动。 Result: 在Encyclopedic-VQA和InfoSeek数据集上的实验表明,ReAG在答案准确率上显著优于先前方法,并能基于检索到的证据提供可解释的推理过程。 Conclusion: ReAG有效提升了知识密集型视觉问答任务的性能,通过高质量的上下文检索和增强推理实现了更准确且可解释的回答。 Abstract: Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: https://github.com/aimagelab/ReAG.[215] All Centers Are at most a Few Tokens Apart: Knowledge Distillation with Domain Invariant Prompt Tuning
Amir Mohammad Ezzati,Alireza Malekhosseini,Armin Khosravi,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 提出了一种名为Domain Invariant Prompt Tuning (DIPT)的方法,用于计算病理学中的领域泛化,通过从视觉-语言模型中蒸馏知识并学习领域不变的提示,提升了多中心数据下的模型鲁棒性和性能。
Details
Motivation: 由于染色协议、扫描设备和成像设置的差异,计算病理学中存在显著的领域偏移问题;现有视觉-语言模型对提示敏感且缺乏适用于病理图像的语义描述,限制了其零样本性能。 Method: 提出DIPT方法,在知识蒸馏过程中为每个领域单独学习输入令牌,并跨领域平均得到领域不变的连续提示;学生模型利用这些提示从PLIP文本编码器中蒸馏知识,实现视觉特征与领域不变嵌入的对齐。 Result: 在多个计算病理学数据集上,该方法显著优于现有的最先进知识蒸馏方法,提升了平均F1分数,验证了其在领域泛化上的有效性。 Conclusion: DIPT通过学习领域不变的连续提示,有效提升了视觉-语言模型在计算病理学中的领域泛化能力,为在异构临床数据上部署鲁棒模型提供了可行方案。 Abstract: Domain generalization is critical in computational pathology (CPath) due to inherent domain shifts caused by variations in staining protocols, scanner devices, and imaging settings across clinical centers. Vision-language models (VLMs), such as PLIP-a pathology-tuned CLIP-trained on image-text pairs across diverse domains, serve as strong knowledge distillation sources. However, their zero-shot performance with predefined prompts remains limited due to sensitivity to prompt variations. Moreover, unlike natural images, histopathology centers lack semantic descriptors (e.g., 'sketch'), making it difficult to define domain-specific prompts for clinical centers. This requires a data-driven approach for learning domain-specific and ultimately class-generic continuous prompts. We propose Domain Invariant Prompt Tuning (DIPT) for knowledge distillation process, a novel step that learns multiple input tokens for each domain. These tokens are trained separately for each domain and are averaged across domains, leading to domain-invariant prompts. Our student model then distills knowledge from PLIP's text encoder by leveraging the prompts learned by DIPT. This leads to alignment of visual features with domain-invariant embeddings, enhancing generalization by training on multiple domains. Our method adds a significant improvement in average F1-score to existing state-of-the-art (SOTA) knowledge distillation approaches in domain generalization with histopathology datasets. This work helps the way of deploying robust CPath models in real-world clinical problems with heterogeneous data sources.[216] MammoRGB: Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models
Jorge Alberto Garza-Abdala,Gerardo A. Fumagal-González,Daly Avendano,Servando Cardona,Sadam Hussain,Eduardo de Avila-Armenta,Jasiel H. Toscano-Martínez,Diana S. M. Rosales Gurmendi,Alma A. Pedro-Pérez,Jose Gerardo Tamez-Pena
Main category: cs.CV
TL;DR: 本研究开发并评估了一种三通道去噪扩散概率模型(DDPM),用于合成单侧乳腺双视图乳腺X线图像,比较不同通道编码方式对图像保真度和跨视图一致性的影响。结果表明,基于求和或绝对差编码的模型表现更优,生成的图像具有良好的解剖一致性,适用于数据集增强。
Details
Motivation: 为了提升乳腺癌筛查中双视图乳腺X线图像的数据可用性,需要高质量的合成图像;现有方法在跨视图一致性和图像真实性方面仍有不足,因此需探索更优的多通道扩散模型结构与编码方式。 Method: 采用预训练的三通道DDPM模型,在包含11020张乳腺X线图像的私有数据集上进行微调,生成配对的CC和MLO视图;比较三种第三通道编码方式:求和、绝对差和零通道;通过IoU、DSC、EMD和KS检验进行定量评估,并由非专家放射科医生进行图灵测试以评估视觉质量与一致性。 Result: 合成图像的IoU和DSC分布与真实图像相近(EMD=0.020,KS=0.077);使用求和或绝对差编码的模型在IoU和DSC上显著优于其他(p<0.001);6%-8%的合成图像出现与训练数据一致的伪影;跨视图一致性良好。 Conclusion: 三通道DDPM能够生成逼真且解剖结构一致的双视图乳腺X线图像,在医学图像数据增强方面具有应用潜力。 Abstract: Purpose: This study aims to develop and evaluate a three channel denoising diffusion probabilistic model (DDPM) for synthesizing single breast dual view mammograms and to assess the impact of channel representations on image fidelity and cross view consistency. Materials and Methods: A pretrained three channel DDPM, sourced from Hugging Face, was fine tuned on a private dataset of 11020 screening mammograms to generate paired craniocaudal (CC) and mediolateral oblique (MLO) views. Three third channel encodings of the CC and MLO views were evaluated: sum, absolute difference, and zero channel. Each model produced 500 synthetic image pairs. Quantitative assessment involved breast mask segmentation using Intersection over Union (IoU) and Dice Similarity Coefficient (DSC), with distributional comparisons against 2500 real pairs using Earth Movers Distance (EMD) and Kolmogorov Smirnov (KS) tests. Qualitative evaluation included a visual Turing test by a non expert radiologist to assess cross view consistency and artifacts. Results: Synthetic mammograms showed IoU and DSC distributions comparable to real images, with EMD and KS values (0.020 and 0.077 respectively). Models using sum or absolute difference encodings outperformed others in IoU and DSC (p < 0.001), though distributions remained broadly similar. Generated CC and MLO views maintained cross view consistency, with 6 to 8 percent of synthetic images exhibiting artifacts consistent with those in the training data. Conclusion: Three channel DDPMs can generate realistic and anatomically consistent dual view mammograms with promising applications in dataset augmentation.[217] Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols
Sebastian Padó,Kerstin Thomas
Main category: cs.CV
TL;DR: 本文研究了当前视觉语言模型(VLMs)在识别艺术作品中情感表达方面的能力,发现模型对具象图像的情感内容和表达方式识别效果较好,但在高度抽象或象征性图像及符号识别上仍存在困难,并表现出回答不一致的问题。
Details
Motivation: 探索当前视觉语言模型能否识别艺术作品中复杂且抽象的情感表达,并评估其在不同层次问题上的表现。 Method: 选取三种视觉语言模型(Llava-Llama 和两个 Qwen 模型),针对艺术作品提出四类递增复杂度的问题(一般内容、情感内容、情感表达方式、情感符号),并通过专家进行定性评估。 Result: 模型能较好识别图像内容及其中表达的情感与方式,尤其在具象图像上表现良好;但在高度抽象或象征性图像上表现较差,符号识别能力有限,且对相关问题的回答存在不一致性。 Conclusion: 尽管当前VLMs在理解艺术图像中的情感方面展现出潜力,但其在处理抽象性和象征性内容以及保持推理一致性方面仍有显著局限,需进一步改进。 Abstract: Emotions are a fundamental aspect of artistic expression. Due to their abstract nature, there is a broad spectrum of emotion realization in artworks. These are subject to historical change and their analysis requires expertise in art history. In this article, we investigate which aspects of emotional expression can be detected by current (2025) vision language models (VLMs). We present a case study of three VLMs (Llava-Llama and two Qwen models) in which we ask these models four sets of questions of increasing complexity about artworks (general content, emotional content, expression of emotions, and emotion symbols) and carry out a qualitative expert evaluation. We find that the VLMs recognize the content of the images surprisingly well and often also which emotions they depict and how they are expressed. The models perform best for concrete images but fail for highly abstract or highly symbolic images. Reliable recognition of symbols remains fundamentally difficult. Furthermore, the models continue to exhibit the well-known LLM weakness of providing inconsistent answers to related questions.[218] Fusion or Confusion? Assessing the impact of visible-thermal image fusion for automated wildlife detection
Camille Dionne-Pierre,Samuel Foucher,Jérôme Théau,Jérôme Lemaître,Patrick Charbonneau,Maxime Brousseau,Mathieu Varin
Main category: cs.CV
TL;DR: 本研究利用同步航空可见光(VIS)和热红外(TIR)影像,结合YOLO11n模型,评估了多模态深度学习在大蓝鹭个体与巢穴自动检测中的应用。通过早期融合和晚期融合方法比较,发现两种融合方式均优于仅使用可见光的模型,其中晚期融合将“有鸟巢”类别的F1分数从90.2%提升至93.0%,并能有效识别误检。但TIR视场限制和配准问题导致部分数据丢失。
Details
Motivation: 为提高野生动物监测效率,探索多源遥感影像(可见光与热红外)融合在自动检测物种及其栖息地中的潜力,克服传统单源影像检测精度受限的问题。 Method: 采用同步航空获取的VIS和TIR影像,利用深度学习模型进行自动配准;分别应用主成分分析(PCA)进行早期融合,以及基于分类回归树的晚期融合策略,结合YOLO11n模型进行目标检测,并与VIS-only模型对比性能。 Result: 融合方法在所有类别上均提升了F1分数,晚期融合使‘有鸟巢’类别的F1分数达到93.0%(较VIS-only提升2.8个百分点),且能以90%的召回率识别来自两源的误检。然而,TIR较小的视场和配准限制导致部分数据无法使用。 Conclusion: 可见光与热红外影像的融合有助于提升野生动物自动检测精度,尤其是晚期融合策略效果更优;但实际应用中需解决TIR视场和影像对齐问题,未来可考虑高分辨率可见光传感器以支持业务化监测。 Abstract: Efficient wildlife monitoring methods are necessary for biodiversity conservation and management. The combination of remote sensing, aerial imagery and deep learning offer promising opportunities to renew or improve existing survey methods. The complementary use of visible (VIS) and thermal infrared (TIR) imagery can add information compared to a single-source image and improve results in an automated detection context. However, the alignment and fusion process can be challenging, especially since visible and thermal images usually have different fields of view (FOV) and spatial resolutions. This research presents a case study on the great blue heron (Ardea herodias) to evaluate the performances of synchronous aerial VIS and TIR imagery to automatically detect individuals and nests using a YOLO11n model. Two VIS-TIR fusion methods were tested and compared: an early fusion approach and a late fusion approach, to determine if the addition of the TIR image gives any added value compared to a VIS-only model. VIS and TIR images were automatically aligned using a deep learning model. A principal component analysis fusion method was applied to VIS-TIR image pairs to form the early fusion dataset. A classification and regression tree was used to process the late fusion dataset, based on the detection from the VIS-only and TIR-only trained models. Across all classes, both late and early fusion improved the F1 score compared to the VIS-only model. For the main class, occupied nest, the late fusion improved the F1 score from 90.2 (VIS-only) to 93.0%. This model was also able to identify false positives from both sources with 90% recall. Although fusion methods seem to give better results, this approach comes with a limiting TIR FOV and alignment constraints that eliminate data. Using an aircraft-mounted very high-resolution visible sensor could be an interesting option for operationalizing surveys.[219] Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
Anik De,Abhirama Subramanyam Penamakuri,Rajeev Yadav,Aditya Rathore,Harshiv Shah,Devesh Sharma,Sagar Agarwal,Pravin Kumar,Anand Mishra
Main category: cs.CV
TL;DR: 本文介绍了Bharat Scene Text Dataset (BSTD),一个用于印度语言场景文本识别的大规模基准数据集,包含11种印度语言和英语的超过10万单词,来自6500多张真实场景图像。该数据集支持多种任务,并推动了相关研究的发展。
Details
Motivation: 由于文字多样性、非标准字体、书写风格差异以及缺乏高质量数据集和开源模型,印度语言的场景文本识别仍是一个未解决的挑战。因此需要构建一个大规模、多样化的数据集来推动该领域发展。 Method: 构建了一个名为Bharat Scene Text Dataset (BSTD) 的大规模、综合性的数据集,包含超过10万标注词,涵盖11种印度语言和英语,来自6500多张真实场景图像,并支持文本检测、语种识别、裁剪词识别和端到端识别等多种任务。对现有的英文场景文本识别模型进行微调并应用于该数据集。 Result: 评估了基于英文设计的最先进模型在印度语言上的表现,结果表明现有模型在印度语言上仍面临挑战,同时也揭示了改进的机会。BSTD为印度语言场景文本识别提供了重要资源。 Conclusion: BSTD数据集填补了印度语言场景文本识别领域的数据空白,为未来研究提供了开放、全面的基准,有望显著推动该方向的发展。 Abstract: Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.[220] Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data
Mahdieh Behjat Khatooni,Mohsen Soryani
Main category: cs.CV
TL;DR: 提出一种结合CNN、Vision Transformer和BiLSTM的多模态深度学习模型,利用纵向MRI数据和生物标志物对阿尔茨海默病(AD)的轻度认知障碍(MCI)进展进行端到端预测,实现了95.05%的准确率,优于现有方法。
Details
Motivation: 阿尔茨海默病不可逆,早期预测至关重要;而并非所有MCI患者都会进展为AD,因此准确预测MCI向AD的转化具有挑战性且意义重大。 Method: 提出一种融合卷积神经网络(CNN)和视觉Transformer的混合架构,提取MRI图像的局部与全局特征,并使用双向LSTM整合四个时间点的影像数据及其他非影像生物标志物,实现对个体48个月认知状态的预测。 Result: 该多模态模型在sMCI与pMCI分类任务中达到平均95.05%的预测准确率,性能优于现有AD预测研究。 Conclusion: 所提方法在纵向AD预测中表现出先进性能,验证了结合空间与时间建模在阿尔茨海默病早期检测中的有效性。 Abstract: Alzheimer's disease (AD) is a prevalent neurodegenerative disorder that progressively impairs memory, decision-making, and overall cognitive function. As AD is irreversible, early prediction is critical for timely intervention and management. Mild Cognitive Impairment (MCI), a transitional stage between cognitively normal (CN) aging and AD, plays a significant role in early AD diagnosis. However, predicting MCI progression remains a significant challenge, as not all individuals with MCI convert to AD. MCI subjects are categorized into stable MCI (sMCI) and progressive MCI (pMCI) based on conversion status. In this study, we propose a generalized, end-to-end deep learning model for AD prediction using MCI cases from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our hybrid architecture integrates Convolutional Neural Networks and Vision Transformers to capture both local spatial features and global contextual dependencies from Magnetic Resonance Imaging (MRI) scans. To incorporate temporal progression, we further employ Bidirectional Long Short-Term Memory (BiLSTM) networks to process features extracted from four consecutive MRI timepoints along with some other non-image biomarkers, predicting each subject's cognitive status at month 48. Our multimodal model achieved an average progression prediction accuracy of 95.05\% between sMCI and pMCI, outperforming existing studies in AD prediction. This work demonstrates state-of-the-art performance in longitudinal AD prediction and highlights the effectiveness of combining spatial and temporal modeling for the early detection of Alzheimer's disease.[221] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
Haruki Sakajo,Hiroshi Takato,Hiroshi Tsutsui,Komei Soda,Hidetaka Kamigaito,Taro Watanabe
Main category: cs.CV
TL;DR: 本研究探讨了大规模视觉语言模型(LVLMs)在处理驾驶员和道路双重视频输入以生成安全驾驶指令方面的潜力,发现微调后的LVLM表现良好,但仍面临检测复杂或细微事件的挑战。
Details
Motivation: 确保驾驶安全需要同时监控驾驶员行为和道路状况,因此需要能够处理同步多视角视频输入的模型。 Method: 构建了一个包含驾驶员和道路视角的同步视频数据集,并对预训练和微调后的LVLM进行评估。 Result: 预训练LVLM效果有限,而微调后的LVLM能生成准确且安全意识强的驾驶指令,但在识别细微或复杂事件方面仍有不足。 Conclusion: 微调显著提升LVLM在驾驶安全应用中的性能,未来需进一步改进以应对复杂场景的检测挑战。 Abstract: Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.[222] World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Eunsu Kim,Junyeong Park,Na Min An,Junseong Kim,Hitesh Laxmichand Patel,Jiho Jin,Julia Kruk,Amit Agarwal,Srikant Panda,Fenal Ashokbhai Ilasariya,Hyunjung Shim,Alice Oh
Main category: cs.CV
TL;DR: 本文提出了一个名为CultureMix的新基准,用于研究大型视觉语言模型(LVLMs)在文化混合场景中的表现,特别是在食物相关的视觉问答任务中。研究发现当前模型在处理跨文化元素时存在身份混淆、背景依赖和预测不一致等问题,并通过监督微调等策略提升了模型的鲁棒性。
Details
Motivation: 随着全球化的发展,视觉场景中经常出现来自不同文化的元素混合。然而,现有LVLMs如何理解这种文化混合尚不清楚,因此需要系统研究其在多文化共存场景下的感知能力与局限。 Method: 构建了一个包含23k张图像的CultureMix食物VQA基准,涵盖四种子任务(食物单独、食物+食物、食物+背景、食物+食物+背景),使用扩散模型生成并经人工验证的数据;评估了10个LVLM的表现,并测试了三种提升鲁棒性的方法,尤其是基于多样化数据集的监督微调。 Result: 实验显示,当加入文化背景后,模型准确率平均下降14%;模型对相同食物在不同背景下给出不一致的回答,表现出强烈背景依赖和文化身份丢失现象;采用监督微调显著提高了模型的一致性并降低了背景敏感性。 Conclusion: 文化混合是LVLM面临的重要挑战,当前模型在保持文化身份和上下文鲁棒性方面存在缺陷;引入多样化的文化混合数据进行训练是提升模型在全球化现实环境中可靠性的有效途径。 Abstract: In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.[223] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
Yiming Chen,Junlin Han,Tianyi Bai,Shengbang Tong,Filippos Kokkinos,Philip Torr
Main category: cs.CV
TL;DR: 本文提出了CogIP-Bench,一个用于评估多模态大语言模型(MLLMs)在图像认知属性上与人类感知一致性的基准,并通过后训练提升模型对主观认知属性的理解,且该能力可迁移到图像生成等创造性任务中。
Details
Motivation: 现有MLLMs擅长识别图像内容,但难以理解图像给人的主观感受(如记忆性、趣味性、美感等),缺乏对人类认知特性的建模能力。 Method: 构建了一个名为CogIP-Bench的综合评测基准,评估MLLMs在主观认知属性上的表现;提出一种后训练方法来增强模型对这些属性的理解和对齐能力。 Result: 实验表明当前MLLMs在认知属性上与人类判断存在显著差距;后训练能有效提升模型与此类人类感知的对齐程度,且该能力可迁移至下游创作任务(如图像生成)。 Conclusion: 本文提供了衡量和提升MLLMs对图像主观感受理解能力的方法,推动AI系统更贴近人类感知,实现更以人为本的AI。 Abstract: While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.[224] LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer
Kai Wang,Siyi Chen,Weicong Pang,Chenchen Zhang,Renjun Gao,Ziru Chen,Cheng Li,Dasa Gu,Rui Huang,Alexis Kai Hon Lau
Main category: cs.CV
TL;DR: 本文提出了一种名为LC4-DViT的高分辨率土地覆盖分类框架,结合文本引导的生成数据增强与变形感知视觉Transformer,有效缓解了标注稀缺、类别不平衡和几何畸变问题,在AID和SIRI-WHU数据集上均取得了优异性能,且注意力机制更符合水文地理结构。
Details
Motivation: 高分辨率遥感影像的土地覆盖分类面临标注数据稀缺且不平衡、以及图像几何畸变的问题,限制了现有模型的性能,因此需要一种能生成高质量训练数据并有效建模局部几何与全局上下文的新型方法。 Method: 提出LC4-DViT框架:首先利用GPT-4o生成场景描述,并结合超分辨率示例图像,通过文本引导的扩散模型合成类平衡、高保真的训练图像;其次设计DViT模型,将DCNv4可变形卷积骨干网络与Vision Transformer结合,以同时捕捉细粒度几何结构和全局上下文信息。 Result: 在AID数据集的八类分类任务中,DViT达到0.9572的整体准确率、0.9576的宏F1分数和0.9510的Kappa系数,优于ViT、ResNet50、MobileNetV2等基线模型;在SIRI-WHU三类子集上也取得0.9333 OA、0.9316 macro F1和0.8989 Kappa,显示良好迁移能力;GPT-4o评估的Grad-CAM热图表明其注意力更聚焦于水文相关结构。 Conclusion: 结合描述驱动的生成式数据增强与变形感知Transformer是一种高效且具泛化性的高分辨率土地覆盖制图方法,有望推动缺乏标注场景下的环境监测与土地管理应用。 Abstract: Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.[225] Captain Safari: A World Engine
Yu-Cheng Chou,Xingrui Wang,Yitong Li,Jiahao Wang,Hanting Liu,Cihang Xie,Alan Yuille,Junfei Xiao
Main category: cs.CV
TL;DR: 提出Captain Safari,一种基于姿态条件的世界引擎,通过检索持久世界记忆生成长时3D一致视频,在复杂户外场景和激进6-DoF相机轨迹下表现优异。
Details
Motivation: 现有世界引擎在处理复杂户外环境和大范围6自由度相机运动时,难以保持长期几何一致性、易偏离目标路径或产生保守运动。 Method: 引入姿态条件的世界引擎Captain Safari,通过动态局部记忆和检索器获取姿态对齐的世界token来指导视频生成,结合持久世界记忆实现稳定3D结构与精确相机控制。 Result: 在OpenSafari新数据集上验证,Captain Safari在视频质量、3D一致性与轨迹跟随方面超越最先进方法:MEt3R从0.3703降至0.3690,AUC@30从0.181提升至0.200,FVD显著降低;50人5模型对比中67.6%用户偏好该方法。 Conclusion: 姿态条件的世界记忆是实现长时程可控视频生成的有效机制,同时发布OpenSafari作为未来世界引擎研究的新基准。 Abstract: World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.[226] Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Tianle Chen,Chaitanya Chakka,Arjun Reddy Akula,Xavier Thomas,Deepti Ghadiyaram
Main category: cs.CV
TL;DR: 本文提出MMA-Bench以评估多模态大语言模型(MLLMs)在模态矛盾情况下的鲁棒性,发现现有MLLMs在音视频不一致或文本误导下表现脆弱,并提出一种模态对齐微调策略以增强跨模态推理能力。
Details
Motivation: 探究MLLMs在不同模态相互矛盾时是否仍能保持稳健,揭示当前模型在多模态推理中的脆弱性。 Method: 构建MMA-Bench数据集,结合黑盒与白盒可解释性技术分析模型行为,并提出模态对齐微调策略来优化模态间的权重分配。 Result: 实验表明现有MLLMs在音视频不匹配和误导性文本下性能显著下降,而所提对齐微调方法能有效提升多模态对齐与推理能力。 Conclusion: 当前MLLMs缺乏可靠的跨模态推理能力,通过针对性的模态对齐训练可改善其鲁棒性,为构建更可靠的多模态模型提供路径。 Abstract: Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.[227] Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering
Dosung Lee,Sangwon Jung,Boyoung Kim,Minyoung Kim,Sungyeon Kim,Junyoung Sung,Paul Hongsuck Seo
Main category: cs.CV
TL;DR: 提出新的MKB-VQA基准RETINA以消除视觉捷径问题,并提出多图像多模态检索模型MIMIR,有效提升对相关实体的问答性能。
Details
Motivation: 现有MKB-VQA基准存在“视觉捷径”问题,即查询图像通常与目标文档的主要主体匹配,导致模型仅依赖视觉线索即可取得良好表现,无法真实反映其多模态推理能力。 Method: 构建了一个由LLM驱动的自动化流程来创建RETINA基准,包含12万训练样本和2千人工整理的测试集;该基准将查询指向文档中的次要相关实体,并配以这些实体的图像,从而消除视觉捷径;同时提出MIMIR模型,通过引入多个相关实体的图像增强文档表示。 Result: 在RETINA上评估现有模型时,其性能显著下降,证明它们依赖视觉捷径;而MIMIR模型由于利用多个相关图像,在处理RETINA任务时表现更优。 Conclusion: RETINA有效暴露了现有MKB-VQA模型对视觉捷径的依赖,MIMIR通过多图像扩展增强了多模态检索能力,为未来研究提供了更可靠的基准和方法。 Abstract: Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.[228] Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
Keliang Liu,Zizhi Chen,Mingcheng Li,Jingqun Tang,Dingkang Yang,Lihua Zhang
Main category: cs.CV
TL;DR: 提出了一种名为SLEUTH的多智能体框架,用于提升视觉语言模型在长文档理解任务中的性能,通过粗到精的协作机制减少冗余并增强关键线索提取,显著提升了多个基准上的表现。
Details
Motivation: 现有视觉语言模型在处理长文档时效果下降,因信息分散且输入冗余,检索增强生成虽有帮助但仍存在大量冗余内容,需更有效的上下文提炼机制。 Method: 设计SLEUTH框架,包含一个检索器和四个协同工作的智能体,采用从粗到细的策略,识别检索页面中的关键文本和视觉线索,过滤显著的视觉证据(如图表),分析问题以制定推理策略,并合成精炼的多模态上下文用于最终预测。 Result: SLEUTH在多个长文档理解基准上实现最先进的性能,具有模型无关性和可扩展性,消融实验验证了各模块的有效性及分层优化范式的益处。 Conclusion: SLEUTH通过多智能体协作与分层细化机制,有效解决了长文档中信息分散与冗余问题,显著提升了文档理解任务的准确性和效率。 Abstract: Document understanding is a long standing practical task. Vision Language Models (VLMs) have gradually become a primary approach in this domain, demonstrating effective performance on single page tasks. However, their effectiveness diminishes when handling long documents. In such scenarios, clues are often scattered across multiple pages and modalities, and redundancy from lengthy inputs can impair the models judgment. While retrieval augmented generation mitigates this issue by filtering for question relevant content, the retrieved results still contain substantial redundancy. To address these limitations, we propose SLEUTH, a multi agent framework. Concretely, SLEUTH orchestrates a retriever and four collaborative agents in a coarse to fine process. The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy. It ultimately synthesizes a distilled, evidence dense multimodal context to generate the final prediction. SLEUTH is model agnostic and scalable. When paired with advanced VLM backbones, it consistently improves performance on multiple long document benchmarks, achieving state of the art results. Ablation studies verify each modules effectiveness and confirm the benefits of our hierarchical refinement paradigm.[229] GLOW: Global Illumination-Aware Inverse Rendering of Indoor Scenes Captured with Dynamic Co-Located Light & Camera
Jiaye Wu,Saeed Hadadan,Geng Lin,Peihan Tu,Matthias Zwicker,David Jacobs,Roni Sengupta
Main category: cs.CV
TL;DR: 本文提出了一种名为GLOW的全局光照感知逆渲染框架,用于解决室内场景中反射率与光照之间的模糊性问题,尤其在共置光-相机设置下的强互反射、动态阴影和近场照明等挑战。
Details
Motivation: 由于反射率与光照之间的模糊性以及多物体间的互反射,室内场景的逆渲染仍然具有挑战性。现有的方法难以处理共置光-相机系统中的复杂光照现象。 Method: GLOW结合了神经隐式表面表示与神经辐射缓存来近似全局光照,并通过精心设计的正则化和初始化策略联合优化几何形状和反射率。引入动态辐射缓存以适应近场运动引起的光照不连续性,以及一种表面角度加权的辐射度损失来抑制闪光灯捕获中的高光伪影。 Result: 实验表明,GLOW在自然光和共置光照条件下,材料反射率估计方面显著优于先前方法。 Conclusion: GLOW有效解决了共置光-相机设置下的逆渲染难题,提升了室内场景中反射率与几何结构的重建精度。 Abstract: Inverse rendering of indoor scenes remains challenging due to the ambiguity between reflectance and lighting, exacerbated by inter-reflections among multiple objects. While natural illumination-based methods struggle to resolve this ambiguity, co-located light-camera setups offer better disentanglement as lighting can be easily calibrated via Structure-from-Motion. However, such setups introduce additional complexities like strong inter-reflections, dynamic shadows, near-field lighting, and moving specular highlights, which existing approaches fail to handle. We present GLOW, a Global Illumination-aware Inverse Rendering framework designed to address these challenges. GLOW integrates a neural implicit surface representation with a neural radiance cache to approximate global illumination, jointly optimizing geometry and reflectance through carefully designed regularization and initialization. We then introduce a dynamic radiance cache that adapts to sharp lighting discontinuities from near-field motion, and a surface-angle-weighted radiometric loss to suppress specular artifacts common in flashlight captures. Experiments show that GLOW substantially outperforms prior methods in material reflectance estimation under both natural and co-located illumination.[230] CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
Fengyi Fang,Sicheng Yang,Wenming Yang
Main category: cs.CV
TL;DR: 本文提出了CoordSpeaker,一个通过文本驱动实现协调的共语手势生成框架,首次探索了手势描述以解决语义鸿沟问题,并实现了双向手势-文本映射。
Details
Motivation: 现有方法因缺乏描述性文本标注和难以实现多模态协调控制,导致共语手势生成中语义先验差距大且控制不灵活。 Method: 提出一种新的手势描述框架,利用运动-语言模型生成多粒度描述;构建基于统一跨数据集运动表示的条件潜在扩散模型,并设计分层控制去噪器以实现协调控制。 Result: 实验表明,该方法生成的手势与语音节奏同步、与任意文本语义一致,且在质量和效率上优于现有方法。 Conclusion: CoordSpeaker有效解决了语义先验差距和多模态协调控制难题,推动了文本驱动共语手势生成的发展。 Abstract: Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.[231] Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis
Jungwoo Seo,David Keetae Park,Shinjae Yoo,Jiook Cha
Main category: cs.CV
TL;DR: 本文提出了首个用于基于认知任务的全脑4D fMRI序列生成的扩散Transformer模型,结合3D VQ-GAN压缩与CNN-Transformer架构,并通过AdaLN-Zero和交叉注意力实现强任务条件控制。在HCP数据上验证了其优异性能。
Details
Motivation: 由于fMRI数据高维、个体间BOLD信号异质性强,且缺乏神经科学基础的评估手段,条件化生成4D fMRI序列具有挑战性。 Method: 提出一种基于扩散Transformer的生成模型,采用3D VQ-GAN进行潜在空间压缩,结合CNN-Transformer主干网络,并利用AdaLN-Zero和交叉注意力机制实现强任务条件控制。 Result: 在HCP任务fMRI数据上,模型能准确复现任务激活图,保持真实数据的跨任务表征结构(RSA达0.98),条件特异性完美,且ROI时间进程符合典型血流动力学响应;任务激活图相关性达0.83,性能随模型规模持续提升,全面优于U-Net基线。 Conclusion: 该工作通过结合潜在扩散、可扩展架构与强条件控制,建立了可行的条件式4D fMRI合成路径,为虚拟实验、跨站点数据标准化及神经影像模型的数据增强提供了新方向。 Abstract: Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.[232] CNN-Based Framework for Pedestrian Age and Gender Classification Using Far-View Surveillance in Mixed-Traffic Intersections
Shisir Shahriar Arif,Md. Muhtashim Shahrier,Nazmul Haque,Md Asif Raihan,Md. Hadiuzzaman
Main category: cs.CV
TL;DR: 本研究提出了一种基于深度学习的框架,利用卷积神经网络从城市交叉口的远距离视频中分类行人的年龄组和性别,无需面部识别或高分辨率图像。
Details
Motivation: 现有行人安全监控系统在低收入和中等收入国家往往缺乏对行人人口统计特征(如年龄和性别)的实时捕捉,而这些因素显著影响行人脆弱性。因此,需要一种可在复杂多模式交通环境中有效识别行人 demographics 的方法。 Method: 采用两种CNN架构(ResNet50和自定义轻量级CNN),在来自孟加拉国达卡三个高风险交叉口的视频数据上进行训练,将行人分类为六类(三种年龄组×两种性别),基于全身视觉线索,并比较不同池化策略与优化器的模型变体。 Result: ResNet50结合Max Pooling和SGD达到最高准确率86.19%,自定义轻量级CNN表现相近(84.15%),且参数更少、训练更快,支持实时推理。 Conclusion: 该框架可利用现有监控基础设施实现行人人口特征的实时监测,为交叉口设计、信号配时优化和针对弱势群体的安全干预提供数据支持,推动混合交通环境下的包容性交通规划。 Abstract: Pedestrian safety remains a pressing concern in congested urban intersections, particularly in low- and middle-income countries where traffic is multimodal, and infrastructure often lacks formal control. Demographic factors like age and gender significantly influence pedestrian vulnerability, yet real-time monitoring systems rarely capture this information. To address this gap, this study proposes a deep learning framework that classifies pedestrian age group and gender from far-view intersection footage using convolutional neural networks (CNNs), without relying on facial recognition or high-resolution imagery. The classification is structured as a unified six-class problem, distinguishing adult, teenager, and child pedestrians for both males and females, based on full-body visual cues. Video data was collected from three high-risk intersections in Dhaka, Bangladesh. Two CNN architectures were implemented: ResNet50, a deep convolutional neural network pretrained on ImageNet, and a custom lightweight CNN optimized for computational efficiency. Eight model variants explored combinations of pooling strategies and optimizers. ResNet50 with Max Pooling and SGD achieved the highest accuracy (86.19%), while the custom CNN performed comparably (84.15%) with fewer parameters and faster training. The model's efficient design enables real-time inference on standard surveillance feeds. For practitioners, this system provides a scalable, cost-effective tool to monitor pedestrian demographics at intersections using existing camera infrastructure. Its outputs can shape intersection design, optimize signal timing, and enable targeted safety interventions for vulnerable groups such as children or the elderly. By offering demographic insights often missing in conventional traffic data, the framework supports more inclusive, data-driven planning in mixed-traffic environments.[233] ClearGCD: Mitigating Shortcut Learning For Robust Generalized Category Discovery
Kailin Lyu,Jianwei He,Long Xiao,Jianing Zeng,Liang Fan,Lin Shu,Jie Hao
Main category: cs.CV
TL;DR: 提出ClearGCD框架,通过语义视图对齐和快捷方式抑制正则化来缓解原型混淆,提升广义类别发现性能。
Details
Motivation: 现有广义类别发现方法因快捷学习导致的原型混淆问题,影响了已知类别的记忆和泛化能力。 Method: 设计了Semantic View Alignment(SVA)进行跨类块替换并强制语义一致性;引入Shortcut Suppression Regularization(SSR)维护自适应原型库以对齐已知类并分离潜在新类。 Result: ClearGCD在多个基准上持续超越当前最先进方法,且可无缝集成到参数化GCD方法中。 Conclusion: ClearGCD有效减少了对非语义线索的依赖,缓解了原型混淆问题,增强了模型在开放世界场景下的类别发现能力。 Abstract: In open-world scenarios, Generalized Category Discovery (GCD) requires identifying both known and novel categories within unlabeled data. However, existing methods often suffer from prototype confusion caused by shortcut learning, which undermines generalization and leads to forgetting of known classes. We propose ClearGCD, a framework designed to mitigate reliance on non-semantic cues through two complementary mechanisms. First, Semantic View Alignment (SVA) generates strong augmentations via cross-class patch replacement and enforces semantic consistency using weak augmentations. Second, Shortcut Suppression Regularization (SSR) maintains an adaptive prototype bank that aligns known classes while encouraging separation of potential novel ones. ClearGCD can be seamlessly integrated into parametric GCD approaches and consistently outperforms state-of-the-art methods across multiple benchmarks.[234] DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking
Weiran Li,Yeqiang Liu,Yijie Wei,Mina Han,Qiannan Guo,Zhenbo Li
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态多目标跟踪框架DM³T,通过将可见光与热红外信息的融合重构为迭代特征对齐过程,提升了跟踪精度和时序一致性。
Details
Motivation: 现有方法在融合可见光与热红外特征时难以解决模态间非线性分布差异,易导致模态冲突,影响跟踪性能。 Method: 提出DM³T框架,包含交叉模态扩散融合(C-MDF)模块实现迭代跨模态协调,扩散优化器(DR)增强统一特征表示,以及分层跟踪器进行自适应置信度估计,实现检测、状态估计与数据关联的一体化在线跟踪。 Result: 在VT-MOT基准上达到41.7 HOTA,相较当前最优方法相对提升1.54%。 Conclusion: DM³T通过类扩散机制的迭代特征对齐实现了更深层次的多模态融合,显著提升了多目标跟踪的鲁棒性和准确性。 Abstract: Multi-object tracking (MOT) is a fundamental task in computer vision with critical applications in autonomous driving and robotics. Multimodal MOT that integrates visible light and thermal infrared information is particularly essential for robust autonomous driving systems. However, effectively fusing these heterogeneous modalities is challenging. Simple strategies like concatenation or addition often fail to bridge the significant non-linear distribution gap between their feature representations, which can lead to modality conflicts and degrade tracking accuracy. Drawing inspiration from the connection between multimodal MOT and the iterative refinement in diffusion models, this paper proposes DM$^3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process to generate accurate and temporally coherent object trajectories. Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module. In this process, features from both modalities provide mutual guidance, iteratively projecting them onto a shared, consistent feature manifold. This enables the learning of complementary information and achieves deeper fusion compared to conventional methods. Additionally, we introduce a plug-and-play Diffusion Refiner (DR) to enhance and refine the unified feature representation. To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation. DM$^3$T unifies object detection, state estimation, and data association into a comprehensive online tracking framework without complex post-processing. Extensive experiments on the VT-MOT benchmark demonstrate that our method achieves 41.7 HOTA, representing a 1.54% relative improvement over existing state-of-the-art methods. The code and models are available at https://vranlee.github.io/DM-3-T/.[235] From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts
Weiran Li,Yeqiang Liu,Yijie Wei,Mina Han,Xin Liu,Zhenbo Li
Main category: cs.CV
TL;DR: 提出Points-to-Clouds(P2C)框架,通过扩散模型启发的动态去噪机制,将多模态提示学习从静态点表示转向语义云分布,提升视觉语言模型在新类别上的泛化能力。
Details
Motivation: 现有MPL方法依赖单一静态点表示,易过拟合且泛化性差,难以适应新颖或模糊类别。 Method: 引入P2C框架,采用动态提示去噪(DPD)机制对文本提示添加退火噪声,并设计V-L Mapper去噪损失,使映射器从含噪文本重建清晰视觉提示,实现鲁棒跨模态对齐。 Result: 在11个数据集上实验表明,P2C优于强基线,在base-to-novel基准上取得79.7%的调和平均分,相对提升1.4%。 Conclusion: 学习语义云分布而非单一静态点可显著提升多模态提示学习的泛化能力和鲁棒性,P2C为VLM适配提供了新范式。 Abstract: Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at https://vranlee.github.io/P2C/.[236] Leveraging Textual Compositional Reasoning for Robust Change Captioning
Kyu Ri Park,Jiyoung Park,Seong Tae Kim,Hong Joo Lee,Jung Uk Kim
Main category: cs.CV
TL;DR: CORTEX 是一种新的变化描述框架,通过结合视觉和文本特征,利用视觉语言模型提取场景级文本知识,增强对图像间细微但有意义变化的理解。
Details
Motivation: 现有方法仅依赖视觉特征,难以捕捉细微但重要的变化,缺乏对对象关系和组合语义的显式表达能力。 Method: 提出 CORTEX 框架,包含三个模块:图像级变化检测器、推理感知文本提取(RTE)模块和图文双重对齐(ITDA)模块,结合像素级差异和VLM生成的文本描述进行细粒度关系推理。 Result: CORTEX 能够更好地理解和描述仅靠视觉特征难以识别的变化,提升了变化描述的准确性和丰富性。 Conclusion: 通过引入文本引导的组合推理,CORTEX 有效增强了变化理解能力,在变化描述任务中表现出优于纯视觉方法的性能。 Abstract: Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.[237] See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection
YuEun Lee,Jung Uk Kim
Main category: cs.CV
TL;DR: 提出一种基于重要词识别的视频片段检索与高亮检测方法,通过多模态大语言模型增强语义理解,在MR和HD任务上显著优于现有方法。
Details
Motivation: 现有方法将文本查询和视频片段视为黑箱,忽略了个别词语的重要性,导致上下文理解不足。 Method: 引入特征增强模块(FEM)识别查询中的重要词语,并结合基于排序的过滤模块(RFM)迭代优化视频片段;利用多模态大语言模型(MLLMs)提升视频-文本场景的语义理解。 Result: 在多个实验中显著优于现有的最先进方法,提升了视频时刻检索和高亮检测的性能。 Conclusion: 该方法通过细粒度的关键词识别与语义增强机制,有效提高了视频内容与自然语言查询之间的对齐精度。 Abstract: Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: https://github.com/VisualAIKHU/SRF.[238] ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance
Congjia Chen,Shen Yan,Yufu Qu
Main category: cs.CV
TL;DR: 本文提出了一种名为ViGG的鲁棒RGB-D点云配准方法,通过视觉与几何信息的相互引导机制提升配准性能,在多个数据集上优于现有最先进方法。
Details
Motivation: 现有配准方法主要依赖几何信息,而RGB-D方法对图像信息的利用有限,难以在复杂场景中保持鲁棒性,因此需要更有效地融合视觉与几何信息。 Method: 提出ViGG方法:1)采用几何引导设计解决团块对齐问题,抑制模糊团块;2)提出视觉引导的几何匹配方法,利用视觉先验确定搜索空间,提取高质量、抗噪的对应关系。 Result: 在3DMatch、ScanNet和KITTI数据集上的实验表明,ViGG在无学习和基于学习的设置下均优于最新的配准方法。 Conclusion: ViGG通过视觉与几何信息的相互引导策略,显著提升了RGB-D点云配准的鲁棒性和准确性,具有广泛的适用性。 Abstract: Point cloud registration is a fundamental task in 3D vision. Most existing methods only use geometric information for registration. Recently proposed RGB-D registration methods primarily focus on feature fusion or improving feature learning, which limits their ability to exploit image information and hinders their practical applicability. In this paper, we propose ViGG, a robust RGB-D registration method using mutual guidance. First, we solve clique alignment in a visual-geometric combination form, employing a geometric guidance design to suppress ambiguous cliques. Second, to mitigate accuracy degradation caused by noise in visual matches, we propose a visual-guided geometric matching method that utilizes visual priors to determine the search space, enabling the extraction of high-quality, noise-insensitive correspondences. This mutual guidance strategy brings our method superior robustness, making it applicable for various RGB-D registration tasks. The experiments on 3DMatch, ScanNet and KITTI datasets show that our method outperforms recent state-of-the-art methods in both learning-free and learning-based settings. Code is available at https://github.com/ccjccjccj/ViGG.[239] NeuMatC: A General Neural Framework for Fast Parametric Matrix Operation
Chuan Wang,Xi-le Zhao,Zhilong Han,Liang Li,Deyu Meng,Michael K. Ng
Main category: cs.CV
TL;DR: 提出Neural Matrix Computation Framework (NeuMatC),用于高效求解参数化矩阵运算,利用参数维度的低秩性和连续性,实现比传统方法更快的计算速度。
Details
Motivation: 传统方法独立处理每个参数下的矩阵运算,忽略了参数变化时结果之间的低秩性和连续性,导致大量重复计算。 Method: NeuMatC通过无监督学习,建立从参数到矩阵运算结果的低秩且连续映射,使得在任意参数下仅需少量基本操作即可完成计算。 Result: 在合成和真实数据上实验表明,NeuMatC在无线通信中相比NumPy实现了3倍以上的矩阵求逆加速和10倍以上的SVD加速,同时保持可接受精度。 Conclusion: NeuMatC有效利用参数化矩阵运算中的低秩与连续特性,显著提升计算效率,适用于需要频繁执行矩阵运算的实际应用场景。 Abstract: Matrix operations (e.g., inversion and singular value decomposition (SVD)) are fundamental in science and engineering. In many emerging real-world applications (such as wireless communication and signal processing), these operations must be performed repeatedly over matrices with parameters varying continuously. However, conventional methods tackle each matrix operation independently, underexploring the inherent low-rankness and continuity along the parameter dimension, resulting in significantly redundant computation. To address this challenge, we propose \textbf{\textit{Neural Matrix Computation Framework} (NeuMatC)}, which elegantly tackles general parametric matrix operation tasks by leveraging the underlying low-rankness and continuity along the parameter dimension. Specifically, NeuMatC unsupervisedly learns a low-rank and continuous mapping from parameters to their corresponding matrix operation results. Once trained, NeuMatC enables efficient computations at arbitrary parameters using only a few basic operations (e.g., matrix multiplications and nonlinear activations), significantly reducing redundant computations. Experimental results on both synthetic and real-world datasets demonstrate the promising performance of NeuMatC, exemplified by over $3\times$ speedup in parametric inversion and $10\times$ speedup in parametric SVD compared to the widely used NumPy baseline in wireless communication, while maintaining acceptable accuracy.[240] Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling
Minyoung Kim,Paul Hongsuck Seo
Main category: cs.CV
TL;DR: 提出了一种基于神经水印的图像自恢复框架ReImage,通过将图像的洗牌版本作为水印嵌入自身,实现对篡改图像的高质量恢复。
Details
Motivation: 现有图像自恢复方法难以准确恢复被篡改区域,无法满足恢复真实内容的需求。 Method: 设计了一个生成器来产生适用于神经水印的水印,并引入图像增强模块优化恢复结果;采用洗牌水印策略并解决其关键限制。 Result: ReImage在多种篡改场景下实现了最先进的性能,能够稳定生成高质量的恢复图像。 Conclusion: ReImage有效提升了图像自恢复的准确性与鲁棒性,为应对AIGC时代下的数字媒体真实性问题提供了可行方案。 Abstract: The rapid growth of Artificial Intelligence-Generated Content (AIGC) raises concerns about the authenticity of digital media. In this context, image self-recovery, reconstructing original content from its manipulated version, offers a practical solution for understanding the attacker's intent and restoring trustworthy data. However, existing methods often fail to accurately recover tampered regions, falling short of the primary goal of self-recovery. To address this challenge, we propose ReImage, a neural watermarking-based self-recovery framework that embeds a shuffled version of the target image into itself as a watermark. We design a generator that produces watermarks optimized for neural watermarking and introduce an image enhancement module to refine the recovered image. We further analyze and resolve key limitations of shuffled watermarking, enabling its effective use in self-recovery. We demonstrate that ReImage achieves state-of-the-art performance across diverse tampering scenarios, consistently producing high-quality recovered images. The code and pretrained models will be released upon publication.[241] Barcode and QR Code Object Detection: An Experimental Study on YOLOv8 Models
Kushagra Pandya,Heli Hathi,Het Buch,Ravikumar R N,Shailendrasinh Chauhan,Sushil Kumar Singh
Main category: cs.CV
TL;DR: 本研究评估了YOLOv8算法在条形码和二维码识别中的检测效率,通过在Kaggle数据集上进行训练与优化,比较了Nano、Small和Medium三个版本的性能,分别取得了88.95%、97.10%和94.10%的准确率,结果表明模型缩放显著提升了检测精度。
Details
Motivation: 提升YOLOv8在特定物体(如条形码和二维码)检测任务中的准确性和实时性,探索不同模型规模对检测性能的影响。 Method: 使用Kaggle上的条形码和二维码数据集对YOLOv8的Nano、Small和Medium三种变体进行训练和精细调优,并以精确率、召回率和F1分数为评估指标进行性能分析。 Result: YOLOv8-Small达到97.10%的最高准确率,YOLOv8-Nano为88.95%,YOLOv8-Medium为94.10%,显示随着模型规模增加,检测性能整体提升,Small版本表现最佳。 Conclusion: YOLOv8在条形码和二维码检测中表现出色,模型缩放有效提升性能,尤其Small版本在精度和效率之间实现了最佳平衡,展示了其在计算机视觉应用中的潜力。 Abstract: This research work dives into an in-depth evaluation of the YOLOv8 (You Only Look Once) algorithm's efficiency in object detection, specially focusing on Barcode and QR code recognition. Utilizing the real-time detection abilities of YOLOv8, we performed a study aimed at enhancing its talent in swiftly and correctly figuring out objects. Through large training and high-quality-tuning on Kaggle datasets tailored for Barcode and QR code detection, our goal became to optimize YOLOv8's overall performance throughout numerous situations and environments. The look encompasses the assessment of YOLOv8 throughout special version iterations: Nano, Small, and Medium, with a meticulous attention on precision, recall, and F1 assessment metrics. The consequences exhibit large improvements in object detection accuracy with every subsequent model refinement. Specifically, we achieved an accuracy of 88.95% for the nano model, 97.10% for the small model, and 94.10% for the medium version, showcasing the incremental improvements finished via model scaling. Our findings highlight the big strides made through YOLOv8 in pushing the limits of computer vision, ensuring its function as a milestone within the subject of object detection. This study sheds light on how model scaling affects object recognition, increasing the concept of deep learning-based computer creative and prescient techniques.[242] DenoiseGS: Gaussian Reconstruction Model for Burst Denoising
Yongsen Cheng,Yuanhao Cai,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出了DenoiseGS,首个利用3D高斯点阵高效进行图像去噪的框架,通过高斯自一致性损失和对数加权频域损失解决噪声输入下的几何退化与细节丢失问题,在性能和速度上显著优于现有方法。
Details
Motivation: 现有的连续去噪方法在处理大运动或高计算成本时表现不佳,需要一种既高效又能保持细节的去噪框架。 Method: 提出DenoiseGS框架,引入高斯自一致性(GSC)损失以正则化噪声输入下的几何预测,并使用对数加权频域(LWF)损失加强频谱域监督,保留高频细节。 Result: 实验表明,DenoiseGS在去噪和噪声条件下的新视角合成任务上显著超越最先进的NeRF方法,推理速度提升达250倍。 Conclusion: DenoiseGS通过结合3D高斯点阵的效率与新的损失函数设计,实现了高性能、快速的图像去噪,为手持设备成像提供了实用解决方案。 Abstract: Burst denoising methods are crucial for enhancing images captured on handheld devices, but they often struggle with large motion or suffer from prohibitive computational costs. In this paper, we propose DenoiseGS, the first framework to leverage the efficiency of 3D Gaussian Splatting for burst denoising. Our approach addresses two key challenges when applying feedforward Gaussian reconsturction model to noisy inputs: the degradation of Gaussian point clouds and the loss of fine details. To this end, we propose a Gaussian self-consistency (GSC) loss, which regularizes the geometry predicted from noisy inputs with high-quality Gaussian point clouds. These point clouds are generated from clean inputs by the same model that we are training, thereby alleviating potential bias or domain gaps. Additionally, we introduce a log-weighted frequency (LWF) loss to strengthen supervision within the spectral domain, effectively preserving fine-grained details. The LWF loss adaptively weights frequency discrepancies in a logarithmic manner, emphasizing challenging high-frequency details. Extensive experiments demonstrate that DenoiseGS significantly exceeds the state-of-the-art NeRF-based methods on both burst denoising and novel view synthesis under noisy conditions, while achieving \textbf{250$\times$} faster inference speed. Code and models are released at https://github.com/yscheng04/DenoiseGS.[243] One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe
Shijun Shi,Jing Xu,Zhihang Li,Chunli Peng,Xiaoda Yang,Lijing Lu,Kai Hu,Jiangning Zhang
Main category: cs.CV
TL;DR: 提出One-to-All Animation框架,解决姿态驱动角色动画中参考姿态空间错位的问题,实现任意布局下的高质量动画生成。
Details
Motivation: 现有扩散模型在姿态对齐的条件下表现良好,但无法处理参考姿态与目标姿态空间错位及骨骼结构不匹配的情况,限制了应用范围。 Method: 将训练重构为自监督的外补全任务,统一不同布局输入;设计参考提取器和混合融合注意力机制处理部分可见和多分辨率输入;引入身份鲁棒的姿态控制和令牌替换策略以提升生成质量与长视频连贯性。 Result: 在多种布局和分辨率下实现了高质量的角色动画和图像姿态迁移,实验表明方法优于现有技术。 Conclusion: One-to-All Animation为任意布局的参考输入提供了统一、高效的解决方案,显著提升了姿态驱动动画的适用性和生成质量。 Abstract: Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available at https://github.com/ssj9596/One-to-All-Animation.[244] Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation
Taeyeong Kim,SeungJoon Lee,Jung Uk Kim,MyeongAh Cho
Main category: cs.CV
TL;DR: FLEX-Seg提出了一种新的语义分割领域泛化框架,通过利用生成图像与语义掩码之间的固有不对齐来增强模型鲁棒性,在多个真实数据集上显著优于现有方法。
Details
Motivation: 解决扩散模型生成数据在语义分割领域泛化中存在图像与掩码不对齐的问题,并将其转化为提升模型鲁棒性的机会。 Method: 提出FLEX-Seg框架,包含三个组件:多尺度边界特征建模的粒度自适应原型、基于预测熵的不确定性边界强调机制、以及聚焦难例的难度感知采样策略。 Result: 在ACDC和Dark Zurich数据集上分别取得2.44%和2.63%的mIoU提升,显著优于当前最先进方法。 Conclusion: 通过自适应方式利用不完美合成数据中的不对齐信息,可有效提升语义分割模型的领域泛化能力。 Abstract: Domain generalization in semantic segmentation faces challenges from domain shifts, particularly under adverse conditions. While diffusion-based data generation methods show promise, they introduce inherent misalignment between generated images and semantic masks. This paper presents FLEX-Seg (FLexible Edge eXploitation for Segmentation), a framework that transforms this limitation into an opportunity for robust learning. FLEX-Seg comprises three key components: (1) Granular Adaptive Prototypes that captures boundary characteristics across multiple scales, (2) Uncertainty Boundary Emphasis that dynamically adjusts learning emphasis based on prediction entropy, and (3) Hardness-Aware Sampling that progressively focuses on challenging examples. By leveraging inherent misalignment rather than enforcing strict alignment, FLEX-Seg learns robust representations while capturing rich stylistic variations. Experiments across five real-world datasets demonstrate consistent improvements over state-of-the-art methods, achieving 2.44% and 2.63% mIoU gains on ACDC and Dark Zurich. Our findings validate that adaptive strategies for handling imperfect synthetic data lead to superior domain generalization. Code is available at https://github.com/VisualScienceLab-KHU/FLEX-Seg.[245] RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Haiyang Mei,Qiming Huang,Hai Ci,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出了RobotSeg,一种用于图像和视频中机器人分割的基础模型,通过改进SAM 2模型并引入结构增强的记忆关联器、机器人提示生成器和标签高效的训练策略,实现了对复杂机器人形态的准确、自动且标签高效地分割。
Details
Motivation: 现代分割模型在分割机器人方面仍面临挑战,主要由于机器人的形态多样性、外观模糊性、结构复杂性和快速形状变化。因此需要一个专门针对机器人分割的基础模型。 Method: 基于SAM 2基础模型,提出三个关键改进:1)结构增强的记忆关联器以适应关节式机器人;2)机器人提示生成器实现自动化提示;3)标签高效的训练策略减少对逐帧标注的依赖。同时构建了包含超过2.8k个视频(138k帧)的视频机器人分割(VRS)数据集。 Result: 实验表明,RobotSeg在图像和视频上的机器人分割任务中均达到最先进的性能,显著优于现有方法。 Conclusion: RobotSeg为机器人感知提供了结构感知、自动化和标签高效的解决方案,为未来机器人视觉研究奠定了坚实基础。 Abstract: Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.[246] Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records
Shiyu Shen,Zhe Gao,Taifeng Chai,Yang Huang,Bin Pan
Main category: cs.CV
TL;DR: SolarCHIP是首个针对多仪器太阳动力学观测数据对比预训练的视觉骨干网络家族,通过多粒度对比学习框架提升下游任务性能,尤其在低资源场景下表现突出。
Details
Motivation: 现有深度学习方法在太阳图像分析中多采用从头训练或基于自然图像的预训练模型,忽略了SDO数据的多模态、弱类间可分性和强类内变异特性,导致特征提取效率低、标签利用率不高。 Method: 提出SolarCHIP,采用多粒度对比预训练框架:(1) 对齐共时AIA-HMI对的全局类别标记以增强时间判别能力;(2) 对齐固定空间索引的局部块标记以实现位置一致、模态不变的特征;(3) 对同一样本不同位置的块进行对比以保留细粒度空间结构。基于CNN和Vision Transformer架构训练自编码器,并应用于跨模态翻译与全盘面耀斑分类任务。 Result: 在跨模态翻译和全盘面耀斑分类两个下游任务上均达到最先进性能,尤其在低标注数据条件下优势显著;消融实验验证了各对比组件在不同粒度上的必要性。 Conclusion: SolarCHIP为太阳物理学社区提供了可重用、即插即用的特征提取基础模型,有效降低计算需求、提高标签效率,推动多任务太阳图像分析的发展。 Abstract: Deep learning has revolutionized solar image analysis, yet most approaches train task-specific encoders from scratch or rely on natural-image pretraining that ignores the unique characteristics of Solar Dynamics Observatory (SDO) data. We introduce SolarCHIP, a family of contrastively pretrained visual backbones tailored to multi-instrument SDO observations. SolarCHIP addresses three key challenges in solar imaging: multimodal sensing across AIA and HMI instruments, weak inter-class separability due to slow temporal evolution, and strong intra-class variability with sparse activity signals. Our pretraining framework employs a multi-granularity contrastive objective that jointly aligns (1) global class tokens across co-temporal AIA-HMI pairs to enhance temporal discrimination, (2) local patch tokens at fixed spatial indices to enforce position-consistent, modality-invariant features, and (3) intra-sample patches across different spatial locations to preserve fine-grained spatial structure. We train both CNN- and Vision Transformer-based autoencoders and demonstrate their effectiveness on two downstream tasks: cross-modal translation between HMI and AIA passbands via ControlNet, and full-disk flare classification. Experimental results show that SolarCHIP achieves state-of-the-art performance across both tasks, with particularly strong gains in low-resource settings where labeled data is limited. Ablation studies confirm that each contrastive component contributes essential discriminative capacity at different granularities. By publicly releasing pretrained weights and training code, we provide the heliophysics community with a practical, plug-and-play feature extractor that reduces computational requirements, improves label efficiency, and establishes a reusable foundation for diverse solar imaging applications.[247] HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
Chen Li,Eric Peh,Basura Fernando
Main category: cs.CV
TL;DR: 提出一种基于多视图图像和文本描述的层次化多模态表示方法,显式对齐视觉语言模型以提升3D场景理解性能。
Details
Motivation: 现有基于视觉语言模型的方法通过隐式对齐3D场景特征与VLM嵌入空间,受限于3D数据稀缺和空间关系复杂性,导致性能不佳。 Method: 利用多视图图像(俯视和四个方向视图)和包含物体3D坐标的文本描述,在输入空间显式对齐VLM;引入层次化特征表示,将patch级特征聚合为view级和scene级表示,实现局部与全局上下文推理。 Result: 在特定和通用3D问答基准上均取得优异性能,验证了方法的有效性。 Conclusion: 所提方法通过显式输入空间对齐和层次化多模态表示,显著提升了3D场景理解任务中的推理能力。 Abstract: Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.[248] Taming the Light: Illumination-Invariant Semantic 3DGS-SLAM
Shouhe Zhang,Dayong Ren,Sensen Song,Yurong Qian,Zhenhong Jia
Main category: cs.CV
TL;DR: 提出了一种具有光照不变性的语义SLAM框架,结合主动的内在外观归一化和动态辐射平衡损失,显著提升了极端光照条件下的定位与建图鲁棒性。
Details
Motivation: 极端曝光会降低三维重建和语义分割的精度,尤其影响紧耦合系统的性能,因此需要实现光照不变性以提升系统鲁棒性。 Method: 提出了两个关键设计:1)内在外观归一化(IAN)模块,将场景反照率等固有属性与光照分离,为每个高斯基元提供稳定的颜色表示;2)动态辐射平衡损失(DRB-Loss),在曝光不良时激活,直接在辐射场上进行优化引导,防止误差累积。 Result: 在公开数据集上的实验表明,该方法在相机位姿估计、地图质量以及几何和语义精度方面均达到最先进水平。 Conclusion: IAN与DRB-Loss的协同作用赋予系统前所未有的鲁棒性,有效解决了极端光照对语义SLAM的影响。 Abstract: Extreme exposure degrades both the 3D map reconstruction and semantic segmentation accuracy, which is particularly detrimental to tightly-coupled systems. To achieve illumination invariance, we propose a novel semantic SLAM framework with two designs. First, the Intrinsic Appearance Normalization (IAN) module proactively disentangles the scene's intrinsic properties, such as albedo, from transient lighting. By learning a standardized, illumination-invariant appearance model, it assigns a stable and consistent color representation to each Gaussian primitive. Second, the Dynamic Radiance Balancing Loss (DRB-Loss) reactively handles frames with extreme exposure. It activates only when an image's exposure is poor, operating directly on the radiance field to guide targeted optimization. This prevents error accumulation from extreme lighting without compromising performance under normal conditions. The synergy between IAN's proactive invariance and DRB-Loss's reactive correction endows our system with unprecedented robustness. Evaluations on public datasets demonstrate state-of-the-art performance in camera tracking, map quality, and semantic and geometric accuracy.[249] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
Zeyu Zhang,Shuning Chang,Yuanyu He,Yizeng Han,Jiasheng Tang,Fan Wang,Bohan Zhuang
Main category: cs.CV
TL;DR: 本文提出BlockVid,一种用于生成分钟级长视频的新型块扩散框架,通过语义感知稀疏KV缓存、块强制训练策略和分块噪声调度等技术,有效缓解长距离误差累积并提升时序一致性,同时引入新的细粒度评测基准LV-Bench,实验表明其在视频质量和长程连贯性上显著优于现有方法。
Details
Motivation: 现有的半自回归(块扩散)视频生成模型面临长距离KV缓存导致的误差累积问题,以及缺乏评估长视频连贯性的细粒度基准和指标。 Method: 提出BlockVid框架,采用语义感知的稀疏KV缓存机制减少缓存占用与误差传播,设计Block Forcing训练策略增强训练稳定性,并引入分块噪声调度与打乱策略以提升时序一致性;同时构建LV-Bench作为新的长视频评测基准,包含评估长程连贯性的新指标。 Result: 在VBench和LV-Bench上实验显示,BlockVid在生成高质量、高连贯性的分钟级视频方面优于现有方法,在LV-Bench的VDE Subject指标上提升22.2%,VDE Clarity指标上提升19.4%。 Conclusion: BlockVid有效解决了块扩散模型在长视频生成中的误差累积和评估缺失问题,推动了分钟级视频生成的发展,为构建世界模型和AI模拟器提供了更可靠的技术基础。 Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.[250] McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning
Qiushi Yang,Yingjie Chen,Yuan Yao,Yifang Men,Huaizhuo Liu,Miaomiao Cui
Main category: cs.CV
TL;DR: 提出了一种名为McSc的三阶段强化学习框架,用于文本到视频生成中的偏好对齐,通过分解偏好维度、分层推理和动态重加权优化,显著提升视频的运动动态与人类偏好一致性。
Details
Motivation: 现有视频偏好对齐方法依赖昂贵的人工标注或不准确的代理指标,且忽略不同偏好维度间的冲突(如运动动态与画质),导致模型偏向低运动内容。 Method: 提出Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc),包含三个阶段:1)自批判维度推理(ScDR)构建生成式奖励模型,分解偏好并进行可靠学习;2)分层比较推理(HCR)实现结构化的多维度偏好比较;3)运动校正直接偏好优化(McDPO)结合动态重加权策略优化T2V模型。 Result: 实验表明,McSc在人类偏好对齐方面表现优越,生成的视频具有更高的运动动态性,且有效缓解了对低运动内容的偏差。 Conclusion: McSc通过细粒度的偏好建模与动态优化策略,显著提升了文本到视频生成中对复杂人类偏好的对齐能力,尤其增强了运动动态的表现力。 Abstract: Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.[251] Ovis-Image Technical Report
Guo-Hua Wang,Liangfu Cao,Tianyu Cui,Minghao Fu,Xiaohao Chen,Pengxin Zhan,Jianshan Zhao,Lan Li,Bowen Fu,Jiaqi Liu,Qing-Guo Chen
Main category: cs.CV
TL;DR: Ovis-Image是一个7B参数的文本到图像模型,专注于高质量文本渲染,能在单个高端GPU上高效运行,性能媲美更大模型。
Details
Motivation: 在严格计算限制下实现高质量文本渲染,缩小先进文本生成与实际部署之间的差距。 Method: 基于Ovis-U1框架,结合扩散型视觉解码器和更强的Ovis 2.5多模态主干,采用以文本为中心的训练流程,包括大规模预训练和精细的后训练优化。 Result: Ovis-Image在文本渲染质量上与Qwen-Image等大型开源模型相当,并接近Seedream和GPT4o等闭源系统,支持双语文本生成且可在单GPU上部署。 Conclusion: 结合强大的多模态主干和精心设计的文本导向训练策略,无需依赖超大或专有模型即可实现可靠的文字渲染。 Abstract: We introduce $\textbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.[252] Convolutional Feature Noise Reduction for 2D Cardiac MR Image Segmentation
Hong Zheng,Nan Mu,Han Su,Lin Feng,Xiaoning Li
Main category: cs.CV
TL;DR: 本文提出了一种用于卷积特征去噪的简单有效方法——卷积特征滤波器(CFF),通过将卷积特征视为服从高斯分布的信号矩阵,设计低振幅通滤波器以减少噪声,并在两个2D分割网络和两个公开心脏MR数据集上验证了其有效性,同时提出二值化方程量化特征信号的信息熵以评估去噪效果。
Details
Motivation: 卷积特征中的噪声常被忽视,可能导致特征系统中产生蝴蝶效应,影响后续分割性能,因此需要有效的去噪机制。 Method: 将卷积特征视为高斯分布的信号矩阵,设计一种低振幅通滤波器(CFF)对特征进行滤波处理,以抑制低幅值噪声;并提出一个二值化方程来计算特征信号的信息熵,实现对噪声减少的量化分析。 Result: 实验表明CFF能有效降低特征信号矩阵中的噪声,信息熵计算结果显示特征更趋于稳定和有序。 Conclusion: CFF是一种简单且有效的卷积特征去噪模块,可集成到现有分割网络中,提升特征质量,进而可能改善分割性能。 Abstract: Noise reduction constitutes a crucial operation within Digital Signal Processing. Regrettably, it frequently remains neglected when dealing with the processing of convolutional features in segmentation networks. This oversight could trigger the butterfly effect, impairing the subsequent outcomes within the entire feature system. To complete this void, we consider convolutional features following Gaussian distributions as feature signal matrices and then present a simple and effective feature filter in this study. The proposed filter is fundamentally a low-amplitude pass filter primarily aimed at minimizing noise in feature signal inputs and is named Convolutional Feature Filter (CFF). We conducted experiments on two established 2D segmentation networks and two public cardiac MR image datasets to validate the effectiveness of the CFF, and the experimental findings demonstrated a decrease in noise within the feature signal matrices. To enable a numerical observation and analysis of this reduction, we developed a binarization equation to calculate the information entropy of feature signals.[253] MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Yuta Oshima,Daiki Miyake,Kohsei Matsutani,Yusuke Iwasawa,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta
Main category: cs.CV
TL;DR: 本文提出了MultiBanana,一个用于评估多参考图像生成模型的新型基准,涵盖多种挑战性场景,如参考图像数量变化、域不匹配、尺度不匹配、罕见概念和多语言文本引用。
Details
Motivation: 现有基准数据集在多参考生成任务上覆盖不足,任务定义模糊,难以准确衡量模型性能进展与缺陷。 Method: 设计MultiBanana基准,系统性构建五种多参考特有挑战的大规模评测场景,并在多种文本到图像模型上进行广泛评估。 Result: 实验揭示了当前模型在多参考生成中的优势、典型失败模式及改进方向,验证了MultiBanana的全面性和有效性。 Conclusion: MultiBanana为多参考图像生成提供了标准化、可扩展的公开基准,有助于推动该领域的发展并实现公平比较。 Abstract: Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .[254] MIMM-X: Disentangling Spurious Correlations for Medical Image Analysis
Louisa Fay,Hajer Reguigui,Bin Yang,Sergios Gatidis,Thomas Küstner
Main category: cs.CV
TL;DR: MIMM-X是一种新框架,通过最小化互信息来分离因果特征与多种虚假相关性,提升医学影像中深度学习模型的泛化能力。
Details
Motivation: 深度学习在医学任务中易受虚假相关性(快捷学习)影响,导致在新环境中泛化性能差,尤其在存在多种虚假相关的医学影像中可能导致严重误诊。 Method: 提出MIMM-X框架,通过最小化因果特征与多个虚假相关性之间的互信息,实现特征解耦,使模型基于真实因果关系进行预测。 Result: 在三个数据集(UK Biobank、NAKO、CheXpert)和两种成像模态(MRI、X射线)上的实验表明,MIMM-X能有效缓解多种虚假相关性带来的快捷学习问题。 Conclusion: MIMM-X能够有效提升医学影像模型的鲁棒性和泛化能力,减少因虚假相关性导致的误分类风险。 Abstract: Deep learning models can excel on medical tasks, yet often experience spurious correlations, known as shortcut learning, leading to poor generalization in new environments. Particularly in medical imaging, where multiple spurious correlations can coexist, misclassifications can have severe consequences. We propose MIMM-X, a framework that disentangles causal features from multiple spurious correlations by minimizing their mutual information. It enables predictions based on true underlying causal relationships rather than dataset-specific shortcuts. We evaluate MIMM-X on three datasets (UK Biobank, NAKO, CheXpert) across two imaging modalities (MRI and X-ray). Results demonstrate that MIMM-X effectively mitigates shortcut learning of multiple spurious correlations.[255] Guiding Visual Autoregressive Models through Spectrum Weakening
Chaoyang Wang,Tianmeng Yang,Jingdong Wang,Yunhai Tong
Main category: cs.CV
TL;DR: 本文提出了一种无需重新训练或架构修改的谱削弱框架,用于视觉自回归模型中的无分类器指导,通过频谱选择和重归一化策略实现高质量的无条件生成与条件对齐。
Details
Motivation: 现有指导机制多依赖于扩散模型的假设,缺乏在视觉自回归模型中有效进行无分类器指导的通用方法。 Method: 在频域构建可控的弱模型,通过通道维度上的频谱选择实现信息控制,并引入两种谱重归一化策略以保证数值稳定性。 Result: 在离散和连续AR模型上验证了方法的有效性,支持文本或类别条件,实现了高质量无条件生成和强提示对齐的条件生成。 Conclusion: 所提谱削弱框架为视觉AR模型提供了一种通用、灵活且无需训练修改的指导机制,突破了扩散模型结构限制。 Abstract: Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.[256] Optimizer Sensitivity In Vision Transformerbased Iris Recognition: Adamw Vs Sgd Vs Rmsprop
Moh Imam Faiz,Aviv Yuniar Rahman,Rangga Pahlevi Putra
Main category: cs.CV
TL;DR: 本研究探讨了不同优化器对基于Vision Transformer(ViT)的虹膜识别系统在准确性和稳定性方面的影响,旨在提升生物特征识别模型的鲁棒性。
Details
Motivation: 尽管深度学习特别是ViT在视觉识别中取得了进展,但优化器选择对ViT基虹膜识别系统的影响尚未得到充分研究。 Method: 通过实验评估多种优化器在ViT架构下的虹膜识别性能,比较其准确性和训练稳定性。 Result: 不同优化器对ViT模型的识别精度和训练过程稳定性有显著影响,某些优化器表现出更优的收敛性和鲁棒性。 Conclusion: 优化器的选择对ViT基虹膜识别系统的性能至关重要,合理选择可有效提升模型的准确性和稳定性。 Abstract: The security of biometric authentication is increasingly critical as digital identity systems expand. Iris recognition offers high reliability due to its distinctive and stable texture patterns. Recent progress in deep learning, especially Vision Transformers ViT, has improved visual recognition performance. Yet, the effect of optimizer choice on ViT-based biometric systems remains understudied. This work evaluates how different optimizers influence the accuracy and stability of ViT for iris recognition, providing insights to enhance the robustness of biometric identification models.[257] MrGS: Multi-modal Radiance Fields with 3D Gaussian Splatting for RGB-Thermal Novel View Synthesis
Minseong Kweon,Janghyun Kim,Ukcheol Shin,Jinsun Park
Main category: cs.CV
TL;DR: 本文提出了一种基于3D高斯点阵的多模态辐射场方法MrGS,用于同时重建RGB和热红外3D场景,结合物理先验提升重建质量。
Details
Motivation: 现有方法在多模态渲染中忽视了热红外成像的独特特性(如热传导和朗伯特性),导致热红外与RGB联合重建效果不佳。 Method: MrGS通过正交特征提取从单一外观特征中分离RGB和热红外信息,并根据各模态的朗伯反射程度采用视图相关或视图无关的嵌入策略;同时引入傅里叶热传导定律和斯特藩-玻尔兹曼定律及平方反比定律,分别建模高斯间热传导和深度感知的热辐射。 Result: 实验表明,MrGS在实现高保真RGB-T场景重建的同时,减少了所需高斯数量,提升了效率与精度。 Conclusion: 结合物理先验的多模态3D高斯点阵方法能有效提升RGB与热红外联合重建的质量,并为多模态场景建模提供了新思路。 Abstract: Recent advances in Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved considerable performance in RGB scene reconstruction. However, multi-modal rendering that incorporates thermal infrared imagery remains largely underexplored. Existing approaches tend to neglect distinctive thermal characteristics, such as heat conduction and the Lambertian property. In this study, we introduce MrGS, a multi-modal radiance field based on 3DGS that simultaneously reconstructs both RGB and thermal 3D scenes. Specifically, MrGS derives RGB- and thermal-related information from a single appearance feature through orthogonal feature extraction and employs view-dependent or view-independent embedding strategies depending on the degree of Lambertian reflectance exhibited by each modality. Furthermore, we leverage two physics-based principles to effectively model thermal-domain phenomena. First, we integrate Fourier's law of heat conduction prior to alpha blending to model intensity interpolation caused by thermal conduction between neighboring Gaussians. Second, we apply the Stefan-Boltzmann law and the inverse-square law to formulate a depth-aware thermal radiation map that imposes additional geometric constraints on thermal rendering. Experimental results demonstrate that the proposed MrGS achieves high-fidelity RGB-T scene reconstruction while reducing the number of Gaussians.[258] JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
Yunlong Lin,Linqing Wang,Kunjie Lin,Zixu Lin,Kaixiong Gong,Wenbo Li,Bin Lin,Zhenxi Li,Shiyi Zhang,Yuyang Peng,Wenxun Dai,Xinghao Ding,Chunyu Wang,Qinglin Lu
Main category: cs.CV
TL;DR: JarvisEvo是一种统一的图像编辑智能体,通过模拟专家设计师的迭代编辑、工具选择、结果评估和自我反思过程,提出iMCoT推理机制和SEPO优化框架,有效解决指令幻觉和奖励欺骗问题,在ArtEdit-Bench上显著优于现有方法。
Details
Motivation: 现有基于智能体的编辑模型存在指令幻觉和奖励欺骗两大问题,传统文本链式思维推理无法完全避免事实错误,静态奖励模型易被动态策略利用,限制了编辑质量和可靠性。 Method: 提出JarvisEvo,包含交错多模态链式思维(iMCoT)增强指令遵循与编辑质量,协同编辑-评估策略优化(SEPO)实现无外部奖励的自我改进,并集成Adobe Lightroom支持全局与局部细粒度编辑。 Result: 在ArtEdit-Bench上,JarvisEvo相比Nano-Banana平均提升18.95%,在像素级内容保真度上提升达44.96%。 Conclusion: JarvisEvo通过多模态推理与自优化框架有效解决了指令幻觉与奖励欺骗问题,显著提升了图像编辑的准确性与细粒度控制能力,推动了智能编辑代理的发展。 Abstract: Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator policy optimization (SEPO) framework that enables self-improvement without external rewards, effectively mitigating reward hacking; and (3) support for both global and local fine-grained editing through seamless integration of Adobe Lightroom. On ArtEdit-Bench, JarvisEvo outperforms Nano-Banana by an average of 18.95% on preservative editing metrics, including a substantial 44.96% improvement in pixel-level content fidelity.[259] From Illusion to Intention: Visual Rationale Learning for Vision-Language Reasoning
Changpeng Wang,Haozhe Wang,Xi Chen,Junhan Liu,Taofeng Xue,Chong Peng,Donglian Qi,Fangzhen Lin,Yunfeng Yan
Main category: cs.CV
TL;DR: 本文提出了视觉理性化(visual rationalization)的概念,将视觉动作视为核心推理原语,并提出ViRL框架,通过端到端强化学习实现‘因正确视觉原因得出正确答案’的目标。
Details
Motivation: 现有视觉-语言模型虽看似具备视觉推理能力,但其视觉操作常为上下文无关的可选工具,导致推理未真正扎根于视觉证据,产生‘图像思考的错觉’。 Method: 提出ViRL框架,包含三部分:基于真实理由链的过程监督、通过步级奖励塑形的目标对齐、细粒度信用分配以区分正确、冗余和错误动作,将视觉动作融入端到端强化学习。 Result: ViRL在感知、幻觉和推理等多个基准上达到最先进性能,验证了视觉理性化能有效提升模型透明性与可信度。 Conclusion: 视觉理性化是一种任务无关、过程导向的新范式,有助于构建可解释、可验证的视觉-语言模型。 Abstract: Recent advances in vision-language reasoning underscore the importance of thinking with images, where models actively ground their reasoning in visual evidence. Yet, prevailing frameworks treat visual actions as optional tools, boosting metrics but leaving reasoning ungrounded and crops ineffective. This gap gives rise to the illusion of thinking with images: models seem visually grounded but rely on context-agnostic actions that neither refine perception nor guide reasoning toward correct answers. We address this problem by reframing visual actions as core reasoning primitives rather than optional tools, which we term visual rationalization, the visual analogue of textual Chain-of-Thought. Building on this insight, we propose Visual Rationale Learning (ViRL), an end-to-end paradigm that grounds training in the visual rationale itself. ViRL integrates (1) Process Supervision with ground-truth rationales, (2) Objective Alignment via step-level reward shaping, and (3) Fine-Grained Credit Assignment to distinguish correct, redundant, and erroneous actions. By ensuring each action contributes meaningfully to the reasoning chain, ViRL enables models to "get the right answer for the right visual reason". Trained purely with end-to-end RL, ViRL achieves state-of-the-art results across benchmarks spanning perception, hallucination, and reasoning. This work establishes visual rationalization as a task-agnostic, process-grounded paradigm for building transparent, verifiable, and trustworthy vision-language models.[260] Geometry-Consistent 4D Gaussian Splatting for Sparse-Input Dynamic View Synthesis
Yiwei Li,Jiannong Cao,Penghui Ruan,Divya Saxena,Songye Zhu,Yinfeng Cao
Main category: cs.CV
TL;DR: 本文提出GC-4DGS,一种引入几何一致性的4D高斯点阵化框架,用于从稀疏输入视图实现高质量、实时动态场景渲染,在N3DV和Technicolor数据集上显著优于现有方法。
Details
Motivation: 现有动态高斯点阵化方法在稀疏输入视图下因4D几何学习不连贯而性能下降,限制了其在AIoT等实际场景中的应用。 Method: 提出动态一致性检查策略以减少MVS在时空上的估计不确定性,并设计全局-局部深度正则化方法,从单目深度中提取时空一致的几何信息,增强4D体积内的几何与外观一致性学习。 Result: 在N3DV和Technicolor数据集上验证了GC-4DGS的有效性,PSNR分别比RF-DeRF和原始4DGS高出2.62dB和1.58dB,且保持实时性和边缘设备部署能力。 Conclusion: GC-4DGS通过引入几何一致性显著提升了稀疏输入下的动态场景渲染质量,兼具高效性与实用性,适用于资源受限的AIoT应用场景。 Abstract: Gaussian Splatting has been considered as a novel way for view synthesis of dynamic scenes, which shows great potential in AIoT applications such as digital twins. However, recent dynamic Gaussian Splatting methods significantly degrade when only sparse input views are available, limiting their applicability in practice. The issue arises from the incoherent learning of 4D geometry as input views decrease. This paper presents GC-4DGS, a novel framework that infuses geometric consistency into 4D Gaussian Splatting (4DGS), offering real-time and high-quality dynamic scene rendering from sparse input views. While learning-based Multi-View Stereo (MVS) and monocular depth estimators (MDEs) provide geometry priors, directly integrating these with 4DGS yields suboptimal results due to the ill-posed nature of sparse-input 4D geometric optimization. To address these problems, we introduce a dynamic consistency checking strategy to reduce estimation uncertainties of MVS across spacetime. Furthermore, we propose a global-local depth regularization approach to distill spatiotemporal-consistent geometric information from monocular depths, thereby enhancing the coherent geometry and appearance learning within the 4D volume. Extensive experiments on the popular N3DV and Technicolor datasets validate the effectiveness of GC-4DGS in rendering quality without sacrificing efficiency. Notably, our method outperforms RF-DeRF, the latest dynamic radiance field tailored for sparse-input dynamic view synthesis, and the original 4DGS by 2.62dB and 1.58dB in PSNR, respectively, with seamless deployability on resource-constrained IoT edge devices.[261] GOATex: Geometry & Occlusion-Aware Texturing
Hyunjin Kim,Kunho Kim,Adam Lee,Wonkwang Lee
Main category: cs.CV
TL;DR: GOATex是一种基于扩散模型的3D网格纹理生成方法,通过引入遮挡感知框架和hit levels概念,实现对外表面和内部区域的高质量、无缝纹理合成。
Details
Motivation: 现有方法在处理3D网格纹理时难以有效覆盖被遮挡的内部区域,导致纹理不完整和可见接缝,GOATex旨在解决这一问题。 Method: 提出一种基于多视角光线投射的hit levels机制来量化面片深度,并将网格面片划分为有序可见性层;采用两阶段可见性控制策略,结合预训练扩散模型逐层生成纹理,并利用基于可见性置信度的软UV空间融合技术实现无缝合并。 Result: 实验结果表明,GOATex在可见和遮挡区域均能生成高保真、无缝的纹理,且无需对预训练扩散模型进行微调,支持对外部和内部区域分别提示以实现细粒度控制。 Conclusion: GOATex通过分层可见性建模与扩散模型结合,实现了更完整、一致的3D网格纹理生成,优于现有方法。 Abstract: We present GOATex, a diffusion-based method for 3D mesh texturing that generates high-quality textures for both exterior and interior surfaces. While existing methods perform well on visible regions, they inherently lack mechanisms to handle occluded interiors, resulting in incomplete textures and visible seams. To address this, we introduce an occlusion-aware texturing framework based on the concept of hit levels, which quantify the relative depth of mesh faces via multi-view ray casting. This allows us to partition mesh faces into ordered visibility layers, from outermost to innermost. We then apply a two-stage visibility control strategy that progressively reveals interior regions with structural coherence, followed by texturing each layer using a pretrained diffusion model. To seamlessly merge textures obtained across layers, we propose a soft UV-space blending technique that weighs each texture's contribution based on view-dependent visibility confidence. Empirical results demonstrate that GOATex consistently outperforms existing methods, producing seamless, high-fidelity textures across both visible and occluded surfaces. Unlike prior works, GOATex operates entirely without costly fine-tuning of a pretrained diffusion model and allows separate prompting for exterior and interior mesh regions, enabling fine-grained control over layered appearances. For more qualitative results, please visit our project page: https://goatex3d.github.io/.[262] Image Valuation in NeRF-based 3D reconstruction
Grigorios Aris Cheimariotis,Antonis Karakottas,Vangelis Chatzis,Angelos Kanlis,Dimitrios Zarpalas
Main category: cs.CV
TL;DR: 本文提出了一种量化在野外图像集中每张图像对NeRF重建贡献的方法,通过PSNR和MSE等指标评估其贡献,并验证了去除低贡献图像对重建质量的影响。
Details
Motivation: 在3D场景重建中,并非所有输入图像对最终结果的贡献相同,尤其是在现实场景中存在质量不一、遮挡和瞬态物体的情况下,需要有效评估每张图像的贡献。 Method: 基于PSNR和MSE等重建质量指标,提出一种量化每张图像对NeRF重建贡献的方法。 Result: 通过在训练过程中移除贡献较低的图像并评估其对重建保真度的影响,验证了所提方法的有效性。 Conclusion: 该方法能够有效识别并量化图像在NeRF重建中的个体贡献,有助于数据估值与优化训练过程。 Abstract: Data valuation and monetization are becoming increasingly important across domains such as eXtended Reality (XR) and digital media. In the context of 3D scene reconstruction from a set of images -- whether casually or professionally captured -- not all inputs contribute equally to the final output. Neural Radiance Fields (NeRFs) enable photorealistic 3D reconstruction of scenes by optimizing a volumetric radiance field given a set of images. However, in-the-wild scenes often include image captures of varying quality, occlusions, and transient objects, resulting in uneven utility across inputs. In this paper we propose a method to quantify the individual contribution of each image to NeRF-based reconstructions of in-the-wild image sets. Contribution is assessed through reconstruction quality metrics based on PSNR and MSE. We validate our approach by removing low-contributing images during training and measuring the resulting impact on reconstruction fidelity.[263] Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation
Felipe Akio Matsuoka,Eduardo Moreno J. M. Farina,Augusto Sarquis Serpa,Soraya Monteiro,Rodrigo Ragazzini,Nitamar Abdala,Marcelo Straus Takahashi,Felipe Campos Kitamura
Main category: cs.CV
TL;DR: 生成模型在视觉上可实现逼真的图像修复,但在儿科手部X光片中去除非解剖标记时,会导致骨龄预测和性别分类性能显著下降,提示其可能破坏临床关键特征。
Details
Motivation: 评估基于生成基础模型的图像修复在医学AI中的临床可靠性,特别是在去除儿科手部X光片中非解剖标记后是否保留骨龄和性别预测所需的关键特征。 Method: 使用RSNA骨龄挑战数据集,选取200张原始X光片,并利用gpt-image-1通过自然语言提示生成600张修复图像;采用深度学习集成模型评估下游任务性能,以MAE和AUC为指标,并分析像素强度分布变化。 Result: 修复后骨龄预测MAE从6.26增至30.11个月,性别分类AUC从0.955降至0.704;修复图像出现像素强度偏移和不一致,表明存在结构性改变且无法通过简单校准纠正。 Conclusion: 尽管生成模型修复的图像视觉上真实,但可能掩盖细微而关键的临床特征并引入潜在偏差,因此在将其整合到临床AI流程前需进行严格的、任务特定的验证。 Abstract: Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.[264] Buffer replay enhances the robustness of multimodal learning under missing-modality
Hongye Zhu,Xuan Liu,Yanwen Ba,Jingye Xue,Shigeng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为REplay Prompting (REP) 的新方法,用于提升多模态模型在缺失模态情况下的鲁棒性。该方法通过残差旁路构建模态特征缓冲区,解耦私有与共享特征,并引入任务感知的动态初始化机制,在多种多模态任务中实现了优于现有方法的性能,且参数开销极低。
Details
Motivation: 现有方法在处理缺失模态时存在计算成本高或仅依赖邻近层特征而忽略长距离上下文信息的问题,导致模型鲁棒性不足。 Method: 1) 通过残差旁路构建模态特征缓冲区以缓存早期层表示并在深层重放;2) 采用私有-共享特征解耦策略,分别保存模态特异性信号和跨模态语义;3) 设计任务感知的动态初始化机制以适应不同缺失模态场景。 Result: 在视觉-语言、视觉-语言-音频和时序多模态基准上,REP在单模态和多模态缺失情况下均优于先前方法,且仅引入可忽略的参数开销。 Conclusion: REP是一种轻量且有效的范式,能够在模态缺失的复杂环境中实现鲁棒的多模态学习。 Abstract: Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.[265] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Ruosen Zhao,Zhikang Zhang,Jialei Xu,Jiahao Chang,Dong Chen,Lingyun Li,Weijian Sun,Zizhuang Wei
Main category: cs.CV
TL;DR: 本文提出了一种名为SpaceMind的新型视觉-语言模型,专为仅基于RGB输入的三维空间推理设计。该模型采用双编码器架构,并引入相机引导的模态融合模块,通过相机条件偏置和门控机制提升空间理解能力,在多个基准上实现了最先进的性能。
Details
Motivation: 现有的大视觉-语言模型在3D空间推理方面表现不足,且多数依赖额外的3D信息或浅层特征融合,缺乏对纯RGB输入下真实空间感知的有效建模。 Method: 提出SpaceMind模型,采用VGGT作为空间理解编码器,InternViT作为2D视觉编码器,并设计了一个轻量级的相机引导模态融合模块,在语言模型前实现深度融合;通过相机条件偏置、几何重要性加权和相机嵌入门控来增强空间表征。 Result: 在VSI-Bench、SQA3D和SPBench三个基准上均取得当前最优结果,显著优于开源和闭源系统,验证了相机引导融合的有效性。 Conclusion: 相机引导的模态融合是一种有效且实用的归纳偏置,能够赋予视觉-语言模型真正具身化的空间智能,无需依赖额外3D输入即可实现强大的空间推理能力。 Abstract: Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.[266] Implementation of a Skin Lesion Detection System for Managing Children with Atopic Dermatitis Based on Ensemble Learning
Soobin Jeon,Sujong Kim,Dongmahn Seo
Main category: cs.CV
TL;DR: 本研究提出了一种基于集成学习的皮肤病变检测系统(ENSEL),通过整合多种深度学习模型提高了诊断准确性,并在真实用户拍摄的皮肤病变图像上验证了其高召回率和低于1秒的处理速度,有助于皮肤病的客观诊断和数字医疗的发展。
Details
Motivation: 特应性皮炎等皮肤病目前依赖主观评估进行诊断,缺乏客观方法,易导致误诊,且与银屑病外观相似,增加了准确诊断的难度。此外,现有研究多使用高质量皮肤镜图像,而临床实际中难以获取此类图像。 Method: 提出了一种基于集成学习的皮肤病变检测系统(ENSEL),通过集成多种深度学习模型来提升诊断准确性,并在真实用户拍摄的皮肤病变图像上进行实验验证,同时测量系统的准确性和响应时间。 Result: ENSEL在大多数图像上实现了高召回率,处理速度小于1秒,表现出良好的性能。 Conclusion: ENSEL系统有助于实现皮肤病变的客观诊断,推动数字医疗的发展,具有临床应用潜力。 Abstract: The amendments made to the Data 3 Act and impact of COVID-19 have fostered the growth of digital healthcare market and promoted the use of medical data in artificial intelligence in South Korea. Atopic dermatitis, a chronic inflammatory skin disease, is diagnosed via subjective evaluations without using objective diagnostic methods, thereby increasing the risk of misdiagnosis. It is also similar to psoriasis in appearance, further complicating its accurate diagnosis. Existing studies on skin diseases have used high-quality dermoscopic image datasets, but such high-quality images cannot be obtained in actual clinical settings. Moreover, existing systems must ensure accuracy and fast response times. To this end, an ensemble learning-based skin lesion detection system (ENSEL) was proposed herein. ENSEL enhanced diagnostic accuracy by integrating various deep learning models via an ensemble approach. Its performance was verified by conducting skin lesion detection experiments using images of skin lesions taken by actual users. Its accuracy and response time were measured using randomly sampled skin disease images. Results revealed that ENSEL achieved high recall in most images and less than 1s s processing speed. This study contributes to the objective diagnosis of skin lesions and promotes the advancement of digital healthcare.[267] NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing
Zhenyu Xu,Xiaoqi Shen,Haotian Nan,Xinyu Zhang
Main category: cs.CV
TL;DR: NumeriKontrol 是一种基于数值标量控制的图像编辑框架,通过引入可插拔的 Numeric Adapter 实现对扩散模型中属性编辑强度的精确、连续调节,支持零样本多条件编辑。
Details
Motivation: 现有的基于文本指令的图像编辑方法在细粒度控制编辑强度方面缺乏精度,难以满足用户对精确调整图像属性的需求。 Method: 提出 NumeriKontrol 框架,设计任务分离的架构和 Numeric Adapter 来编码数值编辑尺度,并将其注入扩散模型;构建包含精确真值尺度的 Common Attribute Transform (CAT) 数据集,利用高保真渲染引擎和 DSLR 相机合成高质量训练数据。 Result: 实验表明,NumeriKontrol 能在多种属性编辑场景中实现准确、连续且稳定的尺度控制,支持多条件零样本编辑,且编辑结果可按任意顺序执行。 Conclusion: NumeriKontrol 通过引入数值化控制机制,提升了指令驱动图像编辑的精确性与可控性,推动了用户可交互式图像编辑的发展。 Abstract: Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.[268] MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?
Yuandong Wang,Yao Cui,Yuxin Zhao,Zhen Yang,Yangfu Zhu,Zhenzhou Shao
Main category: cs.CV
TL;DR: MathSight是一个新的大学级别多模态数学推理基准,旨在量化视觉信息在视觉-语言模型中的作用,发现随着问题难度增加,视觉输入的贡献反而减小。
Details
Motivation: 现有基准难以分离图像模态的作用,无法判断模型是依赖视觉理解还是语言先验进行推理,因此需要一个能解耦视觉影响的新基准。 Method: 构建MathSight基准,包含多种视觉变体(原始图、手绘、拍照)和纯文本条件,用于控制对比实验。 Result: 实验显示,随着问题难度上升,视觉信息对模型推理的贡献下降;Qwen3-VL在无图像输入下表现优于其多模态版本及GPT-5。 Conclusion: 当前VLM在复杂数学推理中并未有效利用视觉信息,强调需设计如MathSight的基准来推动真正基于视觉的推理发展。 Abstract: Recent advances in Vision-Language Models (VLMs) have achieved impressive progress in multimodal mathematical reasoning. Yet, how much visual information truly contributes to reasoning remains unclear. Existing benchmarks report strong overall performance but seldom isolate the role of the image modality, leaving open whether VLMs genuinely leverage visual understanding or merely depend on linguistic priors. To address this, we present MathSight, a university-level multimodal mathematical reasoning benchmark designed to disentangle and quantify the effect of visual input. Each problem includes multiple visual variants -- original, hand-drawn, photo-captured -- and a text-only condition for controlled comparison. Experiments on state-of-the-art VLMs reveal a consistent trend: the contribution of visual information diminishes with increasing problem difficulty. Remarkably, Qwen3-VL without any image input surpasses both its multimodal variants and GPT-5, underscoring the need for benchmarks like MathSight to advance genuine vision-grounded reasoning in future models.[269] db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
Siqi Chen,Ke Hong,Tianchen Zhao,Ruiqi Xie,Zhenhua Zhu,Xudong Zhang,Yu Wang
Main category: cs.CV
TL;DR: 提出了一种名为db-SP的稀疏感知序列并行技术,用于解决扩散变换模型中块稀疏注意力导致的工作负载不平衡问题,显著提升了推理速度。
Details
Motivation: 在视觉生成中,通过序列并行扩展Diffusion Transformer(DiT)推理可以降低延迟,但块稀疏注意力引起的工作负载不平衡严重阻碍了这一方法的应用。 Method: 提出了稀疏不平衡比来量化不平衡,并设计了db-SP技术,采用双层划分方法,在头和块两个层面实现近乎完美的负载均衡,并在运行时动态调整并行度以适应不同去噪步骤和层中的稀疏模式变化。 Result: 实验结果表明,与现有的最先进序列并行方法相比,db-SP平均实现了端到端1.25倍和注意力模块1.40倍的速度提升。 Conclusion: db-SP有效解决了DiT模型中因块稀疏注意力带来的工作负载不平衡问题,显著提高了推理效率,为大规模视觉生成模型的高效训练和部署提供了新思路。 Abstract: Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at https://github.com/thu-nics/db-SP.[270] Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning
Zibo Zhou,Zhengjun Zhai,Huimin Chen,Wei Dai,Hansen Yang
Main category: cs.CV
TL;DR: 提出了一种基于情感描述的图像情感分类新方法ACIEC,通过纯文本捕捉图像中的情感信息,结合分层对比损失和情感属性推理链,利用预训练语言模型提升分类性能,并考虑含文本图像,有效缩小情感鸿沟。
Details
Motivation: 现有基于预训练视觉模型的图像情感分类方法受限于“情感鸿沟”,难以充分传递情感语义;而心理学研究表明语言能有效表达丰富的情感信息,因此引入语言模态来弥补视觉特征在情感表达上的不足。 Method: 提出ACIEC框架:1)设计分层多级对比损失以从图像中检测情感概念;2)引入情感属性链式推理生成富有情感的描述句子;3)利用预训练语言模型融合情感概念与描述句进行分类;4)采用基于语义相似性采样的对比损失缓解情感数据集中类内差异大、类间差异小的问题;5)同时处理含嵌入文本的图像。 Result: 在多个基准数据集上实现了优于现有方法的表现,验证了所提方法在桥接情感鸿沟和提升图像情感分类性能方面的有效性。 Conclusion: 通过将图像映射到情感文本空间并利用语言模型进行分类,ACIEC有效利用了语言的高情感表达能力,成功缓解了传统视觉模型在情感理解上的局限,为图像情感分析提供了新的思路。 Abstract: Image emotion classification (IEC) is a longstanding research field that has received increasing attention with the rapid progress of deep learning. Although recent advances have leveraged the knowledge encoded in pre-trained visual models, their effectiveness is constrained by the "affective gap" , limits the applicability of pre-training knowledge for IEC tasks. It has been demonstrated in psychology that language exhibits high variability, encompasses diverse and abundant information, and can effectively eliminate the "affective gap". Inspired by this, we propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts, which effectively capture the affective information in the image. In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional attribute chain-of-thought reasoning is proposed to generate affective sentences. Then, a pre-trained language model is leveraged to synthesize emotional concepts and affective sentences to conduct IEC. Additionally, a contrastive loss based on semantic similarity sampling is designed to solve the problem of large intra-class differences and small inter-class differences in affective datasets. Moreover, we also take the images with embedded texts into consideration, which were ignored by previous studies. Extensive experiments illustrate that our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.[271] DNA-Prior: Unsupervised Denoise Anything via Dual-Domain Prior
Yanqi Cheng,Chun-Wun Cheng,Jim Denholm,Thiago Lima,Javier A. Montoya-Zegarra,Richard Goodwin,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
Main category: cs.CV
TL;DR: 提出了一种名为DNA-Prior的通用无监督去噪框架,通过数学上严谨的混合先验从噪声图像中直接重建干净图像,无需外部训练数据或模态特定调参。
Details
Motivation: 现有去噪方法依赖大量标注数据或监督学习,限制了在临床中多模态和缺乏真实标签场景的应用。 Method: 结合隐式网络结构先验与显式的频谱-空间先验(包括频域保真项和空间正则化),构建双域优化问题以联合保持全局频率特征和局部解剖结构。 Result: 在多种模态和噪声条件下,DNA-Prior均实现了稳定的噪声抑制和结构保持,且无需训练数据。 Conclusion: DNA-Prior是一种无需监督、无需调参的通用医学图像去噪框架,在多模态临床场景中具有广泛应用潜力。 Abstract: Medical imaging pipelines critically rely on robust denoising to stabilise downstream tasks such as segmentation and reconstruction. However, many existing denoisers depend on large annotated datasets or supervised learning, which restricts their usability in clinical environments with heterogeneous modalities and limited ground-truth data. To address this limitation, we introduce DNA-Prior, a universal unsupervised denoising framework that reconstructs clean images directly from corrupted observations through a mathematically principled hybrid prior. DNA-Prior integrates (i) an implicit architectural prior, enforced through a deep network parameterisation, with (ii) an explicit spectral-spatial prior composed of a frequency-domain fidelity term and a spatial regularisation functional. This dual-domain formulation yields a well-structured optimisation problem that jointly preserves global frequency characteristics and local anatomical structure, without requiring any external training data or modality-specific tuning. Experiments across multiple modalities show that DNA achieves consistent noise suppression and structural preservation under diverse noise conditions.[272] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
Hongfei Zhang,Kanghao Chen,Zixin Zhang,Harold Haodong Chen,Yuanhuiyi Lyu,Yuqi Zhang,Shuai Yang,Kun Zhou,Yingcong Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为DualCamCtrl的端到端扩散模型,用于相机控制的视频生成,通过双分支框架和语义引导的RGB-深度融合机制,显著提升了生成视频的几何一致性和对指定相机轨迹的遵循能力。
Details
Motivation: 现有方法在相机姿态条件化视频生成中缺乏足够的场景理解和几何感知,导致生成结果与指定相机轨迹不一致。 Method: 提出DualCamCtrl,采用双分支架构联合生成相机一致的RGB和深度序列,并引入语义引导互对齐(SIGMA)机制实现模态间互补融合;同时分析了去噪过程中深度与相机姿态在不同阶段的作用。 Result: 实验表明,DualCamCtrl相比先前方法在相机运动误差上减少了超过40%,显著提高了相机控制视频生成的一致性。 Conclusion: DualCamCtrl通过显式建模几何结构并结合语义引导的多模态融合,在相机控制视频生成中实现了更优的外观与几何解耦,有效提升了生成质量与控制精度。 Abstract: This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl\-page/[273] InstanceV: Instance-Level Video Generation
Yuheng Chen,Teng Hu,Jiangning Zhang,Zhucun Xue,Ran Yi,Lizhuang Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为InstanceV的视频生成框架,通过引入实例感知机制和全局语义一致性模块,实现了对文本到视频生成过程中的实例级控制与高质量一致性生成,并提出了新的评估基准InstanceBench。
Details
Motivation: 现有的文本到视频模型大多仅依赖文本条件,缺乏对生成视频的细粒度控制,尤其是实例级别的控制能力不足,难以在指定位置生成正确属性的实例。 Method: 提出InstanceV框架,包含实例感知的掩码交叉注意力机制、共享时间步自适应提示增强模块以及空间感知无条件引导策略,并构建了新的评估基准InstanceBench。 Result: 实验表明,InstanceV在实例级可控性、视频整体质量及小实例保留方面均优于现有最先进模型,在定性和定量评估中表现优异。 Conclusion: InstanceV有效实现了实例级控制与全局语义一致性的协同优化,为可控视频生成提供了新思路,且新基准InstanceBench有助于更全面地评估此类任务。 Abstract: Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.[274] Cascaded Robust Rectification for Arbitrary Document Images
Chaoyun Wang,Quanxin Huang,I-Chao Shen,Takeo Igarashi,Nanning Zheng,Caigui Jiang
Main category: cs.CV
TL;DR: 提出了一种多阶段框架,逐步校正真实场景中文档的透视、几何和内容畸变,并提出了新的评估指标,在多个基准上实现了最先进的性能。
Details
Motivation: 真实场景中的文档校正面临相机视角和物理形变的极端变化,现有方法和评估指标存在局限性。 Method: 采用由粗到细的多阶段框架:首先进行全局仿射变换纠正透视畸变,然后校正纸张卷曲和折叠引起的几何变形,最后通过内容感知的迭代过程消除细粒度内容畸变;同时提出AED/ACER和AD-M/AAD-M两种新评估指标。 Result: 在多个具有挑战性的基准上取得了最先进性能,AAD指标降低了14.1%--34.7%,并在实际应用中表现出优越效果。 Conclusion: 该多阶段框架能有效分解并逐步解决复杂畸变,结合新评估指标更准确地衡量文档校正质量,显著提升了真实场景下的性能。 Abstract: Document rectification in real-world scenarios poses significant challenges due to extreme variations in camera perspectives and physical distortions. Driven by the insight that complex transformations can be decomposed and resolved progressively, we introduce a novel multi-stage framework that progressively reverses distinct distortion types in a coarse-to-fine manner. Specifically, our framework first performs a global affine transformation to correct perspective distortions arising from the camera's viewpoint, then rectifies geometric deformations resulting from physical paper curling and folding, and finally employs a content-aware iterative process to eliminate fine-grained content distortions. To address limitations in existing evaluation protocols, we also propose two enhanced metrics: layout-aligned OCR metrics (AED/ACER) for a stable assessment that decouples geometric rectification quality from the layout analysis errors of OCR engines, and masked AD/AAD (AD-M/AAD-M) tailored for accurately evaluating geometric distortions in documents with incomplete boundaries. Extensive experiments show that our method establishes new state-of-the-art performance on multiple challenging benchmarks, yielding a substantial reduction of 14.1\%--34.7\% in the AAD metric and demonstrating superior efficacy in real-world applications. The code will be publicly available at https://github.com/chaoyunwang/ArbDR.[275] Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
Jin-Seop Lee,SungJoon Lee,SeongJun Jung,Boyang Li,Jee-Hyong Lee
Main category: cs.CV
TL;DR: 提出Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) 方法,基于GRPO框架并结合多种奖励目标,以在视频时序定位中有效拒绝语义相似但实际无关的困难无关查询,并构建了包含此类查询的HI-VTG数据集进行验证。
Details
Motivation: 现有视频时序定位模型无法有效处理与视频内容无关或语义相似但不相关的查询,导致即使查询无关也会强制预测结果,缺乏对困难无关查询的拒绝能力。 Method: 提出RA-RFT方法,基于Group Relative Policy Optimization (GRPO) 框架,引入格式、拒绝IoU、解释和查询修正四种奖励目标,提升相关性判别与细粒度语义推理能力;同时构建包含硬无关查询及其拒绝回答的HI-VTG数据集。 Result: 在多种相关性感知的VTG场景(包括硬无关、简单打乱和人工标注设置)中验证了方法的有效性,并证明该方法可扩展应用于多种基于LVLM的VTG模型。 Conclusion: RA-RFT能有效提升模型对硬无关查询的拒绝能力,增强视频时序定位中的相关性判断与语义理解,推动更鲁棒的实际应用发展。 Abstract: Video Temporal Grounding (VTG) aims to localize a temporal segment in a video corresponding to a natural language query. However, existing VTG models assume that a relevant segment always exists, causing them to always predict a target segment even when the query is irrelevant to the video. While recent approaches attempt to handle irrelevant queries, they can only reject those that are entirely unrelated to the video and still fail to handle hard-irrelevant queries that are semantically similar but not actually relevant. To address this, we propose Refusal-Aware Reinforcement Fine-Tuning (RA-RFT) to effectively refuse hard-irrelevant queries in VTG. Our method is based on the Group Relative Policy Optimization (GRPO) framework and integrates four reward objectives-format, refuse-IoU, explain, and query correction-to improve both relevance discrimination and fine-grained semantic reasoning. In addition, to effectively support RA-RFT, we construct a Hard-Irrelevant VTG (HI-VTG) dataset, which includes hard-irrelevant queries and their refusal answers. We demonstrate the effectiveness of our method across various relevance-aware VTG scenarios, including hard-irrelevant VTG, simply-shuffled RA-VTG, and human-annotated RA-VTG settings. We also show that the proposed method is scalable by applying it to various LVLM-based VTG models. Our code is available at https://github.com/JINSUBY/RA-RFT.[276] REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection
Huangsen Cao,Qin Mei,Zhiheng Li,Yuxi Li,Ying Zhang,Chen Li,Zhimeng Zhang,Xin Ding,Yongwei Wang,Jing Lyu,Fei Wu
Main category: cs.CV
TL;DR: 本文提出了REVEAL-Bench,首个基于证据链的多模态AI生成图像检测基准,以及REVEAL框架,通过融合专家指导的强化学习实现可解释的图像取证,显著提升了检测准确性、解释保真度和跨模型泛化能力。
Details
Motivation: 现有的AI生成图像检测方法多依赖事后解释或视觉判别,缺乏可验证的证据链,导致解释缺乏因果基础且泛化性能差。因此,亟需一种具备真实可解释性和可靠推理机制的取证方法。 Method: 构建了REVEAL-Bench基准,基于多个轻量级专家模型生成证据链,并记录逐步推理轨迹;在此基础上提出REVEAL框架,结合检测任务与专家引导的强化学习,设计专门奖励机制联合优化检测准确率、解释保真度和逻辑一致性。 Result: 实验表明,REVEAL在检测准确率、解释保真度和跨模型泛化方面均显著优于现有方法,成为可解释图像取证的新标杆。 Conclusion: REVEAL通过引入基于证据的推理机制和专家引导的强化学习,实现了高效且真正可解释的AI生成图像检测,为可信赖的图像取证提供了新范式。 Abstract: With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.[277] PowerCLIP: Powerset Alignment for Contrastive Pre-Training
Masaki Kawamura,Nakamasa Inoue,Rintaro Yanagi,Hirokatsu Kataoka,Rio Yokota
Main category: cs.CV
TL;DR: 提出PowerCLIP,一种通过幂集对齐增强的对比视觉-语言预训练框架,利用非线性聚合器高效建模多区域图像与文本短语间的细粒度组合语义。
Details
Motivation: 现有方法难以捕捉跨越多个图像区域的组合语义,限制了对复杂视觉场景的细粒度理解。 Method: 引入幂集对齐机制,通过最小化图像区域幂集与文本解析树之间的损失来优化区域到短语的对齐;设计非线性聚合器(NLA)将计算复杂度从O(2^M)降至O(M),实现高效近似。 Result: 在零样本分类和检索任务上超越现有最先进方法,验证了模型在组合性和鲁棒性方面的优势。 Conclusion: PowerCLIP通过高效的幂集对齐显著提升了视觉-语言模型对组合语义的理解能力,为未来高阶语义对齐提供了新思路。 Abstract: Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.[278] Fast Multi-view Consistent 3D Editing with Video Priors
Liyi Chen,Ruihuang Li,Guowen Zhang,Pengfei Wang,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于生成视频先验的3D编辑方法ViP3DE,利用预训练视频生成模型的时间一致性先验,实现单次前向传播中的多视角一致3D编辑,避免了传统方法迭代更新导致的耗时和结果过平滑问题。
Details
Motivation: 现有文本驱动3D编辑方法因缺乏多视角一致性先验,依赖逐视图迭代更新,效率低且易产生过平滑结果,因此需要一种更高效、一致性强的新方法。 Method: 提出ViP3DE,通过将视频生成模型以单个编辑视图为条件,直接生成其他视角的编辑结果;引入运动保持的噪声融合策略,使生成视图匹配预设相机位姿;并采用几何感知去噪机制,将3D几何先验融入视频模型以增强多视角一致性。 Result: 实验表明,ViP3DE在单次前向传播下即可实现高质量的多视角一致3D编辑,在编辑质量和速度上均显著优于现有方法。 Conclusion: ViP3DE有效利用视频生成模型的时间一致性先验,实现了快速、高质量的3D对象或场景编辑,为文本驱动3D编辑提供了新思路。 Abstract: Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.[279] GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation
Yuhao Wan,Lijuan Liu,Jingzhi Zhou,Zihan Zhou,Xuying Zhang,Dongbo Zhang,Shaohui Jiao,Qibin Hou,Ming-Ming Cheng
Main category: cs.CV
TL;DR: 本文提出了GeoWorld,通过利用视频模型和几何模型生成高质量的3D场景,解决了传统单帧输入方法导致的几何失真和模糊问题。
Details
Motivation: 现有基于单帧图像生成3D场景的方法存在几何失真和内容模糊的问题,难以获得精细且一致的3D结构。 Method: 首先生成连续视频帧,利用几何模型提取全帧几何特征,并将其作为几何条件输入到视频生成模型中;引入几何对齐损失和几何适配模块以增强几何结构的一致性和特征利用率。 Result: 实验表明,GeoWorld在给定单张图像和相机轨迹的情况下,能够生成高保真的3D场景,在定性和定量评估上均优于先前方法。 Conclusion: GeoWorld通过挖掘几何模型潜力并结合多帧几何信息,显著提升了单图像到3D场景生成的质量与几何准确性。 Abstract: Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content. In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models and present our GeoWorld. Instead of exploiting geometric information obtained from a single-frame input, we propose to first generate consecutive video frames and then take advantage of the geometry model to provide full-frame geometry features, which contain richer information than single-frame depth maps or camera embeddings used in previous methods, and use these geometry features as geometrical conditions to aid the video generation model. To enhance the consistency of geometric structures, we further propose a geometry alignment loss to provide the model with real-world geometric constraints and a geometry adaptation module to ensure the effective utilization of geometry features. Extensive experiments show that our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively. Project Page: https://peaes.github.io/GeoWorld/.[280] Vision Bridge Transformer at Scale
Zhenxiong Tan,Zeqing Wang,Xingyi Yang,Songhua Liu,Xinchao Wang
Main category: cs.CV
TL;DR: 提出Vision Bridge Transformer (ViBT),一种基于布朗桥模型的大规模条件生成模型,通过数据到数据的高效转换实现图像和视频翻译任务。
Details
Motivation: 传统扩散模型从噪声生成数据效率较低,希望构建更高效的条件生成框架。 Method: 采用Transformer架构并提出方差稳定的 velocity-matching 目标函数,直接建模输入与输出之间的轨迹。 Result: 实现了20B和1.3B参数规模的Bridge Models,在图像编辑和视频翻译任务上表现出色。 Conclusion: 大规模Bridge Models结合Transformer架构在指令驱动的图像编辑和复杂视频翻译中具有强大潜力。 Abstract: We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.[281] Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings
Christian Grashei,Christian Brechenmacher,Rao Muhammad Umer,Jingsong Liu,Carsten Marr,Ewa Szczurek,Peter J. Schüffler
Main category: cs.CV
TL;DR: Pathryoshka是一个受RADIO和Matryoshka Representation Learning启发的多教师蒸馏框架,用于压缩病理学基础模型,在减少86-92%模型大小的同时保持性能,并在十项公开基准上优于现有方法。
Details
Motivation: 大型病理学基础模型参数量大、嵌入维度高,限制了在计算资源受限环境下的研究与临床应用,需要更小且高效的模型。 Method: 提出Pathryoshka框架,结合多教师知识蒸馏与Matryoshka表示学习,支持可变嵌入维度并实现模型压缩。 Result: 相比大尺寸教师模型,模型大小减少86-92%,性能相当;在十项公开病理学基准上,比同类单教师蒸馏模型中位准确率高出7.0。 Conclusion: Pathryoshka实现了高效本地部署,不牺牲准确性与表示能力,促进了先进病理学基础模型在研究和临床中的普及。 Abstract: Pathology foundation models (FMs) have driven significant progress in computational pathology. However, these high-performing models can easily exceed a billion parameters and produce high-dimensional embeddings, thus limiting their applicability for research or clinical use when computing resources are tight. Here, we introduce Pathryoshka, a multi-teacher distillation framework inspired by RADIO distillation and Matryoshka Representation Learning to reduce pathology FM sizes while allowing for adaptable embedding dimensions. We evaluate our framework with a distilled model on ten public pathology benchmarks with varying downstream tasks. Compared to its much larger teachers, Pathryoshka reduces the model size by 86-92% at on-par performance. It outperforms state-of-the-art single-teacher distillation models of comparable size by a median margin of 7.0 in accuracy. By enabling efficient local deployment without sacrificing accuracy or representational richness, Pathryoshka democratizes access to state-of-the-art pathology FMs for the broader research and clinical community.[282] Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation
Jose Moises Araya-Martinez,Gautham Mohan,Kenichi Hayakawa Bolaños,Roberto Mendieta,Sarvenaz Sardari,Jens Lambrecht,Jörg Krüger
Main category: cs.CV
TL;DR: 提出了一种无需先验缺陷样本、对姿态不敏感的零样本质量检测框架,通过在RGB-D空间中将真实场景与实时数字孪生进行对比,实现早期视觉质量检测。
Details
Motivation: 复杂性和数据需求限制了现有视觉检测系统在半受控工业环境中的广泛应用,亟需一种低数据依赖、易于部署的解决方案。 Method: 利用已知CAD模型进行目标检测与位姿估计,语义化描述工业场景,实现实时数字孪生渲染,并在RGB-D空间中与真实场景对比;提出可扩展的层次化标注策略,统一姿态标注与缺陷逻辑结构标注。 Result: 在汽车轴向磁通电机质检用例中验证了方法有效性,即使在半受控条件下使用简单距离测量,IoU得分最高达63.3%。 Conclusion: 该框架为动态制造环境中可泛化的低数据缺陷检测方法研究奠定了基础。 Abstract: Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.[283] Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day
Milad Abdollahzadeh,Abdul Raheem,Zilong Zhao,Uzair Javaid,Kevin Yee,Nalam Venkata Abhishek,Tram Truong-Huu,Biplab Sikdar
Main category: cs.CV
TL;DR: 本文首次探索了在有限数据和计算资源下,通过指令微调提升大语言模型(LLM)生成表格数据能力的有效性。作者构建了一个高质量的表格指令数据集,并在Llama3.1-8B-Instruct模型上进行指令微调。实验表明,仅使用7K条指令和A100 GPU训练不到6小时,性能即可媲美最强大的商用模型GPT-4o。
Details
Motivation: 现有研究多关注LLM对表格数据的理解与推理,而忽视了表格数据生成任务。本文旨在填补这一空白,并解决高资源需求阻碍研究的问题,探索低资源条件下的指令微调可行性。 Method: 构建一个高质量的表格指令数据集,并在开源大模型Llama3.1-8B-Instruct上进行指令微调,使用7K条指令,在单个A100 GPU上训练不到6小时。 Result: 经过指令微调的模型在表格数据生成任务上的表现与GPT-4o相当,验证了小规模高质量数据和低资源训练的有效性。 Conclusion: 即使在数据和计算资源受限的情况下,通过高质量指令数据集进行指令微调,也能显著提升LLM的表格数据生成能力,达到顶级商用模型的水平。 Abstract: Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.[284] Robust 3DGS-based SLAM via Adaptive Kernel Smoothing
Shouhe Zhang,Dayong Ren,Sensen Song,Wenjie Li,Piaopiao Yu,Yurong Qian
Main category: cs.CV
TL;DR: 本文提出了一种名为CB-KNN的新方法,通过平滑核策略增强3DGS-SLAM中光栅化过程的鲁棒性,以提升相机位姿跟踪的稳定性,而非单纯追求渲染质量。
Details
Motivation: 传统3DGS-SLAM认为渲染质量决定跟踪精度,但作者认为参数误差下的光栅化鲁棒性更为关键。 Method: 提出Corrective Blurry KNN (CB-KNN),通过自适应调整局部K近邻高斯的RGB值和位置,引入可控模糊来正则化渲染过程,提升对参数噪声的鲁棒性。 Result: 实验表明该方法在保持场景重建质量的同时,显著提高了相机位姿跟踪的鲁棒性和准确性。 Conclusion: 增强光栅化的鲁棒性比追求完美渲染更能提升SLAM系统的跟踪性能,CB-KNN为现有3DGS框架提供了一种实用且有效的改进方案。 Abstract: In this paper, we challenge the conventional notion in 3DGS-SLAM that rendering quality is the primary determinant of tracking accuracy. We argue that, compared to solely pursuing a perfect scene representation, it is more critical to enhance the robustness of the rasterization process against parameter errors to ensure stable camera pose tracking. To address this challenge, we propose a novel approach that leverages a smooth kernel strategy to enhance the robustness of 3DGS-based SLAM. Unlike conventional methods that focus solely on minimizing rendering error, our core insight is to make the rasterization process more resilient to imperfections in the 3DGS parameters. We hypothesize that by allowing each Gaussian to influence a smoother, wider distribution of pixels during rendering, we can mitigate the detrimental effects of parameter noise from outlier Gaussians. This approach intentionally introduces a controlled blur to the rendered image, which acts as a regularization term, stabilizing the subsequent pose optimization. While a complete redesign of the rasterization pipeline is an ideal solution, we propose a practical and effective alternative that is readily integrated into existing 3DGS frameworks. Our method, termed Corrective Blurry KNN (CB-KNN), adaptively modifies the RGB values and locations of the K-nearest neighboring Gaussians within a local region. This dynamic adjustment generates a smoother local rendering, reducing the impact of erroneous GS parameters on the overall image. Experimental results demonstrate that our approach, while maintaining the overall quality of the scene reconstruction (mapping), significantly improves the robustness and accuracy of camera pose tracking.[285] DAONet-YOLOv8: An Occlusion-Aware Dual-Attention Network for Tea Leaf Pest and Disease Detection
Yefeng Wu,Shan Wan,Ling Wu,Yecheng Zhao
Main category: cs.CV
TL;DR: 提出了一种改进的YOLOv8模型DAONet-YOLOv8,用于茶园中病虫害检测,通过引入双注意力融合模块、遮挡感知检测头和动态卷积模块,在复杂环境下显著提升检测精度与鲁棒性。
Details
Motivation: 现有检测器在复杂背景、光照变化和枝叶遮挡情况下易出现漏检和误检,难以准确识别茶叶病虫害。 Method: 提出了DAONet-YOLOv8,包含三个关键改进:双注意力融合模块(DAFM)结合局部卷积特征与全局自注意力;遮挡感知检测头(Detect-OAHead)补偿被遮挡区域的特征;C2f-DSConv模块使用多形状动态卷积捕捉不规则病斑边界。 Result: 在真实茶园数据集上,DAONet-YOLOv8达到92.97%精确率、92.80%召回率、97.10% mAP@50和76.90% mAP@50:95,优于YOLOv8n基线模型,并减少16.7%参数量。 Conclusion: DAONet-YOLOv8在茶叶病虫害检测任务中表现出优越性能,尤其在处理复杂环境干扰方面具有更强的鲁棒性和实用性。 Abstract: Accurate detection of tea leaf pests and diseases in real plantations remains challenging due to complex backgrounds, variable illumination, and frequent occlusions among dense branches and leaves. Existing detectors often suffer from missed detections and false positives in such scenarios. To address these issues, we propose DAONet-YOLOv8, an enhanced YOLOv8 variant with three key improvements: (1) a Dual-Attention Fusion Module (DAFM) that combines convolutional local feature extraction with self-attention based global context modeling to focus on subtle lesion regions while suppressing background noise; (2) an occlusion-aware detection head (Detect-OAHead) that learns the relationship between visible and occluded parts to compensate for missing lesion features; and (3) a C2f-DSConv module employing dynamic synthesis convolutions with multiple kernel shapes to better capture irregular lesion boundaries. Experiments on our real-world tea plantation dataset containing six pest and disease categories demonstrate that DAONet-YOLOv8 achieves 92.97% precision, 92.80% recall, 97.10% mAP@50 and 76.90% mAP@50:95, outperforming the YOLOv8n baseline by 2.34, 4.68, 1.40 and 1.80 percentage points respectively, while reducing parameters by 16.7%. Comparative experiments further confirm that DAONet-YOLOv8 achieves superior performance over mainstream detection models.[286] PointCNN++: Performant Convolution on Native Points
Lihan Li,Haofeng Zhong,Rui Bu,Mingchao Sun,Wenzheng Chen,Baoquan Chen,Yangyan Li
Main category: cs.CV
TL;DR: 提出PointCNN++,将稀疏卷积从体素推广到点,兼顾几何精度与计算效率,在点云配准任务中显著提升性能且更省内存和速度更快。
Details
Motivation: 现有3D点云卷积方法在几何精度(点基)和计算效率(体素)之间存在权衡,尤其影响配准等对精度敏感的任务,需突破这一瓶颈。 Method: 提出点中心卷积,以原始高精度点为中心构建感受野,并设计原生点上的矩阵-向量乘加规约(MVMR)计算策略,开发专用GPU核实现高效运算。 Result: 比典型点基方法内存减少一数量级、速度快数倍;替换体素主干网络后显著提升配准精度,同时更高效。 Conclusion: PointCNN++证明保持几何细节与高效计算可兼得,为高保真高效3D学习开辟新路径。 Abstract: Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It \textbf{generalizes sparse convolution from voxels to points}, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational strategy that operates \textbf{natively} on points. We formulate the convolution on native points as a Matrix-Vector Multiplication and Reduction (MVMR) problem, for which we develop a dedicated, highly-optimized GPU kernel. Experiments demonstrate that PointCNN++ \textbf{uses an order of magnitude less memory and is several times faster} than representative point-based methods. Furthermore, when used as a simple replacement for the voxel-based backbones it generalizes, it \textbf{significantly improves point cloud registration accuracies while proving both more memory-efficient and faster}. PointCNN++ shows that preserving geometric detail and achieving high performance are not mutually exclusive, paving the way for a new class of 3D learning with high fidelity and efficiency. Our code will be open sourced.[287] Language-guided 3D scene synthesis for fine-grained functionality understanding
Jaime Corsetti,Francesco Giuliari,Davide Boscaini,Pedro Hermosilla,Andrea Pilzer,Guofeng Mei,Alexandros Delitzas,Francis Engelmann,Fabio Poiesi
Main category: cs.CV
TL;DR: 本文提出了SynthFun3D,首个基于任务的3D场景合成方法,用于解决3D功能理解中真实数据稀缺的问题。该方法根据动作描述生成带有部件级标注的室内场景,并自动识别正确的功能部件,实现低成本、大规模的高质量标注数据生成。
Details
Motivation: 由于收集和标注真实世界数据需要大量人力,3D场景中的功能理解研究受限于数据稀缺问题。 Method: 提出SynthFun3D,利用带部件级标注的家具资产库,根据动作描述生成可执行该动作的3D室内环境,并通过推理自动识别并提取正确功能部件的3D掩码。 Result: 用户研究表明,SynthFun3D在场景与提示的一致性上优于其他方法;定量结果表明,生成的数据可替代真实数据(性能略有下降)或与真实数据结合使用以提升性能。 Conclusion: SynthFun3D为数据密集型的3D应用提供了一种低成本、可扩展的数据生成解决方案。 Abstract: Functionality understanding in 3D, which aims to identify the functional element in a 3D scene to complete an action (e.g., the correct handle to "Open the second drawer of the cabinet near the bed"), is hindered by the scarcity of real-world data due to the substantial effort needed for its collection and annotation. To address this, we introduce SynthFun3D, the first method for task-based 3D scene synthesis. Given the action description, SynthFun3D generates a 3D indoor environment using a furniture asset database with part-level annotation, ensuring the action can be accomplished. It reasons about the action to automatically identify and retrieve the 3D mask of the correct functional element, enabling the inexpensive and large-scale generation of high-quality annotated data. We validate SynthFun3D through user studies, which demonstrate improved scene-prompt coherence compared to other approaches. Our quantitative results further show that the generated data can either replace real data with minor performance loss or supplement real data for improved performance, thereby providing an inexpensive and scalable solution for data-hungry 3D applications. Project page: github.com/tev-fbk/synthfun3d.[288] Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering
Qiming Li,Xiaocheng Feng,Yixuan Ma,Zekai Ye,Ruihan Chen,Xiachong Feng,Bing Qin
Main category: cs.CV
TL;DR: 提出了一种无需训练的推理时方法MRRE,通过表示工程增强多语言推理能力,提升低资源语言的推理性能和输入输出语言一致性。
Details
Motivation: 解决现有方法在多语言推理中依赖昂贵的多语言训练或外部翻译工具的问题,提高低资源语言下的公平性和推理性能。 Method: 在推理过程中特定层注入两个预计算向量:跨语言推理增强向量将非英语推理表示引导至英语空间,目标语言输出锚定向量恢复目标语言分布以保持语言一致性。 Result: 在六个先进LLM和LVLM上四个推理基准的实验表明,MRRE在低资源语言(如泰语和斯瓦希里语)中平均提升5.48%,最高达7.54%,输入输出语言一致性提升3.78%。 Conclusion: MRRE是一种有效的训练免费方法,显著提升了多语言场景下的推理能力和语言一致性,具有良好的应用前景。 Abstract: Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input-output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (Thai and Swahili), while improving input-output language consistency by 3.78%.[289] Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods
Jose Moises Araya-Martinez,Adrián Sanchis Reig,Gautham Mohan,Sarvenaz Sardari,Jens Lambrecht,Jörg Krüger
Main category: cs.CV
TL;DR: 本文研究了多种领域随机化(DR)和领域自适应(DA)技术在无需人工标注的情况下生成情境化合成数据的效果,发现基于渲染且具有足够变异性的数据作为种子时,简单的基于特征的方法(如亮度过滤和感知哈希)在准确性和资源效率上优于复杂的生成式AI方法。
Details
Motivation: 减少数据生成和标注的负担是工业和机器人领域中机器学习部署的主要挑战,现有方法常需专家干预来弥合仿真到现实的差距。 Method: 评估了包括基于特征的方法、生成式AI(GenAI)和传统渲染方法在内的多种DR和DA技术,重点比较低级与高级特征对齐效果,并提出一种由真实场景提示引导的受控扩散型DA方法。 Result: 在自有工业数据集和公开机器人数据集上的实验表明,感知哈希方法表现最佳,mAP50分别达到98%和67%;生成式AI方法未提升mAP但显著增加时间开销。 Conclusion: 简单的基于特征的过滤方法在足够多样化的合成数据基础上,能更高效地实现仿真到现实的迁移,为仅使用合成数据训练高性能模型提供了可行路径。 Abstract: Reducing the burden of data generation and annotation remains a major challenge for the cost-effective deployment of machine learning in industrial and robotics settings. While synthetic rendering is a promising solution, bridging the sim-to-real gap often requires expert intervention. In this work, we benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI), and classical rendering approaches, for creating contextualized synthetic data without manual annotation. Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts. We validate our methods on two datasets: a proprietary industrial dataset (automotive and logistics) and a public robotics dataset. Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency. Perceptual hashing consistently achieves the highest performance, with mAP50 scores of 98% and 67% on the industrial and robotics datasets, respectively. Additionally, GenAI methods present significant time overhead for data generation at no apparent improvement of sim-to-real mAP values compared to simpler methods. Our findings offer actionable insights for efficiently bridging the sim-to-real gap, enabling high real-world performance from models trained exclusively on synthetic data.[290] Learning to Predict Aboveground Biomass from RGB Images with 3D Synthetic Scenes
Silvia Zuffi
Main category: cs.CV
TL;DR: 本文提出了一种基于单张地面RGB图像的森林地上生物量(AGB)估计新方法,通过生成AGB密度图并结合合成3D SPREAD数据集进行训练,实现了对密集植被中生物量的准确预测,具有可扩展性和低成本优势。
Details
Motivation: 传统AGB估算依赖于耗时的实地测量或在茂密植被中存在局限性的遥感技术,亟需一种高效、低成本且适用于广泛场景的方法。 Method: 将AGB估计建模为密集预测任务,引入AGB密度图概念,并利用SPREAD数据集中的树木属性和实例分割掩码,通过异速生长方程计算AGB,训练模型从单张RGB图像预测AGB密度图并积分获得总AGB估计。 Result: 在保留的SPREAD数据上中位数误差为1.22 kg/m²,在真实图像数据集上为1.94 kg/m²,是首个仅用单张RGB图像直接估计AGB的方法。 Conclusion: 该方法为森林监测提供了一种可扩展、可解释且成本低的解决方案,并有望推动公民科学参与。 Abstract: Forests play a critical role in global ecosystems by supporting biodiversity and mitigating climate change via carbon sequestration. Accurate aboveground biomass (AGB) estimation is essential for assessing carbon storage and wildfire fuel loads, yet traditional methods rely on labor-intensive field measurements or remote sensing approaches with significant limitations in dense vegetation. In this work, we propose a novel learning-based method for estimating AGB from a single ground-based RGB image. We frame this as a dense prediction task, introducing AGB density maps, where each pixel represents tree biomass normalized by the plot area and each tree's image area. We leverage the recently introduced synthetic 3D SPREAD dataset, which provides realistic forest scenes with per-image tree attributes (height, trunk and canopy diameter) and instance segmentation masks. Using these assets, we compute AGB via allometric equations and train a model to predict AGB density maps, integrating them to recover the AGB estimate for the captured scene. Our approach achieves a median AGB estimation error of 1.22 kg/m^2 on held-out SPREAD data and 1.94 kg/m^2 on a real-image dataset. To our knowledge, this is the first method to estimate aboveground biomass directly from a single RGB image, opening up the possibility for a scalable, interpretable, and cost-effective solution for forest monitoring, while also enabling broader participation through citizen science initiatives.[291] Simultaneous Image Quality Improvement and Artefacts Correction in Accelerated MRI
Georgia Kanli,Daniele Perlo,Selma Boudissa,Radovan Jirik,Olivier Keunen
Main category: cs.CV
TL;DR: 提出了一种名为USArt的深度学习方法,用于从欠采样数据中恢复高质量的MRI图像,并同时校正噪声和运动伪影,实现了高达5倍的加速和显著的图像质量提升。
Details
Motivation: 现有的深度学习方法通常只解决MRI图像重建中的欠采样或伪影校正之一,而实际应用中两者常同时出现,因此需要一种能同时处理这两种退化因素的方法以提高诊断准确性。 Method: USArt采用双子模型结构,专门针对使用笛卡尔采样获取的2D脑部解剖图像,能够同时进行欠采样数据恢复和噪声与运动伪影校正。 Result: 实验结果显示,该方法在不同欠采样策略和退化程度下均表现出色,尤其是梯度欠采样策略效果最佳,信噪比和对比度显著提高,实现了最高达5倍的加速且无明显质量下降。 Conclusion: USArt能够在真实世界场景中有效恢复高质量MRI图像,具备处理欠采样和多种伪影的综合能力,展现出良好的鲁棒性和临床应用潜力。 Abstract: MR data are acquired in the frequency domain, known as k-space. Acquiring high-quality and high-resolution MR images can be time-consuming, posing a significant challenge when multiple sequences providing complementary contrast information are needed or when the patient is unable to remain in the scanner for an extended period of time. Reducing k-space measurements is a strategy to speed up acquisition, but often leads to reduced quality in reconstructed images. Additionally, in real-world MRI, both under-sampled and full-sampled images are prone to artefacts, and correcting these artefacts is crucial for maintaining diagnostic accuracy. Deep learning methods have been proposed to restore image quality from under-sampled data, while others focused on the correction of artefacts that result from the noise or motion. No approach has however been proposed so far that addresses both acceleration and artefacts correction, limiting the performance of these models when these degradation factors occur simultaneously. To address this gap, we present a method for recovering high-quality images from under-sampled data with simultaneously correction for noise and motion artefact called USArt (Under-Sampling and Artifact correction model). Customized for 2D brain anatomical images acquired with Cartesian sampling, USArt employs a dual sub-model approach. The results demonstrate remarkable increase of signal-to-noise ratio (SNR) and contrast in the images restored. Various under-sampling strategies and degradation levels were explored, with the gradient under-sampling strategy yielding the best outcomes. We achieved up to 5x acceleration and simultaneously artefacts correction without significant degradation, showcasing the model's robustness in real-world settings.[292] FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting
Tianhao Xie,Linlian Jiang,Xinxin Zuo,Yang Wang,Tiberiu Popa
Main category: cs.CV
TL;DR: 提出FACT-GS,一种基于频率对齐的复杂度感知纹理高斯点阵方法,通过可学习的非均匀采样密度分配,在相同参数预算下提升高频细节表现力。
Details
Motivation: 现有基于纹理的高斯点阵使用均匀采样网格,导致纹理空间利用效率低:高频区域欠采样、平滑区域浪费容量,造成模糊和细节丢失。 Method: 基于自适应采样理论,将纹理参数化建模为可微分的采样密度分配问题;引入可学习的形变场,其雅可比行列式调节局部采样密度,实现复杂度感知的非均匀采样。 Result: 在固定分辨率纹理网格上实现非均匀采样,保持实时渲染性能的同时,显著恢复更清晰的高频细节。 Conclusion: FACT-GS通过频率对齐的采样策略,提升了纹理空间利用率和渲染质量,优于传统均匀采样方法。 Abstract: Realistic scene appearance modeling has advanced rapidly with Gaussian Splatting, which enables real-time, high-quality rendering. Recent advances introduced per-primitive textures that incorporate spatial color variations within each Gaussian, improving their expressiveness. However, texture-based Gaussians parameterize appearance with a uniform per-Gaussian sampling grid, allocating equal sampling density regardless of local visual complexity. This leads to inefficient texture space utilization, where high-frequency regions are under-sampled and smooth regions waste capacity, causing blurred appearance and loss of fine structural detail. We introduce FACT-GS, a Frequency-Aligned Complexity-aware Texture Gaussian Splatting framework that allocates texture sampling density according to local visual frequency. Grounded in adaptive sampling theory, FACT-GS reformulates texture parameterization as a differentiable sampling-density allocation problem, replacing the uniform textures with a learnable frequency-aware allocation strategy implemented via a deformation field whose Jacobian modulates local sampling density. Built on 2D Gaussian Splatting, FACT-GS performs non-uniform sampling on fixed-resolution texture grids, preserving real-time performance while recovering sharper high-frequency details under the same parameter budget.[293] A Perceptually Inspired Variational Framework for Color Enhancement
Rodrigo Palma-Amestoy,Edoardo Provenzi,Marcelo Bertalmío,Vicent Caselles
Main category: cs.CV
TL;DR: 提出了一种受颜色感知现象学启发的变分方法,用于颜色对比度增强,并设计了满足感知要求的能量泛函,通过梯度下降求解,同时提供降低计算复杂度的方法。
Details
Motivation: 现有颜色校正模型难以刻画对图像特征(如对比度和离散性)的影响,因此需要一种更符合人类感知特性的建模方式。 Method: 采用变分框架,提出一组‘感知启发’的能量泛函应满足的基本要求,构造三类具体泛函,并用梯度下降法求解最小值;同时引入O(N log N)的高效计算策略。 Result: 得到了一类满足感知合理性的能量泛函,所选三个泛函在效果上与现有模型有可比性且更具理论依据,计算效率从O(N²)提升至O(N log N)。 Conclusion: 该变分方法能有效结合人类颜色感知特性进行对比度增强,兼具理论合理性与计算可行性。 Abstract: Basic phenomenology of human color vision has been widely taken as an inspiration to devise explicit color correction algorithms. The behavior of these models in terms of significative image features (such as contrast and dispersion) can be difficult to characterize. To cope with this, we propose to use a variational formulation of color contrast enhancement that is inspired by the basic phenomenology of color perception. In particular, we devise a set of basic requirements to be fulfilled by an energy to be considered as `perceptually inspired', showing that there is an explicit class of functionals satisfying all of them. We single out three explicit functionals that we consider of basic interest, showing similarities and differences with existing models. The minima of such functionals is computed using a gradient descent approach. We also present a general methodology to reduce the computational cost of the algorithms under analysis from ${\cal O}(N^2)$ to ${\cal O}(N\log N)$, being $N$ the number of input pixels.[294] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
Shuo Ni,Di Wang,He Chen,Haonan Guo,Ning Zhang,Jing Zhang
Main category: cs.CV
TL;DR: 本文提出了GeoSeg-1M,首个百万级遥感指令分割数据集,以及配套的基准GeoSeg-Bench和统一框架UniGeoSeg,显著提升了遥感图像中指令驱动分割的理解与泛化能力。
Details
Motivation: 现有遥感指令分割方法存在任务形式碎片化和指令数据有限的问题,限制了模型对复杂地理场景的理解与泛化能力。 Method: 构建了一个自动的掩码过滤与指令生成 pipeline,融合多个公开数据集生成 referring、interactive 和 reasoning 类型的指令,形成GeoSeg-1M数据集(590K图像,1.1M三元组);并提出UniGeoSeg统一框架,结合任务感知文本增强、潜在知识记忆和渐进训练策略。 Result: 在新提出的GeoSeg-Bench和多个公开基准上实现了最先进的性能,并展现出强大的零样本泛化能力。 Conclusion: GeoSeg-1M和UniGeoSeg为遥感指令分割提供了重要资源与有效方法,推动了该领域向更通用、更智能的方向发展。 Abstract: Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.[295] Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
Yu Zhang,Jingyi Liu,Yiwei Shi,Qi Zhang,Duoqian Miao,Changwei Wang,Longbing Cao
Main category: cs.CV
TL;DR: 本文提出了Markov-VAR,一种基于非全上下文依赖的视觉自回归生成模型,通过将每一代尺度视为马尔可夫状态并引入滑动窗口压缩历史信息,显著提升了生成效率与性能。
Details
Motivation: 全上下文依赖虽然有助于稳定的表示学习,但带来了高昂的计算开销,限制了VAR的实际应用与扩展性,因此需要一种更高效的方法。 Method: 将VAR重新建模为非全上下文的马尔可夫过程,提出马尔可夫尺度预测:使用滑动窗口将部分先前尺度压缩为紧凑的历史向量,并将其与当前状态结合形成动态状态,实现信息的有效传递。 Result: 在ImageNet上,相比原始VAR,Markov-VAR在256×256分辨率下FID降低10.5%,在1024×1024下峰值内存消耗减少83.8%。 Conclusion: Markov-VAR在保持生成质量的同时大幅提升了效率和可扩展性,有望成为未来视觉自回归生成及其他下游任务的基础模型。 Abstract: Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.[296] Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories
Xinxi Zhang,Shiwei Tan,Quang Nguyen,Quan Dao,Ligong Han,Xiaoxiao He,Tunyu Zhang,Alen Mrdovic,Dimitris Metaxas
Main category: cs.CV
TL;DR: 提出Rectified MeanFlow,通过单次reflow步骤建模rectified轨迹上的平均速度场,实现高效的一次采样生成,避免了完全直线化轨迹的需求,并在ImageNet上表现出优于现有方法的生成质量和训练效率。
Details
Motivation: 现有的Flow-based生成模型在采样时依赖昂贵的ODE数值积分,尽管Rectified Flow和MeanFlow尝试实现一步采样,但前者需要多次reflow迭代计算成本高,后者在高度弯曲的流上训练时收敛慢且监督噪声大。 Method: 提出Rectified MeanFlow,直接在rectified轨迹上建模平均速度场,仅需一次reflow步骤;引入一种简单的截断启发式方法以减少残余曲率,提升性能。 Result: 在ImageNet 64、256和512分辨率上的实验表明,Re-MeanFlow在生成质量和训练效率方面均优于之前的一次流蒸馏和Rectified Flow方法。 Conclusion: Rectified MeanFlow通过结合rectified flow的思想与平均速度建模,在不依赖完全直线路径的情况下实现了高效的一步生成,是flow-based模型中一种更优的训练与采样框架。 Abstract: Flow-based generative models have recently demonstrated strong performance, yet sampling typically relies on expensive numerical integration of ordinary differential equations (ODEs). Rectified Flow enables one-step sampling by learning nearly straight probability paths, but achieving such straightness requires multiple computationally intensive reflow iterations. MeanFlow achieves one-step generation by directly modeling the average velocity over time; however, when trained on highly curved flows, it suffers from slow convergence and noisy supervision. To address these limitations, we propose Rectified MeanFlow, a framework that models the mean velocity field along the rectified trajectory using only a single reflow step. This eliminates the need for perfectly straightened trajectories while enabling efficient training. Furthermore, we introduce a simple yet effective truncation heuristic that aims to reduce residual curvature and further improve performance. Extensive experiments on ImageNet at 64, 256, and 512 resolutions show that Re-MeanFlow consistently outperforms prior one-step flow distillation and Rectified Flow methods in both sample quality and training efficiency. Code is available at https://github.com/Xinxi-Zhang/Re-MeanFlow.[297] A Hierarchical Computer Vision Pipeline for Physiological Data Extraction from Bedside Monitors
Vinh Chau,Khoa Le Dinh Van,Hon Huynh Ngoc,Binh Nguyen Thien,Hao Nguyen Thien,Vy Nguyen Quang,Phuc Vo Hong,Yen Lam Minh,Kieu Pham Tieu,Trinh Nguyen Thi Diem,Louise Thwaites,Hai Ho Bich
Main category: cs.CV
TL;DR: 提出了一种基于计算机视觉的管道,利用YOLOv11和PaddleOCR从床旁监护仪屏幕中自动提取生命体征数据,结合几何校正模块提升鲁棒性,在6,498张图像上实现了超过98.9%的端到端提取准确率,为低资源医疗环境中的数据集成提供了低成本、可扩展的解决方案。
Details
Motivation: 在许多低资源医疗环境中,床旁监护仪缺乏网络连接,导致生理数据难以整合到电子健康记录系统中,亟需一种无需更换硬件的低成本互操作性解决方案。 Method: 采用YOLOv11进行监护仪及感兴趣区域(ROI)定位,结合PaddleOCR进行文本识别,并引入几何校正模块以标准化不同拍摄角度和光照条件下的屏幕透视畸变,实现稳定的数据提取。 Result: 在包含6,498张图像的数据集上,监护仪检测mAP@50-95达99.5%,生命体征ROI定位达91.5%,核心生理参数(如心率、SpO2、血压)的端到端提取准确率超过98.9%。 Conclusion: 该轻量级、基于摄像头的方法能高效地将非结构化的屏幕信息转化为结构化数字数据,为低资源环境下提升临床数据可及性和文档记录质量提供了一种实用且可扩展的技术路径。 Abstract: In many low-resource healthcare settings, bedside monitors remain standalone legacy devices without network connectivity, creating a persistent interoperability gap that prevents seamless integration of physiological data into electronic health record (EHR) systems. To address this challenge without requiring costly hardware replacement, we present a computer vision-based pipeline for the automated capture and digitisation of vital sign data directly from bedside monitor screens. Our method employs a hierarchical detection framework combining YOLOv11 for accurate monitor and region of interest (ROI) localisation with PaddleOCR for robust text extraction. To enhance reliability across variable camera angles and lighting conditions, a geometric rectification module standardizes the screen perspective before character recognition. We evaluated the system on a dataset of 6,498 images collected from open-source corpora and real-world intensive care units in Vietnam. The model achieved a mean Average Precision (mAP@50-95) of 99.5% for monitor detection and 91.5% for vital sign ROI localisation. The end-to-end extraction accuracy exceeded 98.9% for core physiological parameters, including heart rate, oxygen saturation SpO2, and arterial blood pressure. These results demonstrate that a lightweight, camera-based approach can reliably transform unstructured information from screen captures into structured digital data, providing a practical and scalable pathway to improve information accessibility and clinical documentation in low-resource settings.[298] SimScale: Learning to Drive via Real-World Simulation at Scale
Haochen Tian,Tianyu Li,Haochen Liu,Jiazhi Yang,Yihang Qiu,Guang Li,Junli Wang,Yinfeng Gao,Zhang Zhang,Liang Wang,Hangjun Ye,Tieniu Tan,Long Chen,Hongyang Li
Main category: cs.CV
TL;DR: 本文提出了一种名为SimScale的可扩展仿真框架,通过神经渲染和反应式环境生成高保真多视角驾驶场景,弥补真实世界数据在安全关键和分布外场景中的不足。利用伪专家轨迹生成机制为合成状态提供动作监督,结合真实与模拟数据的简单协同训练显著提升了规划方法的鲁棒性和泛化能力,在navhard和navtest基准上分别提升达+6.8和+2.9。研究还揭示了伪专家设计和不同策略架构的扩展特性。
Details
Motivation: 真实世界驾驶数据中安全关键和异常场景稀缺,难以覆盖自动驾驶所需的各种复杂情况,限制了模型的鲁棒性和泛化能力。需要一种方法来有效扩充数据多样性,特别是在分布外和高风险场景下。 Method: 提出SimScale框架:1)基于现有驾驶日志,使用先进神经渲染和反应式环境生成多样化的未见状态;2)通过扰动自车轨迹控制生成高保真多视角观测;3)设计伪专家轨迹生成机制,为合成状态提供动作标签;4)采用真实与模拟数据的协同训练策略,提升规划模型性能。 Result: 在navhard和navtest等具有挑战性的现实基准上,所提方法使多种规划方法的性能显著提升,EPDMS指标分别提高+6.8和+2.9;策略性能随模拟数据量增加而平滑提升,无需额外真实数据;验证了不同策略架构的扩展规律及伪专家设计的重要性。 Conclusion: SimScale通过高质量、可扩展的仿真数据生成,有效增强了自动驾驶决策模型的鲁棒性和泛化能力,展示了纯靠增加模拟数据即可持续提升性能的潜力,为构建更安全的自动驾驶系统提供了可行路径。 Abstract: Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.[299] DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline
Rui Zhang,Hongxia Wang,Hangqing Liu,Yang Zhou,Qiang Zeng
Main category: cs.CV
TL;DR: 本文提出了一个用于扩散模型图像编辑区域定位的大规模数据集DEAL-300K,包含超过30万张标注图像,并构建了一个基于视觉基础模型和多频率提示调优的定位框架,在像素级检测任务中取得了优异性能。
Details
Motivation: 现有基准数据集主要关注生成图像的二分类检测或手动编辑区域的定位,未能反映扩散模型编辑图像局部伪造自然融合的特点,因此需要专门针对扩散模型编辑的操纵定位数据集和方法。 Method: 通过多模态大语言模型生成编辑指令,使用无掩码扩散编辑器生成篡改图像,并采用主动学习的变化检测流程获取像素级标注;提出一种结合冻结视觉基础模型(VFM)与多频率提示调优(MFPT)的定位框架,利用语义和频域线索进行编辑区域定位。 Result: 在DEAL-300K测试集上达到82.56%的像素级F1分数,在外部CoCoGlide基准上达到80.97%,显著优于现有方法。 Conclusion: DEAL-300K为扩散模型图像操纵定位提供了大规模高质量数据支持,所提方法有效融合语义与频域特征,为未来DIML研究提供了强基线和实用基础。 Abstract: Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.[300] VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Sinan Du,Jiahao Guo,Bo Li,Shuhao Cui,Zhengzhuo Xu,Yifu Luo,Yongxian Wei,Kun Gai,Xinggang Wang,Kai Wu,Chun Yuan
Main category: cs.CV
TL;DR: 本文提出了VQRAE,一种基于向量量化的统一表示方法,能够在单一tokenizer中同时支持图像理解(连续语义特征)和视觉生成(离散token),并兼顾细粒度重建,通过两阶段训练策略和高维语义码本实现了优异的多模态性能与可扩展性。
Details
Motivation: 现有方法通常使用双编码器架构或对比学习来平衡语义与细节,难以在统一tokenizer中同时支持理解、生成与重建;因此需要一种新型统一表示框架。 Method: 基于预训练视觉模型构建对称ViT解码器,采用两阶段训练:第一阶段冻结编码器,学习高维语义VQ码本以实现像素级重建;第二阶段通过自蒸馏联合优化编码器。 Result: VQRAE在多个视觉理解、生成和重建基准上表现出竞争力,并展现出良好的自回归扩展性;其高维语义码本(1536维)达到100%利用率,优于传统低维码本。 Conclusion: VQRAE首次实现在统一tokenizer中同时支持连续语义表示与离散生成token,为构建统一多模态模型提供了新思路,且具有良好的可扩展性和实际应用潜力。 Abstract: Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.[301] MANTA: Physics-Informed Generalized Underwater Object Tracking
Suhas Srinath,Hemang Jamadagni,Aditya Chadrasekar,Prathosh AP
Main category: cs.CV
TL;DR: MANTA是一个针对水下物体跟踪的物理感知框架,通过结合时间一致性和Beer-Lambert增强的双正对比学习策略,提升特征对水下退化和时序变化的鲁棒性,并引入几何一致性与外观相似性融合的多阶段跟踪流程,在多个基准上实现了最先进性能。
Details
Motivation: 水下环境中的波长依赖性衰减和散射严重扭曲目标外观,导致基于陆地数据训练的现有跟踪器难以泛化到这些物理驱动的退化场景。 Method: 提出MANTA框架,采用双正对比学习策略(结合时间一致性和Beer-Lambert定律增强)来学习对水下退化鲁棒的特征;设计多阶段跟踪流水线,结合运动跟踪与基于几何一致性和外观相似性的物理感知二次关联算法,用于遮挡和漂移下的重识别;并提出CSC和GAS新指标评估几何保真度。 Result: 在四个水下基准(WebUOT-1M, UOT32, UTB180, UWCOT220)上验证,MANTA在Success AUC上最高提升6%,且具备稳定的长期跟踪能力和高效运行速度。 Conclusion: MANTA通过物理建模与表示学习的深度融合,显著提升了水下复杂环境中的跟踪鲁棒性和准确性,为物理感知视觉跟踪提供了有效范式。 Abstract: Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water conditions. Existing trackers trained on terrestrial data fail to generalize to these physics-driven degradations. We present MANTA, a physics-informed framework integrating representation learning with tracking design for underwater scenarios. We propose a dual-positive contrastive learning strategy coupling temporal consistency with Beer-Lambert augmentations to yield features robust to both temporal and underwater distortions. We further introduce a multi-stage pipeline augmenting motion-based tracking with a physics-informed secondary association algorithm that integrates geometric consistency and appearance similarity for re-identification under occlusion and drift. To complement standard IoU metrics, we propose Center-Scale Consistency (CSC) and Geometric Alignment Score (GAS) to assess geometric fidelity. Experiments on four underwater benchmarks (WebUOT-1M, UOT32, UTB180, UWCOT220) show that MANTA achieves state-of-the-art performance, improving Success AUC by up to 6 percent, while ensuring stable long-term generalized underwater tracking and efficient runtime.[302] DisMo: Disentangled Motion Representations for Open-World Motion Transfer
Thomas Ressler-Antal,Frank Fundel,Malek Ben Alaya,Stefan Andreas Baumann,Felix Krause,Ming Gui,Björn Ommer
Main category: cs.CV
TL;DR: 本文提出了一种名为DisMo的新范式,用于从原始视频数据中学习抽象的运动表示,实现了内容与运动的解耦,支持跨类别、无语义关联实体间的开放世界运动迁移,并可与现有视频生成模型结合,提升运动保真度和提示一致性,同时在零样本动作分类等下游任务中表现优异。
Details
Motivation: 现有的文本到视频和图像到视频模型缺乏对运动的显式独立表示,限制了内容创作者在运动迁移等场景中的应用,且现有方法在运动保真度与提示遵循之间存在权衡问题。 Method: 通过图像空间重建目标,直接从原始视频数据中学习与外观、物体身份和姿态无关的通用运动表示,并设计轻量级适配器将其集成到现有视频生成器中。 Result: 在多种运动迁移任务中验证了方法的有效性,能够在不同类别间实现高质量运动迁移;在Something-Something v2和Jester等基准上的零样本动作分类任务中优于V-JEPA等先进视频表示模型。 Conclusion: DisMo实现了运动语义与外观的解耦,提供了一种通用、灵活且可扩展的运动表示框架,不仅提升了运动迁移的质量和可控性,还为下游视频理解任务提供了强大的预训练表示。 Abstract: Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo[303] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model
Junshu Tang,Jiacheng Liu,Jiaqi Li,Longhuang Wu,Haoyu Yang,Penghao Zhao,Siruis Gong,Xiang Yuan,Shuai Shao,Qinglin Lu
Main category: cs.CV
TL;DR: Hunyuan-GameCraft-2 提出一种指令驱动的生成式游戏世界建模范式,通过自然语言、键盘或鼠标信号实现灵活交互,利用大规模文本-视频对构建因果对齐的交互数据集,并在14B MoE模型基础上实现细粒度控制,显著提升开放世界游戏环境的动态响应能力。
Details
Motivation: 现有生成式世界模型受限于固定动作模式和高标注成本,难以支持多样化的游戏内交互和玩家驱动的动态行为。 Method: 提出指令驱动的交互范式,定义交互式视频数据概念,构建自动化流程将非结构化文本-视频对转化为因果对齐的交互数据集,并基于14B图像到视频MoE基础模型引入文本驱动交互注入机制,实现对摄像机运动、角色行为和环境动态的精细控制。 Result: 模型能生成时间连贯且因果合理的交互式游戏视频,准确响应如“开门”、“画一个火把”或“触发爆炸”等多样化自由形式指令,在新提出的InterBench基准上表现出色。 Conclusion: Hunyuan-GameCraft-2 实现了更灵活、语义更丰富的游戏世界交互,推动了生成式世界模型向开放-ended、用户驱动的动态模拟迈进。 Abstract: Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as "open the door", "draw a torch", or "trigger an explosion".[304] Object-Centric Data Synthesis for Category-level Object Detection
Vikhyat Agarwal,Jiayi Cora Guo,Declan Hoban,Sissi Zhang,Nicholas Moran,Peter Cho,Srilakshmi Pattabiraman,Shantanu Joshi
Main category: cs.CV
TL;DR: 提出了一种在对象中心数据(如多视角图像或3D模型)有限的情况下,通过四种数据合成方法微调目标检测模型以实现新类别检测的新范式,并验证了其在现实场景中的有效性。
Details
Motivation: 现有目标检测模型扩展到新类别需要大量标注数据,成本高,尤其对长尾类别不友好,因此需要一种在数据受限情况下仍能有效提升性能的方法。 Method: 基于对象中心数据,采用四种数据合成方法:简单图像处理、3D渲染和图像扩散模型,生成具有上下文多样性的合成图像用于微调检测模型。 Result: 实验表明,这些合成方法显著提升了模型在新类别上的检测性能,实现了在真实世界数据中的类别级泛化能力。 Conclusion: 在对象中心数据有限的设定下,利用合成数据进行微调是一种高效扩展检测模型类别的可行方案,尤其适用于标注数据稀缺的场景。 Abstract: Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capability to new object classes requires large amounts of annotated training data, which is costly and time-consuming to acquire, especially for long-tailed classes with insufficient representation in existing datasets. Here, we introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models), and systematically evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting. The approaches are based on simple image processing techniques, 3D rendering, and image diffusion models, and use object-centric data to synthesize realistic, cluttered images with varying contextual coherence and complexity. We assess how these methods enable models to achieve category-level generalization in real-world data, and demonstrate significant performance boosts within this data-constrained experimental setting.[305] Visual Generation Tuning
Jiahao Guo,Sinan Du,Jingfeng Yao,Wenyu Liu,Bo Li,Haoxiang Cao,Kun Gai,Chun Yuan,Kai Wu,Xinggang Wang
Main category: cs.CV
TL;DR: 本文提出了VGT(Visual Generation Tuning),一种用于激发大型视觉语言模型(VLMs)中潜在视觉生成能力的新方法,通过高效对齐预训练VLM的语义编码器与像素解码器的潜在表示,实现了快速、高质量的图像生成和重建,在多个基准上达到先进性能。
Details
Motivation: 探索已为多模态理解任务优化的视觉语言模型是否具备内在的视觉生成潜力,并降低视觉生成模型的训练成本与收敛时间。 Method: 提出VGT框架,摒弃传统的像素级VAE,通过将预训练VLM的语义编码器与像素解码器的潜在空间对齐,实现连续空间中的自回归建模;采用高效的视觉生成微调策略,加速训练并提升生成质量。 Result: 在图像重建任务中达到26.67 PSNR和0.50 rFID(28倍压缩比),优于专用VAE;在视觉生成任务中,GenEval得分为0.77,DPG-Bench得分为78.73,均为自回归模型中的最先进水平;训练速度提升20倍,具有良好的可扩展性。 Conclusion: VGT成功释放了多模态理解模型中的视觉生成潜能,为构建下一代统一的多模态基础模型提供了新路径。 Abstract: Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.[306] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
Zhizhou Zhong,Yicheng Ji,Zhe Kong,Yiying Liu,Jiarui Wang,Jiasun Feng,Lupeng Liu,Xiangyi Wang,Yanjia Li,Yuqing She,Ying Qin,Huan Li,Shuiyang Mao,Wei Liu,Wenhan Luo
Main category: cs.CV
TL;DR: 本文提出了AnyTalker,一种可扩展的多人视频生成框架,通过身份感知注意力机制和基于单人视频的训练策略,实现了高质量、自然交互的多人对话视频生成,同时降低了数据成本与身份扩展难度。
Details
Motivation: 现有的音频驱动多人说话视频生成方法受限于多样化多人数据收集的高成本以及难以实现多个身份之间的连贯互动,因此需要一种更高效且可扩展的解决方案。 Method: 提出AnyTalker框架,采用扩展的Diffusion Transformer结构,引入新的身份感知注意力机制,迭代处理身份-音频对;训练上仅使用单人视频学习多人 speaking 模式,并用少量真实多人片段优化交互性。 Result: 实验表明,AnyTalker在唇部同步、视觉质量和交互自然性方面表现优异,能够在低数据成本下支持任意数量可驱动身份的生成。 Conclusion: AnyTalker通过新颖的架构设计和高效的训练策略,实现了可扩展、低成本、高自然度的多人视频生成,推动了多人体对话场景的发展。 Abstract: Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.[307] Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Hanoona Rasheed,Mohammed Zumri,Muhammad Maaz,Ming-Hsuan Yang,Fahad Shahbaz Khan,Salman Khan
Main category: cs.CV
TL;DR: 本文提出了一种新的视频理解范式——Interactive Video Reasoning,使模型能够通过一系列视觉操作“与视频共同思考”,提升细粒度时空推理能力。
Details
Motivation: 现有视频理解模型将视觉输入视为静态上下文,缺乏动态回看和验证能力,导致在需要精细时空推理的任务中表现不足。 Method: 提出Video CoM模型,通过Chain of Manipulations(CoM)进行迭代视觉操作以收集和优化证据;构建包含18K样本的Video CoM Instruct数据集,并采用基于步骤级奖励的Group Relative Policy Optimization(GRPO)进行强化学习优化。 Result: 在九个视频推理基准上平均性能比当前最先进模型提升3.6%,仅使用25K SFT和3K GRPO样本,训练效率更高;消融实验表明推理感知奖励提升了准确性和可解释性。 Conclusion: Interactive Video Reasoning范式使模型能主动操作视频内容进行深层推理,显著提升视频理解性能,为多模态推理提供了新方向。 Abstract: Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still "think about videos" ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to "think with videos". Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: https://github.com/mbzuai-oryx/Video-CoM[308] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
Muhammad Maaz,Hanoona Rasheed,Fahad Shahbaz Khan,Salman Khan
Main category: cs.CV
TL;DR: 本文提出了一种新的强化学习方法来提升多模态大模型在视频推理中的时间对齐性和推理一致性,通过引入Think Answer Consistency (TAC) 和 Video Attention Score (VAS) 两个诊断指标,发现现有模型过度依赖语言先验而非视觉内容。所提出的Video R2模型在多个基准上实现了更高的TAC、VAS和准确率。