cs.CL [Back]

[1] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

Zhiqiang Kou,Junyang Chen,Xin-Qiang Cai,Ming-Kun Xie,Biao Liu,Changwei Wang,Lei Feng,Yuheng Jia,Gang Niu,Masashi Sugiyama,Xin Geng

Main category: cs.CL

TL;DR: 本文提出了一种新的多标签毒性检测方法，通过构建三个多标签基准数据集（Q-A-MLL、R-A-MLL、H-X-MLL）和伪标签技术，解决了现有毒性检测中单标签标注的局限性，显著提升了大语言模型生成内容中毒性识别的准确性和可靠性。

Details

Motivation: 现有的毒性检测主要依赖单标签基准，无法充分捕捉现实世界中毒性提示的模糊性和多维特性，导致评估存在偏差，且细粒度多标签标注成本高昂，限制了有效评估和模型发展。 Method: 提出了三个基于公开毒性数据集并依据15类详细分类体系标注的多标签基准（Q-A-MLL、R-A-MLL、H-X-MLL），并开发了一种基于伪标签的毒性检测方法；理论上证明了在所发布数据集上，使用伪标签训练优于直接从单标签监督中学习。 Result: 实验结果表明，该方法在多标签毒性检测任务上显著优于包括GPT-4o和DeepSeek在内的先进基线模型，能够更准确、可靠地评估大语言模型生成内容中的毒性。 Conclusion: 所提出的多标签基准和伪标签方法有效解决了现有毒性检测中的评估偏差和标注成本问题，为大语言模型的安全性评估提供了更可靠的技术路径。 Abstract: Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.

[2] Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

Enis Oğuz

Main category: cs.CL

TL;DR: 本研究评估了生成式AI模型（ChatGPT、Gemini和Deepseek）在有无习语的学生作文自动评分中的表现，发现所有模型具有一致性，其中Gemini在与人类评分者的一致性及处理隐喻语言方面表现最佳，显示出独立用于作文评分的潜力。

Details

Motivation: 鉴于生成式AI在自动作文评分中逐渐被视为传统AES系统的竞争者，且考虑到AI在处理习语等非字面语言时可能存在局限，本研究旨在探究不同生成式AI模型在处理包含与不包含习语的学生作文时的评分表现差异。 Method: 从348篇学生作文语料库中构建两组等量作文列表：一组每篇包含多个习语，另一组不含习语；使用与人类评分者相同的评分标准，让ChatGPT、Gemini和Deepseek三个生成式AI模型对两组作文各评分三次，并结合语料库语言学与计算语言学方法分析其评分一致性、与人类评分者的相关性及对不同群体是否存在偏见。 Result: 所有AI模型表现出优秀的评分一致性，其中Gemini在与人类评分者的评分一致性（interrater reliability）上优于其他模型，且未检测到对任何人口统计学群体的偏见；在处理含多个习语的作文时，Gemini的评分模式最接近人类评分者。 Conclusion: 尽管三种生成式AI模型均显示出在自动作文评分中采用混合方法的潜力，但Gemini因其在处理比喻性语言方面的优势，成为该任务的最佳候选者，并有望在未来独立承担作文评分任务。 Abstract: The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.

[3] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling

Shiyu Ji,Farnoosh Hashemi,Joice Chen,Juanwen Pan,Weicheng Ma,Hefan Zhang,Sophia Pan,Ming Cheng,Shubham Mohole,Saeed Hassanpour,Soroush Vosoughi,Michael Macy

Main category: cs.CL

TL;DR: 提出一种利用大语言模型自动生成和标注合成辩论数据的框架，以分析修辞策略，并通过微调分类器验证其在跨领域任务中的高性能与强泛化能力。

Details

Motivation: 现有修辞策略分析依赖人工标注，成本高、难以扩展，且数据集局限于特定主题和策略，限制了模型的鲁棒性发展。 Method: 基于四类修辞类型（因果、经验、情感、道德），利用大语言模型生成并标注合成辩论数据，微调基于transformer的分类器，并在人工标注数据及多个外部语料库上验证性能。 Result: 模型在多个数据集上表现出高性能和强跨领域泛化能力；应用于说服力预测提升和美国总统辩论（1960–2020）中修辞策略的时序与党派变化分析，发现情感论证使用增加。 Conclusion: LLM生成的合成数据可有效支持修辞策略识别模型的训练，为大规模、跨领域的修辞分析提供了可行且高效的新路径。 Abstract: Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.

[4] Continual Learning via Sparse Memory Finetuning

Jessy Lin,Luke Zettlemoyer,Gargi Ghosh,Wen-Tau Yih,Aram Markosyan,Vincent-Pierre Berges,Barlas Oğuz

Main category: cs.CL

TL;DR: 本文提出稀疏记忆微调方法，通过仅更新内存槽来减少新知识与已有能力之间的干扰，显著降低灾难性遗忘，实现持续学习。

Details

Motivation: 由于可训练参数在所有任务间共享，导致模型在更新新数据时容易遗忘先前知识，因此需要探索稀疏参数更新以缓解这一问题。 Method: 引入稀疏记忆微调，利用内存层模型（memory layer models），仅更新在新知识下相对于预训练数据高度激活的内存槽。 Result: 在两个问答任务上评估显示，相比全量微调和LoRA，稀疏记忆微调在获取新知识的同时遗忘更少：NaturalQuestions F1在全量微调后下降89%，LoRA下降71%，而该方法仅下降11%。 Conclusion: 内存层的稀疏性为大语言模型的持续学习提供了有前景的方向。 Abstract: Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model's existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.

[5] Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks

Kirill Semenov,Rico Sennrich

Main category: cs.CL

TL;DR: 本文研究了多语言事实知识评估中模板翻译的问题，提出使用完整句子翻译以提高语法正确性和评估准确性，并在多种语言上验证了该方法的有效性。

Details

Motivation: 现有基准测试（如MLAMA）使用的模板翻译未考虑命名实体的语法和语义信息，导致生成的提示存在大量语法错误或措辞不当，影响评分解释，尤其是在形态丰富的语言中。 Method: 选取MLAMA数据集中的4种斯拉夫语，比较原始模板翻译与Google Translate和ChatGPT生成的句子级翻译在知识检索得分上的差异，并扩展至其他5种不同语系的语言进行分析。 Result: 发现句子级翻译显著提升了知识检索得分，并观察到跨语言的一致性模式。 Conclusion: 建议社区在构建高多语言数据集时控制语法正确性，而使用神经机器翻译或大语言模型进行整句翻译是一种有效近似方案。 Abstract: For multilingual factual knowledge assessment of LLMs, benchmarks such as MLAMA use template translations that do not take into account the grammatical and semantic information of the named entities inserted in the sentence. This leads to numerous instances of ungrammaticality or wrong wording of the final prompts, which complicates the interpretation of scores, especially for languages that have a rich morphological inventory. In this work, we sample 4 Slavic languages from the MLAMA dataset and compare the knowledge retrieval scores between the initial (templated) MLAMA dataset and its sentence-level translations made by Google Translate and ChatGPT. We observe a significant increase in knowledge retrieval scores, and provide a qualitative analysis for possible reasons behind it. We also make an additional analysis of 5 more languages from different families and see similar patterns. Therefore, we encourage the community to control the grammaticality of highly multilingual datasets for higher and more interpretable results, which is well approximated by whole sentence translation with neural MT or LLM systems. The dataset and all related code is published at the Github repository: https://github.com/ZurichNLP/Fluent-mLAMA.

[6] Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis

Alexander Brady,Tunazzina Islam

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的端到端框架，用于从无标签的社交媒体政治广告语料库中自动生成可解释的主题分类体系，并揭示2024年美国大选前政治话语的结构、道德框架和目标特征。

Details

Motivation: 社交媒体上的政治话语内容庞大且快速演变，传统分析方法难以应对；现有主题建模方法缺乏可解释性和对道德框架的捕捉能力，需要无需领域专业知识的自动化解决方案。 Method: 结合无监督聚类与基于提示的大语言模型标签生成，迭代构建可解释的主题分类体系；应用于Meta平台2024年美国总统大选前一个月的政治广告语料库，并引入道德基础维度进行标注。 Result: 发现投票和移民广告占据最多支出与曝光，堕胎和选举诚信议题则获得不成比例的传播；资金模式高度两极化，不同议题由不同政治团体主导；堕胎广告强调自由/压迫叙事，经济类广告融合多种道德叙事；主题显著性与道德基础之间存在强相关性，并识别出明确的人口统计学定向特征。 Conclusion: 该框架支持对社交媒体政治信息进行可扩展且可解释的分析，有助于研究人员、政策制定者和公众理解数字政治传播中的新兴叙事、极化动态及其道德基础。 Abstract: Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges--abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.

[7] FarsiMCQGen: a Persian Multiple-choice Question Generation Framework

Mohammad Heydari Rad,Rezvan Afari,Saeedeh Momtazi

Main category: cs.CL

TL;DR: 本文提出了一种用于生成波斯语多选题（MCQ）的新方法FarsiMCQGen，结合了候选生成、过滤和排序技术，并利用Transformer模型、知识图谱与基于规则的方法生成高质量干扰项。研究基于维基百科数据构建了一个包含10,289个问题的新型波斯语MCQ数据集，并通过多种先进大语言模型验证其质量，结果表明该方法有效且具有研究潜力。

Details

Motivation: 在低资源语言（如波斯语）中自动生成高质量多选题仍具挑战性，现有方法难以生成逼真且具挑战性的干扰选项，因此需要一种更有效的MCQ生成框架。 Method: 结合候选答案生成、过滤与排序策略，融合Transformer模型、知识图谱与基于规则的方法来生成和筛选干扰项，并基于维基百科内容构建波斯语MCQ数据集。 Result: 成功构建了包含10,289个波斯语MCQ的高质量数据集，实验显示所提模型能有效生成类似真实试题的答案选项，多个先进大语言模型评估结果证明了数据集的优质性与可用性。 Conclusion: FarsiMCQGen能够有效生成高质量的波斯语多选题，所构建的数据集为低资源语言下的自动试题生成提供了新资源，有望推动相关领域研究发展。 Abstract: Multiple-choice questions (MCQs) are commonly used in educational testing, as they offer an efficient means of evaluating learners' knowledge. However, generating high-quality MCQs, particularly in low-resource languages such as Persian, remains a significant challenge. This paper introduces FarsiMCQGen, an innovative approach for generating Persian-language MCQs. Our methodology combines candidate generation, filtering, and ranking techniques to build a model that generates answer choices resembling those in real MCQs. We leverage advanced methods, including Transformers and knowledge graphs, integrated with rule-based approaches to craft credible distractors that challenge test-takers. Our work is based on data from Wikipedia, which includes general knowledge questions. Furthermore, this study introduces a novel Persian MCQ dataset comprising 10,289 questions. This dataset is evaluated by different state-of-the-art large language models (LLMs). Our results demonstrate the effectiveness of our model and the quality of the generated dataset, which has the potential to inspire further research on MCQs.

[8] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

Junlin Wu,Xianrui Zhong,Jiashuo Sun,Bolian Li,Bowen Jin,Jiawei Han,Qingkai Zeng

Main category: cs.CL

TL;DR: 提出Structure-R1框架，利用强化学习将检索内容转化为动态生成的结构化表示，并通过自奖励机制验证结构质量，显著提升大模型在知识密集任务中的推理能力。

Details

Motivation: 现有RAG系统依赖非结构化文本，信息密度低，影响推理效果；需要一种能根据多步推理需求动态生成高质量结构化知识表示的方法。 Method: 设计Structure-R1框架，采用强化学习训练内容表示策略，动态生成任务特定的结构化格式，并引入自奖励结构验证机制确保结构正确性和完整性。 Result: 在七个知识密集型基准上实验表明，使用7B规模模型时性能媲美更大模型，结构化表示有效提升信息密度与上下文清晰度。 Conclusion: 结构化表示能显著增强语言模型的推理能力，Structure-R1为知识增强提供了新范式。 Abstract: Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textsc{Structure-R1}, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textsc{Structure-R1} learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textsc{Structure-R1} consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: https://github.com/jlwu002/sr1.

[9] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

Yuatyong Chaichana,Pittawat Taveekitworachai,Warit Sirichotedumrong,Potsawee Manakul,Kunat Pipatanakul

Main category: cs.CL

TL;DR: 本文提出Partial YaRN和VLAT两种方法，以扩展大型音频-语言模型的音频上下文窗口，实现对长时音频的有效理解。

Details

Motivation: 现有的大型音频-语言模型（LALMs）受限于较短的音频上下文窗口，难以处理长时音频理解任务，而其文本主干通常支持长上下文，存在潜力未被挖掘。 Method: 首先提出Partial YaRN，一种仅修改音频token位置的训练-free音频上下文扩展方法；其次提出Virtual Longform Audio Training (VLAT)，在训练过程中模拟不同长度的音频输入，增强模型对远超训练长度的音频的泛化能力。 Result: 在SALMONN和Qwen2-Audio上的实验表明，Partial YaRN在多种设置下优于原始模型，而结合VLAT的训练策略进一步显著提升长音频理解性能，尤其在未见过的音频长度上表现优异。 Conclusion: Partial YaRN和VLAT有效解决了LALMs中音频上下文受限的问题，实现了无需额外训练或通过训练增强的长时音频理解，具有广泛的应用前景。 Abstract: Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.

[10] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

Lina Berrayana,Ahmed Heakl,Muhammad Abdullah Sohail,Thomas Hofmann,Salman Khan,Wei Chen

Main category: cs.CL

TL;DR: 本研究探索了将离散扩散语言模型（DDLM）与自回归模型（ARM）结合的混合架构，发现通过潜在空间通信可显著提升准确率，并在保持精度的同时大幅降低计算开销。

Details

Motivation: 现有自回归模型虽准确但耗时长、成本高；离散扩散模型虽能并行生成且适合复杂推理，但在文本生成上存在局限。因此，探索二者协作的潜力以实现互补优势。 Method: 首先在文本空间中让一个模型规划、另一个执行答案；然后扩展到潜在空间通信，引入学习的投影器将DDLM的潜变量映射到ARM的嵌入空间，实现更高效的协同推理。 Result: 从文本空间转向潜在空间通信显著提升了准确率（如DART-5从27.0%升至54.0%，AIME24从0.0%升至14.0%）；使用64个token规划和约5个token执行的潜空间流水线，在DART-5和AIME上超越Qwen3.1-7B，而后者使用44倍多的token。 Conclusion: DDLM与ARM的混合架构，尤其是基于潜在空间通信的协作方式，能够在复杂推理任务中实现更高的准确率和更低的计算成本，展示了扩散模型在协同系统中的巨大潜力。 Abstract: Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM's embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM --> ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.

[11] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Sensen Gao,Shanshan Zhao,Xu Jiang,Lunhao Duan,Yong Xien Chng,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,Jia-Wang Bian,Mingming Gong

Main category: cs.CL

TL;DR: 本文系统综述了面向文档理解的多模态检索增强生成（Multimodal RAG），提出分类体系，总结了数据集、基准和应用，并指出了效率、细粒度表示和鲁棒性等开放挑战。

Details

Motivation: 现有文档理解方法在保留结构细节和上下文建模方面存在局限，OCR流水线丢失结构信息，多模态大模型难以有效建模上下文，而传统RAG难以处理文本、表格、图表和布局共存的多模态文档。 Method: 提出基于领域、检索模态和粒度的分类体系，系统回顾结合图结构和智能体框架的多模态RAG进展，总结相关数据集、基准测试与应用场景。 Result: 建立了多模态RAG的系统性分类框架，梳理了当前技术进展与资源，并识别出效率、细粒度表示和鲁棒性等关键挑战。 Conclusion: 多模态RAG是实现全面文档智能的关键方向，本文为未来文档AI的发展提供了清晰的技术路线图。 Abstract: Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

[12] TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration

Mucheng Ren,He Chen,Yuchen Yan,Danqing Hu,Jun Xu,Xian Zeng

Main category: cs.CL

TL;DR: 提出了一种名为TraceCoder的新框架，通过整合多源外部知识（如UMLS、Wikipedia和大语言模型）来增强ICD编码的可追溯性和可解释性，在多个数据集上实现了最先进的性能。

Details

Motivation: 现有自动ICD编码方法面临语义鸿沟、对罕见和长尾代码表现差以及缺乏可解释性等问题。 Method: TraceCoder动态融合多种外部知识源以丰富编码表示，并引入混合注意力机制建模标签、临床上下文与知识间的交互。 Result: 在MIMIC-III-ICD9、MIMIC-IV-ICD9和MIMIC-IV-ICD10数据集上达到最先进水平，消融实验验证了各组件的有效性。 Conclusion: TraceCoder提供了一个可扩展且稳健的自动化ICD编码解决方案，满足临床对准确性、可解释性和可靠性的需求。 Abstract: Automated International Classification of Diseases (ICD) coding assigns standardized diagnosis and procedure codes to clinical records, playing a critical role in healthcare systems. However, existing methods face challenges such as semantic gaps between clinical text and ICD codes, poor performance on rare and long-tail codes, and limited interpretability. To address these issues, we propose TraceCoder, a novel framework integrating multi-source external knowledge to enhance traceability and explainability in ICD coding. TraceCoder dynamically incorporates diverse knowledge sources, including UMLS, Wikipedia, and large language models (LLMs), to enrich code representations, bridge semantic gaps, and handle rare and ambiguous codes. It also introduces a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge, improving long-tail code recognition and making predictions interpretable by grounding them in external evidence. Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets demonstrate that TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components. TraceCoder offers a scalable and robust solution for automated ICD coding, aligning with clinical needs for accuracy, interpretability, and reliability.

[13] TACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding

Mucheng Ren,Yucheng Yan,He Chen,Danqing Hu,Jun Xu,Xian Zeng

Main category: cs.CL

TL;DR: 本文提出了TACL（Threshold-Adaptive Curriculum Learning）框架，通过基于样本复杂度的课程学习，动态调整医学文本训练过程，提升模型在多语言临床任务中的表现。

Details

Motivation: 现有自然语言处理方法通常忽略临床记录间的复杂性差异，导致模型难以泛化到罕见或复杂的病例。本文旨在通过区分数据难度，改善模型学习效果。 Method: 提出TACL框架，根据样本复杂度动态划分难度等级，在训练初期优先学习简单样本，逐步过渡到复杂样本，实现渐进式学习，并应用于多语言医学文本（包括中英文电子病历）。 Result: 在自动ICD编码、再入院预测和中医证候辨识等多种临床任务中，TACL显著提升了模型性能，且在多语言环境下均表现出良好效果。 Conclusion: TACL通过自适应课程学习有效增强了医学文本理解模型的泛化能力，为跨语言、跨领域的医疗AI应用提供了可扩展的统一解决方案。 Abstract: Medical texts, particularly electronic medical records (EMRs), are a cornerstone of modern healthcare, capturing critical information about patient care, diagnoses, and treatments. These texts hold immense potential for advancing clinical decision-making and healthcare analytics. However, their unstructured nature, domain-specific language, and variability across contexts make automated understanding an intricate challenge. Despite the advancements in natural language processing, existing methods often treat all data as equally challenging, ignoring the inherent differences in complexity across clinical records. This oversight limits the ability of models to effectively generalize and perform well on rare or complex cases. In this paper, we present TACL (Threshold-Adaptive Curriculum Learning), a novel framework designed to address these challenges by rethinking how models interact with medical texts during training. Inspired by the principle of progressive learning, TACL dynamically adjusts the training process based on the complexity of individual samples. By categorizing data into difficulty levels and prioritizing simpler cases early in training, the model builds a strong foundation before tackling more complex records. By applying TACL to multilingual medical data, including English and Chinese clinical records, we observe significant improvements across diverse clinical tasks, including automatic ICD coding, readmission prediction and TCM syndrome differentiation. TACL not only enhances the performance of automated systems but also demonstrates the potential to unify approaches across disparate medical domains, paving the way for more accurate, scalable, and globally applicable medical text understanding solutions.

[14] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA

Jingao Xu,Shuoyoucheng Ma,Xin Song,Rong Jiang,Hongkui Tu,Bin Zhou

Main category: cs.CL

TL;DR: 提出了一种名为Exemplar-Guided Planning (EGP)的新框架，通过利用训练数据中的示例推理路径来增强大语言模型在知识图谱问答中的规划能力。

Details

Motivation: 解决自然语言查询与结构化知识图谱之间的语义鸿沟问题，以及现有方法在推理模式利用上的不足。 Method: 采用实体模板对训练集问题进行预处理以标准化语义变化，并使用语义嵌入和FAISS索引检索相似的示例及其成功推理路径；在任务分解和关系探索阶段动态指导LLM的规划过程，同时引入Smart Lookahead机制提高探索效率。 Result: 在WebQSP和CWQ两个真实世界KGQA数据集上的实验表明，PoG-EGP显著优于基线系统PoG及其他对比方法。 Conclusion: EGP有效提升了LLM代理在KGQA中的规划能力和推理效率，为无需训练的方法提供了更优的推理模式利用途径。 Abstract: Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM's planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.

[15] Automatic essay scoring: leveraging Jaccard coefficient and Cosine similaritywith n-gram variation in vector space model approach

Andharini Dwi Cahyani,Moh. Wildan Fathoni,Fika Hastarita Rachman,Ari Basuki,Salman Amin,Bain Khusnul Khotimah

Main category: cs.CL

TL;DR: 本研究比较了Jaccard系数和余弦相似度在基于n-gram向量空间模型的自动作文评分中的表现，结果表明余弦相似度优于Jaccard系数，且unigram的RMSE最低。

Details

Motivation: 为了提高自动作文评分系统的准确性与效率，探索不同相似性度量方法和n-gram特征表示在中学德育作文评估中的有效性。 Method: 采用向量空间模型（VSM），结合unigram、bigram和trigram进行文本特征提取与向量化，使用Jaccard系数和余弦相似度计算作文间的相似性，并通过RMSE评估系统得分与人工评分的差异。 Result: 余弦相似度的表现优于Jaccard系数；在n-gram中，unigram的RMSE最低，优于bigram和trigram。 Conclusion: 在自动作文评分任务中，余弦相似度结合unigram特征能更有效地逼近人工评分，是更优的组合方案。 Abstract: Automated essay scoring (AES) is a vital area of research aiming to provide efficient and accurate assessment tools for evaluating written content. This study investigates the effectiveness of two popular similarity metrics, Jaccard coefficient, and Cosine similarity, within the context of vector space models(VSM)employing unigram, bigram, and trigram representations. The data used in this research was obtained from the formative essay of the citizenship education subject in a junior high school. Each essay undergoes preprocessing to extract features using n-gram models, followed by vectorization to transform text data into numerical representations. Then, similarity scores are computed between essays using both Jaccard coefficient and Cosine similarity. The performance of the system is evaluated by analyzing the root mean square error (RMSE), which measures the difference between the scores given by human graders and those generated by the system. The result shows that the Cosine similarity outperformed the Jaccard coefficient. In terms of n-gram, unigrams have lower RMSE compared to bigrams and trigrams.

[16] Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

Zhiyang Chen,Daliang Xu,Haiyang Shen,Mengwei Xu,Shangguang Wang,Yun Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为CoordGen的移动端推理框架，通过结合推测解码与动态硬件调度，加速基于上下文感知的文本生成。

Details

Motivation: 由于生成过程内存密集，当前设备上的大语言模型在本地数据支持下的文本生成仍面临高延迟和硬件利用率低的问题。 Method: 该框架包含三个核心组件：自适应执行调度、上下文对齐的草稿生成以及硬件高效的草稿扩展，协同优化生成效率。 Result: 在多种智能手机和典型工作负载上的实验表明，相比现有方案，生成速度最高提升3.8倍，能效提升达4.7倍。 Conclusion: CoordGen有效提升了移动端上下文感知文本生成的速度与能效，具备实际部署潜力。 Abstract: Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents CoordGen, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.

[17] Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

Bolei Ma,Yina Yao,Anna-Carolina Haensch

Main category: cs.CL

TL;DR: 提出了一种三步评估框架，用于评估大语言模型在古典中文诗歌生成与评价中的表现，发现模型存在“回音室”效应和偏离人类判断的偏见，强调在复杂创意任务中需结合人类与模型的混合验证。

Details

Motivation: 大语言模型在创造性领域的应用日益广泛，但其在古典中文诗歌生成与评估中的表现尚不明确，需要系统评估其能力与局限。 Method: 提出一种结合计算指标、大语言模型评判和人类专家验证的三步评估框架，对六种主流大语言模型在主题、情感、意象、形式和风格等多个维度进行诗歌质量评估。 Result: 发现大语言模型在评估诗歌时存在‘回音室’效应，倾向于形成有缺陷的评估标准，且与人类专家判断存在系统性偏差。 Conclusion: 当前大语言模型在古典中文诗歌创作与评估中既有潜力也有显著局限，凸显了在文化与技术复杂的创造性任务中，仍需依赖人类与模型协同的混合验证方法。 Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit "echo chamber" effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

[18] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

Hong Ting Tsang,Jiaxin Bai,Haoyu Huang,Qiao Xiao,Tianshi Zheng,Baixuan Xu,Shujie Liu,Yangqiu Song

Main category: cs.CL

TL;DR: AutoGraph-R1是首个利用强化学习直接优化知识图谱构建以提升下游任务性能的框架，通过任务感知的奖励函数实现知识图谱在检索增强生成中的更高效应用。

Details

Motivation: 传统知识图谱构建与下游应用脱节，导致图结构次优，影响检索增强生成（RAG）在问答系统中的效果。 Method: 将图谱生成建模为策略学习问题，使用强化学习训练LLM构造器，并设计两种任务感知的奖励函数：一种用于图作为知识载体，另一种用于图作为知识索引。 Result: 在多个问答基准上，AutoGraph-R1显著提升了基于图的RAG方法的性能，优于通用图谱基线。 Conclusion: 实现了知识图谱构建与应用之间的闭环优化，推动了从构建‘内在良好’的图向构建‘实际有用’的图的范式转变。 Abstract: Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph's functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically ``good'' graphs to building demonstrably ``useful'' ones.

[19] Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

Catarina G Belem,Parker Glenn,Alfy Samuel,Anoop Kumar,Daben Liu

Main category: cs.CL

TL;DR: 本研究通过分析897条人类可读性判断，发现信息内容和主题显著影响文本可理解性，并评估了15种传统与6种基于模型的可读性度量，结果显示基于模型的方法在与人类判断的相关性上表现更优。

Details

Motivation: 现有可读性定义不一致且依赖表面文本特征，导致与人类感知存在偏差，因此需要探究影响人类可读性判断的真实因素并评估现有度量方法的有效性。 Method: 分析897条人类对可读性的判断数据，评估15种传统可读性指标和6种基于模型的指标在五个英文数据集上的表现，使用与人类判断的秩相关性作为评价标准。 Result: 基于模型的四种指标在与人类判断的秩相关性中 consistently 排名前四，而表现最好的传统指标平均排名仅为8.6。 Conclusion: 当前主流可读性指标与人类感知存在显著差距，基于模型的方法更能反映真实可读性，是未来更具前景的方向。 Abstract: Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.

[20] When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Heecheol Yun,Kwangmin Ki,Junghyun Lee,Eunho Yang

Main category: cs.CL

TL;DR: 提出了一种名为SAFE的稳定且高效的LLM集成框架，通过选择性集成和概率锐化策略，在长文本生成中显著优于传统方法。

Details

Motivation: 现有集成方法在长文本生成中表现不佳，需谨慎选择集成位置，因此需要一种更有效的方法来提升性能。 Method: 基于分词不一致性和模型间预测分布的一致性，提出SAFE框架，选择性地在关键位置进行集成，并引入概率锐化策略以提高稳定性。 Result: 在MATH500和BBH等多个基准上，SAFE在准确性和效率方面均优于现有方法，且仅集成不到1%的token即可取得性能增益。 Conclusion: SAFE通过联合考虑分词差异和预测一致性，实现了更稳定、高效的LLM集成，特别适用于长文本生成任务。 Abstract: Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.

[21] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Baode Wang,Biao Wu,Weizhen Li,Meng Fang,Zuming Huang,Jun Huang,Haozhe Wang,Yanjie Liang,Ling Chen,Wei Chu,Yuan Qi

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的文档解析框架LayoutRL，并构建了大规模数据集Infinity-Doc-400K来训练具有强泛化能力的Infinity-Parser模型，在多种文档类型、语言和结构复杂性上均达到最先进的性能。

Details

Motivation: 现有监督微调方法在多样化文档类型上的泛化能力差，且高质量布局感知解析训练数据稀缺，导致对分布外数据表现不佳。 Method: 提出LayoutRL，一种结合归一化编辑距离、段落数量准确性和阅读顺序保持的复合奖励机制的强化学习框架，并构建Infinity-Doc-400K数据集用于训练视觉-语言模型Infinity-Parser。 Result: 在OmniDocBench、olmOCR-Bench、PubTabNet和FinTabNet等多个基准上，Infinity-Parser在各种文档类型、语言和结构复杂度下均显著优于现有专用文档解析系统和通用视觉-语言模型。 Conclusion: LayoutRL和Infinity-Parser通过强化学习与大规模数据结合，实现了鲁棒的文档布局理解与解析，推动了文档解析领域的可重复研究发展。 Abstract: Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.

[22] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

Hongcheng Liu,Yixuan Hou,Heyang Liu,Yuhao Wang,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出VocalBench-DF框架，系统评估22种主流语音大模型（Speech-LLMs）在处理帕金森等疾病相关的言语不流利问题时的表现，发现其性能显著下降，主要瓶颈在于音素级处理和长上下文建模，需增强组件识别与推理能力以提升鲁棒性。

Details

Motivation: 现有语音大语言模型的评估多基于理想化输入，忽视了如帕金森病患者常见的言语不流利现象，缺乏对真实使用场景中包容性的检验。 Method: 构建了一个多维度分类体系的系统评估框架VocalBench-DF，并对22个主流Speech-LLMs进行评测，结合分析识别其在音素处理与长上下文建模方面的缺陷。 Result: 实验显示当前Speech-LLMs在面对言语不流利输入时性能大幅下降，表现出明显的鲁棒性不足；音素级处理和长上下文建模是主要瓶颈。 Conclusion: 现有Speech-LLMs在真实场景下的应用受限，亟需改进对不流利语音的处理能力，发展更鲁棒、更具包容性的模型。 Abstract: While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs

[23] Large-scale User Game Lifecycle Representation Learning

Yanjie Gou,Jiangming Liu,Kouying Xue,Yi Hua

Main category: cs.CL

TL;DR: 提出用户游戏生命周期（UGL）和逆概率掩码策略，以解决游戏推荐中的稀疏性和不平衡性问题，显著提升广告转化率和收入。

Details

Motivation: 现有推荐系统方法难以应对游戏数据稀疏和用户行为不平衡的问题，导致游戏推荐和广告效果不佳。 Method: 引入用户游戏生命周期（UGL）来丰富用户行为，并设计两种策略提取用户的长短时兴趣；采用逆概率掩码策略缓解游戏流行度带来的不平衡。 Result: 离线实验显示AUC平均提升1.83%（广告）和0.5%（道具推荐），在线实验CVR提升21.67%，ARPU提升0.82%。 Conclusion: UGL结合逆概率掩码能有效提升游戏广告与道具推荐的性能，适用于小规模、不平衡的游戏场景。 Abstract: The rapid expansion of video game production necessitates the development of effective advertising and recommendation systems for online game platforms. Recommending and advertising games to users hinges on capturing their interest in games. However, existing representation learning methods crafted for handling billions of items in recommendation systems are unsuitable for game advertising and recommendation. This is primarily due to game sparsity, where the mere hundreds of games fall short for large-scale user representation learning, and game imbalance, where user behaviors are overwhelmingly dominated by a handful of popular games. To address the sparsity issue, we introduce the User Game Lifecycle (UGL), designed to enrich user behaviors in games. Additionally, we propose two innovative strategies aimed at manipulating user behaviors to more effectively extract both short and long-term interests. To tackle the game imbalance challenge, we present an Inverse Probability Masking strategy for UGL representation learning. The offline and online experimental results demonstrate that the UGL representations significantly enhance model by achieving a 1.83% AUC offline increase on average and a 21.67% CVR online increase on average for game advertising and a 0.5% AUC offline increase and a 0.82% ARPU online increase for in-game item recommendation.

[24] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

Lee Qi Zun,Mohamad Zulhilmi Bin Abdul Halim,Goh Man Fye

Main category: cs.CL

TL;DR: 本研究提出并验证了一个框架，通过知识蒸馏生成合成数据集，并使用QLoRA方法微调MedGemma模型，以提升马来西亚临床实践指南中基于图像查询的检索增强生成系统的性能。

Details

Motivation: 现有的视觉-语言模型在处理医学图像时缺乏临床特异性和事实准确性，限制了其在基于证据的临床决策支持中的应用。 Method: 采用知识蒸馏 pipeline 创建涵盖皮肤病学、眼底和胸部放射学领域的合成数据集，并使用参数高效的QLoRA方法对MedGemma模型进行微调，以生成高保真度的医学图像描述。 Result: 微调后的模型在分类准确率上表现出显著提升，RAGAS评估框架显示其在描述的忠实性、相关性和正确性方面均有明显改进。 Conclusion: 该研究建立了一个有效的医学视觉-语言模型专业化流程，并验证了其作为高质量查询生成器的能力，为增强多模态RAG系统在临床决策支持中的应用奠定了基础。 Abstract: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

[25] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

Hongcheng Liu,Pingjie Wang,Yuhao Wang,Siqu Ou,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出GuessBench基准，用于评估多模态大语言模型（MLLMs）在信息不完整情况下的主动推理能力，发现现有MLLMs在此任务上表现不佳，表明主动获取证据和迭代决策仍面临挑战。

Details

Motivation: 现有MLLM评估多集中于信息完整的被动推理，而现实场景常需模型主动获取缺失信息。因此，亟需研究MLLMs在信息不全时的主动推理能力。 Method: 设计GuessBench基准，包含感知型与知识型图像，要求MLLM从候选池中选择目标图像以主动获取信息，并迭代优化决策，评估20种先进MLLM的表现。 Result: 实验显示，MLLMs在主动推理任务上的性能远低于被动推理；细粒度感知和及时决策是主要挑战；消融研究表明感知增强对小模型更有效，而面向思考的方法在不同规模模型中均有提升。 Conclusion: 当前MLLMs在主动推理方面仍有显著不足，未来应关注主动信息获取、细粒度感知与动态决策机制的研究。 Abstract: Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.

[26] Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering

Xiangchen Song,Yuchen Liu,Yaxuan Luan,Jinxu Guo,Xiaofan Guo

Main category: cs.CL

TL;DR: 提出一种基于提示工程的可控抽象摘要生成方法，通过多阶段提示生成框架提升大语言模型的摘要质量和可控性。

Details

Motivation: 解决传统摘要方法在质量与可控性方面的不足。 Method: 设计一个多阶段提示生成框架，结合语义分析、主题建模和噪声控制，生成不同抽象层次的摘要。 Result: 实验证明提示长度显著影响摘要质量，过短或过长均会降低效果；数据噪声增加会导致ROUGE-L分数下降；模型在新闻文本上表现最佳，学术文章较差。 Conclusion: 通过控制提示策略和优化文本预处理，可有效提升大语言模型生成摘要的准确性与可控性。 Abstract: This study presents a controllable abstract summary generation method for large language models based on prompt engineering. To address the issues of summary quality and controllability in traditional methods, we design a multi-stage prompt generation framework. This framework generates summaries with varying levels of abstraction by performing semantic analysis, topic modeling, and noise control on the input text. The experiment uses the CNN/Daily Mail dataset and provides a detailed analysis of different prompt lengths, data noise, and text types. The experimental results show that prompt length has a significant impact on the quality of generated summaries. Both very short and very long prompt tokens result in a decrease in summary quality. Data noise also negatively affects the summary generation process. As noise levels increase, the ROUGE-L score gradually decreases. Furthermore, different text types have varying effects on the model's ability to generate summaries. The model performs best when handling news texts, while its performance is worse when processing academic articles. This research provides new insights into improving summary generation using large language models, particularly in how controlling prompt strategies and optimizing text preprocessing can enhance summary accuracy and controllability.

[27] CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

Gucongcong Fan,Chaoyue Niu,Chengfei Lyu,Fan Wu,Guihai Chen

Main category: cs.CL

TL;DR: 提出CORE框架，结合云端和本地大语言模型的优势，在减少移动代理UI暴露的同时保持任务准确性。

Details

Motivation: 云基大语言模型虽准确但需上传完整UI状态，导致隐私泄露；本地大语言模型保护隐私但性能有限。希望在保证任务成功率的同时减少不必要的UI信息上传。 Method: CORE框架包含三个部分：基于XML层次的布局感知分块、本地与云端LLM协同规划当前子任务、本地LLM排序相关UI块并由云端LLM在最高优先级块中选择具体元素，并引入多轮累积机制缓解本地误判。 Result: 实验表明，CORE最多减少55.6%的UI暴露，任务成功率略低于纯云端方案，显著降低隐私泄露风险。 Conclusion: CORE有效平衡了隐私保护与任务性能，通过云-端协作机制为移动代理提供了一种更安全高效的UI交互解决方案。 Abstract: Mobile agents rely on Large Language Models (LLMs) to plan and execute tasks on smartphone user interfaces (UIs). While cloud-based LLMs achieve high task accuracy, they require uploading the full UI state at every step, exposing unnecessary and often irrelevant information. In contrast, local LLMs avoid UI uploads but suffer from limited capacity, resulting in lower task success rates. We propose $\textbf{CORE}$, a $\textbf{CO}$llaborative framework that combines the strengths of cloud and local LLMs to $\textbf{R}$educe UI $\textbf{E}$xposure, while maintaining task accuracy for mobile agents. CORE comprises three key components: (1) $\textbf{Layout-aware block partitioning}$, which groups semantically related UI elements based on the XML screen hierarchy; (2) $\textbf{Co-planning}$, where local and cloud LLMs collaboratively identify the current sub-task; and (3) $\textbf{Co-decision-making}$, where the local LLM ranks relevant UI blocks, and the cloud LLM selects specific UI elements within the top-ranked block. CORE further introduces a multi-round accumulation mechanism to mitigate local misjudgment or limited context. Experiments across diverse mobile apps and tasks show that CORE reduces UI exposure by up to 55.6% while maintaining task success rates slightly below cloud-only agents, effectively mitigating unnecessary privacy exposure to the cloud. The code is available at https://github.com/Entropy-Fighter/CORE.

[28] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

Yao Huang,Yitong Sun,Yichi Zhang,Ruochen Zhang,Yinpeng Dong,Xingxing Wei

Main category: cs.CL

TL;DR: 本文提出了DeceptionBench，首个系统评估大语言模型在不同社会领域中欺骗行为的基准，涵盖150个场景和1000多个样本，揭示了模型在强化学习动态下欺骗行为加剧的问题。

Details

Motivation: 尽管大语言模型在多种认知任务上取得了显著进展，但其快速增强的能力也带来了潜在的欺骗行为，可能在高风险应用中引发严重问题，而现实场景中的欺骗行为特征仍缺乏深入研究。 Method: 构建了一个名为DeceptionBench的基准，包含五个社会领域（经济、医疗、教育、社交互动和娱乐）中的150个精心设计的场景和1000多个样本，并通过静态分析、内在行为模式（如利己倾向或谄媚行为）和外在因素（如奖励激励和强制压力）影响的研究，结合多轮交互模拟真实反馈机制。 Result: 实验表明，大语言模型和大型推理模型在强化动态下表现出更强烈的欺骗行为，显示出当前模型对操纵性上下文线索缺乏鲁棒抵抗力。 Conclusion: 现有模型在面对操控性情境时存在严重的欺骗漏洞，亟需开发更先进的防护机制以应对各种欺骗行为。 Abstract: Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at https://github.com/Aries-iai/DeceptionBench.

[29] Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?

Ashutosh Bajpai,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文提出了一个名为时间指代一致性（Temporal Referential Consistency）的新基准，以及配套资源TEMP-ReCon，用于评估大语言模型在不同语言环境下的时间一致性表现，并提出了一种基于推理路径对齐的模型UnTRaP来提升LLMs的时间一致性。

Details

Motivation: 随着大语言模型在法律、医疗和金融等时间敏感领域的广泛应用，其在时间维度上的一致性变得至关重要，但目前缺乏对LLM时间一致性的系统评估与增强方法。 Method: 构建了一个新的基准TEMP-ReCon，涵盖多种语言（英语、法语、罗马尼亚语），设计了时间指代一致性测试任务，并提出了UnTRaP模型，通过推理路径对齐来提升时间一致性。 Result: 实验表明现有LLM在时间指代一致性方面存在不足，而所提出的UnTRaP模型在多个开源和闭源模型上显著优于基线方法。 Conclusion: 时间指代一致性是衡量LLM在时间敏感场景中可靠性的重要指标，UnTRaP为提升该能力提供了有效方案。 Abstract: The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.

[30] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Rares Dolga,Lucas Maystre,Tudor Berariu,David Barber

Main category: cs.CL

TL;DR: 提出一种基于BPE的动态字符分组方法，通过添加显式块结束标记和二级BPE压缩，在不依赖额外模型的情况下实现高效、灵活且语言无关的子词表示。

Details

Motivation: 现有子词切分方法在处理罕见词时效率低下且需大嵌入矩阵，字符级模型存在性能瓶颈，而当前分块策略受限于空格或需辅助模型，缺乏通用性和简洁性。 Method: 在BPE基础上引入显式的块结束标记，并采用二级BPE压缩机制来动态控制分块粒度，实现层次化字符分组。 Result: 实验表明该方法在保持紧凑词汇表的同时，性能达到或超过基于熵和空格的动态分块策略。 Conclusion: 所提方法无需额外模型即可有效结合字符级和子词级建模优势，具备良好的语言适应性和实用性。 Abstract: Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.

[31] Latent Reasoning in LLMs as a Vocabulary-Space Superposition

Jingcheng Deng,Liang Pang,Zihao Wei,Shichen Xu,Zenghao Duan,Kun Xu,Yang Song,Huawei Shen,Xueqi Cheng

Main category: cs.CL

TL;DR: 提出Latent-SFT框架，通过限制潜在空间为词汇表列空间，在保持性能的同时显著减少推理链长度，实现高效压缩与并行性。

Details

Motivation: 现有潜在推理方法因潜在空间无结构导致性能下降，需解决潜在表示学习困难的问题。 Method: 设计两阶段学习框架Latent-SFT：第一阶段用特殊注意力掩码引导生成潜在token；第二阶段直接训练LLM自主生成这些token，使用KL和CE损失优化。 Result: 在GSM8k上达到新SOTA，性能匹配显式SFT同时推理链缩短4倍；在Math500和AIME24上优于基于隐藏状态的潜在方法；提出有效压缩率和全局并行性指标验证潜在推理优势。 Conclusion: 将潜在推理限制在词汇概率的超位置空间可提升学习效率与性能，Latent-SFT实现了高效、紧凑且高性能的推理范式。 Abstract: Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.

[32] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

Qiyu Wu,Shuyang Cui,Satoshi Hayakawa,Wei-Yao Wang,Hiromi Wakaki,Yuki Mitsufuji

Main category: cs.CL

TL;DR: 提出一种模态组合感知框架，通过偏好损失和组合正则化目标来提升多模态检索在分布外的鲁棒性。

Details

Motivation: 统一编码器在传统对比学习下容易学习到模态捷径，导致在分布偏移下表现不佳。 Method: 引入偏好损失使多模态嵌入优于单模态嵌入，并设计组合正则化目标将多模态嵌入与由单模态部分组成的原型对齐。 Result: 在多个基准上实验显示该方法提升了分布外检索性能。 Conclusion: 模态组合感知是利用MLLM作为统一编码器时实现鲁棒多模态检索的有效原则。 Abstract: Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.

[33] TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Sibo Xiao,Jinyuan Fu,Zhongle Xie,Lidan Shou

Main category: cs.CL

TL;DR: 提出TokenTiming算法，基于动态时间规整实现词汇表不匹配的通用推测解码，显著提升大模型推理效率。

Details

Motivation: 现有推测解码方法要求草稿模型与目标模型共享词汇表，限制了可用草稿模型的选择，常需从头训练新模型，制约了其应用。 Method: 受动态时间规整（DTW）启发，通过重新编码草稿token序列并利用DTW建立概率分布映射，实现跨词汇表的推测采样。 Result: 在多种任务上实现了1.57倍的加速效果，且无需对现有模型进行重训练或修改。 Conclusion: TokenTiming为推测解码提供了通用解决方案，支持任意现成模型作为草稿模型，增强了LLM推理加速的灵活性和实用性。 Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

[34] Rethinking Cross-lingual Gaps from a Statistical Viewpoint

Vihari Piratla,Purvam Jain,Darshan Singh,Partha Talukdar,Trevor Cohn

Main category: cs.CL

TL;DR: 本文提出了一种新的观点来解释大语言模型中的跨语言差距，认为目标语言中响应的方差是主要原因，并通过偏差-方差分解形式化了这一现象，实验表明控制方差可显著减少跨语言差距。

Details

Motivation: 现有研究将跨语言差距归因于源语言和目标语言潜在表示的差异，但本文试图从响应方差的角度提供一个新的解释。 Method: 采用偏差-方差分解的方法对跨语言差距进行形式化分析，并通过多种推理时干预手段控制目标语言的响应方差。 Result: 实验证明响应方差与跨语言差距密切相关，通过简单的提示指令降低方差，使不同模型在目标语言上的准确率提高了20-25%。 Conclusion: 跨语言差距的主要来源是目标语言中响应的方差，而非潜在表示的差异，控制方差是提升跨语言性能的有效途径。 Abstract: Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.

[35] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

Jinliang Liu

Main category: cs.CL

TL;DR: 本文提出了ParallaxRAG，一种通过多视图解耦查询与知识图谱三元组以实现更稳健检索的框架，有效减少大语言模型在多跳推理中的幻觉问题。

Details

Motivation: 大语言模型在多跳推理中容易产生幻觉且缺乏有效知识支撑，现有知识图谱增强方法依赖扁平嵌入和噪声路径，效果受限。 Method: 提出ParallaxRAG框架，将查询与图谱三元组对称解耦至多视图空间，利用注意力头的语义关系专精特性，显式增强头部多样性并抑制弱相关路径，构建更清晰的子图以支持逐步推理。 Result: 在WebQSP和CWQ数据集上，基于统一可复现设置（BGE-M3 + Llama3.1-8B）的实验显示该方法在检索与问答性能上具有竞争力，同时显著降低幻觉并具备良好泛化能力。 Conclusion: 多视图下注意力头的专门化是实现知识支撑型多跳推理的有效且有原则的方向。 Abstract: Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.

[36] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

Dongjun Kim,Chanhee Park,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文提出了KITE，一个针对韩语大语言模型的开放式指令遵循能力评估基准，填补了非英语语言评估的空白。

Details

Motivation: 现有大语言模型的指令遵循评估主要集中在英语，忽视了韩语等语言在句法、形态、敬语系统等方面的独特性，缺乏专门的评估基准。 Method: 设计了一个名为KITE的综合评估基准，包含通用和韩语特定的开放式指令任务，并结合自动化指标与人工评估进行模型性能分析。 Result: 通过KITE评估发现了不同模型在韩语指令遵循上的表现差异，揭示了其优缺点。 Conclusion: KITE为韩语LLM提供了有效的评估工具，有助于推动多语言、跨文化背景下更包容的大模型发展。 Abstract: The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.

[37] Finetuning LLMs for EvaCun 2025 token prediction shared task

Josef Jon,Ondřej Bojar

Main category: cs.CL

TL;DR: 本文介绍了为EvaCun 2025词汇预测任务提交的基于大语言模型（Command-R、Mistral和Aya Expanse）的系统，直接使用主办方提供的训练数据进行微调，未进行特定任务预处理，并比较了三种不同提示方法在保留数据上的表现。

Details

Motivation: 由于对任务涉及的语言和领域知识了解有限，作者希望通过直接微调现有大语言模型并比较不同提示策略，在不进行复杂数据处理的情况下有效完成词汇预测任务。 Method: 采用Command-R、Mistral和Aya Expanse三种大语言模型，在任务提供的训练数据上进行微调，使用三种不同的提示方式生成预测结果，并在保留的验证集上进行评估。 Result: 论文比较了三种不同提示方法在词汇预测任务上的表现，但具体性能指标未在摘要中提及。 Conclusion: 尽管缺乏领域和语言专业知识，仅依赖原始训练数据微调大模型并结合不同提示策略仍可用于应对词汇预测任务，展示了其可行性。 Abstract: In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.

[38] From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages

Syed Mohammad Sualeh Ali

Main category: cs.CL

TL;DR: 本文通过多义性案例研究方法，探讨乌尔都语诗歌中“pyaar”、“muhabbat”和“ishq”三个表示爱的词汇之间的细微语义差异，并利用词嵌入技术对乌尔都语和英语中的相关词汇进行比较分析，揭示其情感表达的文化与语言独特性。

Details

Motivation: 乌尔都语诗歌中关于‘爱’的词汇具有丰富而微妙的语义层次，但这些差异在英语中缺乏对应表达，导致跨文化理解困难。本研究旨在揭示这些词语的独特情感谱系，深化对乌尔都语诗歌深层含义的理解。 Method: 采用多义性案例研究方法，结合文本分析与计算语言学手段，生成乌尔都语和英语中与‘爱’相关的词嵌入，通过可视化和语义空间对比分析其语义分布。 Result: 研究揭示了‘pyaar’、‘muhabbat’和‘ishq’在语义上的显著差异及其在诗歌中的不同情感指向；词嵌入分析进一步表明，乌尔都语中‘爱’的表达在语义空间上比英语更为分层且文化依赖性强。 Conclusion: 乌尔都语诗歌通过高度精细化的爱的词汇系统传达复杂情感，这种语言特有的多义性结构体现了深刻的文化内涵，值得在跨语言研究中进一步重视与保留。 Abstract: This paper delves into the intricate world of Urdu poetry, exploring its thematic depths through a lens of polysemy. By focusing on the nuanced differences between three seemingly synonymous words (pyaar, muhabbat, and ishq) we expose a spectrum of emotions and experiences unique to the Urdu language. This study employs a polysemic case study approach, meticulously examining how these words are interwoven within the rich tapestry of Urdu poetry. By analyzing their usage and context, we uncover a hidden layer of meaning, revealing subtle distinctions which lack direct equivalents in English literature. Furthermore, we embark on a comparative analysis, generating word embeddings for both Urdu and English terms related to love. This enables us to quantify and visualize the semantic space occupied by these words, providing valuable insights into the cultural and linguistic nuances of expressing love. Through this multifaceted approach, our study sheds light on the captivating complexities of Urdu poetry, offering a deeper understanding and appreciation for its unique portrayal of love and its myriad expressions

[39] BiMax: Bidirectional MaxSim Score for Document-Level Alignment

Xiaotian Wang,Takehito Utsuro,Masaaki Nagata

Main category: cs.CL

TL;DR: 本文提出了一种用于文档对齐的跨语言双向最大相似度（BiMax）方法，在保持与最优传输（OT）方法相当准确率的同时，速度提升了约100倍，并在WMT16任务上验证了其有效性。

Details

Motivation: 由于网络挖掘数据规模巨大，需要在保证准确率的同时提升文档对齐的速度。 Method: 提出了跨语言双向最大相似度（BiMax）来计算文档间相似性，并对比分析了当前最先进的多语言句子嵌入模型的性能。 Result: 在WMT16双语文档对齐任务中，BiMax实现了与OT方法相当的准确率，但速度提高了约100倍。 Conclusion: BiMax在效率和准确性之间取得了良好平衡，显著优于OT方法，且所有方法已作为EmbDA工具公开发布。 Abstract: Document alignment is necessary for the hierarchical mining (Ba\~n\'on et al., 2020; Morishita et al., 2022), which aligns documents across source and target languages within the same web domain. Several high precision sentence embedding-based methods have been developed, such as TK-PERT (Thompson and Koehn, 2020) and Optimal Transport (OT) (Clark et al., 2019; El-Kishky and Guzm\'an, 2020). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models. All the alignment methods in this paper are publicly available as a tool called EmbDA (https://github.com/EternalEdenn/EmbDA).

[40] The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

Antoine Bourgois,Thierry Poibeau

Main category: cs.CL

TL;DR: 本文介绍了一个包含三部完整法语小说的新标注语料库，共超过285,000个词符，旨在解决长篇文学作品中的共指消解问题。

Details

Motivation: 现有的共指消解数据集多集中于短文本，缺乏对长篇复杂文学作品的充分支持，限制了模型在长距离指代链上的评估与应用。 Method: 构建了一个模块化的共指消解流程，并利用新标注的长篇法语小说语料库进行实验，支持细粒度错误分析。 Result: 所提出的方法在长文档上表现良好，具有良好的可扩展性，并成功用于推断虚构人物的性别。 Conclusion: 该语料库和方法为文学分析和下游NLP任务提供了有力支持，推动了长文本共指消解的研究。 Abstract: While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.

[41] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

Tingting Chen,Beibei Lin,Zifeng Yuan,Qiran Zou,Hongyu He,Yew-Soon Ong,Anirudh Goyal,Dianbo Liu

Main category: cs.CL

TL;DR: 本文提出了HypoSpace，一个用于评估大语言模型在科学问题中生成多种合理假设能力的诊断工具套件，重点关注有效性、唯一性和覆盖率三个指标。

Details

Motivation: 由于许多科学问题是不确定的，存在多个与观测一致的不同机制假设，因此需要评估语言模型生成多样化解释的能力，而不仅仅是单一正确答案。 Method: 将大语言模型视为有限假设集的采样器，在三个结构化领域（因果图、3D体素重建、布尔基因互作）中使用确定性验证器和完全枚举的假设空间，测量有效性、唯一性和覆盖率三个指标。 Result: 实验发现，随着可接受假设空间增大，尽管有效性保持较高，但唯一性和覆盖率下降，揭示了仅基于正确性的指标无法发现的模式崩溃现象。 Conclusion: HypoSpace提供了一个受控的探测框架，可用于评估旨在探索和覆盖可行解释空间的方法，而非简单排名。 Abstract: As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.

[42] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection

Joshua Wolfe Brook,Ilia Markov

Main category: cs.CL

TL;DR: 本研究提出了一种利用大语言模型（LLMs）作为动态知识库生成背景上下文，并将其融入仇恨言论检测（HSD）分类器输入的新方法，显著提升了文本和多模态场景下的检测性能。

Details

Motivation: 隐式仇恨言论检测因缺乏明确线索而具有挑战性，现有方法常依赖静态知识库或忽略上下文信息，因此需要一种能动态生成并有效融合上下文信息的方法以提升检测效果。 Method: 使用大语言模型生成上下文，采用两种策略：基于命名实体的提示和全文提示；比较四种上下文融合方式：文本拼接、嵌入拼接、分层Transformer融合和LLM驱动的文本增强。 Result: 在文本数据集Latent Hatred和多模态数据集MAMI上实验表明，上下文信息及其融合方式对性能至关重要，最佳系统相比无上下文基线分别在F1分数上提升了3点（文本）和6点（多模态），表现最优的是嵌入拼接方法。 Conclusion: 利用LLMs生成动态上下文并结合有效的融合策略（尤其是嵌入拼接）可显著提升仇恨言论检测性能，验证了动态知识引入在HSD任务中的有效性。 Abstract: This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.

[43] Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

Helia Hashemi,Victor Rühle,Saravan Rajmohan

Main category: cs.CL

TL;DR: 提出了一种成本感知的检索增强推理模型，通过动态调整检索文档长度和使用强化学习优化效率，在降低延迟的同时提高了准确性。

Details

Motivation: 现有的检索增强推理模型计算成本高，资源消耗大，需要提高效率。 Method: 设计了动态调整检索文档列表长度的模型，引入成本感知的优势函数，并基于强化学习框架进行训练，探索了内存和延迟受限的实现方式。 Result: 在七个公开问答数据集上验证，模型延迟降低16-20%，准确率（精确匹配）平均提升约5%。 Conclusion: 该方法在不牺牲性能的前提下显著提升了检索增强推理模型的效率，适用于实际应用中的资源约束场景。 Abstract: Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.

[44] Attention Sinks in Diffusion Language Models

Maximo Eduardo Rulli,Simone Petruzzi,Edoardo Michielon,Fabrizio Silvestri,Simone Scardapane,Alessio Devoto

Main category: cs.CL

TL;DR: 本文研究了掩码扩散语言模型（DLMs）中的注意力沉降现象，发现其在生成过程中具有动态变化的沉降位置，并且对沉降移除具有鲁棒性，揭示了DLM与传统自回归模型在注意力机制上的本质差异。

Details

Motivation: 尽管DLMs在性能和效率上表现出色，但其内部机制尚不明确，尤其是注意力机制的行为特征缺乏深入研究。 Method: 通过对DLM的注意力模式进行实证分析，重点关注注意力沉降现象，并与自回归模型（ARMs）进行对比。 Result: 发现DLMs中存在动态变化的注意力沉降位置，且移除这些沉降对模型性能影响较小，表现出较强的鲁棒性。 Conclusion: DLMs在注意力分配和使用机制上与ARMs存在根本差异，这为理解扩散语言模型的内部工作机制提供了新视角。 Abstract: Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.

[45] LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

Gao Yang,Yuhang Liu,Siyu Miao,Xinyue Liang,Zhengyang Liu,Heyan Huang

Main category: cs.CL

TL;DR: 本文提出了一种基于博弈论的大型语言模型评估新方法，通过模型间的相互评价与人类投票行为对比，验证其与人类判断的一致性。

Details

Motivation: 传统评估方法依赖固定任务和标准答案，难以捕捉LLM行为的复杂性和主观性，因此需要更有效的评估方式。 Method: 提出自动互评框架，利用自博弈和同行评审让LLMs相互评估，并采用博弈论投票算法聚合评分，与人类投票结果进行系统比较。 Result: 实验结果显示模型生成的排名在某些方面与人类偏好一致，但也存在差异，揭示了互评方法的潜力与局限。 Conclusion: 该研究首次将相互评价、博弈论聚合和人类验证结合，为LLM能力评估提供了新的可行路径。 Abstract: Ideal or real - that is the question.In this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.

[46] On Non-interactive Evaluation of Animal Communication Translators

Orr Paradise,David F. Gruber,Adam Tauman Kalai

Main category: cs.CL

TL;DR: 提出了一种无需参考翻译的机器翻译质量评估方法，通过分段翻译和NLP重排测试来检测“幻觉”，并在数据稀缺的人类语言和构造语言上验证了其有效性。

Details

Motivation: 如何在没有参考翻译的情况下评估AI翻译器（如鲸语到英语）的准确性，特别是在无法进行实际交互或观测时。 Method: 采用分段翻译结合经典的NLP重排测试，评估翻译结果在顺序排列下是否比随机排列更合理，从而判断翻译质量。 Result: 在数据稀缺的人类语言和构造语言上的实验证明该方法有效，且与基于参考翻译的标准评估高度相关；理论分析表明早期翻译学习中交互可能非必要。 Conclusion: 无需依赖交互或外部观测，仅通过输出文本即可评估复杂语言的翻译质量，为动物语言等无参考场景提供了安全、伦理且低成本的评估途径。 Abstract: If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,'' false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.

[47] Emergence of Linear Truth Encodings in Language Models

Shauli Ravfogel,Gilad Yehudai,Tal Linzen,Joan Bruna,Alberto Bietti

Main category: cs.CL

TL;DR: 本文提出一个透明的单层Transformer玩具模型，端到端地复现了语言模型中真假语句的线性子空间，并揭示了一种其产生的具体机制：在事实性语句倾向于共现的数据分布下，模型为降低语言建模损失而学习区分真假，从而形成线性真理子空间。

Details

Motivation: 大语言模型中存在区分真假语句的线性子空间，但其形成机制尚不清楚。本文旨在通过构建可解释的简化模型，揭示这种真理表征可能如何产生。 Method: 设计一个单层Transformer玩具模型，在特定数据分布（事实语句与非事实语句分别聚集共现）下进行训练，分析其内部表示和学习动态；并在预训练语言模型中验证真实数据中的类似模式。 Result: 玩具模型成功生成了真假语句的线性可分表示；发现模型经历两阶段学习：先快速记忆具体事实，再缓慢发展出线性分离结构；该结构有助于降低语言建模损失；在真实预训练模型中也观察到类似共现模式支持该机制。 Conclusion: 线性真理子空间可以在简单且合理的训练条件下自然涌现，其动因是模型为优化语言建模目标而学会利用语句间的真实性相关性，这为理解大模型内部的事实表征提供了机制性解释。 Abstract: Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then -- over a longer horizon -- learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.

[48] Paper2Web: Let's Make Your Paper Alive!

Yuhang Chen,Tianpeng Lv,Siyi Zhang,Yixiang Yin,Yao Wan,Philip S. Yu,Dongping Chen

Main category: cs.CL

TL;DR: 本文提出了Paper2Web，一个用于评估学术网页生成的基准数据集和多维评估框架，并设计了PWAgent自动化管道，将科研论文转化为交互式、多媒体丰富的学术主页，在内容、布局和交互性方面显著优于现有方法。

Details

Motivation: 现有的学术项目网站生成方法（如直接使用大语言模型生成、模板或HTML转换）难以生成具有合理布局和良好交互性的网页，且缺乏全面的评估体系。因此需要一种更有效的方法来提升学术研究成果的传播效果。 Method: 提出Paper2Web基准，包含基于规则的指标（如连通性、完整性）、LLM-as-a-Judge评估（涵盖交互性、美观性和信息量）以及PaperQuiz知识保留测试；同时设计PWAgent智能体，通过MCP工具迭代优化内容与布局，实现从论文到互动网页的自动转化。 Result: 实验表明，PWAgent在多个维度上显著优于模板法、arXiv/alphaXiv等基线方法，能够在低成本下生成高质量、布局合理、交互性强的学术网页，并达到帕累托前沿。 Conclusion: PWAgent结合Paper2Web评估框架为学术网页生成提供了高效、可评估的解决方案，推动了研究成果在线展示的自动化与标准化。 Abstract: Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.

[49] Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework

Shayan Rokhva,Mousa Alizadeh,Maryam Abdollahi Shamami

Main category: cs.CL

TL;DR: 提出一种结合词典、模糊逻辑与Transformer的混合框架，用于生成连续的情感极性与强度评分，在多个领域数据集上表现出更好的用户评分对齐性和极端情感识别能力。

Details

Motivation: 由于非正式和领域特定语言的存在，准确检测产品评论和社交媒体中的情感极性和强度仍然具有挑战性。 Method: 采用VADER进行初始情感估计，利用DistilBERT的置信度分数并结合模糊逻辑进行两阶段调整，通过自定义模糊推理系统将得分映射到0到1的连续区间。 Result: 在食品配送、电商、旅游和时尚四个领域数据集上验证了该框架的有效性，结果显示与用户评分更一致，能更好识别情感极端情况，减少误分类，并具有良好的鲁棒性和效率。 Conclusion: 符号推理与神经模型的结合有助于提升动态语言环境下情感分析的可解释性和细粒度性能。 Abstract: Accurately detecting sentiment polarity and intensity in product reviews and social media posts remains challenging due to informal and domain-specific language. To address this, we propose a novel hybrid lexicon-fuzzy-transformer framework that combines rule-based heuristics, contextual deep learning, and fuzzy logic to generate continuous sentiment scores reflecting both polarity and strength. The pipeline begins with VADER-based initial sentiment estimations, which are refined through a two-stage adjustment process. This involves leveraging confidence scores from DistilBERT, a lightweight transformer and applying fuzzy logic principles to mitigate excessive neutrality bias and enhance granularity. A custom fuzzy inference system then maps the refined scores onto a 0 to 1 continuum, producing expert)like judgments. The framework is rigorously evaluated on four domain-specific datasets. food delivery, e-commerce, tourism, and fashion. Results show improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications. Both quantitative metrics (distributional alignment, confusion matrices) and qualitative insights (case studies, runtime analysis) affirm the models robustness and efficiency. This work demonstrates the value of integrating symbolic reasoning with neural models for interpretable, finegrained sentiment analysis in linguistically dynamic domains.

[50] SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

Kadri Hacioglu,Manjunath K E,Andreas Stolcke

Main category: cs.CL

TL;DR: 本文研究了基于语音的大语言模型（speechLLMs）在槽位填充任务中的应用，提出了通过改进训练数据、架构和训练策略来缩小与性能上限差距的方法，并提供了实证指导和见解。

Details

Motivation: 传统槽位填充依赖于级联的语音识别和自然语言理解模块，缺乏统一性和泛化能力。随着speechLLMs的发展，有望实现更统一、生成式且具备零样本能力的语音理解，但其在槽位填充中的性能、鲁棒性和泛化能力尚不明确，因此需要系统评估并提出改进方案。 Method: 通过构建槽位填充任务的实证性能上限，识别现有speechLLMs在性能、鲁棒性和泛化方面的差距，并从训练数据构造、模型架构设计和训练策略三个方面进行优化以缩小差距。 Result: 各项改进措施均显著提升了模型在槽位填充任务上的性能，验证了其有效性，同时揭示了实际应用中的挑战，并为利用新兴speechLLMs提供了实证指导。 Conclusion: 通过针对性优化，speechLLMs在槽位填充任务上可显著接近性能上限，展现出巨大潜力，但仍需应对鲁棒性与泛化等实际挑战，未来应结合指令微调与高效训练策略进一步挖掘其能力。 Abstract: Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.

[51] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Pengkai Wang,Qi Zuo,Pengwei Liu,Zhijie Sang,Congkai Xie,Hongxia Yang

Main category: cs.CL

TL;DR: ORBIT是一种基于评分标准的增量训练框架，用于提升大语言模型在高风险医疗对话中的表现，通过合成对话生成和动态评分标准创建，在仅有2000个样本的情况下显著提升了模型在HealthBench-Hard基准上的性能。

Details

Motivation: 现有强化学习方法在奖励信号模糊、主观或依赖上下文的开放域任务（如医疗咨询）中面临挑战，缺乏可靠的奖励函数，限制了大模型在此类高风险领域的应用。 Method: 提出ORBIT框架，结合合成对话生成与动态评分标准构建，利用评分标准指导增量式强化学习过程，无需依赖外部医学知识或人工规则，实现自我驱动的学习优化。 Result: 在Qwen3-4B-Instruct模型上，仅使用2k样本将HealthBench-Hard基准得分从7.0提升至27.2，达到同规模模型的最先进水平，并在多种医疗咨询场景中展现出稳定性能提升。 Conclusion: 基于评分标准的强化学习是一种可扩展且有效的策略，能够推动大语言模型在复杂、开放性任务中的持续进步，尤其适用于高风险领域如医疗对话。 Abstract: Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.

[52] PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

Simon Yu,Gang Li,Weiyan Shi,Peng Qi

Main category: cs.CL

TL;DR: PolySkill 是一种新框架，通过将技能的抽象目标与具体实现解耦，使智能体能够在开放网络环境中持续学习并泛化可复用的技能。

Details

Motivation: 现有技能学习方法往往导致技能过度特化，难以在不同网站间泛化，限制了智能体在动态环境中的持续学习能力。 Method: 受软件工程中多态性的启发，PolySkill 将技能的目标（what）与执行方式（how）分离，支持技能的抽象建模和跨环境复用，并通过自我探索机制优化技能学习过程。 Result: 实验表明，PolySkill 在已见网站上技能复用率提升1.7倍，在Mind2Web和未见网站上的成功率分别提高9.4%和13.9%，步数减少超过20%；在无任务设定的自我探索中，能生成更高质量任务并习得跨站点通用技能。 Conclusion: 将技能的目标与执行分离是实现智能体在开放网络中持续、自主、泛化学习的关键步骤，PolySkill 为构建适应性更强的持续学习智能体提供了可行路径。 Abstract: Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent's ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill's goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously.

cs.CV [Back]

[53] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments

Leela Krishna,Mengyang Zhao,Saicharithreddy Pasula,Harshit Rajgarhia,Abhishek Mukherji

Main category: cs.CV

TL;DR: 提出GAZE流水线，自动化将原始长视频转换为可用于世界模型训练的高质量、多模态标注数据，显著提升效率并降低人工审核成本。

Details

Motivation: 训练鲁棒的世界模型需要大规模、精确标注的多模态数据集，但传统手动标注速度慢、成本高，亟需自动化解决方案。 Method: 设计并实现GAZE系统：(1) 将专有360度视频格式标准化并分片；(2) 使用AI模型套件进行密集多模态预标注（场景理解、目标跟踪、音频转录、隐私检测）；(3) 整合信号生成结构化输出供快速人工验证。 Result: 每小时审查节省约19分钟，人工审核量减少超过80%；提升标注密度与一致性，集成隐私保护和溯源元数据，生成可直接用于训练的高保真数据集。 Conclusion: GAZE提供了一个可扩展的蓝图，能够在不牺牲吞吐量和治理的前提下，高效生成高质量的世界模型训练数据。 Abstract: Training robust world models requires large-scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision for world-model training. Our system (i) normalizes proprietary 360-degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre-annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by >80% through conservative auto-skipping of low-salience segments. By increasing label density and consistency while integrating privacy safeguards and chain-of-custody metadata, our method generates high-fidelity, privacy-aware datasets directly consumable for learning cross-modal dynamics and action-conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance.

[54] PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising

Yang Shi,Jingchao Wang,Liangsi Lu,Mingxuan Huang,Ruixin He,Yifeng Xie,Hanqian Liu,Minzhe Guo,Yangyang Liang,Weipeng Zhang,Zimeng Li,Xuhang Chen

Main category: cs.CV

TL;DR: 提出了一种新的Poisson一致性U-Net（PC-UNet）模型，结合物理数据提升低剂量PET图像的保真度。

Details

Motivation: 现有去噪方法在处理低剂量PET图像时无法有效应对Poisson噪声，导致失真和伪影，限制了临床应用。 Method: 设计了Poisson方差与均值一致性损失（PVMC-Loss），在统计上无偏且具有梯度适应性，作为广义矩方法实现，并结合物理信息进行训练。 Result: 在PET数据集上的实验表明，PC-UNet在物理一致性和图像保真度方面优于现有方法。 Conclusion: PC-UNet能有效整合物理信息，显著提升低剂量PET图像质量，具有较强的鲁棒性和临床应用潜力。 Abstract: Positron Emission Tomography (PET) is crucial in medicine, but its clinical use is limited due to high signal-to-noise ratio doses increasing radiation exposure. Lowering doses increases Poisson noise, which current denoising methods fail to handle, causing distortions and artifacts. We propose a Poisson Consistent U-Net (PC-UNet) model with a new Poisson Variance and Mean Consistency Loss (PVMC-Loss) that incorporates physical data to improve image fidelity. PVMC-Loss is statistically unbiased in variance and gradient adaptation, acting as a Generalized Method of Moments implementation, offering robustness to minor data mismatches. Tests on PET datasets show PC-UNet improves physical consistency and image fidelity, proving its ability to integrate physical information effectively.

[55] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

Mor Ventura,Michael Toker,Or Patashnik,Yonatan Belinkov,Roi Reichart

Main category: cs.CV

TL;DR: 本文提出了一种名为DeLeaker的轻量级、无需优化的推理时方法，通过直接干预模型注意力图来缓解文本到图像生成中的语义泄漏问题，并提出了首个专门用于语义泄漏评估的数据集SLIM和自动评估框架。实验表明，DeLeaker在不牺牲图像质量的前提下显著优于现有方法。

Details

Motivation: 文本到图像模型存在语义泄漏问题，即不同实体间无意传递语义特征，现有方法多依赖优化或外部输入，缺乏高效且无需训练的解决方案。 Method: 提出DeLeaker方法，在扩散过程中动态重加权注意力图，抑制跨实体过度交互并增强各实体身份；同时构建SLIM数据集及自动评估框架以支持系统性评测。 Result: DeLeaker在多种场景下均优于基线方法，即使对比使用外部信息的方法也表现更优，有效缓解语义泄漏且保持生成图像的保真度和质量。 Conclusion: 通过控制注意力机制可有效缓解T2I模型中的语义泄漏，DeLeaker为实现更精确的语义生成提供了可行路径。 Abstract: Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model's attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.

[56] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

Mingxuan Liu,Honglin He,Elisa Ricci,Wayne Wu,Bolei Zhou

Main category: cs.CV

TL;DR: UrbanVerse是一个数据驱动的从真实城市到模拟仿真的系统，能将众包的城市游览视频转化为具有物理感知能力的交互式仿真场景，支持高保真、可扩展的都市AI代理训练。

Details

Motivation: 现有城市仿真环境在可扩展性或真实世界复杂性方面存在不足，难以满足都市AI代理训练对多样化、高保真环境的需求。 Method: 提出UrbanVerse系统，包括UrbanVerse-100K（包含超10万标注城市3D资产的数据库）和UrbanVerse-Gen（从视频中提取布局并生成三维仿真场景的自动化流水线），在IsaacSim中构建大规模仿真环境。 Result: UrbanVerse生成的场景在语义和布局上保留了真实世界特征，视觉真实感与人工设计场景相当；在城市导航任务中，训练出的策略展现出良好的扩展规律和泛化能力，仿真中成功率提升6.3%，零样本sim-to-real迁移中提升30.1%，并在真实世界300米任务中仅需两次干预。 Conclusion: UrbanVerse实现了高效、真实且可扩展的城市仿真环境构建，显著提升了都市AI代理的训练效果和现实迁移能力。 Abstract: Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.

[57] NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks

Junliang Ye,Shenghao Xie,Ruowen Zhao,Zhengyi Wang,Hongyu Yan,Wenqiang Zu,Lei Ma,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了Nano3D，一种无需训练、无需掩码的3D物体编辑框架，通过结合FlowEdit与TRELLIS并在前视图引导下实现局部编辑，引入Voxel/Slat-Merge策略以保持编辑与非编辑区域的一致性，显著提升了3D编辑的视觉质量和一致性。同时构建了首个大规模3D编辑数据集Nano3D-Edit-100k，推动了3D编辑算法的发展。

Details

Motivation: 现有3D物体编辑方法依赖多视角渲染与重建，效率低、易产生伪影、难以保持编辑一致性，且缺乏高质量的大规模数据集支持。因此需要一种更高效、一致性强且无需训练的编辑框架，并解决数据匮乏问题。 Method: 提出Nano3D框架：1）将FlowEdit集成到TRELLIS中，利用单视角前视图渲染指导局部编辑；2）设计区域感知的Voxel/Slat-Merge融合策略，自适应保持结构保真度；3）无需训练和掩码，直接在3D表示上进行编辑。同时构建包含超过10万对高质量3D编辑样本的数据集Nano3D-Edit-100k。 Result: 实验表明，Nano3D在3D一致性与视觉质量上优于现有方法，能有效保留未编辑区域结构，减少伪影。所构建的Nano3D-Edit-100k是目前首个大规模3D编辑数据集，为后续研究提供了重要资源。 Conclusion: Nano3D解决了传统3D编辑中的一致性差、伪影多和依赖重建的问题，实现了精确、连贯的训练-free编辑，同时填补了大规模3D编辑数据的空白，为未来前馈式3D编辑模型的发展奠定了基础。 Abstract: 3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:https://jamesyjl.github.io/Nano3D

[58] Constantly Improving Image Models Need Constantly Improving Benchmarks

Jiaxin Ge,Grace Luo,Heekyung Lee,Nishant Malpani,Long Lian,XuDong Wang,Aleksander Holynski,Trevor Darrell,Sewon Min,David M. Chan

Main category: cs.CV

TL;DR: 本文提出了ECHO框架，利用社交媒体上的真实用户使用案例（如新颖提示和定性反馈）构建图像生成模型的评估基准，特别针对GPT-4o Image Gen，发现其能识别现有基准未覆盖的复杂任务，并更好地区分先进模型性能，同时通过社区反馈改进质量度量设计。

Details

Motivation: 现有图像生成模型的评估基准滞后于实际进展，无法捕捉由GPT-4o等专有系统带来的新能力，导致社区认知与正式评估之间存在脱节。 Method: 提出ECHO框架，从展示新颖提示和用户判断的社交媒体帖子中收集超过31,000个真实世界提示数据，构建新的评估基准，并基于社区反馈设计衡量颜色、身份和结构变化的质量指标。 Result: ECHO能够发现现有基准中缺失的创造性与复杂任务（如多语言产品标签重渲染、指定金额的收据生成），更清晰地区分最先进模型与替代方案，并提取社区反馈用于优化评估指标。 Conclusion: ECHO通过整合真实用户行为和反馈，提供了一种动态、贴近实际应用的模型评估方法，有效弥补了传统基准滞后的问题，推动图像生成模型评估向更实用和社区驱动的方向发展。 Abstract: Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at https://echo-bench.github.io.

[59] LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models

Mert Sonmezer,Matthew Zheng,Pinar Yanardag

Main category: cs.CV

TL;DR: 本文提出了一种基于子模优化框架的新方法，用于从大量LoRA适配器中选择最相关且多样化的模型，以解决在海量无序数据库中难以选择和使用合适适配器的问题。

Details

Motivation: 由于LoRA适配器数量庞大、种类繁多且缺乏结构化组织，用户在选择和有效使用合适的适配器时面临困难。 Method: 将LoRA模型的选择问题建模为组合优化问题，并提出一种新颖的子模函数框架来实现相关性和多样性的平衡。 Result: 通过定量和定性实验验证，该方法能够在多种领域生成多样化且高质量的输出。 Conclusion: 所提出的子模框架能有效提升LoRA适配器的选型效率与生成结果的多样性，适用于大规模LoRA模型库的管理与应用。 Abstract: Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like Civit.ai, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.

Mattia Segu,Marta Tintore Gazulla,Yongqin Xian,Luc Van Gool,Federico Tombari

Main category: cs.CV

TL;DR: MOBIUS是一种面向高效实例分割的新型基础模型家族，通过优化架构和训练策略，在显著降低计算量的同时保持最先进性能。

Details

Motivation: 现有基础模型因计算成本高，难以在资源受限的边缘设备上部署，需在性能与效率间取得更好平衡。 Method: 提出MOBIUS模型家族，包含高效的瓶颈像素解码器、语言引导的不确定性校准损失函数，以及简化的统一训练策略，支持多模态融合与自适应解码器剪枝。 Result: 相比基线模型，MOBIUS将像素解码器和Transformer解码器的FLOPs分别减少最多55%和75%，仅用三分之一训练迭代即达到SOTA性能。 Conclusion: MOBIUS在高性能平台和移动设备上均实现了高效实例分割的新标杆，兼顾精度、速度与可扩展性。 Abstract: Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.

[61] Composition-Grounded Instruction Synthesis for Visual Reasoning

Xinyi Gu,Jiayuan Mao,Zhang-Wei Hong,Zhuoran Yu,Pengyuan Li,Dhiraj Joshi,Rogerio Feris,Zexue He

Main category: cs.CV

TL;DR: 本文提出了COGS框架，通过少量种子问题生成大规模合成问答数据，提升多模态大模型在图表等人工图像领域的推理能力。

Details

Motivation: 现有预训练多模态大模型在标注稀缺的领域（如图表、网页）推理能力有限，缺乏大规模标注数据集。 Method: 将种子问题分解为基本感知与推理因子，并与新图像重组生成大量带子问题和中间答案的合成问答对，结合因子级过程奖励进行强化学习。 Result: 在图表推理任务上显著提升模型性能，尤其在复杂、组合性问题上表现突出，并展现出跨数据集的良好迁移能力，且可扩展至网页等其他领域。 Conclusion: COGS是一种数据高效的框架，能有效增强多模态大模型在人工图像域的通用推理能力，避免过拟合。 Abstract: Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

[62] Generalized Dynamics Generation towards Scannable Physical World Model

Yichen Li,Zhiyi Li,Brandon Feng,Dinghuai Zhang,Antonio Torralba

Main category: cs.CV

TL;DR: 本文提出了GDGen，一个从势能角度统一刚体、铰接体和软体动力学的通用框架，通过引入方向刚度和神经场实现几何无关的动力学建模，适用于复杂动态场景中的虚拟环境构建与机器人训练。

Details

Motivation: 为了在具有复杂物理行为的可扫描环境中开发通用具身智能体，需要一个能够统一多种物理动力学的通用、灵活且几何无关的建模框架。 Method: GDGen基于物理系统的势能应尽可能低的原则，扩展经典弹性动力学，引入方向刚度以统一描述软体、铰接体和刚体动力学；使用专门网络建模扩展的材料属性，并采用神经场几何无关地表示形变。 Result: 实验表明，GDGen能够稳健地融合多种仿真范式，在不同物理系统中准确推断物理属性并生成合理动力学行为，支持复杂动态场景下的交互模拟。 Conclusion: GDGen提供了一个统一、通用且几何无关的动力学建模框架，为构建高保真虚拟数字孪生环境及训练具备物理理解能力的智能体奠定了基础。 Abstract: Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry-agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry-agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.

[63] Comprehensive language-image pre-training for 3D medical image understanding

Tassilo Wald,Ibrahim Ethem Hamamci,Yuan Gao,Sam Bond-Taylor,Harshita Sharma,Maximilian Ilse,Cynthia Lo,Olesya Melnichenko,Noel C. F. Codella,Maria Teodora Wetscherek,Klaus H. Maier-Hein,Panagiotis Korfiatis,Valentina Salvatelli,Javier Alvarez-Valle,Fernando Pérez-García

Main category: cs.CV

TL;DR: 本文提出了一种名为COLIPRI的3D医学视觉-语言预训练模型，通过引入报告生成目标和结合图像独占预训练来增强归纳偏置，从而在数据稀缺的情况下提升性能。

Details

Motivation: 由于3D医学图像领域中配对的图文数据有限，现有视觉-语言编码器的能力受到限制，因此需要新的方法来更有效地利用非配对数据以提升模型表现。 Method: 提出综合语言-图像预训练（COLIPRI）框架，结合视觉-语言预训练与纯视觉预训练，并引入报告生成任务作为额外监督信号，利用更多类型的可用数据。 Result: COLIPRI在报告生成、分类探针、零样本分类任务上达到最先进性能，在语义分割任务中也保持竞争力。 Conclusion: 通过引入更强的归纳偏置并融合多类型数据训练，COLIPRI有效缓解了3D医学图像中文本-图像配对数据不足的问题，显著提升了跨模态表示能力。 Abstract: Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation.

[64] Directional Reasoning Injection for Fine-Tuning MLLMs

Chao Huang,Zeliang Zhang,Jiang Liu,Ximeng Sun,Jialian Wu,Xiaodong Yu,Ze Wang,Chenliang Xu,Emad Barsoum,Zicheng Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为DRIFT的轻量级方法，通过在梯度空间中转移推理知识来提升多模态大语言模型的推理能力，避免了传统方法对大规模数据或强化学习的依赖，在保持多模态对齐的同时显著提升了推理性能。

Details

Motivation: 现有的多模态大语言模型（MLLMs）在推理能力上落后于纯文本大模型，常用的方法如监督微调或强化学习资源消耗大，而简单的模型合并效果不稳定，因此需要一种更高效、稳定的推理能力迁移方法。 Method: 提出方向性推理注入微调（DRIFT），预先计算推理模型与多模态模型之间的参数差异作为推理先验，并在多模态微调过程中利用该先验引导梯度更新，从而实现推理能力的有效迁移。 Result: 在MathVista和MathVerse等多个多模态推理基准上的实验表明，DRIFT consistently 提升了模型的推理性能，优于简单模型合并和标准监督微调，且效果媲美更复杂的训练密集型方法，但成本更低。 Conclusion: DRIFT是一种高效、轻量且稳定的方法，能够在不破坏多模态对齐的前提下，有效将推理能力注入MLLMs，为资源受限下的模型增强提供了新思路。 Abstract: Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

[65] A solution to generalized learning from small training sets found in everyday infant experiences

Frangil Ramirez,Elizabeth Clerkin,David J. Crandall,Linda B. Smith

Main category: cs.CV

TL;DR: 婴儿日常视觉经验的“块状相似性”结构有助于从有限经验中实现类别泛化，这种自然的经验模式可提升小数据集下的机器学习泛化能力。

Details

Motivation: 婴幼儿能在经验有限的情况下识别和泛化基本物体类别，而这一能力的来源尚不清楚，本文旨在探究其背后的视觉经验机制。 Method: 分析14名7至11个月大婴儿的自我中心视角图像，识别其日常视觉输入的统计结构，并通过计算实验模拟该结构在机器学习中的泛化效果。 Result: 发现婴儿的视觉输入具有‘块状相似性’结构，即高度相似图像聚类与较少见的多变图像交错；模拟该结构可提升小样本机器学习的泛化性能。 Conclusion: 婴儿日常视觉经验的自然‘块状’结构可能支持早期类别学习与泛化，也为各类学习系统提供高效学习的原则启示。 Abstract: Young children readily recognize and generalize visual objects labeled by common nouns, suggesting that these basic level object categories may be given. Yet if they are, how they arise remains unclear. We propose that the answer lies in the statistics of infant daily life visual experiences. Whereas large and diverse datasets typically support robust learning and generalization in human and machine learning, infants achieve this generalization from limited experiences. We suggest that the resolution of this apparent contradiction lies in the visual diversity of daily life, repeated experiences with single object instances. Analyzing egocentric images from 14 infants (aged 7 to 11 months) we show that their everyday visual input exhibits a lumpy similarity structure, with clusters of highly similar images interspersed with rarer, more variable ones, across eight early-learned categories. Computational experiments show that mimicking this structure in machines improves generalization from small datasets in machine learning. The natural lumpiness of infant experience may thus support early category learning and generalization and, more broadly, offer principles for efficient learning across a variety of problems and kinds of learners.

[66] SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images

Jiaxin Guo,Tongfan Guan,Wenzhen Dong,Wenzhao Zheng,Wenting Wang,Yue Wang,Yeung Yam,Yun-Hui Liu

Main category: cs.CV

TL;DR: 本文提出SaLon3R，首个支持长序列（50+视图）实时（10+ FPS）重建的在线通用3D高斯点阵化方法，通过结构感知与冗余消除实现高效、鲁棒的三维重建。

Details

Motivation: 现有3DGS方法在处理长时间视频序列时存在显著冗余和几何不一致问题，难以兼顾效率与质量，缺乏长期一致性建模能力。 Method: 引入紧凑锚点原语，结合可微的显著性感知高斯量化压缩冗余；利用3D Point Transformer学习空间结构先验，优化锚点属性与显著性，实现跨帧几何与光度一致性校正。 Result: 在多个数据集上实现了最先进的性能，支持超过50个视图的实时重建（>10 FPS），去除50%至90%冗余，在新视图合成与深度估计任务中表现出更高效率、鲁棒性和泛化能力。 Conclusion: SaLon3R首次实现了高效、结构感知的长时序在线3DGS重建，通过紧凑表示与结构先验学习有效解决了冗余与不一致问题，推动了通用化3D重建的实际应用。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: https://wrld.github.io/SaLon3R/.

[67] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

Guofeng Zhang,Angtian Wang,Jacob Zhiyuan Fang,Liming Jiang,Haotian Yang,Bo Liu,Yiding Yang,Guang Chen,Longyin Wen,Alan Yuille,Chongyang Ma

Main category: cs.CV

TL;DR: 本文提出了Text-Grounded Trajectories (TGT)框架，通过将轨迹与局部文本描述配对来增强视频生成中的主体控制能力，结合Location-Aware Cross-Attention和双CFG机制，在多对象复杂场景中实现了更精确的外观和运动控制。

Details

Motivation: 现有文本到视频生成方法在控制生成场景中的主体构成方面能力有限，尤其在多对象和复杂场景下缺乏对个体轨迹与视觉实体之间的清晰对应关系。 Method: 提出TGT框架，使用带文本标注的点轨迹作为条件输入；引入Location-Aware Cross-Attention（LACA）融合局部文本与轨迹信号，并采用双条件引导因子（dual-CFG）分别调控局部与全局文本引导；构建了包含两百万高质量视频片段的数据处理流程用于训练。 Result: 实验表明，TGT在视觉质量、文本对齐准确性和运动可控性方面优于先前方法，尤其在多对象场景中表现突出。 Conclusion: TGT通过结合轨迹与局部文本描述，显著提升了文本到视频生成中对多个主体的细粒度控制能力，为未来可控视频生成提供了新思路。 Abstract: Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: https://textgroundedtraj.github.io.

[68] Deep generative priors for 3D brain analysis

Ana Lawry Aguila,Dina Zemlyanker,You Cheng,Sudeshna Das,Daniel C. Alexander,Oula Puonti,Annabel Sorby-Adams,W. Taylor Kimberly,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 本文提出将扩散模型作为先验应用于脑部医学影像的逆问题求解，结合贝叶斯框架与数据驱动方法，实现了无需配对训练数据的高质量图像重建。

Details

Motivation: 现有的医学影像逆问题方法依赖于传统的数学先验，难以捕捉复杂的脑结构；而扩散模型虽强大，但缺乏与领域知识的结合。因此，需要一种能融合数据驱动能力和领域知识的通用框架。 Method: 利用在多样化脑MRI数据上训练的基于分数的扩散先验，结合灵活的前向模型，解决超分辨率、偏置场校正、图像修复等任务，并可优化现有深度学习方法的输出。 Result: 在多种临床和研究MRI数据上实验表明，该方法无需配对训练数据即可达到最先进的性能，生成一致且高质量的结果。 Conclusion: 扩散先验可作为多功能工具用于脑MRI分析，有效结合数据驱动模型与领域知识，提升解剖学保真度和鲁棒性。 Abstract: Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.

[69] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification

Anthony Bilic,Guangyu Sun,Ming Li,Md Sanzid Bin Hossain,Yu Tian,Wei Zhang,Laura Brattain,Dexter Hadley,Chen Chen

Main category: cs.CV

TL;DR: 提出FFT-MIL框架，通过引入频域分支增强全切片图像分类中的全局上下文建模，结合低频特征与空间特征，在多个数据集和MIL架构上显著提升性能。

Details

Motivation: 现有MIL方法难以捕捉全切片图像的全局依赖关系，因图像尺寸巨大且局部patch嵌入缺乏全局结构信息，限制了对关键粗粒度结构的建模能力。 Method: 采用快速傅里叶变换提取WSI的低频区域，设计FFT-Block（含卷积层和Min-Max归一化）处理频域信息，并通过轻量级融合策略将学习到的全局频率特征与空间patch特征结合，适配多种MIL架构。 Result: 在BRACS、LUAD和IMP三个公开数据集上，集成FFT-Block的六种先进MIL方法平均提升macro F1 3.51%和AUC 1.51%，表现出一致的性能增益。 Conclusion: 频域学习能有效且高效地捕获WSI分类中的全局依赖关系，补充空间特征，提升MIL框架的可扩展性与准确性。 Abstract: Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology.

Xingrui Wang,Jiang Liu,Chao Huang,Xiaodong Yu,Ze Wang,Ximeng Sun,Jialian Wu,Alan Yuille,Emad Barsoum,Zicheng Liu

Main category: cs.CV

TL;DR: 本文提出了XModBench，一个大规模三模态基准，用于评估多模态大模型在跨模态一致性上的表现，揭示了现有模型在模态不变推理上的不足。

Details

Motivation: 现有基准主要评估跨模态问答能力，缺乏对模态不变推理和模态偏差的系统性评测，因此需要一个专门衡量跨模态一致性的 benchmark。 Method: 构建包含60,828个多项选择题的XModBench，覆盖五类任务和六种模态组合，系统评估模型在不同模态输入输出下的推理一致性。 Result: 实验显示当前最强模型Gemini 2.5 Pro在空间和时间推理上准确率低于60%，音频理解性能显著低于文本，且以视觉为上下文时一致性更低，存在模态差异和方向性不平衡。 Conclusion: 当前多模态大模型尚未实现真正的模态不变推理，XModBench可作为诊断和改进跨模态能力的重要工具。 Abstract: Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.

[71] Train a Unified Multimodal Data Quality Classifier with Synthetic Data

Weizhi Wang,Rongmei Lin,Shiyang Li,Colin Lockard,Ritesh Sarkhel,Sanket Lokegaonkar,Jingbo Shang,Xifeng Yan,Nasser Zalmout,Xian Li

Main category: cs.CV

TL;DR: 提出UniFilter，一种用于过滤高质量图文数据的统一多模态数据质量分类器，通过半合成方法生成多级质量标注数据，提升MLLM预训练效果。

Details

Motivation: 现有MLLM预训练中对图文交错文档数据的高质量筛选研究不足，缺乏有效的数据质量评估方法。 Method: 设计UniFilter模型，采用半合成方式利用原始图像生成四种质量级别的文本，构建图文数据的质量评分样本对，用于训练数据质量分类器。 Result: 在DataComp和OBELICS数据集上成功筛选出高质量子集，使用过滤后数据预训练的MLLM在零样本推理和上下文学习能力上显著优于基线模型，并在多个基准测试中取得更好表现。 Conclusion: 高质量多模态预训练数据对MLLM性能至关重要，UniFilter能有效提升数据质量，推动模型能力发展，相关数据与模型已开源。 Abstract: The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.

[72] Hyperparameter Optimization and Reproducibility in Deep Learning Model Training

Usman Afzaal,Ziyu Su,Usama Sajjad,Hao Lu,Mostafa Rezapour,Metin Nafi Gurcan,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: 本研究通过在QUILT-1M数据集上训练CLIP模型，系统评估了超参数和数据增强策略对下游病理学任务可重复性的影响，发现特定的RandomResizedCrop值、分布式训练设置和学习率对结果稳定性至关重要。

Details

Motivation: 解决组织病理学基础模型训练中因软件随机性、硬件非确定性和超参数报告不一致导致的可重复性问题。 Method: 在QUILT-1M数据集上训练CLIP模型，并在PatchCamelyon、LC25000-Lung和LC25000-Colon三个下游数据集上系统评估不同超参数和增强策略的影响。 Result: 发现RandomResizedCrop取值0.7-0.8时表现最佳，无局部损失的分布式训练更稳定，学习率低于5.0e-5会降低性能，LC25000（Colon）数据集最具可重复性。 Conclusion: 组织病理学中的可重复性不仅依赖透明的文档记录，还需精心设计实验配置，研究提出了指导未来数字病理基础模型开发的实用规则。 Abstract: Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.

[73] Salient Concept-Aware Generative Data Augmentation

Tianchen Zhao,Xuanbai Chen,Zhihua Li,Jun Fang,Dongsheng An,Xiang Xu,Zhuowen Tu,Yifan Xing

Main category: cs.CV

TL;DR: 提出一种个性化的图像生成框架，通过显著概念感知的图像嵌入模型减少无关视觉细节的影响，提升生成图像的保真度与多样性，在八项细粒度视觉数据集上优于现有数据增强方法。

Details

Motivation: 现有基于图像和文本提示的生成式数据增强方法难以在保真度和多样性之间取得平衡，因合成过程中的表征常与非必要图像属性纠缠，导致与文本提示冲突。 Method: 设计一个显著概念感知的图像嵌入模型，抑制无关视觉细节的影响，使生成图像更好保持类别判别特征并引入可控变化，实现图像与文本输入的直观对齐。 Result: 在八个细粒度视觉数据集上验证了方法的有效性，分类准确率在常规和长尾设置下分别平均提升0.73%和6.5%，优于当前最先进的数据增强方法。 Conclusion: 所提框架能有效提升训练数据的多样性并增强下游模型的鲁棒性，解决了图像-文本条件生成中保真度与多样性的权衡问题。 Abstract: Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.

[74] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records

Daniela Vega,Hannah V. Ceballos,Javier S. Vera,Santiago Rodriguez,Alejandra Perez,Angela Castillo,Maria Escobar,Dario Londoño,Luis A. Sarmiento,Camila I. Castro,Nadiezhda Rodriguez,Juan C. Briceño,Pablo Arbeláez

Main category: cs.CV

TL;DR: 本文提出了首个公开的多模态数据集CARDIUM，结合胎儿超声图像与母体临床记录，用于先天性心脏病（CHD）的产前检测，并设计了一种基于交叉注意力机制的多模态Transformer模型，显著提升了检测性能。

Details

Motivation: 由于先天性心脏病病例稀少，现有AI研究受限于数据稀缺、质量低和模态单一的问题，缺乏整合影像与临床数据的公开数据集，限制了临床决策支持能力。 Method: 构建了一个包含胎儿超声、心脏影像及母体临床记录的多模态数据集CARDIUM，并提出一种融合图像与表格数据的多模态Transformer架构，采用交叉注意力机制实现模态间特征融合。 Result: 该模型在CARDIUM数据集上比单模态图像和表格方法分别提升11%和50%，F1分数达到79.8±4.8%。 Conclusion: CARDIUM为产前CHD检测提供了重要的公开资源，所提出的多模态方法显著提升了检测性能，推动了AI在临床决策中的应用潜力。 Abstract: Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 $\pm$ 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at https://github.com/BCVUniandes/Cardium, and at the project website https://bcv-uniandes.github.io/CardiumPage/

[75] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads

Aysan Aghazadeh,Adriana Kovashka

Main category: cs.CV

TL;DR: 研究探讨了文本到图像模型在定制视觉广告和针对特定人群中的潜力，分析了不同主题广告中的种族和性别偏见，并实验了针对特定国家投放广告的技术。

Details

Motivation: 探索文本到图像生成模型在广告定制化中的应用潜力，同时评估其在性别和种族表征上的偏见问题。 Method: 通过分析不同广告主题下的 demographic bias，比较仅在人物性别/种族上不同的广告的模型判断说服力差异，并尝试地理定位的广告定向技术。 Result: 发现广告中存在显著的性别和种族偏见，不同群体对广告的感知说服力存在差异，且可利用技术实现按国家定向投放广告。 Conclusion: 文本到图像模型虽具广告定制潜力，但需警惕其可能放大的社会偏见，未来应开发更公平的生成策略。 Abstract: Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at https://github.com/aysanaghazadeh/FaceOfPersuasion

[76] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

Weijie Wang,Jiagang Zhu,Zeyu Zhang,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Chaojun Ni,Haoxiao Wang,Guan Huang,Xinze Chen,Yukun Zhou,Wenkang Qin,Duochao Shi,Haoyun Li,Guanghong Jia,Jiwen Lu

Main category: cs.CV

TL;DR: DriveGen3D是一个用于生成高质量、高可控性动态3D驾驶场景的新框架，结合了高效的长时视频生成与大规模动态场景重建，实现实时、参数高效的3D驾驶场景合成。

Details

Motivation: 现有驾驶场景生成方法在长时间生成、3D表示或动态场景重建方面存在计算成本高或功能受限的问题，DriveGen3D旨在克服这些局限。 Method: 提出一个包含FastDrive-DiT（高效视频扩散Transformer）和FastRecon3D（前馈式3D重建模块）的统一框架，通过文本和鸟瞰图布局条件控制，实现高分辨率视频与3D高斯表示的联合生成。 Result: 实现了高达424×800分辨率、12FPS的实时扩展驾驶视频生成，新视角合成SSIM达0.811，PSNR达22.84，并保持良好的时空一致性与参数效率。 Conclusion: DriveGen3D有效融合了长期视频生成与动态3D重建，为自动驾驶仿真提供了高效、可控的解决方案。 Abstract: We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to $424\times800$ at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.

[77] CuSfM: CUDA-Accelerated Structure-from-Motion

Jingrui Yu,Jun Liu,Kefei Ren,Joydeep Biswas,Rurui Ye,Keqiang Wu,Chirag Majithia,Di Zeng

Main category: cs.CV

TL;DR: 本文提出了一种基于CUDA加速的离线Structure-from-Motion系统cuSfM，用于实现高效且精确的相机位姿估计，显著优于COLMAP方法，并开源了Python封装PyCuSfM。

Details

Motivation: 为了在自主导航、机器人感知和虚拟仿真中实现密集重建所需的高精度相机位姿估计，现有方法在计算效率和精度之间存在权衡。 Method: 提出cuSfM，利用CUDA进行GPU并行化，采用计算密集但高精度的特征提取器，实现高效的离线SfM系统，支持位姿优化、建图、先验地图定位和外参 refinement。 Result: 实验表明，cuSfM在多种测试场景下相比COLMAP显著提升了精度和处理速度，同时保持了高精度和全局一致性。 Conclusion: cuSfM通过GPU加速实现了高效且精确的离线SfM，在精度和速度上均优于现有方法，适用于对精度要求高的离线应用，并已开源供研究使用。 Abstract: Efficient and accurate camera pose estimation forms the foundational requirement for dense reconstruction in autonomous navigation, robotic perception, and virtual simulation systems. This paper addresses the challenge via cuSfM, a CUDA-accelerated offline Structure-from-Motion system that leverages GPU parallelization to efficiently employ computationally intensive yet highly accurate feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping. The system supports pose optimization, mapping, prior-map localization, and extrinsic refinement. It is designed for offline processing, where computational resources can be fully utilized to maximize accuracy. Experimental results demonstrate that cuSfM achieves significantly improved accuracy and processing speed compared to the widely used COLMAP method across various testing scenarios, while maintaining the high precision and global consistency essential for offline SfM applications. The system is released as an open-source Python wrapper implementation, PyCuSfM, available at https://github.com/nvidia-isaac/pyCuSFM, to facilitate research and applications in computer vision and robotics.

[78] Post-Processing Methods for Improving Accuracy in MRI Inpainting

Nishad Kulkarni,Krithika Iyer,Austin Tapp,Abhijeet Parida,Daniel Capellán-Martín,Zhifan Jiang,María J. Ledesma-Carbayo,Syed Muhammad Anwar,Marius George Linguraru

Main category: cs.CV

TL;DR: 本文系统评估了最先进的MRI图像修复模型，并提出结合模型集成与后处理策略（如中值滤波、直方图匹配和像素平均）以及轻量级U-Net增强的方法，显著提升了修复区域的解剖合理性和视觉保真度，实现了更准确、鲁棒的结果。

Details

Motivation: 现有MRI图像修复模型在处理大脑肿瘤等大病变区域时性能趋于饱和，且多数自动化分析工具针对健康解剖结构优化，难以可靠应用于病理性脑组织。因此需要提升修复结果的质量以支持通用分析工具的有效使用。 Method: 采用模型集成方法结合多种高效后处理技术（包括中值滤波、 histogram matching 和 pixel averaging），并通过一个轻量级U-Net进行解剖结构的进一步优化。 Result: 所提方法在解剖合理性与视觉保真度方面均优于单个基线模型，修复结果更准确、更稳健，综合评估显示性能显著提升。 Conclusion: 通过整合现有模型与针对性后处理流程，可在不依赖新模型的前提下实现更优、更易推广的图像修复效果，有助于推动临床部署及资源友好的可持续研究。 Abstract: Magnetic Resonance Imaging (MRI) is the primary imaging modality used in the diagnosis, assessment, and treatment planning for brain pathologies. However, most automated MRI analysis tools, such as segmentation and registration pipelines, are optimized for healthy anatomies and often fail when confronted with large lesions such as tumors. To overcome this, image inpainting techniques aim to locally synthesize healthy brain tissues in tumor regions, enabling the reliable application of general-purpose tools. In this work, we systematically evaluate state-of-the-art inpainting models and observe a saturation in their standalone performance. In response, we introduce a methodology combining model ensembling with efficient post-processing strategies such as median filtering, histogram matching, and pixel averaging. Further anatomical refinement is achieved via a lightweight U-Net enhancement stage. Comprehensive evaluation demonstrates that our proposed pipeline improves the anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models. By combining established models with targeted post-processing, we achieve improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable, resource-conscious research. Our 2025 BraTS inpainting docker is available at https://hub.docker.com/layers/aparida12/brats2025/inpt.

[79] QCFace: Image Quality Control for boosting Face Representation & Recognition

Duc-Phuong Doan-Ngo,Thanh-Dang Diep,Thanh Nguyen-Duc,Thanh-Sach LE,Nam Thoai

Main category: cs.CV

TL;DR: 本文提出了一种基于硬边距策略的质量控制人脸识别方法（QCFace），有效解耦了可识别性与身份表示，提升了人脸识别系统在验证和识别任务中的性能。

Details

Motivation: 现有方法在捕捉人脸可识别性时存在表示能力弱、区分度低的问题，且特征方向与模长的梯度重叠导致优化不稳定、表示纠缠。 Method: 引入硬边距策略QCFace，设计新的损失函数，通过引导因子实现超球面规划，同时优化识别能力和可识别性表示。 Result: 实验表明，QCFace在多个验证和识别基准上达到当前最优性能，并实现了鲁棒、可量化的可识别性编码。 Conclusion: QCFace有效解决了可识别性与身份表示的耦合问题，显著提升了人脸识别系统的泛化能力和性能。 Abstract: Recognizability, a key perceptual factor in human face processing, strongly affects the performance of face recognition (FR) systems in both verification and identification tasks. Effectively using recognizability to enhance feature representation remains challenging. In deep FR, the loss function plays a crucial role in shaping how features are embedded. However, current methods have two main drawbacks: (i) recognizability is only partially captured through soft margin constraints, resulting in weaker quality representation and lower discrimination, especially for low-quality or ambiguous faces; (ii) mutual overlapping gradients between feature direction and magnitude introduce undesirable interactions during optimization, causing instability and confusion in hypersphere planning, which may result in poor generalization, and entangled representations where recognizability and identity are not cleanly separated. To address these issues, we introduce a hard margin strategy - Quality Control Face (QCFace), which overcomes the mutual overlapping gradient problem and enables the clear decoupling of recognizability from identity representation. Based on this strategy, a novel hard-margin-based loss function employs a guidance factor for hypersphere planning, simultaneously optimizing for recognition ability and explicit recognizability representation. Extensive experiments confirm that QCFace not only provides robust and quantifiable recognizability encoding but also achieves state-of-the-art performance in both verification and identification benchmarks compared to existing recognizability-based losses.

[80] Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning

Yiming Lin,Shang Wang,Junkai Zhou,Qiufeng Wang,Xiao-Bo Jin,Kaizhu Huang

Main category: cs.CV

TL;DR: 提出首个用于单正好多标签学习（SPMLL）的双曲球分类框架，通过几何化的超球体表示标签，显式建模包含、重叠和分离等多种标签关系，并引入温度自适应分类器与双井正则化，实现更优性能与可解释性。

Details

Motivation: 现有SPMLL方法隐式建模标签关系，缺乏对不同类型关系的显式几何定义，难以捕捉复杂的标签结构和层次关系。 Method: 将每个标签表示为双曲空间中的球体而非点或向量，利用球体间的几何交互（包含、重叠、分离）建模多种标签关系；设计温度自适应的双曲球分类器和受物理启发的双井正则化机制，引导球体形成有意义的配置。 Result: 在MS-COCO、PASCAL VOC、NUS-WIDE和CUB-200-2011四个基准数据集上实验表明，该方法性能优越且具有更强可解释性；统计分析显示学习到的嵌入与真实共现模式高度相关。 Conclusion: 双曲几何为不完全监督下的结构化分类提供了更鲁棒的范式，所提球体表示方法能有效显式建模复杂标签关系。 Abstract: Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.

[81] Latent Diffusion Model without Variational Autoencoder

Minglei Shi,Haolin Wang,Wenzhao Zheng,Ziyang Yuan,Xiaoshi Wu,Xintao Wang,Pengfei Wan,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 提出SVG模型，一种无需变分自编码器的新型潜在扩散模型，利用自监督表示进行视觉生成，通过冻结DINO特征构建具有清晰语义区分性的特征空间，并用轻量级残差分支捕捉细粒度细节，从而实现更高效的训练、快速采样和高质量生成。

Details

Motivation: 现有的VAE+扩散模型范式存在训练效率低、推理速度慢和迁移性差的问题，主要由于VAE潜在空间缺乏明确的语义分离和强判别结构。为了克服这些限制，需要构建更具语义结构的潜在空间以提升生成效率与多任务适用性。 Method: SVG采用冻结的DINO自监督特征构建语义清晰且具判别性的潜在空间，结合一个轻量级残差分支来恢复精细细节；扩散模型直接在此结构化空间上训练，避免使用VAE，从而提升训练和推理效率。 Result: SVG实现了加速的扩散训练过程，支持少步甚至单步采样，同时提升了生成质量；实验表明其保留了底层自监督表示的语义与判别能力，适用于多种视觉任务。 Conclusion: SVG提供了一条通向任务通用、高质量视觉表示的新路径，摆脱了传统VAE的限制，在生成效率、语义结构和跨任务迁移方面均优于现有方法。 Abstract: Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.

[82] Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation

Fei Wang,Li Shen,Liang Ding,Chao Xue,Ye Liu,Changxing Ding

Main category: cs.CV

TL;DR: 本文提出CoMe，一种基于通道敏感性度量和层次蒸馏的渐进式层剪枝框架，通过拼接式合并技术保留关键通道并实现高效知识迁移，在减少30% LLaMA-2-7b参数时仍保持83%原始准确率。

Details

Motivation: 现有结构化剪枝方法直接删除层导致性能下降，权重聚合能力不足且缺乏有效的后训练恢复机制，难以在压缩模型的同时保留原模型能力。 Method: 提出CoMe框架：1）基于激活强度和权重范数的通道敏感性度量进行细粒度通道选择；2）采用基于拼接的层合并方法融合相邻层的关键通道；3）设计层次化蒸馏后训练策略，利用剪枝建立的原模型与剪枝模型层间对应关系实现高效知识迁移。 Result: 在七个基准测试上实验表明，CoMe达到最先进性能；当剪枝30%的LLaMA-2-7b参数时，剪枝模型保持了原始平均准确率的83%。 Conclusion: CoMe有效解决了传统结构化剪枝中的性能退化、低效聚合和恢复困难问题，实现了更优的模型压缩与性能平衡。 Abstract: Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b's parameters, the pruned model retains 83% of its original average accuracy. Our code is available at https://github.com/MPI-Lab/CoMe.

[83] Proto-Former: Unified Facial Landmark Detection by Prototype Transformer

Shengkai Hu,Haozhe Qi,Jun Wan,Jiaxing Huang,Lefei Zhang,Hang Sun,Dacheng Tao

Main category: cs.CV

TL;DR: 提出Proto-Former，一个统一、自适应的端到端面部关键点检测框架，通过多数据集联合训练增强数据集特定的结构表示，显著提升跨数据集泛化能力和检测精度。

Details

Motivation: 现有面部关键点检测数据集定义的关键点数量不同，主流方法通常只能在单一数据集上训练，限制了模型的泛化能力，缺乏统一的跨数据集检测模型。 Method: 提出Proto-Former，包含自适应原型感知编码器（APAE）和渐进式原型感知解码器（PPAD），并引入原型感知（PA）损失函数，实现多数据集联合训练，增强面部结构表征，解决原型专家寻址不稳定和梯度冲突问题。 Result: 在多个主流基准数据集上实验表明，Proto-Former性能优于现有的最先进方法，具备更强的泛化能力和更准确的特征提取能力。 Conclusion: Proto-Former通过统一架构和新型损失函数实现了跨数据集的高效面部关键点检测，推动了通用面部对齐模型的发展。 Abstract: Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model's attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: https://github.com/Husk021118/Proto-Former.

Joshua Li,Brendan Chharawala,Chang Shu,Xue Bin Peng,Pengcheng Xi

Main category: cs.CV

TL;DR: 本文提出了一种名为Scene-Human Aligned REconstruction（SHARE）的技术，利用场景几何的空间线索，从单目RGB视频中实现更准确的3D人体运动重建，并在关键帧和非关键帧之间保持一致性，显著优于现有方法。

Details

Motivation: 现有的人体运动重建方法在将人类准确放置于3D空间时存在困难，尤其是在复杂场景交互中缺乏对场景几何的有效利用。 Method: SHARE方法首先估计每一帧的人体网格和分割掩码，以及关键帧的场景点图；然后通过将人体网格与利用掩码提取的场景中的人体点图进行比对，迭代优化关键帧中人体位置；同时，在优化过程中保持非关键帧与关键帧之间的相对根关节位置一致。 Result: 实验表明，SHARE在多个数据集和真实网络视频上均优于现有方法，实现了更精确的3D人体定位和场景重建。 Conclusion: SHARE通过融合场景几何信息和跨帧一致性约束，有效提升了单目视频中人体运动重建的空间准确性，适用于游戏、AR/VR和机器人等需要真实人-环境交互的场景。 Abstract: Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry's inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human's positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.

[85] Cortical-SSM: A Deep State Space Model for EEG and ECoG Motor Imagery Decoding

Shuntaro Suzuki,Shunya Nagashima,Masayuki Hirata,Komei Sugiura

Main category: cs.CV

TL;DR: 提出了一种基于深度状态空间模型的新型架构Cortical-SSM，用于在时间、空间和频率域上捕捉脑电（EEG）和皮层脑电（ECoG）信号的综合依赖关系，在多个运动想象数据集上表现优于基线方法。

Details

Motivation: 脑电信号易受生理伪迹干扰，且现有Transformer方法难以捕捉细粒度依赖关系，限制了分类性能。 Method: 提出Cortical-SSM，扩展深度状态空间模型以建模EEG和ECoG信号在时、空、频三域的依赖关系，并在多个公开及临床数据集上进行验证。 Result: 在两个大规模公共EEG数据集和一个ALS患者的ECoG临床数据集上均优于基线方法，可视化解释显示模型能有效捕捉神经生理相关区域。 Conclusion: Cortical-SSM能更有效地建模多域依赖关系，提升EEG/ECoG信号分类性能，具有应用于辅助通信和康复支持的潜力。 Abstract: Classification of electroencephalogram (EEG) and electrocorticogram (ECoG) signals obtained during motor imagery (MI) has substantial application potential, including for communication assistance and rehabilitation support for patients with motor impairments. These signals remain inherently susceptible to physiological artifacts (e.g., eye blinking, swallowing), which pose persistent challenges. Although Transformer-based approaches for classifying EEG and ECoG signals have been widely adopted, they often struggle to capture fine-grained dependencies within them. To overcome these limitations, we propose Cortical-SSM, a novel architecture that extends deep state space models to capture integrated dependencies of EEG and ECoG signals across temporal, spatial, and frequency domains. We validated our method across three benchmarks: 1) two large-scale public MI EEG datasets containing more than 50 subjects, and 2) a clinical MI ECoG dataset recorded from a patient with amyotrophic lateral sclerosis. Our method outperformed baseline methods on the three benchmarks. Furthermore, visual explanations derived from our model indicate that it effectively captures neurophysiologically relevant regions of both EEG and ECoG signals.

[86] Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning

Ana Davila,Jacinto Colan,Yasuhisa Hasegawa

Main category: cs.CV

TL;DR: 提出了一种分阶段自适应微调方法，通过线性探测和逐步冻结策略，提升了手术工具检测性能，在Cholec80和CATARACTS数据集上表现出色，mAP达到96.4%。

Details

Motivation: 由于手术场景中标注数据有限，训练鲁棒的深度学习模型面临挑战，因此需要更高效的微调方法来提升模型在手术领域的适应能力。 Method: 采用分阶段自适应微调方法，包括线性探测阶段（用于调整分类层）和逐步冻结阶段（动态减少可微调层数），仅需一次训练循环即可完成。 Result: 在Cholec80数据集上使用ResNet-50和DenseNet-121实现了96.4%的mAP，优于现有方法，并在CATARACTS数据集上验证了方法的通用性。 Conclusion: 逐步冻结微调是一种有前景的技术，可有效提升多种手术中的工具检测性能，且可能适用于更广泛的图像分类任务。 Abstract: Minimally invasive surgery can benefit significantly from automated surgical tool detection, enabling advanced analysis and assistance. However, the limited availability of annotated data in surgical settings poses a challenge for training robust deep learning models. This paper introduces a novel staged adaptive fine-tuning approach consisting of two steps: a linear probing stage to condition additional classification layers on a pre-trained CNN-based architecture and a gradual freezing stage to dynamically reduce the fine-tunable layers, aiming to regulate adaptation to the surgical domain. This strategy reduces network complexity and improves efficiency, requiring only a single training loop and eliminating the need for multiple iterations. We validated our method on the Cholec80 dataset, employing CNN architectures (ResNet-50 and DenseNet-121) pre-trained on ImageNet for detecting surgical tools in cholecystectomy endoscopic videos. Our results demonstrate that our method improves detection performance compared to existing approaches and established fine-tuning techniques, achieving a mean average precision (mAP) of 96.4%. To assess its broader applicability, the generalizability of the fine-tuning strategy was further confirmed on the CATARACTS dataset, a distinct domain of minimally invasive ophthalmic surgery. These findings suggest that gradual freezing fine-tuning is a promising technique for improving tool presence detection in diverse surgical procedures and may have broader applications in general image classification tasks.

[87] FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

Haisheng Su,Junjie Zhang,Feixiang Song,Sanping Zhou,Wei Wu,Nanning Zheng,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了一种名为FreqPDE的新型方法，用于从多视角2D图像中进行3D目标检测，通过引入频率感知的位置深度嵌入来增强2D特征的空间信息，结合跨视图和尺度不变性设计，提升了检测性能。

Details

Motivation: 现有方法依赖LiDAR点进行深度预测监督，但存在深度不连续、小目标识别困难等问题，且忽视了跨视图一致性和尺度不变性，因此需要一种更鲁棒的无密集LiDAR依赖的3D检测方法。 Method: 提出FreqPDE框架，包含三个模块：频率感知空间金字塔编码器（FSPE）融合高低频特征；跨视图尺度不变深度预测器（CSDP）利用注意力机制估计像素级深度分布；位置深度编码器（PDE）生成3D感知特征；并采用混合深度监督策略。 Result: 在nuScenes数据集上进行了大量实验，结果表明该方法在3D目标检测性能上优于现有方法，尤其在深度边界连续性和小目标检测方面表现更优。 Conclusion: FreqPDE有效提升了基于多视图图像的3D目标检测精度，通过频率分离、跨视图一致性和混合深度监督机制，减少了对稀疏LiDAR标注的依赖，具有较强的实用价值。 Abstract: Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.

[88] PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction

Ting-Yu Yen,Yu-Sheng Chiu,Shih-Hsuan Hung,Peter Wonka,Hung-Kuo Chu

Main category: cs.CV

TL;DR: 提出PFGS，一种姿态感知的3D高斯点阵框架，通过多姿态图像融合实现更完整的物体重建。

Details

Motivation: 现有3DGS方法多假设物体处于静态单一姿态，导致遮挡区域重建不完整，难以应对实际中多姿态拍摄的复杂场景。 Method: PFGS采用全局与局部注册相结合的姿态感知融合策略，利用背景特征进行每姿态相机位姿估计，并借助基础模型实现跨姿态配准，逐步将辅助姿态图像融合至主姿态的统一3DGS表示中。 Result: 实验表明，PFGS在定性和定量评估中均优于强基线方法，生成更完整、高保真的3D重建结果。 Conclusion: PFGS有效解决了多姿态下3DGS重建不完整的问题，通过智能融合基础模型与背景特征，提升了重建质量与鲁棒性。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.

[89] LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding

Peng Ren,Hai Yang

Main category: cs.CV

TL;DR: LILAC提出了一种用于实时长序列任意运动风格化的流式VAE-扩散模型框架，通过潜在空间滑动窗口因果设计和解码运动特征注入，实现了高质量且低延迟的运动生成。

Details

Motivation: 现有流式方法在原始运动空间操作导致计算开销大且时序稳定性差，而基于潜在空间的VAE-扩散模型虽质量高但多限于离线处理，缺乏实时性。 Method: 基于高性能离线框架，构建潜在空间流式架构，采用滑动窗口因果设计，并注入解码后的运动特征以保证过渡平滑，支持在线任意运动风格化。 Result: 在基准数据集上实验表明，LILAC能在不依赖未来帧或修改扩散模型结构的前提下，实现高质量、低延迟的长序列实时运动风格化。 Conclusion: LILAC成功将离线运动风格化模型扩展到在线场景，在保持高 stylization 质量的同时实现了良好的响应性和时间连续性，适用于需要实时角色控制的应用。 Abstract: Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/

[90] MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment

Bingyu Li,Feiyu Wang,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了MARIS，首个大规模水下开放词汇实例分割基准，并设计了一个包含几何先验增强模块（GPEM）和语义对齐注入机制（SAIM）的统一框架，以应对水下视觉退化和语义错位问题。

Details

Motivation: 现有水下实例分割方法受限于闭集词汇预测，难以识别新海洋类别；同时，现有开放词汇方法在水下场景中因视觉退化和缺乏水下类别定义而表现不佳。 Method: 提出MARIS基准，并设计一个包含GPEM和SAIM的统一框架：GPEM利用部分级和结构级几何先验来增强退化条件下的对象一致性，SAIM则通过领域特定先验增强语言嵌入以缓解语义模糊。 Result: 在MARIS基准上的实验表明，该方法在域内和跨域设置下均显著优于现有的开放词汇基线方法。 Conclusion: 所提框架有效解决了水下开放词汇实例分割中的视觉退化与语义错位问题，为未来水下感知研究提供了坚实基础。 Abstract: Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.

[91] Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning

Chen Qian,Haoyu Zhang,Junnan Ma,Liuhong Zhu,Qingrui Cai,Yu Wang,Ruibo Song,Lv Li,Lin Mei,Xianwang Jiang,Qin Xu,Boyu Jiang,Ran Tao,Chunmiao Chen,Shufang Chen,Dongyun Liang,Qiu Guo,Jianzhong Lin,Taishan Kang,Mengtian Lu,Liyuan Fu,Ruibin Huang,Huijuan Wan,Xu Huang,Jianhua Wang,Di Guo,Hai Zhong,Jianjun Zhou,Xiaobo Qu

Main category: cs.CV

TL;DR: 本文提出了一种名为LoSP-Prompt的重建框架，用于解决多射扩散加权成像（multi-shot DWI）在全身肿瘤诊断中因呼吸、肠蠕动等引起的运动相位伪影问题。该方法结合物理建模与合成数据驱动的提示学习，实现了高分辨率、多器官兼容的成像，并在多种解剖区域和扫描设备上表现出优异性能。

Details

Motivation: 多射DWI在临床应用中受限于运动导致的相位伪影及多器官、多层面、多方向和多b值的复杂性，亟需一种鲁棒、高分辨率且无需导航信号的重建方法。 Method: 提出LoSP-Prompt框架：将 shot间相位变化建模为高阶局部平滑相位（LoSP），并嵌入低秩Hankel矩阵重建；通过仅在模拟生理运动的合成腹部DWI数据上训练的提示学习，自动设定算法的秩参数。 Result: 在超过10,000张临床图像（43名受试者，4种扫描仪，5个中心）上验证显示：(1) 空间分辨率达到单射DWI的两倍，提升肝病变可见度；(2) 单一模型通用于七个不同解剖区域；(3) 在图像质量、伪影抑制和降噪方面优于现有方法（11名放射科医生评分达4-5分，p<0.05）。无需导航信号和真实数据监督。 Conclusion: LoSP-Prompt是一种可解释、鲁棒、扫描仪无关的高分辨率多器官multi-shot DWI解决方案，具有推动精准肿瘤学发展的潜力。 Abstract: Clinical adoption of multi-shot diffusion-weighted magnetic resonance imaging (multi-shot DWI) for body-wide tumor diagnostics is limited by severe motion-induced phase artifacts from respiration, peristalsis, and so on, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities. Here, we introduce a reconstruction framework, LoSP-Prompt, that overcomes these challenges through physics-informed modeling and synthetic-data-driven prompt learning. We model inter-shot phase variations as a high-order Locally Smooth Phase (LoSP), integrated into a low-rank Hankel matrix reconstruction. Crucially, the algorithm's rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion. Validated across 10,000+ clinical images (43 subjects, 4 scanner models, 5 centers), LoSP-Prompt: (1) Achieved twice the spatial resolution of clinical single-shot DWI, enhancing liver lesion conspicuity; (2) Generalized to seven diverse anatomical regions (liver, kidney, sacroiliac, pelvis, knee, spinal cord, brain) with a single model; (3) Outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (11 radiologists' evaluations on a 5-point scale, $p<0.05$), achieving 4-5 points (excellent) on kidney DWI, 4 points (good to excellent) on liver, sacroiliac and spinal cord DWI, and 3-4 points (good) on knee and tumor brain. The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI. Its scanner-agnostic performance signifies transformative potential for precision oncology.

[92] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

Shuang Liang,Zhihao Xu,Jialing Tao,Hui Xue,Xiting Wang

Main category: cs.CV

TL;DR: 提出了一种名为LoD的通用框架，用于准确检测大型视觉-语言模型中的未知越狱攻击，通过多模态安全概念激活向量和安全模式自编码器模块提升检测性能与效率。

Details

Motivation: 现有检测方法在泛化性和准确性上存在局限，难以有效应对未知的越狱攻击。 Method: 提出Learning to Detect（LoD）框架，采用任务特定学习替代攻击特定学习，包含多模态安全概念激活向量模块和安全模式自编码器模块，实现对未知攻击的高效检测。 Result: 实验表明，该方法在多种未知攻击上的检测AUROC consistently 更高，并提升了检测效率。 Conclusion: LoD框架通过任务特定的学习策略，在不依赖攻击特异性参数的情况下，实现了对LVLM中未知越狱攻击的高效、准确检测。 Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

[93] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety

Huan Chen,Ting Han,Siyu Chen,Zhihao Guo,Yiping Chen,Meiliu Wu

Main category: cs.CV

TL;DR: 提出Semantic4Safety框架，利用街景图像进行语义分割提取街道特征，并结合因果推断方法分析其对不同类型交通事故的影响。

Details

Motivation: 解决街景图像在交通风险分析中缺乏可解释性街道级指标及因果效应量化的问题。 Method: 采用零样本语义分割提取11个可解释的街道环境指标，结合道路类型信息，使用XGBoost多分类模型和SHAP解释特征重要性，并通过GPS加权和ATE估计进行因果效应分析。 Result: 发现不同事故类型存在异质性因果模式：场景复杂度、暴露程度和道路几何特征具有较强预测能力；更大的可行驶区域和应急空间降低风险，而过度视觉开放性增加风险。 Conclusion: Semantic4Safety通过融合预测模型与因果推断，为城市道路安全规划提供了可扩展、数据驱动的工具，支持高风险路段识别与针对性干预。 Abstract: Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.

[94] Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation

Feifei Zhang,Zhenhong Jia,Sensen Song,Fei Shi,Dayong Ren

Main category: cs.CV

TL;DR: 提出了一种新的Predictive-Corrective (PC)范式及PCMambaNet网络，通过解耦建模任务加速学习，在脑部MRI分割中实现最先进精度且仅需1-5个epoch即收敛。

Details

Motivation: 解决端到端深度学习在数据稀缺领域（如医学图像）中收敛慢、依赖大规模数据的问题。 Method: 设计由预测先验模块（PPM）和校正残差网络（CRN）组成的PCMambaNet，PPM利用解剖学对称性生成关注图，CRN专注于学习残差以精确定位病灶边界。 Result: 在高分辨率脑MRI分割任务上，PCMambaNet达到最先进的精度，并在1-5个epoch内快速收敛，显著优于传统端到端模型。 Conclusion: 通过显式引入领域知识简化学习目标，PC范式有效缓解了数据效率低和过拟合问题，为数据稀缺场景下的高效训练提供了新思路。 Abstract: Despite the remarkable success of the end-to-end paradigm in deep learning, it often suffers from slow convergence and heavy reliance on large-scale datasets, which fundamentally limits its efficiency and applicability in data-scarce domains such as medical imaging. In this work, we introduce the Predictive-Corrective (PC) paradigm, a framework that decouples the modeling task to fundamentally accelerate learning. Building upon this paradigm, we propose a novel network, termed PCMambaNet. PCMambaNet is composed of two synergistic modules. First, the Predictive Prior Module (PPM) generates a coarse approximation at low computational cost, thereby anchoring the search space. Specifically, the PPM leverages anatomical knowledge-bilateral symmetry-to predict a 'focus map' of diagnostically relevant asymmetric regions. Next, the Corrective Residual Network (CRN) learns to model the residual error, focusing the network's full capacity on refining these challenging regions and delineating precise pathological boundaries. Extensive experiments on high-resolution brain MRI segmentation demonstrate that PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs-a performance unattainable by conventional end-to-end models. This dramatic acceleration highlights that by explicitly incorporating domain knowledge to simplify the learning objective, PCMambaNet effectively mitigates data inefficiency and overfitting.

[95] Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning

Xuchen Li,Xuzhao Li,Shiyu Hu,Kaiqi Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于“少选多思”理念的证据感知强化学习（EARL）框架，通过动态选择关键帧并进行局部重采样，提升长视频推理中视觉证据的纯度和时序细节获取能力，在多个基准上达到开源Video LLM中的最先进性能。

Details

Motivation: 现有Video LLM在长视频推理中因静态均匀采样导致信息稀释，且现有视频推理代理缺乏严格的奖励机制来保证证据纯度，无法超越预采样帧进行时序信息补充。 Method: 提出EARL框架，将模型转化为对证据的主动追问者，通过证据感知的强化学习动态选择最相关帧，并在关键帧周围进行局部重采样以获取细粒度时序信息。 Result: 在五个视频推理基准上实验表明，该方法在开源Video LLM中达到SOTA，7B模型在LongVideoBench、MVBench和VideoMME上分别取得59.8%、69.0%和64.9%的成绩。 Conclusion: 优先考虑证据纯度对于长视频推理至关重要，所提出的EARL框架有效提升了视觉证据的选择质量和推理性能。 Abstract: Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: "Select Less, Reason More." Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.

[96] MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention

Nengbo Zhang,Hann Woei Ho

Main category: cs.CV

TL;DR: 本文提出了一种基于多视角学习的MAV动作识别框架MAVR-Net，融合RGB、光流和分割掩码数据，结合多尺度特征金字塔和跨视角注意力模块，显著提升了识别精度。

Details

Motivation: 现有基于RGB的视觉识别模型难以捕捉MAV运动的复杂时空特征，导致动作识别能力受限。 Method: 采用ResNet编码器提取多视角特征，引入多尺度特征金字塔保留时空细节，设计跨视角注意力模块和多视角对齐损失以增强视图间交互与语义一致性。 Result: 在Short MAV、Medium MAV和Long MAV数据集上分别达到97.8%、96.5%和92.8%的准确率，明显优于现有方法。 Conclusion: MAVR-Net通过多视角数据融合与注意力机制，有效提升了MAV动作识别的鲁棒性和准确性，适用于自主飞行群协同感知与控制。 Abstract: Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8\%, 96.5\%, and 92.8\% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.

[97] DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking

Zhiqiang Zhu,Xinbo Gao,Wen Lu,Jie Li,Zhaoyang Wang,Mingqian Ge

Main category: cs.CV

TL;DR: 本文提出DPTrack，一种用于夜间航拍跟踪的提示学习方法，通过引入细粒度属性特征增强提示生成，提升跟踪精度。

Details

Motivation: 现有基于提示学习的夜间航拍跟踪方法仅依赖空间定位监督，缺乏指向目标特征的细粒度线索，导致提示模糊，跟踪性能不佳。 Method: DPTrack受视觉仿生启发，首先分层捕获目标的拓扑结构，利用拓扑属性增强特征表示；然后通过编码器将拓扑感知特征压缩到方向核中，显式封装细粒度属性线索；最后设计核引导提示模块，结合通道-类别对应关系，在搜索区域传播方向核以精确定位目标特征并生成精确提示，并引入空间门控机制增强夜间鲁棒性。 Result: 在多个标准基准上进行了广泛评估，结果表明DPTrack在夜间航拍跟踪任务中表现出优越性能，显著优于现有方法。 Conclusion: DPTrack通过引入拓扑结构和方向核机制，有效提升了提示的精细度与跟踪准确性，为复杂夜间环境下的航拍跟踪提供了新思路。 Abstract: Existing nighttime aerial trackers based on prompt learning rely solely on spatial localization supervision, which fails to provide fine-grained cues that point to target features and inevitably produces vague prompts. This limitation impairs the tracker's ability to accurately focus on the object features and results in trackers still performing poorly. To address this issue, we propose DPTrack, a prompt-based aerial tracker designed for nighttime scenarios by encoding the given object's attribute features into the directional kernel enriched with fine-grained cues to generate precise prompts. Specifically, drawing inspiration from visual bionics, DPTrack first hierarchically captures the object's topological structure, leveraging topological attributes to enrich the feature representation. Subsequently, an encoder condenses these topology-aware features into the directional kernel, which serves as the core guidance signal that explicitly encapsulates the object's fine-grained attribute cues. Finally, a kernel-guided prompt module built on channel-category correspondence attributes propagates the kernel across the features of the search region to pinpoint the positions of target features and convert them into precise prompts, integrating spatial gating for robust nighttime tracking. Extensive evaluations on established benchmarks demonstrate DPTrack's superior performance. Our code will be available at https://github.com/zzq-vipsl/DPTrack.

[98] Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation

Vu Tram Anh Khuong,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo

Main category: cs.CV

TL;DR: 提出了一种基于动态图像的相位感知时间增强方法，通过将微表情分解为两个运动阶段（onset-to-apex和apex-to-offset）并生成双阶段动态图像，有效提升了微表情识别的准确性和鲁棒性。

Details

Motivation: 由于标注的微表情数据稀缺，现有深度学习方法在泛化能力和运动模式多样性上受限，且缺乏对时间维度增强策略的研究。 Method: 将微表情序列分解为onset-to-apex和apex-to-offset两个阶段，分别生成动态图像（DI），形成双阶段DI增强策略，结合空间增强方法用于微表情识别。 Result: 在CASME-II和SAMM数据集上，使用六种深度模型进行实验，识别准确率、F1分数和平均召回率均有提升，相对性能提升最高达10%，尤其在低资源场景下表现优异。 Conclusion: 所提出的双阶段时间增强方法简单、通用且有效，显著提升了微表情识别的性能，为解决数据稀缺和类别不平衡问题提供了新思路。 Abstract: Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, typically lasting less than half a second. Recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. Although deep learning has enabled significant advances in micro-expression recognition (MER), its effectiveness is limited by the scarcity of annotated ME datasets. This data limitation not only hinders generalization but also restricts the diversity of motion patterns captured during training. Existing MER studies predominantly rely on simple spatial augmentations (e.g., flipping, rotation) and overlook temporal augmentation strategies that can better exploit motion characteristics. To address this gap, this paper proposes a phase-aware temporal augmentation method based on dynamic image. Rather than encoding the entire expression as a single onset-to-offset dynamic image (DI), our approach decomposes each expression sequence into two motion phases: onset-to-apex and apex-to-offset. A separate DI is generated for each phase, forming a Dual-phase DI augmentation strategy. These phase-specific representations enrich motion diversity and introduce complementary temporal cues that are crucial for recognizing subtle facial transitions. Extensive experiments on CASME-II and SAMM datasets using six deep architectures, including CNNs, Vision Transformer, and the lightweight LEARNet, demonstrate consistent performance improvements in recognition accuracy, unweighted F1-score, and unweighted average recall, which are crucial for addressing class imbalance in MER. When combined with spatial augmentations, our method achieves up to a 10\% relative improvement. The proposed augmentation is simple, model-agnostic, and effective in low-resource settings, offering a promising direction for robust and generalizable MER.

[99] MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes

Lingfeng Xuan,Chang Nie,Yiqing Xu,Zhe Liu,Yanzi Miao,Hesheng Wang

Main category: cs.CV

TL;DR: 提出了一种针对驾驶场景的多相机结构光运动（MRASfM）框架，通过利用多相机系统的固定空间关系提高位姿估计可靠性，采用平面模型优化路面重建，并通过将多相机组视为单一单元提升优化效率，实现了多场景的粗到精聚合。

Details

Motivation: 传统SfM在多相机系统捕捉的驾驶场景中存在位姿估计不可靠、路面重建外点过多和重建效率低的问题。 Method: 利用多相机系统内部的固定空间关系增强位姿估计；引入平面模型去除三角化路面中的错误点；在捆绑调整中将多相机组视为单一单元以减少优化变量；通过场景关联与组装模块实现多场景聚合。 Result: 在公开数据集上的大规模验证显示MRASfM达到最先进性能，在nuScenes数据集上实现0.124的绝对位姿误差；实际车辆部署证明其在各种场景下的泛化能力和挑战条件下的鲁棒性。 Conclusion: MRASfM有效解决了多相机系统在驾驶场景中应用SfM的关键难题，显著提升了位姿估计精度、路面重建质量和计算效率，具备实际应用价值。 Abstract: Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.

Jinghao Huang,Yaxiong Chen,Ganchao Liu

Main category: cs.CV

TL;DR: 本文首次提出并研究了无人机视频-文本检索（DVTR）任务，提出了一种多语义自适应挖掘方法（MSAM），通过细粒度的文本与视频帧交互和跨模态特征融合机制，在自建数据集上显著优于现有方法。

Details

Motivation: 无人机视频具有俯视视角、结构同质性强和目标组合语义多样等特点，现有面向地面视角的跨模态方法难以有效建模其特征，因此需要专门针对无人机场景设计检索机制。 Method: 提出多语义自适应挖掘（MSAM）方法，包含多语义自适应学习机制、自适应语义构建模块、分布驱动语义学习项和多样性语义项，并引入跨模态交互特征融合池化机制，聚焦目标区域的特征提取与匹配。 Result: 在两个自建的无人机视频-文本数据集上进行的大量实验表明，MSAM在无人机视频-文本检索任务中优于其他现有方法。 Conclusion: MSAM能有效提升对无人机视频内容的深度理解与推理能力，增强跨模态特征表示的鲁棒性，为无人机视频语义检索提供了新的解决方案。 Abstract: With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.

[101] A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition

Vu Tram Anh Khuong,Thi Bich Phuong Man,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo

Main category: cs.CV

TL;DR: 提出了一种结合起始到峰值和峰值到结束两阶段的组合光流（COF）方法，用于提升微表情识别性能。

Details

Motivation: 现有微表情识别方法多关注起始到峰值阶段，忽略了峰值到结束阶段所包含的重要时序动态信息。 Method: 引入组合光流（COF），融合 onset-to-apex 和 apex-to-offset 两个阶段的光流信息，以增强特征表示能力。 Result: 在CASMEII和SAMM数据集上的实验表明，COF优于仅使用单一光流的方法，显著提升了微表情识别效果。 Conclusion: COF能更全面地捕捉微表情的运动动态，有效提升识别性能。 Abstract: Facial micro-expressions are brief, involuntary facial movements that reveal hidden emotions. Most Micro-Expression Recognition (MER) methods that rely on optical flow typically focus on the onset-to-apex phase, neglecting the apex-to-offset phase, which holds key temporal dynamics. This study introduces a Combined Optical Flow (COF), integrating both phases to enhance feature representation. COF provides a more comprehensive motion analysis, improving MER performance. Experimental results on CASMEII and SAMM datasets show that COF outperforms single optical flow-based methods, demonstrating its effectiveness in capturing micro-expression dynamics.

[102] Iterative Motion Compensation for Canonical 3D Reconstruction from UAV Plant Images Captured in Windy Conditions

Andre Rochow,Jonas Marcic,Svetlana Seliunina,Sven Behnke

Main category: cs.CV

TL;DR: 提出了一种基于无人机的自动化3D植物重建管道，通过迭代优化图像对齐以减少叶片运动误差，提升现有3D重建方法的精度，并公开代码与多作物数据集。

Details

Motivation: 3D植物表型分析对研究植物生长、产量预测和病害控制至关重要，但环境风和无人机下洗气流导致的叶片运动给高质量3D重建带来挑战。 Method: 使用商用无人机自动采集植物图像（仅需放置ArUco标记），开发Android应用控制飞行；管道支持集成任意先进3D重建方法，并采用基于光流的迭代优化策略，通过渲染中间3D结果与原始图像对齐，逐步形变输入图像以补偿叶片运动。 Result: 该方法有效减少了由叶片运动引起的重建误差，经过数次迭代后显著提升了现有3D重建技术的质量，能够生成高分辨率3D网格模型；同时发布了源代码和包含多种作物、多时间点的植物数据集。 Conclusion: 所提出的迭代式图像对齐重建管道可有效应对无人机拍摄中的植物运动问题，提升了农业植物3D重建的精度与自动化水平，具有良好的开源应用前景。 Abstract: 3D phenotyping of plants plays a crucial role for understanding plant growth, yield prediction, and disease control. We present a pipeline capable of generating high-quality 3D reconstructions of individual agricultural plants. To acquire data, a small commercially available UAV captures images of a selected plant. Apart from placing ArUco markers, the entire image acquisition process is fully autonomous, controlled by a self-developed Android application running on the drone's controller. The reconstruction task is particularly challenging due to environmental wind and downwash of the UAV. Our proposed pipeline supports the integration of arbitrary state-of-the-art 3D reconstruction methods. To mitigate errors caused by leaf motion during image capture, we use an iterative method that gradually adjusts the input images through deformation. Motion is estimated using optical flow between the original input images and intermediate 3D reconstructions rendered from the corresponding viewpoints. This alignment gradually reduces scene motion, resulting in a canonical representation. After a few iterations, our pipeline improves the reconstruction of state-of-the-art methods and enables the extraction of high-resolution 3D meshes. We will publicly release the source code of our reconstruction pipeline. Additionally, we provide a dataset consisting of multiple plants from various crops, captured across different points in time.

[103] Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement

Xianmin Chen,Peiliang Huang,Longfei Han,Dingwen Zhang,Junwei Han

Main category: cs.CV

TL;DR: 本文提出了一种用于低光照RAW图像增强的高效分层混合架构HiMA，结合Transformer和Mamba模块，并引入局部分布调整（LoDA）和多先验融合（MPF）模块，在多个数据集上优于现有方法。

Details

Motivation: 现有的低光照图像增强方法在提升质量与保持高效率之间难以兼顾，且存在两阶段框架中的模糊问题。 Method: 提出HiMA架构，利用Transformer处理大尺度特征、Mamba处理小尺度特征；设计LoDA模块以自适应对齐局部区域特征分布；采用MPF模块融合空域和频域先验信息。 Result: 在多个公开数据集上实验表明，该方法在更少参数量下优于当前最先进的方法，具有更强的增强效果和更高的效率。 Conclusion: HiMA通过协同利用Transformer和Mamba的优势，结合LoDA和MPF模块，有效解决了低光ISP中的效率与质量平衡问题，显著提升了低光照图像增强性能。 Abstract: Low-light RAW image enhancement remains a challenging task. Although numerous deep learning based approaches have been proposed, they still suffer from inherent limitations. A key challenge is how to simultaneously achieve strong enhancement quality and high efficiency. In this paper, we rethink the architecture for efficient low-light image signal processing (ISP) and introduce a Hierarchical Mixing Architecture (HiMA). HiMA leverages the complementary strengths of Transformer and Mamba modules to handle features at large and small scales, respectively, thereby improving efficiency while avoiding the ambiguities observed in prior two-stage frameworks. To further address uneven illumination with strong local variations, we propose Local Distribution Adjustment (LoDA), which adaptively aligns feature distributions across different local regions. In addition, to fully exploit the denoised outputs from the first stage, we design a Multi-prior Fusion (MPF) module that integrates spatial and frequency-domain priors for detail enhancement. Extensive experiments on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior performance with fewer parameters. Code will be released at https://github.com/Cynicarlos/HiMA.

[104] Exploring Conditions for Diffusion models in Robotic Control

Heeseong Shin,Byeongho Heo,Dongyoon Han,Seungryong Kim,Taekyung Kim

Main category: cs.CV

TL;DR: 本文提出ORCA方法，利用预训练的文本到图像扩散模型生成任务自适应的视觉表征用于机器人控制，通过可学习的任务提示和视觉提示来弥补域差距，实现优于现有方法的性能。

Details

Motivation: 现有的预训练视觉表征在模仿学习中通常是任务无关且冻结的，难以适应具体控制任务的需求；直接应用文本条件在机器人控制中效果不佳，主要由于扩散模型训练数据与实际控制环境之间存在域差距。 Method: 提出ORCA方法，引入可学习的任务提示（task prompts）以适应控制环境，以及视觉提示（visual prompts）捕捉帧级别的细粒度信息，从而生成任务自适应的视觉表征，且无需微调扩散模型本身。 Result: 在多个机器人控制基准上达到最先进的性能，显著超越先前方法。 Conclusion: 通过设计面向控制任务的提示机制，可以有效利用冻结的扩散模型生成高质量的任务自适应视觉表征，为机器人模仿学习提供了新思路。 Abstract: While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.

[105] Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models

Ignacio Serna

Main category: cs.CV

TL;DR: 提出了一种无需属性标签的潜在特征对齐（LFA）方法，用于识别人脸识别模型中的系统性偏差，相比传统聚类方法能更一致地发现语义连贯的子群体和可解释的潜在方向。

Details

Motivation: 现有偏见评估方法依赖昂贵且有限的标注属性来划分子群体，难以全面揭示人脸识别模型中的系统性偏差。 Method: 提出Latent Feature Alignment（LFA），利用潜在方向进行子群体划分，无需标注属性，通过语义一致性和可解释方向发现提升聚类效果。 Result: 在四个先进模型和两个数据集上验证，LFA在组内语义一致性上优于k-means和最近邻搜索，并能发现与年龄、种族等相关的可解释潜在方向。 Conclusion: LFA是一种实用的人脸识别表征审计方法，可在无预定义标注的情况下有效识别和解释模型偏差。 Abstract: Modern face recognition models achieve high overall accuracy but continue to exhibit systematic biases that disproportionately affect certain subpopulations. Conventional bias evaluation frameworks rely on labeled attributes to form subpopulations, which are expensive to obtain and limited to predefined categories. We introduce Latent Feature Alignment (LFA), an attribute-label-free algorithm that uses latent directions to identify subpopulations. This yields two main benefits over standard clustering: (i) semantically coherent grouping, where faces sharing common attributes are grouped together more reliably than by proximity-based methods, and (ii) discovery of interpretable directions, which correspond to semantic attributes such as age, ethnicity, or attire. Across four state-of-the-art recognition models (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence, while uncovering interpretable latent directions aligned with demographic and contextual attributes. These results position LFA as a practical method for representation auditing of face recognition models, enabling practitioners to identify and interpret biased subpopulations without predefined attribute annotations.

[106] Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training

Aditya Vir

Main category: cs.CV

TL;DR: 提出一种定制的卷积神经网络架构，通过平衡多任务注意力机制在EuroSAT数据集上实现了97.23%的分类准确率，无需预训练模型。

Details

Motivation: 解决卫星图像分类中空间与光谱特征利用不充分及预训练模型依赖的问题。 Method: 设计了三个渐进式网络架构：基线模型、CBAM增强模型和平衡多任务注意力模型；结合坐标注意力和Squeeze-Excitation模块，并引入可学习融合参数与DropBlock正则化。 Result: 最终模型在EuroSAT上达到97.23%准确率（Cohen's Kappa 0.9692），各类别准确率均超过94.46%，且预测置信度校准良好，性能接近微调ResNet-50（相差1.34%）。 Conclusion: 系统化的网络设计可有效提升特定领域图像分类性能，所提注意力机制自动平衡空间与光谱模态重要性（α≈0.57），验证了无预训练高精度分类的可行性。 Abstract: This work presents a systematic investigation of custom convolutional neural network architectures for satellite land use classification, achieving 97.23% test accuracy on the EuroSAT dataset without reliance on pre-trained models. Through three progressive architectural iterations (baseline: 94.30%, CBAM-enhanced: 95.98%, and balanced multi-task attention: 97.23%) we identify and address specific failure modes in satellite imagery classification. Our principal contribution is a novel balanced multi-task attention mechanism that combines Coordinate Attention for spatial feature extraction with Squeeze-Excitation blocks for spectral feature extraction, unified through a learnable fusion parameter. Experimental results demonstrate that this learnable parameter autonomously converges to alpha approximately 0.57, indicating near-equal importance of spatial and spectral modalities for satellite imagery. We employ progressive DropBlock regularization (5-20% by network depth) and class-balanced loss weighting to address overfitting and confusion pattern imbalance. The final 12-layer architecture achieves Cohen's Kappa of 0.9692 with all classes exceeding 94.46% accuracy, demonstrating confidence calibration with a 24.25% gap between correct and incorrect predictions. Our approach achieves performance within 1.34% of fine-tuned ResNet-50 (98.57%) while requiring no external data, validating the efficacy of systematic architectural design for domain-specific applications. Complete code, trained models, and evaluation scripts are publicly available.

[107] Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics

Yitong Li,Ralph Buchert,Benita Schmitz-Koep,Timo Grimmer,Björn Ommer,Dennis M. Hedderich,Igor Yakushev,Christian Wachinger

Main category: cs.CV

TL;DR: 提出了一种基于3D扩散桥的框架SiM2P，可从MRI和患者辅助信息生成诊断级FDG-PET模拟图像，显著提升痴呆症分型的诊断准确率。

Details

Motivation: FDG-PET在痴呆诊断中虽有效但可及性差、成本高，而MRI更易获得，因此需要一种能从MRI生成高质量PET模拟图像的方法以提升诊断可及性。 Method: 采用3D扩散桥模型，学习从MRI和患者基本信息（如年龄、性别）到FDG-PET图像的映射，并设计了适用于本地部署的轻量级训练流程，仅需约20例本地数据。 Result: 在盲法临床读片研究中，使用SiM2P生成的PET图像将三类人群（阿尔茨海默病、行为变异型额颞叶痴呆、健康对照）的诊断准确率从75.0%提升至84.7%（p<0.05），且诊断确定性和评分者间一致性均优于原生MRI。 Conclusion: SiM2P框架能有效模拟诊断级FDG-PET图像，提升痴呆疾病的诊断性能，且具备低数据需求的本地部署能力，有助于在资源有限环境中推广PET诊断优势。 Abstract: Positron emission tomography (PET) with 18F-Fluorodeoxyglucose (FDG) is an established tool in the diagnostic workup of patients with suspected dementing disorders. However, compared to the routinely available magnetic resonance imaging (MRI), FDG-PET remains significantly less accessible and substantially more expensive. Here, we present SiM2P, a 3D diffusion bridge-based framework that learns a probabilistic mapping from MRI and auxiliary patient information to simulate FDG-PET images of diagnostic quality. In a blinded clinical reader study, two neuroradiologists and two nuclear medicine physicians rated the original MRI and SiM2P-simulated PET images of patients with Alzheimer's disease, behavioral-variant frontotemporal dementia, and cognitively healthy controls. SiM2P significantly improved the overall diagnostic accuracy of differentiating between three groups from 75.0% to 84.7% (p<0.05). Notably, the simulated PET images received higher diagnostic certainty ratings and achieved superior interrater agreement compared to the MRI images. Finally, we developed a practical workflow for local deployment of the SiM2P framework. It requires as few as 20 site-specific cases and only basic demographic information. This approach makes the established diagnostic benefits of FDG-PET imaging more accessible to patients with suspected dementing disorders, potentially improving early detection and differential diagnosis in resource-limited settings. Our code is available at https://github.com/Yiiitong/SiM2P.

[108] ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents

Tingyu Lin,Marco Peer,Florian Kleber,Robert Sablatnig

Main category: cs.CV

TL;DR: ClapperText是一个用于手写和印刷文本识别的基准数据集，源自二战时期的127个包含场记板的档案视频片段，包含9,813个标注帧和94,573个词级文本实例，适用于低资源、视觉退化场景下的OCR研究。

Details

Motivation: 为了推动在视觉退化和低资源条件下对手写和印刷文本的识别，特别是在历史档案分析中处理非标准、结构化内容的需求，构建一个真实且文化背景明确的基准数据集。 Method: 从127个二战时期的档案视频中提取含场记板的视频片段，进行逐帧标注，提供旋转边界框（四点多边形）形式的词级文本标注，包括转录、语义类别、文本类型和遮挡状态，并发布全帧标注与裁剪后的词图像；采用每视频一致的评估协议，对六种识别模型和七种检测模型在零样本和微调条件下进行基准测试。 Result: 数据集中67%为手写文本，1,566个实例部分遮挡，尽管训练集仅包含18个视频，微调仍显著提升模型性能，验证了该数据集在少样本学习场景中的有效性。 Conclusion: ClapperText为低资源、退化条件下的鲁棒OCR和文档理解提供了现实且富有文化价值的数据资源，适合用于历史文档分析和少样本学习研究。 Abstract: This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText's suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.

[109] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi

Main category: cs.CV

TL;DR: Shakti VLM是一系列具有1B和4B参数的视觉-语言模型，通过架构创新和三阶段训练策略，在较少数据下实现了高效的多模态学习，在文档理解、视觉推理、OCR提取等任务中表现优异。

Details

Motivation: 解决当前视觉-语言模型依赖大量训练数据导致的数据效率低下的问题，探索通过模型设计而非数据规模来提升性能的途径。 Method: 采用QK归一化、混合归一化技术和增强的位置编码等架构改进，并结合三阶段训练策略以提高学习效率。 Result: Shakti-VLM-1B和Shakti-VLM-4B在文档理解、视觉推理、OCR提取和通用多模态推理任务上表现出色，用更少的训练token达到与现有模型相当甚至更好的性能。 Conclusion: 高效的模型设计和训练策略可以在不依赖大规模数据的情况下实现高性能，Shakti为面向企业级应用的高效多模态任务提供了一种可行方案。 Abstract: We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.

[110] Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation

Xiaoming Zhu,Xu Huang,Qinghongbing Xie,Zhi Deng,Junsheng Yu,Yirui Guan,Zhongyuan Liu,Lin Zhu,Qijun Zhao,Ligang Liu,Long Zeng

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉引导的3D场景布局生成系统，通过图像生成模型扩展提示并结合高质量资产库，利用图像解析模块和场景图优化生成丰富且逻辑一致的3D布局。

Details

Motivation: 现有方法在生成艺术性和连贯性的3D场景布局时存在局限：基于优化的方法依赖繁琐的手工规则，深度生成模型难以保证内容的丰富性和多样性，而使用大语言模型的方法缺乏鲁棒性且难以准确捕捉复杂的空间关系。 Method: 构建包含2037个场景资产和147个3D场景布局的高质量资产库；利用图像生成模型将文本提示扩展为图像，并微调以匹配资产库；开发基于视觉语义和几何信息的图像解析模块恢复3D布局；结合场景图和整体视觉语义进行布局优化。 Result: 实验表明，该方法在布局丰富性和质量上显著优于现有方法，用户测试验证了其有效性。 Conclusion: 所提出的视觉引导3D布局生成系统能有效生成多样化、艺术性强且逻辑连贯的3D场景布局，具备实际应用潜力。 Abstract: Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at https://github.com/HiHiAllen/Imaginarium.

[111] Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images

Sami Belguesmia,Mohand Saïd Allili,Assia Hamadene

Main category: cs.CV

TL;DR: 提出一种多视角架构的DeepFake检测方法，通过融合全局、中观和局部视图编码器以及面部朝向编码器，有效提升在姿态变化、遮挡和光照复杂条件下的检测性能。

Details

Motivation: 现有DeepFake检测方法在姿态变化、遮挡和真实场景中的微小伪影下表现不佳，难以满足实际应用需求。 Method: 设计一个多视角架构，包含全局视图编码器（检测边界不一致）、中观视图编码器（分析纹理与色彩对齐）、局部视图编码器（捕捉眼鼻口等区域的失真）以及面部朝向编码器（分类面部姿态），并通过特征融合提升检测能力。 Result: 在具有挑战性的数据集上实验表明，该方法显著优于传统的单视角检测方法，尤其在复杂姿态和光照条件下表现出更强的鲁棒性。 Conclusion: 所提出的多视角融合框架能有效增强DeepFake检测的准确性和鲁棒性，适用于现实场景中的伪造人脸图像识别。 Abstract: DeepFake technology has advanced significantly in recent years, enabling the creation of highly realistic synthetic face images. Existing DeepFake detection methods often struggle with pose variations, occlusions, and artifacts that are difficult to detect in real-world conditions. To address these challenges, we propose a multi-view architecture that enhances DeepFake detection by analyzing facial features at multiple levels. Our approach integrates three specialized encoders, a global view encoder for detecting boundary inconsistencies, a middle view encoder for analyzing texture and color alignment, and a local view encoder for capturing distortions in expressive facial regions such as the eyes, nose, and mouth, where DeepFake artifacts frequently occur. Additionally, we incorporate a face orientation encoder, trained to classify face poses, ensuring robust detection across various viewing angles. By fusing features from these encoders, our model achieves superior performance in detecting manipulated images, even under challenging pose and lighting conditions.Experimental results on challenging datasets demonstrate the effectiveness of our method, outperforming conventional single-view approaches

[112] Lightweight CycleGAN Models for Cross-Modality Image Transformation and Experimental Quality Assessment in Fluorescence Microscopy

Mohammad Soltaninezhad,Yashar Rouzbahani,Jhonatan Contreras,Rohan Chippalkatti,Daniel Kwaku Abankwa,Christian Eggeling,Thomas Bocklitz

Main category: cs.CV

TL;DR: 提出了一种轻量级CycleGAN用于荧光显微镜中的模态转换（共聚焦到超分辨率STED），通过减少模型参数显著降低了计算成本，并可作为实验质量诊断工具。

Details

Motivation: 解决未配对数据集在荧光显微镜模态转换中的挑战，同时降低深度学习模型的计算开销和环境影响。 Method: 在U-Net生成器中用固定通道策略替代传统的通道加倍策略，大幅减少可训练参数数量。 Result: 模型参数从4180万减少到约9000，训练更快、内存占用更低，且性能更优；GAN输出与实验图像的偏差可用于检测光漂白、伪影或标记错误等问题。 Conclusion: 该轻量级CycleGAN不仅高效实用，还可作为显微成像中实验准确性和图像保真度的验证工具。 Abstract: Lightweight deep learning models offer substantial reductions in computational cost and environmental impact, making them crucial for scientific applications. We present a lightweight CycleGAN for modality transfer in fluorescence microscopy (confocal to super-resolution STED/deconvolved STED), addressing the common challenge of unpaired datasets. By replacing the traditional channel-doubling strategy in the U-Net-based generator with a fixed channel approach, we drastically reduce trainable parameters from 41.8 million to approximately nine thousand, achieving superior performance with faster training and lower memory usage. We also introduce the GAN as a diagnostic tool for experimental and labeling quality. When trained on high-quality images, the GAN learns the characteristics of optimal imaging; deviations between its generated outputs and new experimental images can reveal issues such as photobleaching, artifacts, or inaccurate labeling. This establishes the model as a practical tool for validating experimental accuracy and image fidelity in microscopy workflows.

[113] Standardization for improved Spatio-Temporal Image Fusion

Harkaitz Goyena,Peter M. Atkinson,Unai Pérez-Goya,M. Dolores Ugarte

Main category: cs.CV

TL;DR: 提出并比较了两种标准化方法以提升无配对时空图像融合（USTFIP）的精度，其中基于异常的卫星图像标准化（ABSIS）锐化方法显著提高了光谱和空间准确性。

Details

Motivation: 为了促进时空图像融合（STIF）方法的应用，解决不同传感器获取的图像在空间和光谱分辨率上不匹配的问题。 Method: 第一种方法是传统精细分辨率图像上采样；第二种是称为ABSIS的锐化方法，将精细分辨率图像序列的整体特征与特定粗分辨率图像的独特属性结合，生成更接近精细分辨率图像聚合结果的标准化图像。 Result: 两种方法均显著提升了USTFIP的融合精度，其中ABSIS方法使光谱和空间精度分别最高提升49.46%和78.40%。 Conclusion: 所提出的标准化方法，尤其是ABSIS锐化方法，能有效提高无配对STIF方法的融合图像质量，具有良好的应用前景。 Abstract: Spatio-Temporal Image Fusion (STIF) methods usually require sets of images with matching spatial and spectral resolutions captured by different sensors. To facilitate the application of STIF methods, we propose and compare two different standardization approaches. The first method is based on traditional upscaling of the fine-resolution images. The second method is a sharpening approach called Anomaly Based Satellite Image Standardization (ABSIS) that blends the overall features found in the fine-resolution image series with the distinctive attributes of a specific coarse-resolution image to produce images that more closely resemble the outcome of aggregating the fine-resolution images. Both methods produce a significant increase in accuracy of the Unpaired Spatio Temporal Fusion of Image Patches (USTFIP) STIF method, with the sharpening approach increasing the spectral and spatial accuracies of the fused images by up to 49.46\% and 78.40\%, respectively.

Hanrong Ye,Chao-Han Huck Yang,Arushi Goel,Wei Huang,Ligeng Zhu,Yuanhang Su,Sean Lin,An-Chieh Cheng,Zhen Wan,Jinchuan Tian,Yuming Lou,Dong Yang,Zhijian Liu,Yukang Chen,Ambrish Dantrey,Ehsan Jahangiri,Sreyan Ghosh,Daguang Xu,Ehsan Hosseini-Asl,Danial Mohseni Taheri,Vidya Murali,Sifei Liu,Jason Lu,Oluwatobi Olabiyi,Frank Wang,Rafael Valle,Bryan Catanzaro,Andrew Tao,Song Han,Jan Kautz,Hongxu Yin,Pavlo Molchanov

Main category: cs.CV

TL;DR: OmniVinci 是一个开源的多模态大语言模型，通过创新的模型架构和高效的数据管道，在跨模态理解、音频和视觉任务上显著优于现有模型，且训练数据量仅为对比模型的六分之一。

Details

Motivation: 为了实现更接近人类感知能力的机器智能，需要构建能够处理多种模态信息的统一模型，以增强跨模态感知与推理能力。 Method: 提出OmniAlignNet加强视觉与音频嵌入在共享潜在空间中的对齐；采用时间嵌入分组捕捉相对时序关系；使用受限旋转时间嵌入编码绝对时间信息；并构建包含2400万单模态和多模态对话的数据合成与筛选流程。 Result: OmniVinci在DailyOmni（跨模态理解）上提升+19.05，在MMAR（音频）上+1.7，在Video-MME（视觉）上+3.9，仅使用0.2T训练token，为Qwen2.5-Omni的1/6。 Conclusion: 多模态协同能有效增强感知与推理能力，OmniVinci展示了在机器人、医疗AI和智能工厂等下游应用中的广泛潜力。 Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

Zhen Sun,Lei Tan,Yunhang Shen,Chengmao Cai,Xing Sun,Pingyang Dai,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: 提出FlexiReID，一个支持四种模态七种检索模式的灵活行人重识别框架，并构建CIRS-PEDES数据集进行综合评估。

Details

Motivation: 现有方法局限于固定跨模态设置，无法支持任意查询-检索组合，限制了实际应用。 Method: 引入自适应专家混合（MoE）机制动态融合多模态特征，并设计跨模态查询融合模块增强特征提取。 Result: 在新构建的CIRS-PEDES数据集上实验表明，FlexiReID在多种检索模式下达到最先进性能，具有强泛化能力。 Conclusion: FlexiReID实现了灵活、通用的多模态行人重识别，推动了实际场景中的应用潜力。 Abstract: Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.

[116] Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection

Andrei-Timotei Ardelean,Patrick Rückbeil,Tim Weyrich

Main category: cs.CV

TL;DR: 本文提出了一种名为QFCA的实时零样本纹理异常检测方法，通过量化特征对应分析和基于PCA的特征预处理，在保持高精度的同时实现10倍速度提升。

Details

Motivation: 现有零样本异常定位方法在纹理数据上运行速度慢，难以应用于实际场景（如生产线监控），亟需高效且准确的解决方案。 Method: 提出QFCA方法，采用量化后的特征对应分析（FCA）算法，并引入基于主成分分析（PCA）的特征预处理步骤，以直方图形式比较块统计特征，提升对比度与效率。 Result: 相比现有方法实现了约10倍的加速，精度损失极小，并在复杂纹理上展现出更高的检测精度。 Conclusion: QFCA是一种高效、精确的零样本纹理异常定位方法，适合实际工业应用，为实时异常检测提供了可行方案。 Abstract: Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: https://reality.tf.fau.de/pub/ardelean2025quantized.html

[117] Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration

Tomáš Chobola,Julia A. Schnabel,Tingying Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为Noise2Detail (N2D)的超轻量级自监督去噪模型，基于Noise2Noise框架，通过多阶段去噪流程在无需干净参考图像的情况下实现快速且高质量的图像恢复。

Details

Motivation: 现有自监督去噪方法计算和内存开销大，难以兼顾推理速度与重建质量，限制了其在真实场景中的应用，尤其是在缺乏干净训练数据的生物医学成像领域。 Method: 基于Noise2Noise训练框架，设计了一个创新的多阶段去噪流水线Noise2Detail（N2D），在推理过程中打破噪声模式的空间相关性，先生成中间平滑结构，再从原始噪声输入中恢复细节。 Result: 实验表明，Noise2Detail在性能上优于现有的无数据集方法，同时显著降低了计算资源需求，具备高效、低计算成本和无需训练数据的优势。 Conclusion: Noise2Detail为实际应用场景，特别是生物医学成像中的去噪任务提供了一个高效且实用的解决方案。 Abstract: Current self-supervised denoising techniques achieve impressive results, yet their real-world application is frequently constrained by substantial computational and memory demands, necessitating a compromise between inference speed and reconstruction quality. In this paper, we present an ultra-lightweight model that addresses this challenge, achieving both fast denoising and high quality image restoration. Built upon the Noise2Noise training framework-which removes the reliance on clean reference images or explicit noise modeling-we introduce an innovative multistage denoising pipeline named Noise2Detail (N2D). During inference, this approach disrupts the spatial correlations of noise patterns to produce intermediate smooth structures, which are subsequently refined to recapture fine details directly from the noisy input. Extensive testing reveals that Noise2Detail surpasses existing dataset-free techniques in performance, while requiring only a fraction of the computational resources. This combination of efficiency, low computational cost, and data-free approach make it a valuable tool for biomedical imaging, overcoming the challenges of scarce clean training data-due to rare and complex imaging modalities-while enabling fast inference for practical use.

[118] Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey

Shuchang Lyu,Qi Zhao,Zheng Zhou,Meng Li,You Zhou,Dingding Yao,Guangliang Cheng,Huiyu Zhou,Zhenwei Shi

Main category: cs.CV

TL;DR: 本文综述了深度学习在遥感领域中域适应的最新进展，涵盖了关键概念、方法分类、常用数据集及前沿方法性能，并提出了未来研究方向。

Details

Motivation: 遥感中的域适应面临数据分布差异大、传感器多样、环境变化等挑战，亟需系统性综述以整合现有成果并指导未来研究。 Method: 提出一种系统化的分类体系，从任务类型、输入模式、监督范式和算法粒度等多个角度组织和分析现有深度学习域适应方法。 Result: 总结了遥感域适应中的主流数据集和最新方法性能，识别出当前开放挑战，并提供了全面的方法论分类框架。 Conclusion: 该综述为遥感域适应领域提供了更广泛、更系统的视角，有助于推动研究社区的发展与创新。 Abstract: Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.

[119] Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation

Lei Shi,Gang Li,Junxing Zhang

Main category: cs.CV

TL;DR: 提出了一种仅需四个极端点作为标注的弱监督医学图像分割框架，结合SAM2生成初始伪标签，并通过改进的FGEPM算法和不确定性感知损失优化分割结果，在减少标注成本的同时实现了与全监督方法相当甚至更优的性能。

Details

Motivation: 全监督医学图像分割依赖大量像素级标注，成本高且耗时，因此需要一种能显著降低标注负担的弱监督方法。 Method: 利用四个极端点生成边界框作为SAM2的提示以产生初始伪标签；采用增强的FGEPM算法结合蒙特卡洛dropout估计不确定性，构建统一梯度不确定性代价图进行边界追踪；引入双分支不确定性感知尺度一致性（USC）损失和框对齐损失以提升训练中的空间一致性和边界精度。 Result: 在BUSI和UNS两个公开超声数据集上的实验表明，该方法性能媲美甚至超过全监督方法，同时大幅降低了标注成本。 Conclusion: 所提出的弱监督分割框架有效且实用，能够在极低标注成本下实现高质量的超声图像分割。 Abstract: Automatic medical image segmentation is a fundamental step in computer-aided diagnosis, yet fully supervised approaches demand extensive pixel-level annotations that are costly and time-consuming. To alleviate this burden, we propose a weakly supervised segmentation framework that leverages only four extreme points as annotation. Specifically, bounding boxes derived from the extreme points are used as prompts for the Segment Anything Model 2 (SAM2) to generate reliable initial pseudo labels. These pseudo labels are progressively refined by an enhanced Feature-Guided Extreme Point Masking (FGEPM) algorithm, which incorporates Monte Carlo dropout-based uncertainty estimation to construct a unified gradient uncertainty cost map for boundary tracing. Furthermore, a dual-branch Uncertainty-aware Scale Consistency (USC) loss and a box alignment loss are introduced to ensure spatial consistency and precise boundary alignment during training. Extensive experiments on two public ultrasound datasets, BUSI and UNS, demonstrate that our method achieves performance comparable to, and even surpassing fully supervised counterparts while significantly reducing annotation cost. These results validate the effectiveness and practicality of the proposed weakly supervised framework for ultrasound image segmentation.

[120] Valeo Near-Field: a novel dataset for pedestrian intent detection

Antonyo Musabini,Rachid Benmokhtar,Jagdish Bhanushali,Victor Galizzi,Bertrand Luvison,Xavier Perrotton

Main category: cs.CV

TL;DR: 本文提出了一种用于检测行人接近自车时意图的新型多模态数据集，包含鱼眼相机、激光雷达、超声波传感器和运动捕捉3D姿态数据，并提供详细标注和基准测试工具。

Details

Motivation: 为解决现有数据集在真实场景中缺乏多模态同步数据和精确3D姿态标注的问题，推动智能车辆近场感知算法的发展。 Method: 采集多种传感器同步数据（鱼眼相机、lidar、超声波、运动捕捉），进行3D关节位置与图像对齐，并从lidar提取精确行人位置，构建包含准确标注的公开子集及面向嵌入式系统的评估基准。 Result: 发布了带基准套件的部分数据集，提供了准确性、效率和可扩展性评估指标，并基于自定义神经网络给出了3D姿态估计、轨迹预测等任务的基线性能。 Conclusion: 该数据集为行人意图检测、3D姿态估计和4D轨迹预测提供了有力支持，有望成为智能车辆近场感知研究的重要资源。 Abstract: This paper presents a novel dataset aimed at detecting pedestrians' intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.

[121] Towards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI

Gerard Comas-Quiles,Carles Garcia-Cabrera,Julia Dietlmeier,Noel E. O'Connor,Ferran Marques

Main category: cs.CV

TL;DR: 提出一种基于多模态视觉Transformer自编码器（MViT-AE）的无监督异常检测方法，用于脑肿瘤分割，通过重建误差图实现无需标注数据的肿瘤检测与定位。

Details

Motivation: 在标注数据有限、昂贵或不一致的情况下，现有的监督学习方法难以扩展，因此需要一种无需依赖人工标签的无监督方法来提升神经影像工作流的可扩展性。 Method: 设计了一种多模态Vision Transformer自编码器（MViT-AE），仅在健康脑MRI上训练，利用重建误差图进行肿瘤检测；引入多模态早晚期融合策略以整合不同MRI序列信息，并结合Segment Anything Model（SAM）进行后处理以优化肿瘤轮廓。 Result: 在BraTS-GoAT 2025 Lighthouse数据集上验证，全肿瘤、肿瘤核心和增强肿瘤的病灶级Dice系数分别为0.437、0.316和0.350，验证集异常检测率达到89.4%。 Conclusion: 基于Transformer的无监督模型在脑肿瘤检测中具有临床意义的定位能力，展现出作为可扩展、标签高效工具的潜力。 Abstract: Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.

[122] Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Junzhi Ning,Wei Li,Cheng Tang,Jiashi Lin,Chenglong Ma,Chaoyang Zhang,Jiyao Liu,Ying Chen,Shujian Gao,Lihao Liu,Yuandong Pu,Huihui Xu,Chenhui Gou,Ziyan Huang,Yi Xin,Qi Qin,Zhongying Deng,Diping Song,Bin Fu,Guang Yang,Yuanfeng Ji,Tianbin Li,Yanzhou Su,Jin Ye,Shixiang Tang,Ming Hu,Junjun He

Main category: cs.CV

TL;DR: 提出了一种基于观察-知识-分析（OKA）范式的多层级框架，实现了医疗图像理解与生成的统一模型UniMedVL，在多种医疗视觉-语言任务中表现出色。

Details

Motivation: 现有医疗AI系统在图像理解和生成任务上割裂，缺乏统一的多模态处理能力，导致数据表示、特征融合和任务层面的多模态能力存在缺陷。 Method: 构建了包含560万样本的UniMed-5M数据集，提出渐进式课程学习引入医学多模态知识，并设计了首个支持图像理解与生成的统一模型UniMedVL。 Result: UniMedVL在五个医学图像理解基准上表现优越，在八种医学成像模态的生成质量上媲美专用模型，且生成任务反向提升理解性能。 Conclusion: 统一架构能实现双向知识共享，整合传统分离的任务可显著提升医疗多模态模型的整体性能。 Abstract: Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.

[123] DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification

Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig

Main category: cs.CV

TL;DR: 本文提出了一种名为DGME-T的轻量级视频Swim Transformer扩展方法，通过引入基于光流的方向性网格运动编码，提升了在现代和历史影片上的相机运动分类性能。

Details

Motivation: 现有相机运动分类模型在处理噪声多、帧缺失和低对比度的档案老电影时性能下降，因此需要一种更鲁棒的方法来应对退化视频数据中的运动识别挑战。 Method: 构建了一个统一基准，整合了两个现代数据集为四类，并将HISTORIAN数据集重构为五个平衡类别；在此基础上提出DGME-T，其通过可学习且归一化的晚期融合层注入来自光流的方向性网格运动编码（DGME），以增强视频Swim Transformer的运动感知能力。 Result: 在现代视频上，DGME-T将骨干网络的Top-1准确率从81.78%提升至86.14%，宏F1分数从82.08%提升至87.81%；在二战历史 footage 上，准确率从83.43%提升至84.62%，宏F1从81.72%提升至82.63%；跨域实验表明，在现代数据上进行中间微调可使历史数据性能提升超过五个百分点。 Conclusion: 结构化运动先验与Transformer表征是互补的，即使是一个小而精细校准的运动头也能显著提高对退化影片分析的鲁棒性。 Abstract: Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone's top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at https://github.com/linty5/DGME-T.

[124] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Qingyan Bai,Qiuyu Wang,Hao Ouyang,Yue Yu,Hanlin Wang,Wen Wang,Ka Leong Cheng,Shuailei Ma,Yanhong Zeng,Zichen Liu,Yinghao Xu,Yujun Shen,Qifeng Chen

Main category: cs.CV

TL;DR: 本文提出了一个名为Ditto的框架，用于生成大规模、高质量的指令式视频编辑训练数据，并构建了包含一百万个样本的数据集Ditto-1M，训练出的模型Editto在指令跟随能力上达到了最先进的水平。

Details

Motivation: 由于缺乏大规模、高质量的训练数据，基于指令的视频编辑技术发展受限，本文旨在解决这一根本性问题。 Method: 提出Ditto框架，结合图像编辑器的创造性与上下文视频生成器，采用高效的蒸馏模型架构和时间增强器提升生成效率与时间连贯性，并通过智能代理自动生成多样化指令并过滤输出以保证质量。 Result: 利用超过12,000 GPU天构建了Ditto-1M数据集，并使用课程学习策略训练了Editto模型，在指令跟随能力方面表现出色，实现了当前最优的性能。 Conclusion: Ditto框架有效解决了指令式视频编辑中数据稀缺的问题，推动了该领域的可扩展性和性能上限。 Abstract: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.

[125] SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior

Haoran Wang,Bo Zhao,Jinghui Wang,Hanzhang Wang,Huan Yang,Wei Ji,Hao Liu,Xinyan Xiao

Main category: cs.CV

TL;DR: 本文提出了一种用于内容感知布局生成的逐步演化范式SEGA，采用粗到精的分层推理框架，并引入布局设计原则作为先验知识，显著提升了复杂场景下的布局规划性能。

Details

Motivation: 现有方法缺乏反馈式自我修正机制，在处理复杂元素布局时失败率较高，难以生成与背景图像协调的高质量布局。 Method: SEGA采用两阶段分层推理：首先由粗粒度模块估计整体布局，再通过精细化模块对初步结果进行优化；同时融入布局设计原则作为模型先验知识，并构建了大规模海报数据集GenPoster-100K用于训练和评估。 Result: 在多个基准数据集上实现了最先进的性能，实验验证了所提方法在布局质量与协调性方面的有效性。 Conclusion: SEGA通过模仿人类系统化思维过程，结合反馈式精炼机制和领域先验知识，显著提升了内容感知布局生成的效果，尤其适用于复杂场景下的海报设计等应用。 Abstract: In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution Paradigm for Content-Aware Layout Generation. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module performs fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the model to enhance its layout planning ability. Besides, we present GenPoster-100K that is a new large-scale poster dataset with rich meta-information annotation. The experiments demonstrate the effectiveness of our approach by achieving the state-of-the-art results on multiple benchmark datasets. Our project page is at: https://brucew91.github.io/SEGA.github.io/

[126] NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation

Yitong Sun,Yao Huang,Ruochen Zhang,Huanran Chen,Shouwei Ruan,Ranjie Duan,Xingxing Wei

Main category: cs.CV

TL;DR: 本文提出了一种名为NDM的噪声驱动检测与缓解框架，用于在文本到图像生成中检测和抑制隐式恶意意图（如隐含性内容），同时保持模型原有的生成能力。

Details

Motivation: 现有的检测方法主要针对显式有害内容，难以识别隐含的、伪装为无害词汇的性暗示提示，且微调方法可能损害生成质量，因此需要一种既能有效检测隐式风险又不牺牲生成性能的方法。 Method: 提出NDM框架：1）利用早期预测噪声的可分离性进行基于噪声的高精度检测；2）设计噪声增强的自适应负向引导机制，通过抑制显著区域注意力来优化初始噪声，从而更有效地缓解生成中的性相关内容。 Result: 在自然和对抗数据集上验证了NDM的有效性，结果表明其在检测准确率和缓解效果上优于SLD、UCE、RECE等现有SOTA方法。 Conclusion: NDM是首个基于噪声的隐式有害内容检测与缓解框架，在有效防止隐式性内容生成的同时，保留了T2I扩散模型的高质量生成能力，具有良好的实用性和伦理价值。 Abstract: Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model's generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model's original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region's attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.

[127] Semantic segmentation with coarse annotations

Jort de Jong,Mike Holenderski

Main category: cs.CV

TL;DR: 提出了一种基于超像素的正则化方法，用于编码器-解码器结构的语义分割模型，在使用粗标注训练时显著提升边界召回率。

Details

Motivation: 在难以获得精细标注的情况下，利用粗标注进行语义分割面临边界对齐困难的问题，需要改进模型对边界的优化能力。 Method: 提出一种基于SLIC超像素的正则化方法，鼓励解码后的分割结果以颜色和位置为基础形成超像素，应用于FCN-16网络架构中。 Result: 在SUIM、Cityscapes和PanNuke数据集上验证了该方法，相比现有模型在粗标注下显著提升了边界召回率。 Conclusion: 该正则化方法有效改善了基于粗标注训练的语义分割模型的边界对齐性能，尤其适用于精细标注成本高的场景。 Abstract: Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.

[128] QSilk: Micrograin Stabilization and Adaptive Quantile Clipping for Detail-Friendly Latent Diffusion

Denis Rychkovskiy

Main category: cs.CV

TL;DR: QSilk是一种轻量级、始终启用的潜扩散稳定层，通过微钳位和自适应分位数裁剪提升高频保真度并抑制激活尖峰，无需训练即可在低步数和超高分辨率下实现更清晰、锐利的渲染效果。

Details

Motivation: 为了在潜扩散模型中提高高频细节的保真度，同时避免罕见的激活异常导致的不稳定性，需要一种无需额外训练且低开销的稳定化方法。 Method: 提出QSilk，包含两种技术：每样本微钳位用于温和限制极端值，以及自适应分位数裁剪（AQClip），后者可根据区域特性动态调整数值范围，支持基于局部结构统计的代理模式或基于注意力熵引导的模型置信度模式。 Result: 在CADE 2.5渲染流程中集成QSilk后，在低采样步数和超高分辨率下实现了更清晰、更锐利的结果，且计算开销极低；在SD/SDXL架构上均观察到一致的定性改进，并能与CFG/Rescale方法协同工作，允许使用更高引导强度而不引入伪影。 Conclusion: QSilk是一种高效、无需训练的潜扩散稳定化方案，显著提升了生成质量与稳定性，适用于高分辨率图像生成场景，具有广泛的适用性和集成便利性。 Abstract: We present QSilk, a lightweight, always-on stabilization layer for latent diffusion that improves high-frequency fidelity while suppressing rare activation spikes. QSilk combines (i) a per-sample micro clamp that gently limits extreme values without washing out texture, and (ii) Adaptive Quantile Clip (AQClip), which adapts the allowed value corridor per region. AQClip can operate in a proxy mode using local structure statistics or in an attention entropy guided mode (model confidence). Integrated into the CADE 2.5 rendering pipeline, QSilk yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead. It requires no training or fine-tuning and exposes minimal user controls. We report consistent qualitative improvements across SD/SDXL backbones and show synergy with CFG/Rescale, enabling slightly higher guidance without artifacts.

[129] Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model

Gaoxiang Huang,Songning Lai,Yutao Yue

Main category: cs.CV

TL;DR: 提出一种轻量级解耦概念瓶颈模型（LDCBM），通过自动将视觉特征分组为语义上有意义的组件，提升概念与视觉模式之间的对齐，从而增强可解释性和分类性能。

Details

Motivation: 现有概念瓶颈模型存在输入到概念映射偏差和可控性不足的问题，限制了其实际应用价值，并损害基于概念方法的策略责任性。 Method: 引入滤波分组损失和联合概念监督，在无需区域标注的情况下自动分组视觉特征，实现更优的概念对齐。 Result: 在三个不同数据集上的实验表明，LDCBM在概念准确率和分类准确率上均优于先前的CBM方法。 Conclusion: LDCBM通过将概念建立在视觉证据基础上，克服了先前模型的一个根本局限，提升了可解释AI的可靠性。 Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.

[130] Controlling the image generation process with parametric activation functions

Ilia Pavlov

Main category: cs.CV

TL;DR: 本文提出了一种通过替换生成网络中的激活函数为参数化函数，并允许用户设置这些参数，以交互方式理解模型的新方法。

Details

Motivation: 现有的图像生成模型虽然在保真度和普及性上不断提高，但缺乏能够以可解释的方式直接与模型内部机制交互的工具。 Method: 通过将生成网络的激活函数替换为参数化的版本，并提供一种设置这些参数的方法，使用户可以通过交互和实验来探索模型行为。 Result: 该方法在StyleGAN2和BigGAN（分别训练于FFHQ和ImageNet）上得到了验证，展示了对网络输出进行控制的有效性。 Conclusion: 所提出的系统为理解和操控生成模型提供了一个可解释且直观的交互式工具。 Abstract: As image generative models continue to increase not only in their fidelity but also in their ubiquity the development of tools that leverage direct interaction with their internal mechanisms in an interpretable way has received little attention In this work we introduce a system that allows users to develop a better understanding of the model through interaction and experimentation By giving users the ability to replace activation functions of a generative network with parametric ones and a way to set the parameters of these functions we introduce an alternative approach to control the networks output We demonstrate the use of our method on StyleGAN2 and BigGAN networks trained on FFHQ and ImageNet respectively.

[131] ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection

Haowei Zhu,Tianxiang Pan,Rui Qin,Jun-Hai Yong,Bin Wang

Main category: cs.CV

TL;DR: ReCon是一种新的数据增强框架，通过在扩散采样过程中引入区域引导校正和区域对齐的交叉注意力机制，提升生成数据的质量和目标检测模型的训练效果。

Details

Motivation: 现有的生成模型在数据增强中常依赖复杂的后处理或大规模微调，且易出现内容与位置不匹配、语义泄露等问题，限制了其在目标检测中的应用。 Method: 提出ReCon框架，在扩散模型生成过程中利用预训练感知模型的反馈进行区域引导校正，并设计区域对齐的交叉注意力机制，确保图像区域与文本提示之间的空间-语义一致性。 Result: 实验表明，ReCon显著提升了生成数据的质量和可训练性，在多种数据集、主干网络和数据规模下均实现了性能提升。 Conclusion: ReCon有效解决了生成模型在目标检测数据增强中的关键问题，为高质量合成数据的生成提供了新思路。 Abstract: The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at https://github.com/haoweiz23/ReCon .

[132] ERNet: Efficient Non-Rigid Registration Network for Point Sequences

Guangzhao He,Yuxi Xiao,Zhen Xu,Xiaowei Zhou,Sida Peng

Main category: cs.CV

TL;DR: 提出ERNet，一种高效的前馈模型，用于处理非刚性形变点云序列的配准，通过两阶段管道预测形变图序列，在准确性和效率上均优于现有方法。

Details

Motivation: 解决非刚性形变点云序列配准中的局部极小值和误差累积问题，提升在噪声和部分输入下的鲁棒性与长期跟踪性能。 Method: 采用数据驱动方法，设计ERNet模型，通过两阶段 pipeline：先逐帧估计粗略图节点以实现鲁棒初始化，再以滑动窗口方式优化其时间轨迹。 Result: 在DeformingThings4D和D-FAUST数据集上超越现有最先进方法，且速度提升超过4倍。 Conclusion: ERNet能有效处理非刚性形变序列的配准，兼具高精度、强鲁棒性和高效率，适用于长期序列的稳定跟踪。 Abstract: Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose ERNet, an efficient feed-forward model trained on large deformation datasets. It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates frame-wise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach (i) outperforms previous state-of-the-art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.

[133] VISTA: A Test-Time Self-Improving Video Generation Agent

Do Xuan Long,Xingchen Wan,Hootan Nakhost,Chen-Yu Lee,Tomas Pfister,Sercan Ö. Arık

Main category: cs.CV

TL;DR: 提出VISTA，一种多智能体系统，通过迭代优化提示词自主提升视频生成质量。

Details

Motivation: 现有文本到视频生成方法严重依赖精确的用户提示，且测试时优化方法在视频生成中效果有限。 Method: 将用户想法分解为时序计划，生成后通过成对比赛选择最佳视频，由三个专门代理分别评估视觉、音频和上下文保真度，再由推理代理整合反馈重写提示。 Result: 在单场景和多场景视频生成中，VISTA相比基线方法最高获得60%的成对胜率，人类评估中66.4%的情况下更受偏好。 Conclusion: VISTA能持续提升视频质量和与用户意图的一致性，优于现有方法。 Abstract: Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.

[134] Neuro-Symbolic Spatial Reasoning in Segmentation

Jiayi Lin,Jiabo Huang,Shaogang Gong

Main category: cs.CV

TL;DR: 本文提出了RelateSeg，首次将神经符号（NeSy）空间推理引入开放词汇语义分割（OVSS），通过一阶逻辑形式化空间关系约束，实现像素级语义与空间伪类别联合预测，显著提升多物体场景下的分割性能。

Details

Motivation: 现有基于视觉-语言模型（VLM）的OVSS方法缺乏对场景中物体间空间关系的理解，难以泛化到未见和未标注对象。 Method: 提出RelateSeg，自动提取图像中的空间关系（如“猫在人右侧”），利用伪类别将其编码为一阶逻辑公式，并通过模糊逻辑松弛嵌入深度网络，实现端到端的空间关系一致性学习。 Result: 在四个基准数据集上实现了最先进的平均mIoU表现，尤其在包含多个类别的图像中优势明显，仅增加一个辅助损失且无额外参数。 Conclusion: 验证了神经符号空间推理在开放词汇语义分割中的有效性，为OVSS提供了新的建模范式。 Abstract: Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., , and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.

[135] 3DPR: Single Image 3D Portrait Relight using Generative Priors

Pramod Rao,Abhimitra Meka,Xilong Zhou,Gereon Fox,Mallikarjun B R,Fangneng Zhan,Tim Weyrich,Bernd Bickel,Hanspeter Pfister,Wojciech Matusik,Thabo Beeler,Mohamed Elgharib,Marc Habermann,Christian Theobalt

Main category: cs.CV

TL;DR: 本文提出3DPR，一种基于图像的重光照模型，利用从光场采集的多视角OLAT图像学习生成先验，实现单目肖像图像的人脸重光照，相较于传统方法在身份保持和光照效果（如高光、自阴影、次表面散射）上表现更优。

Details

Motivation: 由于从单目肖像图像渲染新光照视角是一个严重欠约束的问题，传统图形学方法依赖显式分解几何、材质和光照，但受限于模型假设和参数化，因此需要一种更灵活、高质量的方法来实现逼真重光照。 Method: 提出3DPR模型，利用预训练生成头像模型的潜在空间作为人脸几何的强先验；通过编码器将输入肖像嵌入该潜在流形；训练一个基于三平面的反射率网络，在光场数据上学习高频率面部反射特性，并在潜在空间中生成OLAT图像，结合HDRI环境图实现图像级重光照。 Result: 在定量与定性评估中，3DPR优于先前方法，尤其在身份保持、高光、自阴影和次表面散射等光照细节方面表现突出；构建了包含139人的大规模多视角4K OLAT数据集用于训练和验证。 Conclusion: 3DPR通过结合生成模型先验与光场数据驱动的反射率学习，有效解决了单目人像重光照中的欠约束问题，在真实感和细节保留方面达到先进水平。 Abstract: Rendering novel, relit views of a human head, given a monocular portrait image as input, is an inherently underconstrained problem. The traditional graphics solution is to explicitly decompose the input image into geometry, material and lighting via differentiable rendering; but this is constrained by the multiple assumptions and approximations of the underlying models and parameterizations of these scene components. We propose 3DPR, an image-based relighting model that leverages generative priors learnt from multi-view One-Light-at-A-Time (OLAT) images captured in a light stage. We introduce a new diverse and large-scale multi-view 4K OLAT dataset of 139 subjects to learn a high-quality prior over the distribution of high-frequency face reflectance. We leverage the latent space of a pre-trained generative head model that provides a rich prior over face geometry learnt from in-the-wild image datasets. The input portrait is first embedded in the latent manifold of such a model through an encoder-based inversion process. Then a novel triplane-based reflectance network trained on our lightstage data is used to synthesize high-fidelity OLAT images to enable image-based relighting. Our reflectance network operates in the latent space of the generative head model, crucially enabling a relatively small number of lightstage images to train the reflectance model. Combining the generated OLATs according to a given HDRI environment maps yields physically accurate environmental relighting results. Through quantitative and qualitative evaluations, we demonstrate that 3DPR outperforms previous methods, particularly in preserving identity and in capturing lighting effects such as specularities, self-shadows, and subsurface scattering. Project Page: https://vcai.mpi-inf.mpg.de/projects/3dpr/

[136] Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae,Lihui Luo,Xi Yuan,Dongmei Yu,Zhenglin Chen,Lian Zhang,Peiwu Qin

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需人工提示的Memory-SAM管道，通过DINOv3特征和FAISS检索从少量先验样本中自动生成有效提示，指导SAM2实现精确的舌部分割，在真实场景下表现出显著优势。

Details

Motivation: 准确的舌部分割对中医分析至关重要，但现有监督模型依赖大量标注数据，而SAM系列模型仍需人工提示，限制了其在实际中的应用。 Method: 利用密集的DINOv3特征和FAISS检索，从少量记忆样本中自动提取查询图像与检索示例之间的掩码约束对应关系，并生成前景/背景点提示，用于引导SAM2进行分割，无需人工点击或模型微调。 Result: 在600张专家标注图像（300张受控，300张真实场景）上评估，Memory-SAM在混合测试集上达到0.9863的mIoU，优于FCN（0.8188）和检测器-框SAM基线（0.1839），在真实场景中表现尤为突出。 Conclusion: 检索生成提示的方法实现了高效、鲁棒的舌部不规则边界分割，具有良好的数据效率和实际应用潜力。 Abstract: Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

[137] BLIP3o-NEXT: Next Frontier of Native Image Generation

Jiuhai Chen,Le Xue,Zhiyang Xu,Xichen Pan,Shusheng Yang,Can Qin,An Yan,Honglu Zhou,Zeyuan Chen,Lifu Huang,Tianyi Zhou,Junnan Li,Silvio Savarese,Caiming Xiong,Ran Xu

Main category: cs.CV

TL;DR: BLIP3o-NEXT是一个开源的基础模型，统一了文本到图像生成和图像编辑任务，通过结合自回归与扩散模型架构，在图像生成质量和编辑能力上实现了显著提升。

Details

Motivation: 推动原生图像生成技术的发展，解决当前图像编辑和指令跟随中的挑战，并探索架构、训练方法和数据对模型性能的影响。 Method: 采用Autoregressive + Diffusion混合架构：自回归模型根据多模态输入生成离散图像token，其隐藏状态作为扩散模型的条件信号生成高保真图像；结合强化学习、后训练和数据引擎优化性能。 Result: 在多个文本到图像生成和图像编辑基准测试中，BLIP3o-NEXT表现出优于现有模型的性能，显著提升了生成图像的连贯性、真实感以及编辑一致性。 Conclusion: 数据质量与规模是决定模型上限的关键因素，而合理的架构设计结合强化学习和高质量数据后训练，可有效统一图像生成与编辑任务，推动原生图像生成迈向新阶段。 Abstract: We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

[138] BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models

Kaushitha Silva,Mansitha Eashwara,Sanduni Ubayasiri,Ruwan Tennakoon,Damayanthi Herath

Main category: cs.CV

TL;DR: BiomedXPro是一个基于进化框架的生物医学视觉-语言模型提示优化方法，利用大语言模型生成可解释、多样化的自然语言提示对，提升疾病诊断性能，尤其在少样本场景下表现优越，并通过与临床特征的语义对齐增强模型可信度。

Details

Motivation: 现有提示优化技术生成的隐向量或单一文本提示缺乏可解释性，难以反映临床诊断中多维度观察整合的特点，限制了其在高风险医疗场景中的可信度和应用。 Method: 提出BiomedXPro框架，采用大语言模型作为生物医学知识提取器和自适应优化器，通过进化算法自动构建多样化的可解释自然语言提示对，用于疾病诊断。 Result: 在多个生物医学基准上实验表明，BiomedXPro在少样本条件下持续优于当前最先进的提示调优方法，并显示出生成提示与显著临床特征之间的强语义对齐。 Conclusion: BiomedXPro通过生成多样化且可解释的提示集合，为模型预测提供了可验证的基础，是迈向更可信、更贴近临床需求的AI系统的重要一步。 Abstract: The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model's performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.

[139] LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

Shr-Ruei Tsai,Wei-Cheng Chang,Jie-Ying Lee,Chih-Hai Su,Yu-Lun Liu

Main category: cs.CV

TL;DR: 提出LightsOut，一种基于扩散模型的外推框架，通过重建画面外光源来增强单图像去眩光效果。

Details

Motivation: 现有单图像去眩光方法在画面外光源不完整或缺失时表现不佳，影响了物体检测和自动驾驶等关键任务。 Method: 采用多任务回归模块和LoRA微调的扩散模型，实现真实且物理一致的画面外推。 Result: 实验表明，LightsOut在各种挑战性场景中 consistently 提升了现有去眩光方法的性能，且无需额外训练。 Conclusion: LightsOut可作为通用即插即用的预处理方案，有效增强现有SIFR方法对缺失光源的鲁棒性。 Abstract: Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: https://ray-1026.github.io/lightsout/

[140] Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

Jie-Ying Lee,Yi-Ruei Liu,Shr-Ruei Tsai,Wei-Cheng Chang,Chung-Ho Wu,Jiewen Chan,Zhenjun Zhao,Chieh Hubert Lin,Yu-Lun Liu

Main category: cs.CV

TL;DR: 本文提出了Skyfall-GS，首个无需昂贵3D标注即可生成城市街区尺度3D场景的框架，结合卫星图像与扩散模型实现几何完整且纹理逼真的实时可探索3D城市场景合成。

Details

Motivation: 缺乏大规模高质量的真实世界3D扫描数据，难以训练出可泛化的生成模型来创建大型、可交互且几何精确的3D城市场景。 Method: 提出Skyfall-GS框架，利用卫星图像提供粗略几何结构，结合开放域扩散模型生成高质量近景外观；采用课程驱动的迭代优化策略逐步提升几何完整性和纹理真实感。 Result: 实验表明，Skyfall-GS在跨视角几何一致性与纹理 realism 上优于现有最先进方法，支持实时沉浸式3D探索。 Conclusion: Skyfall-GS通过融合卫星图像与扩散模型，实现了无需3D标注的大规模3D城市场景生成，为虚拟现实和具身应用提供了高效且高质量的解决方案。 Abstract: Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbf{Skyfall-GS}, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: https://skyfall-gs.jayinnn.dev/

Table of Contents

cs.CL [Back]

[1] Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective

[2] Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek

[3] A Generalizable Rhetorical Strategy Annotation Model Using LLM-based Debate Simulation and Labelling

[4] Continual Learning via Sparse Memory Finetuning

[5] Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks

[6] Latent Topic Synthesis: Leveraging LLMs for Electoral Ad Analysis

[7] FarsiMCQGen: a Persian Multiple-choice Question Generation Framework

[8] Structure-R1: Dynamically Leveraging Structural Knowledge in LLM Reasoning through Reinforcement Learning

[9] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

[10] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning

[11] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

[12] TraceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration

[13] TACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding

[14] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA

[15] Automatic essay scoring: leveraging Jaccard coefficient and Cosine similaritywith n-gram variation in vector space model approach

[16] Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination

[17] Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

[18] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

[19] Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics

[20] When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

[21] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

[22] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

[23] Large-scale User Game Lifecycle Representation Learning

[24] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

[25] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

[26] Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering

[27] CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLMs

[28] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

[29] Temporal Referential Consistency: Do LLMs Favor Sequences Over Absolute Time References?

[30] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

[31] Latent Reasoning in LLMs as a Vocabulary-Space Superposition

[32] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval

[33] TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

[34] Rethinking Cross-lingual Gaps from a Statistical Viewpoint

[35] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

[36] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models

[37] Finetuning LLMs for EvaCun 2025 token prediction shared task

[38] From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages

[39] BiMax: Bidirectional MaxSim Score for Document-Level Alignment

[40] The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

[41] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

[42] Leveraging LLMs for Context-Aware Implicit Textual and Multimodal Hate Speech Detection

[43] Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth

[44] Attention Sinks in Diffusion Language Models

[45] LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

[46] On Non-interactive Evaluation of Animal Communication Translators

[47] Emergence of Linear Truth Encodings in Language Models

[48] Paper2Web: Let's Make Your Paper Alive!

[49] Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework

[50] SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

[51] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

[52] PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction

cs.CV [Back]

[53] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments

[54] PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising

[55] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

[56] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos

[57] NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks

[58] Constantly Improving Image Models Need Constantly Improving Benchmarks

[59] LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models

[60] MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning

[61] Composition-Grounded Instruction Synthesis for Visual Reasoning

[62] Generalized Dynamics Generation towards Scannable Physical World Model

[63] Comprehensive language-image pre-training for 3D medical image understanding

[64] Directional Reasoning Injection for Fine-Tuning MLLMs

[65] A solution to generalized learning from small training sets found in everyday infant experiences

[66] SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images

[67] TGT: Text-Grounded Trajectories for Locally Controlled Video Generation

[68] Deep generative priors for 3D brain analysis

[69] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification

[70] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

[71] Train a Unified Multimodal Data Quality Classifier with Synthetic Data

[72] Hyperparameter Optimization and Reproducibility in Deep Learning Model Training

[73] Salient Concept-Aware Generative Data Augmentation

[74] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records

[75] The Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads

[76] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

[77] CuSfM: CUDA-Accelerated Structure-from-Motion