Skip to content

Table of Contents

cs.CL [Back]

[1] TokenShapley: Token Level Context Attribution with Shapley Value

Yingtai Xiao,Yuqing Zhu,Sirat Samyoun,Wanrong Zhang,Jiachen T. Wang,Jian Du

Main category: cs.CL

TL;DR: TokenShapley improves token-level data attribution for large language models using Shapley values and KNN-based retrieval, achieving better accuracy than existing methods.

Details Motivation: Verifying the correctness of large language model responses is challenging, especially at the keyword level for specific elements like numbers, years, or names. Method: TokenShapley uses a precomputed datastore for contextual retrieval and computes Shapley values to quantify token importance. Result: Extensive evaluations on four benchmarks show that TokenShapley achieves an 11-23% improvement in accuracy in token-level attribution. Conclusion: TokenShapley provides a fine-grained data attribution approach by combining Shapley value-based data attribution with KNN-based retrieval techniques, outperforming state-of-the-art baselines in token-level attribution. Abstract: Large language models (LLMs) demonstrate strong capabilities in in-context learning, but verifying the correctness of their generated responses remains a challenge. Prior work has explored attribution at the sentence level, but these methods fall short when users seek attribution for specific keywords within the response, such as numbers, years, or names. To address this limitation, we propose TokenShapley, a novel token-level attribution method that combines Shapley value-based data attribution with KNN-based retrieval techniques inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed datastore for contextual retrieval and computing Shapley values to quantify token importance, TokenShapley provides a fine-grained data attribution approach. Extensive evaluations on four benchmarks show that TokenShapley outperforms state-of-the-art baselines in token-level attribution, achieving an 11-23% improvement in accuracy.

[2] User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs

Sougata Saha,Monojit Choudhury

Main category: cs.CL

TL;DR: This paper proposes user behavior prediction as a robust alternative to measure the generalization ability of Large Language Models (LLMs), introducing a novel framework tested on recommendation datasets.

Details Motivation: Measuring generalization in LLMs is difficult due to data contamination, and traditional knowledge-retrieval and reasoning tasks are not suitable for this purpose. Method: The authors introduce a new framework that leverages user behavior prediction as a proxy for measuring generalization in LLMs. They test this approach on movie and music recommendation datasets using GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct models. Result: Results show that GPT-4o outperforms both GPT-4o-mini and Llama-3.1-8B-Instruct in user behavior prediction tasks, although all models have significant room for improvement, particularly Llama. Conclusion: User behavior prediction offers a theoretically sound, scalable, and robust method for evaluating the generalization capabilities of LLMs compared to traditional knowledge-retrieval and reasoning tasks. Abstract: Measuring the generalization ability of Large Language Models (LLMs) is challenging due to data contamination. As models grow and computation becomes cheaper, ensuring tasks and test cases are unseen during training phases will become nearly impossible. We argue that knowledge-retrieval and reasoning tasks are not ideal for measuring generalization, as LLMs are not trained for specific tasks. Instead, we propose user behavior prediction, also a key aspect of personalization, as a theoretically sound, scalable, and robust alternative. We introduce a novel framework for this approach and test it on movie and music recommendation datasets for GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct. Results align with our framework's predictions, showing GPT-4o outperforms GPT-4o-mini and Llama, though all models have much room for improvement, especially Llama.

[3] An Adaptive Supervised Contrastive Learning Framework for Implicit Sexism Detection in Digital Social Networks

Mohammad Zia Ur Rehman,Aditya Shah,Nagendra Kumar

Main category: cs.CL

TL;DR: This paper proposes ASCEND, an improved framework for detecting implicit sexist language on social media using advanced embedding techniques and feature fusion, outperforming current methods significantly.

Details Motivation: Implicit sexism on social media is often overlooked by traditional detection techniques, prompting the need for a more robust framework that can capture subtle cues in language. Method: ASCEND uses threshold-based contrastive learning to refine embeddings, combines contrastive loss with cross-entropy loss for classification, and enhances textual features using word-level attention along with sentiment, emotion, and toxicity features. Result: On EXIST2021 and MLSC datasets, ASCEND achieved average Macro F1 improvements of 9.86%, 29.63%, and 32.51% across tasks, demonstrating its effectiveness. Conclusion: The proposed ASCEND framework effectively enhances the detection of implicit sexist content in social media by refining embedding spaces and incorporating multi-feature learning, showing significant improvements over existing methods. Abstract: The global reach of social media has amplified the spread of hateful content, including implicit sexism, which is often overlooked by conventional detection methods. In this work, we introduce an Adaptive Supervised Contrastive lEarning framework for implicit sexism detectioN (ASCEND). A key innovation of our method is the incorporation of threshold-based contrastive learning: by computing cosine similarities between embeddings, we selectively treat only those sample pairs as positive if their similarity exceeds a learnable threshold. This mechanism refines the embedding space by robustly pulling together representations of semantically similar texts while pushing apart dissimilar ones, thus reducing false positives and negatives. The final classification is achieved by jointly optimizing a contrastive loss with a cross-entropy loss. Textual features are enhanced through a word-level attention module. Additionally, we employ sentiment, emotion, and toxicity features. Evaluations on the EXIST2021 and MLSC datasets demonstrate that ASCEND significantly outperforms existing methods, with average Macro F1 improvements of 9.86%, 29.63%, and 32.51% across multiple tasks, highlighting its efficacy in capturing the subtle cues of implicit sexist language.

[4] Beyond classical and contemporary models: a transformative ai framework for student dropout prediction in distance learning using rag, prompt engineering, and cross-modal fusion

Miloud Mihoubi,Meriem Zerkouk,Belkacem Chikhaoui

Main category: cs.CL

TL;DR: 本文介绍了一种新的AI框架,通过结合多种先进技术来提高远程教育中学生辍学预测的准确性,并提供可解释的干预建议。

Details Motivation: 传统机器学习模型在预测远程教育中的学生辍学时难以捕捉学生互动中的情感和上下文因素,而这些问题对于准确预测至关重要。 Method: 结合检索增强生成(RAG)进行领域特定的情感分析,利用提示工程解码学术压力源,并通过跨模态注意力融合将文本、行为和社会人口统计信息动态对齐。 Result: 在包含4423名学生的纵向数据集上评估,该框架达到了89%的准确率和0.88的F1分数,比传统模型提高了7%,并将假阴性结果减少了21%。 Conclusion: 该论文提出了一种创新的人工智能框架,通过整合检索增强生成、提示工程和跨模态注意力融合技术,显著提高了远程教育中学生辍学风险的预测准确性,并提供了可解释的干预策略。 Abstract: Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors, and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., "isolation," "workload anxiety"). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk profiles. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems

[5] LCDS: A Logic-Controlled Discharge Summary Generation System Supporting Source Attribution and Expert Review

Cheng Yuan,Xinkai Rui,Yongqi Fan,Yawei Fan,Boyang Zhong,Jiacheng Wang,Weiyan Zhang,Tong Ruan

Main category: cs.CL

TL;DR: This paper introduces LCDS, a Logic-Controlled Discharge Summary generation system that improves the accuracy and reliability of LLM-generated summaries by leveraging source mapping and logical rules.

Details Motivation: LLMs face hallucination issues when generating discharge summaries from long-form EMR data, leading to inaccurate or fabricated content. This necessitates a system that improves reliability and traceability in summary generation. Method: The study proposes LCDS, which constructs a source mapping table using textual similarity between EMRs and discharge summaries. It also incorporates logical rules to generate accurate, field-specific summaries and allows for source attribution. Result: LCDS generates reliable silver discharge summaries with source attribution support, enabling expert review and feedback. These golden summaries are used for incremental fine-tuning of LLMs, enhancing their performance over time. Conclusion: LCDS effectively addresses hallucination issues in discharge summary generation by incorporating logical rules and source attribution, improving the reliability and adaptability of LLMs in clinical settings. Abstract: Despite the remarkable performance of Large Language Models (LLMs) in automated discharge summary generation, they still suffer from hallucination issues, such as generating inaccurate content or fabricating information without valid sources. In addition, electronic medical records (EMRs) typically consist of long-form data, making it challenging for LLMs to attribute the generated content to the sources. To address these challenges, we propose LCDS, a Logic-Controlled Discharge Summary generation system. LCDS constructs a source mapping table by calculating textual similarity between EMRs and discharge summaries to constrain the scope of summarized content. Moreover, LCDS incorporates a comprehensive set of logical rules, enabling it to generate more reliable silver discharge summaries tailored to different clinical fields. Furthermore, LCDS supports source attribution for generated content, allowing experts to efficiently review, provide feedback, and rectify errors. The resulting golden discharge summaries are subsequently recorded for incremental fine-tuning of LLMs. Our project and demo video are in the GitHub repository https://github.com/ycycyc02/LCDS.

[6] MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

Ming Gong,Xucheng Huang,Chenghan Yang,Xianhan Peng,Haoxin Wang,Yang Liu,Ling Jiang

Main category: cs.CL

TL;DR: 本文介绍了MindFlow,一个专为电子商务设计的开源多模态大型语言模型代理,其通过整合记忆、决策和行动模块以及采用模块化的“MLLM-as-Tool”策略,在处理复杂查询、提高用户满意度和降低运营成本方面取得了显著成效。

Details Motivation: 当前的大型语言模型在复杂的多模态场景中的能力仍然受到限制,尤其是在电子商务客户服务方面。 Method: MindFlow基于CoALA框架构建,采用模块化的“MLLM-as-Tool”策略进行有效的视觉-文本推理。 Result: 通过在线A/B测试和基于仿真的消融实验评估,MindFlow在实际部署中显示出93.53%的相对改进。 Conclusion: MindFlow是一个专为电子商务设计的开源多模态大型语言模型代理,它通过整合记忆、决策和行动模块,显著提高了处理复杂查询的能力,用户满意度,并降低了运营成本。 Abstract: Recent advances in large language models (LLMs) have enabled new applications in e-commerce customer service. However, their capabilities remain constrained in complex, multimodal scenarios. We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce. Built on the CoALA framework, it integrates memory, decision-making, and action modules, and adopts a modular "MLLM-as-Tool" strategy for effect visual-textual reasoning. Evaluated via online A/B testing and simulation-based ablation, MindFlow demonstrates substantial gains in handling complex queries, improving user satisfaction, and reducing operational costs, with a 93.53% relative improvement observed in real-world deployments.

[7] LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks

William Fleshman,Benjamin Van Durme

Main category: cs.CL

TL;DR: 本文提出了一种新的方法LAG,通过使用LoRA适配器来高效筛选和应用大量知识库中的专家模型,实现了优于现有无数据方法的表现。

Details Motivation: 随着针对特定任务和领域的微调语言模型专家的激增,需要高效的筛选和组合方法。 Method: 提出了一种名为LoRA-Augmented Generation(LAG)的新方法,该方法基于每个token和层筛选、检索和应用专家模型。 Result: 在各种知识密集型任务上评估了LAG,在性能上超过了现有的无数据方法,并探索了结合其他解决方案如RAG的可能性。 Conclusion: LAG是一种有效利用大量知识库和任务特定LoRA适配器的方法,无需额外训练或访问数据,并且与现有数据无关方法相比表现出色。 Abstract: The proliferation of fine-tuned language model experts for specific tasks and domains signals the need for efficient selection and combination methods. We propose LoRA-Augmented Generation (LAG) for leveraging large libraries of knowledge and task-specific LoRA adapters. LAG requires no additional training or access to data, and efficiently filters, retrieves, and applies experts on a per-token and layer basis. We evaluate LAG on various knowledge-intensive tasks, achieving superior performance over existing data-free methods. We explore scenarios where additional data is available, demonstrating LAG's compatibility with alternative solutions such as retrieval-augmented generation (RAG).

[8] On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study

Riccardo Alberghi,Elizaveta Demyanenko,Luca Biggio,Luca Saglietti

Main category: cs.CL

TL;DR: Training language models on reasoning traces with backtracking improves their ability to generalize, provided the traces are coherent and incremental, aiding optimization of the training signal.

Details Motivation: To understand how reasoning in large language models can be improved by analyzing the impact of structured, incremental reasoning traces and redundancy on model performance. Method: Decoder-only transformers were trained on question-trace-answer triples using a custom tokenizer in shortest-path tasks involving layered graphs. The comparison was made between optimal bottom-up dynamic programming traces and longer, valid backtracking traces. Result: Models trained on inefficient but coherent traces generalized better to unseen graphs than those trained on optimal traces. Injecting arbitrary redundancy did not help and sometimes hurt performance. Conclusion: Training models on inefficient traces improves generalization compared to optimal ones, as long as the traces are coherent and incremental, enhancing the training signal optimization. Abstract: Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This benefit is not due to length alone-injecting arbitrary redundancy into reasoning traces fails to help and can even hurt performance. Instead, we find that generalization correlates with the model's confidence in next-token prediction, suggesting that long, coherent, and locally incremental traces make the training signal easier to optimize.

[9] EduCoder: An Open-Source Annotation System for Education Transcript Data

Guanzhong Pan,Mei Tan,Hyunji Nam,Lucía Langlois,James Malamut,Liliana Deonizio,Dorottya Demszky

Main category: cs.CL

TL;DR: The paper introduces EduCoder, a specialized tool for annotating educational dialogues, offering collaborative codebook development, integration of multiple annotation types, and improved data reliability through comparative analysis.

Details Motivation: There is a lack of tools addressing the complexities of coding educational dialogue transcripts, such as defining codebooks for pedagogical features, supporting various coding types, and contextualizing utterances with external features. Method: The authors designed EduCoder, a platform for collaborative definition of complex codebooks, incorporating categorical and open-ended annotations along with contextual materials. It also provides side-by-side comparison of multiple annotators' responses. Result: EduCoder successfully addresses the challenges in annotating educational dialogues by enabling collaborative codebook creation, integrating categorical and open-ended annotations, and improving data reliability through comparison and calibration of annotations. Conclusion: EduCoder is an effective domain-specialized tool that facilitates utterance-level annotation of educational dialogue, enhancing collaboration and data reliability. Abstract: We introduce EduCoder, a domain-specialized tool designed to support utterance-level annotation of educational dialogue. While general-purpose text annotation tools for NLP and qualitative research abound, few address the complexities of coding education dialogue transcripts -- with diverse teacher-student and peer interactions. Common challenges include defining codebooks for complex pedagogical features, supporting both open-ended and categorical coding, and contextualizing utterances with external features, such as the lesson's purpose and the pedagogical value of the instruction. EduCoder is designed to address these challenges by providing a platform for researchers and domain experts to collaboratively define complex codebooks based on observed data. It incorporates both categorical and open-ended annotation types along with contextual materials. Additionally, it offers a side-by-side comparison of multiple annotators' responses, allowing comparison and calibration of annotations with others to improve data reliability. The system is open-source, with a demo video available.

[10] The Generalization Ridge: Information Flow in Natural Language Generation

Ruidi Chang,Chunyuan Deng,Hanjie Chen

Main category: cs.CL

TL;DR: This paper introduces InfoRidge to analyze how task-relevant information flows through transformer layers during training, revealing that intermediate layers form a generalization ridge critical for model performance under distribution shift.

Details Motivation: While intermediate layers are known to provide more generalizable representations than final layers, the emergence and propagation of this generalization ability during training is not well understood. Method: The authors propose InfoRidge, an information-theoretic framework to track predictive information across model depth, and introduce residual scaling coefficients to assess layer importance. Result: Experiments show a non-monotonic trend where predictive information peaks in upper-middle layers (forming a generalization ridge) and declines in final layers. Under distribution shift, models increasingly rely on ridge layers by downweighting final layers. Conclusion: Transformer-based language models synthesize task-relevant information through internal mechanisms that evolve during training, with intermediate layers playing a critical role in generalization. Abstract: Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG) tasks, yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth. Estimating this quantity enables us to trace the flow of task-relevant information throughout the model during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in upper-middle layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we introduce residual scaling coefficients-trainable scalar parameters applied to each residual block-which serve as functional probes for assessing the relative importance of individual transformer layers. These coefficients reveal that, under distribution shift, models downweight final layers and increasingly rely on ridge layers, highlighting their role in generalization. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.

[11] Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences

Guillem Ramírez,Alexandra Birch,Ivan Titov

Main category: cs.CL

TL;DR: This paper explores the use of privacy profiles to enable users to maintain control over their data when using commercial APIs for large language models. It introduces the PEEP dataset and highlights the need for improved model understanding of user-defined privacy preferences.

Details Motivation: Users often have to expose their data to service providers when using commercial APIs for large language models (LLMs). This paper aims to explore how users can maintain control over their data through the use of privacy profiles. Method: The paper introduces PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. A framework is built where a local model uses privacy profiles to rewrite queries before sending them to an external model. Result: Experiments show that lightweight LLMs can follow privacy instructions to some extent but face consistent challenges in fully understanding and applying user-defined privacy preferences. Conclusion: The paper concludes that while lightweight LLMs can partially follow privacy instructions, there is a need for models that better understand and comply with user-defined privacy preferences. Abstract: Large language models (LLMs) are primarily accessed via commercial APIs, but this often requires users to expose their data to service providers. In this paper, we explore how users can stay in control of their data by using privacy profiles: simple natural language instructions that say what should and should not be revealed. We build a framework where a local model uses these instructions to rewrite queries, only hiding details deemed sensitive by the user, before sending them to an external model, thus balancing privacy with performance. To support this research, we introduce PEEP, a multilingual dataset of real user queries annotated to mark private content and paired with synthetic privacy profiles. Our experiments with lightweight LLMs show they can follow these instructions to some extent, but also face consistent challenges, highlighting the need for models that better understand and comply with user-defined privacy preferences.

[12] Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning

Jaedong Hwang,Kumar Tanmay,Seok-Jin Lee,Ayush Agrawal,Hamid Palangi,Kumar Ayush,Ila Fiete,Paul Pu Liang

Main category: cs.CL

TL;DR: 本研究提出了GeoFact-X和BRIDGE方法,用于提升大语言模型在多语言环境下的推理能力,尤其是在低资源语言上,通过监督微调和强化学习结合语言一致性奖励的方法,显著提高了跨语言的推理准确性和一致性。

Details Motivation: 当前的大语言模型在低资源语言(如斯瓦希里语或泰语)中的多语言推理能力不足,容易误解提示或默认用英语进行推理,这导致了事实准确性、可解释性和信任度的问题。现有的多语言基准测试只关注最终答案,而忽视了模型是否真的用目标语言进行推理。 Method: 提出了一种名为BRIDGE的新训练方法,该方法使用监督微调和测试时强化学习,并通过语言一致性奖励来对齐输入语言的推理过程。此外,还开发了一个基于LLM-as-a-judge的自动评估协议。 Result: GeoFact-X是一个包含五种语言(英语、印地语、日语、斯瓦希里语和泰语)注释推理轨迹的地理基础多语言事实推理基准。BRIDGE方法显著增强了多语言推理的保真度。结果表明,推理感知的多语言强化学习对于稳健的跨语言泛化至关重要。 Conclusion: GeoFact-X和BRIDGE方法显著提升了多语言推理的保真度,表明推理感知的多语言强化学习对于实现稳健的跨语言泛化至关重要。 Abstract: Large Language Models (LLMs) have achieved strong performance in domains like mathematics, factual QA, and code generation, yet their multilingual reasoning capabilities in these tasks remain underdeveloped. Especially for low-resource languages such as Swahili or Thai, LLMs can often misinterpret prompts or default to reasoning in English. This implicit bias toward high-resource languages undermines factual accuracy, interpretability, and trust. Current multilingual benchmarks focus only on final answers, overlooking whether models actually reason in the target language. To address this gap, we introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages: English, Hindi, Japanese, Swahili, and Thai. We further propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning with a language-consistency reward to align reasoning with the input language. Finally, we develop an automatic evaluation protocol using LLM-as-a-judge to assess answer correctness and the quality and language consistency of reasoning traces, enabling nuanced and scalable analysis beyond surface-level metrics. Our results show that BRIDGE significantly enhances multilingual reasoning fidelity, demonstrating that reasoning-aware multilingual reinforcement learning is crucial for robust cross-lingual generalization. https://jd730.github.io/projects/GeoFact-X_BRIDGE

[13] "Lost-in-the-Later": Framework for Quantifying Contextual Grounding in Large Language Models

Yufei Tao,Adam Hiatt,Rahul Seetharaman,Ameeta Agrawal

Main category: cs.CL

TL;DR: 本文提出了CoPE评估框架,用于系统衡量大型语言模型在多语言环境下如何整合上下文知识和参数知识,并揭示了模型对后续信息的忽视现象及推理模型在上下文使用上的不足。

Details Motivation: 探索大型语言模型如何优先考虑和整合上下文知识与参数知识,这是当前研究中未充分探讨的问题。 Method: 引入名为CoPE的评估框架,并基于MultiWikiAtomic数据集分析模型在开放式问答中如何整合上下文和参数知识。 Result: 发现模型存在'lost-in-the-later'现象,即倾向于忽略上下文中较后出现的信息;推理模型和带有链式提示的非推理模型更少利用上下文且无法缓解该问题。 Conclusion: 通过设计基于提示的方法可以有效利用输入上下文,应用CoPE到摘要任务中能提高事实基础并减少幻觉。 Abstract: Large language models are capable of leveraging both contextual and parametric knowledge but how they prioritize and integrate these sources remains underexplored. We introduce CoPE, a novel evaluation framework that systematically measures contextual knowledge (CK) and parametric knowledge (PK) across models and languages. Using our MultiWikiAtomic dataset in English, Spanish, and Danish, we analyze how large language models (LLMs) integrate context, prioritize information, and incorporate PK in open-ended question answering. Our analysis uncovers a phenomenon we call lost-in-the-later, where LLMs tend to overlook or deprioritize information that appears later in a given context, revealing a strong positional bias that affects contextual grounding. We further find that reasoning models, as well as non-reasoning models prompted with chain-of-thought (CoT), use context even less than non-reasoning models without CoT and fail to mitigate the lost-in-the-later effect. CoT prompting, in particular, results in lower recall and shorter responses, leading to degraded contextual grounding. Based on these insights, we design prompt-based methods to effectively leverage input context. A case study applying CoPE to summarization demonstrates that CK-informed prompting improves factual grounding and reduces hallucination.

[14] Gendered Divides in Online Discussions about Reproductive Rights

Ashwin Rao,Sze Yuh Nina Wang,Kristina Lerman

Main category: cs.CL

TL;DR: This paper examines how gender and location shape abortion discourse online, finding a significant gender gap influenced by regional conservatism, especially after the Dobbs court decision.

Details Motivation: To understand how gender and regional sociopolitical contexts affect public discourse on abortion, particularly following the Dobbs v. Jackson Women's Health ruling. Method: Analysis of nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology, and location. Result: Gender was found to significantly moderate abortion attitudes and emotional expression, creating a growing gender gap in conservative regions regardless of ideology. The Dobbs draft opinion leak mobilized pro-abortion women more in threatened areas. Conclusion: The study concludes that gender and local sociopolitical contexts significantly influence abortion discourse, with a pronounced gender gap in attitudes towards abortion, especially in conservative regions. Abstract: The U.S. Supreme Court's 2022 ruling in Dobbs v. Jackson Women's Health Organization marked a turning point in the national debate over reproductive rights. While the ideological divide over abortion is well documented, less is known about how gender and local sociopolitical contexts interact to shape public discourse. Drawing on nearly 10 million abortion-related posts on X (formerly Twitter) from users with inferred gender, ideology and location, we show that gender significantly moderates abortion attitudes and emotional expression, particularly in conservative regions, and independently of ideology. This creates a gender gap in abortion attitudes that grows more pronounced in conservative regions. The leak of the Dobbs draft opinion further intensified online engagement, disproportionately mobilizing pro-abortion women in areas where access was under threat. These findings reveal that abortion discourse is not only ideologically polarized but also deeply structured by gender and place, highlighting the central role of identity in shaping political expression during moments of institutional disruption.

[15] PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs

Sana Kang,Myeongseok Gwon,Su Young Kwon,Jaewook Lee,Andrew Lan,Bhiksha Raj,Rita Singh

Main category: cs.CL

TL;DR: 本文介绍了一种名为PhoniTale的新型跨语言助记符生成系统,该系统利用大型语言模型(LLMs)和语音相似性来帮助第二语言学习者更有效地学习词汇。

Details Motivation: 对于第二语言学习者来说,词汇习得是一个重大挑战,尤其是在学习类型学上距离较远的语言时,因此需要一种有效的辅助记忆工具。 Method: PhoniTale利用语音相似性检索L1关键词序列,并使用LLM生成助记符,通过自动指标、人工评估和短期回忆测试进行评估。 Result: PhoniTale的表现与人工创作的助记符相当,并确定了未来在助记符质量和方法上的关键改进领域。 Conclusion: PhoniTale是一个有前景的跨语言助记符生成系统,表现与人工创作的助记符相当,但在助记符质量和方法论方面仍需改进。 Abstract: Vocabulary acquisition poses a significant challenge for second-language (L2) learners, especially when learning typologically distant languages such as English and Korean, where phonological and structural mismatches complicate vocabulary learning. Recently, large language models (LLMs) have been used to generate keyword mnemonics by leveraging similar keywords from a learner's first language (L1) to aid in acquiring L2 vocabulary. However, most of this research has focused on native English speakers learning other languages, rather than the reverse. In this paper, we present PhoniTale, a novel cross-lingual mnemonic generation system that retrieves L1 keyword sequence based on phonological similarity and uses LLMs to generate mnemonics. We evaluate PhoniTale using both automated metrics and human evaluations, comparing its output to mnemonics created by humans and by previous automated approaches. To assess practical effectiveness, we also conduct a short-term recall test measuring mnemonic helpfulness. Our findings show that PhoniTale performs comparably to human-authored mnemonics. We also highlight key areas for future improvement in mnemonic quality and methodology.

[16] On the Semantics of Large Language Models

Martin Schuele

Main category: cs.CL

TL;DR: This paper explores how well large language models like ChatGPT understand language semantics by analyzing their structure and aligning them with classical philosophical theories.

Details Motivation: The motivation stems from the controversy over whether LLMs truly understand language, particularly at the word and sentence level. Method: The analysis involves examining the inner workings of LLMs and comparing their generated representations to classical semantic theories by Frege and Russell. Result: A more nuanced understanding of the semantic abilities of LLMs was developed through theoretical analysis and model introspection. Conclusion: The study concludes that LLMs have potential semantic capabilities, though their true understanding of language remains a nuanced issue. Abstract: Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.

[17] ModelCitizens:Representing Community Voices in Online Safety

Ashima Suvarna,Christina Chance,Hamid Palangi,Sophie Hao,Thomas Hartvigsen,Saadia Gabriel

Main category: cs.CL

TL;DR: 本研究提出了MODELCITIZENS数据集和相应的微调模型,强调社区知情的注释和建模对提高毒性语言检测的重要性。

Details Motivation: 现有的毒性检测模型通常基于将多样注释者观点合并为单一真实标签的注释,抹去了重要的特定情境下的毒性概念(如语言收复)。 Method: 引入MODELCITIZENS数据集,并使用LLM生成对话场景来增强该数据集。同时发布基于LLaMA和Gemma的微调模型LLAMACITIZEN-8B和GEMMACITIZEN-12B。 Result: 最先进的毒性检测工具在MODELCITIZENS数据集上表现不佳,尤其是在添加上下文的情况下。而基于MODELCITIZENS微调的LLAMACITIZEN-8B和GEMMACITIZEN-12B模型在分布内评估中超越了GPT-o4-mini达5.5%。 Conclusion: 社区知情的注释和建模对于包容性内容审核至关重要。 Abstract: Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation.

[18] Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

Jean-Philippe Corbeil,Asma Ben Abacha,George Michalopoulos,Phillip Swazinna,Miguel Del-Agua,Jerome Tremblay,Akila Jeeson Daniel,Cari Bader,Kevin Cho,Pooja Krishnan,Nathan Bodenstab,Thomas Lin,Wenxuan Teng,Francois Beaulieu,Paul Vozila

Main category: cs.CL

TL;DR: This paper explores structured clinical data extraction using LLMs and introduces new datasets to address underexplored NLP tasks in healthcare.

Details Motivation: Structured tabular reporting and medical order extraction tasks are underexplored due to data scarcity and sensitivity, despite their potential to reduce documentation burdens on healthcare providers. Method: The study evaluates open- and closed-weight LLMs on private and open-source clinical datasets, proposing an agentic pipeline for generating realistic nurse dictations. Result: SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction, were released to support further research. Conclusion: The paper concludes that structured extraction from clinical data can be enhanced through proposed agentic pipelines and newly introduced datasets. Abstract: Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.

[19] Enhancing Test-Time Scaling of Large Language Models with Hierarchical Retrieval-Augmented MCTS

Alex ZH Dou,Zhongwei Wan,Dongfei Cui,Xin Wang,Jing Xiong,Haokun Lin,Chaofan Tao,Shen Yan,Mi Zhang

Main category: cs.CL

TL;DR: R2-LLMs 提出了一种无需依赖高级模型蒸馏的双级检索增强推理框架,有效提升了大型语言模型在复杂推理任务中的表现。

Details Motivation: 测试时扩展性成为语言模型的重要方向,而 R2-LLMs 的提出旨在不依赖更先进模型蒸馏数据的情况下提高大语言模型的推理能力。 Method: R2-LLMs 结合了双级检索的上下文学习:粗粒度检索与细粒度检索,并利用过程奖励模型 (PRM) 进行评分和决策优化。 Result: 在 MATH500、GSM8K 和 OlympiadBench-TO 数据集上的实验表明,相比基线模型,使用 LLaMA-3.1-8B 相对提升了高达 16% 的性能。 Conclusion: R2-LLMs 是一种强大的分层推理增强方法,增强了上下文级别的推理能力,并在复杂推理任务中展现了显著的效果提升。 Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, leveraging additional computational resources at inference time to enhance model performance. In this work, we introduce R2-LLMs, a novel and versatile hierarchical retrieval-augmented reasoning framework designed to improve test-time scaling in large language models (LLMs) without requiring distillation from more advanced models to obtain chain-of-thought (CoT) training data. R2-LLMs enhances inference-time generalization by integrating dual-level retrieval-based in-context learning: (1) At the coarse level, our approach extracts abstract templates from complex reasoning problems and retrieves similar problem-answer pairs to facilitate high-level in-context learning; (2) At the fine level, during Monte Carlo Tree Search (MCTS), R2-LLMs efficiently retrieves analogous intermediate solution steps from reference mathematical problem datasets, refining step-wise reasoning with the aid of a process reward model (PRM) for scoring. R2-LLMs is a robust hierarchical reasoning-augmentation method that enhances in-context-level reasoning while seamlessly integrating with step-level tree search methods. Utilizing PRM, it refines both candidate generation and decision-making for improved reasoning accuracy. Empirical evaluations on the MATH500, GSM8K, and OlympiadBench-TO datasets achieve substantial relative improvement with an increase of up to 16% using LLaMA-3.1-8B compared to the baselines, showcasing the effectiveness of our approach in complex reasoning tasks.

[20] Self-Review Framework for Enhancing Instruction Following Capability of LLM

Sihyun Park

Main category: cs.CL

TL;DR: Re5 框架通过结构评估和选择性修订提高指令遵循性能,同时控制成本并保持输出质量。

Details Motivation: 为了克服现有方法在数据点和修订迭代增加时成本显著上升的问题,并解决因过度修订导致的输出质量下降问题。 Method: Re5 提取任务和约束组件,进行结构评估,并应用细粒度的约束特定内容评估和选择性修订。 Result: 实验结果表明,Re5 在指令遵循性能方面达到了与使用 GPT-4o-mini 生成的数据相当的效果,且在少量数据下保持响应质量,胜过非修订初始响应的比例达到64.24%。 Conclusion: Re5 是一种高效的自我评估和修订框架,有效提高了指令遵循性能,同时保持了生成内容的质量。 Abstract: Various techniques have been proposed to improve large language models (LLMs) adherence to formatting and instruction constraints. One of the most effective approaches involves utilizing high-quality data generated by powerful models. However, such models often fail to fully comply with complex instructions in a single generation. To address this limitation, iterative revision methods have been introduced. Nevertheless, as the number of data points and revision iterations increases, the associated monetary costs grow significantly. As a resource-efficient alternative, methods have been proposed that leverage high-performance evaluation tools to compensate for the limited self-evaluation capabilities of open-source LLMs. However, these approaches often lead to a degradation in output quality due to excessive revision. To overcome these challenges, we propose Re5, a self-evaluation and revision framework designed to enhance instruction-following performance while preserving the quality of the generated content. Re5 extracts task and constraint components from user instructions, performs structural evaluations to prevent error accumulation, and applies fine-grained constraint-specific content evaluations followed by selective revisions. This process ensures precise and quality-preserving improvements. The final high-quality outputs are used for alignment tuning, enabling long-term alignment improvements through a data-centric iterative refinement loop. Experimental results demonstrate that Re5 achieves instruction-following performance comparable to models trained on data generated by GPT-4o-mini, a high-performance model, even with a small amount of data while maintaining response quality with a 64.24%-win rate over the non-revised initial responses. These results validate Re5 as an efficient and effective solution for enhancing instruction adherence with minimal external supervision.

[21] Flipping Knowledge Distillation: Leveraging Small Models' Expertise to Enhance LLMs in Text Matching

Mingzhe Li,Jing Xiang,Qishen Zhang,Kaiyang Wan,Xiuying Chen

Main category: cs.CL

TL;DR: This paper proposes a new knowledge distillation approach where a Large Language Model learns from a Smaller Language Model to enhance text matching performance, combining the benefits of both model types.

Details Motivation: To combine the specialized strengths of small models with the rich semantic understanding of large models, overcoming their architectural differences for better text matching performance. Method: The method involves reinterpreting decoder-only LLMs as encoder-decoder models using LoRA. The encoder generates compressed representations and similarities, which are aligned with the teacher's similarity scores through Margin-aware Contrastive Learning (MCL). Result: Experiments on financial and healthcare benchmarks, as well as real-world applications, confirmed the effectiveness of the proposed paradigm, leading to its full deployment in an online environment. Conclusion: The flipped knowledge distillation paradigm effectively leverages the strengths of both SLMs and LLMs, with the LLM learning from the SLM to achieve improved performance in text matching tasks. Abstract: Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks such as text matching, fine-tuned smaller models often yield more effective domain-specific representations, as they focus on optimizing the similarity of input pairs. To leverage both the specialized strengths of small models and the rich semantic understanding of LLMs, we introduce a flipped knowledge distillation paradigm, where LLM learns from SLM. Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models by reinterpreting LLMs in an encoder-decoder manner using LoRA. The encoder generates compressed representations, while the decoder maps them to the output space. During training, the encoder produces representations and their similarities, which are then aligned with the similarity scores produced by the teacher, using our proposed Margin-aware Contrastive Learning (MCL) approach. The MCL ensures accurate similarity for both positive and negative pairs, and adaptively handles the internal differences within positive and negative samples. Our paradigm requires only a reasonably good-performing SLM, allowing the LLM to achieve improved performance. Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness, and the model has been fully deployed in an online environment.

[22] SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression

Yiqiao Jin,Kartik Sharma,Vineeth Rakesh,Yingtong Dou,Menghai Pan,Mahashweta Das,Srijan Kumar

Main category: cs.CL

TL;DR: SARA通过整合自然语言和语义压缩向量,提升检索增强生成的效果。

Details Motivation: 为了解决检索增强生成中上下文长度限制和文档冗余的问题。 Method: 提出了一种名为SARA的统一RAG框架,结合了自然语言文本片段和语义压缩向量。 Result: 在9个数据集和5个开源大语言模型上,SARA显著提高了答案相关性、正确性和语义相似性。 Conclusion: SARA有效结合了自然语言文本片段和语义压缩向量,提高了检索增强生成的效果。 Abstract: Retrieval-augmented Generation (RAG) extends large language models (LLMs) with external knowledge but faces key challenges: restricted effective context length and redundancy in retrieved documents. Pure compression-based approaches reduce input size but often discard fine-grained details essential for factual accuracy. We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets. SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness. It represents contexts at two complementary levels: 1) fine-grained natural-language spans that preserve critical entities and numerical values, and 2) compact, interpretable vectors that summarize high-level semantics. An iterative evidence-selection module employs the compression vectors for dynamic reranking of contexts. Across 9 datasets and 5 open-source LLMs spanning 3 model families (Mistral, Llama, and Gemma), SARA consistently improves answer relevance (+17.71), answer correctness (+13.72), and semantic similarity (+15.53), demonstrating the importance of integrating textual and compressed representations for robust, context-efficient RAG.

[23] ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

Haoxin Wang,Xianhan Peng,Xucheng Huang,Yizhe Huang,Ming Gong,Chenghan Yang,Yang Liu,Ling Jiang

Main category: cs.CL

TL;DR: 本文提出ECom-Bench,一种用于评测电商客服领域多模态LLM代理的新基准框架。

Details Motivation: 为了评估和推动具备多模态能力的LLM代理在电商客服领域的发展,需要一个专门的基准测试框架。 Method: 基于真实电商客户交互中收集的人物角色信息,构建动态用户模拟和一个源自真实电商对话的现实任务数据集。 Result: 即使像GPT-4o这样的先进模型,在ECom-Bench基准测试中的通过率也只有10-20%,突显了复杂电子商务场景带来的巨大困难。 Conclusion: ECom-Bench是一个具有挑战性的基准框架,用于评估LLM代理在电子商务客户服务领域的多模态能力,并将促进该领域的进一步研究和发展。 Abstract: In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.

[24] Smoothie-Qwen: Post-Hoc Smoothing to Reduce Language Bias in Multilingual LLMs

SeungWon Ji,Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee

Main category: cs.CL

TL;DR: 本文提出了一种名为Smoothie-Qwen的轻量级后处理方法,用于解决多语言大语言模型的语言混淆问题,实验结果表明其能有效减少非预期语言的生成并保持任务准确性。

Details Motivation: 多语言大语言模型(LLMs)经常表现出语言混淆,即倾向于用主要语言生成响应,而不管提示的语言如何。为了解决这个问题,提出了Smoothie-Qwen。 Method: Smoothie-Qwen是一种轻量级的后期处理方法,通过选择性调整词元级别的输出概率来有效抑制不希望的语言生成。 Result: 在Qwen模型上的应用表明,该方法可将非预期的中文输出减少超过95%,同时保持多语言基准任务的准确性。 Conclusion: Smoothie-Qwen 是一种实用且高效增强LLMs语言可控性的解决方案,适用于全球应用。 Abstract: Multilingual large language models (LLMs) often exhibit language confusion, a tendency to generate responses in a dominant language irrespective of the prompt's language. To address this, we propose Smoothie-Qwen, a lightweight, post-hoc method that mitigates language bias without retraining. This technique selectively adjusts token-level output probabilities to effectively suppress undesired language generation. Applied to the Qwen model, our method reduces unintended Chinese output by over 95% while preserving task accuracy on multilingual benchmarks. This work provides a practical and efficient solution for enhancing the language controllability of LLMs, making them more reliable for global applications.

[25] Agentic-R1: Distilled Dual-Strategy Reasoning

Weihua Du,Pranjal Aggarwal,Sean Welleck,Yiming Yang

Main category: cs.CL

TL;DR: 本文介绍了一种新的微调框架DualDistill,用于整合多种推理策略,以提升模型在复杂任务中的表现。

Details Motivation: 当前长链式推理模型依赖于缓慢且易错的自然语言推理过程,而工具增强代理在处理复杂逻辑任务时常常表现不佳。因此需要一种更高效和强大的推理方法。 Method: 提出了一种名为DualDistill的微调框架,通过从多个教师模型中提炼互补的推理策略,并训练一个统一的学生模型动态选择最优策略。 Result: 使用DualDistill训练的Agentic-R1模型在各种任务上都表现出更高的准确性,包括计算密集型任务和标准基准测试。 Conclusion: DualDistill框架能够有效整合多种推理策略,从而提高模型在不同类型任务上的准确性和鲁棒性。 Abstract: Current long chain-of-thought (long-CoT) models excel at mathematical reasoning but rely on slow and error-prone natural language traces. Tool-augmented agents address arithmetic via code execution, but often falter on complex logical tasks. We introduce a fine-tuning framework, DualDistill, that distills complementary reasoning strategies from multiple teachers into a unified student model. Using this approach, we train Agentic-R1, which dynamically selects the optimal strategy for each query, invoking tools for arithmetic and algorithmic problems, and using text-based reasoning for abstract ones. Our method improves accuracy across a range of tasks, including both computation-intensive and standard benchmarks, demonstrating the effectiveness of multi-strategy distillation in achieving robust and efficient reasoning. Our project is available at https://github.com/StigLidu/DualDistill

[26] DRAGON: Dynamic RAG Benchmark On News

Fedor Chernogorskii,Sergei Averkiev,Liliya Kudraleeva,Zaven Martirosian,Maria Tikhonova,Valentin Malykh,Alena Fenogenova

Main category: cs.CL

TL;DR: This paper introduces DRAGON, the first dynamic RAG benchmark for Russian, built on a continuously updated news corpus with automated question generation and evaluation tools.

Details Motivation: While multiple RAG benchmarks exist for English, resources for other languages like Russian are scarce and static, not reflecting real-world dynamics. This work addresses this gap. Method: The work constructs DRAGON using a regularly updated Russian news corpus and employs Knowledge Graph-based automatic question generation to extract four core question types. It includes an evaluation framework with reusable pipelines and scripts. Result: The result is the first dynamic RAG benchmark for Russian, supporting evaluation of retriever and generator components with automatically generated questions from a knowledge graph. Conclusion: DRAGON provides a dynamic benchmark for evaluating RAG systems in Russian, offering comprehensive evaluation tools and encouraging community participation through a public leaderboard. Abstract: Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.

[27] HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

YiHan Jiao,ZheHao Tan,Dan Yang,DuoLin Sun,Jie Feng,Jian Wang,Peng Wei

Main category: cs.CL

TL;DR: This paper proposes HIRAG, a novel RAG instruction fine-tuning method that enhances model performance by incorporating hierarchical reasoning processes.

Details Motivation: Traditional RAG systems lack in-depth focus on specific RAG tasks and reasoning processes, leading to challenges with document quality and retrieval system limitations. Method: Introduce a new RAG instruction fine-tuning method called Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG), which utilizes multi-level progressive chain-of-thought. Result: Experiments demonstrate that the HIRAG training strategy significantly improves model performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA. Conclusion: The proposed HIRAG method enhances RAG models' capabilities through hierarchical thought processes, significantly improving performance across multiple datasets. Abstract: Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a "think before answering" strategy. This method enhances the model's open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model's performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.

[28] Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Zijin Gu,Tatiana Likhomanenko,Navdeep Jaitly

Main category: cs.CL

TL;DR: 本文提出 Omni-router Transformer,通过共享路由机制改进 MoE 架构,在自动语音识别任务中实现了更高的性能和更强的鲁棒性。

Details Motivation: 传统 MoE 方法如 Switch Transformer 在各层中独立进行专家选择,而这些选择之间通常缺乏强相关性,因此需要一种能加强层间专家合作的方法。 Method: 通过跨 MoE 层共享路由机制,提升不同层间专家的协作性并增强专家的专业化程度。 Result: 实验表明,在大规模伪标签数据集上的训练损失更低,并在 10 个多样性的语音识别任务中平均词错误率分别比密集模型和 Switch Transformer 模型降低了 11.2% 和 8.2%。 Conclusion: Omni-router Transformer 模型在训练损失和识别准确率上优于密集模型和 Switch Transformer 模型,并表现出更好的专家协作与数据多样性适应能力。 Abstract: Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model \emph{Omni-router Transformer}. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

[29] GPTKB v1.5: A Massive Knowledge Base for Exploring Factual LLM Knowledge

Yujia Hu,Tuan-Phong Nguyen,Shrestha Ghosh,Moritz Müller,Simon Razniewski

Main category: cs.CL

TL;DR: 本文介绍了一种低成本构建的大规模知识库 GPTKB v1.5,可用于深入探索语言模型(LLM)中的知识内容。

Details Motivation: 为了更好地理解语言模型(LLM)中的事实性知识,并提供可扩展的统计分析和浏览功能。 Method: 使用 GPTKB 方法,基于 GPT-4.1 构建了一个密集互联的 1 亿三元组知识库,并支持链接遍历、SPARQL 查询等用例。 Result: 开发了 GPTKB v1.5,一种低成本的大规模知识库,支持多种交互方式来探索和分析 LLM 的知识内容。 Conclusion: GPTKB v1.5 是一个用于探索LLM知识的创新型大规模知识库,为系统性分析LLM知识和自动化构建KB提供了新机遇。 Abstract: Language models are powerful tools, yet their factual knowledge is still poorly understood, and inaccessible to ad-hoc browsing and scalable statistical analysis. This demonstration introduces GPTKB v1.5, a densely interlinked 100-million-triple knowledge base (KB) built for $14,000 from GPT-4.1, using the GPTKB methodology for massive-recursive LLM knowledge materialization (Hu et al., ACL 2025). The demonstration experience focuses on three use cases: (1) link-traversal-based LLM knowledge exploration, (2) SPARQL-based structured LLM knowledge querying, (3) comparative exploration of the strengths and weaknesses of LLM knowledge. Massive-recursive LLM knowledge materialization is a groundbreaking opportunity both for the research area of systematic analysis of LLM knowledge, as well as for automated KB construction. The GPTKB demonstrator is accessible at https://gptkb.org.

[30] DocTalk: Scalable Graph-based Dialogue Synthesis for Enhancing LLM Conversational Capabilities

Jing Yang Lee,Hamed Bonab,Nasser Zalmout,Ming Zeng,Sanket Lokegaonkar,Colin Lockard,Binxuan Huang,Ritesh Sarkhel,Haodong Wang

Main category: cs.CL

TL;DR: 研究提出 DocTalk 数据集,通过合成多轮对话数据提升大型语言模型的上下文理解和记忆能力,且不影响其基本性能。

Details Motivation: 由于当前大型语言模型的预训练数据主要为连续文本,与实际应用中所需的多轮对话能力存在不匹配问题,因此需要一种新方法来改善这种情况。 Method: 提出了一种新的合成对话数据的方法,通过将多个相关文档组成的聚类转换为扩展的多轮、多主题信息寻求对话,并构建了包含超过73万长对话的 DocTalk 数据集。 Result: 实验表明,在预训练过程中引入 DocTalk 可使上下文记忆和理解能力提高高达40%。 Conclusion: DocTalk 的引入能够提升大型语言模型在多轮对话任务中的上下文记忆和理解能力,同时不影响基础性能。 Abstract: Large Language Models (LLMs) are increasingly employed in multi-turn conversational tasks, yet their pre-training data predominantly consists of continuous prose, creating a potential mismatch between required capabilities and training paradigms. We introduce a novel approach to address this discrepancy by synthesizing conversational data from existing text corpora. We present a pipeline that transforms a cluster of multiple related documents into an extended multi-turn, multi-topic information-seeking dialogue. Applying our pipeline to Wikipedia articles, we curate DocTalk, a multi-turn pre-training dialogue corpus consisting of over 730k long conversations. We hypothesize that exposure to such synthesized conversational structures during pre-training can enhance the fundamental multi-turn capabilities of LLMs, such as context memory and understanding. Empirically, we show that incorporating DocTalk during pre-training results in up to 40% gain in context memory and understanding, without compromising base performance. DocTalk is available at https://huggingface.co/datasets/AmazonScience/DocTalk.

[31] Flippi: End To End GenAI Assistant for E-Commerce

Anand A. Rajasekar,Praveen Tangarajan,Anjali Nainani,Amogh Batwal,Vinay Rao Dandin,Anusua Trivedi,Ozan Ersoy

Main category: cs.CL

TL;DR: Flippi is an advanced conversational assistant for e-commerce that enhances user experience by providing personalized product recommendations and streamlining product discovery.

Details Motivation: The motivation behind Flippi is to address the challenges of navigating the vast product landscape in e-commerce, enhancing product discovery through natural language dialogue. Method: Flippi uses advanced NLP techniques like Query Reformulation, Intent Detection, RAG, NER, and Context Reduction to interpret customer queries and deliver precise product information. Result: Flippi provides a personalized shopping experience, identifies attractive offers, and enables informed decision-making through comparative analysis features. Conclusion: Flippi sets a new standard for customer satisfaction and engagement in the digital marketplace by bridging the convenience of online shopping with personalized assistance. Abstract: The emergence of conversational assistants has fundamentally reshaped user interactions with digital platforms. This paper introduces Flippi-a cutting-edge, end-to-end conversational assistant powered by large language models (LLMs) and tailored for the e-commerce sector. Flippi addresses the challenges posed by the vast and often overwhelming product landscape, enabling customers to discover products more efficiently through natural language dialogue. By accommodating both objective and subjective user requirements, Flippi delivers a personalized shopping experience that surpasses traditional search methods. This paper details how Flippi interprets customer queries to provide precise product information, leveraging advanced NLP techniques such as Query Reformulation, Intent Detection, Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and Context Reduction. Flippi's unique capability to identify and present the most attractive offers on an e-commerce site is also explored, demonstrating how it empowers users to make cost-effective decisions. Additionally, the paper discusses Flippi's comparative analysis features, which help users make informed choices by contrasting product features, prices, and other relevant attributes. The system's robust architecture is outlined, emphasizing its adaptability for integration across various e-commerce platforms and the technological choices underpinning its performance and accuracy. Finally, a comprehensive evaluation framework is presented, covering performance metrics, user satisfaction, and the impact on customer engagement and conversion rates. By bridging the convenience of online shopping with the personalized assistance traditionally found in physical stores, Flippi sets a new standard for customer satisfaction and engagement in the digital marketplace.

[32] Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports

Amane Watahiki,Tomoki Doi,Taiga Shinozaki,Satoshi Nishida,Takuya Niikawa,Katsunori Miyahara,Hitomi Yanaka

Main category: cs.CL

TL;DR: This paper investigates the ability of large vision-language models to handle amodal completion, revealing performance discrepancies among models and languages.

Details Motivation: To explore the inferential abilities of LVLMs regarding amodal completion, a phenomenon where humans perceive objects even when parts are hidden, which remains understudied in computer-vision algorithms. Method: A benchmark was constructed using Basic Formal Ontology for systematic classification of amodal completion, evaluating LVLMs' performance on original and blank stimuli across different languages. Result: Some LVLMs, like LLaVA-NeXT variants and Claude 3.5 Sonnet, showed lower accuracy with original images compared to blank stimuli, especially under Japanese prompting. Conclusion: The study concludes that while many LVLMs perform well overall, they show varying accuracy in handling specific object types related to amodal completion, particularly under Japanese prompting. Abstract: One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.

[33] How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures

Tanvina Patel,Wiebke Hutiri,Aaron Yi Ding,Odette Scharenborg

Main category: cs.CL

TL;DR: This paper argues that standard error rates are not enough to detect bias in ASR systems and suggests using supplementary measures for a more accurate evaluation across diverse speaker groups.

Details Motivation: There is increasing evidence of bias in ASR systems against different speakers based on factors like gender, age, or accent. Despite progress in detecting, quantifying, and mitigating bias, the open question remains: how to effectively measure system performance and bias. Method: The study compares different performance and bias measures from literature and proposed ones, using several bias mitigation strategies in experiments to evaluate state-of-the-art end-to-end ASR systems for Dutch. Result: The findings indicate that standard metrics like averaged error rates are insufficient for capturing bias in ASR systems and need to be complemented by additional measures. Conclusion: The paper concludes that averaged error rates alone are not sufficient for evaluating ASR systems and should be supplemented with other measures to accurately represent performance across diverse speaker groups and assess overall system bias. Abstract: There is increasingly more evidence that automatic speech recognition (ASR) systems are biased against different speakers and speaker groups, e.g., due to gender, age, or accent. Research on bias in ASR has so far primarily focused on detecting and quantifying bias, and developing mitigation approaches. Despite this progress, the open question is how to measure the performance and bias of a system. In this study, we compare different performance and bias measures, from literature and proposed, to evaluate state-of-the-art end-to-end ASR systems for Dutch. Our experiments use several bias mitigation strategies to address bias against different speaker groups. The findings reveal that averaged error rates, a standard in ASR research, alone is not sufficient and should be supplemented by other measures. The paper ends with recommendations for reporting ASR performance and bias to better represent a system's performance for diverse speaker groups, and overall system bias.

[34] Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

Sungjib Lim,Woojung Song,Eun-Ju Lee,Yohan Jo

Main category: cs.CL

TL;DR: 本文介绍了一种使用大语言模型生成心理测量调查项目的新方法,通过模拟不同中介因子的受访者提高项目效度,为低成本开发心理测量工具提供了新方向。

Details Motivation: 随着大语言模型(LLMs)被广泛应用于心理特质评估,需要一种可扩展的调查项目生成方法,同时确保生成项目的建构效度。 Method: 通过模拟具有不同中介因子的受访者行为,利用LLMs生成合理的中介因子并验证调查项目的有效性。 Result: 实验表明,所提出的中介因子生成方法和模拟框架能够有效识别高有效性调查项目,LLMs可以成功地生成合理中介因子并模拟受访者行为。 Conclusion: 该论文提出了一种基于大语言模型(LLMs)的虚拟受访者模拟框架,用于心理测量调查项目生成,并展示了其在构建高效、有效心理测量工具方面的潜力。 Abstract: As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that robustly measure intended traits. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-effective survey development and a deeper understanding of how LLMs replicate human-like behavior. We will publicly release our dataset and code to support future work.

[35] Few-shot text-based emotion detection

Teodor-George Marchitan,Claudiu Creanga,Liviu P. Dinu

Main category: cs.CL

TL;DR: 本文介绍了Unibuc-NLP团队在SemEval 2025研讨会任务11中的方法,重点使用大型语言模型进行文本情感检测。

Details Motivation: 为了填补基于文本的情感检测中的空白,提升不同语言子集的情感识别效果。 Method: 主要采用Gemini、Qwen和DeepSeek等大语言模型,通过少样本提示或微调进行实验。 Result: 在英语子集上获得0.7546的F1宏值(26/96队),在莫桑比克葡萄牙语子集上获得0.1727(35/36队),在Emakhuwa子集上获得0.325(1/31队)。 Conclusion: 该方法在不同语言数据集中表现不一,但在Emakhuwa子集上取得了相对较好的成绩。 Abstract: This paper describes the approach of the Unibuc - NLP team in tackling the SemEval 2025 Workshop, Task 11: Bridging the Gap in Text-Based Emotion Detection. We mainly focused on experiments using large language models (Gemini, Qwen, DeepSeek) with either few-shot prompting or fine-tuning. With our final system, for the multi-label emotion detection track (track A), we got an F1-macro of $0.7546$ (26/96 teams) for the English subset, $0.1727$ (35/36 teams) for the Portuguese (Mozambican) subset and $0.325$ (\textbf{1}/31 teams) for the Emakhuwa subset.

[36] Towards a Principled Evaluation of Knowledge Editors

Sebastian Pohl,Max Ploner,Alan Akbik

Main category: cs.CL

TL;DR: 本文研究了知识编辑技术在不同评估方法下的表现差异,并揭示其对模型整体能力的影响及现有评估方法的局限性。

Details Motivation: 当前的知识编辑评估数据集存在方法论上的不足,且缺乏对其整体模型能力影响的研究。 Method: 通过不同度量标准、评估方法和编辑批次大小来分析知识编辑器的表现,并结合人工评估检测基于字符串匹配的评估方法的问题。 Result: 不同的评估方法和指标会导致知识编辑器的不同排名,且某些常用方法容易产生误匹配。 Conclusion: 模型编辑技术的评估方法和指标选择对排名有显著影响,同时这些编辑对整体模型能力有干扰。 Abstract: Model editing has been gaining increasing attention over the past few years. For Knowledge Editing in particular, more challenging evaluation datasets have recently been released. These datasets use different methodologies to score the success of editors. Yet, it remains under-explored how robust these methodologies are and whether they unfairly favor some editors. Moreover, the disruptive impact of these editors on overall model capabilities remains a constant blind spot. We address both of these problems and show that choosing different metrics and evaluation methodologies as well as different edit batch sizes can lead to a different ranking of knowledge editors. Crucially we demonstrate this effect also on general language understanding tasks evaluated alongside the knowledge editing tasks. Further we include a manual assessment of the string matching based evaluation method for knowledge editing that is favored by recently released datasets, revealing a tendency to produce false positive matches.

[37] Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors

Bing Wang,Ximing Li,Mengzhe Ye,Changchun Li,Bo Fu,Jianfeng Qu,Lin Yuanbo Wu

Main category: cs.CL

TL;DR: This paper proposes DAEDCMD, a continual multimodal misinformation detection method that addresses challenges like past knowledge forgetting and evolving social environments.

Details Motivation: The motivation is the issue of outdated MMD models trained on offline data, which are ineffective for continually emerging new events on social media platforms. Method: A Dirichlet process-based mixture-of-expert structure to remember past knowledge and a continuous-time dynamics model to anticipate future environmental distributions are used. Result: DAEDCMD significantly outperforms six MMD baselines and three continual learning methods in extensive experiments. Conclusion: DAEDCMD is an effective continual MMD method that outperforms other methods in detecting misinformation on new and past data. Abstract: Nowadays, misinformation articles, especially multimodal ones, are widely spread on social media platforms and cause serious negative effects. To control their propagation, Multimodal Misinformation Detection (MMD) becomes an active topic in the community to automatically identify misinformation. Previous MMD methods focus on supervising detectors by collecting offline data. However, in real-world scenarios, new events always continually emerge, making MMD models trained on offline data consistently outdated and ineffective. To address this issue, training MMD models under online data streams is an alternative, inducing an emerging task named continual MMD. Unfortunately, it is hindered by two major challenges. First, training on new data consistently decreases the detection performance on past data, named past knowledge forgetting. Second, the social environment constantly evolves over time, affecting the generalization on future data. To alleviate these challenges, we propose to remember past knowledge by isolating interference between event-specific parameters with a Dirichlet process-based mixture-of-expert structure, and anticipate future environmental distributions by learning a continuous-time dynamics model. Accordingly, we induce a new continual MMD method DAEDCMD. Extensive experiments demonstrate that DAEDCMD can consistently and significantly outperform the compared methods, including six MMD baselines and three continual learning methods.

[38] Chat-Ghosting: A Comparative Study of Methods for Auto-Completion in Dialog Systems

Sandeep Mishra,Anubhab Mandal,Bishal Santra,Tushar Abhishek,Pawan Goyal,Manish Gupta

Main category: cs.CL

TL;DR: 本文研究了聊天界面中的文本预测(ghosting)技术,比较了多种方法(包括字典树、n-gram和深度学习模型)在多个对话数据集上的表现,并提出了新的早停策略。

Details Motivation: Ghosting对于提升用户输入体验至关重要,但Chat-Ghosting问题缺乏标准化基准和性能分析,因此需要系统研究不同方法的效果。 Method: 使用四个公开可用的对话数据集(DailyDialog、DSTC7-Ubuntu、Open Assistant和ShareGPT),实验了多种查询自动补全方法(包括字典树、n-gram和深度学习方法),并提出了一种基于熵的动态早停策略。 Result: 统计n-gram模型和字典树在已见前缀任务中表现更优;深度学习模型如T5和Phi-2在未见查询上效果更好;加入对话上下文显著提升了ghosting质量。 Conclusion: 统计n-gram模型和字典树在已见过的前缀上的表现优于深度学习模型,而T5和Phi-2等神经网络模型在未见过的查询上效果更好。加入对话上下文可以显著提高ghosting质量。 Abstract: Ghosting, the ability to predict a user's intended text input for inline query auto-completion, is an invaluable feature for modern search engines and chat interfaces, greatly enhancing user experience. By suggesting completions to incomplete queries (or prefixes), ghosting aids users with slow typing speeds, disabilities, or limited language proficiency. Ghosting is a challenging problem and has become more important with the ubiquitousness of chat-based systems like ChatGPT, Copilot, etc. Despite the increasing prominence of chat-based systems utilizing ghosting, this challenging problem of Chat-Ghosting has received little attention from the NLP/ML research community. There is a lack of standardized benchmarks and relative performance analysis of deep learning and non-deep learning methods. We address this through an open and thorough study of this problem using four publicly available dialog datasets: two human-human (DailyDialog and DSTC7-Ubuntu) and two human-bot (Open Assistant and ShareGPT). We experiment with various existing query auto-completion methods (using tries), n-gram methods and deep learning methods, with and without dialog context. We also propose a novel entropy-based dynamic early stopping strategy. Our analysis finds that statistical n-gram models and tries outperform deep learning based models in terms of both model performance and inference efficiency for seen prefixes. For unseen queries, neural models like T5 and Phi-2 lead to better results. Adding conversational context leads to significant improvements in ghosting quality, especially for Open-Assistant and ShareGPT. We make code and data publicly available

[39] OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation

Lucas Fonseca Lage,Simon Ostermann

Main category: cs.CL

TL;DR: OpenFActScore is an open-source framework that evaluates the factuality of LLM-generated text by extracting and validating atomic facts, enabling flexible use of open models while achieving performance comparable to closed systems.

Details Motivation: To provide an open-source alternative to FActScore that supports reproducibility, transparency, and flexibility with open models while maintaining comparable performance to closed-source systems. Method: OpenFActScore uses Atomic Fact Generation (AFG) and Atomic Fact Validation (AFV), supporting any Hugging Face-compatible model. It evaluates models using BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Result: Gemma achieved the best overall performance, and the final setup obtained a 0.99 Pearson correlation with the original FActScore experiments. Conclusion: OpenFActScore successfully enables open-source models to approximate the performance of closed-source systems in evaluating factual accuracy, promoting transparency and cost-effective evaluation. Abstract: We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: https://github.com/lflage/OpenFActScore.

[40] We Should Evaluate Real-World Impact

Ehud Reiter

Main category: cs.CL

TL;DR: The paper highlights the lack of real-world impact evaluation in NLP research within the ACL community and argues for its importance in making NLP technology more useful and widely adopted.

Details Motivation: The ACL community shows minimal interest in evaluating the real-world impact of NLP systems, with only an estimated 0.1% of papers addressing such evaluations. Method: A structured survey of the ACL Anthology was conducted to assess the extent to which papers evaluate the real-world impact of NLP systems. Result: Most papers focus on metric evaluations rather than real-world impact assessments, and those that do include impact evaluations tend to present them sketchily. Conclusion: Evaluating the real-world impact of NLP systems can make NLP technology more useful and rapidly adopted. Abstract: The ACL community has very little interest in evaluating the real-world impact of NLP systems. A structured survey of the ACL Anthology shows that perhaps 0.1% of its papers contain such evaluations; furthermore most papers which include impact evaluations present them very sketchily and instead focus on metric evaluations. NLP technology would be more useful and more quickly adopted if we seriously tried to understand and evaluate its real-world impact.

[41] RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Gabriel Chua,Leanne Tan,Ziyu Ge,Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: 本论文提出了 RabakBench,一個針對新加坡多語言環境的安全基準,解決了低資源語言在大型語言模型及其安全分類器上的性能不足問題。

Details Motivation: 由於訓練數據和評估基準有限,大型語言模型(LLMs)及其安全分類器在低資源語言上表現不佳。因此需要構建一個適應新加坡獨特語言環境的多語言安全基準。 Method: 通過三個階段構建 RabakBench:(i) 生成 - 利用真實的 Singlish 網絡內容與 LLM 驅動的紅隊測試擴增生成對抗示例;(ii) 標註 - 使用與人類判斷一致的主要投票 LLM 標註器進行半自動多標籤安全註釋;(iii) 翻譯 - 在不同語言間保持語言細微差別和毒性高保真翻譯。 Result: 最終數據集包含四種語言和六個細粒度安全類別(含嚴重程度級別)的 5,000 多個安全標記示例。對 11 個流行的開源和閉源防護分類器的評估顯示其性能顯著下降。 Conclusion: RabakBench 不僅可以在東南亞多語言環境中實現強大的安全評估,還為在低資源環境中構建本地化安全數據集提供了可重現的框架。 Abstract: Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

[42] Evolution without Large Models: Training Language Model with Task Principles

Minghang Zhu,Shen Gao,Zhengliang Shi,Jiabao Fang,Pengjie Ren,Zhaochun Ren,Zhumin Chen,Shuo Shang

Main category: cs.CL

TL;DR: This paper proposes a self-evolution method for training language models, where a large model generates task principles and a smaller model creates training data based on these principles, resulting in improved performance and reduced environmental impact.

Details Motivation: The motivation stems from the need to reduce training costs, mitigate high carbon emissions during data augmentation, and prevent data leakage when using closed-source LLMs in current language model training approaches. Method: The method involves two steps: Multi-level Principle Generation, where a large-scale model summarizes task-completion principles from limited data, and Principle-based Instance Generation, where a smaller model uses these principles to generate extensive training data. Result: Experimental results show that the proposed method significantly improves model performance compared to directly using a smaller-scale language model for data generation. It also greatly reduces carbon emissions due to the limited use of a large-scale model for principle generation. Conclusion: The proposed self-evolution method for language models effectively reduces carbon emissions and data leakage risks while significantly improving model performance by using a large-scale model to generate task-completion principles and a smaller model to generate training data. Abstract: A common training approach for language models involves using a large-scale language model to expand a human-provided dataset, which is subsequently used for model training.This method significantly reduces training costs by eliminating the need for extensive human data annotation. However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage when we use closed-source LLMs. To address these issues, we propose a self-evolution method for language models. First, we introduce the Multi-level Principle Generation, which enables a large-scale model to summarize task-completion principles based on a small amount of task data. Then, we propose the Principle-based Instance Generation, in which a smaller-scale language model uses these task principles to generate a large amount of data. This data is then used for model training. Experimental results show that our proposed method significantly improves model performance compared to directly using a smaller-scale language model to generate data. Additionally, since we only use the large-scale language model to generate the task-completion principles, the carbon emissions associated with training the model are greatly reduced.

[43] DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Nicholas Popovič,Ashish Kangen,Tim Schopf,Michael Färber

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的全自动合成数据生成与上下文学习方法,用于解决文档级实体和关系抽取中数据稀缺的问题,但实验结果显示即使先进模型在零样本条件下仍面临挑战。

Details Motivation: 当前在零样本或少样本环境下,文档级实体和关系抽取领域缺乏大规模、高质量的标注语料库,这限制了该领域的研究进展。因此需要一种全自动的方法来生成高质量的训练数据并提升模型性能。 Method: 本研究使用基于大语言模型(LLM)的自动化流水线进行合成数据生成,并结合基于检索的上下文学习方法。利用推理优化的语言模型构建高质量演示数据库,并在推理时动态检索相关示例,从而避免手动标注。 Result: 基于所提出的方法,研究人员生成了一个包含超过5,000个维基百科摘要的合成数据集,其中包括约59,000个实体和30,000个关系三元组。在DocIE共享任务上的评估表明,即便对于最先进的大语言模型,在零样本条件下实现文档级的联合实体和关系抽取仍然是一个具有挑战性的任务。 Conclusion: 尽管最先进的大型语言模型在零样本设置下仍面临文档级联合实体和关系抽取的挑战,本文提出的方法通过合成数据生成与基于检索的上下文学习结合,为解决少样本或零样本环境下的问题提供了新思路。 Abstract: Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.

[44] Conditional Multi-Stage Failure Recovery for Embodied Agents

Youmna Farag,Svetlana Stoyanchev,Mohan Li,Simon Keizer,Rama Doddipatla

Main category: cs.CL

TL;DR: This paper proposes a multistage failure recovery framework using zero-shot chain prompting and LLM reasoning to effectively address execution failures in embodied agents, achieving superior performance on a benchmark dataset.

Details Motivation: Embodied agents performing complex tasks are prone to execution failures, necessitating effective recovery mechanisms to ensure task success. Method: A four-stage error-handling framework is introduced that incorporates zero-shot chain prompting and leverages the reasoning capabilities of LLMs to analyze challenges and devise strategic solutions for failure recovery. Result: The framework achieves state-of-the-art results on the TfD benchmark of the TEACH dataset, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%. Conclusion: The proposed conditional multistage failure recovery framework significantly improves the performance of embodied agents in handling execution failures, demonstrating its effectiveness and superiority over existing methods. Abstract: Embodied agents performing complex tasks are susceptible to execution failures, motivating the need for effective failure recovery mechanisms. In this work, we introduce a conditional multistage failure recovery framework that employs zero-shot chain prompting. The framework is structured into four error-handling stages, with three operating during task execution and one functioning as a post-execution reflection phase. Our approach utilises the reasoning capabilities of LLMs to analyse execution challenges within their environmental context and devise strategic solutions. We evaluate our method on the TfD benchmark of the TEACH dataset and achieve state-of-the-art performance, outperforming a baseline without error recovery by 11.5% and surpassing the strongest existing model by 19%.

[45] Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Yizhan Huang,Zhe Yang,Meifang Chen,Jianping Zhang,Michael R. Lyu

Main category: cs.CL

TL;DR: 本文揭示了数据熵与LLM记忆能力之间的线性关系,并提出一种基于此的新方法用于区分训练与测试数据。

Details Motivation: 大型语言模型(LLMs)会记忆部分训练数据,甚至在适当提示下逐字复现内容。如何刻画训练数据在LLMs中的记忆难度是一个基础但尚未深入探讨的问题。 Method: 通过对OLMo模型进行实证实验,分析不同训练数据的记忆得分与数据熵之间的关系,并通过记忆高度随机的字符串(“胡言乱语”)进行案例研究。 Result: 发现了数据熵与记忆得分之间存在线性相关性,即Entropy-Memorization法则;同时发现尽管高度随机字符串看似无序,其经验熵却低于整体训练语料库,表明它们更容易被记忆。 Conclusion: 本研究提出了Entropy-Memorization法则,并利用该法则设计了一种简单而有效的区分训练和测试数据的方法,从而实现了数据集推断(Dataset Inference, DI)。 Abstract: Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately. In this work, we investigate a fundamental yet under-explored question in the domain of memorization: How to characterize memorization difficulty of training data in LLMs? Through empirical experiments on OLMo, a family of open models, we present the Entropy-Memorization Law. It suggests that data entropy is linearly correlated with memorization score. Moreover, in a case study of memorizing highly randomized strings, or "gibberish", we observe that such sequences, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the broader training corpus. Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data, enabling Dataset Inference (DI).

[46] A Survey on Prompt Tuning

Zongqian Li,Yixuan Su,Nigel Collier

Main category: cs.CL

TL;DR: This paper surveys prompt tuning techniques for efficiently adapting language models, categorizing them into direct and transfer learning methods while highlighting challenges and future opportunities.

Details Motivation: To provide an overview and systematic classification of prompt tuning methods for efficient adaptation of frozen language models. Method: A survey and classification of existing prompt tuning approaches into direct prompt learning and transfer learning categories, analyzing their designs, innovations, advantages, and disadvantages. Result: Classification of prompt tuning methods, detailed analysis of each approach, and identification of challenges and future research directions. Conclusion: Prompt tuning is a parameter-efficient method for adapting language models, with challenges in computational efficiency and training stability identified, along with future directions for improvement. Abstract: This survey reviews prompt tuning, a parameter-efficient approach for adapting language models by prepending trainable continuous vectors while keeping the model frozen. We classify existing approaches into two categories: direct prompt learning and transfer learning. Direct prompt learning methods include: general optimization approaches, encoder-based methods, decomposition strategies, and mixture-of-experts frameworks. Transfer learning methods consist of: general transfer approaches, encoder-based methods, and decomposition strategies. For each method, we analyze method designs, innovations, insights, advantages, and disadvantages, with illustrative visualizations comparing different frameworks. We identify challenges in computational efficiency and training stability, and discuss future directions in improving training robustness and broadening application scope.

[47] NeoBabel: A Multilingual Open Tower for Visual Generation

Mohammad Mahdi Derakhshani,Dheeraj Varghese,Marzieh Fadaee,Cees G. M. Snoek

Main category: cs.CL

TL;DR: NeoBabel是一个新的多语言图像生成框架,支持六种语言,并在性能、效率和包容性方面设立了新标准。

Details Motivation: 现有的文本到图像生成系统主要依赖翻译流程,导致语义漂移、计算开销和文化不匹配的问题,而NeoBabel旨在解决这些不足并提升非英语用户的体验。 Method: 通过大规模多语言预训练和高分辨率指令调整相结合的方法进行模型训练,并引入两个新指标来评估多语言对齐性和鲁棒性。 Result: NeoBabel在m-GenEval和m-DPG基准测试中分别得分0.75和0.68,在保持强大英语能力的同时显著优于现有模型,并且模型体积小2-4倍。 Conclusion: 多语言能力不是一种权衡,而是促进生成式人工智能的稳健性、效率和文化保真度的催化剂。 Abstract: Text-to-image generation advancements have been predominantly English-centric, creating barriers for non-English speakers and perpetuating digital inequities. While existing systems rely on translation pipelines, these introduce semantic drift, computational overhead, and cultural misalignment. We introduce NeoBabel, a novel multilingual image generation framework that sets a new Pareto frontier in performance, efficiency and inclusivity, supporting six languages: English, Chinese, Dutch, French, Hindi, and Persian. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. To evaluate its capabilities, we expand two English-only benchmarks to multilingual equivalents: m-GenEval and m-DPG. NeoBabel achieves state-of-the-art multilingual performance while retaining strong English capability, scoring 0.75 on m-GenEval and 0.68 on m-DPG. Notably, it performs on par with leading models on English tasks while outperforming them by +0.11 and +0.09 on multilingual benchmarks, even though these models are built on multilingual base LLMs. This demonstrates the effectiveness of our targeted alignment training for preserving and extending crosslingual generalization. We further introduce two new metrics to rigorously assess multilingual alignment and robustness to code-mixed prompts. Notably, NeoBabel matches or exceeds English-only models while being 2-4x smaller. We release an open toolkit, including all code, model checkpoints, a curated dataset of 124M multilingual text-image pairs, and standardized multilingual evaluation protocols, to advance inclusive AI research. Our work demonstrates that multilingual capability is not a trade-off but a catalyst for improved robustness, efficiency, and cultural fidelity in generative AI.

[48] Coding Triangle: How Does Large Language Model Understand Code?

Taolin Zhang,Zihan Ma,Maosong Cao,Junnan Liu,Songyang Zhang,Kai Chen

Main category: cs.CL

TL;DR: The Code Triangle framework reveals limitations in LLM programming skills and proposes methods to improve their coding performance.

Details Motivation: Despite progress in code generation, the true programming competence of LLMs remains underexplored compared to human programmers. Method: The Code Triangle framework evaluates LLMs across editorial analysis, code implementation, and test case generation using competitive programming benchmarks. Result: LLMs show self-consistency across dimensions but lack diversity and robustness; errors cluster due to data bias and limited reasoning transfer. Conclusion: Incorporating human-generated content and leveraging model mixtures can enhance LLMs' coding performance and robustness, while self-reflection may help future improvements. Abstract: Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can substantially enhance both the performance and robustness of LLMs. Furthermore, we reveal both the consistency and inconsistency in the cognition of LLMs that may facilitate self-reflection and self-improvement, providing a potential direction for developing more powerful coding models.

[49] Skywork-R1V3 Technical Report

Wei Shen,Jiangbo Pei,Yi Peng,Xuchen Song,Yang Liu,Jian Peng,Haofeng Sun,Yunzhuo Hao,Peiyu Wang,Yahui Zhou

Main category: cs.CL

TL;DR: Skywork-R1V3 is an advanced open-source vision-language model that utilizes a post-training reinforcement learning framework to transfer reasoning skills from text-based models to visual tasks, achieving state-of-the-art results and rivaling closed-source models.

Details Motivation: The motivation is to pioneer a new approach to visual reasoning by transferring reasoning skills from text-only Large Language Models to visual tasks, aiming to achieve state-of-the-art performance without further pre-training. Method: The method involves an elaborate post-training reinforcement learning framework that enhances the model's reasoning ability without additional pre-training. The approach also focuses on the connector module for cross-modal alignment and uses entropy of critical reasoning tokens for checkpoint selection during training. Result: Skywork-R1V3 achieves state-of-the-art results on MMMU, improving significantly from 64.3% to 76.0%, matching entry-level human capabilities and allowing even the 38B parameter model to rival top closed-source models. Conclusion: Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing the potential of reinforcement learning as a powerful engine for advancing open-source vision-language model capabilities. Abstract: We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.

[50] CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

Zhongyuan Peng,Yifan Yao,Kaijing Ma,Shuyue Guo,Yizhe Li,Yichi Zhang,Chenchen Zhang,Yifan Zhang,Zhouliang Yu,Luming Li,Minghao Liu,Yihang Xia,Jiawei Shen,Yuchen Wu,Yixin Cao,Zhaoxiang Zhang,Wenhao Huang,Jiaheng Liu,Ge Zhang

Main category: cs.CL

TL;DR: 本文提出了一种新的批评引导强化学习框架CriticLean,用于评估自然语言数学语句转换为形式化代码的语义保真度,并构建了一个包含超过285K问题的数据集FineLeanCorpus。

Details Motivation: 将自然语言数学语句转换为形式化、可执行代码是自动定理证明中的一个基本挑战。现有工作主要集中在生成和编译的成功率上,而对生成的形式化是否真正捕捉原始问题的语义意图关注较少。 Method: 提出了CriticLean框架,其中包括通过监督微调和强化学习训练的CriticLeanGPT模型,以及用于衡量模型区分正确与错误形式化能力的CriticLeanBench基准。 Result: 实验结果表明,CriticLeanGPT在CriticLeanBench基准上显著优于开源和闭源基线模型,并基于人类评估构建了具有丰富领域多样性、广泛难度覆盖和高正确性的数据集FineLeanCorpus。 Conclusion: 优化批评阶段对于生成可靠的形式化至关重要,CriticLean框架为未来形式化数学推理的发展提供了有价值的见解。 Abstract: Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models' ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.

[51] DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation

Maximilian Heil,Dionne Bang

Main category: cs.CL

TL;DR: 本论文提出了一种结合编码器迁移学习与数据增强的方法来提高英文新闻文本主观性检测的性能。

Details Motivation: 探索迁移学习和风格化数据增强在英文新闻文本中对主客观句子分类的有效性。 Method: 研究采用迁移学习和风格化数据增强的方法,通过对比预训练编码器的微调与在相关任务上优化的Transformer模型进行迁移学习,并使用GPT-4o生成指定主观性风格的改写文本。 Result: 迁移学习效果优于通用编码器的微调,精心设计的数据增强显著提高了模型鲁棒性,尤其是在识别主观内容方面。团队最终排名24个参与者中的第16位。 Conclusion: 结合编码器专业化与标签一致的数据增强方法能有效提升主观性检测的效果。 Abstract: This paper presents our submission to Task 1, Subjectivity Detection, of the CheckThat! Lab at CLEF 2025. We investigate the effectiveness of transfer-learning and stylistic data augmentation to improve classification of subjective and objective sentences in English news text. Our approach contrasts fine-tuning of pre-trained encoders and transfer-learning of fine-tuned transformer on related tasks. We also introduce a controlled augmentation pipeline using GPT-4o to generate paraphrases in predefined subjectivity styles. To ensure label and style consistency, we employ the same model to correct and refine the generated samples. Results show that transfer-learning of specified encoders outperforms fine-tuning general-purpose ones, and that carefully curated augmentation significantly enhances model robustness, especially in detecting subjective content. Our official submission placed us $16^{th}$ of 24 participants. Overall, our findings underscore the value of combining encoder specialization with label-consistent augmentation for improved subjectivity detection. Our code is available at https://github.com/dsgt-arc/checkthat-2025-subject.

[52] DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification

Maximilian Heil,Aleksandar Pramov

Main category: cs.CL

TL;DR: This paper evaluates techniques for improving the fact-checking of numerical claims and finds that evidence quality is more important than model architecture or input length.

Details Motivation: Numerical claims present unique challenges to automated fact-checking systems, requiring specialized approaches to improve accuracy in determining claim veracity. Method: The researchers evaluated modeling strategies using the QuanTemp dataset and developed an evidence retrieval pipeline. They tested the impact of longer input contexts with ModernBERT and R2L tokenization on NLI tasks. Result: Contrary to expectations, neither R2L tokenization nor a longer context window significantly improved classification performance. The best system achieved a macro-average F1 score of 0.57 and ranked among the Top-4 submissions in Task 3 of CheckThat! 2025. Conclusion: The study concludes that evidence quality is the main limiting factor in improving veracity prediction for numerical claims, rather than context length or tokenization methods. Abstract: Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.

[53] UQLM: A Python Package for Uncertainty Quantification in Large Language Models

Dylan Bouchard,Mohit Singh Chauhan,David Skarbrevik,Ho-Kyeong Ra,Viren Bajaj,Zeya Ahmad

Main category: cs.CL

TL;DR: 本文介绍了一种用于大型语言模型幻觉检测的Python工具包UQLM,利用最先进的不确定性量化技术提供响应级别的置信度评分。

Details Motivation: 大型语言模型生成虚假或误导性内容的现象严重影响了下游应用的安全性和可信度,因此需要有效的检测工具。 Method: 开发了一个名为UQLM的Python工具包,集成了基于不确定性量化的评分器来计算响应级别的置信度分数。 Result: 提供了一个现成的解决方案,能够轻松集成到现有系统中,从而增强大型语言模型输出的可靠性。 Conclusion: UQLM是一个有效的工具,可以用来提高大型语言模型输出内容的安全性和可信度。 Abstract: Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.

[54] A Survey on Latent Reasoning

Rui-Jie Zhu,Tianhao Peng,Tianhao Cheng,Xingwei Qu,Jinfa Huang,Dawei Zhu,Hao Wang,Kaiwen Xue,Xuanliang Zhang,Yong Shan,Tianle Cai,Taylor Kergan,Assel Kembay,Andrew Smith,Chenghua Lin,Binh Nguyen,Yuqi Pan,Yuhong Chou,Zefan Cai,Zhenhe Wu,Yongchi Zhao,Tianyu Liu,Jian Yang,Wangchunshu Zhou,Chujie Zheng,Chongxuan Li,Yuyin Zhou,Zhoujun Li,Zhaoxiang Zhang,Jiaheng Liu,Ge Zhang,Wenhao Huang,Jason Eshraghian

Main category: cs.CL

TL;DR: 这篇论文讨论了一种超越传统链式思维的新方法——潜在推理,这种方法利用模型的隐藏状态进行多步骤推理,从而提高模型的表达能力和准确性。

Details Motivation: 链式思维(CoT)虽然提高了模型的可解释性和准确性,但其依赖自然语言推理限制了模型的表达带宽。潜在推理通过完全在模型的连续隐藏状态下进行多步骤推理,解决了这一瓶颈。 Method: 本文首先探讨了神经网络层作为推理计算基础的作用,接着研究了多种潜在推理方法,如基于激活的递归、隐藏状态传播和微调策略,并讨论了诸如通过掩码扩散模型实现无限深度潜在推理等先进范式。 Result: 论文提供了潜在推理领域的全面综述,包括最新方法和技术,以及一个GitHub仓库收集最新的论文和资源。 Conclusion: 该论文旨在通过统一不同的视角,阐明潜在推理的概念格局,并为大型语言模型认知前沿的研究指明未来方向。 Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.

[55] DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media

Ayush Parikh,Hoang Thanh Thanh Truong,Jeanette Schofield,Maximilian Heil

Main category: cs.CL

TL;DR: DS@GT团队通过三种建模方法(transformer微调、小样本提示和综合模型)参与了CLEF 2025 CheckThat! Task 4a科学网络话语检测任务,并以宏观平均F1得分0.8611取得第7名的好成绩。

Details Motivation: 本文旨在解决多类分类任务,判断推文中是否包含科学主张、对科学研究或出版物的引用,以及提到科学实体(如大学或科学家)的情况。 Method: 该研究采用了transformer微调、LLMs的小样本提示以及一个结合模型的方法。 Result: DS@GT团队在竞争中排名第7,取得了宏观平均F1得分0.8611的成绩,优于基线DeBERTaV3的0.8375。 Conclusion: DS@GT团队在CLEF 2025 CheckThat! Task 4a中采用了三种建模方法进行科学网络话语检测任务,并设计了一个综合模型,在比赛中取得了宏观平均F1得分为0.8611的优异成绩,优于基线DeBERTaV3。 Abstract: In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at https://github.com/dsgt-arc/checkthat-2025-swd/tree/main/subtask-4a.

[56] Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Zhiyuan Peng,Ting-ruen Wei,Tingyu Song,Yilun Zhao,Yi Fang

Main category: cs.CL

TL;DR: This paper introduces E2R-FLOPs, a new framework using RPP and QPP metrics, to better evaluate the efficiency and effectiveness of LLM-based rerankers in information retrieval.

Details Motivation: Existing efficiency metrics for LLM-based rerankers are hardware-dependent and do not account for model size, making it difficult to interpret results and evaluate the efficiency-effectiveness trade-off. This necessitates a more standardized and interpretable evaluation framework. Method: The authors propose new metrics, RPP (ranking metrics per PetaFLOP) and QPP (queries per PetaFLOP), along with an interpretable FLOPs estimator to assess computational efficiency without requiring experiments. These are applied to evaluate various LLM-based rerankers comprehensively. Result: The proposed E2R-FLOPs framework enables a clearer understanding of the efficiency-effectiveness trade-off across different LLM-based rerankers, offering insights into their computational demands independent of hardware specifics. Conclusion: The paper concludes that the proposed E2R-FLOPs metrics provide a more interpretable and hardware-agnostic way to evaluate the efficiency-effectiveness trade-off in LLM-based rerankers, bringing attention to this critical issue in the research community. Abstract: Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E\textsuperscript{2}R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

[57] Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Xiangru Tang,Tianrui Qin,Tianhao Peng,Ziyang Zhou,Daniel Shao,Tingting Du,Xinming Wei,Peng Xia,Fang Wu,He Zhu,Ge Zhang,Jiaheng Liu,Xingyao Wang,Sirui Hong,Chenglin Wu,Hao Cheng,Chi Wang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: Agent KB enhances language agent performance through a shared knowledge base that facilitates cross-agent learning and strategy generalization.

Details Motivation: Language agents struggle with effective error correction and experience reuse across domains. Method: Introducing Agent KB, a hierarchical experience framework with a Reason-Retrieve-Refine pipeline, creating a shared knowledge base for cross-agent knowledge transfer. Result: Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points; Claude-3 improved from 38.46% to 57.69%, and GPT-4 improved from 53.49% to 73.26%. On SWE-bench code repair, Claude-3 improved from 41.33% to 53.33%. Conclusion: Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks. Abstract: As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other's experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.

cs.CV [Back]

[58] Structured Captions Improve Prompt Adherence in Text-to-Image Models (Re-LAION-Caption 19M)

Nicholas Merchant,Haitz Sáez de Ocáriz Borde,Andrei Cristian Popescu,Carlos Garcia Jurado Suarez

Main category: cs.CV

TL;DR: This paper introduces Re-LAION-Caption 19M, a structured caption dataset that improves text-to-image model alignment by enforcing consistent caption formatting during training.

Details Motivation: Generative text-to-image models struggle with prompt adherence due to noisy, unstructured datasets like LAION-5B, requiring heavy prompt engineering. Method: Created Re-LAION-Caption 19M, a high-quality dataset with structured captions following a four-part template (subject, setting, aesthetics, camera details), and fine-tuned PixArt-Σ and Stable Diffusion 2 on both structured and shuffled captions. Result: Structured captions consistently yielded higher text-image alignment scores compared to randomly shuffled captions when evaluated using VQA models. Conclusion: Enforcing consistent caption structure during training improves model controllability and alignment in text-to-image generation. Abstract: We argue that generative text-to-image models often struggle with prompt adherence due to the noisy and unstructured nature of large-scale datasets like LAION-5B. This forces users to rely heavily on prompt engineering to elicit desirable outputs. In this work, we propose that enforcing a consistent caption structure during training can significantly improve model controllability and alignment. We introduce Re-LAION-Caption 19M, a high-quality subset of Re-LAION-5B, comprising 19 million 1024x1024 images with captions generated by a Mistral 7B Instruct-based LLaVA-Next model. Each caption follows a four-part template: subject, setting, aesthetics, and camera details. We fine-tune PixArt-$\Sigma$ and Stable Diffusion 2 using both structured and randomly shuffled captions, and show that structured versions consistently yield higher text-image alignment scores using visual question answering (VQA) models. The dataset is publicly available at https://huggingface.co/datasets/supermodelresearch/Re-LAION-Caption19M.

[59] CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection

Binjia Zhou,Hengrui Lou,Lizhe Chen,Haoyuan Li,Dawei Luo,Shuai Chen,Jie Lei,Zunlei Feng,Yijun Bei

Main category: cs.CV

TL;DR: CorrDetail is a novel visual-based deepfake detection framework that enhances detection accuracy through self-correction and fine-grained visual analysis, achieving top performance with strong interpretability.

Details Motivation: The increasing prevalence of facial deepfakes necessitates more effective and interpretable detection methods, as existing approaches suffer from lack of explanation or hallucination issues. Method: CorrDetail incorporates a visual detail enhanced self-correction framework with a fine-grained detail enhancement module and a fusion decision strategy to improve detection accuracy and reduce hallucinations. Result: Experimental results show that CorrDetail outperforms recent methodologies in both detection performance and precise identification of forgery details. Conclusion: The proposed CorrDetail framework achieves state-of-the-art performance in face forgery detection, effectively identifying forged details and offering robust generalization capabilities. Abstract: With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake detection.Existing techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of hallucinations.To address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model's discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias reduction.Experimental results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.

[60] YOLO-APD: Enhancing YOLOv8 for Robust Pedestrian Detection on Complex Road Geometries

Aquino Joctum,John Kandiri

Main category: cs.CV

TL;DR: 本研究提出了一种名为YOLO-APD的新型深度学习架构,用于解决复杂道路环境中的行人检测问题。

Details Motivation: 自动驾驶汽车感知系统需要在几何结构复杂的道路上(如Type-S弯道)实现稳健的行人检测,而传统的RGB相机方法存在局限性。 Method: 提出了YOLO-APD,集成了SimAM注意力机制、C3Ghost模块、SimSPPF模块、Mish激活函数和IGD模块。同时引入了利用车辆转向动力学进行自适应感兴趣区域处理的概念。 Result: 在定制的CARLA数据集上达到77.7% mAP@0.5:0.95和超过96%的行人召回率,并实现实时处理能力100 FPS。消融实验验证了每个组件的协同贡献。 Conclusion: YOLO-APD实现了最先进的检测精度,并保持了实时处理能力,展示了准确性和效率之间的卓越平衡。这项研究推进了基于低成本传感器的高精度、高效和自适应感知系统的发展。 Abstract: Autonomous vehicle perception systems require robust pedestrian detection, particularly on geometrically complex roadways like Type-S curved surfaces, where standard RGB camera-based methods face limitations. This paper introduces YOLO-APD, a novel deep learning architecture enhancing the YOLOv8 framework specifically for this challenge. YOLO-APD integrates several key architectural modifications: a parameter-free SimAM attention mechanism, computationally efficient C3Ghost modules, a novel SimSPPF module for enhanced multi-scale feature pooling, the Mish activation function for improved optimization, and an Intelligent Gather & Distribute (IGD) module for superior feature fusion in the network's neck. The concept of leveraging vehicle steering dynamics for adaptive region-of-interest processing is also presented. Comprehensive evaluations on a custom CARLA dataset simulating complex scenarios demonstrate that YOLO-APD achieves state-of-the-art detection accuracy, reaching 77.7% mAP@0.5:0.95 and exceptional pedestrian recall exceeding 96%, significantly outperforming baseline models, including YOLOv8. Furthermore, it maintains real-time processing capabilities at 100 FPS, showcasing a superior balance between accuracy and efficiency. Ablation studies validate the synergistic contribution of each integrated component. Evaluation on the KITTI dataset confirms the architecture's potential while highlighting the need for domain adaptation. This research advances the development of highly accurate, efficient, and adaptable perception systems based on cost-effective sensors, contributing to enhanced safety and reliability for autonomous navigation in challenging, less-structured driving environments.

[61] Foreground-aware Virtual Staining for Accurate 3D Cell Morphological Profiling

Alexandr A. Kalinin,Paula Llanos,Theresa Maria Sommer,Giovanni Sestini,Xinhai Hou,Jonathan Z. Sexton,Xiang Wan,Ivo D. Dinov,Brian D. Athey,Nicolas Rivron,Anne E. Carpenter,Beth Cimini,Shantanu Singh,Matthew J. O'Meara

Main category: cs.CV

TL;DR: Spotlight是一种改进的虚拟染色方法,它通过引导模型关注相关细胞结构,提升了形态表示能力和实用性。

Details Motivation: 现有的虚拟染色方法通常依赖于将所有像素同等对待的损失函数,从而复制背景噪声和伪影而不是关注生物上有意义的信号。 Method: 使用基于直方图的前景估计来掩盖像素级损失,并通过对软阈值预测计算Dice损失来进行形状感知学习。 Result: 在3D基准数据集上的应用表明,Spotlight改善了形态表示,生成的虚拟染色更适合下游任务如分割和分析。 Conclusion: Spotlight通过聚焦相关的细胞结构,提高了虚拟染色的形态表示能力,并且保持了像素级准确性。 Abstract: Microscopy enables direct observation of cellular morphology in 3D, with transmitted-light methods offering low-cost, minimally invasive imaging and fluorescence microscopy providing specificity and contrast. Virtual staining combines these strengths by using machine learning to predict fluorescence images from label-free inputs. However, training of existing methods typically relies on loss functions that treat all pixels equally, thus reproducing background noise and artifacts instead of focusing on biologically meaningful signals. We introduce Spotlight, a simple yet powerful virtual staining approach that guides the model to focus on relevant cellular structures. Spotlight uses histogram-based foreground estimation to mask pixel-wise loss and to calculate a Dice loss on soft-thresholded predictions for shape-aware learning. Applied to a 3D benchmark dataset, Spotlight improves morphological representation while preserving pixel-level accuracy, resulting in virtual stains better suited for downstream tasks such as segmentation and profiling.

[62] From General to Specialized: The Need for Foundational Models in Agriculture

Vishal Nedungadi,Xingguo Xiong,Aike Potze,Ron Van Bree,Tao Lin,Marc Rußwurm,Ioannis N. Athanasiadis

Main category: cs.CV

TL;DR: This paper evaluates the use of foundation models in agriculture and argues for a specialized model to better address food security challenges.

Details Motivation: Food security is a growing concern due to population growth and climate change, and innovative solutions are required. Foundation models have shown potential in related fields but remain under-explored for specific agricultural challenges. Method: The authors surveyed and compared general-purpose foundation models within a requirements framework for an ideal agricultural foundation model (CropFM) and empirically evaluated two models on three agriculture-specific tasks. Result: The study quantitatively evaluated the effectiveness of current foundation models in agricultural tasks such as crop type mapping, phenology estimation, and yield estimation, and identified the need for a specialized agricultural foundation model. Conclusion: The paper concludes that although existing foundation models show promise, a dedicated foundational model tailored to agriculture is needed. Abstract: Food security remains a global concern as population grows and climate change intensifies, demanding innovative solutions for sustainable agricultural productivity. Recent advances in foundation models have demonstrated remarkable performance in remote sensing and climate sciences, and therefore offer new opportunities for agricultural monitoring. However, their application in challenges related to agriculture-such as crop type mapping, crop phenology estimation, and crop yield estimation-remains under-explored. In this work, we quantitatively evaluate existing foundational models to assess their effectivity for a representative set of agricultural tasks. From an agricultural domain perspective, we describe a requirements framework for an ideal agricultural foundation model (CropFM). We then survey and compare existing general-purpose foundational models in this framework and empirically evaluate two exemplary of them in three representative agriculture specific tasks. Finally, we highlight the need for a dedicated foundational model tailored specifically to agriculture.

[63] Enhancing Underwater Images Using Deep Learning with Subjective Image Quality Integration

Jose M. Montero,Jose-Luis Lisani

Main category: cs.CV

TL;DR: This paper proposes a deep learning method for underwater image enhancement that incorporates human subjective evaluations, achieving notable improvements in both objective metrics and visual quality.

Details Motivation: Underwater images often suffer from degradation in quality due to environmental factors, and traditional enhancement methods may not align with human perception. This work aims to improve underwater image quality by integrating human subjective assessments into the training process of deep learning models. Method: A two-step deep learning method is employed: first, a classifier network distinguishes between high- and low-quality underwater images; second, generative adversarial networks (GANs) are trained to enhance low-quality images using various criteria. The results are evaluated using quantitative metrics (PSNR, SSIM, UIQM) and qualitative analysis. Result: The GAN-based enhancement model shows significant improvements in image quality, both quantitatively and qualitatively, especially when criteria such as color fidelity and sharpness are used. Conclusion: The proposed deep learning approach, particularly when incorporating criteria like color fidelity and image sharpness, substantially improves both perceived and measured underwater image quality. Abstract: Recent advances in deep learning, particularly neural networks, have significantly impacted a wide range of fields, including the automatic enhancement of underwater images. This paper presents a deep learning-based approach to improving underwater image quality by integrating human subjective assessments into the training process. To this end, we utilize publicly available datasets containing underwater images labeled by experts as either high or low quality. Our method involves first training a classifier network to distinguish between high- and low-quality images. Subsequently, generative adversarial networks (GANs) are trained using various enhancement criteria to refine the low-quality images. The performance of the GAN models is evaluated using quantitative metrics such as PSNR, SSIM, and UIQM, as well as through qualitative analysis. Results demonstrate that the proposed model -- particularly when incorporating criteria such as color fidelity and image sharpness -- achieves substantial improvements in both perceived and measured image quality.

[64] pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models

Sajjad Ghiasvand,Mahnoosh Alizadeh,Ramtin Pedarsani

Main category: cs.CV

TL;DR: 本文提出了一个名为pFedMMA的个性化联邦学习框架,在视觉-语言任务中实现了个性化与泛化的良好平衡,并在多个数据集上取得了优于现有方法的效果。

Details Motivation: 现有的个性化联邦学习方法在牺牲泛化能力的同时实现个性化,尤其在看不见的类别或领域上表现不佳,因此需要一种更有效的方法来解决这一问题。 Method: 提出了一种名为pFedMMA的个性化联邦学习框架,该框架使用多模态适配器和不对称优化策略,使客户端能够本地适应个性化数据分布,同时协作训练共享投影以改进全局泛化。 Result: 通过在11个数据集上的广泛实验,包括域偏移和标签偏移场景,pFedMMA展示了最先进的个性化与泛化之间的权衡,并超过了最近的联邦提示调整方法的表现。 Conclusion: pFedMMA是一个创新的个性化联邦学习框架,成功在视觉-语言任务中平衡了个性化和泛化能力,并且具有通信效率。 Abstract: Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our asymmetric optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods. The code is available at https://github.com/sajjad-ucsb/pFedMMA.

[65] Neural-Driven Image Editing

Pengfei Zhou,Jie Xia,Xiaopeng Peng,Wangbo Zhao,Zilong Ye,Zekai Li,Suorong Yang,Jiadong Pan,Yuanxiang Chen,Ziqiao Wang,Kai Wang,Qian Zheng,Xiaojun Chang,Gang Pan,Shurong Dong,Kaipeng Zhang,Yang You

Main category: cs.CV

TL;DR: LoongX is a hands-free image editing technique that uses neurophysiological signals, achieving results comparable to text-driven methods and offering promising advancements for accessible image editing.

Details Motivation: Traditional image editing is labor-intensive and inaccessible to individuals with limited motor control or language abilities. Recent advances in brain-computer interfaces (BCIs) and generative models provide an opportunity to develop a hands-free image editing approach. Method: LoongX integrates cross-scale state space (CS3) module and dynamic gated fusion (DGF) module to encode and aggregate modality-specific features, aligning them with edit semantics via fine-tuning on a diffusion transformer (DiT). The encoders are pre-trained using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Result: LoongX achieves performance comparable to text-driven methods and outperforms them when neural signals are combined with speech. Extensive experiments demonstrate its effectiveness in translating neurophysiological signals into image editing commands. Conclusion: LoongX represents a significant advancement in hands-free image editing by utilizing multimodal neurophysiological signals and demonstrates the potential of neural-driven generative models in enabling accessible and intuitive image editing. Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.

[66] Motion Generation: A Survey of Generative Approaches and Benchmarks

Aliasghar Khani,Arianna Rampini,Bruno Roy,Larasika Nadela,Noa Kaplan,Evan Atherton,Derek Cheung,Jacky Bibliowicz

Main category: cs.CV

TL;DR: This survey reviews recent advancements in motion generation by analyzing methods, architectures, and evaluation practices, aiming to provide a foundational reference for researchers.

Details Motivation: With the rapid development of motion generation techniques using various modeling paradigms, there is a growing need for a structured and comprehensive review that examines these methods based on their underlying generative strategies. Method: The authors categorized motion generation methods based on their generative approaches and analyzed key aspects like architecture, conditioning mechanisms, and datasets used. The focus was on top-tier papers published since 2023. Result: An in-depth categorization of motion generation methods, along with an analysis of architectures, conditioning inputs, and evaluation practices, is presented to enable better comparisons and identify open challenges. Conclusion: The paper provides a comprehensive review of recent motion generation methods, focusing on their generative strategies, architectural principles, and evaluation metrics. It aims to establish a foundational reference for researchers in the field. Abstract: Motion generation, the task of synthesizing realistic motion sequences from various conditioning inputs, has become a central problem in computer vision, computer graphics, and robotics, with applications ranging from animation and virtual agents to human-robot interaction. As the field has rapidly progressed with the introduction of diverse modeling paradigms including GANs, autoencoders, autoregressive models, and diffusion-based techniques, each approach brings its own advantages and limitations. This growing diversity has created a need for a comprehensive and structured review that specifically examines recent developments from the perspective of the generative approach employed. In this survey, we provide an in-depth categorization of motion generation methods based on their underlying generative strategies. Our main focus is on papers published in top-tier venues since 2023, reflecting the most recent advancements in the field. In addition, we analyze architectural principles, conditioning mechanisms, and generation settings, and compile a detailed overview of the evaluation metrics and datasets used across the literature. Our objective is to enable clearer comparisons and identify open challenges, thereby offering a timely and foundational reference for researchers and practitioners navigating the rapidly evolving landscape of motion generation.

[67] Mastering Regional 3DGS: Locating, Initializing, and Editing with Diverse 2D Priors

Lanqing Guo,Yufei Wang,Hezhen Hu,Yan Zheng,Yeying Jin,Siyu Huang,Zhangyang Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于2D扩散编辑和逆向渲染的3D高斯随机化场景局部编辑方法,实现了高效、精确的区域编辑,并在性能上达到了SOTA,同时提供了4倍的速度提升。

Details Motivation: 3D语义解析通常比其2D对应方法表现差,导致在3D空间中进行目标操作更加困难并限制了编辑保真度,而本文旨在解决这一问题。 Method: 利用2D扩散编辑来准确识别每个视图中的修改区域,然后通过逆向渲染进行3D定位,并细化正面视图,初始化一个由深度图预测得到的一致视图和近似形状的粗略3DGS,从而支持一种迭代的、视角一致的编辑过程。 Result: 实验表明,该方法在实现最先进的性能的同时,还提供了高达4倍的速度提升,为3D场景局部编辑提供了一个更高效且有效的方法。 Conclusion: 这种方法可以显著提高3D场景局部编辑的效率和效果,对未来的相关研究具有重要的参考价值。 Abstract: Many 3D scene editing tasks focus on modifying local regions rather than the entire scene, except for some global applications like style transfer, and in the context of 3D Gaussian Splatting (3DGS), where scenes are represented by a series of Gaussians, this structure allows for precise regional edits, offering enhanced control over specific areas of the scene; however, the challenge lies in the fact that 3D semantic parsing often underperforms compared to its 2D counterpart, making targeted manipulations within 3D spaces more difficult and limiting the fidelity of edits, which we address by leveraging 2D diffusion editing to accurately identify modification regions in each view, followed by inverse rendering for 3D localization, then refining the frontal view and initializing a coarse 3DGS with consistent views and approximate shapes derived from depth maps predicted by a 2D foundation model, thereby supporting an iterative, view-consistent editing process that gradually enhances structural details and textures to ensure coherence across perspectives. Experiments demonstrate that our method achieves state-of-the-art performance while delivering up to a $4\times$ speedup, providing a more efficient and effective approach to 3D scene local editing.

[68] OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Shiting Xiao,Rishabh Kabra,Yuhang Li,Donghyun Lee,Joao Carreira,Priyadarshini Panda

Main category: cs.CV

TL;DR: OpenWorldSAM是一种基于Segment Anything Model v2的框架,能够在开放词汇场景下实现高效的语义、实例和全景分割,并具有出色的零样本泛化能力。

Details Motivation: 基于开放语言提示对物体进行分割仍然是一个重大挑战,需要模型能够将文本语义映射为精确的空间掩码,同时处理多样且未见过的类别。 Method: 基于四个关键原则:统一提示、效率性、实例感知和泛化能力。方法包括冻结预训练组件(SAM2和VLM)、引入位置tie-breaker嵌入和交叉注意力层。 Result: OpenWorldSAM在COCO-stuff数据集上仅训练450万参数,实现了资源高效性,并在ADE20k、PASCAL、ScanNet和SUN-RGBD等多个基准数据集上展示了强大的零样本能力和SOTA性能。 Conclusion: OpenWorldSAM通过整合轻量级视觉语言模型的多模态嵌入,扩展了Segment Anything Model v2 (SAM2) 到开放词汇场景中,并展现出在多个基准数据集上开放词汇语义、实例和全景分割的SOTA性能。 Abstract: The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks, including ADE20k, PASCAL, ScanNet, and SUN-RGBD.

[69] Robotic System with AI for Real Time Weed Detection, Canopy Aware Spraying, and Droplet Pattern Evaluation

Inayat Rasool,Pappu Kumar Yadav,Amee Parmar,Hasan Mirzakhaninafchi,Rikesh Budhathoki,Zain Ul Abideen Usmani,Supriya Paudel,Ivan Perez Olivera,Eric Jone

Main category: cs.CV

TL;DR: 本文介绍了一种基于AI和嵌入式系统的智能除草剂喷洒技术,可在室内环境中实现高效的杂草检测与喷洒控制,减少了化学物质的使用。

Details Motivation: 现代农业中统一且过量的除草剂应用导致成本增加、环境污染以及抗药性杂草的出现,因此需要一种更精确的喷洒方法。 Method: 开发了一个基于视觉引导和人工智能驱动的可变喷洒系统,利用YOLO11n和YOLO11n-seg深度学习模型进行杂草检测和冠层分割,并通过Arduino Uno控制电磁阀喷嘴实现实时调整。 Result: 室内试验显示,YOLO11n模型在平均精度(mAP@50)上达到了0.98,而YOLO11n-seg分割模型达到了0.48;同时,系统在不同大小的冠层区域实现了不同的喷洒覆盖率,小、中、大冠层分别达到16.22%、21.46%和21.65%。 Conclusion: 该系统结合了深度学习和低成本嵌入式硬件,实现了除草剂的精准喷洒,展示了在现代农业中减少化学物质使用和降低成本的潜力。 Abstract: Uniform and excessive herbicide application in modern agriculture contributes to increased input costs, environmental pollution, and the emergence of herbicide resistant weeds. To address these challenges, we developed a vision guided, AI-driven variable rate sprayer system capable of detecting weed presence, estimating canopy size, and dynamically adjusting nozzle activation in real time. The system integrates lightweight YOLO11n and YOLO11n-seg deep learning models, deployed on an NVIDIA Jetson Orin Nano for onboard inference, and uses an Arduino Uno-based relay interface to control solenoid actuated nozzles based on canopy segmentation results. Indoor trials were conducted using 15 potted Hibiscus rosa sinensis plants of varying canopy sizes to simulate a range of weed patch scenarios. The YOLO11n model achieved a mean average precision (mAP@50) of 0.98, with a precision of 0.99 and a recall close to 1.0. The YOLO11n-seg segmentation model achieved a mAP@50 of 0.48, precision of 0.55, and recall of 0.52. System performance was validated using water sensitive paper, which showed an average spray coverage of 24.22% in zones where canopy was present. An upward trend in mean spray coverage from 16.22% for small canopies to 21.46% and 21.65% for medium and large canopies, respectively, demonstrated the system's capability to adjust spray output based on canopy size in real time. These results highlight the potential of combining real time deep learning with low-cost embedded hardware for selective herbicide application. Future work will focus on expanding the detection capabilities to include three common weed species in South Dakota: water hemp (Amaranthus tuberculatus), kochia (Bassia scoparia), and foxtail (Setaria spp.), followed by further validation in both indoor and field trials within soybean and corn production systems.

[70] Driving as a Diagnostic Tool: Scenario-based Cognitive Assessment in Older Drivers From Driving Video

Md Zahid Hasan,Guillermo Basulto-Elias,Jun Ha Chang,Sahuna Hallmark,Matthew Rizzo,Anuj Sharma,Soumik Sarkar

Main category: cs.CV

TL;DR: This research uses driving behavior and AI to detect early cognitive decline in older adults, offering a faster, non-invasive alternative to traditional diagnostic methods.

Details Motivation: Cognitive decline, such as Alzheimer's disease and mild cognitive impairment, is often underdiagnosed due to the high cost and time required for current diagnostic methods. This research aims to develop an efficient, proactive approach by leveraging real-world driving behavior captured through in-vehicle systems. Method: The study proposes a framework that utilizes large vision models and naturalistic driving videos to analyze driver behavior, classify cognitive status, and predict disease progression by extracting 'digital fingerprints' linked to cognitive decline. Result: The method successfully identifies early warning signs of functional impairment in older drivers by analyzing naturalistic driving data, enabling early detection of cognitive decline and supporting the development of scalable monitoring systems. Conclusion: This paper concludes that scenario-based cognitive status identification using naturalistic driving videos and large vision models can effectively detect early signs of cognitive decline, offering a scalable and non-invasive solution for monitoring cognitive health in older drivers. Abstract: We introduce scenario-based cognitive status identification in older drivers from Naturalistic driving videos and large vision models. In recent times, cognitive decline, including Alzheimer's disease (AD) and mild cognitive impairment (MCI), is often underdiagnosed due to the time-consuming and costly nature of current diagnostic methods. By analyzing real-world driving behavior captured through in-vehicle systems, this research aims to extract "digital fingerprints" that correlate with functional decline and clinical features of MCI and AD. Moreover, modern large vision models can draw meaningful insights from everyday driving patterns of older patients to early detect cognitive decline. We propose a framework that uses large vision models and naturalistic driving videos to analyze driver behavior, classify cognitive status and predict disease progression. We leverage the strong relationship between real-world driving behavior as an observation of the current cognitive status of the drivers where the vehicle can be utilized as a "diagnostic tool". Our method identifies early warning signs of functional impairment, contributing to proactive intervention strategies. This work enhances early detection and supports the development of scalable, non-invasive monitoring systems to mitigate the growing societal and economic burden of cognitive decline in the aging population.

[71] Cloud Diffusion Part 1: Theory and Motivation

Andrew Randono

Main category: cs.CV

TL;DR: Cloud Diffusion Models use scale-invariant noise profiles instead of white noise, aiming to improve inference speed, high-frequency detail generation, and controllability.

Details Motivation: Natural images exhibit scale invariance in their statistical properties, which traditional diffusion models using white noise do not account for. Method: Incorporate scale-invariant noise profiles into diffusion models to create Cloud Diffusion Models. Result: Cloud Diffusion Models are expected to offer faster inference, improved high-frequency details, and greater controllability compared to traditional white noise-based diffusion models. Conclusion: By aligning the noise profile with natural image statistics, Cloud Diffusion Models can potentially outperform conventional approaches. Abstract: Diffusion models for image generation function by progressively adding noise to an image set and training a model to separate out the signal from the noise. The noise profile used by these models is white noise -- that is, noise based on independent normal distributions at each point whose mean and variance is independent of the scale. By contrast, most natural image sets exhibit a type of scale invariance in their low-order statistical properties characterized by a power-law scaling. Consequently, natural images are closer (in a quantifiable sense) to a different probability distribution that emphasizes large scale correlations and de-emphasizes small scale correlations. These scale invariant noise profiles can be incorporated into diffusion models in place of white noise to form what we will call a ``Cloud Diffusion Model". We argue that these models can lead to faster inference, improved high-frequency details, and greater controllability. In a follow-up paper, we will build and train a Cloud Diffusion Model that uses scale invariance at a fundamental level and compare it to classic, white noise diffusion models.

[72] LoomNet: Enhancing Multi-View Image Generation via Latent Space Weaving

Giulio Federico,Fabio Carrara,Claudio Gennaro,Giuseppe Amato,Marco Di Benedetto

Main category: cs.CV

TL;DR: LoomNet是一种新颖的多视角扩散架构,它通过并行多次应用同一扩散模型来协作构建和利用共享潜在空间以实现视图一致性,从而解决从单幅图像生成一致的多视角图像的问题。

Details Motivation: 单幅图像生成一致的多视角图像仍然具有挑战性,而空间不一致性通常会降低表面重建中的三维网格质量。 Method: 提出了一种名为LoomNet的新颖多视角扩散架构,该架构通过并行多次应用同一扩散模型来协作构建和利用共享潜在空间以实现视图一致性。 Result: LoomNet能够在短短15秒内生成16个高质量且一致的视图。 Conclusion: LoomNet不仅在图像质量和重建指标上优于现有技术,还通过从相同输入生成多样化且合理的新视图展示了创造力。 Abstract: Generating consistent multi-view images from a single image remains challenging. Lack of spatial consistency often degrades 3D mesh quality in surface reconstruction. To address this, we propose LoomNet, a novel multi-view diffusion architecture that produces coherent images by applying the same diffusion model multiple times in parallel to collaboratively build and leverage a shared latent space for view consistency. Each viewpoint-specific inference generates an encoding representing its own hypothesis of the novel view from a given camera pose, which is projected onto three orthogonal planes. For each plane, encodings from all views are fused into a single aggregated plane. These aggregated planes are then processed to propagate information and interpolate missing regions, combining the hypotheses into a unified, coherent interpretation. The final latent space is then used to render consistent multi-view images. LoomNet generates 16 high-quality and coherent views in just 15 seconds. In our experiments, LoomNet outperforms state-of-the-art methods on both image quality and reconstruction metrics, also showing creativity by producing diverse, plausible novel views from the same input.

[73] Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

Mengyao Xu,Gabriel Moreira,Ronay Ak,Radek Osmulski,Yauhen Babakhin,Zhiding Yu,Benedikt Schifferer,Even Oldridge

Main category: cs.CV

TL;DR: This paper introduces llama-nemoretriever-colembed, a powerful text-image retrieval model achieving top performance at the expense of some storage and efficiency trade-offs.

Details Motivation: The motivation stems from the increasing demand for cross-modal retrieval systems, aiming to develop a unified text-image retrieval model that performs exceptionally across multiple benchmarks. Method: The method involves leveraging and modifying the NVIDIA Eagle2 Vision-Language model by replacing causal attention with bidirectional attention and integrating a ColBERT-style late interaction mechanism. A two-stage training strategy is also adopted. Result: The 3B model variant achieved an NDCG@5 score of 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, securing first place on both leaderboards as of June 27, 2025. Conclusion: The paper concludes that the introduced model, llama-nemoretriever-colembed, achieves state-of-the-art performance in text-image retrieval tasks while acknowledging the trade-offs in storage and efficiency due to the ColBERT-style mechanism. Abstract: Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.

[74] Simulating Refractive Distortions and Weather-Induced Artifacts for Resource-Constrained Autonomous Perception

Moseli Mots'oehli,Feimei Chen,Hok Wai Chan,Itumeleng Tlali,Thulani Babeli,Kyungim Baek,Huaijin Chen

Main category: cs.CV

TL;DR: This paper introduces a procedural augmentation pipeline to simulate realistic distortions and weather effects on low-cost dashcam footage, targeting underrepresented African driving scenarios, and provides benchmark results to support perception research.

Details Motivation: The scarcity of autonomous vehicle datasets from developing regions, especially across Africa's diverse road conditions, hinders robust perception in low-resource settings. This work aims to address this gap by offering a cost-effective solution through data augmentation. Method: The paper introduces a procedural augmentation pipeline with two modules: a refractive module simulating optical effects such as lens distortion, Perlin noise, Thin-Plate Spline warps, and divergence-free warps; and a weather module adding fog and lens flare. It also provides baseline performance using three image restoration models. Result: The paper presents an augmentation toolkit tailored to African driving scenarios, along with an augmented dataset and benchmark results based on three image restoration models, eliminating the need for expensive data collection or simulation. Conclusion: The paper concludes that the proposed procedural augmentation pipeline can effectively enhance low-cost monocular dashcam footage with realistic distortions and weather-induced artifacts, providing a valuable toolkit and benchmark results for perception research in African driving contexts. Abstract: The scarcity of autonomous vehicle datasets from developing regions, particularly across Africa's diverse urban, rural, and unpaved roads, remains a key obstacle to robust perception in low-resource settings. We present a procedural augmentation pipeline that enhances low-cost monocular dashcam footage with realistic refractive distortions and weather-induced artifacts tailored to challenging African driving scenarios. Our refractive module simulates optical effects from low-quality lenses and air turbulence, including lens distortion, Perlin noise, Thin-Plate Spline (TPS), and divergence-free (incompressible) warps. The weather module adds homogeneous fog, heterogeneous fog, and lens flare. To establish a benchmark, we provide baseline performance using three image restoration models. To support perception research in underrepresented African contexts, without costly data collection, labeling, or simulation, we release our distortion toolkit, augmented dataset splits, and benchmark results.

[75] ReLayout: Integrating Relation Reasoning for Content-aware Layout Generation with Multi-modal Large Language Models

Jiaxu Tian,Xuehui Yu,Yaoxing Wang,Pan Wang,Guangqian Guo,Shan Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为ReLayout的新方法,通过引入关系推理和布局原型再平衡采样器,解决了现有基于大语言模型的布局生成方法中存在的结构性和多样性问题。

Details Motivation: 现有的基于大语言模型的方法无法充分解释视觉主题和设计元素之间的空间关系,导致布局生成中出现结构性和多样性问题,因此需要一种新的方法来解决这些问题。 Method: 提出了一种名为ReLayout的新方法,该方法利用关系推理(relation-CoT)来生成更加合理且美观的布局。具体包括增强布局注释,引入显式关系定义以及设计布局原型再平衡采样器。 Result: 实验结果表明,ReLayout在布局生成任务中优于基线方法,能够生成更加结构化和多样化的布局,并且更符合人类审美和具有更高的可解释性。 Conclusion: ReLayout有效地解决了现有基于LLM的方法在布局生成中的结构性和多样性问题,通过引入关系定义和布局原型再平衡采样器,能够生成更符合人类审美且更具解释性的布局。 Abstract: Content-aware layout aims to arrange design elements appropriately on a given canvas to convey information effectively. Recently, the trend for this task has been to leverage large language models (LLMs) to generate layouts automatically, achieving remarkable performance. However, existing LLM-based methods fail to adequately interpret spatial relationships among visual themes and design elements, leading to structural and diverse problems in layout generation. To address this issue, we introduce ReLayout, a novel method that leverages relation-CoT to generate more reasonable and aesthetically coherent layouts by fundamentally originating from design concepts. Specifically, we enhance layout annotations by introducing explicit relation definitions, such as region, salient, and margin between elements, with the goal of decomposing the layout into smaller, structured, and recursive layouts, thereby enabling the generation of more structured layouts. Furthermore, based on these defined relationships, we introduce a layout prototype rebalance sampler, which defines layout prototype features across three dimensions and quantifies distinct layout styles. This sampler addresses uniformity issues in generation that arise from data bias in the prototype distribution balance process. Extensive experimental results verify that ReLayout outperforms baselines and can generate structural and diverse layouts that are more aligned with human aesthetics and more explainable.

[76] Multi-Modal Face Anti-Spoofing via Cross-Modal Feature Transitions

Jun-Xiong Chong,Fang-Yu Hsu,Ming-Tsung Hsu,Yi-Ting Lin,Kai-Heng Chien,Chiou-Ting Hsu,Pei-Kai Huang

Main category: cs.CV

TL;DR: This paper proposes CTNet, which learns consistent and inconsistent cross-modal feature transitions for multi-modal face anti-spoofing. It also addresses missing modality issues by learning complementary features from RGB modality.

Details Motivation: The motivation stems from the fact that within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces, and feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Method: Cross-modal Transition-guided Network (CTNet) is proposed to tackle the challenges in the multi-modal FAS task by learning consistent cross-modal feature transitions among live samples, learning inconsistent cross-modal feature transitions between live and spoof samples, and learning complementary IR and depth features from the RGB modality as auxiliary modalities. Result: Extensive experiments demonstrate that CTNet outperforms previous two-class multi-modal FAS methods across most protocols. Conclusion: CTNet outperforms previous two-class multi-modal FAS methods across most protocols. Abstract: Multi-modal face anti-spoofing (FAS) aims to detect genuine human presence by extracting discriminative liveness cues from multiple modalities, such as RGB, infrared (IR), and depth images, to enhance the robustness of biometric authentication systems. However, because data from different modalities are typically captured by various camera sensors and under diverse environmental conditions, multi-modal FAS often exhibits significantly greater distribution discrepancies across training and testing domains compared to single-modal FAS. Furthermore, during the inference stage, multi-modal FAS confronts even greater challenges when one or more modalities are unavailable or inaccessible. In this paper, we propose a novel Cross-modal Transition-guided Network (CTNet) to tackle the challenges in the multi-modal FAS task. Our motivation stems from that, within a single modality, the visual differences between live faces are typically much smaller than those of spoof faces. Additionally, feature transitions across modalities are more consistent for the live class compared to those between live and spoof classes. Upon this insight, we first propose learning consistent cross-modal feature transitions among live samples to construct a generalized feature space. Next, we introduce learning the inconsistent cross-modal feature transitions between live and spoof samples to effectively detect out-of-distribution (OOD) attacks during inference. To further address the issue of missing modalities, we propose learning complementary infrared (IR) and depth features from the RGB modality as auxiliary modalities. Extensive experiments demonstrate that the proposed CTNet outperforms previous two-class multi-modal FAS methods across most protocols.

[77] Semi-Supervised Defect Detection via Conditional Diffusion and CLIP-Guided Noise Filtering

Shuai Li,Shihan Chen,Wanru Geng,Zhaohua Xu,Xiaolu Liu,Can Dong,Zhen Tian,Changlin Chen

Main category: cs.CV

TL;DR: 本研究提出了一种新的半监督缺陷检测框架DSYM,结合条件扩散模型和跨模态噪声过滤机制,有效提升了工业质量检测的数据效率和精度。

Details Motivation: 传统的工业质量检测方法依赖人工检测或早期图像处理算法,存在效率低下、成本高昂和鲁棒性有限的问题。因此,需要一种更高效、精确且对标注数据依赖较低的缺陷检测方法。 Method: 该框架采用两阶段协同训练机制和分阶段联合优化策略,利用标注数据进行初步训练,并通过生成伪标签引入未标注数据。同时,使用CLIP跨模态特征的噪声过滤机制减轻标签污染。 Result: 实验结果表明,在与传统监督方法相同标注数据量的情况下,该方法在NEU-DET数据集上实现了78.4%的mAP@0.5;而在仅需原始监督模型40%的标注数据时,仍可达到75.1%的mAP@0.5,显示出其在数据效率方面的显著优势。 Conclusion: 本文提出了一种基于条件扩散的半监督缺陷检测框架(DSYM),为工业质量检测场景提供了一种高精度、低标注依赖的解决方案。 Abstract: In the realm of industrial quality inspection, defect detection stands as a critical component, particularly in high-precision, safety-critical sectors such as automotive components aerospace, and medical devices. Traditional methods, reliant on manual inspection or early image processing algorithms, suffer from inefficiencies, high costs, and limited robustness. This paper introduces a semi-supervised defect detection framework based on conditional diffusion (DSYM), leveraging a two-stage collaborative training mechanism and a staged joint optimization strategy. The framework utilizes labeled data for initial training and subsequently incorporates unlabeled data through the generation of pseudo-labels. A conditional diffusion model synthesizes multi-scale pseudo-defect samples, while a CLIP cross-modal feature-based noise filtering mechanism mitigates label contamination. Experimental results on the NEU-DET dataset demonstrate a 78.4% mAP@0.5 with the same amount of labeled data as traditional supervised methods, and 75.1% mAP@0.5 with only 40% of the labeled data required by the original supervised model, showcasing significant advantages in data efficiency. This research provides a high-precision, low-labeling-dependent solution for defect detection in industrial quality inspection scenarios. The work of this article has been open-sourced at https://github.com/cLin-c/Semisupervised-DSYM.

[78] GSVR: 2D Gaussian-based Video Representation for 800+ FPS with Hybrid Deformation Field

Zhizhuo Pang,Zhihui Ke,Xiaobo Zhou,Tie Qiu

Main category: cs.CV

TL;DR: GSVR is a fast and efficient video representation method based on 2D Gaussians, offering high decoding speed, quick training, and strong performance in compression and interpolation.

Details Motivation: To address the slow decoding speed and long training times of existing convolutional network-based video representation methods while maintaining reconstruction quality. Method: GSVR uses a hybrid deformation field combining tri-plane motion and polynomial motion to model video dynamics. It also employs Dynamic-aware Time Slicing to divide videos into GOPs and quantization-aware fine-tuning for performance stability. Image codecs are used to compress Gaussians for compact representation. Result: GSVR achieves 800+ FPS and 35+ PSNR on Bunny with only 2 seconds of training per frame. It converges faster and has 10x faster decoding speed than other methods. It performs comparably in video interpolation and outperforms NeRV in compression. Conclusion: The paper introduces GSVR, a 2D Gaussian-based video representation method that significantly improves decoding speed and training time while maintaining high video quality. Abstract: Implicit neural representations for video have been recognized as a novel and promising form of video representation. Existing works pay more attention to improving video reconstruction quality but little attention to the decoding speed. However, the high computation of convolutional network used in existing methods leads to low decoding speed. Moreover, these convolution-based video representation methods also suffer from long training time, about 14 seconds per frame to achieve 35+ PSNR on Bunny. To solve the above problems, we propose GSVR, a novel 2D Gaussian-based video representation, which achieves 800+ FPS and 35+ PSNR on Bunny, only needing a training time of $2$ seconds per frame. Specifically, we propose a hybrid deformation field to model the dynamics of the video, which combines two motion patterns, namely the tri-plane motion and the polynomial motion, to deal with the coupling of camera motion and object motion in the video. Furthermore, we propose a Dynamic-aware Time Slicing strategy to adaptively divide the video into multiple groups of pictures(GOP) based on the dynamic level of the video in order to handle large camera motion and non-rigid movements. Finally, we propose quantization-aware fine-tuning to avoid performance reduction after quantization and utilize image codecs to compress Gaussians to achieve a compact representation. Experiments on the Bunny and UVG datasets confirm that our method converges much faster than existing methods and also has 10x faster decoding speed compared to other methods. Our method has comparable performance in the video interpolation task to SOTA and attains better video compression performance than NeRV.

[79] PaddleOCR 3.0 Technical Report

Cheng Cui,Ting Sun,Manhui Lin,Tingquan Gao,Yubo Zhang,Jiaxuan Liu,Xueqing Wang,Zelun Zhang,Changda Zhou,Hongen Liu,Yue Zhang,Wenyu Lv,Kui Huang,Yichao Zhang,Jing Zhang,Jun Zhang,Yi Liu,Dianhai Yu,Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR 3.0 is an open-source OCR and document parsing toolkit that offers high accuracy and efficiency with smaller models, making it ideal for intelligent document applications.

Details Motivation: To meet the growing demand for document understanding in the era of large language models. Method: Developing three major models (PP-OCRv5, PP-StructureV3, PP-ChatOCRv4) to handle multilingual text recognition, hierarchical document parsing, and key information extraction. Result: Models with less than 100 million parameters achieve accuracy and efficiency comparable to billion-parameter vision-language models. Conclusion: PaddleOCR 3.0 is a competitive toolkit with fewer parameters, offering efficient OCR and document parsing solutions for developers. Abstract: This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.

[80] Rethinking Layered Graphic Design Generation with a Top-Down Approach

Jingye Chen,Zhaowen Wang,Nanxuan Zhao,Li Zhang,Difan Liu,Jimei Yang,Qifeng Chen

Main category: cs.CV

TL;DR: Accordion is a new framework that transforms non-editable AI-generated graphic designs into editable layered formats, helping designers create refined and meaningful designs efficiently.

Details Motivation: The motivation behind Accordion is to address the lack of editability in AI-generated graphic designs, enabling designers to refine and build upon these designs more effectively. Method: Accordion uses a vision language model (VLM) across three stages to convert non-layered AI-generated designs into editable layered formats, leveraging tools like SAM and element removal models, and is trained on the Design39K dataset augmented with AI-generated images. Result: Experimental results and user studies show that Accordion performs well on the DesignIntention benchmark, particularly excelling in tasks such as text-to-template, adding text to backgrounds, text de-rendering, and generating design variations. Conclusion: Accordion is a top-down graphic design generation framework that converts AI-generated designs into editable layered designs while refining text based on user prompts, offering superior results in tasks like text-to-template and design variation creation. Abstract: Graphic design is crucial for conveying ideas and messages. Designers usually organize their work into objects, backgrounds, and vectorized text layers to simplify editing. However, this workflow demands considerable expertise. With the rise of GenAI methods, an endless supply of high-quality graphic designs in pixel format has become more accessible, though these designs often lack editability. Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. It is built around a vision language model (VLM) playing distinct roles in three curated stages. For each stage, we design prompts to guide the VLM in executing different tasks. Distinct from existing bottom-up methods (e.g., COLE and Open-COLE) that gradually generate elements to create layered designs, our approach works in a top-down manner by using the visually harmonious reference image as global guidance to decompose each layer. Additionally, it leverages multiple vision experts such as SAM and element removal models to facilitate the creation of graphic layers. We train our method using the in-house graphic design dataset Design39K, augmented with AI-generated design images coupled with refined ground truth created by a customized inpainting model. Experimental results and user studies by designers show that Accordion generates favorable results on the DesignIntention benchmark, including tasks such as text-to-template, adding text to background, and text de-rendering, and also excels in creating design variations.

[81] Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration

Yuyang Hu,Kangfu Mei,Mojtaba Sahraee-Ardakan,Ulugbek S. Kamilov,Peyman Milanfar,Mauricio Delbracio

Main category: cs.CV

TL;DR: Kernel Density Steering enhances diffusion model performance for image restoration by leveraging collective wisdom to avoid artifacts and boost fidelity.

Details Motivation: Existing diffusion models struggle with inconsistent fidelity and artifacts in image restoration tasks, prompting the need for a more robust inference-time framework. Method: KDS uses an ensemble of diffusion samples, applying patch-wise kernel density estimation gradients to steer patches towards higher-density regions collectively. Result: KDS achieves improved quantitative and qualitative performance on real-world super-resolution and image inpainting tasks while being compatible with various diffusion samplers as a plug-and-play framework. Conclusion: Kernel Density Steering (KDS) improves the fidelity and robustness of diffusion models in image restoration tasks by leveraging collective local mode-seeking mechanisms without requiring retraining. Abstract: Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an $N$-particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as "collective wisdom", steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high-fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.

[82] Generative Head-Mounted Camera Captures for Photorealistic Avatars

Shaojie Bai,Seunghyeon Seo,Yida Wang,Chenghui Li,Owen Wang,Te-Li Wang,Tianyang Ma,Jason Saragih,Shih-En Wei,Nojun Kwak,Hyung Jun Kim

Main category: cs.CV

TL;DR: 本论文提出了一种新的生成方法Generative HMC (GenHMC),通过利用更易收集的未配对头戴式摄像头(HMC)数据,直接从任意条件化的虚拟头像状态生成高质量的合成HMC图像,实现了更准确的地面实况,并且在不同身份间具有良好的泛化能力。

Details Motivation: 由于难以获取面部的真实状态,实现逼真的虚拟和增强现实(VR/AR)头像动画一直是一个挑战。现有的分析-合成方法依赖于大量成对的HMC和穹顶摄像头捕捉数据,这在操作上成本高昂,且无法适应不同的HMC视角和光照条件。 Method: 提出了一种新的生成方法Generative HMC (GenHMC),该方法能够利用大量未配对的HMC数据,从任何条件化的虚拟头像状态生成高质量的合成HMC图像。这种方法有效地将指定面部表情和视角的输入条件信号与面部外观分离开来,从而提高了地面实况的准确性。 Result: 实验表明,GenHMC方法不仅能够准确地分离输入条件信号与面部外观,还能推广到未见过的身份,减少了对配对捕捉数据的依赖。通过对合成HMC图像的评估以及从这些新HMC-头像对应关系训练的通用面部编码器,展示了该方法在数据效率和最先进精度方面的突破。 Conclusion: 本研究通过引入GenHMC方法,解决了传统方法中因需要大量配对HMC和穹顶摄像头数据而导致的成本高、泛化能力差的问题,为实现更高效和精确的VR/AR头像动画提供了新思路。 Abstract: Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.

[83] AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework

Suoxiang Zhang,Xiaxi Li,Hongrui Chang,Zhuoyan Hou,Guoxin Wu,Ronghua Ji

Main category: cs.CV

TL;DR: AdaptaGen is a novel framework for domain-specific image generation that improves semantic accuracy and visual diversity by integrating prompt optimization, cross-modal adaptation, and semantic transformation techniques.

Details Motivation: Existing methods for domain-specific image generation suffer from separate handling of prompt engineering and model adaptation, and inadequate incorporation of domain-specific semantic constraints, leading to hallucinations and semantic deviations. Method: AdaptaGen employs a hierarchical semantic optimization framework combining matrix-based prompt optimization with cross-modal adaptation and a two-phase caption semantic transformation to enhance semantic coherence and visual diversity. Result: Experimental results show that AdaptaGen outperforms existing approaches across 40 categories from diverse datasets using only 16 images per category, significantly improving image quality, diversity, and semantic consistency. Conclusion: AdaptaGen effectively addresses the challenges of domain-specific image generation by integrating prompt optimization and multi-perspective understanding, achieving superior performance in quality, diversity, and semantic consistency. Abstract: Domain-specific image generation aims to produce high-quality visual content for specialized fields while ensuring semantic accuracy and detail fidelity. However, existing methods exhibit two critical limitations: First, current approaches address prompt engineering and model adaptation separately, overlooking the inherent dependence between semantic understanding and visual representation in specialized domains. Second, these techniques inadequately incorporate domain-specific semantic constraints during content synthesis, resulting in generation outcomes that exhibit hallucinations and semantic deviations. To tackle these issues, we propose AdaptaGen, a hierarchical semantic optimization framework that integrates matrix-based prompt optimization with multi-perspective understanding, capturing comprehensive semantic relationships from both global and local perspectives. To mitigate hallucinations in specialized domains, we design a cross-modal adaptation mechanism, which, when combined with intelligent content synthesis, enables preserving core thematic elements while incorporating diverse details across images. Additionally, we introduce a two-phase caption semantic transformation during the generation phase. This approach maintains semantic coherence while enhancing visual diversity, ensuring the generated images adhere to domain-specific constraints. Experimental results confirm our approach's effectiveness, with our framework achieving superior performance across 40 categories from diverse datasets using only 16 images per category, demonstrating significant improvements in image quality, diversity, and semantic consistency.

[84] OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval

Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Xuemeng Song,Liqiang Nie

Main category: cs.CV

TL;DR: This paper proposes OFFSET, a new framework for Composed Image Retrieval (CIR), which improves retrieval accuracy by better handling visual noise and aligning focus with textual modification instructions.

Details Motivation: The motivation stems from the limitations in current CIR methods—specifically, their inability to effectively handle noise in visual data and their tendency to exhibit visual focus bias. These issues degrade the quality of query features and hinder accurate retrieval based on multimodal queries. Method: The method involves two key modules: 1) A focus mapping-based feature extractor that identifies significant dominant portions in images and guides the extraction of visual and textual features to reduce noise interference; 2) A textually guided focus revision module that uses modification text to adaptively revise the focus on the reference image, enhancing the perception of the desired modifications. Result: Comprehensive experiments on four benchmark datasets show that the proposed OFFSET framework outperforms existing methods, demonstrating its effectiveness in improving feature extraction and focus alignment with modification requirements. Conclusion: The proposed OFFSET framework, consisting of a focus mapping-based feature extractor and a textually guided focus revision module, demonstrates superior performance in addressing the challenges of inhomogeneity in visual data and visual focus bias in Composed Image Retrieval (CIR). Abstract: Composed Image Retrieval (CIR) represents a novel retrieval paradigm that is capable of expressing users' intricate retrieval requirements flexibly. It enables the user to give a multimodal query, comprising a reference image and a modification text, and subsequently retrieve the target image. Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. It is designed to identify significant dominant portions in images and guide the extraction of visual and textual data features, thereby reducing the impact of noise interference. Subsequently, we propose a textually guided focus revision module, which can utilize the modification requirements implied in the text to perform adaptive focus revision on the reference image, thereby enhancing the perception of the modification focus on the composed features. The aforementioned modules collectively constitute the segmentatiOn-based Focus shiFt reviSion nETwork (\mbox{OFFSET}), and comprehensive experiments on four benchmark datasets substantiate the superiority of our proposed method. The codes and data are available on https://zivchen-ty.github.io/OFFSET.github.io/

[85] Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain

Junfei Shi,Yu Cheng,Haiyan Jin,Junhuai Li,Zhaolin Xiao,Maoguo Gong,Weisi Lin

Main category: cs.CV

TL;DR: 本文提出了一种针对PolSAR图像分类的结构知识引导的复扩散模型,该模型利用Contourlet变换以更好地捕捉多尺度和多方向的特征,从而提升了分类准确性并改善了边缘保持。

Details Motivation: 传统的实值扩散模型在处理PolSAR数据时面临捕捉复值相位信息的挑战,并且难以保持精细的结构细节。 Method: 利用Contourlet变换提供丰富的多尺度和多方向表示,适用于PolSAR图像。设计了一个结构知识引导的复扩散网络,对低频分量的统计特性进行建模,并利用高频系数的结构信息指导扩散过程。 Result: 提出了一种在Contourlet域中的结构知识引导的复扩散模型,用于PolSAR图像分类。通过将复杂Contourlet变换应用于分解数据为低频和高频频带,提取统计和边界特征。 Conclusion: 实验结果表明,所提出的方法在三个真实世界的PolSAR数据集上超越了最先进的方法,特别是在复杂地形中保留边缘细节和保持区域同质性方面。 Abstract: Diffusion models have demonstrated exceptional performance across various domains due to their ability to model and generate complicated data distributions. However, when applied to PolSAR data, traditional real-valued diffusion models face challenges in capturing complex-valued phase information.Moreover, these models often struggle to preserve fine structural details. To address these limitations, we leverage the Contourlet transform, which provides rich multiscale and multidirectional representations well-suited for PolSAR imagery. We propose a structural knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain. Specifically, the complex Contourlet transform is first applied to decompose the data into low- and high-frequency subbands, enabling the extraction of statistical and boundary features. A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components. During the process, structural information from high-frequency coefficients is utilized to guide the diffusion process, improving edge preservation. Furthermore, multiscale and multidirectional high-frequency features are jointly learned to further boost classification accuracy. Experimental results on three real-world PolSAR datasets demonstrate that our approach surpasses state-of-the-art methods, particularly in preserving edge details and maintaining region homogeneity in complex terrain.

[86] Dynamic Rank Adaptation for Vision-Language Models

Jiahui Wang,Qin Xu,Bo Jiang,Bin Luo

Main category: cs.CV

TL;DR: This paper introduces Dynamic Rank Adaptation (DRA) for fine-tuning vision-language models to improve their generalization to unseen classes by dynamically adapting feature ranks based on importance.

Details Motivation: Existing prompt-based and adapter-based methods struggle to maintain strong generalization abilities when faced with unseen classes because they treat all tokens equally, potentially overfitting to less informative features. Method: DRA dynamically allocates adaptation ranks based on feature importance during training, using token importance grouping, rank adaptation, and a channel response mechanism combined with L1 regularization. Result: Extensive experiments show that DRA outperforms existing methods in enhancing performance on unseen classes across various benchmarks, including base-new class evaluation, cross-dataset evaluation, and domain generalization. Conclusion: The paper proposes Dynamic Rank Adaptation (DRA), a new method for fine-tuning large vision-language models to enhance their generalization ability towards unseen classes. Abstract: Pre-trained large vision-language models (VLMs) like CLIP demonstrate impressive generalization ability. Existing prompt-based and adapter-based works have made significant progress in fine-tuning VLMs but still face the challenges of maintaining strong generalization abilities, particularly towards unseen new classes. This limitation partly arises from these methods treating all tokens of the image and text encoder equally, which can lead to overfitting on less informative features (e.g., background noise, template words) and degrade the general representations that are crucial for novel concept recognition. To address this issue, we propose Dynamic Rank Adaptation (DRA), a novel adapter variant method, designed specifically to enhance new class generalization. DRA dynamically allocates adaptation ranks based on the importance of features during training to preserve general knowledge. DRA first employs token importance grouping, using sequence attention to evaluate and group tokens by their importance. Then, we adopt rank adaptation according to the importance of each token group dynamically by assigning higher feature ranks to the more important tokens. Also, we design a new channel response mechanism to prioritize the preservation and adaptation of feature channels identified as the most informative for each instance. In addition, a L1 regularization term is introduced to stabilize the training. Extensive experiments demonstrate the effectiveness and superiority of our proposed DRA over existing works, especially on enhancing the performance of new classes on various benchmarks, including base-new classes, cross-datasets evaluation and domain generalization. The source code will be published after the paper is received.

[87] Modeling and Reversing Brain Lesions Using Diffusion Models

Omar Zamzam,Haleh Akrami,Anand Joshi,Richard Leahy

Main category: cs.CV

TL;DR: 本文开发了一个新的脑部病变分析框架,可以更精确地区分和处理受损与变形组织,从而提高病变分割和预后评估的准确性。

Details Motivation: 现有的病变分割方法忽略了受损组织和变形组织之间的区别,这可能导致不准确的分析结果。 Method: 首先对异常区域进行分割,然后估计并反转组织变形,最后通过修复核心病变区域来估计病变前的健康大脑状态。 Result: 在没有公开数据集验证的情况下,通过模拟正向模型合成了多个病变脑图像,并展示了该框架在病变分割、表征和脑区标记方面的更高准确性。 Conclusion: 本文提出了一种基于扩散模型的框架,能够有效分析和逆转脑部病变过程,为临床和研究提供了一个强有力的工具。 Abstract: Brain lesions are abnormalities or injuries in brain tissue that are often detectable using magnetic resonance imaging (MRI), which reveals structural changes in the affected areas. This broad definition of brain lesions includes areas of the brain that are irreversibly damaged, as well as areas of brain tissue that are deformed as a result of lesion growth or swelling. Despite the importance of differentiating between damaged and deformed tissue, existing lesion segmentation methods overlook this distinction, labeling both of them as a single anomaly. In this work, we introduce a diffusion model-based framework for analyzing and reversing the brain lesion process. Our pipeline first segments abnormal regions in the brain, then estimates and reverses tissue deformations by restoring displaced tissue to its original position, isolating the core lesion area representing the initial damage. Finally, we inpaint the core lesion area to arrive at an estimation of the pre-lesion healthy brain. This proposed framework reverses a forward lesion growth process model that is well-established in biomechanical studies that model brain lesions. Our results demonstrate improved accuracy in lesion segmentation, characterization, and brain labeling compared to traditional methods, offering a robust tool for clinical and research applications in brain lesion analysis. Since pre-lesion healthy versions of abnormal brains are not available in any public dataset for validation of the reverse process, we simulate a forward model to synthesize multiple lesioned brain images.

[88] R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Joonhyung Park,Peng Tang,Sagnik Das,Srikar Appalaraju,Kunwar Yashraj Singh,R. Manmatha,Shabnam Ghadar

Main category: cs.CV

TL;DR: R-VLM通过改进元素定位和目标函数设计,在GUI自动化任务中取得了显著的性能提升。

Details Motivation: 现有的视觉专用GUI代理处理屏幕截图时存在大量无关信息,且使用基础交叉熵损失函数无法有效捕捉定位质量,导致准确率下降。 Method: 通过引入基于区域建议的视觉语言模型(R-VLM)和IoU感知的目标函数,将VLM与传统目标检测方法相结合。 Result: 在ScreenSpot和AgentStudio基准测试中,R-VLM将元素定位准确率提高了13%;在AITW和Mind2Web基准测试中,GUI导航任务的准确率提升了3.2-9.7%。 Conclusion: R-VLM有效提高了GUI自动化中的元素定位精度,并在多个基准测试中显著提升了性能。 Abstract: Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.

[89] MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos

Rongsheng Wang,Junying Chen,Ke Ji,Zhenyang Cai,Shunian Chen,Yunjin Yang,Benyou Wang

Main category: cs.CV

TL;DR: 本研究提出了 MedVideoCap-55K 数据集和 MedGen 模型,推动了医学视频生成的发展。

Details Motivation: 尽管在通用视频生成方面取得了进展,但医学视频生成仍然缺乏高质量的数据集和模型支持。 Method: 构建了一个大规模、多样化的医学视频数据集 MedVideoCap-55K,并基于此开发了 MedGen 模型。 Result: MedGen 在多个基准测试中表现优异,视觉质量和医学准确性均达到领先水平。 Conclusion: MedVideoCap-55K 和 MedGen 的推出为医学视频生成领域提供了重要的资源,有望促进该领域的进一步研究。 Abstract: Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen

[90] Integrated Structural Prompt Learning for Vision-Language Models

Jiahui Wang,Qin Xu,Bo Jiang,Bin Luo

Main category: cs.CV

TL;DR: This paper proposes an Integrated Structural Prompt (ISP) for Vision-Language Models (VLMs) to improve information interaction between text and image branches by modeling structural relationships and dynamically adjusting loss coefficients for better generalization.

Details Motivation: Prompt learning methods have extended the transferability of pre-trained Vision-Language Models (VLMs), but existing approaches often ignore structural relationships between prompts and tokens within and between modalities, and struggle to balance performance for base and new classes. Method: The paper proposes an Integrated Structural Prompt (ISP) with self-structural and cross-structural prompt modules to model structural relationships within and across modalities. It also introduces a sample probing module to dynamically adjust loss coefficients based on sample difficulty. Result: Extensive experiments show that the proposed ISP achieves competitive performance compared to state-of-the-art methods in three widely used settings: base-to-new generalization, cross-dataset evaluation, and domain generalization. Conclusion: The paper concludes that the proposed Integrated Structural Prompt (ISP) for Vision-Language Models (VLMs) effectively enhances information interaction between text and image branches, achieving competitive performance in various evaluation settings. Abstract: Prompt learning methods have significantly extended the transferability of pre-trained Vision-Language Models (VLMs) like CLIP for various downstream tasks. These methods adopt handcraft templates or learnable vectors to provide text or image instructions in fine-tuning VLMs. However, most existing works ignore the structural relationships between learnable prompts and tokens within and between modalities. Moreover, balancing the performance of base and new classes remains a significant challenge. In this paper, we propose an Integrated Structural Prompt (ISP) for VLMs to enhance the interaction of information representations between the text and image branches. ISP introduces self-structural and cross-structural prompt modules to model the structural relationships between learnable prompts and frozen tokens within and across modalities. This enables efficient information transfer while preserving feature stability. Additionally, we propose a sample probing module that dynamically adjusts loss coefficients based on sample difficulty, preventing the mode from overfitting to simple samples and improving generalization ability to new classes. Extensive experiments on three widely used settings: base-to-new generalization, cross-dataset evaluation, and domain generalization demonstrate that the proposed ISP achieves competitive performance against state-of-the-art methods.

[91] LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion

Yisu Zhang,Chenjie Cao,Chaohui Yu,Jianke Zhu

Main category: cs.CV

TL;DR: LiON-LoRA enhances video diffusion models' ability to precisely control camera and object motion through novel adaptation techniques.

Details Motivation: The motivation is to improve upon vanilla Low-Rank Adaptation's limitations in precisely controlling both camera trajectories and object motion in video synthesis with constrained data. Method: The method involves analyzing LoRA features' orthogonality, enforcing norm consistency across layers, integrating a controllable token into the diffusion transformer, and extending LiON-LoRA to temporal generation using static-camera videos. Result: Experiments show that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, generalizing well with minimal training data. Conclusion: LiON-LoRA addresses the challenges of achieving precise control over camera trajectories and object motion in Video Diffusion Models by introducing principles of Linear scalability, Orthogonality, and Norm consistency. Abstract: Video Diffusion Models (VDMs) have demonstrated remarkable capabilities in synthesizing realistic videos by learning from large-scale data. Although vanilla Low-Rank Adaptation (LoRA) can learn specific spatial or temporal movement to driven VDMs with constrained data, achieving precise control over both camera trajectories and object motion remains challenging due to the unstable fusion and non-linear scalability. To address these issues, we propose LiON-LoRA, a novel framework that rethinks LoRA fusion through three core principles: Linear scalability, Orthogonality, and Norm consistency. First, we analyze the orthogonality of LoRA features in shallow VDM layers, enabling decoupled low-level controllability. Second, norm consistency is enforced across layers to stabilize fusion during complex camera motion combinations. Third, a controllable token is integrated into the diffusion transformer (DiT) to linearly adjust motion amplitudes for both cameras and objects with a modified self-attention mechanism to ensure decoupled control. Additionally, we extend LiON-LoRA to temporal generation by leveraging static-camera videos, unifying spatial and temporal controllability. Experiments demonstrate that LiON-LoRA outperforms state-of-the-art methods in trajectory control accuracy and motion strength adjustment, achieving superior generalization with minimal training data. Project Page: https://fuchengsu.github.io/lionlora.github.io/

[92] Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting

Mohsi Jawaid,Marcus Märtens,Tat-Jun Chin

Main category: cs.CV

TL;DR: This paper proposes an RGB-event sensor fusion method to improve spacecraft pose estimation under harsh lighting conditions.

Details Motivation: Spacecraft pose estimation is vital for autonomous operations but is challenged by harsh lighting conditions affecting traditional RGB imaging sensors. Event sensors offer resilience due to their higher dynamic range, though they have limitations in spatial resolution and signal-to-noise ratio. This work aims to address these challenges through multi-sensor fusion. Method: A sensor fusion approach combining RGB and event sensors was introduced. A beam-splitter prism ensured optical and temporal alignment, while a RANSAC-based technique fused data from both modalities. Dropout uncertainty estimation was used to detect adverse conditions affecting either channel. Result: A comprehensive real dataset of RGB and event data for satellite pose estimation was collected and tested in various challenging illumination conditions. The proposed fusion method showed encouraging results, demonstrating improved performance over individual sensors. Conclusion: The study concludes that the event-RGB fusion approach effectively addresses individual sensor limitations, enhancing spacecraft pose estimation even under challenging illumination conditions. It supports the broader use of event sensors in this domain. Abstract: Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset will be released publicly.

[93] Hyperspectral Anomaly Detection Methods: A Survey and Comparative Study

Aayushma Pant,Arbind Agrahari Baniya,Tsz-Kwan Lee,Sunil Aryal

Main category: cs.CV

TL;DR: This paper compares hyperspectral anomaly detection (HAD) techniques, showing that deep learning provides the best accuracy, while statistical models are the fastest, aiding future research and application development.

Details Motivation: Hyperspectral anomaly detection is crucial in applications such as agriculture, defense, and environmental monitoring, but challenges like computational complexity and sensitivity to noise persist. Method: The study categorizes HAD techniques into statistical models, representation-based methods, classical machine learning approaches, and deep learning models, evaluating them across 17 benchmarking datasets using performance metrics like ROC, AUC, and separability maps. Result: Deep learning models outperformed others in detection accuracy, whereas statistical models were fastest across all datasets. Conclusion: Deep learning models achieve the highest detection accuracy, while statistical models offer exceptional speed for hyperspectral anomaly detection. Abstract: Hyperspectral images are high-dimensional datasets consisting of hundreds of contiguous spectral bands, enabling detailed material and surface analysis. Hyperspectral anomaly detection (HAD) refers to the technique of identifying and locating anomalous targets in such data without prior information about a hyperspectral scene or target spectrum. This technology has seen rapid advancements in recent years, with applications in agriculture, defence, military surveillance, and environmental monitoring. Despite this significant progress, existing HAD methods continue to face challenges such as high computational complexity, sensitivity to noise, and limited generalisation across diverse datasets. This study presents a comprehensive comparison of various HAD techniques, categorising them into statistical models, representation-based methods, classical machine learning approaches, and deep learning models. We evaluated these methods across 17 benchmarking datasets using different performance metrics, such as ROC, AUC, and separability map to analyse detection accuracy, computational efficiency, their strengths, limitations, and directions for future research.The research shows that deep learning models achieved the highest detection accuracy, while statistical models demonstrated exceptional speed across all datasets. This study aims to provide valuable insights for researchers and practitioners working to advance the field of hyperspectral anomaly detection methods.

[94] SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

Yegyu Han,Taegyoon Yoon,Dayeon Woo,Sojeong Kim,Hyung-Sin Kim

Main category: cs.CV

TL;DR: SenseShift6D是一个用于6D物体姿态估计的新RGB-D数据集,它考虑了真实环境中的传感器和光照变化,研究表明测试时使用传感器控制能显著提高模型性能。

Details Motivation: 现有6D姿态估计数据集未考虑真实世界中的光照和传感器变化,需要探索传感器控制对缓解这些变化的影响。 Method: 构建了一个包含多种RGB曝光、增益、深度捕捉模式和照明水平的大型RGB-D数据集,并对现有模型进行了测试。 Result: 实验表明,测试时进行传感器控制比数字数据增强更能提升性能,甚至优于增加训练数据的数量和多样性。 Conclusion: SenseShift6D通过引入物理传感器控制,扩展了6D姿态估计的评估范式,为在不确定真实世界环境中实现自适应感知系统奠定了基础。 Abstract: Recent advances on 6D object-pose estimation has achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode - and the potential of test-time sensor control to mitigate such variations - largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For three common household objects (spray, pringles, and tincase), we acquire 101.9k RGB and 10k depth images, which can provide 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art models on our dataset show that applying sensor control during test-time induces greater performance improvement over digital data augmentation, achieving performance comparable to or better than costly increases in real-world training data quantity and diversity. Adapting either RGB or depth sensors individually is effective, while jointly adapting multimodal RGB-D configurations yields even greater improvements. SenseShift6D extends the 6D-pose evaluation paradigm from data-centered to sensor-aware robustness, laying a foundation for adaptive, self-tuning perception systems capable of operating robustly in uncertain real-world environments. Our dataset is available at: huggingface.co/datasets/Yegyu/SenseShift6D Associated scripts can be found at: github.com/yegyu-han/SenseShift6D

[95] Normal Patch Retinex Robust Alghoritm for White Balancing in Digital Microscopy

Radoslaw Roszczyk,Artur Krupa,Izabella Antoniuk

Main category: cs.CV

TL;DR: 本文介绍了一种用于显微图像白平衡校正的全自动机制,实验证明其效果优于传统方法。

Details Motivation: 即使对于经验丰富的显微镜操作员来说,在光学显微镜中获取准确色彩、平衡的图像也是一项挑战。 Method: 提出了一种完全自动化的白平衡机制,并通过两百张微观图像进行实验验证。 Result: 实验结果表明,该算法适用于病理学形态学常用的三种显微标本扫描图像,并且在某些特定染色的显微图像上表现优于其他常用白平衡算法。 Conclusion: 该算法在应用于微观图像的白平衡校正时,比传统数字摄影中的经典算法更为有效。 Abstract: The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.

[96] DreamArt: Generating Interactable Articulated Objects from a Single Image

Ruijie Lu,Yu Liu,Jiaxiang Tang,Junfeng Ni,Yuxiang Wang,Diwen Wan,Gang Zeng,Yixin Chen,Siyuan Huang

Main category: cs.CV

TL;DR: DreamArt提出了一个全新的框架,用于从单视图图像生成高质量、可交互的关节资产,解决了当前方法在关节建模和可扩展性方面的不足。

Details Motivation: 当前的图像到3D方法主要关注表面几何和纹理,忽略了部分分解和关节建模,而神经重建方法依赖于密集的多视角或交互数据,限制了它们的可扩展性。 Method: DreamArt采用三阶段管道:首先通过结合图像到3D生成、掩码提示的3D分割和部分模态补全来重建部分分割和完整的3D对象网格;其次,微调视频扩散模型以捕捉部分级别的关节先验;最后,优化关节运动,并进行全局纹理精炼和重新绘制。 Result: 实验结果表明,DreamArt能够有效生成高质量的关节对象,具有准确的部分形状、高外观保真度和合理的关节,从而提供可扩展的关节资产生成解决方案。 Conclusion: DreamArt是一个从单视图图像生成高质量、可交互的关节资产的新框架,提供了关节对象生成的可扩展解决方案。 Abstract: Generating articulated objects, such as laptops and microwaves, is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR. Current image-to-3D methods primarily focus on surface geometry and texture, neglecting part decomposition and articulation modeling. Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting) rely on dense multi-view or interaction data, limiting their scalability. In this paper, we introduce DreamArt, a novel framework for generating high-fidelity, interactable articulated assets from single-view images. DreamArt employs a three-stage pipeline: firstly, it reconstructs part-segmented and complete 3D object meshes through a combination of image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion. Second, we fine-tune a video diffusion model to capture part-level articulation priors, leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion. Finally, DreamArt optimizes the articulation motion, represented by a dual quaternion, and conducts global texture refinement and repainting to ensure coherent, high-quality textures across all parts. Experimental results demonstrate that DreamArt effectively generates high-quality articulated objects, possessing accurate part shape, high appearance fidelity, and plausible articulation, thereby providing a scalable solution for articulated asset generation. Our project page is available at https://dream-art-0.github.io/DreamArt/.

[97] TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model

Yujie Hu,Xuanyu Zhang,Weiqi Li,Jian Zhang

Main category: cs.CV

TL;DR: 本文提出了TalkFashion,一个基于文本指令的多功能虚拟试穿系统,能够自动完成整体服装更换和局部编辑。

Details Motivation: 现有方法主要依赖端到端网络执行单一任务,缺乏灵活性与多功能性,因此需要一种更智能的解决方案。 Method: 提出TalkFashion框架,结合语言模型理解用户指令并选择相应处理流程,并引入基于指令的局部重绘模型。 Result: 实验结果显示,该方法在语义一致性和视觉质量方面优于当前方法。 Conclusion: TalkFashion通过利用大型语言模型和多模态模型实现了多功能虚拟试穿,提升了虚拟试穿的灵活性和自动化程度。 Abstract: Virtual try-on has made significant progress in recent years. This paper addresses how to achieve multifunctional virtual try-on guided solely by text instructions, including full outfit change and local editing. Previous methods primarily relied on end-to-end networks to perform single try-on tasks, lacking versatility and flexibility. We propose TalkFashion, an intelligent try-on assistant that leverages the powerful comprehension capabilities of large language models to analyze user instructions and determine which task to execute, thereby activating different processing pipelines accordingly. Additionally, we introduce an instruction-based local repainting model that eliminates the need for users to manually provide masks. With the help of multi-modal models, this approach achieves fully automated local editings, enhancing the flexibility of editing tasks. The experimental results demonstrate better semantic consistency and visual quality compared to the current methods.

[98] SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning

Xin Hu,Ke Qin,Guiduo Duan,Ming Li,Yuan-Fang Li,Tao He

Main category: cs.CV

TL;DR: 本文提出了一种名为SPADE的新框架,通过利用去噪扩散模型的反转过程来保持输入图像的空间结构,从而解决复杂场景中的像素级结构关系捕捉问题。

Details Motivation: 现有的基于预训练视觉-语言模型的方法在空间关系推理上存在局限性,如难以区分物体的相对位置,导致关系预测不理想。 Method: 提出SPADE框架,包括 inversion-guided calibration 和 spatial-aware context reasoning 两个关键步骤。 Result: 在基准PSG和Visual Genome数据集上的实验表明,SPADE在空间关系预测方面表现优异。 Conclusion: SPADE框架在闭集和开集场景下均优于现有方法,特别是在空间关系预测方面。 Abstract: Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel-level structural relationships in complex scenes. Although recent approaches leveraging pre-trained vision-language models (VLMs) have significantly improved performance in the open-vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial-Aware Denoising-nEtwork) framework -- a novel approach for open-vocabulary PSG. SPADE consists of two key steps: (1) inversion-guided calibration for the UNet adaptation, and (2) spatial-aware context reasoning. In the first step, we calibrate a general pre-trained teacher diffusion model into a PSG-specific denoising network with cross-attention maps derived during inversion through a lightweight LoRA-based fine-tuning strategy. In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state-of-the-art methods in both closed- and open-set scenarios, particularly for spatial relationship prediction.

[99] DREAM: Document Reconstruction via End-to-end Autoregressive Model

Xin Li,Mingming Gong,Yunfei Wu,Jianxin Dai,Antai Guo,Xinghua Jiang,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun

Main category: cs.CV

TL;DR: This paper proposes DREAM, an end-to-end autoregressive model for document reconstruction that outperforms existing methods and performs well across multiple related tasks.

Details Motivation: Existing multi-stage approaches suffer from error propagation, while end-to-end generative models fail to preserve layout information crucial for document reconstruction. Method: An autoregressive model named DREAM is introduced to transform text images into document reconstruction sequences in an end-to-end manner. A standardized task definition, a new similarity metric (DSM), and the DocRec1K dataset are also introduced. Result: Empirical results show that the DREAM model achieves state-of-the-art performance on document reconstruction tasks and performs well on subtasks like layout analysis, text recognition, table structure recognition, formula recognition, and reading order detection. Conclusion: The proposed DREAM model demonstrates superior performance in document reconstruction and shows competitiveness across various related subtasks. Abstract: Document reconstruction constitutes a significant facet of document analysis and recognition, a field that has been progressively accruing interest within the scholarly community. A multitude of these researchers employ an array of document understanding models to generate predictions on distinct subtasks, subsequently integrating their results into a holistic document reconstruction format via heuristic principles. Nevertheless, these multi-stage methodologies are hindered by the phenomenon of error propagation, resulting in suboptimal performance. Furthermore, contemporary studies utilize generative models to extract the logical sequence of plain text, tables and mathematical expressions in an end-to-end process. However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). DREAM transmutes the text image into a sequence of document reconstruction in a comprehensive, end-to-end process, encapsulating a broader spectrum of document element information. In addition, we establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task. Empirical results substantiate that our methodology attains unparalleled performance in the realm of document reconstruction. Furthermore, the results on a variety of subtasks, encompassing document layout analysis, text recognition, table structure recognition, formula recognition and reading order detection, indicate that our model is competitive and compatible with various tasks.

[100] Towards Solar Altitude Guided Scene Illumination

Samed Doğan,Maximilian Hoh,Nico Leuze,Nicolas R. -Peña,Alfred Schöttl

Main category: cs.CV

TL;DR: 本文提出了一种基于太阳高度作为全局条件变量和定制归一化方法的合成相机传感器数据生成技术,以解决白天变化研究中的数据稀缺问题。

Details Motivation: 研究发现关于白天变化的研究存在显著差距,这可能是由于可用标签稀缺造成的。此外,真实世界数据的获取受到标注成本、驾驶员安全协议和不同场景覆盖范围的限制。 Method: 通过使用太阳高度作为全局条件变量,并开发了一种针对日光对高度小数值变化敏感性的定制归一化方法。 Result: 展示了该方法在扩散模型中准确捕捉光照特性及与照明相关的图像噪声的能力。 Conclusion: 利用太阳高度作为全局条件变量和定制的归一化方法可以有效生成具有准确光照特性和受照明影响的图像噪声的白天变化合成相机传感器数据。 Abstract: The development of safe and robust autonomous driving functions is heavily dependent on large-scale, high-quality sensor data. However, real-word data acquisition demands intensive human labor and is strongly limited by factors such as labeling cost, driver safety protocols and diverse scenario coverage. Thus, multiple lines of work focus on the conditional generation of synthetic camera sensor data. We identify a significant gap in research regarding daytime variation, presumably caused by the scarcity of available labels. Consequently, we present the solar altitude as global conditioning variable. It is readily computable from latitude-longitude coordinates and local time, eliminating the need for extensive manual labeling. Our work is complemented by a tailored normalization approach, targeting the sensitivity of daylight towards small numeric changes in altitude. We demonstrate its ability to accurately capture lighting characteristics and illumination-dependent image noise in the context of diffusion models.

[101] Empowering Bridge Digital Twins by Bridging the Data Gap with a Unified Synthesis Framework

Wang Wang,Mingyu Shi,Jun Jiang,Wenqian Ma,Chong Liu,Yasutaka Narazaki,Xuguang Wang

Main category: cs.CV

TL;DR: 本论文提出了一种生成3D桥梁数据的系统框架,以解决现有合成数据方法泛化能力不足的问题,并展示了在现实世界桥梁语义分割和组件补全任务中的卓越性能。

Details Motivation: 桥梁作为关键交通基础设施,面临老化和退化的挑战,传统人工检测方法效率低下。虽然3D点云技术提供了新的数据驱动范式,但其应用潜力常受限于现实世界数据的不完整性。 Method: 本文提出了一种能够自动生成包含组件级实例注释、高保真颜色和精确法向量的完整点云的系统框架,并进一步扩展以模拟创建多样且物理上逼真的不完整点云。 Result: 实验表明,使用合成数据训练的PointNet++模型在现实世界桥梁语义分割中达到84.2%的平均交并比(mIoU)。同时,经过微调的KT-Net在组件补全任务中表现出优越性能。 Conclusion: 该研究为桥梁结构的3D视觉分析提供了创新方法和基础数据集,对推进基础设施自动化管理和维护具有重要意义。 Abstract: As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data-driven paradigm, its application potential is often constrained by the incompleteness of real-world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component-level instance annotations, high-fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real-world bridge semantic segmentation. Concurrently, a fine-tuned KT-Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure.

[102] 2D Instance Editing in 3D Space

Yuhuan Xie,Aoxuan Pan,Ming-Xian Lin,Wei Huang,Yi-Hua Huang,Xiaojuan Qi

Main category: cs.CV

TL;DR: 本文提出了一种基于3D环境编辑的“2D-3D-2D”框架,解决了传统2D图像编辑方法在一致性与物体身份保持方面的局限性。

Details Motivation: 生成模型虽然在2D图像编辑方面取得了显著进展,但由于其固有的像素操作特性,常常难以保持一致性及物体身份不变,因此需要一种新的方法来解决这一限制。 Method: 首先将2D对象提升到3D表示,在物理合理且受刚性约束的3D环境中进行编辑,然后将编辑后的3D对象重新投影并无缝修复回原始2D图像。 Result: 广泛的实验表明,该框架在总体性能上优于先前的方法,提供了高度一致的编辑结果并保持了物体身份。 Conclusion: 该论文提出的“2D-3D-2D”框架在2D图像编辑中实现了更一致的编辑效果,并且能够稳健地保持物体身份,相较现有的2D编辑方法具有优势。 Abstract: Generative models have achieved significant progress in advancing 2D image editing, demonstrating exceptional precision and realism. However, they often struggle with consistency and object identity preservation due to their inherent pixel-manipulation nature. To address this limitation, we introduce a novel "2D-3D-2D" framework. Our approach begins by lifting 2D objects into 3D representation, enabling edits within a physically plausible, rigidity-constrained 3D environment. The edited 3D objects are then reprojected and seamlessly inpainted back into the original 2D image. In contrast to existing 2D editing methods, such as DragGAN and DragDiffusion, our method directly manipulates objects in a 3D environment. Extensive experiments highlight that our framework surpasses previous methods in general performance, delivering highly consistent edits while robustly preserving object identity.

[103] Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models

L'ea Dubois,Klaus Schmidt,Chengyu Wang,Ji-Hoon Park,Lin Wang,Santiago Munoz

Main category: cs.CV

TL;DR: 本文提出了一种结合视觉基础模型和大语言模型的框架,用于提高视频理解中的因果推理和未来预测能力。

Details Motivation: 当前视频理解模型在高级认知任务(如因果推理和未来预测)上表现不足,主要由于缺乏常识性世界知识。 Method: 文章设计了一个融合模块,借鉴Q-Former架构,将复杂的时空和以对象为中心的视觉特征转化为简洁的语言对齐表示,并采用两阶段训练策略:大规模对齐预训练和指令微调。 Result: 实验表明该模型在多个挑战性基准测试中达到最先进的性能,表现出显著的零样本泛化能力。 Conclusion: 这项研究推动了机器感知从简单识别向真正认知理解的发展,为更智能的AI系统铺平了道路。 Abstract: Current video understanding models excel at recognizing "what" is happening but fall short in high-level cognitive tasks like causal reasoning and future prediction, a limitation rooted in their lack of commonsense world knowledge. To bridge this cognitive gap, we propose a novel framework that synergistically fuses a powerful Vision Foundation Model (VFM) for deep visual perception with a Large Language Model (LLM) serving as a knowledge-driven reasoning core. Our key technical innovation is a sophisticated fusion module, inspired by the Q-Former architecture, which distills complex spatiotemporal and object-centric visual features into a concise, language-aligned representation. This enables the LLM to effectively ground its inferential processes in direct visual evidence. The model is trained via a two-stage strategy, beginning with large-scale alignment pre-training on video-text data, followed by targeted instruction fine-tuning on a curated dataset designed to elicit advanced reasoning and prediction skills. Extensive experiments demonstrate that our model achieves state-of-the-art performance on multiple challenging benchmarks. Notably, it exhibits remarkable zero-shot generalization to unseen reasoning tasks, and our in-depth ablation studies validate the critical contribution of each architectural component. This work pushes the boundary of machine perception from simple recognition towards genuine cognitive understanding, paving the way for more intelligent and capable AI systems in robotics, human-computer interaction, and beyond.

[104] I$^2$R: Inter and Intra-image Refinement in Few Shot Segmentation

Ourui Fu,Hangzhou He,Xinliang Zhang,Lei Zhu,Shuang Zeng,ZhaoHeng Xie,Yanye Lu

Main category: cs.CV

TL;DR: This paper proposes I²R, a novel few-shot segmentation method that addresses feature mismatches and false predictions by leveraging global semantic cues and suppressing inconsistent pixel pairs.

Details Motivation: Few-shot segmentation aims to generalize models to novel classes with minimal exemplars, but current approaches are limited by inter- and intra-image discrepancies that degrade performance. Method: The method involves two key components: 1) Category-specific high-level representations that aggregate global semantic cues from support and query images for better inter-image region localization. 2) A directional masking strategy to suppress inconsistent pixel pairs with high feature similarity but conflicting masks. Result: Experiments show that the I²R method outperforms state-of-the-art approaches, achieving improvements of 1.9% and 2.1% in mIoU under the 1-shot setting on PASCAL-5$^i$ and COCO-20$^i$ benchmarks. Conclusion: The proposed I²R method addresses the limitations in few-shot segmentation by using category-specific high-level representations and a directional masking strategy, which improves segmentation performance. Abstract: The annotation bottleneck in semantic segmentation has driven significant interest in few-shot segmentation, which aims to develop segmentation models capable of generalizing rapidly to novel classes using minimal exemplars. Conventional training paradigms typically generate query prior maps by extracting masked-area features from support images, followed by making predictions guided by these prior maps. However, current approaches remain constrained by two critical limitations stemming from inter- and intra-image discrepancies, both of which significantly degrade segmentation performance: 1) The semantic gap between support and query images results in mismatched features and inaccurate prior maps; 2) Visually similar yet semantically distinct regions within support or query images lead to false negative or false positive predictions. We propose a novel FSS method called \textbf{I$^2$R}: 1) Using category-specific high level representations which aggregate global semantic cues from support and query images, enabling more precise inter-image region localization and address the first limitation. 2) Directional masking strategy that suppresses inconsistent support-query pixel pairs, which exhibit high feature similarity but conflicting mask, to mitigate the second issue. Experiments demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 1.9\% and 2.1\% in mIoU under the 1-shot setting on PASCAL-5$^i$ and COCO-20$^i$ benchmarks, respectively.

[105] USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining

Yue Peng,Bing Xiong,Fuqiang Chen,De Eybo,RanRan Zhang,Wanming Hu,Jing Cai,Wenjian Qin

Main category: cs.CV

TL;DR: USIGAN improves IHC virtual staining by addressing weak pairing and spatial heterogeneity, offering enhanced pathological semantic consistency and clinical relevance.

Details Motivation: To overcome inaccuracies caused by spatial heterogeneity under weakly paired conditions in IHC virtual staining for better pathological analysis. Method: USIGAN employs unbalanced self-information feature transport along with UOT-CTM and PC-SCM mechanisms to enhance correlation between H&E and generated IHC images without relying on positional correspondence. Result: Experiments show that USIGAN outperforms existing methods in clinically significant metrics like IoD and Pearson-R correlation on two public datasets. Conclusion: The proposed USIGAN method effectively addresses the challenges of spatial heterogeneity and weak pairing in IHC virtual staining, achieving superior performance in content and pathological semantic consistency. Abstract: Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H\&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional correspondence.By removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H\&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level.. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.

[106] DFYP: A Dynamic Fusion Framework with Spectral Channel Attention and Adaptive Operator learning for Crop Yield Prediction

Juli Zhang,Zeyu Yan,Jing Zhang,Qiguang Miao,Quan Wang

Main category: cs.CV

TL;DR: This paper proposes DFYP, a new method for improving the accuracy of remote sensing-based crop yield predictions by incorporating advanced attention modules and operator learning networks, demonstrating superior performance over existing approaches.

Details Motivation: Accurate remote sensing-based crop yield prediction remains a fundamental challenging task due to complex spatial patterns, heterogeneous spectral characteristics, and dynamic agricultural conditions. Existing methods often suffer from limited spatial modeling capacity, weak generalization across crop types and years. Method: DFYP, a novel Dynamic Fusion framework for crop Yield Prediction, which combines spectral channel attention, edge-adaptive spatial modeling and a learnable fusion mechanism to improve robustness across diverse agricultural scenarios. It introduces three key components: (1) a Resolution-aware Channel Attention (RCA) module; (2) an Adaptive Operator Learning Network (AOL-Net); and (3) a dual-branch architecture with a learnable fusion mechanism. Result: Extensive experiments on multi-year datasets MODIS and multi-crop dataset Sentinel-2 demonstrate that DFYP consistently outperforms current state-of-the-art baselines in RMSE, MAE, and R2 across different spatial resolutions, crop types, and time periods. Conclusion: DFYP is effective and robust for real-world agricultural monitoring. Abstract: Accurate remote sensing-based crop yield prediction remains a fundamental challenging task due to complex spatial patterns, heterogeneous spectral characteristics, and dynamic agricultural conditions. Existing methods often suffer from limited spatial modeling capacity, weak generalization across crop types and years. To address these challenges, we propose DFYP, a novel Dynamic Fusion framework for crop Yield Prediction, which combines spectral channel attention, edge-adaptive spatial modeling and a learnable fusion mechanism to improve robustness across diverse agricultural scenarios. Specifically, DFYP introduces three key components: (1) a Resolution-aware Channel Attention (RCA) module that enhances spectral representation by adaptively reweighting input channels based on resolution-specific characteristics; (2) an Adaptive Operator Learning Network (AOL-Net) that dynamically selects operators for convolutional kernels to improve edge-sensitive spatial feature extraction under varying crop and temporal conditions; and (3) a dual-branch architecture with a learnable fusion mechanism, which jointly models local spatial details and global contextual information to support cross-resolution and cross-crop generalization. Extensive experiments on multi-year datasets MODIS and multi-crop dataset Sentinel-2 demonstrate that DFYP consistently outperforms current state-of-the-art baselines in RMSE, MAE, and R2 across different spatial resolutions, crop types, and time periods, showcasing its effectiveness and robustness for real-world agricultural monitoring.

[107] D-FCGS: Feedforward Compression of Dynamic Gaussian Splatting for Free-Viewpoint Videos

Wenkang Zhang,Yan Zhao,Qiang Wang,Li Song,Zhengxue Cheng

Main category: cs.CV

TL;DR: 本文提出D-FCGS,一种用于动态3D高斯点云的高效前馈压缩框架,解决了自由视点视频中动态3D表示的压缩难题,具有快速压缩和良好视觉质量保持能力。

Details Motivation: 现有方法将场景重建与依赖优化的编码耦合,限制了通用性,需要一种高效的动态3D表示压缩方案以支持自由视点视频应用。 Method: 提出了一种基于GoF结构的前馈压缩框架,采用I-P帧编码和稀疏控制点提取运动信息,并利用双先验感知熵模型进行运动张量压缩。 Result: 在多视角视频数据集上训练后,D-FCGS在2秒内实现40倍以上压缩率,同时保持视觉质量,且与优化方法的率失真性能相当。 Conclusion: D-FCGS实现了动态3D高斯点云序列的高效压缩,具有良好的场景泛化能力,无需逐场景优化。 Abstract: Free-viewpoint video (FVV) enables immersive 3D experiences, but efficient compression of dynamic 3D representations remains a major challenge. Recent advances in 3D Gaussian Splatting (3DGS) and its dynamic extensions have enabled high-fidelity scene modeling. However, existing methods often couple scene reconstruction with optimization-dependent coding, which limits generalizability. This paper presents Feedforward Compression of Dynamic Gaussian Splatting (D-FCGS), a novel feedforward framework for compressing temporally correlated Gaussian point cloud sequences. Our approach introduces a Group-of-Frames (GoF) structure with I-P frame coding, where inter-frame motions are extracted via sparse control points. The resulting motion tensors are compressed in a feedforward manner using a dual prior-aware entropy model that combines hyperprior and spatial-temporal priors for accurate rate estimation. For reconstruction, we perform control-point-guided motion compensation and employ a refinement network to enhance view-consistent fidelity. Trained on multi-view video-derived Gaussian frames, D-FCGS generalizes across scenes without per-scene optimization. Experiments show that it matches the rate-distortion performance of optimization-based methods, achieving over 40 times compression in under 2 seconds while preserving visual quality across viewpoints. This work advances feedforward compression for dynamic 3DGS, paving the way for scalable FVV transmission and storage in immersive applications.

[108] GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing

Xianzhi Ma,Jianhui Li,Changhua Pei,Hao Liu

Main category: cs.CV

TL;DR: This paper proposes GeoMag, an end-to-end framework for remote sensing image understanding that improves performance on pixel-level tasks and reduces computational costs through adaptive resolution adjustment and semantic-aware cropping.

Details Motivation: Existing RS-VLMs are limited in handling pixel-level tasks, perform poorly in small-object recognition, and consume significant computational resources when processing high-resolution images. Method: The proposed method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC) to adaptively adjust spatial resolution and enhance visual representation in relevant areas. Result: GeoMag excels in handling pixel-level tasks and maintains competitive performance across other granularities on 10 benchmarks. Conclusion: GeoMag is a more efficient and effective framework for remote sensing image understanding compared to existing RS-VLMs, as it improves perception of critical regions while reducing computational costs. Abstract: The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.

[109] What You Have is What You Track: Adaptive and Robust Multimodal Tracking

Yuedong Tan,Jiawei Shao,Eduard Zamfir,Ruanjun Li,Zhaochong An,Chao Ma,Danda Paudel,Luc Van Gool,Radu Timofte,Zongwei Wu

Main category: cs.CV

TL;DR: 本文研究了在多模态数据缺失情况下如何提升视觉跟踪性能,提出了一个具有自适应能力的新框架,并验证了其优越性。

Details Motivation: 现有跟踪器在面对时间缺失的多模态数据时表现显著下降,因为其架构缺乏足够的灵活性来应对缺失模态的问题。 Method: 设计了一个具有自适应复杂度的异构专家混合融合机制,并结合视频级掩码策略,以确保时空一致性与完整性。 Result: 模型不仅能够适应不同的数据缺失率,还能根据场景复杂度进行调整,在9个基准测试中均取得了最优性能。 Conclusion: 该论文提出了一种灵活的多模态跟踪框架,通过动态激活计算单元和新颖的融合机制,在处理时间不完整数据方面表现出色,并在多个基准测试中达到SOTA性能。 Abstract: Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities. To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness which is critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be publicly available at https://github.com/supertyd/FlexTrack/tree/main.

[110] On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification

Jonas Klotz,Tom Burgert,Begüm Demir

Main category: cs.CV

TL;DR: This paper evaluates xAI methods and metrics for remote sensing scene classification, revealing their limitations in this domain and offering practical guidelines for their use.

Details Motivation: Most explainable AI (xAI) methods and evaluation metrics have been developed for natural images in computer vision and may not be suitable for remote sensing (RS) image scene classification. This work addresses this gap by investigating how these methods perform in the RS context. Method: The authors conduct a methodological and experimental analysis of ten explanation metrics across five categories (faithfulness, robustness, localization, complexity, randomization) applied to five feature attribution methods (Occlusion, LIME, GradCAM, LRP, DeepLIFT) on three RS datasets. Result: The study identifies key limitations of both explanation methods and evaluation metrics when applied to RS data. Perturbation-based methods depend heavily on baselines and spatial characteristics, gradient-based methods struggle with multi-label images, and relevance propagation methods may misrepresent class importance. Faithfulness, localization, and complexity metrics are found to be unreliable for large-area classes, while robustness and randomization metrics show greater stability. Conclusion: The paper concludes that explanation methods and metrics used in remote sensing (RS) image scene classification must be carefully selected, as many existing approaches developed for natural images in computer vision are not well-suited for RS. The authors provide guidelines for selecting the most appropriate methods, metrics, and hyperparameters based on their experimental and methodological analysis. Abstract: The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.

[111] High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Xinyu Huang,Yuhao Dong,Weiwei Tian,Bo Li,Rui Feng,Ziwei Liu

Main category: cs.CV

TL;DR: This paper introduces MGPO, an end-to-end reinforcement learning framework that enhances visual grounding capabilities of large multi-modal models without requiring additional annotations, achieving superior performance on both in-distribution and out-of-distribution visual question answering benchmarks.

Details Motivation: State-of-the-art large multi-modal models struggle with high-resolution image processing due to excessive irrelevant visual tokens. Existing methods like supervised fine-tuning require costly grounding annotations, while models often fail to autonomously trigger visual grounding during rollout. Method: The paper proposes MGPO, a reinforcement learning framework that enables iterative focus on key visual regions through automatic cropping based on model-predicted coordinates in a multi-turn conversation setting. Policy loss computation is restricted to outputs across multiple dialogue rounds to address optimization stability. Result: Experiments show MGPO outperforms GRPO by 5.4% on MME-Realworld and 5.2% on the challenging out-of-distribution V* Bench. Post-training Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o on OOD V* Bench. Conclusion: MGPO is an effective reinforcement learning framework for improving visual grounding capabilities of large multi-modal models without requiring additional grounding annotations. Abstract: State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

[112] Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation

Quanzhu Niu,Yikang Zhou,Shihao Chen,Tao Zhang,Shunping Ji

Main category: cs.CV

TL;DR: The paper introduces geometric awareness using monocular depth estimation to enhance Video Instance Segmentation (VIS) robustness, achieving state-of-the-art results on the OVIS benchmark with the EDC method.

Details Motivation: To overcome challenges like object occlusions, motion blur, and appearance variations in Video Instance Segmentation. Method: Three integration paradigms were investigated: Expanding Depth Channel (EDC), Sharing ViT (SV), and Depth Supervision (DS). Result: EDC and SV significantly improved VIS robustness; EDC achieved a state-of-the-art 56.2 AP on the OVIS benchmark. Conclusion: This work establishes depth cues as critical for robust video understanding in VIS. Abstract: Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.

[113] CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions

Yuchen Huang,Zhiyuan Fan,Zhitao He,Sandeep Polisetty,Wenyan Li,Yi R. Fung

Main category: cs.CV

TL;DR: 本文提出了一种解决视觉语言模型在文化概念区分中的挑战的方法,通过构建合成文化数据集CulTwin并微调CLIP模型,得到了性能提升的CultureCLIP。

Details Motivation: 预训练视觉语言模型(VLMs)如CLIP在多模态理解方面表现出色,但在区分视觉相似但文化不同的概念时存在困难,这源于高质量特定文化数据集的稀缺、集成上下文知识的缺乏以及突出细微差异的硬负样本的缺失。 Method: 首先设计了一个数据整理流水线,利用开源VLMs和文本到图像扩散模型构建CulTwin合成文化数据集,然后在CulTwin上对CLIP进行微调以创建CultureCLIP,并通过定制的对比学习将文化概念与上下文增强的字幕和合成图像对齐。 Result: 构建了CulTwin合成文化数据集,并通过微调得到CultureCLIP,实现了更细致的文化区分并保留了泛化能力。 Conclusion: 实验结果表明,CultureCLIP在文化相关基准测试中优于基础CLIP,在某些任务上的细粒度概念识别率提高了高达5.49%,同时保留了CLIP的泛化能力,验证了我们数据合成和VLM骨干网络训练范式的有效性。 Abstract: Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP's original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.

[114] High-Fidelity and Generalizable Neural Surface Reconstruction with Sparse Feature Volumes

Aoxiang Fan,Corentin Dumery,Nicolas Talabot,Hieu Le,Pascal Fua

Main category: cs.CV

TL;DR: This paper proposes a sparse representation method for neural surface reconstruction, enabling higher resolution results with significantly reduced memory usage, outperforming current state-of-the-art approaches.

Details Motivation: The motivation is to overcome the limitations of dense 3D feature volumes, which do not scale well with increasing voxel resolutions, thereby limiting reconstruction quality. Method: The paper uses a two-stage approach: first training a network to predict voxel occupancies from images and depth maps, then computing features and performing volume rendering only in voxels with high occupancy estimates. Custom algorithms are also developed for efficient handling of sparse volumes. Result: Experiments show that the proposed method reduces storage requirements by more than 50 times, enables reconstructions at $512^3$ resolution (compared to $128^3$ with traditional methods), and achieves superior reconstruction accuracy. Conclusion: The paper concludes that the proposed sparse representation method significantly improves memory efficiency, allowing for higher resolution reconstructions without performance degradation compared to existing methods. Abstract: Generalizable neural surface reconstruction has become a compelling technique to reconstruct from few images without per-scene optimization, where dense 3D feature volume has proven effective as a global representation of scenes. However, the dense representation does not scale well to increasing voxel resolutions, severely limiting the reconstruction quality. We thus present a sparse representation method, that maximizes memory efficiency and enables significantly higher resolution reconstructions on standard hardware. We implement this through a two-stage approach: First training a network to predict voxel occupancies from posed images and associated depth maps, then computing features and performing volume rendering only in voxels with sufficiently high occupancy estimates. To support this sparse representation, we developed custom algorithms for efficient sampling, feature aggregation, and querying from sparse volumes-overcoming the dense-volume assumptions inherent in existing works. Experiments on public datasets demonstrate that our approach reduces storage requirements by more than 50 times without performance degradation, enabling reconstructions at $512^3$ resolution compared to the typical $128^3$ on similar hardware, and achieving superior reconstruction accuracy over current state-of-the-art methods.

[115] Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Zhenghao Zhang,Junchao Liao,Xiangyu Meng,Long Qin,Weizhi Wang

Main category: cs.CV

TL;DR: 本文提出了 Tora2,一种支持视频中多个实体外观和运动同时定制的新方法,通过改进模型结构显著提升了性能与对齐准确性。

Details Motivation: 为了扩展 Tora 模型在外观和运动定制方面的能力,提高多模态训练中的对齐精度和细节保留能力。 Method: 引入解耦个性化提取器、门控自注意力机制以及对比损失函数,以整合轨迹、文本描述和视觉信息,并优化运动与个性化的嵌入映射。 Result: 实验结果表明,Tora2 在定制化视频生成方面表现优异,具备先进的运动控制能力。 Conclusion: Tora2 是首个在视频生成中实现外观和运动同时多实体定制的方法,标志着多条件视频生成领域的重要进展。 Abstract: Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: https://github.com/alibaba/Tora .

[116] T-LoRA: Single Image Diffusion Model Customization Without Overfitting

Vera Soboleva,Aibek Alanov,Andrey Kuznetsov,Konstantin Sobolev

Main category: cs.CV

TL;DR: 本文提出了T-LoRA,一种用于扩散模型个性化的时步依赖低秩适应框架,有效解决了单图像定制中的过拟合问题。

Details Motivation: 扩散模型微调在训练样本有限时容易过拟合,影响泛化能力和输出多样性,而单图像定制具有最大的实际潜力。 Method: 提出T-LoRA方法,包括基于扩散时间步长动态调整低秩更新的策略和通过正交初始化确保适配器组件独立性的权重参数化技术。 Result: 实验表明,T-LoRA及其各个组件优于标准LoRA和其他扩散模型个性化技术,实现了概念保真度和文本对齐之间的更好平衡。 Conclusion: T-LoRA是一个针对扩散模型个性化的高效低秩适应框架,在数据和资源受限的情况下表现出色,为单图像定制提供了一种有前景的解决方案。 Abstract: While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/T-LoRA.

[117] Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval

Haiwen Li,Delong Liu,Zhaohui Hou,Zhicheng Zhao,Fei Su

Main category: cs.CV

TL;DR: This paper proposes a scalable pipeline for composed image retrieval (CIR) using a synthetic dataset (CIRHS) and a new framework called CoAlign, achieving strong zero-shot and supervised performance without relying on manually labeled data.

Details Motivation: Existing CIR methods rely on costly, manually labeled triplets, limiting scalability and zero-shot capability. The authors aim to address this issue by proposing a scalable pipeline for automatic triplet generation and a new synthetic dataset. Method: The paper introduces Hybrid Contextual Alignment (CoAlign), a novel CIR framework that enables global alignment and local reasoning within a broader context. Additionally, an automatic pipeline generates a fully synthetic dataset named CIRHS, leveraging an LLM for prompt generation and a text-to-image model for creating image pairs. Result: CoAlign achieves outstanding zero-shot performance on three benchmarks using the synthetic CIRHS dataset, demonstrating the feasibility of training CIR models on synthetic data. Under supervised training, the method outperforms all state-of-the-art supervised CIR approaches. Conclusion: The paper concludes that the proposed CoAlign framework combined with the synthetic CIRHS dataset achieves outstanding zero-shot performance and surpasses state-of-the-art supervised CIR approaches, demonstrating the feasibility of training CIR models on fully synthetic data. Abstract: As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.

[118] Exploring Partial Multi-Label Learning via Integrating Semantic Co-occurrence Knowledge

Xin Wu,Fei Teng,Yue Feng,Kaibo Shi,Zhuosheng Lin,Ji Zhang,James Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的部分多标签学习框架SCINet,通过双主导提示模块、跨模态融合模块和内在语义增强策略来解决标签与实例间的模糊关系问题,并在多个数据集上证明了其优越性。

Details Motivation: 准确识别标签与实例之间的模糊关系是部分多标签学习的核心挑战,本文强调了匹配标签与实例之间共现模式的重要性。 Method: 提出了一种名为SCINet的新颖框架,包含双主导提示模块、跨模态融合模块和内在语义增强策略。 Result: 实验结果表明,SCINet在四个广泛使用的基准数据集上表现优于现有技术方法。 Conclusion: SCINet是一个有效的部分多标签学习框架,在四个广泛使用的基准数据集上表现优于现有技术方法。 Abstract: Partial multi-label learning aims to extract knowledge from incompletely annotated data, which includes known correct labels, known incorrect labels, and unknown labels. The core challenge lies in accurately identifying the ambiguous relationships between labels and instances. In this paper, we emphasize that matching co-occurrence patterns between labels and instances is key to addressing this challenge. To this end, we propose Semantic Co-occurrence Insight Network (SCINet), a novel and effective framework for partial multi-label learning. Specifically, SCINet introduces a bi-dominant prompter module, which leverages an off-the-shelf multimodal model to capture text-image correlations and enhance semantic alignment. To reinforce instance-label interdependencies, we develop a cross-modality fusion module that jointly models inter-label correlations, inter-instance relationships, and co-occurrence patterns across instance-label assignments. Moreover, we propose an intrinsic semantic augmentation strategy that enhances the model's understanding of intrinsic data semantics by applying diverse image transformations, thereby fostering a synergistic relationship between label confidence and sample difficulty. Extensive experiments on four widely-used benchmark datasets demonstrate that SCINet surpasses state-of-the-art methods.

[119] Ensemble-Based Deepfake Detection using State-of-the-Art Models with Robust Cross-Dataset Generalisation

Haroon Wahab,Hassan Ugail,Lujain Jaleel

Main category: cs.CV

TL;DR: Ensemble-based deepfake detection improves generalization and stability across diverse datasets compared to individual models.

Details Motivation: Deepfake detection models often perform poorly on out-of-distribution data, so this work investigates ensemble methods to improve generalization across diverse datasets. Method: The researchers used an ensemble-based approach, combining prediction probabilities from several state-of-the-art asymmetric models on a recent open-source benchmark. They evaluated the approach using two distinct out-of-domain datasets. Result: Experiments showed that no single model consistently outperformed others across different settings, while ensemble-based predictions provided more stable and reliable performance in all scenarios. Conclusion: The study concludes that asymmetric ensembling provides a robust and scalable solution for real-world deepfake detection, offering stable performance across diverse datasets without prior knowledge of forgery type or quality. Abstract: Machine learning-based Deepfake detection models have achieved impressive results on benchmark datasets, yet their performance often deteriorates significantly when evaluated on out-of-distribution data. In this work, we investigate an ensemble-based approach for improving the generalization of deepfake detection systems across diverse datasets. Building on a recent open-source benchmark, we combine prediction probabilities from several state-of-the-art asymmetric models proposed at top venues. Our experiments span two distinct out-of-domain datasets and demonstrate that no single model consistently outperforms others across settings. In contrast, ensemble-based predictions provide more stable and reliable performance in all scenarios. Our results suggest that asymmetric ensembling offers a robust and scalable solution for real-world deepfake detection where prior knowledge of forgery type or quality is often unavailable.

[120] Geo-Registration of Terrestrial LiDAR Point Clouds with Satellite Images without GNSS

Xinyu Wang,Muhammad Ibrahim,Atif Mansoor,Ajmal Mian

Main category: cs.CV

TL;DR: A new geo-registration method improves LiDAR point cloud alignment in urban areas by integrating satellite images and terrain data, outperforming traditional GNSS-dependent methods.

Details Motivation: Accurate geo-registration of LiDAR point clouds is challenging in GNSS-denied urban environments due to localization errors from unstable GNSS signals. This necessitates a method that does not rely on prior localization. Method: The method uses a pre-trained Point Transformer model to segment road points, extracts road skeletons and intersections, performs global rigid alignment and local refinement using RBF interpolation, and applies elevation correction based on SRTM terrain data. Result: Testing on the KITTI dataset showed a 55.3% improvement in planimetric alignment standard deviation and 30.5% gain in elevation correlation. On the Perth dataset, it achieved a 77.4% improvement in alignment and 50.4% gain in elevation correlation compared to initial alignment. Conclusion: The proposed structured geo-registration and spatial correction method effectively aligns LiDAR point clouds with satellite images, improving alignment accuracy significantly in urban areas without relying on GNSS data. Abstract: Accurate geo-registration of LiDAR point clouds presents significant challenges in GNSS signal denied urban areas with high-rise buildings and bridges. Existing methods typically rely on real-time GNSS and IMU data, that require pre-calibration and assume stable positioning during data collection. However, this assumption often fails in dense urban areas, resulting in localization errors. To address this, we propose a structured geo-registration and spatial correction method that aligns 3D point clouds with satellite images, enabling frame-wise recovery of GNSS information and reconstruction of city scale 3D maps without relying on prior localization. The proposed approach employs a pre-trained Point Transformer model to segment the road points and then extracts the road skeleton and intersection points from the point cloud as well as the target map for alignment. Global rigid alignment of the two is performed using the intersection points, followed by local refinement using radial basis function (RBF) interpolation. Elevation correction is then applied to the point cloud based on terrain information from SRTM dataset to resolve vertical discrepancies. The proposed method was tested on the popular KITTI benchmark and a locally collected Perth (Western Australia) CBD dataset. On the KITTI dataset, our method achieved an average planimetric alignment standard deviation (STD) of 0.84~m across sequences with intersections, representing a 55.3\% improvement over the original dataset. On the Perth dataset, which lacks GNSS information, our method achieved an average STD of 0.96~m compared to the GPS data extracted from Google Maps API. This corresponds to a 77.4\% improvement from the initial alignment. Our method also resulted in elevation correlation gains of 30.5\% on the KITTI dataset and 50.4\% on the Perth dataset.

[121] TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision

Syeda Anshrah Gillani,Mirza Samad Ahmed Baig,Osama Ahmed Khan,Shahid Munir Shah,Umema Mujeeb,Maheen Ali

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到图像扩散模型框架GCDA,有效解决了生成图像中文本不可读和拼写错误的问题,提高了文本渲染的质量和图像生成的效果。

Details Motivation: 现有的文本到图像扩散模型无法在生成的图像中产生可读、有意义且拼写正确的文本,这显著限制了其在广告、教育和创意设计等实际应用中的使用。 Method: 引入了一个新的框架Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA),该框架通过三个精心设计的模块扩展了典型的扩散模型:双流文本编码器、字符感知注意力机制和OCR-in-the-loop微调阶段。 Result: 大规模实验表明,GCDA在MARIO-10M和T2I-CompBench等基准数据集上取得了最好的性能,字符错误率从0.21降低到0.08,词错误率从0.25降低到0.15,并且在人类感知质量和高保真图像合成(FID: 14.3)方面也具有竞争力。 Conclusion: GCDA实现了所有指标上的新SOTA,在文本渲染的字符错误率、词错误率等方面表现优异,同时保持了高质量的图像生成能力。 Abstract: The modern text-to-image diffusion models boom has opened a new era in digital content production as it has proven the previously unseen ability to produce photorealistic and stylistically diverse imagery based on the semantics of natural-language descriptions. However, the consistent disadvantage of these models is that they cannot generate readable, meaningful, and correctly spelled text in generated images, which significantly limits the use of practical purposes like advertising, learning, and creative design. This paper introduces a new framework, namely Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA), using which a typical diffusion backbone is extended by three well-designed modules. To begin with, the model has a dual-stream text encoder that encodes both semantic contextual information and explicit glyph representations, resulting in a character-aware representation of the input text that is rich in nature. Second, an attention mechanism that is aware of the character is proposed with a new attention segregation loss that aims to limit the attention distribution of each character independently in order to avoid distortion artifacts. Lastly, GCDA has an OCR-in-the-loop fine-tuning phase, where a full text perceptual loss, directly optimises models to be legible and accurately spell. Large scale experiments to benchmark datasets, such as MARIO-10M and T2I-CompBench, reveal that GCDA sets a new state-of-the-art on all metrics, with better character based metrics on text rendering (Character Error Rate: 0.08 vs 0.21 for the previous best; Word Error Rate: 0.15 vs 0.25), human perception, and comparable image synthesis quality on high-fidelity (FID: 14.3).

[122] VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis

Alexandre Symeonidis-Herzig,Özge Mercanoğlu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: VisualSpeaker通过结合光真实感渲染和视觉语音识别,改进了3D面部动画的质量和准确性。

Details Motivation: 尽管先前的方法显示出良好的质量,但它们对网格域的依赖限制了它们充分利用2D计算机视觉和图形学中的快速视觉创新的能力。 Method: VisualSpeaker方法结合了光真实感3D高斯随机投影头像渲染和预先训练的视觉自动语音识别模型。 Result: 在MEAD数据集上的评估表明,VisualSpeaker不仅将标准唇顶点误差指标提高了56.1%,还提高了生成动画的感知质量,同时保持了网格驱动动画的可控性。 Conclusion: VisualSpeaker是一个新的方法,通过使用逼真的可微渲染和视觉语音识别监督,改进了3D面部动画。 Abstract: Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation. This perceptual focus naturally supports accurate mouthings, essential cues that disambiguate similar manual signs in sign language avatars.

[123] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding

Chang Liu,Ye Pan,Chenyang Ding,Susanto Rahardja,Xiaokang Yang

Main category: cs.CV

TL;DR: 本文提出了 MEDTalk,一种用于生成逼真且富有情感的 3D 面部动画的新方法,通过音频、文本及参考图像实现动态控制和高度定制化。

Details Motivation: 现有方法主要关注静态和预定义的情绪标签,限制了生成结果的多样性和自然性。因此,需要一种能够实现更精细、动态控制情感化面部动画的方法。 Method: 通过使用交叉重建过程分离运动序列中的内容和情感嵌入空间,结合音频和语音文本预测帧级强度变化并动态调整静态情感特征,同时利用多模态输入指导生成指定面部表情。 Result: 提出了一种新的细粒度、动态情感对话头像生成框架 MEDTalk,能独立控制嘴唇动作和面部表情,同时增强控制性和个性化能力。 Conclusion: MEDTalk 提供了一个新颖的框架,实现了细粒度和动态的情感化对话头部生成,并可以方便地集成到工业生产管线中。 Abstract: Audio-driven emotional 3D facial animation aims to generate synchronized lip movements and vivid facial expressions. However, most existing approaches focus on static and predefined emotion labels, limiting their diversity and naturalness. To address these challenges, we propose MEDTalk, a novel framework for fine-grained and dynamic emotional talking head generation. Our approach first disentangles content and emotion embedding spaces from motion sequences using a carefully designed cross-reconstruction process, enabling independent control over lip movements and facial expressions. Beyond conventional audio-driven lip synchronization, we integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions. Furthermore, to enhance control and personalization, we incorporate multimodal inputs-including text descriptions and reference expression images-to guide the generation of user-specified facial expressions. With MetaHuman as the priority, our generated results can be conveniently integrated into the industrial production pipeline.

[124] MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding

Tongtong Cheng,Rongzhen Li,Yixin Xiong,Tao Zhang,Jing Wang,Kai Liu

Main category: cs.CV

TL;DR: This paper proposes MCAM, a new model for accurate driving behavior recognition and reasoning in autonomous driving video understanding.

Details Motivation: Existing methods for driving behavior recognition fail to address spurious correlations across modalities and ignore ego-vehicle level causality modeling. Method: The paper proposes a Multimodal Causal Analysis Model (MCAM) that includes a multi-level feature extractor, a causal analysis module using a directed acyclic graph (DAG), and a vision-language transformer. Result: Extensive experiments on the BDD-X and CoVLA datasets show that MCAM achieves state-of-the-art performance in visual-language causal relationship learning. Conclusion: MCAM is effective in capturing causal characteristics within video sequences and demonstrates superior performance for autonomous driving applications. Abstract: Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at https://github.com/SixCorePeach/MCAM.

[125] Discontinuity-aware Normal Integration for Generic Central Camera Models

Francesco Milano,Manuel López-Antequera,Naina Dhingra,Roland Siegwart,Robert Thiel

Main category: cs.CV

TL;DR: 本文提出了一种新的3D表面重建方法,解决了现有方法在处理深度不连续性和复杂相机模型方面的不足。

Details Motivation: 现有的法线积分方法在处理深度不连续性和非正交或理想针孔相机方面存在局限。 Method: 基于局部平面假设,通过表面法线和光线方向之间的约束关系进行建模。 Result: 该方法更准确地近似了深度与表面法线之间的关系,在标准法线积分基准测试中达到了最先进的结果,并且是第一个直接处理通用中心相机模型的方法。 Conclusion: 该论文提出了一种新的法线积分公式,能够显式地建模不连续性并处理通用的中心相机模型。 Abstract: Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.

[126] ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

Chihan Huang,Hao Tang

Main category: cs.CV

TL;DR: ScoreAdv是一种基于扩散模型的新型无约束对抗样本生成方法,通过引入可解释的对抗引导机制和显著图,在保持图像自然性的同时实现高效的攻击。

Details Motivation: 现有的对抗攻击方法依赖于ℓp范数扰动或受限生成模型,无法生成既符合人类感知又具有高攻击成功率的对抗样本。 Method: 利用扩散模型的去噪能力,设计了一种渐进式的对抗引导机制,并结合显著图注入参考图像信息,从而在生成过程中平衡去噪与对抗扰动。 Result: ScoreAdv在ImageNet和CelebA数据集上对十种目标模型实现了最先进的攻击成功率,同时生成的图像质量更高,并且在防御措施下仍保持鲁棒性。 Conclusion: ScoreAdv突破了传统对抗攻击的限制,提供了一种灵活、高效且适用于多种任务(如分类和检索)的对抗样本生成框架。 Abstract: Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.

[127] CAST-Phys: Contactless Affective States Through Physiological signals Database

Joaquim Comas,Alexander Joel Vera,Xavier Vives,Eleonora De Filippi,Alexandre Pereda,Federico Sukno

Main category: cs.CV

TL;DR: 本文提出了一个用于远程生理情感识别的新数据集CAST-Phys,解决了传统方法中因接触设备而影响真实情感的问题,并展示了多模态数据在无接触情感识别中的潜力。

Details Motivation: 目前缺乏有效的情感多模态数据集,且传统的接触式设备会影响真实情感反应,因此需要一种非接触的方法来进行情感识别。 Method: 开发了一个新型高质量的多模态远程生理情感识别数据集CAST-Phys,结合面部视频和生理信号(如PPG、EDA和RR)进行分析。 Result: 研究表明,在面部表情无法提供足够情感信息时,生理信号能显著提升情感识别的效果,并验证了多模态融合的有效性。 Conclusion: CAST-Phys数据库展示了生理信号在远程多模态情感识别中的重要性,并证明了其在无接触情感识别技术中的潜力。 Abstract: In recent years, affective computing and its applications have become a fast-growing research topic. Despite significant advancements, the lack of affective multi-modal datasets remains a major bottleneck in developing accurate emotion recognition systems. Furthermore, the use of contact-based devices during emotion elicitation often unintentionally influences the emotional experience, reducing or altering the genuine spontaneous emotional response. This limitation highlights the need for methods capable of extracting affective cues from multiple modalities without physical contact, such as remote physiological emotion recognition. To address this, we present the Contactless Affective States Through Physiological Signals Database (CAST-Phys), a novel high-quality dataset explicitly designed for multi-modal remote physiological emotion recognition using facial and physiological cues. The dataset includes diverse physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and respiration rate (RR), alongside high-resolution uncompressed facial video recordings, enabling the potential for remote signal recovery. Our analysis highlights the crucial role of physiological signals in realistic scenarios where facial expressions alone may not provide sufficient emotional information. Furthermore, we demonstrate the potential of remote multi-modal emotion recognition by evaluating the impact of individual and fused modalities, showcasing its effectiveness in advancing contactless emotion recognition technologies.

[128] Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification

Murilo Gustineli,Anthony Miyaguchi,Adrian Cheung,Divyansh Khattak

Main category: cs.CV

TL;DR: The DS@GT team developed an efficient solution for plant species identification using a combination of advanced techniques and publicly shared resources.

Details Motivation: To address the challenge of multi-species plant identification in vegetation quadrat images effectively and efficiently. Method: The method involves using ViTD2PC24All for patch-level inference, a 4x4 tiling strategy, and domain-prior adaptation with PaCMAP + K-Means clustering and geolocation filtering. Result: The approach achieved a macro-averaged F1 score of 0.348 on the private leaderboard. Conclusion: DS@GT's solution to the PlantCLEF 2025 challenge combines a fine-tuned Vision Transformer, tiling strategy, and domain-prior adaptation without additional training. Abstract: We describe DS@GT's second-place solution to the PlantCLEF 2025 challenge on multi-species plant identification in vegetation quadrat images. Our pipeline combines (i) a fine-tuned Vision Transformer ViTD2PC24All for patch-level inference, (ii) a 4x4 tiling strategy that aligns patch size with the network's 518x518 receptive field, and (iii) domain-prior adaptation through PaCMAP + K-Means visual clustering and geolocation filtering. Tile predictions are aggregated by majority vote and re-weighted with cluster-specific Bayesian priors, yielding a macro-averaged F1 of 0.348 (private leaderboard) while requiring no additional training. All code, configuration files, and reproducibility scripts are publicly available at https://github.com/dsgt-arc/plantclef-2025.

[129] Reflections Unlock: Geometry-Aware Reflection Disentanglement in 3D Gaussian Splatting for Photorealistic Scenes Rendering

Jiayi Song,Zihan Ye,Qingyuan Zhou,Weidong Yang,Ben Fei,Jingyi Xu,Ying He,Wanli Ouyang

Main category: cs.CV

TL;DR: 本文提出Ref-Unlock,通过几何感知的反射建模解决反射表面在新视角合成中的挑战,实现更准确的场景重建和灵活的反射编辑。

Details Motivation: 现有方法如NeRF和3DGS常将反射误认为物理几何,导致重建效果下降,尤其是在处理包含复杂几何的真实场景时更为明显。 Method: 提出了基于3D高斯随机化的几何感知反射建模框架Ref-Unlock,采用双分支表示和高阶球谐函数捕捉高频反射细节,并结合伪深度图和几何感知双边平滑约束。 Result: Ref-Unlock在经典GS-based反射方法上显著优于其他方法,并与NeRF-based模型达到相当的效果。 Conclusion: Ref-Unlock提供了一种高效的、可推广的解决方案,用于反射场景的真实感渲染,并实现了基于视觉基础模型的反射编辑。 Abstract: Accurately rendering scenes with reflective surfaces remains a significant challenge in novel view synthesis, as existing methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) often misinterpret reflections as physical geometry, resulting in degraded reconstructions. Previous methods rely on incomplete and non-generalizable geometric constraints, leading to misalignment between the positions of Gaussian splats and the actual scene geometry. When dealing with real-world scenes containing complex geometry, the accumulation of Gaussians further exacerbates surface artifacts and results in blurred reconstructions. To address these limitations, in this work, we propose Ref-Unlock, a novel geometry-aware reflection modeling framework based on 3D Gaussian Splatting, which explicitly disentangles transmitted and reflected components to better capture complex reflections and enhance geometric consistency in real-world scenes. Our approach employs a dual-branch representation with high-order spherical harmonics to capture high-frequency reflective details, alongside a reflection removal module providing pseudo reflection-free supervision to guide clean decomposition. Additionally, we incorporate pseudo-depth maps and a geometry-aware bilateral smoothness constraint to enhance 3D geometric consistency and stability in decomposition. Extensive experiments demonstrate that Ref-Unlock significantly outperforms classical GS-based reflection methods and achieves competitive results with NeRF-based models, while enabling flexible vision foundation models (VFMs) driven reflection editing. Our method thus offers an efficient and generalizable solution for realistic rendering of reflective scenes. Our code is available at https://ref-unlock.github.io/.

[130] Omni-Video: Democratizing Unified Video Understanding and Generation

Zhiyu Tan,Hao Yang,Luozheng Qin,Jia Gong,Mengping Yang,Hao Li

Main category: cs.CV

TL;DR: This paper introduces Omni-Video, a unified framework for video understanding, generation, and editing that combines multimodal large language models with diffusion decoders through a lightweight architecture and efficient training approach.

Details Motivation: Current foundational models predominantly focus on image processing, creating a gap in unified modeling for video understanding and generation. The goal is to develop an efficient and effective unified framework for handling video-related tasks. Method: Omni-Video utilizes a lightweight architectural design that integrates vision heads and adapters with multimodal large language models (MLLMs) and diffusion decoders. It also employs an efficient multi-stage training scheme to enhance performance. Result: Omni-Video achieves high-quality video production by leveraging MLLMs to generate visual clues used by diffusion decoders while maintaining efficiency in training and resource usage. Conclusion: The proposed Omni-Video framework demonstrates satisfactory generalization abilities across video understanding, generation, and editing tasks. Abstract: Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.

[131] Prompt-Free Conditional Diffusion for Multi-object Image Augmentation

Haoyu Wang,Lei Zhang,Wei Wei,Chen Ding,Yanning Zhang

Main category: cs.CV

TL;DR: 为了解决当前多目标图像生成方法中存在的生成物体与原数据偏差大或生成图像多样性不足的问题,我们提出了一种无需提示的多目标图像增强扩散框架。这种方法通过局部-全局语义融合策略、LoRA知识注入以及基于计数损失的奖励模型,既减少了类别偏差又提高了生成数据的多样性,从而在多个基线方法上展现出优越的性能和良好的下游任务增益及领域外泛化能力。

Details Motivation: 然而,在涉及生成多对象图像的真实场景时,大多数现有方法要么完全依赖文本条件,导致生成对象与原始数据之间存在偏差,要么过于依赖原始图像,导致生成图像缺乏多样性,这对下游任务帮助有限。为了同时解决这两个问题,我们提出了一种无需提示的多对象图像增强扩散框架。 Method: 我们引入了一种局部-全局语义融合策略,从图像中提取语义以替代文本,并通过LoRA将知识注入扩散模型,以缓解原始模型与目标数据集之间的类别偏差。此外,我们设计了一种基于计数损失的奖励模型来辅助传统的重建损失进行模型训练。 Result: 通过约束每个类别的物体数量而不是逐像素约束,在弥合生成数据与原始数据之间的数量偏差的同时提高了生成数据的多样性。 Conclusion: 实验结果表明,该方法在多个代表性基线方法上具有优越性,并展示了强大的下游任务增益和领域外泛化能力。 Abstract: Diffusion models has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https://github.com/00why00/PFCD}{here}.

[132] SoftReMish: A Novel Activation Function for Enhanced Convolutional Neural Networks for Visual Recognition Performance

Mustafa Bayram Gücen

Main category: cs.CV

TL;DR: 该研究提出了一种新的激活函数SoftReMish,用于提升卷积神经网络在图像分类任务中的性能,并通过与ReLU、Tanh和Mish的比较实验验证了其优越性。

Details Motivation: 提出了一种新的激活函数SoftReMish,旨在提高卷积神经网络(CNN)在图像分类任务中的性能。 Method: 使用MNIST数据集,实现了一个包含两个卷积层、最大池化和全连接层的标准CNN架构,并用SoftReMish替换所有可训练层中的激活函数进行评估。 Result: 结果表明,SoftReMish实现了最小损失(3.14e-8)和验证准确率(99.41%),超过了所有其他测试的激活函数。 Conclusion: SoftReMish表现出更好的收敛行为和泛化能力,使其成为视觉识别任务的有希望的候选方案。 Abstract: In this study, SoftReMish, a new activation function designed to improve the performance of convolutional neural networks (CNNs) in image classification tasks, is proposed. Using the MNIST dataset, a standard CNN architecture consisting of two convolutional layers, max pooling, and fully connected layers was implemented. SoftReMish was evaluated against popular activation functions including ReLU, Tanh, and Mish by replacing the activation function in all trainable layers. The model performance was assessed in terms of minimum training loss and maximum validation accuracy. Results showed that SoftReMish achieved a minimum loss (3.14e-8) and a validation accuracy (99.41%), outperforming all other functions tested. These findings demonstrate that SoftReMish offers better convergence behavior and generalization capability, making it a promising candidate for visual recognition tasks.

[133] Normalizing Diffusion Kernels with Optimal Transport

Nathan Kessler,Robin Magnet,Jean Feydy

Main category: cs.CV

TL;DR: This paper introduces a novel class of smoothing operators derived from general similarity or adjacency matrices, normalized via a symmetric Sinkhorn algorithm to mimic Laplacian-based heat diffusion, enabling effective processing of irregular data while retaining theoretical guarantees.

Details Motivation: Existing methods like convolution kernels and message-passing layers are biased near domain boundaries, while traditional Laplacian-based smoothing requires well-defined domain structures. This work aims to bridge the gap by introducing a more flexible class of smoothing operators. Method: A symmetric variant of the Sinkhorn algorithm is used to normalize general similarity or adjacency matrices into diffusion-like operators that mimic the behavior of Laplacians. Result: The introduced smoothing operators effectively mimic Laplacian-based heat diffusion on irregular domains such as point clouds and sparse voxel grids, while preserving spectral properties of the Laplacian. Conclusion: The proposed smoothing operators can approximate heat diffusion and retain spectral information from the Laplacian, enabling effective processing of irregular data with applications in shape analysis and matching. Abstract: Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On well-structured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.

[134] OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Yunhan Yang,Yufan Zhou,Yuan-Chen Guo,Zi-Xin Zou,Yukun Huang,Ying-Tian Liu,Hao Xu,Ding Liang,Yan-Pei Cao,Xihui Liu

Main category: cs.CV

TL;DR: OmniPart is a novel framework for part-aware 3D object generation that decomposes the task into structure planning and 3D part synthesis, allowing for controllable and editable outputs.

Details Motivation: Most generative methods produce only monolithic 3D shapes, limiting their utility for interactive applications that require editable part structures. Method: OmniPart uses a two-stage approach: (1) an autoregressive structure planning module that generates 3D part bounding boxes using flexible 2D part masks, and (2) a spatially-conditioned rectified flow model that synthesizes all 3D parts simultaneously based on the planned layout. Result: OmniPart achieves high semantic decoupling among components while maintaining structural cohesion, enabling more interpretable, editable, and versatile 3D content with diverse downstream applications. Conclusion: OmniPart enables the generation of 3D objects with explicit, editable part structures, supporting user-defined part granularity and precise localization while achieving state-of-the-art performance. Abstract: The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.

[135] Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling

Prahitha Movva,Naga Harshita Marupaka

Main category: cs.CV

TL;DR: 本研究针对科学图表中的视觉问答问题,提出了基于InternVL3和集成建模的方法,取得了较好的效果,并强调了提示优化、链式推理的重要性。

Details Motivation: 当前的视觉问答方法在科学数据解释方面往往缺乏必要的精确性,特别是在处理数值、多步推理以及保持视觉观察与文本推理之间的一致性方面存在困难。 Method: 使用5B到8B参数的模型进行了一系列实验,其中最强的独立模型InternVL3在SciVQA测试集上实现了ROUGE-1和ROUGE-L F1得分为0.740和BERTScore为0.983。此外,还开发了一个包含多个视觉语言模型(VLM)的集成模型。 Result: 通过验证集上的错误分析表明,集成方法相比大多数独立模型提高了性能,但InternVL3仍然是最强的独立表现者。 Conclusion: 本文提出了一种基于科学图表的视觉问答方法,并强调了提示优化、链式推理和集成建模在提升模型视觉问答能力方面的有效性。 Abstract: Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.

[136] Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

Aleksandar Jevtić,Christoph Reich,Felix Wimbauer,Oliver Hahn,Christian Rupprecht,Stefan Roth,Daniel Cremers

Main category: cs.CV

TL;DR: SceneDINO is an unsupervised method for semantic scene completion that achieves impressive results without relying on ground-truth annotations.

Details Motivation: Prior work on semantic scene completion (SSC) heavily relies on expensive ground-truth annotations. SceneDINO approaches SSC in an unsupervised setting to eliminate this dependency. Method: SceneDINO utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. It infers 3D geometry and expressive 3D DINO features in a feed-forward manner and employs a novel 3D feature distillation approach for unsupervised 3D semantics. Result: SceneDINO reaches state-of-the-art segmentation accuracy in both 3D and 2D unsupervised scene understanding. Linear probing of its 3D features matches the segmentation accuracy of current supervised SSC approaches. Additionally, it showcases domain generalization and multi-view consistency. Conclusion: SceneDINO takes the first steps towards a strong foundation for single image 3D scene understanding, achieving state-of-the-art segmentation accuracy in both 3D and 2D unsupervised scene understanding. Abstract: Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

[137] RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models

Keyan Chen,Chenyang Liu,Bowen Chen,Jiafan Zhang,Zhengxia Zou,Zhenwei Shi

Main category: cs.CV

TL;DR: RSRefSeg 2 improves remote sensing image segmentation by decoupling the process into two stages: coarse localization and fine segmentation, outperforming existing approaches.

Details Motivation: Current methods have limitations in handling complex semantic relationships and achieving precise cross-modal alignment due to architectural coupling. Method: RSRefSeg 2 uses a collaborative dual-stage framework combining CLIP for coarse localization and SAM for fine segmentation, with a cascaded second-order prompter for better precision. Result: Experiments show RSRefSeg 2 achieves higher segmentation accuracy (+~3% gIoU) and better performance in complex semantic interpretation. Conclusion: RSRefSeg 2 provides improved segmentation accuracy and complex semantic interpretation compared to existing methods. Abstract: Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed by fine segmentation. RSRefSeg 2 integrates CLIP's cross-modal alignment strength with SAM's segmentation generalizability through strategic foundation model collaboration. Specifically, CLIP is employed as the dual-modal encoder to activate target features within its pre-aligned semantic space and generate localization prompts. To mitigate CLIP's misactivation challenges in multi-entity scenarios described by referring texts, a cascaded second-order prompter is devised, which enhances precision through implicit reasoning via decomposition of text embeddings into complementary semantic subspaces. These optimized semantic prompts subsequently direct the SAM to generate pixel-level refined masks, thereby completing the semantic transmission pipeline. Extensive experiments (RefSegRS, RRSIS-D, and RISBench) demonstrate that RSRefSeg 2 surpasses contemporary methods in segmentation accuracy (+~3% gIoU) and complex semantic interpretation. Code is available at: https://github.com/KyanChen/RSRefSeg2.

[138] Learning to Track Any Points from Human Motion

Inès Hyeonsu Kim,Seokju Cho,Jahyeok Koo,Junghyun Park,Jiahui Huang,Joon-Young Lee,Seungryong Kim

Main category: cs.CV

TL;DR: AnthroTAP automates the generation of pseudo-labeled training data for point tracking by leveraging the SMPL model, achieving superior performance with minimal resources.

Details Motivation: Human motion provides a rich source of supervision for training robust and generalizable point trackers; however, acquiring extensive training data is difficult due to manual annotation challenges. Method: AnthroTAP utilizes the SMPL model to fit detected humans in video frames, projects 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handles occlusions using ray-casting, and filters unreliable tracks based on optical flow consistency. Result: A point tracking model trained on the AnthroTAP dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs. Conclusion: The proposed AnthroTAP pipeline effectively generates pseudo-labeled training data for point tracking, achieving state-of-the-art performance with significantly less data and computational resources compared to existing methods. Abstract: Human motion, with its inherent complexities, such as non-rigid deformations, articulated movements, clothing distortions, and frequent occlusions caused by limbs or other individuals, provides a rich and challenging source of supervision that is crucial for training robust and generalizable point trackers. Despite the suitability of human motion, acquiring extensive training data for point tracking remains difficult due to laborious manual annotation. Our proposed pipeline, AnthroTAP, addresses this by proposing an automated pipeline to generate pseudo-labeled training data, leveraging the Skinned Multi-Person Linear (SMPL) model. We first fit the SMPL model to detected humans in video frames, project the resulting 3D mesh vertices onto 2D image planes to generate pseudo-trajectories, handle occlusions using ray-casting, and filter out unreliable tracks based on optical flow consistency. A point tracking model trained on AnthroTAP annotated dataset achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing other models trained on real videos while using 10,000 times less data and only 1 day in 4 GPUs, compared to 256 GPUs used in recent state-of-the-art.