Skip to content

Table of Contents

cs.CL [Back]

[1] Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions

Logé Cécile,Ghori Rehan

Main category: cs.CL

TL;DR: 该论文提出了一种基于人工智能的系统,旨在通过事实核查和用户互动来对抗YouTube上的虚假信息。

Details Motivation: 虚假信息在数字世界中构成重大威胁,尤其是在YouTube等平台上迅速传播,因此需要一种有效的应对方法。 Method: 系统包括两个主要代理:Truth Sleuth 和 Trend Bender。Truth Sleuth 使用检索增强生成(RAG)技术,结合维基百科、谷歌搜索和谷歌事实核查等来源评估YouTube视频中的声明的真实性;Trend Bender 则利用这些报告生成具有说服力的评论以促进有益讨论。 Result: 实验表明该系统的事实核查代理具有高准确性,并展示了其在影响用户观点和打击虚假信息方面的潜力。 Conclusion: 研究确认了人工智能驱动干预措施在创建更知情在线空间方面的有效性,并强调了这种创新方法对抗虚假信息的重要性。 Abstract: Misinformation poses a significant threat in today's digital world, often spreading rapidly through platforms like YouTube. This paper introduces a novel approach to combating misinformation by developing an AI-powered system that not only fact-checks claims made in YouTube videos but also actively engages users in the comment section and challenge misleading narratives. Our system comprises two main agents: Truth Sleuth and Trend Bender. Truth Sleuth extracts claims from a YouTube video, uses a Retrieval-Augmented Generation (RAG) approach - drawing on sources like Wikipedia, Google Search, Google FactCheck - to accurately assess their veracity and generates a nuanced and comprehensive report. Through rigorous prompt engineering, Trend Bender leverages this report along with a curated corpus of relevant articles to generate insightful and persuasive comments designed to stimulate a productive debate. With a carefully set up self-evaluation loop, this agent is able to iteratively improve its style and refine its output. We demonstrate the system's capabilities through experiments on established benchmark datasets and a real-world deployment on YouTube, showcasing its potential to engage users and potentially influence perspectives. Our findings highlight the high accuracy of our fact-checking agent, and confirm the potential of AI-driven interventions in combating misinformation and fostering a more informed online space.

[2] An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation

Vimaleswar A,Prabhu Nandan Sahu,Nilesh Kumar Sahu,Haroon R Lone

Main category: cs.CL

TL;DR: EmoSApp 是一个完全离线的智能手机心理健康对话应用,利用微调和量化的大语言模型(LLM)提供个性化支持,并在低资源环境下展现出色性能。

Details Motivation: 数字平台在心理健康支持方面存在用户可访问性、互联网连接和数据隐私的挑战,需要一种离线的、基于智能手机的解决方案。 Method: 通过使用 Torchtune 和 Executorch 在资源受限设备上微调、量化和部署 LLaMA-3.2-1B-Instruct 模型,并构建 14,582 个心理健康问答对的知识数据集进行训练。 Result: 通过学生群体的定性评估,EmoSApp 能够连贯、富有同理心地维持互动对话并提供建议;在九个常识和推理基准测试中也证明了其在低资源环境下的有效性。 Conclusion: EmoSApp 通过优先考虑设备端部署和专业领域适应,为未来便携、安全和高度定制的 AI 驱动心理健康解决方案的创新提供了蓝图。 Abstract: Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have been increasingly used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solution. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed for mental health and emotional support. The system leverages Large Language Models (LLMs), specifically fine-tuned, quantized and deployed using Torchtune and Executorch for resource-constrained devices, allowing all inferences to occur on the smartphone. To equip EmoSApp with robust domain expertise, we fine-tuned the LLaMA-3.2-1B-Instruct model on our custom curated ``Knowledge dataset'' of 14,582 mental-health QA pairs, along with the multi-turn conversational data. Through qualitative human evaluation with the student population, we demonstrate that EmoSApp has the ability to respond coherently, empathetically, maintain interactive dialogue, and provide relevant suggestions to user's mental health problems. Additionally, quantitative evaluations on nine standard commonsense and reasoning benchmarks demonstrate the efficacy of our fine-tuned, quantized model in low-resource settings. By prioritizing on-device deployment and specialized domain adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health solutions.

[3] Transforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis

Anders Ledberg,Anna Thalén

Main category: cs.CL

TL;DR: 本文提出了一种模块化工具链,利用开放权重模型对敏感文本进行标准化、匿名化和嵌入式分析,有效解决隐私和异构性问题,支持大规模社会科学研究。

Details Motivation: 非结构化的法律、医疗和行政来源的文本是公共卫生和社会科学领域研究的重要但未被充分利用的资源,但由于包含敏感的个人身份信息和文本结构及语言的高度异质性,难以进行大规模分析。 Method: 使用基于大语言模型(LLM)的提示技术标准化、摘要化以及在需要时将文本翻译成英文,通过LLM去标识化结合命名实体识别和基于规则的方法实现匿名化,并将处理后的文本转换为文档级别的嵌入向量。 Result: 该工具链成功地从10,842份瑞典《滥用者护理法案》(LVM)法院裁决书中共56,000多页文本中去除识别信息,同时保留语义内容,并通过小规模手动标注摘要生成的嵌入向量训练预测模型,展示了其半自动化内容分析的能力。 Conclusion: 该工具链使得对敏感文档的结构化、隐私保护分析成为可能,为那些由于隐私和异构性限制而无法访问文本数据的领域开辟了大规模研究的新可能性。 Abstract: Unstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research. The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure. We demonstrate the toolchain on a corpus of 10,842 Swedish court decisions under the Care of Abusers Act (LVM), comprising over 56,000 pages. Each document is processed into an anonymized, standardized summary and transformed into a document-level embedding. Validation, including manual review, automated scanning, and predictive evaluation shows the toolchain effectively removes identifying information while retaining semantic content. As an illustrative application, we train a predictive model using embedding vectors derived from a small set of manually labeled summaries, demonstrating the toolchain's capacity for semi-automated content analysis at scale. By enabling structured, privacy-conscious analysis of sensitive documents, our toolchain opens new possibilities for large-scale research in domains where textual data was previously inaccessible due to privacy and heterogeneity constraints.

[4] A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations

Isar Nejadgholi,Mona Omidyeganeh,Marc-Antoine Drouin,Jonathan Boisvert

Main category: cs.CL

TL;DR: This paper introduces a taxonomy for Natural Language Explanations (NLEs) in Explainable AI (XAI), aimed at improving transparency and governance of AI systems by systematically organizing NLE characteristics across three dimensions.

Details Motivation: With the growing use of large language models, there is a need for effective governance mechanisms, particularly through NLEs, to ensure transparent AI system behavior. This necessitates a systematic understanding of NLE characteristics and their governance implications. Method: The paper draws on existing Explainable AI (XAI) literature to develop an updated taxonomy tailored for prompt-based Natural Language Explanations (NLEs), organized along three dimensions: Context, Generation and Presentation, and Evaluation. Result: An updated XAI taxonomy adapted to prompt-based NLEs was developed, encompassing three key dimensions—Context, Generation and Presentation, and Evaluation—providing a comprehensive framework for stakeholders involved in AI governance. Conclusion: The paper concludes that the proposed taxonomy offers a structured framework for characterizing, designing, and improving NLEs to enhance transparency in AI systems. Abstract: Effective AI governance requires structured approaches for stakeholders to access and verify AI system behavior. With the rise of large language models, Natural Language Explanations (NLEs) are now key to articulating model behavior, which necessitates a focused examination of their characteristics and governance implications. We draw on Explainable AI (XAI) literature to create an updated XAI taxonomy, adapted to prompt-based NLEs, across three dimensions: (1) Context, including task, data, audience, and goals; (2) Generation and Presentation, covering generation methods, inputs, interactivity, outputs, and forms; and (3) Evaluation, focusing on content, presentation, and user-centered properties, as well as the setting of the evaluation. This taxonomy provides a framework for researchers, auditors, and policymakers to characterize, design, and enhance NLEs for transparent AI systems.

[5] AutoRAG-LoRA: Hallucination-Triggered Knowledge Retuning via Lightweight Adapters

Kaushik Dwivedi,Padmanabh Patanjali Mishra

Main category: cs.CL

TL;DR: This paper proposes AutoRAG-LoRA, a framework that significantly reduces hallucinations in large language models by integrating retrieval-augmented generation, lightweight adapters, and KL-regularized training.

Details Motivation: Large Language Models (LLMs) often produce hallucinations—factual inaccuracies that hinder their reliability in real-world applications. This work aims to tackle this issue by grounding model responses in retrieved evidence through a modular framework. Method: The paper introduces AutoRAG-LoRA, a framework that combines Retrieval-Augmented Generation with lightweight LoRA-based adapters and KL-regularized training to address hallucinations in LLMs. It incorporates prompt rewriting, hybrid retrieval, adapter tuning, and a hallucination detection module with a feedback correction loop. Result: AutoRAG-LoRA successfully reduces hallucinations in large language models while preserving their efficiency and modularity. Conclusion: AutoRAG-LoRA is effective in reducing factual drift while maintaining model efficiency and modularity. Abstract: Large Language Models (LLMs) have demonstrated remarkable fluency across a range of natural language tasks, yet remain vulnerable to hallucinations - factual inaccuracies that undermine trust in real world deployment. We present AutoRAG-LoRA, a modular framework for Retrieval-Augmented Generation (RAG) that tackles hallucination in large language models through lightweight LoRA-based adapters and KL-regularized training. Our pipeline integrates automated prompt rewriting, hybrid retrieval, and low-rank adapter tuning to ground responses in retrieved evidence. A hallucination detection module, using both classifier-based and self-evaluation techniques, assigns confidence scores to generated outputs, triggering an optional feedback correction loop. This loop enforces factual alignment via contrastive KL loss and adapter fine tuning. We demonstrate that AutoRAG-LoRA significantly reduces the factual drift while preserving the efficiency and modularity of the model.

[6] Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing

Dennis Ulmer,Alexandra Lorson,Ivan Titov,Christian Hardmeier

Main category: cs.CL

TL;DR: 本文探讨了如何通过模仿人类不确定性表达方式来增强人机交互的信任度,指出当前语言模型存在过于自信的问题,并建议通过更真实的语言表达提升用户信任。

Details Motivation: 当前大型语言模型(LLMs)输出时常表现出过度自信,即使其准确性存疑,这影响了其可信度。因此需要一种能够表达模型信心的方式,以提升人机协作效益并减少潜在危害。 Method: 论文综述了人类不确定性交流的研究,调查了正在进行的研究,并进行了额外分析以揭示被忽视的言语不确定性中的偏见。 Result: 论文展示了关于人类不确定性交流的全面概述,揭示了言语不确定性中存在的数据偏见,并提出“拟人化不确定性”这一概念。 Conclusion: 论文得出结论,不确定性沟通在人机交流中具有独特因素,提出了通过模仿人类交流实现语言真实性与个性化的方法,并指出了未来NLP的研究方向。 Abstract: Human users increasingly rely on natural language interactions with large language models (LLMs) in order to receive help on a large variety of tasks and problems. However, the trustworthiness and perceived legitimacy of LLMs is undermined by the fact that their output is frequently stated in very confident terms, even when its accuracy is questionable. Therefore, there is a need to signal the confidence of the language model to a user in order to reap the benefits of human-machine collaboration and mitigate potential harms. Verbalized uncertainty is the expression of confidence with linguistic means, an approach that integrates perfectly into language-based interfaces. Nevertheless, most recent research in natural language processing (NLP) overlooks the nuances surrounding human uncertainty communication and the data biases that influence machine uncertainty communication. We argue for anthropomimetic uncertainty, meaning that intuitive and trustworthy uncertainty communication requires a degree of linguistic authenticity and personalization to the user, which could be achieved by emulating human communication. We present a thorough overview over the research in human uncertainty communication, survey ongoing research, and perform additional analyses to demonstrate so-far overlooked biases in verbalized uncertainty. We conclude by pointing out unique factors in human-machine communication of uncertainty and deconstruct anthropomimetic uncertainty into future research directions for NLP.

[7] PLEX: Perturbation-free Local Explanations for LLM-Based Text Classification

Yogachandran Rahulamathavan,Misbah Farooq,Varuna De Silva

Main category: cs.CL

TL;DR: PLEX is a new, efficient method for explaining LLM-based text classification that avoids the computational burden of traditional perturbation-based approaches like LIME and SHAP, while maintaining accuracy in identifying influential words.

Details Motivation: Current XAI methods like LIME and SHAP are computationally expensive due to their reliance on thousands of perturbed sentences, which is especially burdensome with LLMs. Method: PLEX uses contextual embeddings from LLMs and a Siamese network trained to align with feature importance scores, eliminating the need for computationally expensive perturbations. Result: PLEX demonstrates over 92% agreement with LIME and SHAP, accurately identifies influential words, maintains classification accuracy when those words are removed, and reduces explanation time and computational overhead by two and four orders of magnitude, respectively. Conclusion: PLEX provides a promising solution for explainable LLM-based text classification by offering efficient and accurate local explanations without perturbations. Abstract: Large Language Models (LLMs) excel in text classification, but their complexity hinders interpretability, making it difficult to understand the reasoning behind their predictions. Explainable AI (XAI) methods like LIME and SHAP offer local explanations by identifying influential words, but they rely on computationally expensive perturbations. These methods typically generate thousands of perturbed sentences and perform inferences on each, incurring a substantial computational burden, especially with LLMs. To address this, we propose \underline{P}erturbation-free \underline{L}ocal \underline{Ex}planation (PLEX), a novel method that leverages the contextual embeddings extracted from the LLM and a ``Siamese network" style neural network trained to align with feature importance scores. This one-off training eliminates the need for subsequent perturbations, enabling efficient explanations for any new sentence. We demonstrate PLEX's effectiveness on four different classification tasks (sentiment, fake news, fake COVID-19 news and depression), showing more than 92\% agreement with LIME and SHAP. Our evaluation using a ``stress test" reveals that PLEX accurately identifies influential words, leading to a similar decline in classification accuracy as observed with LIME and SHAP when these words are removed. Notably, in some cases, PLEX demonstrates superior performance in capturing the impact of key features. PLEX dramatically accelerates explanation, reducing time and computational overhead by two and four orders of magnitude, respectively. This work offers a promising solution for explainable LLM-based text classification.

[8] Emergence of Hierarchical Emotion Organization in Large Language Models

Bo Zhao,Maya Okawa,Eric J. Bigelow,Rose Yu,Tomer Ullman,Ekdeep Singh Lubana,Hidenori Tanaka

Main category: cs.CL

TL;DR: This research reveals that large language models (LLMs) exhibit emotional reasoning similar to human psychology but show biases in recognizing emotions among underrepresented groups, highlighting both their potential and challenges in ethical deployment.

Details Motivation: As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Method: Inspired by emotion wheels, a psychological framework arguing emotions organize hierarchically, the researchers analyzed probabilistic dependencies between emotional states in model outputs. They also conducted human studies to compare findings. Result: The study found that LLMs naturally form hierarchical emotion trees aligned with human psychological models, with larger models developing more complex hierarchies. It also uncovered systematic biases in emotion recognition across socioeconomic personas, particularly affecting intersectional, underrepresented groups. Conclusion: The study concludes that LLMs demonstrate emergent emotional reasoning and internalize aspects of human social perception, suggesting the potential for using cognitive theories in developing better model evaluations. Abstract: As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels -- a psychological framework that argues emotions organize hierarchically -- we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

[9] Language Models for Adult Service Website Text Analysis

Nickolas Freeman,Thanh Nguyen,Gregory Bott,Jason Parton,Collin Francel

Main category: cs.CL

TL;DR: The paper proposes efficient custom transformer models for analyzing complex adult service website (ASW) text data, significantly improving performance over existing methods and enabling practical applications in sex trafficking investigations.

Details Motivation: ASW ad text is crucial for identifying potential sex trafficking victims, but its complexity, including emojis, poor grammar, and deliberate obfuscation, poses significant analytical challenges. Method: The researchers evaluated various language modeling approaches, including information retrieval methods and pre-trained as well as custom transformer models, tailored to the unique characteristics of ASW text data. Result: Custom transformer models outperformed fine-tuned variants of BERT-base, RoBERTa, and ModernBERT on key performance metrics and were efficiently trained and used for inference. Conclusion: The study concludes that custom transformer models offer significant improvements in analyzing ASW ad text, enabling efficient processing and actionable insights for combating sex trafficking. Abstract: Sex trafficking refers to the use of force, fraud, or coercion to compel an individual to perform in commercial sex acts against their will. Adult service websites (ASWs) have and continue to be linked to sex trafficking, offering a platform for traffickers to advertise their victims. Thus, organizations involved in the fight against sex trafficking often use ASW data when attempting to identify potential sex trafficking victims. A critical challenge in transforming ASW data into actionable insight is text analysis. Previous research using ASW data has shown that ASW ad text is important for linking ads. However, working with this text is challenging due to its extensive use of emojis, poor grammar, and deliberate obfuscation to evade law enforcement scrutiny. We conduct a comprehensive study of language modeling approaches for this application area, including simple information retrieval methods, pre-trained transformers, and custom transformer models. We demonstrate that characteristics of ASW text data allow efficient custom transformer models to be trained with relatively small GPU resources and used efficiently for inference on consumer hardware. Our custom models outperform fine-tuned variants of well-known encoder-only transformer models, including BERT-base, RoBERTa, and ModernBERT, on accuracy, recall, F1 score, and ROC AUC. We demonstrate the use of our best-performing custom configuration on three tasks related to ASW data analysis: (i) decomposing the giant component in a graph representation of ASW data, (ii) clustering ASW ad text, and (iii) using the learned token embeddings to understand the use of emojis in the illicit context we study. The models we develop represent a significant advancement in ASW text analysis, which can be leveraged in a variety of downstream applications and research.

[10] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

Michal Podstawski

Main category: cs.CL

TL;DR: This study employs pretrained text embedding models to enhance semantic analysis in labeled property graphs, improving tasks such as node classification and relation prediction by leveraging textual semantics.

Details Motivation: Labeled property graphs often contain rich textual attributes that can improve analytical tasks when effectively leveraged for enhanced contextual understanding. Method: This work utilizes pretrained text embedding models to embed textual node and edge properties, enabling efficient semantic analysis within labeled property graphs. Result: The approach supports downstream tasks like node classification and relation prediction with improved contextual insights derived from textual semantics. Conclusion: The integration of language model embeddings into the graph pipeline enhances the accuracy and interpretability of property graph analysis without altering its structure. Abstract: Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.

[11] Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Yilun Zhao,Chengye Wang,Chuhan Li,Arman Cohan

Main category: cs.CL

TL;DR: This paper introduces MISS-QA, a benchmark for evaluating multimodal models' ability to interpret schematic diagrams in scientific papers, revealing a significant gap between model and human performance.

Details Motivation: The motivation is to specifically evaluate and improve models' ability to interpret schematic diagrams in scientific literature context. Method: The paper introduces MISS-QA, a benchmark with 1,500 expert-annotated examples from 465 scientific papers, used to evaluate 18 multimodal models including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. Result: Current models show a notable performance gap compared to human experts on MISS-QA, with strengths and limitations identified through error analysis and handling of unanswerable questions. Conclusion: MISS-QA highlights a significant performance gap between current multimodal models and human experts in interpreting schematic diagrams within scientific literature. Abstract: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

[12] Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler

David M. Markowitz,Samuel Hardman Taylor

Main category: cs.CL

TL;DR: The study found that social approval does not consistently lead to more hate speech on niche platforms like Parler.

Details Motivation: To explore how social approval motivates online hate in the context of Walther's theory. Method: Analysis of over 110 million posts from Parler (2018-2021) to test Walther's theory on social approval and hate speech. Result: Upvotes on hate speech posts were not associated with increased hate speech in subsequent posts across multiple time intervals; a negative relationship was observed at the post level. Conclusion: Social approval reinforcement mechanisms may function differently on niche social media platforms, and there is no direct correlation between social approval and production of hate speech. Abstract: In this paper, we explored how online hate is motivated by receiving social approval from others. We specifically examined two central tenets of Walther's (2024) social approval theory of online hate: (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech messages become more extreme. Using over 110 million posts from Parler (2018-2021), we observed that the number of upvotes a person received on a hate speech post was unassociated with the amount of hate speech in their next post and posts during the next week, month, three months, and six months. Between-person effects revealed an average negative relationship between social approval and hate speech production at the post level, but this relationship was mixed at other time intervals. Social approval reinforcement mechanisms of online hate may operate differently on niche social media platforms.

[13] LLMs on Trial: Evaluating Judicial Fairness for Large Language Models

Yiran Hu,Zongyue Xue,Haitao Li,Siyuan Zheng,Qingjing Chen,Shaochun Wang,Xihan Zhang,Ning Zheng,Yun Liu,Qingyao Ai,Yiqun Liu,Charles L. A. Clarke,Weixing Shen

Main category: cs.CL

TL;DR: This paper explores the judicial fairness of Large Language Models (LLMs) in high-stakes applications, revealing significant inconsistencies, biases, and inaccuracies, particularly in relation to demographic labels. The study introduces a framework and dataset for evaluating LLM fairness and provides insights into mitigating these issues.

Details Motivation: As LLMs are increasingly used in high-stakes fields impacting rights and equity, it's essential to evaluate their judicial fairness and implications for social justice. Ensuring fair decision-making by LLMs is crucial for trustworthiness. Method: Based on theories of judicial fairness, a framework was created to measure LLM fairness, involving 65 labels and 161 values. An extensive dataset named JudiFair was compiled with 177,100 unique case facts. Three evaluation metrics—inconsistency, bias, and imbalanced inaccuracy—were developed, along with a method for assessing overall fairness across multiple LLMs. Result: Experiments with 16 LLMs revealed pervasive inconsistency, bias, and imbalanced inaccuracy, indicating severe judicial unfairness. Biases were more pronounced on demographic labels. Increased inconsistency correlated with reduced biases, but higher prediction accuracy increased biases. Temperature adjustments affected fairness, but other factors like model size, release date, and country of origin did not show significant effects. Conclusion: The study concludes that there is significant judicial unfairness in LLMs, with inconsistencies, biases, and imbalanced inaccuracies being prominent. Demographic labels are more biased than substance or procedure labels. Adjusting temperature parameters can influence fairness, while model size, release date, and origin do not significantly impact fairness. Abstract: Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs' judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.

[14] How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations

Ikumi Numaya,Shoji Moriya,Shiki Sato,Reina Akama,Jun Suzuki

Main category: cs.CL

TL;DR: 本论文介绍了一个新的数据集,该数据集区分了用户主观风格相似性和第三方客观风格相似性,并发现了它们对用户偏好的不同影响。

Details Motivation: 虽然之前的研究表明用户和系统之间的风格相似性可以提升用户的印象,但很少有研究关注主观相似性和客观相似性之间的区别。 Method: 作者通过构建一个包含用户偏好、主观风格相似性和第三方评估的客观风格相似性的新数据集进行分析。 Result: 研究发现主观风格相似性与用户偏好之间存在显著正相关关系,并且用户的主观风格相似性不同于第三方的客观相似性。 Conclusion: 作者强调了区分主观和客观评价的重要性,并指出每种评价方式所捕捉到的不同方面,同时提供了新的数据集用于开放领域的对话场景研究。 Abstract: Recent advancements in dialogue generation have broadened the scope of human-bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users' preferences, subjective stylistic similarity based on users' own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users' subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.

[15] HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

Seungho Choi

Main category: cs.CL

TL;DR: HanjaBridge enhances Korean language understanding in large language models by integrating Hanja ambiguity resolution during pre-training, leading to improved performance and cross-lingual transfer without increasing runtime cost.

Details Motivation: Large language models perform poorly in low-resource languages like Korean due to linguistic challenges such as homophonous Sino-Korean words indistinguishable in Hangul script. This work aims to address the semantic ambiguity caused by these challenges. Method: HanjaBridge uses a continual pre-training framework that presents all possible Hanja candidates for homographs, encouraging contextual disambiguation, and incorporates token-level knowledge distillation to prevent catastrophic forgetting. Result: HanjaBridge achieves a 21% relative improvement on the KoBALT benchmark, demonstrates strong cross-lingual transfer between Korean and Chinese, and maintains performance gains without requiring Hanja augmentation during inference. Conclusion: HanjaBridge improves Korean language understanding by addressing semantic ambiguity through Hanja candidates and prevents forgetting with knowledge distillation, achieving significant performance improvement on the KoBALT benchmark and enabling cross-lingual transfer. Abstract: Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21\% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic alignment between Korean and Chinese through shared Hanja, we observe a strong positive cross-lingual transfer. Furthermore, these gains persist even when Hanja augmentation is omitted at inference time, ensuring practical efficiency with no additional run-time cost.

[16] Modeling Understanding of Story-Based Analogies Using Large Language Models

Kalit Inani,Keshav Kabra,Vijay Marupudi,Sashank Varma

Main category: cs.CL

TL;DR: This study compares LLMs and human reasoning in analogy tasks, finding that while LLMs can extract similarities, they still fall short of human-like reasoning.

Details Motivation: To understand how closely LLMs align with human reasoning abilities in analogy detection and mapping tasks. Method: The study evaluated LLMs' ability to detect and map analogies using sentence embeddings and explicit prompting. It compared model performance across architectures (GPT-4, LLaMA3) and sizes (8B vs. 70B). Result: LLMs captured similarity between source and target texts but struggled with robust reasoning, showing differences from human performance profiles. Conclusion: LLMs show potential in analogical reasoning and semantic representation but still lack robust human-like reasoning. Abstract: Recent advancements in Large Language Models (LLMs) have brought them closer to matching human cognition across a variety of tasks. How well do these models align with human performance in detecting and mapping analogies? Prior research has shown that LLMs can extract similarities from analogy problems but lack robust human-like reasoning. Building on Webb, Holyoak, and Lu (2023), the current study focused on a story-based analogical mapping task and conducted a fine-grained evaluation of LLM reasoning abilities compared to human performance. First, it explored the semantic representation of analogies in LLMs, using sentence embeddings to assess whether they capture the similarity between the source and target texts of an analogy, and the dissimilarity between the source and distractor texts. Second, it investigated the effectiveness of explicitly prompting LLMs to explain analogies. Throughout, we examine whether LLMs exhibit similar performance profiles to those observed in humans by evaluating their reasoning at the level of individual analogies, and not just at the level of overall accuracy (as prior studies have done). Our experiments include evaluating the impact of model size (8B vs. 70B parameters) and performance variation across state-of-the-art model architectures such as GPT-4 and LLaMA3. This work advances our understanding of the analogical reasoning abilities of LLMs and their potential as models of human reasoning.

[17] DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models

Anthony Miyaguchi,David Guecha,Yuwen Chiu,Sidharth Gaur

Main category: cs.CL

TL;DR: The DS@GT team used prompt-engineering with LLMs to detect depression in the eRisk 2025 challenges, achieving strong performance despite the lack of ground-truth labels.

Details Motivation: To detect depression through conversational cues using LLMs where ground-truth labels were unavailable. Method: Used prompt-engineering with large language models (LLMs) to conduct BDI-II-based assessments and generate structured JSON outputs. Result: Best submission ranked second with DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27. Conclusion: The DS@GT team successfully applied a prompt-engineering strategy for depression detection in the eRisk 2025 challenges, achieving notable results on the leaderboard. Abstract: This Working Note summarizes the participation of the DS@GT team in two eRisk 2025 challenges. For the Pilot Task on conversational depression detection with large language-models (LLMs), we adopted a prompt-engineering strategy in which diverse LLMs conducted BDI-II-based assessments and produced structured JSON outputs. Because ground-truth labels were unavailable, we evaluated cross-model agreement and internal consistency. Our prompt design methodology aligned model outputs with BDI-II criteria and enabled the analysis of conversational cues that influenced the prediction of symptoms. Our best submission, second on the official leaderboard, achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27.

[18] Teach Me Sign: Stepwise Prompting LLM for Sign Language Production

Zhaoyi An,Rei Kawakami

Main category: cs.CL

TL;DR: 本文提出了TEAM-Sign,通过微调大型语言模型和逐步提示策略,解决了手语生成中复杂性和规则差异的问题,取得了良好的效果。

Details Motivation: 由于手语的复杂性和独特规则,大型语言模型在手语生成中的应用有限,因此提出了一种新的方法来解决这一问题。 Method: 通过微调大语言模型并采用逐步提示策略提取手语知识,以促进手语与文本之间的对应学习和生成过程。 Result: 在How2Sign和Phoenix14T数据集上的实验结果表明,该方法能够有效对齐手语与口语的不同分布和语法规则。 Conclusion: TEAM-Sign有效地利用了大语言模型的推理能力和手语知识,实现了手语生成的优化。 Abstract: Large language models, with their strong reasoning ability and rich knowledge, have brought revolution to many tasks of AI, but their impact on sign language generation remains limited due to its complexity and unique rules. In this paper, we propose TEAch Me Sign (TEAM-Sign), treating sign language as another natural language. By fine-tuning an LLM, we enable it to learn the correspondence between text and sign language, and facilitate generation. Considering the differences between sign and spoken language, we employ a stepwise prompting strategy to extract the inherent sign language knowledge within the LLM, thereby supporting the learning and generation process. Experimental results on How2Sign and Phoenix14T datasets demonstrate that our approach effectively leverages both the sign language knowledge and reasoning capabilities of LLM to align the different distribution and grammatical rules between sign and spoken language.

[19] Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection

Lin Tian,Johanne R. Trippas,Marian-Andrei Rizoiu

Main category: cs.CL

TL;DR: This paper proposes a hierarchical LoRA adaptation method for efficient multilingual sexism detection in tweets, achieving strong performance with minimal computational cost.

Details Motivation: The motivation behind this research is to address text-based sexism detection in a multilingual context efficiently. Traditional approaches often involve complex data processing or ensemble techniques, which are computationally expensive. The authors aim to show that a parameter-efficient fine-tuning strategy can yield strong performance while minimizing computational costs. Method: The method involves hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B, introducing conditional adapter routing to model label dependencies across three subtasks: binary sexism identification, source intention detection, and multilabel sexism categorization. Adaptation is applied to all linear transformations rather than just attention layers, and separate LoRA adapters are trained for each subtask using unified multilingual training. Result: The approach achieved F1 improvements of 1.7-2.4% through cross-lingual transfer, reduced training time by 75%, and model storage by 98%. Performance metrics include ICM-Hard scores of 0.6774 for binary classification, 0.4991 for intention detection, and 0.6519 for multilabel categorization. Conclusion: The paper concludes that the hierarchical LoRA adaptation approach, combined with multilingual training, is effective for sexism detection in English and Spanish tweets. It demonstrates that parameter-efficient methods can achieve competitive performance while significantly reducing training time and storage requirements. Abstract: This paper presents our approach to EXIST 2025 Task 1, addressing text-based sexism detection in English and Spanish tweets through hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B. Our method introduces conditional adapter routing that explicitly models label dependencies across three hierarchically structured subtasks: binary sexism identification, source intention detection, and multilabel sexism categorization. Unlike conventional LoRA applications that target only attention layers, we apply adaptation to all linear transformations, enhancing the model's capacity to capture task-specific patterns. In contrast to complex data processing and ensemble approaches, we show that straightforward parameter-efficient fine-tuning achieves strong performance. We train separate LoRA adapters (rank=16, QLoRA 4-bit) for each subtask using unified multilingual training that leverages Llama 3.1's native bilingual capabilities. The method requires minimal preprocessing and uses standard supervised learning. Our multilingual training strategy eliminates the need for separate language-specific models, achieving 1.7-2.4\% F1 improvements through cross-lingual transfer. With only 1.67\% trainable parameters compared to full fine-tuning, our approach reduces training time by 75\% and model storage by 98\%, while achieving competitive performance across all subtasks (ICM-Hard: 0.6774 for binary classification, 0.4991 for intention detection, 0.6519 for multilabel categorization).

[20] Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification

Yejun Yoon,Jaeyoon Jung,Seunghyun Yoon,Kunwoo Park

Main category: cs.CL

TL;DR: This paper introduces HerO 2, an enhanced system for fact verification that improves evidence quality, optimizes veracity prediction, and integrates updated language models.

Details Motivation: To improve upon the previous best-performing open-source model for fact verification, HerO, by enhancing efficiency and performance for real-world applications. Method: HerO 2 improves evidence quality through document summarization and answer reformulation, uses post-training quantization to optimize veracity prediction under computational constraints, and incorporates updated language model backbones. Result: HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems in the AVeriTeC shared task. Conclusion: HerO 2 demonstrates high efficiency and strong potential for real-world fact verification, combining improved performance with faster processing times. Abstract: This paper presents HerO 2, Team HUMANE's system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year's challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at https://github.com/ssu-humane/HerO2.

[21] Journalism-Guided Agentic In-Context Learning for News Stance Detection

Dahyun Lee,Jonghyeon Choi,Jiyoung Han,Kunwoo Park

Main category: cs.CL

TL;DR: This paper introduces a new Korean dataset for stance detection and proposes the JoA-ICL framework, which improves article-level stance detection for long-form news, promoting viewpoint diversity and uncovering media bias.

Details Motivation: Personalized recommendation systems in digital journalism risk reinforcing filter bubbles and political polarization by lacking diverse perspectives. Stance detection can help mitigate this issue; however, existing research is limited to short texts and high-resource languages. Method: The authors introduced K-News-Stance, the first Korean dataset for article-level stance detection, and proposed the JoA-ICL framework which uses a language model agent to predict stances of key structural segments, aggregating them to infer the overall article stance. Result: Experiments showed that JoA-ICL outperforms existing stance detection methods, emphasizing the benefits of segment-level analysis in capturing the overall position of long-form news articles. Conclusion: The proposed JoA-ICL framework demonstrates superior performance in article-level stance detection for long-form news articles, enabling viewpoint-aware recommendations and uncovering patterns of media bias. Abstract: As online news consumption grows, personalized recommendation systems have become integral to digital journalism. However, these systems risk reinforcing filter bubbles and political polarization by failing to incorporate diverse perspectives. Stance detection -- identifying a text's position on a target -- can help mitigate this by enabling viewpoint-aware recommendations and data-driven analyses of media bias. Yet, existing stance detection research remains largely limited to short texts and high-resource languages. To address these gaps, we introduce \textsc{K-News-Stance}, the first Korean dataset for article-level stance detection, comprising 2,000 news articles with article-level and 19,650 segment-level stance annotations across 47 societal issues. We also propose \textsc{JoA-ICL}, a \textbf{Jo}urnalism-guided \textbf{A}gentic \textbf{I}n-\textbf{C}ontext \textbf{L}earning framework that employs a language model agent to predict the stances of key structural segments (e.g., leads, quotes), which are then aggregated to infer the overall article stance. Experiments show that \textsc{JoA-ICL} outperforms existing stance detection methods, highlighting the benefits of segment-level agency in capturing the overall position of long-form news articles. Two case studies further demonstrate its broader utility in promoting viewpoint diversity in news recommendations and uncovering patterns of media bias.

[22] LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP

Haowei Yang,Ziyu Shen,Junli Shao,Luyao Men,Xinyue Han,Jing Dong

Main category: cs.CL

TL;DR: This study introduces an LLM-augmented NLP pipeline for analyzing unstructured clinical notes, improving cardiovascular disease risk stratification and showcasing the potential of large language models in clinical decision support systems.

Details Motivation: The motivation is to enhance the timely identification and accurate risk stratification of cardiovascular disease by leveraging valuable early indicators found in unstructured clinical notes, which existing prediction models using structured data often overlook. Method: The study employs a novel LLM-augmented clinical NLP pipeline, integrating cardiovascular-specific fine-tuning, prompt-based inference, and entity-aware reasoning to extract and analyze information from unstructured clinical notes. Result: The proposed approach demonstrated improved performance in precision, recall, F1-score, and AUROC. Clinical relevance was confirmed by cardiologists with a high kappa score of 0.82. Challenges like contextual hallucination and temporal ambiguity were mitigated using prompt engineering and hybrid rule-based verification. Conclusion: This study concludes that the LLM-augmented clinical NLP pipeline can significantly improve the performance of cardiovascular disease risk stratification and early warning systems, highlighting the potential of LLMs in clinical decision support systems. Abstract: Timely identification and accurate risk stratification of cardiovascular disease (CVD) remain essential for reducing global mortality. While existing prediction models primarily leverage structured data, unstructured clinical notes contain valuable early indicators. This study introduces a novel LLM-augmented clinical NLP pipeline that employs domain-adapted large language models for symptom extraction, contextual reasoning, and correlation from free-text reports. Our approach integrates cardiovascular-specific fine-tuning, prompt-based inference, and entity-aware reasoning. Evaluations on MIMIC-III and CARDIO-NLP datasets demonstrate improved performance in precision, recall, F1-score, and AUROC, with high clinical relevance (kappa = 0.82) assessed by cardiologists. Challenges such as contextual hallucination, which occurs when plausible information contracts with provided source, and temporal ambiguity, which is related with models struggling with chronological ordering of events are addressed using prompt engineering and hybrid rule-based verification. This work underscores the potential of LLMs in clinical decision support systems (CDSS), advancing early warning systems and enhancing the translation of patient narratives into actionable risk assessments.

[23] Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach

Md. Sabbir Hossen,Md. Saiduzzaman,Pabon Shaha

Main category: cs.CL

TL;DR: This study analyzes public sentiment on social media during the July Revolution in Bangladesh using a hybrid transformer-based machine learning framework, achieving high accuracy in sentiment classification for the low-resource Bangla language.

Details Motivation: The motivation for this study was to decode public sentiment expressed on social media during and after the July Revolution in Bangladesh, a significant student-led uprising, using advanced machine learning techniques tailored for low-resource languages like Bangla. Method: The study used a dataset of 4,200 Bangla social media comments and applied transformer-based feature extraction techniques (BanglaBERT, mBERT, XLM-RoBERTa, and the hybrid XMB-BERT). Principle Component Analysis (PCA) was used for dimensionality reduction, and eleven machine learning classifiers were explored to identify sentiments. Result: The hybrid XMB-BERT framework with the voting classifier achieved an accuracy of 83.7%, outperforming other model-classifier combinations. This demonstrates the effectiveness of the proposed method in capturing nuanced sentiment patterns in Bangla text data. Conclusion: The study concludes that the proposed hybrid XMB-BERT framework with a voting classifier is highly effective for sentiment analysis in low-resource languages like Bangla, demonstrating the potential of machine learning techniques in analyzing social sentiment. Abstract: The July Revolution in Bangladesh marked a significant student-led mass uprising, uniting people across the nation to demand justice, accountability, and systemic reform. Social media platforms played a pivotal role in amplifying public sentiment and shaping discourse during this historic mass uprising. In this study, we present a hybrid transformer-based sentiment analysis framework to decode public opinion expressed in social media comments during and after the revolution. We used a brand new dataset of 4,200 Bangla comments collected from social media. The framework employs advanced transformer-based feature extraction techniques, including BanglaBERT, mBERT, XLM-RoBERTa, and the proposed hybrid XMB-BERT, to capture nuanced patterns in textual data. Principle Component Analysis (PCA) were utilized for dimensionality reduction to enhance computational efficiency. We explored eleven traditional and advanced machine learning classifiers for identifying sentiments. The proposed hybrid XMB-BERT with the voting classifier achieved an exceptional accuracy of 83.7% and outperform other model classifier combinations. This study underscores the potential of machine learning techniques to analyze social sentiment in low-resource languages like Bangla.

[24] Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification

Andres Azqueta-Gavaldón,Joaquin Ramos Cosgrove

Main category: cs.CL

TL;DR: This paper explores the use of Large Language Models (LLMs) as a more effective alternative to traditional matching algorithms for identifying and classifying foreign entities in the Spanish financial system.

Details Motivation: The necessity of accurately identifying and classifying foreign entities in the Spanish financial system due to challenges from linguistic variations and outdated names. Method: Evaluated traditional methods, Hugging Face-based LLMs, and interface-based LLMs using a dataset of 65 Portuguese company cases. Result: Traditional methods achieve accuracies over 92% but suffer high false positive rates (20-40%). Interface-based LLMs achieve accuracies above 93%, F1 scores exceeding 96%, and lower false positives (40-80%). Conclusion: Interface-based LLMs outperform traditional methods in accuracy, F1 scores, and lower false positives in entity matching tasks. Abstract: The growing prevalence of cross-border financial activities in global markets has underscored the necessity of accurately identifying and classifying foreign entities. This practice is essential within the Spanish financial system for ensuring robust risk management, regulatory adherence, and the prevention of financial misconduct. This process involves a labor-intensive entity-matching task, where entities need to be validated against available reference sources. Challenges arise from linguistic variations, special characters, outdated names, and changes in legal forms, complicating traditional matching algorithms like Jaccard, cosine, and Levenshtein distances. These methods struggle with contextual nuances and semantic relationships, leading to mismatches. To address these limitations, we explore Large Language Models (LLMs) as a flexible alternative. LLMs leverage extensive training to interpret context, handle abbreviations, and adapt to legal transitions. We evaluate traditional methods, Hugging Face-based LLMs, and interface-based LLMs (e.g., Microsoft Copilot, Alibaba's Qwen 2.5) using a dataset of 65 Portuguese company cases. Results show traditional methods achieve accuracies over 92% but suffer high false positive rates (20-40%). Interface-based LLMs outperform, achieving accuracies above 93%, F1 scores exceeding 96%, and lower false positives (40-80%).

[25] The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen,Jiashu Qu,Dongrui Liu,Zhiyuan Liu,Ruixi Wu,Yicun Yang,Xiangqi Jin,Haoyun Xu,Xuyang Liu,Weijia Li,Chaochao Lu,Jing Shao,Conghui He,Linfeng Zhang

Main category: cs.CL

TL;DR: This paper introduces DIJA, a new jailbreak framework that exploits vulnerabilities in diffusion-based large language models (dLLMs), highlighting the urgent need for improved safety alignment in these models.

Details Motivation: Despite the strong performance of dLLMs in code generation and text infilling, existing alignment mechanisms fail to safeguard them against masked-input adversarial prompts, exposing novel vulnerabilities. Method: The researchers proposed DIJA, a systematic study and jailbreak attack framework that exploits the bidirectional modeling and parallel decoding mechanisms of dLLMs. Result: DIJA significantly outperforms existing jailbreak methods, achieving high ASR scores and exposing overlooked threat surfaces in dLLM architectures. Conclusion: The study highlights the urgent need for rethinking safety alignment in diffusion-based large language models (dLLMs) due to their unique vulnerabilities. Abstract: Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

[26] Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

Sanhanat Sivapiromrat,Caiqi Zhang,Marco Basaldella,Nigel Collier

Main category: cs.CL

TL;DR: This paper explores how multiple backdoor triggers can coexist in Large Language Models (LLMs) without interference, demonstrating robust activation even under token modifications. A new post hoc recovery method efficiently mitigates these threats by selectively retraining specific model components.

Details Motivation: Recent studies show that LLMs are vulnerable to data poisoning attacks, but most existing works focus only on single triggers and effectiveness, offering limited understanding of multi-trigger interactions. This motivates the need for a comprehensive framework and effective mitigation strategies. Method: The authors present a framework for studying poisoning in LLMs and use multiple triggers with high embedding similarity to evaluate their robustness under token substitution or long token spans. They propose a defense mechanism based on layer-wise weight difference analysis and selective retraining of model components. Result: Multiple distinct backdoor triggers can coexist in an LLM without interfering with each other, allowing adversaries to embed several triggers concurrently. Poisoned triggers achieve robust activation even when tokens are substituted or separated. The proposed defense method effectively neutralizes multi-trigger threats with minimal model changes. Conclusion: The paper concludes that multiple backdoor triggers can coexist within a single LLM without interference, which expands the vulnerability surface. Additionally, the proposed post hoc recovery method effectively removes trigger behavior with minimal parameter updates. Abstract: Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack's effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.

[27] MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models

Seif Ahmed,Mohamed T. Younes,Abdelrahman Moustafa,Abdelrahman Allam,Hamza Moustafa

Main category: cs.CL

TL;DR: 本文提出了一种高效的多语言多模态推理系统,在ImageCLEF 2025 EXAMS V挑战中表现优异,证明轻量级模型集成结合精确提示策略可在教育领域超越复杂端到端模型。

Details Motivation: 本研究旨在设计一个用于ImageCLEF 2025 EXAMS V挑战的多语言多模态推理系统,以提高在教育领域的多语言环境下的准确性和表现。 Method: 研究提出了一种基于Gemini 2.5 Flash、Phi 4、Gemma 3和Mistral的多语言多模态推理系统,并进行了广泛的消融研究,同时使用英文数据集及其多语言增强版本进行训练和评估。 Result: 该系统在官方排行榜上在多语言赛道中以81.4%的准确率获得第一名,并在13个语言赛道中的11个中领先,例如克罗地亚语达到95.07%,意大利语达到92.12%。 Conclusion: 研究表明,轻量级OCR-VLM集成方法在精确的提示策略和跨语言增强的配合下,在高风险、多语言教育环境中能够优于更重的端到端模型。 Abstract: We present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi 4, Gemma 3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English validation set from 55.9% to 61.7%. On the official leaderboard, our system (Team MSA) achieved first place overall in the multilingual track with 81.4% accuracy, and led 11 out of 13 individual language tracks, with top results such as 95.07% for Croatian and 92.12% for Italian. These findings highlight that lightweight OCR-VLM ensembles, when paired with precise prompt strategies and cross-lingual augmentation, can outperform heavier end-to-end models in high-stakes, multilingual educational settings.

[28] What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests

Dimitri Staufer

Main category: cs.CL

TL;DR: This paper introduces WikiMem, a novel dataset and metric to detect memorized personal data in large language models, enabling compliance with GDPR's Right to Be Forgotten through targeted machine unlearning.

Details Motivation: The motivation stems from concerns about LLMs memorizing personal information, which conflicts with GDPR's Right to Be Forgotten. Current methods lack the ability to identify specific individual-fact associations stored in models, limiting their applicability for individual-level privacy inquiries. Method: The paper proposes WikiMem, a dataset of over 5,000 natural language canaries from Wikidata, and uses a model-agnostic metric based on calibrated negative log-likelihood across paraphrased prompts to rank ground-truth values against counterfactuals. Result: The evaluation of WikiMem across 15 LLMs (410M-70B parameters) and 200 individuals shows that memorization correlates with subject web presence and model scale, demonstrating its effectiveness in identifying memorized personal data. Conclusion: The paper concludes that the introduced WikiMem dataset and model-agnostic metric can effectively identify memorized personal data in LLMs at the individual level, offering a solution for RTBF requests and machine unlearning. Abstract: Large Language Models (LLMs) can memorize and reveal personal information, raising concerns regarding compliance with the EU's GDPR, particularly the Right to Be Forgotten (RTBF). Existing machine unlearning methods assume the data to forget is already known but do not address how to identify which individual-fact associations are stored in the model. Privacy auditing techniques typically operate at the population level or target a small set of identifiers, limiting applicability to individual-level data inquiries. We introduce WikiMem, a dataset of over 5,000 natural language canaries covering 243 human-related properties from Wikidata, and a model-agnostic metric to quantify human-fact associations in LLMs. Our approach ranks ground-truth values against counterfactuals using calibrated negative log-likelihood across paraphrased prompts. We evaluate 200 individuals across 15 LLMs (410M-70B parameters), showing that memorization correlates with subject web presence and model scale. We provide a foundation for identifying memorized personal data in LLMs at the individual level, enabling the dynamic construction of forget sets for machine unlearning and RTBF requests.

[29] Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Conrad Borchers,Bahar Shahrokhian,Francesco Balzan,Elham Tajik,Sreecharan Sankaranarayanan,Sebastian Simon

Main category: cs.CL

TL;DR: This study investigates how agent personas and temperature settings affect consensus and coding accuracy in a multi-agent system using six open-source LLMs. While MAS can mimic human coding workflows, they do not consistently outperform single agents in accuracy. Temperature and persona diversity influence consensus dynamics, but only specific configurations (low temperature and assertive personas) show minor improvements in one model.

Details Motivation: The motivation behind this study is to explore how Large Language Models (LLMs) can be used for qualitative research at scale, particularly in tasks like coding and data annotation. While multi-agent systems (MAS) have been proposed as a way to replicate human coding workflows, their effectiveness compared to single-agent systems remains unclear. This research aims to better understand the role of agent personas and temperature in shaping consensus and accuracy in coding dialog segments. Method: The researchers conducted an experimental study using an open-source MAS that emulates deductive human coding through structured agent discussion and consensus arbitration. They used six open-source LLMs (ranging from 3 to 32 billion parameters) and 18 experimental configurations to analyze over 77,000 coding decisions. These decisions were compared against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. They assessed the impact of agent persona (neutral, assertive, empathetic) and temperature settings on consensus-building and coding accuracy. Result: Temperature significantly affected whether and when consensus was reached across all six LLMs. MAS with multiple personas delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures reduced the impact of multiple personas on consensus. However, neither temperature nor persona diversity consistently improved coding accuracy. Single agents performed as well or better than MAS in most conditions. Only one model (OpenHermesV2:7B) showed above-chance improvements in one code category under specific conditions (temperature ≤ 0.5 and inclusion of an assertive persona). Conclusion: The study concludes that while multi-agent systems (MAS) can emulate human coding workflows, they do not consistently outperform single-agent systems in coding accuracy. Furthermore, the research challenges the assumption that diverse MAS personas lead to better outcomes, as neither temperature nor persona pairing led to robust improvements in accuracy. However, MAS may help in refining ambiguous code applications, which could improve codebooks and human-AI collaboration. Abstract: Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including coding and data annotation. While multi-agent systems (MAS) can emulate human coding workflows, their benefits over single-agent coding remain poorly understood. We conducted an experimental study of how agent persona and temperature shape consensus-building and coding accuracy of dialog segments based on a codebook with 8 codes. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic), significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing lead to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Only one model (OpenHermesV2:7B) and code category showed above-chance gains from MAS deliberation when temperature was 0.5 or lower and especially when the agents included at least one assertive persona. Qualitative analysis of MAS collaboration for these configurations suggests that MAS may nonetheless aid in narrowing ambiguous code applications that could improve codebooks and human-AI coding. We contribute new insight into the limits of LLM-based qualitative methods, challenging the notion that diverse MAS personas lead to better outcomes. We open-source our MAS and experimentation code.

[30] EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

Valle Ruiz-Fernández,Mario Mina,Júlia Falcão,Luis Vasquez-Reina,Anna Sallés,Aitor Gonzalez-Agirre,Olatz Perez-de-Viñaspre

Main category: cs.CL

TL;DR: 本文介绍了适用于西班牙语和加泰罗尼亚语的两个社会偏见评估数据集,探讨了大型语言模型在不同文化背景下可能存在的偏见问题。

Details Motivation: 鉴于非英语语言和社会背景(如美国以外)缺乏资源来评估社会偏见,本研究旨在填补这一空白。 Method: 基于原始BBQ设计了西班牙语和加泰罗尼亚语的两个平行数据集,用于评估不同模型家族、规模和变体的社会偏见情况。 Result: 引入了适用于西班牙语和加泰罗尼亚语以及西班牙社会背景的Spanish and Catalan Bias Benchmarks for Question Answering (EsBBQ 和 CaBBQ)。 Conclusion: 模型在处理模棱两可的情况时往往无法选择正确答案,并且高问答准确性通常与更多依赖社会偏见相关。 Abstract: Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.

[31] An Agentic Flow for Finite State Machine Extraction using Prompt Chaining

Fares Wael,Youssef Maklad,Ali Hamdi,Wael Elsersy

Main category: cs.CL

TL;DR: 本文提出了一种利用大型语言模型和代理框架从RFC文档中提取有限状态机的新方法,解决了现有技术的局限性。

Details Motivation: 现有的FSM提取技术面临可扩展性、覆盖不完整以及自然语言规范中的歧义等问题,因此需要一种更高效准确的方法。 Method: 提出了一种名为FlowFSM的新型代理框架,结合了大型语言模型(LLMs)、提示链和思维链推理,从原始RFC文档中提取准确的FSMs。 Result: 在FTP和RTSP协议上的实验评估表明,FlowFSM实现了高精度的FSM提取,并有效减少了错误的状态转换。 Conclusion: FlowFSM展示了基于代理的LLM系统在协议分析和FSM推理中的潜力,为网络安全和逆向工程应用提供了新的解决方案。 Abstract: Finite-State Machines (FSMs) are critical for modeling the operational logic of network protocols, enabling verification, analysis, and vulnerability discovery. However, existing FSM extraction techniques face limitations such as scalability, incomplete coverage, and ambiguity in natural language specifications. In this paper, we propose FlowFSM, a novel agentic framework that leverages Large Language Models (LLMs) combined with prompt chaining and chain-of-thought reasoning to extract accurate FSMs from raw RFC documents. FlowFSM systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs. Experimental evaluation across FTP and RTSP protocols demonstrates that FlowFSM achieves high extraction precision while minimizing hallucinated transitions, showing promising results. Our findings highlight the potential of agent-based LLM systems in the advancement of protocol analysis and FSM inference for cybersecurity and reverse engineering applications.

[32] Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Lyzander Marciano Andrylie,Inaya Rahmanisa,Mahardika Krisna Ihsani,Alfan Farizki Wicaksono,Haryo Akbarianto Wibowo,Alham Fikri Aji

Main category: cs.CL

TL;DR: This paper introduces SAE-LAPE, a novel method using sparse autoencoders and feature activation probability, to identify interpretable language-specific features in large language models, which are found mainly in middle to final layers and can be used effectively for language identification.

Details Motivation: The motivation of the study is to better understand the multilingual mechanisms of large language models by identifying language-specific features, which are underexplored despite the presence of language-independent features. This understanding can improve how models process different languages and enhance language identification capabilities. Method: The authors introduced SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network of LLMs. They used sparse autoencoders (SAEs) to isolate monosemantic features and analyzed their distribution and impact across different layers of the model. Result: The authors found that language-specific features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model's multilingual performance and language output. The method achieved language identification performance comparable to fastText with added interpretability. Conclusion: The study concludes that language-specific features in LLMs can be effectively identified using the SAE-LAPE method, and these features are interpretable, impactful for multilingual performance, and applicable for language identification with high interpretability. Abstract: Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model's multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features .

[33] KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

Luohe Shi,Zuchao Li,Lefei Zhang,Guoming Liu,Baoyuan Qi,Hai Zhao

Main category: cs.CL

TL;DR: KV-Latent reduces KV Cache footprint and improves inference speed in large language models by down-sampling Key-Value vectors and modifying Rotary Positional Embedding.

Details Motivation: The gradually increasing Key-Value cache during inference has become a primary efficiency bottleneck in large language models. Method: Down-sampling Key-Value vectors into a latent space and modifying the frequency sampling mechanism of Rotary Positional Embedding. Result: The experiments showed satisfactory results in reducing the KV Cache footprint and improving inference speed with minimal extra training. Conclusion: KV-Latent provides a promising solution to reduce KV Cache footprint and improve inference speed for large language models. Abstract: Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1\% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model's performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at https://github.com/ShiLuohe/KV-Latent.

[34] FMC: Formalization of Natural Language Mathematical Competition Problems

Jiaxuan Xie,Chengwu Liu,Ye Yuan,Siqi Li,Zhiping Xiao,Ming Zhang

Main category: cs.CL

TL;DR: This paper introduces an efficient autoformalization pipeline using large language models with error feedback to create a high-quality Olympiad-level dataset for formal mathematical reasoning, demonstrating improved formalization through few-shot learning, error feedback, and increased sampling.

Details Motivation: The motivation behind this research is to advance formal mathematical reasoning by developing efficient and accurate autoformalization methods that leverage large-scale datasets of natural language mathematical problems to construct formal language datasets. Method: The method involves an autoformalization pipeline based on large language models with error feedback to create a dataset aligning natural language mathematical problems with Lean formalizations. Result: The result is an Olympiad-level dataset containing 3,922 mathematical problems in natural language and 9,787 in Lean, where 64.46% were assessed as at least above-average quality. The study also showed that few-shot learning, error feedback, and increasing sampling numbers enhance the autoformalization process. Additionally, experiments highlighted the dataset's challenging nature and its value as a benchmark for formal reasoning tasks. Conclusion: The paper concludes that their proposed autoformalization pipeline based on large language models with error feedback is efficient and accurate, and the curated Olympiad-level dataset aligning natural language problems with Lean formalizations is suitable as a benchmark for automated theorem provers. Abstract: Efficient and accurate autoformalization methods, which leverage large-scale datasets of extensive natural language mathematical problems to construct formal language datasets, are key to advancing formal mathematical reasoning. In this paper, we propose an autoformalization pipeline based on large language models with error feedback, achieving a fully automatic and training-free formalization approach. Using this pipeline, we curate an Olympiad-level dataset aligning natural language problems with Lean formalizations. The dataset comprises $3,922$ mathematical problems in natural language and $9,787$ in Lean, of which $64.46\%$ were assessed as at least above-average quality, making it suitable as a benchmark for automated theorem provers. Additionally, we investigate the formalization and reasoning capabilities of various LLMs and empirically demonstrate that few-shot learning, error feedback, and increasing sampling numbers enhance the autoformalization process. Experiments of three automated theorem provers on the \dataset\ dataset also highlight its challenging nature and its value as a benchmark for formal reasoning tasks.

[35] Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources, Coded Term Lexicon, and Enhanced Detection Frameworks

Zewen Bai,Liang Yang,Shengdi Yin,Yuanyuan Sun,Hongfei Lin

Main category: cs.CL

TL;DR: 本研究提出了首个细粒度中文仇恨言论数据集,并探索了增强模型解释能力的方法,从而提升中文仇恨言论检测的效果与可解释性。

Details Motivation: 中文仇恨言论检测的研究滞后,且存在缺乏细粒度标注数据和对复杂场景中仇恨语义解释能力不足的问题。 Method: 构建了一个新的细粒度中文仇恨言论数据集(STATE ToxiCN),并进行了首次全面的中文编码仇恨术语研究,并提出将注释词典集成到模型中的方法。 Result: 提升了仇恨言论检测的效果,并评估了现有模型在理解仇恨语义方面的能力。 Conclusion: 研究提供了重要的资源和见解,推动了中文仇恨言论检测的可解释性研究。 Abstract: The proliferation of hate speech has inflicted significant societal harm, with its intensity and directionality closely tied to specific targets and arguments. In recent years, numerous machine learning-based methods have been developed to detect hateful comments on online platforms automatically. However, research on Chinese hate speech detection lags behind, and interpretability studies face two major challenges: first, the scarcity of span-level fine-grained annotated datasets limits models' deep semantic understanding of hate speech; second, insufficient research on identifying and interpreting coded hate speech restricts model explainability in complex real-world scenarios. To address these, we make the following contributions: (1) We introduce the Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), the first span-level Chinese hate speech dataset, and evaluate the hate semantic understanding of existing models using it. (2) We conduct the first comprehensive study on Chinese coded hate terms, LLMs' ability to interpret hate semantics. (3) We propose a method to integrate an annotated lexicon into models, significantly enhancing hate speech detection performance. Our work provides valuable resources and insights to advance the interpretability of Chinese hate speech detection research.

[36] Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Andrei Niculae,Adrian Cosma,Cosmin Dumitrache,Emilian Rǎdoi

Main category: cs.CL

TL;DR: 本文提出了一种用于提升医生书面回复质量的多智能体大语言模型系统 Dr.Copilot,已在罗马尼亚远程医疗平台中成功部署并取得良好效果。

Details Motivation: 文本形式的远程医疗日益普及,但医患交流中的建议质量往往更取决于沟通方式而非临床准确性。为了改善这一问题,需要一个能评估和提升医生书面回答表达质量的系统。 Method: 该论文提出了一种多智能体大语言模型(LLM)系统 Dr.Copilot,通过 DSPy 自动优化提示,从17个可解释维度对医生的书面回答提供反馈,使用低资源罗马尼亚语数据进行设计并采用开源权重模型进行部署。 Result: 经过41名医生的实证评估和实际部署,结果表明 Dr.Copilot 显著提升了用户评价和回复质量,成为首批在罗马尼亚医疗环境中成功部署的大语言模型应用之一。 Conclusion: Dr.Copilot 是一个基于多智能体大语言模型的系统,能够有效提升罗马尼亚语医生在远程医疗中的书面回复质量,并已在实际应用中取得显著效果。 Abstract: Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr.Copilot , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr.Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.

[37] Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

Haoran Jin,Meng Li,Xiting Wang,Zhihao Xu,Minlie Huang,Yantao Jia,Defu Lian

Main category: cs.CL

TL;DR: 这篇论文介绍了一种名为 ConVA 的新方法,它可以在不影响模型性能的前提下有效对齐和控制大型语言模型的内部价值观。

Details Motivation: 随着人们对大型语言模型(LLM)与人类价值观保持一致的需求增加,研究者希望找到能够提供清晰度、透明性并适应变化场景的方法。这是本篇论文的研究动机。 Method: 论文提出了一种名为 Controlled Value Vector Activation (ConVA) 的方法,用于直接对齐 LLM 的内部价值观,并通过解释这些价值观如何在模型的潜在表示中编码来修改相关激活。同时采用上下文控制的价值向量识别方法以及门控价值向量激活方法确保价值的一致性和最小干预。 Result: 实验表明,ConVA 方法在10种基本价值观上实现了最高的控制成功率,且不会损害 LLM 的性能和流畅性。即使面对相反或潜在恶意的输入提示,也能确保目标价值观的体现。 Conclusion: 论文得出结论,通过引入上下文控制的价值向量识别方法和门控价值向量激活方法,ConVA 方法能够在不牺牲模型性能和流畅性的情况下实现对大型语言模型(LLM)内部价值观的有效对齐与控制。 Abstract: Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.

[38] Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge

Wenqing Wu,Chengzhi Zhang,Yi Zhao

Main category: cs.CL

TL;DR: This paper proposes a method to assess academic paper novelty by combining human insights and large language models, achieving better results than traditional approaches.

Details Motivation: Traditional methods for assessing novelty have limitations, as experts have constrained knowledge and citation-based metrics are uncertain indicators of true novelty. Method: Sentences related to novelty were extracted from peer review reports, and LLMs were used to summarize methodology sections. These were used to fine-tune PLMs. A novel Sparse-Attention fusion module was also designed. Result: Experiments showed superior performance of the proposed method compared to existing baselines. Conclusion: The study concludes that combining human expertise with LLMs improves the assessment of methodological novelty in academic papers. Abstract: Novelty is a crucial criterion in the peer review process for evaluating academic papers. Traditionally, it's judged by experts or measure by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it's unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. The most common novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs, e.g. BERT etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine-tune PLMs. In addition, we have designed a text-guided fusion module with novel Sparse-Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.

[39] What is the Best Process Model Representation? A Comparative Analysis for Process Modeling with Large Language Models

Alexis Brissard,Frédéric Cuppens,Amal Zouaq

Main category: cs.CL

TL;DR: 这篇论文介绍了PMo数据集,比较了不同的流程模型表示方法(PMRs)在使用大型语言模型(LLMs)进行流程建模(PMo)和流程模型生成(PMG)中的效果。Mermaid在PMo标准中得分最高,而BPMN文本在PMG的过程元素相似性方面表现最佳。

Details Motivation: 目前在流程建模任务中应用的大型语言模型(LLMs)依赖于多种结构、复杂性和可用性差异很大的流程模型表示方法(PMRs),但这些PMRs从未被系统地比较过,并且现有的流程模型生成方法采用了不同的评估策略和技术,难以进行比较。 Method: 本文提出了一个实证研究方法,通过引入包含55个流程描述及九种不同PMR模型的新PMo数据集,从两个维度评估多个PMRs:一是其适用于基于LLM的PMo的能力,二是其在PMG任务中的性能表现。 Result: 实验结果显示,Mermaid在六个PMo标准中的总体得分最高,表明它更适合用于LLM辅助的流程建模;而在流程模型生成方面,BPMN文本在过程元素相似性上取得了最好的结果。 Conclusion: 本研究表明,不同的流程模型表示方法(PMRs)在LLM支持的流程建模(PMo)和流程模型生成(PMG)任务中表现出显著差异。Mermaid在PMo方面最优,而BPMN文本在PMG的任务相似性上表现最好,这为未来选择适合特定任务的PMR提供了指导。 Abstract: Large Language Models (LLMs) are increasingly applied for Process Modeling (PMo) tasks such as Process Model Generation (PMG). To support these tasks, researchers have introduced a variety of Process Model Representations (PMRs) that serve as model abstractions or generation targets. However, these PMRs differ widely in structure, complexity, and usability, and have never been systematically compared. Moreover, recent PMG approaches rely on distinct evaluation strategies and generation techniques, making comparison difficult. This paper presents the first empirical study that evaluates multiple PMRs in the context of PMo with LLMs. We introduce the PMo Dataset, a new dataset containing 55 process descriptions paired with models in nine different PMRs. We evaluate PMRs along two dimensions: suitability for LLM-based PMo and performance on PMG. \textit{Mermaid} achieves the highest overall score across six PMo criteria, whereas \textit{BPMN text} delivers the best PMG results in terms of process element similarity.

[40] Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss

Xia Cui

Main category: cs.CL

TL;DR: 这篇论文研究了加权损失函数在Transformer模型上的应用,旨在解决多标签情感检测中的数据不平衡问题,结果表明其对高频类别有提升作用,但对少数类别效果有限。

Details Motivation: 论文的动机是为了应对多标签情感检测中的数据不平衡问题,同时避免传统重采样方法带来的计算负担。 Method: 论文的方法涉及使用一种简单的加权损失函数,并通过动态调整类别权重来解决数据不平衡问题。 Result: 实验结果显示,加权损失函数提高了高频情感类别的性能,但对少数类别影响有限。 Conclusion: 论文的结论是,加权损失函数在提高高频情感类别性能方面有效,但对少数类别的影响有限,这突出了应用这种方法处理不平衡多标签情感检测时的有效性和挑战。 Abstract: This paper explores the application of a simple weighted loss function to Transformer-based models for multi-label emotion detection in SemEval-2025 Shared Task 11. Our approach addresses data imbalance by dynamically adjusting class weights, thereby enhancing performance on minority emotion classes without the computational burden of traditional resampling methods. We evaluate BERT, RoBERTa, and BART on the BRIGHTER dataset, using evaluation metrics such as Micro F1, Macro F1, ROC-AUC, Accuracy, and Jaccard similarity coefficients. The results demonstrate that the weighted loss function improves performance on high-frequency emotion classes but shows limited impact on minority classes. These findings underscore both the effectiveness and the challenges of applying this approach to imbalanced multi-label emotion detection.

[41] DCR: Quantifying Data Contamination in LLMs Evaluation

Cheng Xu,Nan Yan,Shuhao Guan,Changhong Jin,Yuke Mei,Yibing Guo,M-Tahar Kechadi

Main category: cs.CL

TL;DR: 本文提出DCR框架,用于检测和量化大语言模型中的基准数据污染问题,通过污染感知调整准确率,提高评估的可靠性。

Details Motivation: 大语言模型的进步引发了对基准数据污染问题的担忧,这可能导致性能指标的高估并削弱模型泛化能力的评估。 Method: 提出了一种名为DCR的轻量级、可解释性框架,通过模糊推理系统合成污染分数,并调整模型准确率以反映污染感知性能。 Result: DCR框架在9个大语言模型(0.5B-72B)上验证,污染感知性能调整误差平均在4%以内,能够可靠地诊断数据污染程度。 Conclusion: DCR框架提供了一种高效且透明的工具,用于检测和量化基准数据污染,从而提升大语言模型评估的可信度和公平性。 Abstract: The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data, inflating performance metrics and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.

[42] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

LG AI Research,:,Kyunghoon Bae,Eunbi Choi,Kibong Choi,Stanley Jungkyu Choi,Yemuk Choi,Kyubeen Han,Seokhee Hong,Junwon Hwang,Taewan Hwang,Joonwon Jang,Hyojin Jeon,Kijeong Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Euisoon Kim,Hyosang Kim,Jihoon Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Youchul Kim,Edward Hwayoung Lee,Gwangho Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Young Min Paik,Yongmin Park,Youngyong Park,Sanghyun Seo,Sihoon Yang,Heuiyeen Yeen,Sihyuk Yi,Hyeongu Yun

Main category: cs.CL

TL;DR: EXAONE 4.0 improves upon previous versions by integrating advanced reasoning modes, supporting more languages, and offering two model sizes for diverse applications.

Details Motivation: To meet the demands of the agentic AI era by providing a model that balances usability with advanced reasoning abilities, supports multiple languages, and offers flexibility through different model sizes. Method: The paper introduces EXAONE 4.0, which combines a Non-reasoning mode and a Reasoning mode while extending multilingual capabilities to include Spanish alongside English and Korean. It also describes the availability of two model sizes: a high-performance 32B mid-size model and a 1.2B small-size model designed for on-device use. Result: EXAONE 4.0 outperforms open-weight models in its class and remains competitive against frontier-class models. The models are publicly available for research purposes. Conclusion: EXAONE 4.0 is a significant advancement in AI models, offering improved usability, advanced reasoning capabilities, multilingual support, and two model sizes optimized for different applications. Abstract: This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.

[43] KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?

Soumadeep Saha,Akshay Chaturvedi,Saptarshi Saha,Utpal Garain,Nicholas Asher

Main category: cs.CL

TL;DR: 该论文提出了因果推理图(CCG)和相关数据集KisMATH,揭示了大语言模型如何利用思维链进行推理。

Details Motivation: 尽管思维链在提升大语言模型性能方面有效,但其机制尚不明确。 Method: 从推理痕迹中自动提取有向无环图,建模语言模型输出中的细粒度因果依赖关系。 Result: 分析显示推理节点是最终答案的中介,并且模型内部确实实现了类似所构建的图结构。 Conclusion: KisMATH通过引入因果推理图,为研究大语言模型中的思维链提供了新的视角和数据集。 Abstract: Chain-of-thought traces have been shown to improve performance of large language models in a plethora of reasoning tasks, yet there is no consensus on the mechanism through which this performance boost is achieved. To shed more light on this, we introduce Causal CoT Graphs (CCGs), which are directed acyclic graphs automatically extracted from reasoning traces that model fine-grained causal dependencies in the language model output. A collection of $1671$ mathematical reasoning problems from MATH500, GSM8K and AIME, and their associated CCGs are compiled into our dataset -- \textbf{KisMATH}. Our detailed empirical analysis with 15 open-weight LLMs shows that (i) reasoning nodes in the CCG are mediators for the final answer, a condition necessary for reasoning; and (ii) LLMs emphasise reasoning paths given by the CCG, indicating that models internally realise structures akin to our graphs. KisMATH enables controlled, graph-aligned interventions and opens up avenues for further investigation into the role of chain-of-thought in LLM reasoning.

[44] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Orion Weller,Kathryn Ricci,Marc Marone,Antoine Chaffin,Dawn Lawrie,Benjamin Van Durme

Main category: cs.CL

TL;DR: 本文介绍了Ettin套件,系统地比较了encoder-only和decoder-only模型的性能,证明它们在不同任务上的优势,并开源所有训练数据与模型以支持未来研究。

Details Motivation: 比较encoder-only和decoder-only模型的性能差异,解决之前研究中因参数数量、训练技术和数据集不同而带来的不公平比较问题。 Method: 引入了SOTA开放数据Ettin套件,包括配对的encoder-only和decoder-only模型,参数范围从1700万到10亿,并在相同训练策略下进行评估。 Result: encoder-only模型在分类和检索任务上表现更好,decoder-only模型则在生成任务上更优。 Conclusion: Encoder-only和decoder-only模型在不同任务上表现各异,通过持续训练适应相反目标的效果不如直接使用适合的模型。 Abstract: The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.

[45] Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?

Yanjian Zhang,Guillaume Wisniewski,Nadi Tomeh,Thierry Charnois

Main category: cs.CL

TL;DR: 该研究探讨了通过提示控制大型语言模型的推理策略以提高逻辑问题解决能力的方法,并发现自适应策略选择有助于性能提升。

Details Motivation: 大型语言模型倾向于偏好单一推理策略,这可能限制了它们在多样化推理挑战中的有效性。因此,研究如何通过提示控制其推理策略是有必要的。 Method: 通过实验评估了不同推理策略对大语言模型在逻辑问题解决任务中的影响,并探索了提升性能的方法。 Result: 实验表明,没有一种单一的策略能持续提高准确性,但如果模型能够自适应地选择最佳策略,则性能可能会得到增强。 Conclusion: 提示可以控制大语言模型的推理策略,并提出了引导其策略选择的方法。 Abstract: Human reasoning involves different strategies, each suited to specific problems. Prior work shows that large language model (LLMs) tend to favor a single reasoning strategy, potentially limiting their effectiveness in diverse reasoning challenges. In this work, we investigate whether prompting can control LLMs reasoning strategies and assess its impact on logical problem-solving. While our experiments show that no single strategy consistently improves accuracy, performance could be enhanced if models could adaptively choose the optimal strategy. We propose methods to guide LLMs in strategy selection, highlighting new ways to refine their reasoning abilities.

[46] HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong

Sirui Han,Junqi Zhu,Ruiyuan Zhang,Yike Guo

Main category: cs.CL

TL;DR: 本文介绍了HKGAI-V1的开发,这是一个为香港量身定制的价值观一致的人工智能基础设施。

Details Motivation: 建立适合香港独特多语言环境和社会法律背景的人工智能模型。 Method: 基于DeepSeek架构并通过全面参数微调过程对模型进行区域规范对齐,结合检索增强生成(RAG)系统。 Result: 成功开发出HKGAI-V1,其在处理与香港相关的文化敏感查询方面优于通用模型,并创建了专有的对抗性HK价值基准。 Conclusion: 该论文不仅提供了一个技术成果,还为开发深深植根于本地身份的先进区域性人工智能系统提供了可复制的蓝图。 Abstract: This paper presents the development of HKGAI-V1, a foundational sovereign large language model (LLM), developed as part of an initiative to establish value-aligned AI infrastructure specifically tailored for Hong Kong. Addressing the region's unique multilingual environment (Cantonese, Mandarin, and English), its distinct socio-legal context under the "one country, two systems" framework, and specific local cultural and value considerations, the model is built upon the DeepSeek architecture and systematically aligned with regional norms through a multifaceted full parameter fine-tuning process. It is further integrated with a retrieval-augmented generation (RAG) system to ensure timely and factually grounded information access. The core contribution lies in the design and implementation of a comprehensive, region-specific AI alignment and safety framework, demonstrated through two key achievements: 1) The successful development of HKGAI-V1 itself - which outper-forms general-purpose models in handling Hong Kong-specific culturally sensitive queries, and embodies a "governance-embedded" approach to digital sovereignty - empowers Hong Kong to exercise control over AI applications in critical sectors including public services, legal systems, and edu-cation. 2) The development of the proprietary Adversarial HK Value Benchmark, a rigorous tool for evaluating model alignment with local ethical and legal stand-ards under challenging conditions. By documenting these achievements, the paper provides not only a technological artifact but also a replicable blueprint for developing advanced, regionally focused AI systems deeply rooted in their local identities.

[47] Real-World Summarization: When Evaluation Reaches Its Limits

Patrícia Schmidtová,Ondřej Dušek,Saad Mahamood

Main category: cs.CL

TL;DR: 该研究发现,在评估酒店亮点时,简单的度量标准往往优于复杂的LLM方法,LLMs在生成亮点时表现良好但评估时不可靠。

Details Motivation: 研究旨在评估酒店亮点这一特定场景下,LLM生成内容对输入数据的忠实性,并探讨不同评估方法的有效性。 Method: 通过涉及分类错误评估和跨度级别注释的人类评估活动,比较了传统的度量标准、可训练方法和LLM作为判断者的方法。 Result: 发现简单的度量标准(如词重叠)与人类判断有较高的斯皮尔曼相关系数(0.63),LLMs生成的亮点质量高,但评估时存在不可靠性。 Conclusion: LLMs在生成酒店亮点方面表现出色,但在评估这些亮点时却不可靠,因为它们往往会严重低估或高估。简单的度量标准在评估忠实性方面通常优于更复杂的方法,尤其是在应用到域外数据时。 Abstract: We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.

cs.CV [Back]

[48] CWNet: Causal Wavelet Network for Low-Light Image Enhancement

Tongshun Zhang,Pingping Liu,Yubing Lu,Mengen Cai,Zijian Zhang,Zhe Zhang,Qiuzhan Zhou

Main category: cs.CV

TL;DR: CWNet uses causal reasoning and wavelet transforms to enhance low-light images more effectively than traditional methods, delivering superior performance across various datasets.

Details Motivation: Traditional LLIE methods focus on uniform brightness adjustment and neglect instance-level semantic information and feature characteristics, which limits their effectiveness. Method: The method involves a causal reasoning perspective combined with wavelet transforms. It includes a metric learning strategy for causal embeddings and an instance-level CLIP semantic loss, along with a wavelet transform-based backbone network. Result: Extensive experiments show that CWNet significantly outperforms existing state-of-the-art methods on multiple datasets. Conclusion: CWNet demonstrates robust performance across diverse scenes and significantly outperforms current state-of-the-art methods in low-light image enhancement. Abstract: Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that effectively optimizes the recovery of frequency information, ensuring precise enhancement tailored to the specific attributes of wavelet transforms. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes. Code is available at https://github.com/bywlzts/CWNet-Causal-Wavelet-Network.

[49] Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines

Jiayuan Chen,Thai-Hoang Pham,Yuanlong Wang,Ping Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的框架,将外部生物学知识整合到预训练策略中,以提高显微镜图像分析模型对新细胞系的泛化能力。

Details Motivation: 由于不同细胞系之间的形态学和生物异质性显著,目前对于新细胞系的稳健扰动筛选仍然具有挑战性。 Method: 该方法利用来自STRING和Hetionet数据库的蛋白质相互作用数据构建知识图谱,并结合单细胞基础模型的转录组特征,以分离扰动特异性和细胞系特异性表示。 Result: 实验结果表明,该方法在RxRx数据库上通过一次性微调和少量样本微调,提高了对新细胞系的显微图像分析性能。 Conclusion: 所提出的框架有效提升了显微图像分析模型在新细胞系上的泛化能力,展示了其在现实世界表型药物发现应用中的潜力。 Abstract: High-throughput screening techniques, such as microscopy imaging of cellular responses to genetic and chemical perturbations, play a crucial role in drug discovery and biomedical research. However, robust perturbation screening for \textit{de novo} cell lines remains challenging due to the significant morphological and biological heterogeneity across cell lines. To address this, we propose a novel framework that integrates external biological knowledge into existing pretraining strategies to enhance microscopy image profiling models. Our approach explicitly disentangles perturbation-specific and cell line-specific representations using external biological information. Specifically, we construct a knowledge graph leveraging protein interaction data from STRING and Hetionet databases to guide models toward perturbation-specific features during pretraining. Additionally, we incorporate transcriptomic features from single-cell foundation models to capture cell line-specific representations. By learning these disentangled features, our method improves the generalization of imaging models to \textit{de novo} cell lines. We evaluate our framework on the RxRx database through one-shot fine-tuning on an RxRx1 cell line and few-shot fine-tuning on cell lines from the RxRx19a dataset. Experimental results demonstrate that our method enhances microscopy image profiling for \textit{de novo} cell lines, highlighting its effectiveness in real-world phenotype-based drug discovery applications.

[50] Auditing Facial Emotion Recognition Datasets for Posed Expressions and Racial Bias

Rina Khan,Catherine Stinson

Main category: cs.CV

TL;DR: This paper evaluates facial expression recognition algorithms, finding issues with performance on spontaneous expressions and racial/skin color bias, which could lead to harmful outcomes in real-world use.

Details Motivation: To evaluate the performance of facial expression recognition algorithms on spontaneous versus posed expressions and address ethical concerns regarding racial and skin color bias in these models. Method: Auditing two state-of-the-art facial expression recognition datasets by sampling images to determine if they are spontaneous or posed and testing model performance across races and skin tones. Result: A significant number of images thought to be spontaneous were actually posed, and tested models showed bias towards predicting negative emotions for non-white or dark-skinned individuals even when smiling. Conclusion: Facial expression recognition algorithms face challenges in detecting spontaneous expressions and exhibit racial and skin color bias, which can lead to harmful real-life applications. Abstract: Facial expression recognition (FER) algorithms classify facial expressions into emotions such as happy, sad, or angry. An evaluative challenge facing FER algorithms is the fall in performance when detecting spontaneous expressions compared to posed expressions. An ethical (and evaluative) challenge facing FER algorithms is that they tend to perform poorly for people of some races and skin colors. These challenges are linked to the data collection practices employed in the creation of FER datasets. In this study, we audit two state-of-the-art FER datasets. We take random samples from each dataset and examine whether images are spontaneous or posed. In doing so, we propose a methodology for identifying spontaneous or posed images. We discover a significant number of images that were posed in the datasets purporting to consist of in-the-wild images. Since performance of FER models vary between spontaneous and posed images, the performance of models trained on these datasets will not represent the true performance if such models were to be deployed in in-the-wild applications. We also observe the skin color of individuals in the samples, and test three models trained on each of the datasets to predict facial expressions of people from various races and skin tones. We find that the FER models audited were more likely to predict people labeled as not white or determined to have dark skin as showing a negative emotion such as anger or sadness even when they were smiling. This bias makes such models prone to perpetuate harm in real life applications.

[51] FPC-Net: Revisiting SuperPoint with Descriptor-Free Keypoint Detection via Feature Pyramids and Consistency-Based Implicit Matching

Ionuţ Grigore,Călin-Adrian Popa,Claudiu Leoveanu-Condrei

Main category: cs.CV

TL;DR: This paper introduces a descriptor-free method for matching interest points, reducing memory usage at the cost of minor accuracy loss.

Details Motivation: Traditional methods rely on descriptors for matching interest points, which can be costly in terms of computation, storage, and transmission. This work aims to address these limitations by removing the dependency on descriptors. Method: A new technique was introduced where interest points are associated inherently during detection, avoiding descriptor computation, storage, transmission, and matching. Result: The method achieves comparable performance without using descriptors, with a marginal decrease in matching accuracy but offering substantial benefits in memory efficiency. Conclusion: The proposed method eliminates the need for descriptors in interest point matching, significantly reducing memory usage despite slightly lower matching accuracy. Abstract: The extraction and matching of interest points are fundamental to many geometric computer vision tasks. Traditionally, matching is performed by assigning descriptors to interest points and identifying correspondences based on descriptor similarity. This work introduces a technique where interest points are inherently associated during detection, eliminating the need for computing, storing, transmitting, or matching descriptors. Although the matching accuracy is marginally lower than that of conventional approaches, our method completely eliminates the need for descriptors, leading to a drastic reduction in memory usage for localization systems. We assess its effectiveness by comparing it against both classical handcrafted methods and modern learned approaches.

[52] A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers

Jeffrey Joan Sam,Janhavi Sathe,Nikhil Chigali,Naman Gupta,Radhey Ruparel,Yicheng Jiang,Janmajay Singh,James W. Berck,Arko Barman

Main category: cs.CV

TL;DR: This paper introduces a spacecraft image segmentation dataset and optimized YOLO models for autonomous inspection systems in space, achieving strong performance under real-time constraints.

Details Motivation: The motivation was to address the lack of annotated spacecraft segmentation data and enable cost-effective, autonomous inspection systems for spacecraft. Method: The researchers created a dataset using real spacecraft models on mixed backgrounds, added realistic noise and distortions, and fine-tuned YOLOv8 and YOLOv11 segmentation models under defined hardware constraints. Result: The models achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 seconds, demonstrating strong performance under real-time constraints. Conclusion: The study successfully developed a new dataset of annotated spacecraft images and fine-tuned YOLO models for efficient real-time image segmentation in space applications. Abstract: Spacecraft deployed in outer space are routinely subjected to various forms of damage due to exposure to hazardous environments. In addition, there are significant risks to the subsequent process of in-space repairs through human extravehicular activity or robotic manipulation, incurring substantial operational costs. Recent developments in image segmentation could enable the development of reliable and cost-effective autonomous inspection systems. While these models often require large amounts of training data to achieve satisfactory results, publicly available annotated spacecraft segmentation data are very scarce. Here, we present a new dataset of nearly 64k annotated spacecraft images that was created using real spacecraft models, superimposed on a mixture of real and synthetic backgrounds generated using NASA's TTALOS pipeline. To mimic camera distortions and noise in real-world image acquisition, we also added different types of noise and distortion to the images. Finally, we finetuned YOLOv8 and YOLOv11 segmentation models to generate performance benchmarks for the dataset under well-defined hardware and inference time constraints to mimic real-world image segmentation challenges for real-time onboard applications in space on NASA's inspector spacecraft. The resulting models, when tested under these constraints, achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 second. The dataset and models for performance benchmark are available at https://github.com/RiceD2KLab/SWiM.

[53] Warehouse Spatial Question Answering with LLM Agent

Hsiang-Wei Huang,Jen-Hao Cheng,Kuang-Ming Chen,Cheng-Yen Yang,Bahaa Alattar,Yi-Ru Lin,Pyongkun Kim,Sangwon Kim,Kwangju Kim,Chung-I Huang,Jenq-Neng Hwang

Main category: cs.CV

TL;DR: This paper introduces a data-efficient LLM agent system with strong spatial reasoning capabilities, successfully solving complex spatial tasks in indoor warehouse environments without large-scale MLLM fine-tuning.

Details Motivation: Spatial understanding remains a challenge for existing Multi-modal Large Language Models (MLLMs). The authors aim to address this by introducing a more efficient and effective solution that enhances spatial reasoning specifically for complex indoor environments like warehouses. Method: The authors propose a LLM agent system that integrates multiple tools for spatial reasoning and API interaction. Their approach is data-efficient and designed to handle complex spatial tasks like object retrieval, counting, and distance estimation without requiring large-scale model retraining. Result: The proposed system achieved high accuracy and efficiency on the 2025 AI City Challenge dataset, demonstrating its effectiveness in handling complex spatial question answering tasks in warehouse settings. Conclusion: The paper concludes that their proposed LLM agent system with spatial reasoning abilities significantly improves accuracy and efficiency in complex indoor warehouse scenarios, outperforming previous methods that relied on large-scale MLLM fine-tuning. Abstract: Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM's spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: https://github.com/hsiangwei0903/SpatialAgent

[54] ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Ali Hojjat,Janek Haberer,Soren Pirk,Olaf Landsiedel

Main category: cs.CV

TL;DR: ThinkingViT是一个可扩展的视觉Transformer架构,能够根据输入复杂度动态调整计算资源,从而提高效率。

Details Motivation: 现有的嵌套Transformer模型对所有输入分配相同的计算资源,不管其复杂性如何,这导致效率低下。 Method: 引入了Token Recycling机制,并采用逐步激活更多注意力头的方式进行推理。 Result: 在相同吞吐量下,ThinkingViT比嵌套基线准确率高出最多2.0个百分点,在相同GMACs下高出最多2.9个百分点。 Conclusion: ThinkingViT是一种新的嵌套式视觉Transformer架构,通过渐进式推理阶段实现基于输入难度的动态计算调整。 Abstract: Vision Transformers deliver state-of-the-art performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent nested Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT initiates inference by activating a small subset of the most important attention heads and terminates early if predictions reach sufficient certainty. Otherwise, it activates additional attention heads and re-evaluates the input. At the core of ThinkingViT is our Token Recycling mechanism, which conditions each subsequent inference stage on the embeddings from the previous stage, enabling progressive improvement. Due to its backbone-preserving design, ThinkingViT also serves as a plugin upgrade for vanilla ViT. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. The source code is available at https://github.com/ds-kiel/ThinkingViT.

[55] LLM-Guided Agentic Object Detection for Open-World Understanding

Furkan Mumcu,Michael J. Jones,Anoop Cherian,Yasin Yilmaz

Main category: cs.CV

TL;DR: This paper introduces LAOD, a label-free, zero-shot object detection method leveraging LLMs for improved autonomy and adaptability in open-world environments.

Details Motivation: Traditional object detection methods require fixed categories and costly re-training for new objects. OWOD and OVOD have limitations in autonomy due to the lack of semantic labels or reliance on user inputs. Method: LAOD utilizes a Large Language Model (LLM) to generate scene-specific object names, which are then used by an open-vocabulary detector for localization. Result: Experiments showed strong performance on detecting and naming novel objects using the LAOD framework, validated through metrics like CAAP and SNAP on datasets such as LVIS, COCO, and COCO-OOD. Conclusion: The proposed LAOD framework improves object detection by enabling label-free, zero-shot detection that dynamically adapts to novel objects without user prompts. Abstract: Object detection traditionally relies on fixed category sets, requiring costly re-training to handle novel objects. While Open-World and Open-Vocabulary Object Detection (OWOD and OVOD) improve flexibility, OWOD lacks semantic labels for unknowns, and OVOD depends on user prompts, limiting autonomy. We propose an LLM-guided agentic object detection (LAOD) framework that enables fully label-free, zero-shot detection by prompting a Large Language Model (LLM) to generate scene-specific object names. These are passed to an open-vocabulary detector for localization, allowing the system to adapt its goals dynamically. We introduce two new metrics, Class-Agnostic Average Precision (CAAP) and Semantic Naming Average Precision (SNAP), to separately evaluate localization and naming. Experiments on LVIS, COCO, and COCO-OOD validate our approach, showing strong performance in detecting and naming novel objects. Our method offers enhanced autonomy and adaptability for open-world understanding.

[56] Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

Casey Wall,Longwei Wang,Rodrigue Rizk,KC Santosh

Main category: cs.CV

TL;DR: Winsor-CAM is a novel method for generating robust and interpretable visual explanations for CNN decision-making processes by aggregating information across all convolutional layers with human-tunable control.

Details Motivation: The decision-making process of CNNs needs interpretation for high-stakes deployment, and existing methods like Grad-CAM have limitations in focusing on the final layer or averaging across layers. Method: Winsor-CAM applies Winsorization to mitigate the influence of noisy or extreme attribution values across all convolutional layers. Result: Evaluations on standard architectures demonstrate that Winsor-CAM produces more interpretable heatmaps and achieves superior performance in localization metrics compared to Grad-CAM and uniform layer-averaging baselines. Conclusion: Winsor-CAM is a trustworthy AI advancement that provides interpretable, multi-layer insights with human-in-the-loop control. Abstract: Interpreting the decision-making process of Convolutional Neural Networks (CNNs) is critical for deploying models in high-stakes domains. Gradient-weighted Class Activation Mapping (Grad-CAM) is a widely used method for visual explanations, yet it typically focuses on the final convolutional layer or na\"ively averages across layers, strategies that can obscure important semantic cues or amplify irrelevant noise. We propose Winsor-CAM, a novel, human-tunable extension of Grad-CAM that generates robust and coherent saliency maps by aggregating information across all convolutional layers. To mitigate the influence of noisy or extreme attribution values, Winsor-CAM applies Winsorization, a percentile-based outlier attenuation technique. A user-controllable threshold allows for semantic-level tuning, enabling flexible exploration of model behavior across representational hierarchies. Evaluations on standard architectures (ResNet50, DenseNet121, VGG16, InceptionV3) using the PASCAL VOC 2012 dataset demonstrate that Winsor-CAM produces more interpretable heatmaps and achieves superior performance in localization metrics, including intersection-over-union and center-of-mass alignment, when compared to Grad-CAM and uniform layer-averaging baselines. Winsor-CAM advances the goal of trustworthy AI by offering interpretable, multi-layer insights with human-in-the-loop control.

[57] Sparse Fine-Tuning of Transformers for Generative Tasks

Wei Chen,Jingxi Yu,Zichen Miao,Qiang Qiu

Main category: cs.CV

TL;DR: This paper proposes a sparse coding-based fine-tuning method for transformers, enhancing interpretability and performance in tasks like image editing and text-to-image customization.

Details Motivation: Existing fine-tuning methods produce dense parameter combinations that are hard to interpret. The authors aimed to develop a more interpretable and efficient fine-tuning approach using sparse representations. Method: A sparse coding-inspired fine-tuning framework was introduced, where features are represented as a sparse combination of feature dictionary atoms. This allows for interpretability by identifying atom importance through sparse coefficients. Result: The method enhanced image editing performance by improving text alignment through removal of unimportant atoms. It also efficiently constructed target concepts in text-to-image customization, outperforming baseline methods. Conclusion: The study concludes that the proposed sparse coding-based fine-tuning framework improves model interpretability and adapts efficiently to downstream tasks, outperforming existing methods. Abstract: Large pre-trained transformers have revolutionized artificial intelligence across various domains, and fine-tuning remains the dominant approach for adapting these models to downstream tasks due to the cost of training from scratch. However, in existing fine-tuning methods, the updated representations are formed as a dense combination of modified parameters, making it challenging to interpret their contributions and understand how the model adapts to new tasks. In this work, we introduce a fine-tuning framework inspired by sparse coding, where fine-tuned features are represented as a sparse combination of basic elements, i.e., feature dictionary atoms. The feature dictionary atoms function as fundamental building blocks of the representation, and tuning atoms allows for seamless adaptation to downstream tasks. Sparse coefficients then serve as indicators of atom importance, identifying the contribution of each atom to the updated representation. Leveraging the atom selection capability of sparse coefficients, we first demonstrate that our method enhances image editing performance by improving text alignment through the removal of unimportant feature dictionary atoms. Additionally, we validate the effectiveness of our approach in the text-to-image concept customization task, where our method efficiently constructs the target concept using a sparse combination of feature dictionary atoms, outperforming various baseline fine-tuning methods.

[58] A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

Saadat Behzadi,Danial Sharifrazi,Bita Mesbahzadeh,Javad Hassannataj Joloudarid,Roohallah Alizadehsani

Main category: cs.CV

TL;DR: 本研究提出了一种结合LOF和YOLO-v11n的新方法,实现了高效的结直肠息肉检测,具有较高的准确率和实时应用潜力。

Details Motivation: 结直肠息肉的及时准确检测在诊断和预防结直肠癌中起关键作用,本研究旨在引入一种轻量且高效的检测框架。 Method: 结合LOF算法过滤噪声数据与YOLO-v11n深度学习模型,并使用5折交叉验证及现代增强策略进行训练。 Result: 该方法显著提高了息肉定位性能,精确度达到95.83%,召回率达到91.85%,F1得分为93.48%,mAP@0.5为96.48%,mAP@0.5:0.95为77.75%。 Conclusion: 该方法适用于临床环境中的实时结肠镜检查支持,研究表明数据预处理和模型效率在设计有效的医学影像AI系统中的重要性。 Abstract: Objectives: Timely and accurate detection of colorectal polyps plays a crucial role in diagnosing and preventing colorectal cancer, a major cause of mortality worldwide. This study introduces a new, lightweight, and efficient framework for polyp detection that combines the Local Outlier Factor (LOF) algorithm for filtering noisy data with the YOLO-v11n deep learning model. Study design: An experimental study leveraging deep learning and outlier removal techniques across multiple public datasets. Methods: The proposed approach was tested on five diverse and publicly available datasets: CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS, and EndoScene. Since these datasets originally lacked bounding box annotations, we converted their segmentation masks into suitable detection labels. To enhance the robustness and generalizability of our model, we apply 5-fold cross-validation and remove anomalous samples using the LOF method configured with 30 neighbors and a contamination ratio of 5%. Cleaned data are then fed into YOLO-v11n, a fast and resource-efficient object detection architecture optimized for real-time applications. We train the model using a combination of modern augmentation strategies to improve detection accuracy under diverse conditions. Results: Our approach significantly improves polyp localization performance, achieving a precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%. Compared to previous YOLO-based methods, our model demonstrates enhanced accuracy and efficiency. Conclusions: These results suggest that the proposed method is well-suited for real-time colonoscopy support in clinical settings. Overall, the study underscores how crucial data preprocessing and model efficiency are when designing effective AI systems for medical imaging.

[59] Trexplorer Super: Topologically Correct Centerline Tree Tracking of Tubular Objects in CT Volumes

Roman Naeem,David Hagerman,Jennifer Alvén,Lennart Svensson,Fredrik Kahl

Main category: cs.CV

TL;DR: Trexplorer Super improves centerline tracking in medical images, overcoming limitations of previous models and enabling better evaluation through new synthetic and real datasets.

Details Motivation: Accurate tracking of tubular tree structures in human anatomy is crucial, but current models struggle with issues like duplicate branch prediction and premature termination. Evaluation of such models is also hindered by the lack of public datasets. Method: Development of Trexplorer Super with novel advancements and creation of three centerline datasets (one synthetic and two real) for comprehensive evaluation. Result: Trexplorer Super achieves superior performance compared to existing models across all developed datasets, emphasizing that strong results on synthetic data do not always translate to real-world applications. Conclusion: Trexplorer Super outperforms previous state-of-the-art models on every dataset, demonstrating the importance of thorough evaluation using real datasets. Abstract: Tubular tree structures, such as blood vessels and airways, are essential in human anatomy and accurately tracking them while preserving their topology is crucial for various downstream tasks. Trexplorer is a recurrent model designed for centerline tracking in 3D medical images but it struggles with predicting duplicate branches and terminating tracking prematurely. To address these issues, we present Trexplorer Super, an enhanced version that notably improves performance through novel advancements. However, evaluating centerline tracking models is challenging due to the lack of public datasets. To enable thorough evaluation, we develop three centerline datasets, one synthetic and two real, each with increasing difficulty. Using these datasets, we conduct a comprehensive evaluation of existing state-of-the-art (SOTA) models and compare them with our approach. Trexplorer Super outperforms previous SOTA models on every dataset. Our results also highlight that strong performance on synthetic data does not necessarily translate to real datasets. The code and datasets are available at https://github.com/RomStriker/Trexplorer-Super.

[60] Modernizing CNN-based Weather Forecast Model towards Higher Computational Efficiency

Minjong Cheon,Eunhan Goo,Su-Hyeon Shin,Muhammad Ahmed,Hyungjun Kim

Main category: cs.CV

TL;DR: This paper proposes KAI-a, a lightweight CNN-based model for global weather forecasting that achieves competitive accuracy while significantly reducing computational demands compared to Transformer-based models.

Details Motivation: Transformer-based AI weather forecast models have high training complexity and resource demands due to massive parameter sizes, necessitating an alternative approach that maintains accuracy while lowering computational needs. Method: The study introduces KAI-a, which incorporates a scale-invariant architecture and InceptionNeXt-based blocks within a geophysically-aware design. The model is trained on the ERA5 daily dataset with 67 atmospheric variables. Result: KAI-a achieves performance comparable to state-of-the-art models in medium-range weather forecasting while being computationally efficient. It has about 7 million parameters and completes training in 12 hours on a single NVIDIA L40s GPU. Additionally, it demonstrates robust skill in capturing extreme events like the 2018 European heatwave and the East Asian summer monsoon. Conclusion: KAI-a, a modernized CNN-based model for global weather forecasting, achieves competitive accuracy while significantly reducing computational requirements compared to AI models leveraging Transformer-based architectures. Abstract: Recently, AI-based weather forecast models have achieved impressive advances. These models have reached accuracy levels comparable to traditional NWP systems, marking a significant milestone in data-driven weather prediction. However, they mostly leverage Transformer-based architectures, which often leads to high training complexity and resource demands due to the massive parameter sizes. In this study, we introduce a modernized CNN-based model for global weather forecasting that delivers competitive accuracy while significantly reducing computational requirements. To present a systematic modernization roadmap, we highlight key architectural enhancements across multiple design scales from an earlier CNN-based approach. KAI-a incorporates a scale-invariant architecture and InceptionNeXt-based blocks within a geophysically-aware design, tailored to the structure of Earth system data. Trained on the ERA5 daily dataset with 67 atmospheric variables, the model contains about 7 million parameters and completes training in just 12 hours on a single NVIDIA L40s GPU. Our evaluation shows that KAI-a matches the performance of state-of-the-art models in medium-range weather forecasting, while offering a significantly lightweight design. Furthermore, case studies on the 2018 European heatwave and the East Asian summer monsoon demonstrate KAI-a's robust skill in capturing extreme events, reinforcing its practical utility.

[61] Commuting Distance Regularization for Timescale-Dependent Label Inconsistency in EEG Emotion Recognition

Xiaocong Zeng,Craig Michoski,Yan Pang,Dongyang Kuang

Main category: cs.CV

TL;DR: 本文提出针对EEG情感识别中时间尺度依赖性标签不一致问题的新正则化方法(LVL和LGCL),并通过实验验证其优越性能。

Details Motivation: 论文旨在解决EEG情感识别中常被忽视的时间尺度依赖性标签不一致问题(TsDLI),从而提高模型的泛化能力和可解释性。 Method: 结合有界变差函数和通行时间距离的数学原理,在图理论框架中引入局部变化损失(LVL)和局部-全局一致性损失(LGCL),并开发了新的评估指标。 Result: 在DREAMER和DEAP数据集上的实验表明,所提方法优于现有最先进的基线方法,尤其是在预测性能与可解释性的权衡方面表现突出。 Conclusion: 论文提出两种新的正则化策略LVL和LGCL,以解决EEG情感识别中的TsDLI问题,并通过实验验证其有效性。 Abstract: In this work, we address the often-overlooked issue of Timescale Dependent Label Inconsistency (TsDLI) in training neural network models for EEG-based human emotion recognition. To mitigate TsDLI and enhance model generalization and explainability, we propose two novel regularization strategies: Local Variation Loss (LVL) and Local-Global Consistency Loss (LGCL). Both methods incorporate classical mathematical principles--specifically, functions of bounded variation and commute-time distances--within a graph theoretic framework. Complementing our regularizers, we introduce a suite of new evaluation metrics that better capture the alignment between temporally local predictions and their associated global emotion labels. We validate our approach through comprehensive experiments on two widely used EEG emotion datasets, DREAMER and DEAP, across a range of neural architectures including LSTM and transformer-based models. Performance is assessed using five distinct metrics encompassing both quantitative accuracy and qualitative consistency. Results consistently show that our proposed methods outperform state-of-the-art baselines, delivering superior aggregate performance and offering a principled trade-off between interpretability and predictive power under label inconsistency. Notably, LVL achieves the best aggregate rank across all benchmarked backbones and metrics, while LGCL frequently ranks the second, highlighting the effectiveness of our framework.

[62] GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization

Shaowen Tong,Zimin Xia,Alexandre Alahi,Xuming He,Yujiao Shi

Main category: cs.CV

TL;DR: The paper introduces GeoDistill, a new framework for cross-view localization that enhances local feature learning through teacher-student learning with FoV-based masking, leading to improved localization performance and a scalable solution for real-world applications.

Details Motivation: Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. Method: GeoDistill, a Geometry guided weakly supervised self distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking. Result: GeoDistill significantly improves localization performance across different frameworks. Conclusion: GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Abstract: Cross-view localization, the task of estimating a camera's 3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with satellite images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a Geometry guided weakly supervised self distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a panoramic image, while the student model predicts locations from a limited FoV counterpart created by FoV-based masking. By aligning the student's predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty, regardless of whether the query images are panoramas or limited FoV images. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Code and model can be found at https://github.com/tongshw/GeoDistill.

[63] Graph Aggregation Prototype Learning for Semantic Change Detection in Remote Sensing

Zhengyi Xu,Haoran Wu,Wen Jiang,Jie Geng

Main category: cs.CV

TL;DR: 本文提出了一种新的遥感语义变化检测方法GAPL-SCD,通过图聚合原型学习和多任务联合优化策略,有效解决了多任务学习中的冲突问题,并在两个数据集上验证了其优越性能。

Details Motivation: 由于SCD涉及多个任务的同时优化,模型容易受到任务特定学习难度和冲突梯度流的影响,导致负迁移。 Method: 提出了用于遥感语义变化检测的图聚合原型学习(GAPL-SCD)框架,设计了一种多任务联合优化方法,采用自适应权重分配和梯度旋转方法缓解任务冲突,提升了多任务学习能力;通过构建交互图并利用原型作为类别代理实现跨时间点的类别级域对齐,减少无关变化的干扰,并进一步增强了多尺度特征表示。 Result: 在SECOND和Landsat-SCD数据集上的实验结果表明,该方法在准确性和鲁棒性方面均有显著提升。 Conclusion: GAPL-SCD实现了最先进的性能,在SCD任务中显著提高了准确性和鲁棒性。 Abstract: Semantic change detection (SCD) extends the binary change detection task to provide not only the change locations but also the detailed "from-to" categories in multi-temporal remote sensing data. Such detailed semantic insights into changes offer considerable advantages for a wide array of applications. However, since SCD involves the simultaneous optimization of multiple tasks, the model is prone to negative transfer due to task-specific learning difficulties and conflicting gradient flows. To address this issue, we propose Graph Aggregation Prototype Learning for Semantic Change Detection in remote sensing(GAPL-SCD). In this framework, a multi-task joint optimization method is designed to optimize the primary task of semantic segmentation and change detection, along with the auxiliary task of graph aggregation prototype learning. Adaptive weight allocation and gradient rotation methods are used to alleviate the conflict between training tasks and improve multi-task learning capabilities. Specifically, the graph aggregation prototype learning module constructs an interaction graph using high-level features. Prototypes serve as class proxies, enabling category-level domain alignment across time points and reducing interference from irrelevant changes. Additionally, the proposed self-query multi-level feature interaction and bi-temporal feature fusion modules further enhance multi-scale feature representation, improving performance in complex scenes. Experimental results on the SECOND and Landsat-SCD datasets demonstrate that our method achieves state-of-the-art performance, with significant improvements in accuracy and robustness for SCD task.

[64] Robust ID-Specific Face Restoration via Alignment Learning

Yushun Fang,Lu Liu,Xiang Gao,Qiang Hu,Ning Cao,Jianghe Cui,Gang Chen,Xiaoyun Zhang

Main category: cs.CV

TL;DR: RIDFR is a novel face restoration framework that improves identity fidelity and visual quality by leveraging diffusion models and alignment learning, effectively addressing identity uncertainty issues.

Details Motivation: To overcome the unresolved challenge of identity uncertainty caused by identity-obscure inputs and stochastic generative processes in Face Restoration. Method: RIDFR uses a pre-trained diffusion model with two parallel conditioning modules: a Content Injection Module for degraded images and an Identity Injection Module for specific identity integration. It also incorporates Alignment Learning to align multiple references and suppress ID-irrelevant semantics. Result: Experiments demonstrate that RIDFR surpasses existing state-of-the-art methods, achieving high-quality ID-specific restoration with strong identity fidelity and robustness. Conclusion: The proposed RIDFR framework addresses the issue of uncertain face identity in restoration processes, outperforming state-of-the-art methods by reconstructing high-quality, ID-specific results with high fidelity and robustness. Abstract: The latest developments in Face Restoration have yielded significant advancements in visual quality through the utilization of diverse diffusion priors. Nevertheless, the uncertainty of face identity introduced by identity-obscure inputs and stochastic generative processes remains unresolved. To address this challenge, we present Robust ID-Specific Face Restoration (RIDFR), a novel ID-specific face restoration framework based on diffusion models. Specifically, RIDFR leverages a pre-trained diffusion model in conjunction with two parallel conditioning modules. The Content Injection Module inputs the severely degraded image, while the Identity Injection Module integrates the specific identity from a given image. Subsequently, RIDFR incorporates Alignment Learning, which aligns the restoration results from multiple references with the same identity in order to suppress the interference of ID-irrelevant face semantics (e.g. pose, expression, make-up, hair style). Experiments demonstrate that our framework outperforms the state-of-the-art methods, reconstructing high-quality ID-specific results with high identity fidelity and demonstrating strong robustness.

[65] Women Sport Actions Dataset for Visual Classification Using Small Scale Training Data

Palash Ray,Mahuya Sasmal,Asish Bera

Main category: cs.CV

TL;DR: This paper introduces the WomenSports dataset and proposes a CNN-based model with channel attention to advance women sports action classification.

Details Motivation: There is a lack of sufficient image datasets representing women sports actions with adequate intra- and inter-class variations, which limits progress in automated sports action recognition. Method: A convolutional neural network (CNN) with a channel attention scheme on local contextual regions is proposed for deep feature extraction and enhancement. Result: The deep learning method achieves 89.15% top-1 classification accuracy using ResNet-50 on the new WomenSports dataset. Conclusion: The study concludes that the proposed CNN model with a channel attention scheme achieves remarkable performance on sports and dance datasets, and the WomenSports dataset contributes to women sports action classification research. Abstract: Sports action classification representing complex body postures and player-object interactions is an emerging area in image-based sports analysis. Some works have contributed to automated sports action recognition using machine learning techniques over the past decades. However, sufficient image datasets representing women sports actions with enough intra- and inter-class variations are not available to the researchers. To overcome this limitation, this work presents a new dataset named WomenSports for women sports classification using small-scale training data. This dataset includes a variety of sports activities, covering wide variations in movements, environments, and interactions among players. In addition, this study proposes a convolutional neural network (CNN) for deep feature extraction. A channel attention scheme upon local contextual regions is applied to refine and enhance feature representation. The experiments are carried out on three different sports datasets and one dance dataset for generalizing the proposed algorithm, and the performances on these datasets are noteworthy. The deep learning method achieves 89.15% top-1 classification accuracy using ResNet-50 on the proposed WomenSports dataset, which is publicly available for research at Mendeley Data.

[66] Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection

Quan Bi Pay,Vishnu Monn Baskaran,Junn Yong Loo,KokSheik Wong,Simon See

Main category: cs.CV

TL;DR: This paper introduces an efficient HOI detection architecture using a wavelet backbone and ray-based encoder, improving interaction detection accuracy while reducing computational overhead.

Details Motivation: Existing HOI detection methods are inefficient and resource-intensive, prompting the need for a more effective architectural solution. Method: The authors propose a wavelet attention-like backbone to capture multi-order interactions and a ray-based encoder to optimize attention mechanisms for efficient HOI detection. Result: The proposed method achieves promising performance on ImageNet and HICO-DET datasets, highlighting its potential for accurate and efficient HOI detection. Conclusion: The paper concludes that the proposed wavelet attention-like backbone and ray-based encoder architecture enhance HOI detection efficiency and accuracy, as demonstrated by experimental results on benchmark datasets. Abstract: Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].

[67] Mind the Gap: Bridging Occlusion in Gait Recognition via Residual Gap Correction

Ayush Gupta,Siyuan Huang,Rama Chellappa

Main category: cs.CV

TL;DR: This paper proposes RG-Gait, a residual learning approach for occluded gait recognition that maintains high accuracy for both occluded and non-occluded (holistic) gait data.

Details Motivation: Current gait recognition methods often ignore real-world occlusion problems or require impractical data setups, leading to compromised performance in practical scenarios. Method: RG-Gait models occluded gait signatures as residual deviations from holistic gait representations and adaptively integrates learned residuals to improve recognition accuracy. Result: RG-Gait achieves improved performance on occluded gait sequences while retaining accuracy for holistic inputs, validated through experiments on Gait3D, GREW, and BRIAR datasets. Conclusion: The proposed RG-Gait method effectively addresses occluded gait recognition while maintaining performance on holistic inputs by modeling the problem as a residual learning task. Abstract: Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention.

[68] SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition

Quan Bi Pay,Vishnu Monn Baskaran,Junn Yong Loo,KokSheik Wong,Simon See

Main category: cs.CV

TL;DR: 本文提出了一种轻量级网络架构SpaRTAN,结合了多感受野卷积核和波形通道聚合模块,在减少冗余的同时提升了图像识别和目标检测任务的性能与参数效率。

Details Motivation: 现代CNNs和transformer一样存在信息冗余的问题,需要设计更高效的架构来提升性能与参数效率。 Method: SpaRTAN采用了不同感受野的卷积核以及基于波的通道聚合模块,以捕捉多阶空间特征并减少通道冗余。 Result: 在ImageNet和COCO数据集上,SpaRTAN分别取得了77.7%的准确率(3.8M参数)和50.0%的AP(21.5M参数),表现优于现有方法。 Conclusion: SpaRTAN是一个轻量级的网络架构,通过增强空间和通道信息处理,实现了高效且具有竞争力的性能。 Abstract: The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [https://github.com/henry-pay/SpaRTAN].

[69] Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection

Yuhu Bai,Jiangning Zhang,Yunkang Cao,Guangyuan Lu,Qingdong He,Xiangtai Li,Guanzhong Tian

Main category: cs.CV

TL;DR: FiSeCLIP is a novel approach for zero-shot anomaly detection that leverages CLIP's capabilities to improve performance in both classification and segmentation tasks, outperforming current state-of-the-art methods.

Details Motivation: The motivation is to improve zero-shot anomaly detection by leveraging the capabilities of vision-language models like CLIP, especially in scenarios where rare classes are essential and labels may be lacking. Method: FiSeCLIP combines feature matching with cross-modal alignment using training-free CLIP. It leverages other images in the same batch as reference information and uses text to filter out noisy features. It also restores local semantic correlation for fine-grained detection. Result: FiSeCLIP achieves better performance in both anomaly classification and segmentation, with improvements of +4.6% and +5.7% over AdaCLIP in segmentation metrics AU-ROC and F1-max on the MVTec-AD dataset. Conclusion: FiSeCLIP demonstrates superior performance in zero-shot anomaly detection, outperforming existing methods like AdaCLIP on benchmarks such as MVTec-AD. Abstract: With the advent of vision-language models (e.g., CLIP) in zero- and few-shot settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in recent research, where the rare classes are essential and expected in many applications. This study introduces \textbf{FiSeCLIP} for ZSAD with training-free \textbf{CLIP}, combining the feature matching with the cross-modal alignment. Testing with the entire dataset is impractical, while batch-based testing better aligns with real industrial needs, and images within a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes other images in the same batch as reference information for the current image. However, the lack of labels for these references can introduce ambiguity, we apply text information to \textbf{fi}lter out noisy features. In addition, we further explore CLIP's inherent potential to restore its local \textbf{se}mantic correlation, adapting it for fine-grained anomaly detection tasks to enable a more accurate filtering process. Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks, building a stronger baseline for the direction, e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by +4.6\%$\uparrow$/+5.7\%$\uparrow$ in segmentation metrics AU-ROC/$F_1$-max.

[70] Semantically Informed Salient Regions Guided Radiology Report Generation

Zeyi Hou,Zeqiang Wei,Ruixin Yan,Ning Lang,Xiuzhuang Zhou

Main category: cs.CV

TL;DR: 为了解决放射学图像中存在的数据偏差导致的医学不准确报告的问题,提出了SISRNet方法,在两个大型数据集上验证了其生成准确医学报告的能力。

Details Motivation: 由于放射学图像中固有的大规模数据偏差,现有方法往往会产生流畅但医学不准确的报告,限制了它们在临床实践中的适用性。 Method: 提出了一种名为SISRNet的方法,通过细粒度的跨模态语义显式识别具有医学关键特征的显著区域,并在图像建模和报告生成过程中系统地关注这些高信息量区域。 Result: SISRNet在广泛使用的IU-Xray和MIMIC-CXR数据集上相较于其他方法表现更优,能够有效捕捉细微的异常发现,减轻数据偏差的负面影响,最终生成临床准确的报告。 Conclusion: SISRNet有效地解决了医学影像中存在的数据偏差问题,并在IU-Xray和MIMIC-CXR数据集上表现出卓越的性能,展示了其在临床实践中的潜力。 Abstract: Recent advances in automated radiology report generation from chest X-rays using deep learning algorithms have the potential to significantly reduce the arduous workload of radiologists. However, due to the inherent massive data bias in radiology images, where abnormalities are typically subtle and sparsely distributed, existing methods often produce fluent yet medically inaccurate reports, limiting their applicability in clinical practice. To address this issue effectively, we propose a Semantically Informed Salient Regions-guided (SISRNet) report generation method. Specifically, our approach explicitly identifies salient regions with medically critical characteristics using fine-grained cross-modal semantics. Then, SISRNet systematically focuses on these high-information regions during both image modeling and report generation, effectively capturing subtle abnormal findings, mitigating the negative impact of data bias, and ultimately generating clinically accurate reports. Compared to its peers, SISRNet demonstrates superior performance on widely used IU-Xray and MIMIC-CXR datasets.

[71] Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion

Sung Ho Kang,Hyun-Cheol Park

Main category: cs.CV

TL;DR: This paper proposes a novel CBCT-to-MDCT translation framework combining GAN-derived priors with human-guided conditional diffusion, offering improved anatomical accuracy and real-time performance with minimal sampling steps.

Details Motivation: To address limitations in conventional GANs and diffusion models for CBCT-to-MDCT translation by explicitly enforcing boundary consistency and incorporating human feedback for clinically preferred outcomes. Method: A novel framework integrating GAN-derived priors with human-guided conditional diffusion was developed, leveraging classifier-free guidance (CFG) and iterative refinement to internalize human preferences without a reward model. Result: The method demonstrated superior performance across RMSE, SSIM, LPIPS, and Dice metrics, requiring only 10 sampling steps while selectively attenuating shade artifacts and preserving fine structural details. Conclusion: The proposed CBCT-to-MDCT translation framework based on the Schrodinger Bridge formulation effectively integrates GAN-derived priors with human-guided conditional diffusion, achieving superior performance in both anatomical fidelity and perceptual controllability. Abstract: We present a novel framework for CBCT-to-MDCT translation, grounded in the Schrodinger Bridge (SB) formulation, which integrates GAN-derived priors with human-guided conditional diffusion. Unlike conventional GANs or diffusion models, our approach explicitly enforces boundary consistency between CBCT inputs and pseudo targets, ensuring both anatomical fidelity and perceptual controllability. Binary human feedback is incorporated via classifier-free guidance (CFG), effectively steering the generative process toward clinically preferred outcomes. Through iterative refinement and tournament-based preference selection, the model internalizes human preferences without relying on a reward model. Subtraction image visualizations reveal that the proposed method selectively attenuates shade artifacts in key anatomical regions while preserving fine structural detail. Quantitative evaluations further demonstrate superior performance across RMSE, SSIM, LPIPS, and Dice metrics on clinical datasets -- outperforming prior GAN- and fine-tuning-based feedback methods -- while requiring only 10 sampling steps. These findings underscore the effectiveness and efficiency of our framework for real-time, preference-aligned medical image translation.

[72] Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation

Sunghyun Park,Jungsoo Lee,Shubhankar Borse,Munawar Hayat,Sungha Choi,Kyuwoong Hwang,Fatih Porikli

Main category: cs.CV

TL;DR: This paper introduces personalized open-vocabulary semantic segmentation to improve segmentation accuracy for user-specific concepts using a novel text prompt tuning approach.

Details Motivation: Traditional OVSS fails to interpret personal texts (e.g., 'my mug cup') for segmenting user-specific regions. This limitation motivates the need for a more personalized approach that can recognize such unique descriptions effectively. Method: The method uses text prompt tuning combined with 'negative mask proposal' and enriches text prompts by injecting visual embeddings of the personal concept. This approach helps in reducing false predictions while maintaining the performance of the original OVSS. Result: The proposed method demonstrates superior performance on newly established benchmarks like FSS$^\text{per}$, CUB$^\text{per}$, and ADE$^\text{per}$, showcasing its effectiveness in addressing the challenges of personalized open-vocabulary semantic segmentation. Conclusion: The paper proposes a method called personalized open-vocabulary semantic segmentation to better understand personal texts for segmenting regions of specific interest, improving performance without compromising the original OVSS capabilities. Abstract: While open-vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., `my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing `my mug cup' among `multiple mug cups'. To overcome this challenge, we introduce a novel task termed \textit{personalized open-vocabulary semantic segmentation} and propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs `negative mask proposal' that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS$^\text{per}$, CUB$^\text{per}$, and ADE$^\text{per}$.

[73] Efficient Dual-domain Image Dehazing with Haze Prior Perception

Lirong Zheng,Yanshan Li,Rui Yu,Kaihao Zhang

Main category: cs.CV

TL;DR: This paper proposes DGFDNet, a novel dual-domain framework for single-image dehazing that combines spatial and frequency information effectively, achieving state-of-the-art results with real-time efficiency.

Details Motivation: Transformer-based models are effective for image dehazing but computationally expensive, limiting their real-time application. Existing methods using spatial-domain features struggle under complex haze conditions, and frequency-domain approaches suffer from weak coupling between spatial and frequency branches. Method: The paper proposes DGFDNet, a dual-domain framework combining spatial and frequency domains. It introduces the DGFDBlock containing the Haze-Aware Frequency Modulator (HAFM) and Multi-level Gating Aggregation Module (MGAM), along with a Prior Correction Guidance Branch (PCGB) for iterative refinement of haze priors. Result: DGFDNet achieves superior performance on four benchmark haze datasets, showing better robustness and real-time efficiency compared to existing methods. Conclusion: DGFDNet demonstrates state-of-the-art performance on dehazing with robustness and real-time efficiency, especially for complex outdoor scenes. Abstract: Transformer-based models exhibit strong global modeling capabilities in single-image dehazing, but their high computational cost limits real-time applicability. Existing methods predominantly rely on spatial-domain features to capture long-range dependencies, which are computationally expensive and often inadequate under complex haze conditions. While some approaches introduce frequency-domain cues, the weak coupling between spatial and frequency branches limits the overall performance. To overcome these limitations, we propose the Dark Channel Guided Frequency-aware Dehazing Network (DGFDNet), a novel dual-domain framework that performs physically guided degradation alignment across spatial and frequency domains. At its core, the DGFDBlock comprises two key modules: 1) the Haze-Aware Frequency Modulator (HAFM), which generates a pixel-level haze confidence map from dark channel priors to adaptively enhance haze-relevant frequency components, thereby achieving global degradation-aware spectral modulation; 2) the Multi-level Gating Aggregation Module (MGAM), which fuses multi-scale features through diverse convolutional kernels and hybrid gating mechanisms to recover fine structural details. Additionally, a Prior Correction Guidance Branch (PCGB) incorporates a closed-loop feedback mechanism, enabling iterative refinement of the prior by intermediate dehazed features and significantly improving haze localization accuracy, especially in challenging outdoor scenes. Extensive experiments on four benchmark haze datasets demonstrate that DGFDNet achieves state-of-the-art performance with superior robustness and real-time efficiency. Code is available at: https://github.com/Dilizlr/DGFDNet.

[74] A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion

Jie-Wen Li,Zi-Han Ye,Qingyuan Zhou,Jiayi Song,Ying He,Ben Fei,Wen-Ming Chen

Main category: cs.CV

TL;DR: FootGait3D是一个新颖的多视角点云数据集,专注于自然步态中踝足区域的详细建模,旨在解决遮挡和不同视角下的3D点云补全任务。

Details Motivation: 由于摆动脚遮挡和观察限制,在动态步态条件下收集足踝表面几何数据具有固有挑战性。 Method: 使用定制的五摄像头深度传感系统,从46名受试者中收集了8403点云帧,包括完整的5视角重建和部分点云数据。 Result: FootGait3D数据集能够推动生物力学和多段足部建模的研究,为临床步态分析、假肢设计和机器人应用提供详细的足部运动3D模型。 Conclusion: FootGait3D提供了一个新颖的多视角数据集,专注于踝足区域的详细建模,为生物力学研究和临床步态分析提供了宝贵的测试平台。 Abstract: The kinematics analysis of foot-ankle complex during gait is essential for advancing biomechanical research and clinical assessment. Collecting accurate surface geometry data from the foot and ankle during dynamic gait conditions is inherently challenging due to swing foot occlusions and viewing limitations. Thus, this paper introduces FootGait3D, a novel multi-view dataset of high-resolution ankle-foot surface point clouds captured during natural gait. Different from existing gait datasets that typically target whole-body or lower-limb motion, FootGait3D focuses specifically on the detailed modeling of the ankle-foot region, offering a finer granularity of motion data. To address this, FootGait3D consists of 8,403 point cloud frames collected from 46 subjects using a custom five-camera depth sensing system. Each frame includes a complete 5-view reconstruction of the foot and ankle (serving as ground truth) along with partial point clouds obtained from only four, three, or two views. This structured variation enables rigorous evaluation of 3D point cloud completion methods under varying occlusion levels and viewpoints. Our dataset is designed for shape completion tasks, facilitating the benchmarking of state-of-the-art single-modal (e.g., PointTr, SnowflakeNet, Anchorformer) and multi-modal (e.g., SVDFormer, PointSea, CSDN) completion networks on the challenge of recovering the full foot geometry from occluded inputs. FootGait3D has significant potential to advance research in biomechanics and multi-segment foot modeling, offering a valuable testbed for clinical gait analysis, prosthetic design, and robotics applications requiring detailed 3D models of the foot during motion. The dataset is now available at https://huggingface.co/datasets/ljw285/FootGait3D.

[75] Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

Nicolas Drapier,Aladine Chetouani,Aurélien Chateigner

Main category: cs.CV

TL;DR: GLOD introduces a transformer-first architecture for object detection in satellite imagery, combining Swin Transformers with novel modules for improved accuracy.

Details Motivation: The motivation is to address challenges in high-resolution satellite imagery using transformers and improve upon CNN-based architectures. Method: GLOD uses a Swin Transformer backbone with UpConvMixer blocks and Fusion Blocks to integrate multi-scale features for object detection. Result: GLOD achieves 32.95% on the xView dataset, outperforming SOTA methods by 11.46%. Conclusion: GLOD presents a transformer-first architecture for object detection in high-resolution satellite imagery, achieving better performance than existing methods. Abstract: We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.

[76] Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation

Shuchang Ye,Usman Naseem,Mingyuan Meng,Jinman Kim

Main category: cs.CV

TL;DR: 提出了一种名为ProLearn的Prototype-driven Learning框架,用于语言引导的医学图像分割,以减轻对文本输入的依赖。

Details Motivation: 现有的医学语言引导分割方法依赖于配对的图像-文本输入,但许多医学分割数据集缺乏配对报告,并且在实际临床场景中分割通常先于报告生成。 Method: 引入了Prototype-driven Semantic Approximation (PSA)模块,通过从文本报告中提取分割相关的语义信息来初始化一个离散且紧凑的原型空间,并支持查询与响应机制来近似无文本输入的图像语义引导。 Result: 在QaTa-COV19、MosMedData+和Kvasir-SEG数据集上的实验表明,当文本数据有限时,ProLearn优于当前最先进的语言引导分割方法。 Conclusion: ProLearn有效地减少了对文本输入的依赖,同时在医学图像分割任务中保持或提升了性能。 Abstract: Medical language-guided segmentation, integrating textual clinical reports as auxiliary guidance to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as ``textual reliance", presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, in ProLearn, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.

[77] Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

Hayeon Kim,Ji Ha Jang,Se Young Chun

Main category: cs.CV

TL;DR: RoMaP is a new framework for precise local editing of 3D Gaussian representations, combining improved segmentation and enhanced loss functions to allow detailed, controlled modifications.

Details Motivation: Precise local 3D edits are challenging in Gaussian Splatting due to inconsistent segmentations and ambiguity in SDS loss. RoMaP aims to overcome these limitations for more accurate and drastic part-level modifications. Method: RoMaP introduces two key components: 1) 3D-GALP for accurate and consistent part segmentation using spherical harmonics coefficients, and 2) a regularized SDS loss with L1 anchor loss (via SLaMP) and additional constraints like Gaussian prior removal and 3D masking. Result: Experiments show that RoMaP achieves superior performance in local 3D editing, both qualitatively and quantitatively, enabling high-quality part-level changes while preserving contextual coherence. Conclusion: RoMaP provides a robust and flexible framework for part-level 3D Gaussian editing, achieving state-of-the-art results in both reconstructed and generated Gaussian scenes. Abstract: Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing.

[78] Joint angle model based learning to refine kinematic human pose estimation

Chang Peng,Yifei Zhou,Huifeng Xi,Shiqing Huang,Chuangye Chen,Jianming Yang,Bao Yang,Zhenyu Jiang

Main category: cs.CV

TL;DR: This paper proposes a novel method for marker-free human pose estimation (HPE) refinement called joint angle-based modeling (JAR), which constructs a high-quality dataset by approximating temporal variation of joint angles through high order Fourier series and utilizes a bidirectional recurrent network as a post-processing module to refine the estimation of HRNet, showing superior performance in correcting wrongly recognized joints and smoothing their spatiotemporal trajectories especially in challenging cases like figure skating and breaking.

Details Motivation: Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. Method: The key techniques include: (i) A joint angle-based model of human pose, which is robust to describe kinematic human poses; (ii) Approximating temporal variation of joint angles through high order Fourier series to get reliable 'ground truth'; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of well-established HRNet. Result: Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Conclusion: Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking. Abstract: Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. This paper proposed a novel method to overcome the difficulty through joint angle-based modeling. The key techniques include: (i) A joint angle-based model of human pose, which is robust to describe kinematic human poses; (ii) Approximating temporal variation of joint angles through high order Fourier series to get reliable "ground truth"; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of well-established HRNet. Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking.

[79] GKNet: Graph-based Keypoints Network for Monocular Pose Estimation of Non-cooperative Spacecraft

Weizhao Ma,Dong Zhou,Yuhui Hu,Zipeng He

Main category: cs.CV

TL;DR: 本文提出了GKNet,一种用于非合作航天器单目姿态估计的图神经网络,并发布了相关数据集SKD。

Details Motivation: 非合作航天器的姿态估计对于轨道服务任务至关重要,但当前关键点检测方法在结构对称性和部分遮挡情况下表现不足。 Method: 提出了一种基于图的关键点网络GKNet,并构建了一个用于验证关键点检测器的中等规模数据集SKD。 Result: 实验表明,与现有最先进方法相比,GKNet在精度和有效性方面均表现优异。 Conclusion: GKNet在非合作航天器单目姿态估计中表现出高精度和有效性,并且代码和数据集已公开。 Abstract: Monocular pose estimation of non-cooperative spacecraft is significant for on-orbit service (OOS) tasks, such as satellite maintenance, space debris removal, and station assembly. Considering the high demands on pose estimation accuracy, mainstream monocular pose estimation methods typically consist of keypoint detectors and PnP solver. However, current keypoint detectors remain vulnerable to structural symmetry and partial occlusion of non-cooperative spacecraft. To this end, we propose a graph-based keypoints network for the monocular pose estimation of non-cooperative spacecraft, GKNet, which leverages the geometric constraint of keypoints graph. In order to better validate keypoint detectors, we present a moderate-scale dataset for the spacecraft keypoint detection, named SKD, which consists of 3 spacecraft targets, 90,000 simulated images, and corresponding high-precise keypoint annotations. Extensive experiments and an ablation study have demonstrated the high accuracy and effectiveness of our GKNet, compared to the state-of-the-art spacecraft keypoint detectors. The code for GKNet and the SKD dataset is available at https://github.com/Dongzhou-1996/GKNet.

[80] Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification

Chang Peng,Bao Yang,Meiqi Li,Ge Zhang,Hui Sun,Zhenyu Jiang

Main category: cs.CV

TL;DR: 本文通过构建大规模GPR数据集并设计YOLO模型的交叉验证策略,显著提升了道路地下缺陷自动识别的准确率与效率。

Details Motivation: 传统RSD识别依赖专家手动分析GPR图像,费时费力;深度学习虽有潜力但受限于高质量数据集的缺乏和网络识别能力的不足。 Method: 构建了一个经过严格验证的包含2134个样本的3D GPR数据集,利用YOLO模型进行训练,并基于其对不同类型RSD的敏感性差异提出了交叉验证策略。 Result: 提出的交叉验证策略在现场测试中实现了超过98.6%的召回率,且在线检测系统可减少约90%的人工检查工作量。 Conclusion: 本研究提出了一种新颖的交叉验证策略,结合YOLO模型在GPR图像上的应用,实现了高效的RSD识别,并成功集成到在线检测系统中,大幅减少了人工检查的工作量。 Abstract: Ground penetrating radar (GPR) has become a rapid and non-destructive solution for road subsurface distress (RSD) detection. However, RSD recognition from GPR images is labor-intensive and heavily relies on inspectors' expertise. Deep learning offers the possibility for automatic RSD recognition, but its current performance is limited by two factors: Scarcity of high-quality dataset for network training and insufficient capability of network to distinguish RSD. In this study, a rigorously validated 3D GPR dataset containing 2134 samples of diverse types was constructed through field scanning. Based on the finding that the YOLO model trained with one of the three scans of GPR images exhibits varying sensitivity to specific type of RSD, we proposed a novel cross-verification strategy with outstanding accuracy in RSD recognition, achieving recall over 98.6% in field tests. The approach, integrated into an online RSD detection system, can reduce the labor of inspection by around 90%.

[81] Atmos-Bench: 3D Atmospheric Structures for Climate Insight

Tianchi Xu

Main category: cs.CV

TL;DR: This paper presents Atmos-Bench, the first 3D atmospheric benchmark, and FourCastX, a novel deep learning model that improves satellite-based atmospheric structure recovery with higher accuracy and without reliance on additional data.

Details Motivation: Existing methods for atmospheric structure analysis rely on simplified approximations and external data, introducing uncertainties and limiting accurate radiative transfer modeling. This work aims to address these limitations by providing a standardized 3D benchmark and a physics-informed deep learning approach. Method: The study introduces Atmos-Bench, a 3D atmospheric benchmark, and FourCastX, a Frequency-enhanced Spatio-Temporal Mixture-of-Experts Network. It generates high-resolution image slices from simulated 3D scattering volumes using WRF and an enhanced COSP simulator, incorporating physical constraints into the model architecture. Result: FourCastX achieved consistent performance improvements on the Atmos-Bench dataset across both 355 nm and 532 nm bands, outperforming state-of-the-art baselines while eliminating the need for auxiliary inputs. Conclusion: Atmos-Bench sets a new standard for 3D atmospheric structure recovery, enabling better climate insights and improving the accuracy of atmospheric models without relying on auxiliary inputs. Abstract: Atmospheric structure, represented by backscatter coefficients (BC) recovered from satellite LiDAR attenuated backscatter (ATB), provides a volumetric view of clouds, aerosols, and molecules, playing a critical role in human activities, climate understanding, and extreme weather forecasting. Existing methods often rely on auxiliary inputs and simplified physics-based approximations, and lack a standardized 3D benchmark for fair evaluation. However, such approaches may introduce additional uncertainties and insufficiently capture realistic radiative transfer and atmospheric scattering-absorption effects. To bridge these gaps, we present Atmos-Bench: the first 3D atmospheric benchmark, along with a novel FourCastX: Frequency-enhanced Spatio-Temporal Mixture-of-Experts Network that (a) generates 921,600 image slices from 3D scattering volumes simulated at 532 nm and 355 nm by coupling WRF with an enhanced COSP simulator over 384 land-ocean time steps, yielding high-quality voxel-wise references; (b) embeds ATB-BC physical constraints into the model architecture, promoting energy consistency during restoration; (c) achieves consistent improvements on the Atmos-Bench dataset across both 355 nm and 532 nm bands, outperforming state-of-the-art baseline models without relying on auxiliary inputs. Atmos-Bench establishes a new standard for satellite-based 3D atmospheric structure recovery and paves the way for deeper climate insight.

[82] A Survey on Interpretability in Visual Recognition

Qiyang Wan,Chengzhi Gao,Ruiping Wang,Xilin Chen

Main category: cs.CV

TL;DR: This paper reviews and organizes research on the interpretability of visual recognition models by proposing a human-centered taxonomy and identifying future research directions.

Details Motivation: The motivation stems from the increasing deployment of visual recognition models in critical areas like autonomous driving and medical diagnostics, necessitating better understanding and interpretability of these models. Method: The authors systematically reviewed existing research on the interpretability of visual recognition models and proposed a taxonomy based on Intent, Object, Presentation, and Methodology. Result: A comprehensive taxonomy of interpretable visual recognition methods was developed, along with an exploration of evaluation metrics and new research opportunities. Conclusion: The paper concludes that interpretability in visual recognition models is essential for their deployment in critical applications, and the proposed taxonomy provides a structured approach to categorize existing methods. Abstract: In recent years, visual recognition methods have advanced significantly, finding applications across diverse fields. While researchers seek to understand the mechanisms behind the success of these models, there is also a growing impetus to deploy them in critical areas like autonomous driving and medical diagnostics to better diagnose failures, which promotes the development of interpretability research. This paper systematically reviews existing research on the interpretability of visual recognition models and proposes a taxonomy of methods from a human-centered perspective. The proposed taxonomy categorizes interpretable recognition methods based on Intent, Object, Presentation, and Methodology, thereby establishing a systematic and coherent set of grouping criteria for these XAI methods. Additionally, we summarize the requirements for evaluation metrics and explore new opportunities enabled by recent technologies, such as large multimodal models. We aim to organize existing research in this domain and inspire future investigations into the interpretability of visual recognition models.

[83] KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model

Jie Yang,Wang Zeng,Sheng Jin,Lumin Xu,Wentao Liu,Chen Qian,Zhen Li,Ruimao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态大语言模型 KptLLM++,专门用于通用关键点理解,通过整合多种输入模态和用户定义指令,实现了关键点检测的先进性能。

Details Motivation: 现有的多模态大语言模型在捕捉细粒度语义信息方面存在困难,尤其是对物体关键点的精确识别和分析,而关键点在细粒度图像分析、物体检索和行为识别等应用中起着至关重要的作用。 Method: KptLLM++ 采用了先识别后检测的范式,通过结构化的思维链推理机制,首先解释关键点语义,然后精确定位其位置。训练数据集扩展到超过 500K 样本,涵盖多样化的对象、关键点类别、图像风格和复杂遮挡场景。 Result: KptLLM++ 在多个关键点检测基准上展示了最先进的性能,表现出卓越的准确性和泛化能力。 Conclusion: KptLLM++ 作为统一的多模态大语言模型,有效提升了关键点理解和检测的准确性和泛化能力,展现了其在细粒度图像理解和人机交互中的潜力。 Abstract: The emergence of Multimodal Large Language Models (MLLMs) has revolutionized image understanding by bridging textual and visual modalities. However, these models often struggle with capturing fine-grained semantic information, such as the precise identification and analysis of object keypoints. Keypoints, as structure-aware, pixel-level, and compact representations of objects, particularly articulated ones, play a crucial role in applications such as fine-grained image analysis, object retrieval, and behavior recognition. In this paper, we propose KptLLM++, a novel multimodal large language model that specifically designed for generic keypoint comprehension through the integration of diverse input modalities guided by user-defined instructions. By unifying keypoint detection across varied contexts, KptLLM++ establishes itself as an advanced interface, fostering more effective human-AI collaboration. The model is built upon a novel identify-then-detect paradigm, which first interprets keypoint semantics and subsequently localizes their precise positions through a structured chain-of-thought reasoning mechanism. To push the boundaries of performance, we have scaled up the training dataset to over 500K samples, encompassing diverse objects, keypoint categories, image styles, and scenarios with complex occlusions. This extensive scaling enables KptLLM++ to unlock its potential, achieving remarkable accuracy and generalization. Comprehensive experiments on multiple keypoint detection benchmarks demonstrate its state-of-the-art performance, underscoring its potential as a unified solution for fine-grained image understanding and its transformative implications for human-AI interaction.

[84] Jellyfish Species Identification: A CNN Based Artificial Neural Network Approach

Md. Sabbir Hossen,Md. Saiduzzaman,Pabon Shaha,Mostofa Kamal Nasir

Main category: cs.CV

TL;DR: 本研究开发了一种高效的深度学习框架,结合多种特征提取技术和分类器,实现对水母种类的高精度识别,为解决生物多样性挑战提供了新方法。

Details Motivation: 准确识别水母种类对于生态监测和管理至关重要,因为水母的快速繁殖和生态影响对生物多样性和保护构成了重大挑战。 Method: 提出了一种结合MobileNetV3、ResNet50、EfficientNetV2-B0和VGG16等先进特征提取技术以及多种机器学习分类器的深度学习框架,用于水母种类的检测与分类。 Result: 最佳模型(MobileNetV3与人工神经网络结合)实现了98%的卓越准确率,显著优于其他特征提取器-分类器组合。 Conclusion: 研究证明了深度学习和混合框架在解决生物多样性挑战和推进海洋环境中物种检测方面的有效性。 Abstract: Jellyfish, a diverse group of gelatinous marine organisms, play a crucial role in maintaining marine ecosystems but pose significant challenges for biodiversity and conservation due to their rapid proliferation and ecological impact. Accurate identification of jellyfish species is essential for ecological monitoring and management. In this study, we proposed a deep learning framework for jellyfish species detection and classification using an underwater image dataset. The framework integrates advanced feature extraction techniques, including MobileNetV3, ResNet50, EfficientNetV2-B0, and VGG16, combined with seven traditional machine learning classifiers and three Feedforward Neural Network classifiers for precise species identification. Additionally, we activated the softmax function to directly classify jellyfish species using the convolutional neural network models. The combination of the Artificial Neural Network with MobileNetV3 is our best-performing model, achieving an exceptional accuracy of 98%, significantly outperforming other feature extractor-classifier combinations. This study demonstrates the efficacy of deep learning and hybrid frameworks in addressing biodiversity challenges and advancing species detection in marine environments.

[85] Try Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID

Hankun Liu,Yujian Zhao,Guanglin Niu

Main category: cs.CV

TL;DR: This paper proposes a multimodal-guided framework for generating and learning hard samples in clothing-changing person re-identification (CC-ReID), improving model robustness and performance on benchmark datasets.

Details Motivation: The motivation stems from the challenge hard samples pose in CC-ReID tasks due to their ambiguity and lack of explicit definitions, which limits targeted learning strategies and model robustness. Method: The method involves a multimodal-guided framework called HSGL with two components: (1) DGHSG, which generates hard samples using multimodal cues, and (2) HSAL, which optimizes feature distances based on textual semantic labels to improve model robustness. Result: The experiments showed that the HSGL framework effectively increases training data hardness and diversity, accelerates convergence, and achieves state-of-the-art results on PRCC and LTCC datasets. Conclusion: The paper concludes that the proposed HSGL framework, which integrates textual and visual modalities for hard sample generation and learning, enhances the robustness and discriminative capability of models in CC-ReID tasks, achieving state-of-the-art performance on PRCC and LTCC datasets. Abstract: Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model's robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model's discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at https://github.com/undooo/TryHarder-ACMMM25.

[86] MMOne: Representing Multiple Modalities in One Scene

Zhifeng Gu,Bing Wang

Main category: cs.CV

TL;DR: This paper introduces MMOne, a framework for multimodal scene representation that effectively handles modality conflicts and enhances representation efficiency and scalability.

Details Motivation: The motivation is to overcome property and granularity disparities among different modalities to improve multimodal scene representation for better understanding of the physical world. Method: The authors propose a framework called MMOne, which includes a modality modeling module with a novel modality indicator and a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians. Result: Experiments show that MMOne consistently enhances representation capability for each modality and can be readily extended to additional modalities. Conclusion: MMOne addresses modality conflicts in scene representation by disentangling multimodal information into shared and modality-specific components, enhancing representation capability and scalability. Abstract: Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.

[87] RMAU-NET: A Residual-Multihead-Attention U-Net Architecture for Landslide Segmentation and Detection from Remote Sensing Images

Lam Pham,Cam Le,Hieu Tang,Khang Truong,Truong Nguyen,Jasmin Lampert,Alexander Schindler,Martin Boyer,Son Phan

Main category: cs.CV

TL;DR: This paper proposes a deep learning model for automatic landslide detection and segmentation using remote sensing images, achieving high performance across multiple datasets.

Details Motivation: Landslide disasters have become frequent due to extreme weather and human activities. Observing landslides is challenging due to large areas and rugged terrain, motivating the need for an automated solution. Method: A novel neural network architecture was proposed for two tasks: landslide detection and segmentation. The model was evaluated on three benchmark datasets (LandSlide4Sense, Bijie, Nepal) with extensive experiments. Result: The model achieved high F1 scores of 98.23 and 93.83 for landslide detection on LandSlide4Sense and Bijie datasets, respectively, and mIoU scores of 63.74 and 76.88 for segmentation tasks on LandSlide4Sense and Nepal datasets. Conclusion: The proposed end-to-end deep learning model effectively detects and segments landslides using remote sensing images, showing potential for integration into real-life landslide observation systems. Abstract: In recent years, landslide disasters have reported frequently due to the extreme weather events of droughts, floods , storms, or the consequence of human activities such as deforestation, excessive exploitation of natural resources. However, automatically observing landslide is challenging due to the extremely large observing area and the rugged topography such as mountain or highland. This motivates us to propose an end-to-end deep-learning-based model which explores the remote sensing images for automatically observing landslide events. By considering remote sensing images as the input data, we can obtain free resource, observe large and rough terrains by time. To explore the remote sensing images, we proposed a novel neural network architecture which is for two tasks of landslide detection and landslide segmentation. We evaluated our proposed model on three different benchmark datasets of LandSlide4Sense, Bijie, and Nepal. By conducting extensive experiments, we achieve F1 scores of 98.23, 93.83 for the landslide detection task on LandSlide4Sense, Bijie datasets; mIoU scores of 63.74, 76.88 on the segmentation tasks regarding LandSlide4Sense, Nepal datasets. These experimental results prove potential to integrate our proposed model into real-life landslide observation systems.

[88] Assessing Color Vision Test in Large Vision-language Models

Hongfei Ye,Bin Chen,Wenxi Liu,Yu Zhang,Zhao Li,Dandan Ni,Hongyang Chen

Main category: cs.CV

TL;DR: 本文设计了针对大型视觉-语言模型的颜色视觉测试任务并构建了相关数据集,通过错误分析提出有效微调策略以提升模型在颜色视觉方面的能力。

Details Motivation: 尽管大型视觉-语言模型已被广泛采用,但其颜色视觉能力尚未得到充分研究。因此,本文旨在填补这一空白,通过系统性方法探索并提升此类模型在颜色视觉方面的表现力。 Method: 本研究首先定义了一个用于评估大型视觉-语言模型的颜色视觉测试任务,并构建了一个包含多种类别问题和不同难度级别任务的数据集。随后分析了模型在此任务上的错误类型,并提出了针对性的微调策略以提升其性能。 Result: 研究成功构建了一个全面的颜色视觉测试数据集,并识别了大型视觉-语言模型在该任务中的主要错误类型。基于此分析,提出的微调策略显著提升了模型在颜色视觉测试中的性能。 Conclusion: 大型视觉-语言模型在颜色视觉测试任务中展现出改进的潜力,通过定义新的测试任务、构建多样化数据集并提出有效的微调策略,研究为提升这些模型的颜色视觉能力提供了可行方案。 Abstract: With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset \footnote{Anonymous Github Showing some of the data https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.

[89] Clustering-Guided Multi-Layer Contrastive Representation Learning for Citrus Disease Classification

Jun Chen,Yonghua Yu,Weifu Li,Yaohui Chen,Hong Chen

Main category: cs.CV

TL;DR: 本文提出CMCRL算法,通过自监督学习解决柑橘病害分类问题,利用未标注数据提升性能,在多个评价指标上优于现有方法。

Details Motivation: 传统的深度学习方法依赖大量高质量标注数据,而现实中获取这些数据成本较高,因此需要一种能够利用未标注数据并提升柑橘病害分类准确率的新方法。 Method: 通过引入两个关键设计:与聚类质心对比和多层对比训练(MCT)范式,构建CMCRL算法,并在公共柑橘图像集CDD上进行实验验证。 Result: 所提方法在CDD数据集上实现了最先进的性能,准确率比现有方法高出4.5%-30.1%,并在F1分数、精确度和召回率等其他评估指标上表现出色,尤其在类别不平衡情况下展现了强鲁棒性。 Conclusion: 该论文提出了一种新的聚类引导的自监督多层对比表征学习算法(CMCRL),用于柑橘病害检测与分类,有效解决了大规模未标注样本优化、不同疾病症状相似性适应以及层次特征表示学习等问题。 Abstract: Citrus, as one of the most economically important fruit crops globally, suffers severe yield depressions due to various diseases. Accurate disease detection and classification serve as critical prerequisites for implementing targeted control measures. Recent advancements in artificial intelligence, particularly deep learning-based computer vision algorithms, have substantially decreased time and labor requirements while maintaining the accuracy of detection and classification. Nevertheless, these methods predominantly rely on massive, high-quality annotated training examples to attain promising performance. By introducing two key designs: contrasting with cluster centroids and a multi-layer contrastive training (MCT) paradigm, this paper proposes a novel clustering-guided self-supervised multi-layer contrastive representation learning (CMCRL) algorithm. The proposed method demonstrates several advantages over existing counterparts: (1) optimizing with massive unannotated samples; (2) effective adaptation to the symptom similarity across distinct citrus diseases; (3) hierarchical feature representation learning. The proposed method achieves state-of-the-art performance on the public citrus image set CDD, outperforming existing methods by 4.5\%-30.1\% accuracy. Remarkably, our method narrows the performance gap with fully supervised counterparts (all samples are labeled). Beyond classification accuracy, our method shows great performance on other evaluation metrics (F1 score, precision, and recall), highlighting the robustness against the class imbalance challenge.

[90] How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

Che Liu,Jiazhen Pan,Weixiang Shen,Wenjia Bai,Daniel Rueckert,Rossella Arcucci

Main category: cs.CV

TL;DR: This paper evaluates general and medical-specific Vision-Language Models (VLMs) across multiple healthcare benchmarks, finding that while large models show strong zero-shot transfer, their reasoning abilities and reliability remain insufficient for clinical use.

Details Motivation: Vision-Language Models (VLMs) trained on web-scale data have shown success in natural image tasks and are increasingly being applied to healthcare. However, their effectiveness in medical domains is not well understood, prompting the need for systematic evaluation. Method: The authors conducted a comprehensive evaluation of open-source general-purpose and medically specialized VLMs across eight medical benchmarks, analyzing performance in terms of understanding and reasoning components. Result: Three key findings emerged: (1) large general-purpose models match or surpass medical-specific models on some benchmarks; (2) reasoning performance lags behind understanding; and (3) performance varies significantly across benchmarks, indicating challenges in task design and annotation quality. Conclusion: The study concludes that while large general-purpose VLMs show promise in medical tasks through zero-shot transfer, there remains a significant gap in reasoning capabilities and reliability for clinical deployment. Abstract: Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.

[91] A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition

Xinkui Zhao,Jinsong Shu,Yangyang Wu,Guanjie Cheng,Zihe Liu,Naibo Wang,Shuiguang Deng,Zhongle Xie,Jianwei Yin

Main category: cs.CV

TL;DR: 本文提出MCULoRA框架,通过模态组合感知低秩适应和动态参数微调解决多模态情感识别中的不完全模态问题,显著提升了模型性能。

Details Motivation: 多模态情感识别(MER)在实际应用中常面临由于传感器故障或隐私保护导致的模态不完整问题,现有方法因不同模态组合间的训练梯度冲突而限制了模型性能。 Method: 提出了一种基于模态组合的单模态解耦动态低秩适应方法MCULoRA,包含模态组合感知低秩适应(MCLA)模块和动态参数微调(DPFT)模块。 Result: 实验表明,MCULoRA在多个基准数据集上显著优于之前的不完全多模态学习方法,提升了下游任务的准确性。 Conclusion: MCULoRA框架在多个基准数据集中表现出优于现有不完全多模态学习方法的下游任务准确性,为参数高效训练提供了新思路。 Abstract: Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality's representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.

[92] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

X. Feng,H. Yu,M. Wu,S. Hu,J. Chen,C. Zhu,J. Wu,X. Chu,K. Huang

Main category: cs.CV

TL;DR: NarrLV is the first benchmark to assess narrative expression in long video generation models using Temporal Narrative Atoms and an MLLM-based evaluation framework.

Details Motivation: Current benchmarks for video generation models lack focus on narrative richness, prompting the need for a comprehensive evaluation framework like NarrLV. Method: NarrLV introduces Temporal Narrative Atoms (TNAs) to measure narrative richness and uses an MLLM-based question generation and answering framework for evaluation. Result: NarrLV demonstrates effectiveness in evaluating narrative content expression in long video generation models, aligning closely with human assessments. Conclusion: The proposed NarrLV benchmark effectively evaluates the narrative expression capabilities of long video generation models, showing close alignment with human judgments and revealing current model boundaries. Abstract: With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

[93] Fairness-Aware Grouping for Continuous Sensitive Variables: Application for Debiasing Face Analysis with respect to Skin Tone

Veronika Shilova,Emmanuel Malherbe,Giovanni Palma,Laurent Risser,Jean-Michel Loubes

Main category: cs.CV

TL;DR: 该论文提出了一种针对连续敏感属性(如肤色)的公平性分组方法,通过最大化组间歧视差异来识别关键子群体,并通过实验证明其在提高模型公平性方面的有效性。

Details Motivation: 传统的公平性评估通常将观察结果划分为预定义组,但当敏感属性是连续的(如肤色)时,这种方法可能会忽略某些少数群体所经历的歧视。为解决此问题,需要一种更精确的分组方法。 Method: 通过根据观察到的歧视水平对数据进行分组,提出了一种基于组间歧视差异的新准则,以最大化组间方差来确定最优划分。此外,还利用组间后处理方法进行去偏。 Result: 在多个合成数据集和真实数据集(如CelebA和FFHQ)上的实验表明,该方法能够揭示比以往更细致的歧视模式,并且在不同数据集上具有稳定性。此外,该方法在提高公平性方面表现良好,同时对准确性影响较小。 Conclusion: 该论文提出了一种基于公平性的分组方法,能够更细致地识别数据集中因连续敏感属性(如肤色)造成的歧视现象,并通过实验证明该方法在提高模型公平性的同时对准确性影响较小,为工业应用提供了可能。 Abstract: Within a legal framework, fairness in datasets and models is typically assessed by dividing observations into predefined groups and then computing fairness measures (e.g., Disparate Impact or Equality of Odds with respect to gender). However, when sensitive attributes such as skin color are continuous, dividing into default groups may overlook or obscure the discrimination experienced by certain minority subpopulations. To address this limitation, we propose a fairness-based grouping approach for continuous (possibly multidimensional) sensitive attributes. By grouping data according to observed levels of discrimination, our method identifies the partition that maximizes a novel criterion based on inter-group variance in discrimination, thereby isolating the most critical subgroups. We validate the proposed approach using multiple synthetic datasets and demonstrate its robustness under changing population distributions - revealing how discrimination is manifested within the space of sensitive attributes. Furthermore, we examine a specialized setting of monotonic fairness for the case of skin color. Our empirical results on both CelebA and FFHQ, leveraging the skin tone as predicted by an industrial proprietary algorithm, show that the proposed segmentation uncovers more nuanced patterns of discrimination than previously reported, and that these findings remain stable across datasets for a given model. Finally, we leverage our grouping model for debiasing purpose, aiming at predicting fair scores with group-by-group post-processing. The results demonstrate that our approach improves fairness while having minimal impact on accuracy, thus confirming our partition method and opening the door for industrial deployment.

[94] MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection

Guanghao Wu,Chen Xu,Hai Song,Chong Wang,Qixing Zhang

Main category: cs.CV

TL;DR: 为了解决森林火灾烟雾检测中烟雾图像数据稀缺以及现有生成模型生成效果不佳的问题,研究者提出了一种全新的烟雾图像生成框架,包括基于特征引导的网络架构、新的损失函数以及结合烟雾特征和大语言模型的数据生成策略,最终实现了高质量烟雾图像的生成并提升了检测效果。

Details Motivation: 森林火灾中烟雾图像数据的稀缺性是阻碍烟雾检测的重要因素,而现有的图像修复模型在生成高质量烟雾表征方面存在局限,尤其是合成烟雾与背景环境之间存在不一致的问题。 Method: 首先利用预训练的分割模型和多模态模型获取烟雾掩码和图像标题;接着引入了基于掩码和掩图特征引导的网络架构以解决补全模型对掩码及掩图利用不足的问题;同时提出了一种新的损失函数——随机掩码差分损失,通过随机扩展和腐蚀掩码边缘来增强生成效果在掩码周围的连贯性;最后结合烟雾特征并使用多模态大语言模型作为筛选工具生成多样且合理的烟雾图像数据集。 Result: 提出了一个全面的森林火灾烟雾图像生成框架,并成功生成了逼真且多样的烟雾图像,提升了后续检测任务的效果。 Conclusion: 实验结果表明,所提出的框架能够生成逼真的烟雾图像,并有效提升森林火灾烟雾检测模型的性能。 Abstract: Smoke is the first visible indicator of a wildfire.With the advancement of deep learning, image-based smoke detection has become a crucial method for detecting and preventing forest fires. However, the scarcity of smoke image data from forest fires is one of the significant factors hindering the detection of forest fire smoke. Image generation models offer a promising solution for synthesizing realistic smoke images. However, current inpainting models exhibit limitations in generating high-quality smoke representations, particularly manifesting as inconsistencies between synthesized smoke and background contexts. To solve these problems, we proposed a comprehensive framework for generating forest fire smoke images. Firstly, we employed the pre-trained segmentation model and the multimodal model to obtain smoke masks and image captions.Then, to address the insufficient utilization of masks and masked images by inpainting models, we introduced a network architecture guided by mask and masked image features. We also proposed a new loss function, the mask random difference loss, which enhances the consistency of the generated effects around the mask by randomly expanding and eroding the mask edges.Finally, to generate a smoke image dataset using random masks for subsequent detection tasks, we incorporated smoke characteristics and use a multimodal large language model as a filtering tool to select diverse and reasonable smoke images, thereby improving the quality of the synthetic dataset. Experiments showed that our generated smoke images are realistic and diverse, and effectively enhance the performance of forest fire smoke detection models. Code is available at https://github.com/wghr123/MFGDiffusion.

[95] ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

Ronggang Huang,Haoxin Yang,Yan Cai,Xuemiao Xu,Huaidong Zhang,Shengfeng He

Main category: cs.CV

TL;DR: 本文提出了一种新的3D视觉定位框架ViewSRD,通过结构化的多视角分解方法有效处理复杂的空间查询问题。

Details Motivation: 为了解决现有方法在处理复杂多锚点查询时难以解耦目标与锚点以及视角变化导致的空间描述不一致的问题。 Method: 提出了ViewSRD框架,包括SRD模块用于重构多锚点查询,Multi-TSI模块利用CCVTs集成文本和场景特征,以及Textual-Scene Reasoning模块综合多视角预测。 Result: 实验结果表明,ViewSRD在3D视觉定位数据集上显著优于现有最先进技术。 Conclusion: ViewSRD框架在3D视觉定位任务中表现出色,尤其是在需要精确空间区分的复杂查询方面超越了现有技术。 Abstract: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.

[96] YOLOatr : Deep Learning Based Automatic Target Detection and Localization in Thermal Infrared Imagery

Aon Safdar,Usman Akram,Waseem Anwar,Basit Malik,Mian Ibad Ali

Main category: cs.CV

TL;DR: YOLOatr is a customized single-stage detector designed to overcome the unique challenges of Automatic Target Recognition using Thermal Infrared imagery, achieving near-perfect accuracy on a key dataset.

Details Motivation: Automatic Target Recognition from Thermal Infrared imagery is challenging due to limited datasets, domain-specific issues, and modality-specific difficulties. Contemporary deep learning models underperform in this domain, necessitating a tailored solution like YOLOatr. Method: YOLOatr modifies the detection heads, feature fusion in the neck, and uses a custom augmentation profile on the base YOLOv5s architecture to address challenges specific to Thermal Infrared imagery. Result: The proposed YOLOatr model demonstrates up to 99.6% performance accuracy on the DSIAC MWIR dataset for real-time ATR, outperforming existing state-of-the-art methods. Conclusion: YOLOatr, the modified version of YOLOv5s, achieves state-of-the-art performance in Automatic Target Recognition (ATR) for Thermal Infrared imagery. Abstract: Automatic Target Detection (ATD) and Recognition (ATR) from Thermal Infrared (TI) imagery in the defense and surveillance domain is a challenging computer vision (CV) task in comparison to the commercial autonomous vehicle perception domain. Limited datasets, peculiar domain-specific and TI modality-specific challenges, i.e., limited hardware, scale invariance issues due to greater distances, deliberate occlusion by tactical vehicles, lower sensor resolution and resultant lack of structural information in targets, effects of weather, temperature, and time of day variations, and varying target to clutter ratios all result in increased intra-class variability and higher inter-class similarity, making accurate real-time ATR a challenging CV task. Resultantly, contemporary state-of-the-art (SOTA) deep learning architectures underperform in the ATR domain. We propose a modified anchor-based single-stage detector, called YOLOatr, based on a modified YOLOv5s, with optimal modifications to the detection heads, feature fusion in the neck, and a custom augmentation profile. We evaluate the performance of our proposed model on a comprehensive DSIAC MWIR dataset for real-time ATR over both correlated and decorrelated testing protocols. The results demonstrate that our proposed model achieves state-of-the-art ATR performance of up to 99.6%.

[97] Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping

Yujie Zhang,Sabine Struckmeyer,Andreas Kolb,Sven Reichardt

Main category: cs.CV

TL;DR: 本研究开发了用于番茄植物精细表型分析的高质量数据集 TomatoMAP,并利用深度学习模型实现自动化评估,其性能与专家相当。

Details Motivation: 传统植物表型分析方法存在观察者偏差和不一致性,影响准确性和可重复性,因此需要一个自动化、标准化的数据集支持研究。 Method: 使用基于物联网的成像系统采集数据,构建包含 RGB 图像、边界框标注、语义分割标注的数据集,并采用 MobileNetv3、YOLOv11 和 MaskRCNN 的级联模型进行验证。 Result: 训练模型在精度和速度上与领域专家相当,并通过 Cohen's Kappa 和互评一致率热图验证了方法的可靠性。 Conclusion: TomatoMAP 提供了一个全面、标准化的数据集,结合 IoT 图像系统和深度学习框架,实现了可靠且高效的植物精细表型分析。 Abstract: Observer bias and inconsistencies in traditional plant phenotyping methods limit the accuracy and reproducibility of fine-grained plant analysis. To overcome these challenges, we developed TomatoMAP, a comprehensive dataset for Solanum lycopersicum using an Internet of Things (IoT) based imaging system with standardized data acquisition protocols. Our dataset contains 64,464 RGB images that capture 12 different plant poses from four camera elevation angles. Each image includes manually annotated bounding boxes for seven regions of interest (ROIs), including leaves, panicle, batch of flowers, batch of fruits, axillary shoot, shoot and whole plant area, along with 50 fine-grained growth stage classifications based on the BBCH scale. Additionally, we provide 3,616 high-resolution image subset with pixel-wise semantic and instance segmentation annotations for fine-grained phenotyping. We validated our dataset using a cascading model deep learning framework combining MobileNetv3 for classification, YOLOv11 for object detection, and MaskRCNN for segmentation. Through AI vs. Human analysis involving five domain experts, we demonstrate that the models trained on our dataset achieve accuracy and speed comparable to the experts. Cohen's Kappa and inter-rater agreement heatmap confirm the reliability of automated fine-grained phenotyping using our approach.

[98] Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

An-Lun Liu,Yu-Wei Chao,Yi-Ting Chen

Main category: cs.CV

TL;DR: This paper introduces task-aware contact maps and a two-stage pipeline for task-oriented human grasp synthesis, showing improved grasp quality and performance by incorporating scene and task context.

Details Motivation: Traditional contact maps only consider the manipulated object and its relation with the hand, lacking awareness of the scene and task context. This limitation motivates the need for enhanced task-aware contact maps that improve the accuracy and relevance of grasping poses. Method: The paper proposes a two-stage pipeline that constructs a task-aware contact map using scene and task information, followed by synthesizing task-oriented human grasps using this map. Result: The experiments demonstrate significant improvements over existing methods in both grasp quality and task performance, validating the importance of incorporating scene and task information. Conclusion: The paper concludes that modeling both scene and task information significantly improves grasp quality and task performance in task-oriented human grasp synthesis. Abstract: In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and a metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance. See our project page for more details: https://hcis-lab.github.io/TOHGS/

[99] Detección y Cuantificación de Erosión Fluvial con Visión Artificial

Paúl Maji,Marlon Túquerres,Stalin Valencia,Marcela Valenzuela,Christian Mejia-Escobar

Main category: cs.CV

TL;DR: 本研究提出了一種基於人工智能的方法,用於自動識別侵蝕區域並估計其面積,準確率達70%。

Details Motivation: 傳統的攝影測量方法和地理信息系統分析需要專業知識和大量手動處理,因此需要更高效的自動化方法來檢測和監控河流侵蝕現象。 Method: 使用最先進的計算機視覺模型YOLOv11,通過微調進行調整,並使用照片和LiDAR圖像進行訓練。數據集由Roboflow平台進行分割和標記。 Result: 實驗結果顯示侵蝕模式的檢測準確率為70%,能精確識別侵蝕區域及其範圍(以像素和平方米為單位)。開發出的EROSCAN系統可提供自動分割及面積估算功能。 Conclusion: 所提出的AI方法優化了侵蝕檢測與量化過程,有助於風險管理和土地規劃中的決策效率提升。 Abstract: Fluvial erosion is a natural process that can generate significant impacts on soil stability and strategic infrastructures. The detection and monitoring of this phenomenon is traditionally addressed by photogrammetric methods and analysis in geographic information systems. These tasks require specific knowledge and intensive manual processing. This study proposes an artificial intelligence-based approach for automatic identification of eroded zones and estimation of their area. The state-of-the-art computer vision model YOLOv11, adjusted by fine-tuning and trained with photographs and LiDAR images, is used. This combined dataset was segmented and labeled using the Roboflow platform. Experimental results indicate efficient detection of erosion patterns with an accuracy of 70%, precise identification of eroded areas and reliable calculation of their extent in pixels and square meters. As a final product, the EROSCAN system has been developed, an interactive web application that allows users to upload images and obtain automatic segmentations of fluvial erosion, together with the estimated area. This tool optimizes the detection and quantification of the phenomenon, facilitating decision making in risk management and territorial planning.

[100] A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction

Haoxuan Qu,Yujun Cai,Hossein Rahmani,Ajay Kumar,Junsong Yuan,Jun Liu

Main category: cs.CV

TL;DR: 本文首次实现了多类型几何基元在高斯点绘中的应用,显著提升了表面重建质量。

Details Motivation: 现有基于高斯点绘的方法仅使用单一类型的点绘基元(如椭圆或椭球),难以高质量地表示复杂形状的物体表面。 Method: 提出了一种组合式点绘策略、混合基元初始化策略和顶点剪枝机制,并将其集成到高斯点绘流程中。 Result: 通过大量实验验证了所提框架在表面重建任务中的有效性与高性能。 Conclusion: 本文提出了一种新的高斯点绘框架,该框架能够结合多种几何基元进行表面重建,从而提高了重建的质量和准确性。 Abstract: Recently, Gaussian Splatting (GS) has received a lot of attention in surface reconstruction. However, while 3D objects can be of complex and diverse shapes in the real world, existing GS-based methods only limitedly use a single type of splatting primitive (Gaussian ellipse or Gaussian ellipsoid) to represent object surfaces during their reconstruction. In this paper, we highlight that this can be insufficient for object surfaces to be represented in high quality. Thus, we propose a novel framework that, for the first time, enables Gaussian Splatting to incorporate multiple types of (geometrical) primitives during its surface reconstruction process. Specifically, in our framework, we first propose a compositional splatting strategy, enabling the splatting and rendering of different types of primitives in the Gaussian Splatting pipeline. In addition, we also design our framework with a mixed-primitive-based initialization strategy and a vertex pruning mechanism to further promote its surface representation learning process to be well executed leveraging different types of primitives. Extensive experiments show the efficacy of our framework and its accurate surface reconstruction performance.

[101] MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network

Jianfei Jiang,Qiankun Liu,Haochen Yu,Hongyuan Liu,Liyong Wang,Jiansheng Chen,Huimin Ma

Main category: cs.CV

TL;DR: 该研究提出了一种新的基于单目特征和深度引导的多视角立体网络MonoMVSNet,以解决现有MVS方法在无纹理区域和反射表面上表现不佳的问题,并在DTU和Tanks-and-Temples数据集上实现了最先进的性能。

Details Motivation: 现有的多视角立体(MVS)方法在处理无纹理区域和反射表面时存在困难,因为这些区域的特征匹配失败,而单目深度估计不需要特征匹配,因此可以在这些区域实现鲁棒的相对深度估计。 Method: 首先通过一种新设计的跨视图位置编码机制,利用注意力机制将参考视图的单目特征集成到源视图特征中;然后对齐参考视图的单目深度以动态更新采样过程中的边缘区域深度候选;最后设计了一种基于单目深度的相对一致性损失来监督深度预测。 Result: 广泛的实验表明,MonoMVSNet在DTU和Tanks-and-Temples数据集上实现了最先进的性能,并在Tanks-and-Temples中级和高级基准测试中排名第一。 Conclusion: MonoMVSNet是一种有效的学习型多视角立体方法,结合了单目深度估计的优势,在挑战性区域表现出色。 Abstract: Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.

[102] UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Peiran Wu,Yunze Liu,Zhengdong Zhu,Enmin Zhou,Shawn Shen

Main category: cs.CV

TL;DR: 本文提出了UGC-VideoCap,一个用于用户生成短视频多模态描述的新基准和模型框架。

Details Motivation: 现有视频描述模型过于视觉中心化,忽视音频在场景动态、说话者意图和上下文中的作用。 Method: 提出UGC-VideoCaptioner(3B)模型,并采用两阶段训练策略:监督微调后接组相对策略优化(GRPO)。 Result: 创建包含1000个TikTok视频的数据集,并通过结构化的三阶段人工标注流程生成描述。 Conclusion: UGC-VideoCap提供了一个高质量的基准和数据高效解决方案,推动了现实场景中多模态视频描述的发展。 Abstract: Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

[103] Attributes Shape the Embedding Space of Face Recognition Models

Pierrick Leroy,Antonio Mastropietro,Marco Nurisso,Francesco Vaccarino

Main category: cs.CV

TL;DR: This paper explores how Face Recognition models respond to various facial and image attributes using a geometric approach and physics-inspired metric, revealing insights into model invariance and interpretability.

Details Motivation: The authors observe a multiscale geometric structure in the embedding space influenced by facial and image attributes. They aim to provide a deeper understanding and interpretability of FR models' behavior concerning these attributes. Method: A geometric approach is proposed to describe the dependence or invariance of FR models to facial and image attributes, introducing a physics-inspired alignment metric. The metric is evaluated on controlled models and widely used FR models fine-tuned with synthetic data. Result: The findings reveal varying degrees of invariance in FR models across different attributes, enabling deeper interpretability of their strengths and weaknesses. Conclusion: FR models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses. Abstract: Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast). We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability. Code available here: https://github.com/mantonios107/attrs-fr-embs}{https://github.com/mantonios107/attrs-fr-embs

[104] Implementing Adaptations for Vision AutoRegressive Model

Kaif Shaikh,Antoni Kowalczuk,Franziska Boenisch,Adam Dziedzic

Main category: cs.CV

TL;DR: 该论文研究了Vision AutoRegressive模型(VAR)的适应方法,发现其在非差分隐私任务中优于扩散模型,但在差分隐私任务中表现较差。

Details Motivation: 尽管扩散模型已经有许多适应技术,但关于Vision AutoRegressive模型(VAR)的研究仍然较少,特别是在差分隐私保护下的适应方法。 Method: 作者实现并评估了多种适用于VAR的适应策略,并将其与最先进的DM适应策略进行比较。 Result: VAR在非DP适应任务中表现优异,但在DP适应任务中性能下降明显。 Conclusion: 在非差分隐私(DP)适应任务中,VAR优于扩散模型(DM),但在差分隐私适应方面表现不佳,需要进一步研究。 Abstract: Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively studied for DMs, while VAR lacks such solutions. In our work, we implement and benchmark many strategies for VAR, and compare them to state-of-the-art DM adaptation strategies. We observe that VAR outperforms DMs for non-DP adaptations, however, the performance of DP suffers, which necessitates further research in private adaptations for VAR. Code is available at https://github.com/sprintml/finetuning_var_dp.

[105] COLI: A Hierarchical Efficient Compressor for Large Images

Haoran Wang,Hanyu Pei,Yang Lyu,Kai Zhang,Li Li,Feng-Lei Fan

Main category: cs.CV

TL;DR: COLI improves implicit neural representation-based image compression with faster training and better efficiency.

Details Motivation: Existing compression techniques struggle with preserving details or generalizing well; COLI addresses these issues for large, high-resolution images. Method: COLI uses a pretraining-finetuning paradigm, mixed-precision training, and Hyper-Compression to improve NeRV-based compression. Result: COLI achieves faster convergence and improved compression ratios on medical imaging datasets without sacrificing image quality. Conclusion: COLI provides an effective solution for large image compression, achieving better performance in PSNR and SSIM while reducing bits per pixel and accelerating training. Abstract: The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs' transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.

[106] HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing

Pan Du,Mingqi Xu,Xiaozhi Zhu,Jian-xun Wang

Main category: cs.CV

TL;DR: 本文提出了 HUG-VAS,一种结合 NURBS 参数化和扩散生成模型的新型血管几何建模方法,能够生成高精度的主动脉结构并实现多种实际应用。

Details Motivation: 传统的统计形状模型(SSM)依赖线性假设,难以处理复杂拓扑结构如多分支血管结构,因此需要更高效的方法进行血管几何建模。 Method: 提出了一种名为 HUG-VAS 的方法,结合了 NURBS 曲面参数化与基于扩散的生成模型,采用分层架构,包括生成中心线的去噪扩散模型和根据中心线合成径向轮廓的引导扩散模型。 Result: 使用21个患者特异性样本训练后,HUG-VAS 能生成具有超主动脉分支的解剖逼真主动脉,并产生与原始数据集高度匹配的生物标志物分布。此外,该框架支持从图像推导的先验条件下的零样本条件生成。 Conclusion: HUG-VAS 是首个通过统一整合NURBS参数化和分层扩散过程,将图像先验与生成式形状建模相结合的SSM框架。 Abstract: Accurate characterization of vascular geometry is essential for cardiovascular diagnosis and treatment planning. Traditional statistical shape modeling (SSM) methods rely on linear assumptions, limiting their expressivity and scalability to complex topologies such as multi-branch vascular structures. We introduce HUG-VAS, a Hierarchical NURBS Generative model for Vascular geometry Synthesis, which integrates NURBS surface parameterization with diffusion-based generative modeling to synthesize realistic, fine-grained aortic geometries. Trained with 21 patient-specific samples, HUG-VAS generates anatomically faithful aortas with supra-aortic branches, yielding biomarker distributions that closely match those of the original dataset. HUG-VAS adopts a hierarchical architecture comprising a denoising diffusion model that generates centerlines and a guided diffusion model that synthesizes radial profiles conditioned on those centerlines, thereby capturing two layers of anatomical variability. Critically, the framework supports zero-shot conditional generation from image-derived priors, enabling practical applications such as interactive semi-automatic segmentation, robust reconstruction under degraded imaging conditions, and implantable device optimization. To our knowledge, HUG-VAS is the first SSM framework to bridge image-derived priors with generative shape modeling via a unified integration of NURBS parameterization and hierarchical diffusion processes.

[107] C-FBI: A Combinatorial method using Convolutions for Circle Fitting in Blurry Images

Esteban Román Catafau,Torbjörn E. M. Nordling

Main category: cs.CV

TL;DR: This paper introduces 3C-FBI, a novel method for robust circle detection and fitting that combines combinatorial edge pixel sampling and convolution-based density estimation, achieving high accuracy and real-time performance even in challenging imaging conditions.

Details Motivation: The motivation is to address the challenge of accurate and efficient circle detection in degraded images, which is crucial for applications like medical imaging, robotics, and industrial inspection. Method: The paper proposes 3C-FBI, which integrates efficient combinatorial edge pixel sampling with convolution-based density estimation in parameter space for robust circle detection and fitting. Result: 3C-FBI achieves a Jaccard index of 0.896 on real-world data with 40.3 fps performance, maintains near-perfect accuracy at high resolutions, and performs reliably even with up to 20% outliers. In synthetic tests, it achieves a mean Jaccard Index of 0.989, comparable to modern methods. Conclusion: 3C-FBI combines combinatorial edge pixel sampling and convolution-based density estimation to achieve state-of-the-art performance in circle detection and fitting, especially under degraded imaging conditions. Abstract: This paper addresses the fundamental computer vision challenge of robust circle detection and fitting in degraded imaging conditions. We present Combinatorial Convolution-based Circle Fitting for Blurry Images (3C-FBI), an algorithm that bridges the gap between circle detection and precise parametric fitting by combining (1) efficient combinatorial edge pixel (edgel) sampling and (2) convolution-based density estimation in parameter space. We evaluate 3C-FBI across three experimental frameworks: (1) real-world medical data from Parkinson's disease assessments (144 frames from 36 videos), (2) controlled synthetic data following established circle-fitting benchmarks, and (3) systematic analysis across varying spatial resolutions and outlier contamination levels. Results show that 3C-FBI achieves state-of-the-art accuracy (Jaccard index 0.896) while maintaining real-time performance (40.3 fps), significantly outperforming classical methods like RCD (6.8 fps) on a standard CPU (i7-10875H). It maintains near-perfect accuracy (Jaccard almost 1.0) at high resolutions (480x480) and reliable performance (Jaccard higher than 0.95) down to 160x160 with up to 20% outliers. In extensive synthetic testing, 3C-FBI achieves a mean Jaccard Index of 0.989 across contamination levels, comparable to modern methods like Qi et al. (2024, 0.991), and surpassing RHT (0.964). This combination of accuracy, speed, and robustness makes 3C-FBI ideal for medical imaging, robotics, and industrial inspection under challenging conditions.

[108] COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation

Pakizar Shamoi,Nuray Toganas,Muragul Muratbekova,Elnara Kadyrgali,Adilet Yerkin,Ayan Igali,Malika Ziyada,Ayana Adilova,Aron Karatayev,Yerdauit Torekhan

Main category: cs.CV

TL;DR: This paper introduces COLIBRI, a new fuzzy logic-based color model that accurately represents human color perception better than traditional models by incorporating data from a large-scale human survey.

Details Motivation: Computers struggle to replicate human color perception, which is crucial in fields like design, AI, marketing, and human-computer interaction. This research aims to develop a model that aligns computational representations with how humans perceive color. Method: A three-phase approach was employed: identifying distinguishable color stimuli through experiments, conducting a large-scale categorization survey with over 1000 participants, and extracting fuzzy partitions to create membership functions that reflect perceptual uncertainty. The model also includes an adaptation mechanism for refinement. Result: The COLIBRI model demonstrated alignment with human perception better than traditional models like RGB, HSV, and LAB, based on comparative evaluations. It was built using data from a large and representative sample (n=2496), making it unique compared to previous studies. Conclusion: The COLIBRI model successfully bridges the gap between computational color representation and human perception, offering adaptability and precision in modeling perceptual uncertainty. Abstract: Colors are omnipresent in today's world and play a vital role in how humans perceive and interact with their surroundings. However, it is challenging for computers to imitate human color perception. This paper introduces the Human Perception-Based Fuzzy Color Model, COLIBRI (Color Linguistic-Based Representation and Interpretation), designed to bridge the gap between computational color representations and human visual perception. The proposed model uses fuzzy sets and logic to create a framework for color categorization. Using a three-phase experimental approach, the study first identifies distinguishable color stimuli for hue, saturation, and intensity through preliminary experiments, followed by a large-scale human categorization survey involving more than 1000 human subjects. The resulting data are used to extract fuzzy partitions and generate membership functions that reflect real-world perceptual uncertainty. The model incorporates a mechanism for adaptation that allows refinement based on feedback and contextual changes. Comparative evaluations demonstrate the model's alignment with human perception compared to traditional color models, such as RGB, HSV, and LAB. To the best of our knowledge, no previous research has documented the construction of a model for color attribute specification based on a sample of this size or a comparable sample of the human population (n = 2496). Our findings are significant for fields such as design, artificial intelligence, marketing, and human-computer interaction, where perceptually relevant color representation is critical.

[109] CATVis: Context-Aware Thought Visualization

Tariq Mehmood,Hamza Ahmad,Muhammad Haroon Shakeel,Murtaza Taj

Main category: cs.CV

TL;DR: This paper proposes a novel 5-stage framework for decoding visual representations from EEG signals, resulting in improved image generation performance over state-of-the-art approaches.

Details Motivation: Decoding visual representations from EEG signals is challenging due to their complexity and noise, limiting the potential applications of EEG-based BCIs. Method: The study introduces a framework involving an EEG encoder, cross-modal alignment in CLIP feature space, caption refinement via re-ranking, interpolation of embeddings, and image generation using Stable Diffusion. Result: Experimental results show a 13.43% improvement in Classification Accuracy, 15.21% in Generation Accuracy, and a 36.61% reduction in Fréchet Inception Distance, indicating enhanced semantic alignment and image quality. Conclusion: The proposed 5-stage framework significantly improves the decoding of visual representations from EEG signals, achieving better performance in image generation compared to existing methods. Abstract: EEG-based brain-computer interfaces (BCIs) have shown promise in various applications, such as motor imagery and cognitive state monitoring. However, decoding visual representations from EEG signals remains a significant challenge due to their complex and noisy nature. We thus propose a novel 5-stage framework for decoding visual representations from EEG signals: (1) an EEG encoder for concept classification, (2) cross-modal alignment of EEG and text embeddings in CLIP feature space, (3) caption refinement via re-ranking, (4) weighted interpolation of concept and caption embeddings for richer semantics, and (5) image generation using a pre-trained Stable Diffusion model. We enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking. Experimental results demonstrate that our method generates high-quality images aligned with visual stimuli, outperforming SOTA approaches by 13.43% in Classification Accuracy, 15.21% in Generation Accuracy and reducing Fr\'echet Inception Distance by 36.61%, indicating superior semantic alignment and image quality.

[110] CharaConsist: Fine-Grained Consistent Character Generation

Mengyu Wang,Henghui Ding,Jianing Peng,Yao Zhao,Yunpeng Chen,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出 CharaConsist 方法,解决了文本到图像生成中保持角色和背景一致性的问题,适用于多种实际应用场景。

Details Motivation: 现有训练无关的方法在生成一致的文本到图像时无法保持一致的背景细节,并且在前景角色经历大范围动作变化时表现出明显的不一致性。 Method: 通过点追踪注意力机制、自适应 token 合并与解耦的前景背景控制策略来增强生成内容的一致性。 Result: CharaConsist 实现了前景与背景的细粒度一致性,支持连续镜头或跨场景离散镜头中同一角色的高质量生成。 Conclusion: CharaConsist 是一种专为文本到图像生成的 DiT 模型设计的一致性生成方法,能够保持前景和背景的细粒度一致性,并能适应各种现实场景。 Abstract: In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems. First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background. CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes. Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios. The source code has been released at https://github.com/Murray-Wang/CharaConsist

[111] Streaming 4D Visual Geometry Transformer

Dong Zhuo,Wenzhao Zheng,Jiahe Guo,Yuqi Wu,Jie Zhou,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种流式4D视觉几何变换模型,用于实时视频中的4D时空几何感知与重建,具有较高的推理速度和空间一致性。

Details Motivation: 为了促进交互式和实时应用的发展,需要一种高效的4D时空几何感知和重建方法。 Method: 采用因果变换架构,通过时间因果注意力和缓存历史键值来实现高效的流式长期4D重建,并通过从双向视觉几何变换模型中提取知识进行训练。 Result: 在各种4D几何感知基准测试中,该模型在在线场景中提高了推理速度,同时保持了竞争力的性能。 Conclusion: 该流式4D视觉几何变换模型为可扩展和交互式的4D视觉系统铺平了道路。 Abstract: Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.

Zhen Xu,Hongyu Zhou,Sida Peng,Haotong Lin,Haoyu Guo,Jiahao Shao,Peishan Yang,Qinglin Yang,Sheng Miao,Xingyi He,Yifan Wang,Yue Wang,Ruizhen Hu,Yiyi Liao,Xiaowei Zhou,Hujun Bao

Main category: cs.CV

TL;DR: This paper explores the development of depth foundation models using deep learning to improve depth estimation in 3D computer vision, addressing challenges in traditional methods and highlighting future research directions.

Details Motivation: The motivation is to address limitations in traditional depth estimation methods, such as high cost, low resolution, and environmental sensitivity, by exploring vision-based methods with strong generalization capabilities. Method: The paper surveys deep learning architectures and paradigms for depth estimation across various settings and provides an overview of large-scale datasets that can aid in developing depth foundation models. Result: The paper identifies key architectures and training strategies that can lead to the development of robust depth foundation models with zero-shot generalization abilities. Conclusion: The paper concludes that robust depth foundation models can overcome existing challenges in depth estimation, offering promising future research and applications in 3D computer vision. Abstract: Depth estimation is a fundamental task in 3D computer vision, crucial for applications such as 3D reconstruction, free-viewpoint rendering, robotics, autonomous driving, and AR/VR technologies. Traditional methods relying on hardware sensors like LiDAR are often limited by high costs, low resolution, and environmental sensitivity, limiting their applicability in real-world scenarios. Recent advances in vision-based methods offer a promising alternative, yet they face challenges in generalization and stability due to either the low-capacity model architectures or the reliance on domain-specific and small-scale datasets. The emergence of scaling laws and foundation models in other domains has inspired the development of "depth foundation models": deep neural networks trained on large datasets with strong zero-shot generalization capabilities. This paper surveys the evolution of deep learning architectures and paradigms for depth estimation across the monocular, stereo, multi-view, and monocular video settings. We explore the potential of these models to address existing challenges and provide a comprehensive overview of large-scale datasets that can facilitate their development. By identifying key architectures and training strategies, we aim to highlight the path towards robust depth foundation models, offering insights into their future research and applications.

eess.IV [Back]

[113] Latent Space Consistency for Sparse-View CT Reconstruction

Duoyou Chen,Yunqing Chen,Can Zhang,Zhou Wang,Cheng Chen,Ruoxiu Xiao

Main category: eess.IV

TL;DR: 为了解决传统CT面临的时间消耗大和辐射暴露高的问题,以及传统LDM无法实现X射线模态的2D潜在表示与CT模态的3D潜在表示之间有效对齐的问题,本文提出了一种一致潜在空间扩散模型(CLS-DM),结合了跨模态特征对比学习以有效提取潜在3D信息。

Details Motivation: 由于传统CT面临时间消耗大和辐射暴露高的问题,基于稀疏视角X射线图像的CT重建方法引起了研究者的广泛关注。然而,由于X射线模态的2D潜在表示与CT模态的3D潜在表示之间存在显著差异,传统的LDM无法实现潜在空间中的有效对齐。 Method: 提出了一种一致潜在空间扩散模型(CLS-DM),结合了跨模态特征对比学习以有效提取潜在3D信息。 Result: 实验结果表明,CLS-DM在LIDC-IDRI和CTSpine1K数据集的标准体素级指标(PSNR,SSIM)方面优于经典和最先进的生成模型。 Conclusion: CLS-DM通过跨模态特征对比学习有效地从2D X射线图像中提取潜在的3D信息,并在LIDC-IDRI和CTSpine1K数据集上的标准体素级指标(PSNR,SSIM)方面优于经典和最先进的生成模型。 Abstract: Computed Tomography (CT) is a widely utilized imaging modality in clinical settings. Using densely acquired rotational X-ray arrays, CT can capture 3D spatial features. However, it is confronted with challenged such as significant time consumption and high radiation exposure. CT reconstruction methods based on sparse-view X-ray images have garnered substantial attention from researchers as they present a means to mitigate costs and risks. In recent years, diffusion models, particularly the Latent Diffusion Model (LDM), have demonstrated promising potential in the domain of 3D CT reconstruction. Nonetheless, due to the substantial differences between the 2D latent representation of X-ray modalities and the 3D latent representation of CT modalities, the vanilla LDM is incapable of achieving effective alignment within the latent space. To address this issue, we propose the Consistent Latent Space Diffusion Model (CLS-DM), which incorporates cross-modal feature contrastive learning to efficiently extract latent 3D information from 2D X-ray images and achieve latent space alignment between modalities. Experimental results indicate that CLS-DM outperforms classical and state-of-the-art generative models in terms of standard voxel-level metrics (PSNR, SSIM) on the LIDC-IDRI and CTSpine1K datasets. This methodology not only aids in enhancing the effectiveness and economic viability of sparse X-ray reconstructed CT but can also be generalized to other cross-modal transformation tasks, such as text-to-image synthesis. We have made our code publicly available at https://anonymous.4open.science/r/CLS-DM-50D6/ to facilitate further research and applications in other domains.

[114] 3D Magnetic Inverse Routine for Single-Segment Magnetic Field Images

J. Senthilnath,Chen Hao,F. C. Wellstood

Main category: eess.IV

TL;DR: This paper introduces the 3D Magnetic Inverse Routine (3D MIR), a novel method that combines deep learning, physics-based constraints, and optimization to accurately recover 3D current flow information from Magnetic Field Images (MFI) for non-destructive testing in semiconductor packaging.

Details Motivation: Accurately recovering 3D information is crucial for non-destructive testing (NDT) in semiconductor packaging to localize circuit defects, which this paper aims to address by introducing a novel approach called the 3D Magnetic Inverse Routine (3D MIR). Method: The paper proposes the 3D Magnetic Inverse Routine (3D MIR), which combines a deep learning-based CNN model, spatial-physics-based constraints, and an optimizer to reconstruct 3D current flow parameters from Magnetic Field Images (MFI). Result: The results demonstrate that the 3D MIR method accurately recovers 3D information with high precision, establishing it as a new benchmark for magnetic image reconstruction in semiconductor packaging. Conclusion: The 3D MIR method accurately recovers 3D information with high precision, setting a new benchmark for magnetic image reconstruction in semiconductor packaging and highlighting the potential of combining DL and physics-driven optimization in practical applications. Abstract: In semiconductor packaging, accurately recovering 3D information is crucial for non-destructive testing (NDT) to localize circuit defects. This paper presents a novel approach called the 3D Magnetic Inverse Routine (3D MIR), which leverages Magnetic Field Images (MFI) to retrieve the parameters for the 3D current flow of a single-segment. The 3D MIR integrates a deep learning (DL)-based Convolutional Neural Network (CNN), spatial-physics-based constraints, and optimization techniques. The method operates in three stages: i) The CNN model processes the MFI data to predict ($\ell/z_o$), where $\ell$ is the wire length and $z_o$ is the wire's vertical depth beneath the magnetic sensors and classify segment type ($c$). ii) By leveraging spatial-physics-based constraints, the routine provides initial estimates for the position ($x_o$, $y_o$, $z_o$), length ($\ell$), current ($I$), and current flow direction (positive or negative) of the current segment. iii) An optimizer then adjusts these five parameters ($x_o$, $y_o$, $z_o$, $\ell$, $I$) to minimize the difference between the reconstructed MFI and the actual MFI. The results demonstrate that the 3D MIR method accurately recovers 3D information with high precision, setting a new benchmark for magnetic image reconstruction in semiconductor packaging. This method highlights the potential of combining DL and physics-driven optimization in practical applications.

[115] HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging

Arefin Ittesafun Abian,Ripon Kumar Debnath,Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Md Rafiqul Islam,Asif Karim,Reem E. Mohamed,Sami Azam

Main category: eess.IV

TL;DR: HANS-Net is a novel segmentation framework that accurately segments liver and tumors in CT images by combining multiple advanced techniques.

Details Motivation: Accurate liver and tumor segmentation on abdominal CT images is critical for diagnosis and treatment planning but is challenging due to anatomical complexity, tumor variability, and limited data. Method: HANS-Net combines hyperbolic convolutions, wavelet-inspired decomposition, synaptic plasticity mechanism, implicit neural representation, uncertainty-aware Monte Carlo dropout, and lightweight temporal attention. Result: On the LiTS dataset, HANS-Net achieved a mean Dice score of 93.26%, IoU of 88.09%, ASSD of 0.72 mm, and VOE of 11.91%. On the 3D-IRCADb-01 dataset, it obtained an average Dice of 87.45%, IoU of 80.30%, ASSD of 1.525 mm, and VOE of 19.71%. Conclusion: HANS-Net provides anatomically consistent, accurate, and confident liver and tumor segmentation. Abstract: Accurate liver and tumor segmentation on abdominal CT images is critical for reliable diagnosis and treatment planning, but remains challenging due to complex anatomical structures, variability in tumor appearance, and limited annotated data. To address these issues, we introduce Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network (HANS-Net), a novel segmentation framework that synergistically combines hyperbolic convolutions for hierarchical geometric representation, a wavelet-inspired decomposition module for multi-scale texture learning, a biologically motivated synaptic plasticity mechanism for adaptive feature enhancement, and an implicit neural representation branch to model fine-grained and continuous anatomical boundaries. Additionally, we incorporate uncertainty-aware Monte Carlo dropout to quantify prediction confidence and lightweight temporal attention to improve inter-slice consistency without sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap error (VOE) of 11.91%. Furthermore, cross-dataset validation on the 3D-IRCADb-01 dataset obtains an average Dice of 87.45%, IoU of 80.30%, ASSD of 1.525 mm, and VOE of 19.71%, indicating strong generalization across different datasets. These results confirm the effectiveness and robustness of HANS-Net in providing anatomically consistent, accurate, and confident liver and tumor segmentation.