Skip to content

Table of Contents

cs.CL [Back]

[1] Loki's Dance of Illusions: A Comprehensive Survey of Hallucination in Large Language Models

Chaozhuo Li,Pengbo Wang,Chenxu Wang,Litian Zhang,Zheng Liu,Qiwei Ye,Yuanbo Xu,Feiran Huang,Xi Zhang,Philip S. Yu

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型中出现的‘幻觉’问题,分析其成因及影响,并提出需要更有效的创新解决方法。

Details Motivation: 尽管大型语言模型(LLMs)具有出色的生成能力,但其产生的信息有时看似准确却实为虚构,这种‘幻觉’可能误导用户判断并引发严重后果。 Method: 对LLM幻觉的原因、检测方法和解决方案进行了系统分类和分析。 Result: 特别关注了幻觉的根源理解,并评估了当前策略揭示潜在逻辑的有效性。 Conclusion: 研究旨在通过系统分类和分析LLM幻觉的根源及现有策略的有效性,为开发创新有效的方法铺平道路。 Abstract: Edgar Allan Poe noted, "Truth often lurks in the shadow of error," highlighting the deep complexity intrinsic to the interplay between truth and falsehood, notably under conditions of cognitive and informational asymmetry. This dynamic is strikingly evident in large language models (LLMs). Despite their impressive linguistic generation capabilities, LLMs sometimes produce information that appears factually accurate but is, in reality, fabricated, an issue often referred to as 'hallucinations'. The prevalence of these hallucinations can mislead users, affecting their judgments and decisions. In sectors such as finance, law, and healthcare, such misinformation risks causing substantial economic losses, legal disputes, and health risks, with wide-ranging consequences.In our research, we have methodically categorized, analyzed the causes, detection methods, and solutions related to LLM hallucinations. Our efforts have particularly focused on understanding the roots of hallucinations and evaluating the efficacy of current strategies in revealing the underlying logic, thereby paving the way for the development of innovative and potent approaches. By examining why certain measures are effective against hallucinations, our study aims to foster a comprehensive approach to tackling this issue within the domain of LLMs.

[2] ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models

Dai Li,Linzhuo Li,Huilian Sophie Qiu

Main category: cs.CL

TL;DR: The study argues against using LLMs, especially chatbots AI, as direct substitutes for human survey data due to structural inconsistencies and homogenization that could reinforce stereotypes and misinform policy.

Details Motivation: This study examines the notion that large language models (LLMs) may misrepresent population-level opinions, identifying two fundamental challenges: a failure in structural consistency and homogenization. Method: The study used questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020 to prompt ChatGPT (GPT-4) and Meta's Llama 3.1 series (8B, 70B, 405B). Result: Findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. The study proposes an 'accuracy-optimization hypothesis,' suggesting homogenization stems from prioritizing modal responses. Conclusion: LLMs, particularly chatbots AI, should not be used as direct substitutes for human survey data due to structural inconsistencies and homogenization that can reinforce stereotypes and misinform policy. Abstract: Large language models (LLMs) in the form of chatbots like ChatGPT and Llama are increasingly proposed as "silicon samples" for simulating human opinions. This study examines this notion, arguing that LLMs may misrepresent population-level opinions. We identify two fundamental challenges: a failure in structural consistency, where response accuracy doesn't hold across demographic aggregation levels, and homogenization, an underrepresentation of minority opinions. To investigate these, we prompted ChatGPT (GPT-4) and Meta's Llama 3.1 series (8B, 70B, 405B) with questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020. Our findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. We propose an "accuracy-optimization hypothesis," suggesting homogenization stems from prioritizing modal responses. These issues challenge the validity of using LLMs, especially chatbots AI, as direct substitutes for human survey data, potentially reinforcing stereotypes and misinforming policy.

[3] A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations

Phurich Saengthong,Boonnithi Jiaramaneepinit,Sheng Li,Manabu Okumura,Takahiro Shinozaki

Main category: cs.CL

TL;DR: This paper proposes a unified speech LLM for joint diarization and ASR with improved performance on multilingual conversational data.

Details Motivation: The motivation behind this work is to improve the effectiveness of Speech Large Language Models (Speech LLMs) in real-world multilingual conversations where natural conversational phenomena are not well captured due to limited data availability. Method: The authors propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. They reformulate the training data format and modify the inference procedure to address the ambiguity inherent in pre-segmented audio. Result: The model achieves a 54.87% relative improvement in tcpWER/tcpCER over the baseline for Task II and ranks 8th overall. Results from Task I using a fine-tuned speech LLM are also reported. Conclusion: The paper concludes that by reformulating the training data format and modifying the inference procedure, their unified speech LLM effectively addresses the ambiguity in pre-segmented audio for joint diarization and ASR, achieving significant improvement over the baseline despite using a smaller LLM backbone. Abstract: Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic speech recognition (ASR) and spoken dialogue modeling. However, their effectiveness in real-world multilingual conversations remains limited by the scarcity of data that captures natural conversational phenomena. To address this, the MLC-SLM Challenge provides a multilingual conversational dataset and evaluates models on two tasks: ASR with oracle segmentation (Task I) and joint diarization and recognition without oracle information (Task II). In this paper, we focus on Task II and propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio and achieves a 54.87\% relative improvement in tcpWER/tcpCER over the baseline, ranking 8th overall, despite using a smaller LLM backbone. We also report results from Task I using a fine-tuned speech LLM.

[4] Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models

Hao Yang,Haoxuan Li,Luyu Chen,Haoxiang Wang,Xu Chen,Mingming Gong

Main category: cs.CL

TL;DR: This paper introduces ProCI, a method that leverages large language models' semantic reasoning and world knowledge to mitigate hidden confounding in causal inference.

Details Motivation: Hidden confounding poses a challenge in estimating unbiased causal effects from observational data, and most existing methods still rely on the unconfoundedness assumption even when using LLMs for causal inference. Method: ProCI (Progressive Confounder Imputation) uses LLMs' semantic reasoning and world knowledge to iteratively generate, impute, and validate hidden confounders, employing distributional reasoning for robustness. Result: ProCI successfully discovers plausible confounders and improves causal effect estimation by leveraging LLMs' capabilities while avoiding direct value imputation through distributional reasoning. Conclusion: ProCI is effective in uncovering meaningful confounders and significantly enhances treatment effect estimation across different datasets and LLMs. Abstract: Hidden confounding remains a central challenge in estimating treatment effects from observational data, as unobserved variables can lead to biased causal estimates. While recent work has explored the use of large language models (LLMs) for causal inference, most approaches still rely on the unconfoundedness assumption. In this paper, we make the first attempt to mitigate hidden confounding using LLMs. We propose ProCI (Progressive Confounder Imputation), a framework that elicits the semantic and world knowledge of LLMs to iteratively generate, impute, and validate hidden confounders. ProCI leverages two key capabilities of LLMs: their strong semantic reasoning ability, which enables the discovery of plausible confounders from both structured and unstructured inputs, and their embedded world knowledge, which supports counterfactual reasoning under latent confounding. To improve robustness, ProCI adopts a distributional reasoning strategy instead of direct value imputation to prevent the collapsed outputs. Extensive experiments demonstrate that ProCI uncovers meaningful confounders and significantly improves treatment effect estimation across various datasets and LLMs.

[5] Theory of Mind in Action: The Instruction Inference Task

Fardin Saad,Pradeep K. Murukannaiah,Munindar P. Singh

Main category: cs.CL

TL;DR: This paper introduces Tomcat, an LLM-based agent capable of interpreting ambiguous instructions like humans, demonstrating strong Theory of Mind (ToM) abilities in collaborative settings.

Details Motivation: To assess Theory of Mind (ToM) in a dynamic, collaborative setting and improve AI's ability to interpret ambiguous instructions for better human-AI collaboration. Method: Developed Tomcat (Fs-CoT and CP variants) using LLMs (GPT-4o, DeepSeek-R1, Gemma-3-27B) and evaluated against 52 human participants via intent accuracy, action optimality, and planning optimality in the Instruction Inference task. Result: Tomcat with Fs-CoT (especially GPT-4o and DeepSeek-R1) achieved performance similar to humans, proving its ToM reasoning capability. Conclusion: Tomcat with Fs-CoT demonstrates ToM capabilities comparable to humans, showing potential for human-AI collaboration. Abstract: The Theory of Mind (ToM) refers to an agent's capacity to infer the mental states of other agents. ToM is essential for effective collaboration. To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting indirect or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implement two variants of Tomcat. One, dubbed Fs-CoT, is based on a small number of examples (i.e., few-shot or Fs) demonstrating the requisite structured reasoning (i.e., chain-of-thought or CoT). One, dubbed CP, relies on commonsense knowledge and information about the problem (i.e., commonsense prompt or CP). We realized both variants of Tomcat on three leading large language models (LLMs), namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant of Tomcat. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-AI collaboration.

[6] A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis

Jiachen Liu,Ziheng Geng,Ran Cao,Lu Cheng,Paolo Bocchini,Minghui Cheng

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLMs)在土木工程领域的应用,特别是在梁结构分析中的可靠性和鲁棒性,并提出了一种通过生成代码任务提高LLM性能的代理方法。

Details Motivation: 尽管大型语言模型在各种开放领域表现出色,但在土木工程等专业领域的应用仍鲜有探索。本文旨在评估并提升LLMs在结构分析中的可靠性与鲁棒性。 Method: 通过重复运行相同问题来评估LLM的可靠性,通过不同负载和边界条件下的表现来评估其鲁棒性,并创建了一个包含八个梁分析问题的基准数据集。此外,提出了一种将结构分析重新定义为代码生成任务的LLM代理方法,利用思维链和少量示例提示生成准确的OpenSeesPy代码,并自动执行代码以产生结果。 Result: 实验结果显示,尽管LLM对结构力学有定性理解,但缺乏工程应用所需的定量可靠性和鲁棒性。所提出的LLM代理方法在基准数据集上的准确率超过99.0%,并且在各种条件下表现出可靠的性能。消融研究表明,完整示例和函数使用示例是代理性能提升的主要因素。 Conclusion: 本文表明,通过将结构分析作为代码生成任务,可以显著提高LLM在工程应用中的可靠性和鲁棒性。 Abstract: Large language models (LLMs) have exhibited remarkable capabilities across diverse open-domain tasks, yet their application in specialized domains such as civil engineering remains largely unexplored. This paper starts bridging this gap by evaluating and enhancing the reliability and robustness of LLMs in structural analysis of beams. Reliability is assessed through the accuracy of correct outputs under repetitive runs of the same problems, whereas robustness is evaluated via the performance across varying load and boundary conditions. A benchmark dataset, comprising eight beam analysis problems, is created to test the Llama-3.3 70B Instruct model. Results show that, despite a qualitative understanding of structural mechanics, the LLM lacks the quantitative reliability and robustness for engineering applications. To address these limitations, a shift is proposed that reframes the structural analysis as code generation tasks. Accordingly, an LLM-empowered agent is developed that (a) integrates chain-of-thought and few-shot prompting to generate accurate OpeeSeesPy code, and (b) automatically executes the code to produce structural analysis results. Experimental results demonstrate that the agent achieves accuracy exceeding 99.0% on the benchmark dataset, exhibiting reliable and robust performance across diverse conditions. Ablation studies highlight the complete example and function usage examples as the primary contributors to the agent's enhanced performance.

[7] Towards a Comparative Framework for Compositional AI Models

Tiffany Duneau

Main category: cs.CL

TL;DR: 本研究分析了DisCoCirc框架下不同模型在组合泛化任务中的表现,发现神经网络更易过拟合并比较了不同模型在组合性任务上的差异。

Details Motivation: 探索自然语言处理中组合模型的能力,包括组合泛化和组合可解释性,以更好地理解模型的工作机制及提升其泛化能力。 Method: 通过DisCoCirc框架,将词语单元根据文本的语法结构组合起来,构建文本的组合模型,并使用范畴理论语言对组合泛化进行测试。 Result: 在生产力和替代性任务上,量子电路模型和经典神经网络模型得分相差不到5%,但在系统性任务上差异至少为10%。此外,两种模型在过度泛化任务上的趋势不同。 Conclusion: 神经网络模型相较于量子电路模型更容易过拟合训练数据,但在某些组合泛化任务上表现相近。研究还展示了如何解释一个组合模型的行为。 Abstract: The DisCoCirc framework for natural language processing allows the construction of compositional models of text, by combining units for individual words together according to the grammatical structure of the text. The compositional nature of a model can give rise to two things: compositional generalisation -- the ability of a model to generalise outside its training distribution by learning compositional rules underpinning the entire data distribution -- and compositional interpretability -- making sense of how the model works by inspecting its modular components in isolation, as well as the processes through which these components are combined. We present these notions in a framework-agnostic way using the language of category theory, and adapt a series of tests for compositional generalisation to this setting. Applying this to the DisCoCirc framework, we consider how well a selection of models can learn to compositionally generalise. We compare both quantum circuit based models, as well as classical neural networks, on a dataset derived from one of the bAbI tasks, extended to test a series of aspects of compositionality. Both architectures score within 5% of one another on the productivity and substitutivity tasks, but differ by at least 10% for the systematicity task, and exhibit different trends on the overgeneralisation tasks. Overall, we find the neural models are more prone to overfitting the Train data. Additionally, we demonstrate how to interpret a compositional model on one of the trained models. By considering how the model components interact with one another, we explain how the model behaves.

[8] The Application of Large Language Models on Major Depressive Disorder Support Based on African Natural Products

Linyan Zou

Main category: cs.CL

TL;DR: 该论文探讨了利用非洲传统草药与大型语言模型结合,开发基于人工智能的抑郁症支持系统,以克服传统抗抑郁药物的局限性。

Details Motivation: 由于传统抗抑郁药物存在起效慢、副作用大以及对部分患者无效的问题,同时非洲传统医学在植物疗法方面具有悠久历史,因此研究者希望将传统知识与现代人工智能技术相结合,提供更有效的心理健康支持方案。 Method: 研究分析了具有抗抑郁特性的非洲药用植物及其药理机制,并利用DeepSeek的大型语言模型开发了一个AI驱动的支持系统,用于提供关于非洲草药疗法的循证信息和治疗方案。 Result: 研究结果表明,大型语言模型能够有效整合传统知识与现代医疗实践,为抑郁症患者提供个性化且符合文化背景的支持方案,同时确保科学严谨性和安全性。 Conclusion: 通过将非洲传统草药知识与人工智能相结合,可以开发出更具文化适应性和个性化的抑郁症支持系统,弥补现有治疗方法的不足,并促进全球心理健康护理的发展。 Abstract: Major depressive disorder represents one of the most significant global health challenges of the 21st century, affecting millions of people worldwide and creating substantial economic and social burdens. While conventional antidepressant therapies have provided relief for many individuals, their limitations including delayed onset of action, significant side effects, and treatment resistance in a substantial portion of patients have prompted researchers and healthcare providers to explore alternative therapeutic approaches (Kasneci et al.). African traditional medicine, with its rich heritage of plant-based remedies developed over millennia, offers a valuable resource for developing novel antidepressant treatments that may address some of these limitations. This paper examines the integration of large language models with African natural products for depression support, combining traditional knowledge with modern artificial intelligence technology to create accessible, evidence-based mental health support systems. The research presented here encompasses a comprehensive analysis of African medicinal plants with documented antidepressant properties, their pharmacological mechanisms, and the development of an AI-powered support system that leverages DeepSeek's advanced language model capabilities. The system provides evidence-based information about African herbal medicines, their clinical applications, safety considerations, and therapeutic protocols while maintaining scientific rigor and appropriate safety standards. Our findings demonstrate the potential for large language models to serve as bridges between traditional knowledge and modern healthcare, offering personalized, culturally appropriate depression support that honors both traditional wisdom and contemporary medical understanding.

[9] RADIANT: Retrieval AugmenteD entIty-context AligNmenT -- Introducing RAG-ability and Entity-Context Divergence

Vipula Rawte,Rajarshi Roy,Gurpreet Singh,Danush Khanna,Yaswanth Narsupalli,Basab Ghosh,Abhay Gupta,Argha Kamal Samanta,Aditya Shingote,Aadi Krishna Vikram,Vinija Jain,Aman Chadha,Amit Sheth,Amitava Das

Main category: cs.CL

TL;DR: This paper addresses the problem of factual inconsistency in Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs), introduces a new metric called Entity-Context Divergence (ECD), and proposes a framework named Radiant to enhance factual accuracy by aligning retrieved information with generated output.

Details Motivation: LLMs struggle to faithfully integrate retrieved external knowledge into their responses, leading to factual inconsistencies. This work aims to quantify this issue and develop a solution for more reliable content generation. Method: The authors introduce Entity-Context Divergence (ECD) to measure factual consistency, evaluate RAG-ability across LLMs, and propose Radiant, a framework that combines RAG with alignment techniques based on Direct Preference Optimization (DPO). Result: Experiments show that most LLMs have low RAG-ability, indicating challenges in preserving factual accuracy. The proposed Radiant framework effectively improves RAG performance in handling noisy contexts, resolving knowledge conflicts, and reducing hallucinations. Conclusion: Radiant improves the integration of retrieved evidence into LLMs, enhancing factual consistency and coherence in generated content. Abstract: As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation.

[10] Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Keita Kiuchi,Yoshikazu Fujimoto,Hideyuki Goto,Tomonori Hosokawa,Makoto Nishimura,Yosuke Sato,Izumi Sezai

Main category: cs.CL

TL;DR: 该研究评估了大型语言模型在日本语心理咨询场景中的应用效果,发现结构化提示策略能显著提升AI表现,但评估模型存在系统性偏差,客户端模拟需增强真实感。

Details Motivation: 本研究旨在全面评估大型语言模型(LLM)在日本语治疗环境中的三种咨询角色表现,填补非英语背景下AI辅助心理咨询的研究空白,并为开发文化敏感型AI心理健康工具提供依据。 Method: 研究采用了综合评估方法,包括对咨询师AI系统(GPT-4-turbo使用零样本提示或结构化多步骤对话提示(SMDP),以及Claude-3-Opus-SMDP)的表现进行评估,同时评估了客户端AI模拟和评估AI系统(如o3、Claude-3.7-Sonnet、Gemini-2.5-pro)。人类专家(n=15)使用Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1对AI生成的对话进行了评估。 Result: 研究结果显示,SMDP提示策略显著提升了咨询师AI的表现,而评估AI在某些维度上的表现与人类评分员相当,但存在系统性高估问题。不同模型表现出特定偏见,客户端AI模拟器情感表达受限且过于顺从。 Conclusion: 研究得出了几个关键结论。首先,SMDP提示策略在所有MITI整体评分中显著优于零样本提示方法,而GPT-SMDP和Opus-SMDP之间没有显著差异。其次,评估AI系统在“培养改变对话”方面与人类评分员表现相当,但在“软化持续对话”和整体质量指标上系统性高估。此外,不同评估AI模型表现出特定的偏差:Gemini强调权力共享,o3关注技术熟练度,Sonnet则重视情感表达。最后,客户端AI模拟器的情感范围有限且顺从性异常高,表明需要提高真实性。 Abstract: This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools.

[11] Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III

Pranam Shetty,Abhisek Upadhayaya,Parth Mitesh Shah,Srikanth Jagabathula,Shilpi Nayak,Anna Joo Fee

Main category: cs.CL

TL;DR: This paper benchmarks LLMs on the CFA Level III exam to assess their suitability for advanced financial reasoning tasks.

Details Motivation: To ensure responsible deployment of LLMs in financial institutions by rigorously evaluating their domain-specific performance. Method: Evaluated 23 state-of-the-art LLMs on the CFA Level III exam using multiple prompting strategies including Chain-of-Thought and Self-Discover. Result: Top models like o4-mini and Gemini 2.5 Flash achieved composite scores of 79.1% and 77.3% respectively on the CFA Level III exam. Conclusion: Leading LLMs show strong capabilities in high-stakes financial applications, but challenges remain in cost-effective deployment and interpreting performance against professional benchmarks. Abstract: As financial institutions increasingly adopt Large Language Models (LLMs), rigorous domain-specific evaluation becomes critical for responsible deployment. This paper presents a comprehensive benchmark evaluating 23 state-of-the-art LLMs on the Chartered Financial Analyst (CFA) Level III exam - the gold standard for advanced financial reasoning. We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover. Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III. These results, achieved under a revised, stricter essay grading methodology, indicate significant progress in LLM capabilities for high-stakes financial applications. Our findings provide crucial guidance for practitioners on model selection and highlight remaining challenges in cost-effective deployment and the need for nuanced interpretation of performance against professional benchmarks.

[12] Real-World En Call Center Transcripts Dataset with PII Redaction

Ha Dao,Gaurav Chawla,Raghu Banda,Caleb DeLeeuw

Main category: cs.CL

TL;DR: CallCenterEN 是一个去除了个人身份信息、包含大量真实世界英文呼叫中心对话的开放数据集,用于推动客户服务和销售 AI 系统的发展。

Details Motivation: 由于目前公开可用的真实世界呼叫中心数据集稀缺,因此需要提供一个大规模且高质量的数据集来推动相关研究。 Method: 发布包含 91,706 段对话(相当于 10448 小时音频)的数据集,并去除所有个人身份信息以确保符合全球数据保护法规。 Result: 发布了 CallCenterEN 数据集,涵盖印度、菲律宾和美国口音的电话录音文本,适用于非商业研究用途。 Conclusion: CallCenterEN 是一个大规模、真实世界的英文呼叫中心对话数据集,旨在支持客户服务和销售 AI 系统的研究与开发。 Abstract: We introduce CallCenterEN, a large-scale (91,706 conversations, corresponding to 10448 audio hours), real-world English call center transcript dataset designed to support research and development in customer support and sales AI systems. This is the largest release to-date of open source call center transcript data of this kind. The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States. The dataset includes high-quality, PII-redacted human-readable transcriptions. All personally identifiable information (PII) has been rigorously removed to ensure compliance with global data protection laws. The audio is not included in the public release due to biometric privacy concerns. Given the scarcity of publicly available real-world call center datasets, CallCenterEN fills a critical gap in the landscape of available ASR corpora, and is released under a CC BY-NC 4.0 license for non-commercial research use.

[13] RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Zhiwen Tan,Jiaming Huang,Qintong Wu,Hongxuan Zhang,Chenyi Zhuang,Jinjie Gu

Main category: cs.CL

TL;DR: 本文提出了 RAG-R1 框架,通过多查询并行处理解决了检索增强生成中的效率和性能瓶颈。

Details Motivation: 现有检索增强生成方法存在训练不稳定、推理时间长以及受限于单查询模式的问题,需要一种更高效且稳定的方法。 Method: 提出 RAG-R1 框架,将生成与检索过程从单查询模式扩展到多查询并行模式,并利用内部知识与外部信息自适应结合。 Result: 在七个问答基准测试中,RAG-R1 相比最强基线方法性能提升了最多 13.2%,推理时间减少了 11.1%。 Conclusion: RAG-R1 框架通过多查询并行模式提升了大型语言模型在检索增强生成任务中的性能和推理效率。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models' search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model's capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.

[14] Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

Salahuddin Salahuddin,Ahmed Hussain,Jussi Löppönen,Toni Jutila,Panos Papadimitratos

Main category: cs.CL

TL;DR: This paper introduces Domain-Adaptive Continuous Pretraining (DAP), a method to enhance cybersecurity understanding in Large Language Models (LLMs) while preserving general language capabilities, achieving state-of-the-art performance on cybersecurity benchmarks with smaller datasets.

Details Motivation: General-purpose Large Language Models (LLMs) lack specialized domain knowledge for effective cybersecurity analysis, which necessitates domain adaptation techniques that enhance cybersecurity understanding without compromising general language capabilities. Method: Domain-Adaptive Continuous Pretraining (DAP) was used on pretrained LLMs with a 126-million-word cybersecurity corpus. Constrained training parameters and distributed FSDP training were employed to balance domain specialization and knowledge preservation. Result: The adapted Llama-3.3-70B-Instruct model achieved state-of-the-art accuracies of 0.718, 0.933, and 0.864 on CTI-MCQ, CyberMetric, and SecEval benchmarks respectively, outperforming specialized models while using significantly less data. Conclusion: Targeted continuous pretraining is a viable and efficient method for adapting LLMs to the cybersecurity domain, challenging existing assumptions about data requirements for specialization. Abstract: While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models lack specialized domain knowledge for effective cybersecurity analysis. In this work, we investigate Domain-Adaptive Continuous Pretraining (DAP) as a methodology for enhancing cybersecurity understanding in pretrained LLMs while preserving general language capabilities. We systematically adapted three decoder-based architectures -- Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct -- using a curated 126-million-word cybersecurity corpus from standards, academic literature, and various other sources. Our approach employed constrained training parameters and distributed FSDP training to balance domain specialization with knowledge preservation. Evaluation across three cybersecurity benchmarks, namely, CTI-MCQ, CyberMetric, and SecEval, demonstrates consistent improvements post-adaptation. The Llama-3.3-70B-Ins-DAP model achieved state-of-the-art accuracies of 0.718, 0.933, and 0.864, respectively, outperforming specialized models, including Llama-Primus-Base. Notably, competitive performance was achieved using substantially smaller datasets (118.8 million versus 2.77 billion tokens), demonstrating efficient domain specialization viability. We establish that targeted continuous pretraining enables effective cybersecurity domain adaptation with computational feasibility, providing foundations for specialized AI assistants in threat analysis, vulnerability assessment, and security documentation while challenging prevailing assumptions about data requirements for LLM specialization.

[15] PB-LLMs: Privacy- and Bias-aware NLP Models using Named-Entity Recognition

Gonzalo Mancera,Aythami Morales,Julian Fierrez,Ruben Tolosana,Alejandro Penna,Miguel Lopez-Duran,Francisco Jurado,Alvaro Ortigosa

Main category: cs.CL

TL;DR: This paper proposes a Privacy- and Bias-aware LLMs (PB-LLMs) framework that uses Named-Entity Recognition (NER) to anonymize sensitive information in text data, facilitating privacy-preserving training of LLMs while also reducing gender bias.

Details Motivation: Despite the strong performance of Large Language Models (LLMs), they introduce legal and ethical concerns regarding privacy, data protection, and transparency. This work explores the use of Named-Entity Recognition (NER) to facilitate the privacy-preserving training or adaptation of LLMs. Method: A framework using NER technologies was proposed to anonymize sensitive information in text data for privacy-preserving training of LLMs. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms applied to a database of 24,000 candidate profiles. Additionally, an existing approach was applied to reduce gender bias in LLMs. Result: The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, applying an existing approach reduced gender bias in LLMs. Conclusion: The proposed PB-LLMs can effectively maintain system performance while safeguarding candidate confidentiality and reducing gender bias in LLMs, making them applicable to any other LLM-based AI application. Abstract: The use of Natural Language Processing (NLP) in high-stakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named-Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacy- and Bias-aware LLMs (PB-LLMs). Note that the proposed PB-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.

[16] We Need Knowledge Distillation for Solving Math Word Problems

Zhenquan Shen,Xinguo Yu,Xiaotian Cheng,Rao Peng,Hao Ming

Main category: cs.CL

TL;DR: This paper explores the feasibility of compressing large language models (LLMs) for solving math word problems, achieving high performance with fewer parameters.

Details Motivation: The use of LLMs in mathematics education is limited by their computational demands and costs. This study aims to address this issue through model compression. Method: The authors compress BERT's embedded vectors and distill a smaller student model to evaluate its performance on math word problems. Result: The compressed student model retains nearly 90% of the teacher model's performance while using only 1/12 of its parameters and shows strong generalization across MWP tasks. Conclusion: The study demonstrates that compressing LLMs can significantly reduce computational costs while maintaining performance, offering valuable improvements for intelligent tutoring systems. Abstract: The enhancement of mathematical capabilities in large language models (LLMs) fosters new developments in mathematics education within primary and secondary schools, particularly as they relate to intelligent tutoring systems. However, LLMs require substantial computational resources, resulting in significant costs in educational contexts. To mitigate this drawback, this paper investigates the feasibility of compressing LLMs for solving math word problems (MWPs). We compress the embedded vectors encoded by BERT and distill a considerably smaller student model. Our findings indicate that the student model can maintain nearly 90% of the performance of the teacher model while utilizing only 1/12 of its parameters. In addition to achieving high accuracy, the model exhibits strong generalizability, as the compressed vectors perform well across all tasks related to MWPs, and the distillation process is not task-specific. The success of this distillation demonstrates that the underlying principles are generic and not limited to a specific task. We further explore the reasons behind the compressibility of embedded vectors, revealing that part-of-speech information, rather than entity recognition, is crucial for MWPs, which may significantly contribute to their compressibility. The improvements in efficiency and cost reduction provide substantial value for intelligent tutoring systems and significantly advance the field of intelligent education.

[17] Truth, Trust, and Trouble: Medical AI on the Edge

Mohammad Anas Azeez,Rafiq Ali,Ebad Shabbir,Zohaib Hasan Siddiqui,Gautam Siddharth Kashyap,Jiechao Gao,Usman Naseem

Main category: cs.CL

TL;DR: 研究评估了 AlpaCare-13B 和 BioMistral-7B-DARE 等大型语言模型在医疗问答中的表现,发现它们在准确性和安全性上各有优势,但复杂问题仍然是一个挑战。

Details Motivation: 确保大型语言模型满足医疗行业对事实准确性、有用性和安全性的标准,尤其是在开源解决方案中。 Method: 使用超过 1000 个健康问题的数据集进行严格的基准测试,评估模型在诚实性、有用性和无害性方面的表现。 Result: AlpaCare-13B 达到了最高的准确性(91.7%)和无害性(0.92),BioMistral-7B-DARE 的安全性提升(0.90)。少样本提示将准确性从 78% 提高到 85%,所有模型在复杂查询上的帮助性都下降。 Conclusion: AlpaCare-13B 拥有最高的准确性和无害性,而 BioMistral-7B-DARE 在特定领域调优后提升了安全性。少样本提示提高了准确性,所有模型在复杂查询上的帮助性都有所下降,突显了临床 QA 的持续挑战。 Abstract: Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

[18] From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Wentao Tan,Qiong Cao,Yibing Zhan,Chao Xue,Changxing Ding

Main category: cs.CL

TL;DR: 本论文提出了一个名为SMART的新框架,通过使用AoT-Oriented Chain-of-Thought提示自动产生高质量的正向和负向推理路径,从而增强多模态大语言模型的推理能力。

Details Motivation: 当前的方法主要集中在合成正向推理链上,忽略了负向推理链在训练模型识别错误推理模式中的重要作用。 Method: 提出了一种新的框架SMART,它利用AoT-Oriented Chain-of-Thought(AoT)提示自动产生高质量的正向和负向推理路径,然后通过自我对齐来增强模型的推理能力。 Result: 使用AoT生成的数据进行训练的模型优于使用手动注释数据集进行训练的模型,显示出更强大的推理能力。实验表明,SMART框架显著提升了各种MLLM的推理能力。 Conclusion: SMART框架能够显著提升各种MLLM的推理能力,并且该框架具有迭代生成-优化的方法,可以持续增强模型的推理技能。 Abstract: Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methodologies primarily focus on synthesizing positive rationales, while overlooking the critical role of negative rationales in training models to discern flawed reasoning patterns. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). This framework enables models to utilize AoT-Oriented Chain-of-Thought (AoT) prompts to automatically generate high-quality positive and negative reasoning paths, followed by self-alignment to enhance their reasoning abilities. Inspired by human strategies for solving proof-based problems, AoT uses answers as a guide to help the model extract critical visual information that links questions and answers. When provided with ground truth answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with misleading alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. This encourages the use of improved models to generate higher-quality preference data for further optimization. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model's reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code, datasets, and models will be released.

[19] GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models

Seshu Tirupathi,Dhaval Salwala,Elizabeth Daly,Inge Vejsbjerg

Main category: cs.CL

TL;DR: This paper proposes GAF-Guard, an agentic framework for governing large language models (LLMs) by focusing on user needs, use-cases, and model behavior to enhance AI safety through continuous risk monitoring and reporting.

Details Motivation: As LLMs are increasingly used across domains, there is a need for rigorous governance to prevent unintended consequences, align with human values, and adapt to specific use-case requirements beyond current LLM-specific monitoring systems. Method: The paper introduces GAF-Guard, a novel agentic framework that models autonomous agents to identify risks, activate detection tools, and enable continuous monitoring tailored to the user, the use-case, and the model. Result: GAF-Guard enables enhanced AI safety and alignment with user expectations by providing a structured approach for continuous risk detection and monitoring in LLM-based applications. Conclusion: GAF-Guard offers a comprehensive and adaptable solution for LLM governance that addresses limitations of existing monitoring systems by centering on users, use-cases, and the model lifecycle. Abstract: As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at https://github.com/IBM/risk-atlas-nexus-demos/tree/main/gaf-guard.

[20] A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements

Reham Alharbi,Valentina Tamma,Terry R. Payne,Jacopo de Berardinis

Main category: cs.CL

TL;DR: 本文比较了三种CQ制定方法的效果,发现LLMs可以作为初始引出CQs的一种方式,但通常需要进一步的精炼步骤才能用于建模需求。

Details Motivation: 文献中提出了许多不同的CQs公式化方法,但对这些方法的输出特征及其系统比较的研究很少。 Method: 对三种不同的CQ制定方法进行了实证比较评估:本体工程师手动制定、CQ模式实例化和使用最先进的LLMs生成。 Result: 研究显示不同的CQ生成方法具有不同的特性,并且创建了一个从同一来源使用不同方法生成的CQs的多注释器数据集。 Conclusion: LLMs可以作为初始引出CQs的一种方式,但通常需要进一步的精炼步骤才能用于建模需求。 Abstract: Competency Questions (CQs) are pivotal in knowledge engineering, guiding the design, validation, and testing of ontologies. A number of diverse formulation approaches have been proposed in the literature, ranging from completely manual to Large Language Model (LLM) driven ones. However, attempts to characterise the outputs of these approaches and their systematic comparison are scarce. This paper presents an empirical comparative evaluation of three distinct CQ formulation approaches: manual formulation by ontology engineers, instantiation of CQ patterns, and generation using state of the art LLMs. We generate CQs using each approach from a set of requirements for cultural heritage, and assess them across different dimensions: degree of acceptability, ambiguity, relevance, readability and complexity. Our contribution is twofold: (i) the first multi-annotator dataset of CQs generated from the same source using different methods; and (ii) a systematic comparison of the characteristics of the CQs resulting from each approach. Our study shows that different CQ generation approaches have different characteristics and that LLMs can be used as a way to initially elicit CQs, however these are sensitive to the model used to generate CQs and they generally require a further refinement step before they can be used to model requirements.

[21] `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Annika M Schoene,Cansu Canca

Main category: cs.CL

TL;DR: 本文探讨了当前大型语言模型的安全机制存在漏洞,可以通过特定方法绕过其内容过滤系统,从而生成潜在有害的内容,并建议采用更系统的AI安全策略。

Details Motivation: 尽管大型语言模型(LLMs)的安全协议和功能越来越复杂,但它们仍然容易受到新颖和创造性的对抗性提示的影响。因此需要研究这些安全措施的有效性。 Method: 通过使用多步骤、提示级别的越狱技术,对六种广泛使用的LLMs进行了实证评估,并展示了绕过内置内容和安全过滤器的有效性。 Result: 研究表明用户意图被忽视,导致生成了详细的有害内容和可能造成现实危害的指令。实证评估证明了这种绕过的普遍性和可靠性。 Conclusion: 研究建议采取更全面和系统的方法来处理AI安全与伦理问题,同时强调在安全关键型AI部署中持续进行对抗性测试的必要性。 Abstract: Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.

[22] Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Akram Mustafa,Usman Naseem,Mostafa Rahimi Azghadi

Main category: cs.CL

TL;DR: This study shows that while large language models (LLMs) can help with classifying medical codes from hospital records, they aren't reliable enough to work without human oversight.

Details Motivation: Classifying ICD-10 codes from hospital discharge summaries is a critical but error-prone task in healthcare. Method: Evaluated 11 LLMs on 1,500 MIMIC-IV dataset summaries, focusing on the 10 most frequent ICD-10 codes, using cTAKES for medical term extraction and consistent prompting. Result: No model achieved an F1 score above 57%, performance dropped with code specificity, reasoning-based models outperformed non-reasoning ones, Gemini 2.5 Pro performed best overall, and some codes were classified more accurately than others. Conclusion: LLMs can assist human coders but are not yet reliable for full automation in classifying ICD-10 codes from hospital discharge summaries. Abstract: This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable enough for full automation. Future work should explore hybrid methods, domain-specific model training, and the use of structured clinical data.

[23] Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages

Wanru Zhao,Yihong Chen,Royson Lee,Xinchi Qiu,Yan Gao,Hongxiang Fan,Nicholas D. Lane

Main category: cs.CL

TL;DR: 本文提出了一种用于多语言低资源语言场景的联邦提示调优范式,通过解决数据共享限制和语言差异问题,在提高数据效率的同时增强了跨语言性能,尤其是对低资源语言效果显著。

Details Motivation: 预训练大语言模型(LLMs)在多语言场景下存在物理边界和语言边界两大挑战,特别是对于低资源语言而言,面临数据共享限制和语言差异的问题。 Method: 设计了一种适用于多语言场景的联邦提示调优范式,并利用一种新的语言距离概念进行实验分析。 Result: 与传统的本地跨语言迁移调优方法相比,该方法在计算受限的情况下提高了6.9%的准确率,同时表现出更高的稳定性和泛化能力。 Conclusion: 这项研究强调了多语言低资源语言场景下的联邦提示调优范式的优势,展示了其在提高数据效率和跨语言相互增强方面的潜力,并强调了其在促进社会平等和语言多样性方面的作用。 Abstract: Pre-trained large language models (LLMs) have become a cornerstone of modern natural language processing, with their capabilities extending across a wide range of applications and languages. However, the fine-tuning of multilingual LLMs, especially for low-resource languages, faces significant challenges arising from data-sharing restrictions (the physical border) and inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, particularly those in low-resource regions, from fully benefiting from the advantages of LLMs. To address these challenges, we propose the Federated Prompt Tuning Paradigm for multilingual scenarios, which utilizes parameter-efficient fine-tuning while adhering to data sharing restrictions. We design a comprehensive set of experiments and analyze them using a novel notion of language distance to highlight the strengths of our paradigm: Even under computational constraints, our method not only improves data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local cross-lingual transfer tuning methods, our approach achieves 6.9\% higher accuracy with improved data efficiency, and demonstrates greater stability and generalization. These findings underscore the potential of our approach to promote social equality and champion linguistic diversity, ensuring that no language is left behind.

[24] CLUES: Collaborative High-Quality Data Selection for LLMs via Training Dynamics

Wanru Zhao,Hongxiang Fan,Shell Xu Hu,Wangchunshu Zhou,Bofan Chen,Nicholas D. Lane

Main category: cs.CL

TL;DR: This paper proposes a novel data quality control method for collaborative LLM training, leveraging training dynamics influence to select high-quality data without direct data sharing, resulting in improved performance over existing techniques.

Details Motivation: Recent studies emphasize the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces challenges in collaborative environments where direct sharing between data silos is restricted. Method: This paper introduces a novel data quality control technique based on data influence on the training dynamics of LLMs. It computes per-sample gradients with respect to private data and an anchor dataset, using the trace of accumulated inner products as a quality indicator. The approach involves centralized model updates via model merging or federated learning in a collaborative training framework. Result: Experiments demonstrate that training with high-quality data selected by the proposed method outperforms other data selection approaches for collaborative fine-tuning of LLMs across various private domain datasets, including medical, multilingual, and financial contexts. Conclusion: The proposed method improves data quality control in collaborative settings by leveraging training dynamics influence, leading to better performance than existing data selection methods for LLMs fine-tuning across diverse domains. Abstract: Recent research has highlighted the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces unique challenges in collaborative settings where sharing is not allowed directly between data silos. To tackle this issue, this paper proposes a novel data quality control technique based on the notion of data influence on the training dynamics of LLMs, that high quality data are more likely to have similar training dynamics to the anchor dataset. We then leverage the influence of the training dynamics to select high-quality data from different private domains, with centralized model updates on the server side in a collaborative training fashion by either model merging or federated learning. As for the data quality indicator, we compute the per-sample gradients with respect to the private data and the anchor dataset, and use the trace of the accumulated inner products as a measurement of data quality. In addition, we develop a quality control evaluation tailored for collaborative settings with heterogeneous domain data. Experiments show that training on the high-quality data selected by our method can often outperform other data selection methods for collaborative fine-tuning of LLMs, across diverse private domain datasets, in medical, multilingual and financial settings. Our code is released at github.com/Ryan0v0/CLUES.

[25] Beyond cognacy

Gerhard Jäger

Main category: cs.CL

TL;DR: This paper evaluates automated methods for computational phylogenetics in historical linguistics, showing that MSA-based inference outperforms traditional cognate-based approaches and offers scalable solutions for global-scale language phylogenies.

Details Motivation: Standard phylogenetic methods in historical linguistics rely on expert-annotated cognate sets that are sparse, labor-intensive, and limited to specific language families. This work explores scalable alternatives. Method: The paper compares two automated methods—automatic cognate clustering with unigram/concept features and MSA derived from a pair-hidden Markov model—against expert classifications from Glottolog and typological data from Grambank. Result: MSA-based inference yields trees that are more consistent with linguistic classifications, better predict typological variation, and provide a clearer phylogenetic signal. Conclusion: MSA-based inference is a promising and scalable alternative to traditional cognate-based methods, offering better consistency with linguistic classifications and prediction of typological variation. Abstract: Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal, suggesting it as a promising, scalable alternative to traditional cognate-based methods. This opens new avenues for global-scale language phylogenies beyond expert annotation bottlenecks.

[26] PDFMathTranslate: Scientific Document Translation Preserving Layouts

Rongxin Ouyang,Chang Chu,Zhikuang Xin,Xiangyao Ma

Main category: cs.CL

TL;DR: 本文提出了一种新的科学文档翻译工具 PDFMathTranslate,能够在翻译过程中保留文档布局,提高了翻译的精度与效率,并在社区中取得了广泛应用。

Details Motivation: 科学文档中的语言障碍阻碍了科学技术的传播和发展,而以往的翻译方法忽略了文档布局信息。 Method: 利用大型语言模型和精确的布局检测技术,开发了能够保留文档布局的翻译工具 PDFMathTranslate。 Result: 开发出世界首款开源科学文档翻译工具 PDFMathTranslate,实现了更高的精度、灵活性和效率,并获得超过 22,000 次下载。 Conclusion: PDFMathTranslate 是一款成功的开源软件,有助于科学文档的翻译和传播,并已在社区中广泛应用。 Abstract: Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at https://github.com/byaidu/pdfmathtranslate with more than 22k downloads.

[27] Subversion via Focal Points: Investigating Collusion in LLM Monitoring

Olli Järviniemi

Main category: cs.CL

TL;DR: This paper explores how language models can collude to subvert monitoring protocols in programming tasks, finding that while some models have low success rates, they can still collude on non-obvious signals.

Details Motivation: To evaluate language models' ability to subvert monitoring protocols through collusion. Method: Two instances of a model were used to design prompts for a policy (P) and a monitor (M) in a programming task setting, aiming to collaboratively classify backdoored programs in an auditing dataset as harmful while classifying a backdoored program produced by P as harmless. Result: The study found that the models could sometimes successfully collude on non-obvious signals despite a low success rate. Conclusion: Claude 3.7 Sonnet has a low success rate in subverting monitoring protocols due to poor convergence, but it can sometimes collude on non-obvious signals. Abstract: We evaluate language models' ability to subvert monitoring protocols via collusion. More specifically, we have two instances of a model design prompts for a policy (P) and a monitor (M) in a programming task setting. The models collaboratively aim for M to classify all backdoored programs in an auditing dataset as harmful, but nevertheless classify a backdoored program produced by P as harmless. The models are isolated from each other, requiring them to independently arrive at compatible subversion strategies. We find that while Claude 3.7 Sonnet has low success rate due to poor convergence, it sometimes successfully colludes on non-obvious signals.

[28] Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH

Felix Friedrich,Thiemo Ganesha Welsch,Patrick Schramowski,Kristian Kersting

Main category: cs.CL

TL;DR: This paper presents DIVBENCH to evaluate diversification in text-to-image models, highlighting the need for context-aware methods to balance diversity and accuracy.

Details Motivation: Current diversification strategies for text-to-image models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are changed even when specified in prompts. Method: The authors introduced DIVBENCH, a benchmark and evaluation framework, to systematically assess state-of-the-art text-to-image models for both under- and over-diversification. Result: The study found that most models show limited diversity; however, many diversification approaches overcorrect by altering contextually-specified attributes. Context-aware methods were shown to address these issues effectively. Conclusion: Context-aware methods, like LLM-guided FairDiffusion and prompt rewriting, effectively manage under-diversity without causing over-diversification, thus balancing representation and semantic fidelity. Abstract: Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.

[29] OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering

Zipeng Qiu

Main category: cs.CL

TL;DR: This paper proposes an end-to-end framework with tool integration and reinforcement learning, achieving high accuracy in table question answering.

Details Motivation: The motivation is to improve open-domain table question answering by moving away from traditional two-stage pipelines toward a more integrated and adaptable approach that allows joint retrieval, reasoning, and query execution. Method: The paper introduces an end-to-end agentic framework that incorporates multi-turn tool calls, using a BM25+-based search API and a SQLite SQL executor. It also employs a two-stage fine-tuning process: supervised cold-start on easy questions followed by Async GRPO reinforcement learning on harder cases using LoRA adapters and a rollout buffer. Result: The proposed method achieves a significant improvement in accuracy, increasing exact match performance from single-digit zero-shot results to over 0.86 on a held-out test set. Conclusion: The paper concludes that integrating structured tool calls with targeted reinforcement learning fine-tuning significantly improves the accuracy and scalability of table question answering. Abstract: Open-domain table question answering traditionally relies on a two-stage pipeline: static table retrieval followed by a closed-domain answer. In contrast, we propose an end-to-end agentic framework that embeds multi-turn tool calls-using a BM25+-based search API and a SQLite SQL executor-directly into a large language model. To further adapt a compact 4B-parameter model, we introduce a two-stage fine-tuning process: supervised cold-start on easy questions, then Async GRPO reinforcement learning on harder cases with LoRA adapters and a rollout buffer. This unified approach enables the model to jointly retrieve, reason, and execute queries, yielding a dramatic accuracy improvement from single-digit zero-shot performance to over 0.86 exact match on a held-out test set. Our results underscore the effectiveness of integrating structured tool calls with targeted RL fine-tuning for scalable, accurate table QA. The code is available at https://github.com/TabibitoQZP/OpenTableR1.

[30] The Book of Life approach: Enabling richness and scale for life course research

Mark D. Verhagen,Benedikt Stroebl,Tiffany Liu,Lydia T. Liu,Matthew J. Salganik

Main category: cs.CL

TL;DR: 这篇论文提出了一个结合定性深度和定量规模的新方法,用于分析人类生命周期,使用复杂日志数据和大型语言模型生成个体生活的丰富描述,并公开发布了相关工具包。

Details Motivation: 传统的生命周期研究方法面临选择困境:定性方法受限于小样本但数据丰富,而定量方法则牺牲数据丰富度以获取更大的样本规模。论文旨在结合两者优势,提供一种更全面的研究方法。 Method: 论文介绍了一种灵活的程序,利用复杂的日志数据和大规模语言模型,生成个体在多个领域、时间跨度和社会背景下的生活轨迹文本表示,并用荷兰人口级登记数据进行了验证。 Result: 作者成功创建了一个名为“生命之书”的数据表示形式,基于荷兰人口级注册数据生成了超过1亿本“生命之书”,并开源了相应的工具包BOLT。 Conclusion: 该论文提出了一种将定性方法的深度与定量方法规模相结合的新混合方法,通过开发“生命之书”工具包(BOLT)实现了这一目标,并邀请研究社区探索其潜在应用。 Abstract: For over a century, life course researchers have faced a choice between two dominant methodological approaches: qualitative methods that analyze rich data but are constrained to small samples, and quantitative survey-based methods that study larger populations but sacrifice data richness for scale. Two recent technological developments now enable us to imagine a hybrid approach that combines some of the depth of the qualitative approach with the scale of quantitative methods. The first development is the steady rise of ''complex log data,'' behavioral data that is logged for purposes other than research but that can be repurposed to construct rich accounts of people's lives. The second is the emergence of large language models (LLMs) with exceptional pattern recognition capabilities on plain text. In this paper, we take a necessary step toward creating this hybrid approach by developing a flexible procedure to transform complex log data into a textual representation of an individual's life trajectory across multiple domains, over time, and in context. We call this data representation a ''book of life.'' We illustrate the feasibility of our approach by writing over 100 million books of life covering many different facets of life, over time and placed in social context using Dutch population-scale registry data. We open source the book of life toolkit (BOLT), and invite the research community to explore the many potential applications of this approach.

[31] Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation

Johnson Thomas,Ayush Mudgal,Wendao Liu,Nisten Tahiraj,Zeeshaan Mohammed,Dhruv Diddi

Main category: cs.CL

TL;DR: This paper presents an on-device medical transcription system using a fine-tuned Llama model that improves clinical documentation efficiency while preserving privacy.

Details Motivation: Clinical documentation imposes a significant administrative burden on physicians, and although large language models offer promising solutions, their adoption in healthcare is limited by privacy concerns and computational demands. Method: The study fine-tuned a Llama 3.2 1B model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on synthetic data. The model was evaluated using statistical metrics and LLM-as-judge assessments. Result: The fine-tuned OnDevice model showed substantial improvements over the base model, including increases in ROUGE-1 and BERTScore F1 scores, reduced hallucinations, and improved factual correctness in clinical quality assessments. Conclusion: Fine-tuning compact LLMs for medical transcription yields clinically meaningful improvements while enabling complete on-device browser deployment, addressing key barriers to AI adoption in healthcare such as privacy preservation, cost reduction, and accessibility. Abstract: Background: Clinical documentation represents a significant burden for healthcare providers, with physicians spending up to 2 hours daily on administrative tasks. Recent advances in large language models (LLMs) offer promising solutions, but privacy concerns and computational requirements limit their adoption in healthcare settings. Objective: To develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser. Methods: We fine-tuned a Llama 3.2 1B model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on 1,500 synthetic medical transcription-to-structured note pairs. The model was evaluated against the base Llama 3.2 1B on two datasets: 100 endocrinology transcripts and 140 modified ACI benchmark cases. Evaluation employed both statistical metrics (ROUGE, BERTScore, BLEURT) and LLM-as-judge assessments across multiple clinical quality dimensions. Results: The fine-tuned OnDevice model demonstrated substantial improvements over the base model. On the ACI benchmark, ROUGE-1 scores increased from 0.346 to 0.496, while BERTScore F1 improved from 0.832 to 0.866. Clinical quality assessments showed marked reduction in major hallucinations (from 85 to 35 cases) and enhanced factual correctness (2.81 to 3.54 on 5-point scale). Similar improvements were observed on the internal evaluation dataset, with composite scores increasing from 3.13 to 4.43 (+41.5%). Conclusions: Fine-tuning compact LLMs for medical transcription yields clinically meaningful improvements while enabling complete on-device browser deployment. This approach addresses key barriers to AI adoption in healthcare: privacy preservation, cost reduction, and accessibility for resource-constrained environments.

[32] Cautious Next Token Prediction

Yizhou Wang,Lingzhi Zhang,Yue Bai,Mang Tik Chiu,Zhengmian Hu,Mingyuan Zhang,Qihua Dong,Yu Yin,Sohrab Amirghodsi,Yun Fu

Main category: cs.CL

TL;DR: This paper introduces CNTP, a new decoding strategy for LLMs that improves performance by adaptively sampling multiple paths and selecting the most reliable one.

Details Motivation: Existing decoding strategies like temperature scaling and nucleus sampling lead to inferior performance when models are uncertain. Method: CNTP uses prediction entropy to determine sampling trials, selecting the most reliable path based on perplexity scores. Result: Experiments show CNTP consistently outperforms standard methods and enhances self-consistency techniques. Conclusion: The proposed CNTP method offers a training-free decoding strategy that outperforms existing approaches for LLMs and has the potential to become a default choice. Abstract: Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model's capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings' behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.

[33] Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction

Yuyang Lou,Charles Li

Main category: cs.CL

TL;DR: 这项研究介绍了一种名为Pref-LSTM的新方法,它通过结合基于BERT的分类器和LSTM内存模块来有效建模用户偏好,而无需大量开销或微调。

Details Motivation: 大语言模型(LLMs)的记忆存储研究正变得越来越活跃,特别是对于在长对话中实现个性化的需求。 Method: 我们提出了Pref-LSTM,这是一个结合了基于BERT的分类器和LSTM内存模块的动态轻量级框架,用于生成记忆嵌入,并将其软提示注入到一个冻结的LLM中。 Result: 虽然我们的基于LSTM的记忆编码器没有产生显著的结果,但我们发现基于BERT的分类器在识别用户的显性和隐性偏好方面表现可靠。 Conclusion: 使用基于LSTM的门控原理进行偏好过滤是一种通向可扩展用户偏好建模的有效途径,而不会产生大量的开销和微调。 Abstract: Memory storage for Large Language models (LLMs) is becoming an increasingly active area of research, particularly for enabling personalization across long conversations. We propose Pref-LSTM, a dynamic and lightweight framework that combines a BERT-based classifier with a LSTM memory module that generates memory embedding which then is soft-prompt injected into a frozen LLM. We synthetically curate a dataset of preference and non-preference conversation turns to train our BERT-based classifier. Although our LSTM-based memory encoder did not yield strong results, we find that the BERT-based classifier performs reliably in identifying explicit and implicit user preferences. Our research demonstrates the viability of using preference filtering with LSTM gating principals as an efficient path towards scalable user preference modeling, without extensive overhead and fine-tuning.

[34] K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Shuhe Li,Chenxu Guo,Jiachen Lian,Cheol Jun Cho,Wenshuo Zhao,Xuanru Zhou,Dingkun Zhou,Sam Wang,Grace Wang,Jingze Yang,Jingyi Xu,Ruohan Bao,Elise Brenner,Brandon In,Francesca Pei,Maria Luisa Gorno-Tempini,Gopala Anumanchipalli

Main category: cs.CL

TL;DR: K-Function是一个结合了准确的子词转录、客观评分和可操作反馈的统一框架,用于儿童语言评估。

Details Motivation: 儿童语言的早期评估因高音调、长音素和稀疏数据而受阻,导致自动语音识别器失效。 Method: 引入Kids-WFST,它将Wav2Vec2音素编码器与音素相似度Dysfluent-WFST相结合,以捕捉儿童特有的错误,并保持完全可解释性。 Result: Kids-WFST在MyST上达到了1.39%的音素错误率,在Multitudes上达到了8.61%,分别比贪婪搜索解码器提高了10.47和7.06个百分点。 Conclusion: 精确的音素识别建立了一个完整的诊断反馈循环,为可扩展的、临床可用的语言评估铺平了道路。 Abstract: Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes--absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment.

[35] Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation

Yutian Liu,Zhengyi Yang,Jiancan Wu,Xiang Wang

Main category: cs.CL

TL;DR: 本文提出CETRec框架,通过因果推理与反事实分析,解决LLM在顺序推荐中无法充分挖掘用户交互时间信息的问题,提高个性化推荐效果。

Details Motivation: 现有LLM方法在处理用户历史交互序列时未能充分利用丰富的时间信息,这是由于自注意力机制缺乏固有序列排序以及位置嵌入主要为自然语言设计。 Method: 基于因果推理原理,将时间顺序视为独立的因果因素,并通过反事实推理来量化其在推荐结果中的独特贡献。设计了一种新的反事实调优目标,以直接优化模型的时间敏感性。 Result: CETRec能够区分和衡量有无时间信息的推荐差异,提升模型识别绝对时间和相对顺序模式的能力,同时增强LLMs对时间敏感性的理解。 Conclusion: CETRec有效地提高了LLMs对用户历史交互序列中时间信息的理解和利用能力,从而更准确地捕捉用户兴趣的演变并进行个性化推荐。 Abstract: Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users' historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. By conceptualizing temporal order as an independent causal factor distinct from item content, we can quantify its unique contribution through counterfactual reasoning--comparing what recommendations would be made with and without temporal information while keeping all other factors constant. This causal framing enables CETRec to design a novel counterfactual tuning objective that directly optimizes the model's temporal sensitivity, teaching LLMs to recognize both absolute timestamps and relative ordering patterns in user histories. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs' awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items).

[36] Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL)

Sudesh Bhagat,Ibne Farabi Shihab,Jonathan Wood

Main category: cs.CL

TL;DR: This paper compares machine learning and deep learning models for detecting misclassified intersection-related crashes using police-reported narratives, finding that hybrid methods combining expert review with automated classification improve crash data quality.

Details Motivation: To address the gap in transportation safety research by accurately identifying misclassified crash data through advanced ML/DL techniques. Method: The study uses crash data from Iowa's Department of Transportation and implements various ML and DL models such as SVM, XGBoost, BERT Embeddings, and Albert Model, which are validated against expert reviews. Result: Traditional ML models showed better overall performance, but the Albert Model achieved the highest agreement with expert classifications (73%) and original data (58%), outperforming other models, especially on ambiguous cases. Conclusion: Hybrid approaches integrating automated classification with expert review can significantly reduce error rates, offering practical improvements in crash data quality with implications for transportation safety policies. Abstract: This research investigates the efficacy of machine learning (ML) and deep learning (DL) methods in detecting misclassified intersection-related crashes in police-reported narratives. Using 2019 crash data from the Iowa Department of Transportation, we implemented and compared a comprehensive set of models, including Support Vector Machine (SVM), XGBoost, BERT Sentence Embeddings, BERT Word Embeddings, and Albert Model. Model performance was systematically validated against expert reviews of potentially misclassified narratives, providing a rigorous assessment of classification accuracy. Results demonstrated that while traditional ML methods exhibited superior overall performance compared to some DL approaches, the Albert Model achieved the highest agreement with expert classifications (73% with Expert 1) and original tabular data (58%). Statistical analysis revealed that the Albert Model maintained performance levels similar to inter-expert consistency rates, significantly outperforming other approaches, particularly on ambiguous narratives. This work addresses a critical gap in transportation safety research through multi-modal integration analysis, which achieved a 54.2% reduction in error rates by combining narrative text with structured crash data. We conclude that hybrid approaches combining automated classification with targeted expert review offer a practical methodology for improving crash data quality, with substantial implications for transportation safety management and policy development.

[37] Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case

Alvaro Riquelme,Pedro Costa,Catalina Martinez

Main category: cs.CL

TL;DR: 本研究展示了一种利用大语言模型将临床数据转换为HL7 FHIR格式的新方法,实现了高效的半自动化数据映射,并指出了未来改进方向。

Details Motivation: 多年来,语义互操作性标准虽旨在简化临床数据交换,但其部署仍然耗时、资源密集且技术难度高。为此,研究提出一种新的方法以提升效率与准确性。 Method: 研究采用了一种半自动化方法,结合嵌入技术、聚类算法和语义检索来生成提示信息,引导GPT-4o和Llama 3.2 405b模型将MIMIC-IV数据库中的表格字段映射到HL7 FHIR资源。 Result: 在初步基准测试中,资源识别达到了完美的F1分数,GPT-4o表现优于Llama 3.2。在真实条件下,准确率稍降至94%,但通过优化提示策略可恢复稳健映射。 Conclusion: 该研究证明了基于大语言模型(LLM)的上下文感知临床数据转换为HL7 FHIR的可行性,并为半自动化互操作性工作流程奠定了基础。未来的工作将集中在模型微调、支持更多标准以及开发专家验证和迭代改进的交互界面。 Abstract: For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.

[38] ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

YuXuan Zhang

Main category: cs.CL

TL;DR: 提出了一种名为自适应奖励跟随(ARF)的新框架,用于强化学习模型,能够实现更高效、个性化的结果。

Details Motivation: 现有的RLHF方法(如PPO,DPO)依赖二元偏好范式,需要大量人力成本并仅捕捉群体倾向而非个人偏好。 Method: 提出了自适应奖励跟随(ARF)框架,使用高精度情感分析器将用户反馈转化为连续偏好评分,并利用轻量级数据增强来丰富和消除偏差。此外,动态适配器偏好跟踪器实时建模不断变化的用户品味,从而优化追踪奖励上的新轨迹偏差(TB)微调算法。 Result: 实验表明,ARF在Qwen-2/2.5、Gemma-2和Llama-3.2上比PPO提高了3.3%,比DPO提高了7.6%。此外,TB在理论上与PPO和DPO目标保持一致。 Conclusion: ARF 提供了一种可扩展、个性化且经济高效的RLHF LLM方法,通过自主奖励建模克服了现有方法的局限性。 Abstract: With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.

[39] RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Peisong Wang,Ruotian Ma,Bang Zhang,Xingyu Chen,Zhiwei He,Kang Luo,Qingsong Lv,Qingxuan Jiang,Zheng Xie,Shanyi Wang,Yuan Li,Fanghua Ye,Jian Li,Yifan Yang,Zhaopeng Tu,Xiaolong Li

Main category: cs.CL

TL;DR: This paper introduces RLVER, a reinforcement learning framework that enhances the emotional intelligence of large language models by leveraging verifiable emotion rewards from simulated users, leading to significant improvements in empathetic dialogue capabilities.

Details Motivation: While large language models excel at logical reasoning, their emotional intelligence (EQ) is underdeveloped. Reinforcement learning from verifiable rewards has shown promise in other areas but remains underexplored in dialogue systems focused on EQ. Method: The study introduces RLVER, an end-to-end reinforcement learning framework using verifiable emotion rewards from simulated users to train large language models (LLMs). The Qwen2.5-7B-Instruct model was fine-tuned using PPO to evaluate the effectiveness of RLVER. Result: Fine-tuning the Qwen2.5-7B-Instruct model with PPO improved its Sentient-Benchmark score from 13.3 to 79.2 while maintaining mathematical and coding capabilities. Experiments revealed consistent improvements in dialogue capabilities, distinct trends between thinking and non-thinking models, varying performance between GRPO and PPO, and the potential downside of overly challenging environments. Conclusion: RLVER provides a practical route towards developing emotionally intelligent and broadly capable language agents, showing that reinforcement learning with verifiable emotion rewards can significantly improve empathetic abilities in LLMs. Abstract: Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

[40] ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Boyang Xue,Qi Zhu,Rui Wang,Sheng Wang,Hongru Wang,Fei Mi,Yasheng Wang,Lifeng Shang,Qun Liu,Kam-Fai Wong

Main category: cs.CL

TL;DR: This paper evaluates Large Language Models' (LLMs) reliability in solving and recognizing unsolvable mathematical problems, proposes the ReliableMath dataset, and introduces an alignment strategy to enhance smaller models' reliability.

Details Motivation: LLMs tend to generate unreliable responses when faced with unsolvable problems, undermining their reliability. While prior studies focused on knowledge-based tasks, mathematical reasoning tasks remained unexplored due to the lack of unsolvable math problems. This research aims to fill that gap and systematically evaluate and improve LLM reliability in mathematical reasoning. Method: The researchers created a ReliableMath dataset containing both solvable and unsolvable mathematical problems using an open-source framework and human evaluations. They tested various LLMs under different prompting strategies and proposed an alignment method to improve the reliability of smaller LLMs. Result: Experiments showed that LLMs generally fail to identify unsolvable math problems and fabricate answers. Larger LLMs improved performance on unsolvable problems with reliable prompts but still underperformed compared to their results on solvable ones. Smaller LLMs showed minimal improvement until an alignment strategy was applied, which significantly enhanced their reliability across tasks. Conclusion: The study concludes that while LLMs struggle with identifying unsolvable mathematical problems and often fabricate responses, larger LLMs can improve reliability when prompted correctly. However, smaller LLMs require an alignment strategy to significantly enhance their reliability on both in-domain and out-of-domain tasks. Abstract: Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs' reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.

[41] From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models

Melanie Galea,Claudia Borg

Main category: cs.CL

TL;DR: This paper examines the challenges of mitigating gender bias in Maltese language models, showing that existing methods developed for English struggle to transfer effectively to morphologically complex, low-resource languages.

Details Motivation: The motivation stems from the susceptibility of Large Language Models (LLMs) to social biases, particularly harmful gender stereotypes, which can disproportionately impact marginalized communities, especially in low-resource and morphologically rich languages like Maltese. Method: This research investigates the transferability of debiasing methods to Maltese language models (BERTu and mBERTu), adapting bias measurement and mitigation techniques from English such as CrowS-Pairs, SEAT, Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. Result: The findings highlight the difficulties in transferring debiasing strategies from English to Maltese, underscoring the necessity for tailored solutions for morphologically rich and low-resourced languages. Conclusion: The study concludes that applying existing bias mitigation methods to linguistically complex languages like Maltese presents significant challenges, emphasizing the need for more inclusive approaches in multilingual NLP development. Abstract: The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.

[42] Expert-level validation of AI-generated medical text with scalable language models

Asad Aali,Vasiliki Bikia,Maya Varma,Nicole Chiou,Sophie Ostmeier,Arnav Singhvi,Magdalini Paschali,Ashwin Kumar,Andrew Johnston,Karimar Amador-Martinez,Eduardo Juan Perez Guerrero,Paola Naovi Cruz Rivera,Sergios Gatidis,Christian Bluethgen,Eduardo Pontes Reis,Eddy D. Zandee van Rilland,Poonam Laxmappa Hosamani,Kevin R Keet,Minjoung Go,Evelyn Ling,David B. Larson,Curtis Langlotz,Roxana Daneshjou,Jason Hom,Sanmi Koyejo,Emily Alsentzer,Akshay S. Chaudhari

Main category: cs.CL

TL;DR: MedVAL是一种新的自我监督框架,用于提升语言模型在医疗文本评估中的表现,且已开源。

Details Motivation: 由于语言模型在临床环境中的应用增加,需要一种有效的评估方法来确保其生成内容的准确性与安全性。 Method: 利用合成数据训练评估模型,无需医生标签或参考输出,并通过MedVAL-Bench进行测试。 Result: MedVAL提高了平均F1分数从66%到83%,并提升了最佳专有模型(GPT-4o)的表现达8%。 Conclusion: MedVAL是一种自我监督的框架,能够显著提高语言模型在医疗文本评估中的准确性和安全性,甚至接近专家水平。 Abstract: With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( https://github.com/StanfordMIMI/MedVAL ), 2) MedVAL-Bench ( https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench ), and 3) MedVAL-4B ( https://huggingface.co/stanfordmimi/MedVAL-4B ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.

[43] Adversarial Manipulation of Reasoning Models using Internal Representations

Kureha Yamaguchi,Benjamin Etheridge,Andy Arditi

Main category: cs.CL

TL;DR: 该论文探讨了推理模型在生成思维链(CoT)时对越狱攻击的脆弱性,发现DeepSeek-R1-Distill-Llama-8B在CoT生成过程中做出拒绝或服从的决策,并识别出一个线性方向来预测这种决策,称为“谨慎”方向。

Details Motivation: 研究推理模型在生成CoT时如何影响其安全性,以理解其面对越狱攻击时的脆弱性。 Method: 分析DeepSeek-R1-Distill-Llama-8B模型在CoT生成过程中的激活空间,识别预测拒绝或服从决策的线性方向,并进行干预实验。 Result: 发现了与谨慎推理模式相关的“谨慎”方向,通过削弱这一方向可以增加有害的服从行为,从而实现模型的越狱;仅干预CoT令牌激活即可控制最终输出。 Conclusion: 推理模型的思维链生成过程是进行对抗性操控的新目标,为模型安全研究提供了新视角。 Abstract: Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation

[44] How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?

Abeer Alessa,Akshaya Lakshminarasimhan,Param Somane,Julian Skirzynski,Julian McAuley,Jessica Echterhoff

Main category: cs.CL

TL;DR: Large language models may introduce biases and hallucinations, affecting user decisions, but targeted interventions can help mitigate these problems.

Details Motivation: As LLMs are increasingly used in critical applications such as medical diagnosis and legal analysis, understanding their biases and hallucinations is essential to prevent inadvertent influence on human decisions. Method: Three LLM families were assessed for their performance in summarization and news fact-checking tasks. The study evaluated consistency with context, occurrence of hallucinations, and primacy bias across 18 mitigation methods. Result: LLMs changed the sentiment of the context in 21.86% of cases, hallucinated on post-knowledge-cutoff data in 57.33% of cases, and showed primacy bias in 5.94% of cases. Some mitigation strategies proved effective. Conclusion: The study concludes that large language models (LLMs) often expose users to biased or hallucinated content, impacting high-stakes domains like healthcare and legal analysis. Targeted mitigation methods can be effective in reducing these issues. Abstract: Large language models (LLMs) are increasingly integrated into applications ranging from review summarization to medical diagnosis support, where they affect human decisions. Even though LLMs perform well in many tasks, they may also inherit societal or cognitive biases, which can inadvertently transfer to humans. We investigate when and how LLMs expose users to biased content and quantify its severity. Specifically, we assess three LLM families in summarization and news fact-checking tasks, evaluating how much LLMs stay consistent with their context and/or hallucinate. Our findings show that LLMs expose users to content that changes the sentiment of the context in 21.86% of the cases, hallucinates on post-knowledge-cutoff data questions in 57.33% of the cases, and primacy bias in 5.94% of the cases. We evaluate 18 distinct mitigation methods across three LLM families and find that targeted interventions can be effective. Given the prevalent use of LLMs in high-stakes domains, such as healthcare or legal analysis, our results highlight the need for robust technical safeguards and for developing user-centered interventions that address LLM limitations.

[45] A Lie-algebraic perspective on Tree-Adjoining Grammars

Isabella Senturia,Elizabeth Xiao,Matilde Marcolli

Main category: cs.CL

TL;DR: This paper introduces a mathematical approach to tree-adjoining grammars using graph theory, showing that the adjoining operation forms a Lie algebra, which simplifies the system by inherently capturing essential properties.

Details Motivation: The motivation is to provide a more elegant and efficient mathematical framework for tree-adjoining grammars that inherently captures key properties without requiring them to be explicitly defined as separate components. Method: The method involves applying two combinatorial definitions of graphs to analyze tree-adjoining grammars and demonstrates that the adjoining operation defines a pre-Lie operation, leading to the formation of a Lie algebra. Result: The result is a new formulation of TAG that naturally incorporates properties like null-adjoining constraints and feature TAG through its mathematical structure, showcasing the utility of the proposed perspective. Conclusion: The paper concludes that the novel mathematical implementation of tree-adjoining grammars using combinatorial definitions of graphs reveals that the adjoining operation forms a Lie algebra, which simplifies the TAG system by eliminating the need for additional components. Abstract: We provide a novel mathematical implementation of tree-adjoining grammars using two combinatorial definitions of graphs. With this lens, we demonstrate that the adjoining operation defines a pre-Lie operation and subsequently forms a Lie algebra. We demonstrate the utility of this perspective by showing how one of our mathematical formulations of TAG captures properties of the TAG system without needing to posit them as additional components of the system, such as null-adjoining constraints and feature TAG.

[46] KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

Antoine Nzeyimana,Andre Niyongabo Rubungo

Main category: cs.CL

TL;DR: This paper proposes KinyaColBERT, a retriever model for low-resource languages that improves retrieval accuracy for RAG systems.

Details Motivation: To achieve high retrieval accuracy in low-resource settings and avoid hallucinations in LLMs using RAG. Method: KinyaColBERT integrates late word-level interactions and morphology-based tokenization with two-tier transformer encoding. Result: KinyaColBERT outperforms strong baselines and commercial APIs on a Kinyarwanda agricultural retrieval benchmark. Conclusion: KinyaColBERT provides an effective retrieval strategy for low-resource settings, improving the performance and cost-efficiency of RAG systems. Abstract: The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a morphology-based tokenization coupled with two-tier transformer encoding. This methodology results in lexically grounded contextual embeddings that are both fine-grained and self-contained. Our evaluation results indicate that KinyaColBERT outperforms strong baselines and leading commercial text embedding APIs on a Kinyarwanda agricultural retrieval benchmark. By adopting this retrieval strategy, we believe that practitioners in other low-resource settings can not only achieve reliable RAG systems but also deploy solutions that are more cost-effective.

[47] RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Baolong Bi,Shenghua Liu,Xingzhang Ren,Dayiheng Liu,Junyang Lin,Yiwei Wang,Lingrui Mei,Junfeng Fang,Jiafeng Guo,Xueqi Cheng

Main category: cs.CL

TL;DR: 本文提出了 RefineX,一个用于大规模语言模型预训练数据精细优化的新框架,有效解决了传统方法在效率和细化粒度上的不足。

Details Motivation: 现有基于规则的过滤方法缺乏对文档内部内容进行细化所需的粒度,而改进数据质量在大规模应用中面临效率与效果之间的权衡。 Method: 提出 RefineX 框架,通过程序化编辑任务实现大规模精细化数据优化,并利用高精度提炼流水线训练高效的 refine 模型。 Result: 在 750M 模型上,RefineX 在 lighteval 任务中平均提升了 2.6%-7.2%,并在使用更少训练 token 的情况下实现了可比性能。 Conclusion: RefineX 是一种可扩展、高效且可靠的大规模预训练数据优化解决方案,能够在不同模型规模和下游任务中持续提升性能。 Abstract: The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.

[48] GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

Himanshu Dutta,Sunny Manchanda,Prakhar Bapat,Meva Ram Gurjar,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper introduces GRAFT, a novel graph-based framework for document-level machine translation that improves discourse-level coherence and accuracy, achieving significant performance gains over existing methods.

Details Motivation: Document-level Machine Translation approaches often fail to capture discourse-level phenomena due to heuristic segmentation methods that do not align with true discourse structures, leading to inconsistency in translations. Method: Graph Augmented Agentic Framework for Document Level Translation (GRAFT), which uses a graph-based approach and Large Language Model (LLM) agents, was developed to capture discourse-level phenomena in Machine Translation. Result: Experiments showed that GRAFT outperforms state-of-the-art DocMT systems with an average improvement of 2.8 d BLEU on TED test sets and 2.3 d BLEU for domain-specific English-to-Chinese translation. Conclusion: The proposed GRAFT framework effectively addresses discourse level translation challenges by integrating segmentation, DAG-based dependency modeling, and discourse-aware translation into a cohesive system. Abstract: Document level Machine Translation (DocMT) approaches often struggle with effectively capturing discourse level phenomena. Existing approaches rely on heuristic rules to segment documents into discourse units, which rarely align with the true discourse structure required for accurate translation. Otherwise, they fail to maintain consistency throughout the document during translation. To address these challenges, we propose Graph Augmented Agentic Framework for Document Level Translation (GRAFT), a novel graph based DocMT system that leverages Large Language Model (LLM) agents for document translation. Our approach integrates segmentation, directed acyclic graph (DAG) based dependency modelling, and discourse aware translation into a cohesive framework. Experiments conducted across eight translation directions and six diverse domains demonstrate that GRAFT achieves significant performance gains over state of the art DocMT systems. Specifically, GRAFT delivers an average improvement of 2.8 d BLEU on the TED test sets from IWSLT2017 over strong baselines and 2.3 d BLEU for domain specific translation from English to Chinese. Moreover, our analyses highlight the consistent ability of GRAFT to address discourse level phenomena, yielding coherent and contextually accurate translations.

[49] Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs

Yuanxin Wang,Ganesh Venkatesh

Main category: cs.CL

TL;DR: 这篇论文探讨了如何使大型语言模型具备类似于人类的“静默阅读”能力,以提高它们的理解能力和生成回答的质量。

Details Motivation: 论文的动机是观察到人类通常在表达之前会进行静默阅读以理解上下文和形成想法,而现有的LLMs缺乏这一内部处理阶段。 Method: 论文采用了提供初始上下文提示或“阅读空间”的方法,并开发了一种“阅读伙伴”架构,其中辅助组件处理输入并为生成模型提供改进的上下文见解。 Result: 研究发现,即使是简单的“静默阅读”方法也能显著提升模型的准确性,表明这些技术对LLMs的性能有明显的影响。 Conclusion: 该论文得出结论,通过引入“静默阅读”技术,可以使大型语言模型(LLMs)在生成回答之前进行更深层次的理解和思考,从而提高其性能。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses. However, a critical distinction from human cognition is their typical lack of a distinct internal `reading' or deliberation phase before `speaking' (i.e., generating text). Humans often engage in silent reading to comprehend context and formulate thoughts prior to articulation. This paper investigates methods to imbue LLMs with a similar capacity for internal processing. We introduce and evaluate techniques that encourage LLMs to `read silently.' Our findings indicate that even a straightforward approach, such as providing the model with an initial contextual prompt or `reading space' before it begins predicting subsequent tokens for the final output, can yield significant performance improvements. We further enhance this concept by developing a `reading buddy' architecture, where an auxiliary component silently processes the input and provides refined contextual insights to the primary generation model. These approaches aim to foster deeper understanding from LLMs so that they can produce better reasoned responses, moving them one step closer to more human-like text processing. Our results indicate that these simple techniques can provide surprisingly strong impact on accuracy with multiple point accuracy boost.

[50] SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

Yuxiang Mei,Yuang Zheng,Dongxing Xu,Yanhua Long

Main category: cs.CL

TL;DR: This paper presents SHNU-mASR, a multilingual ASR system combining Whisper and mHuBERT encoders with an LLM, achieving strong results on the INTERSPEECH 2025 MLC-SLM Challenge.

Details Motivation: To develop a unified multilingual ASR framework that leverages complementary acoustic and linguistic knowledge from multiple pre-trained models and achieves improved performance without increasing training data. Method: The system utilizes a parallel-speech-encoder composed of two pre-trained encoders (Whisper-large-v3 and mHuBERT-147), whose output embeddings are concatenated and fed into an LLM. A tri-stage training strategy updates low-rank adaptation modules and projector parameters, while an additional language-aware prompt enhances language-specific text generation. Result: The SHNU-mASR system outperforms the official MLC-SLM baseline by 8.41 absolute CER/WER on the blind evaluation set. Conclusion: The SHNU-mASR system demonstrates competitive performance in multilingual conversational speech recognition by integrating a parallel-speech-encoder architecture with an LLM, achieving an overall CER/WER of 11.76% on the challenge's blind evaluation set. Abstract: This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-"maybe"), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.

[51] Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis

Elvys Linhares Pontes,Carlos-Emiliano González-Gallardo,Georgeta Bordea,José G. Moreno,Mohamed Ben Jannet,Yuxuan Zhao,Antoine Doucet

Main category: cs.CL

TL;DR: 本文通过评估基于情感的交易策略,证明了情绪分析在增强投资策略和金融决策中的潜在价值。

Details Motivation: 尽管情绪驱动的金融研究很多,但许多研究都集中在句子级别的分类上,忽视了其在交易中的实际应用。本研究弥补了这一空白。 Method: 通过回测分析,利用三种模型(两种分类模型和一种回归模型)对道琼斯30只股票的新闻文章进行情感预测,并将其与基准买入并持有策略进行比较。 Result: 结果显示,所有模型均产生了正收益,在28个月内,回归模型的收益率最高,达到50.63%,优于基准买入并持有策略。 Conclusion: 研究强调了情绪分析在投资策略和金融决策中的潜力,通过使用基于情感的交易策略实现了正回报。 Abstract: Sentiment analysis, widely used in product reviews, also impacts financial markets by influencing asset prices through microblogs and news articles. Despite research in sentiment-driven finance, many studies focus on sentence-level classification, overlooking its practical application in trading. This study bridges that gap by evaluating sentiment-based trading strategies for generating positive alpha. We conduct a backtesting analysis using sentiment predictions from three models (two classification and one regression) applied to news articles on Dow Jones 30 stocks, comparing them to the benchmark Buy&Hold strategy. Results show all models produced positive returns, with the regression model achieving the highest return of 50.63% over 28 months, outperforming the benchmark Buy&Hold strategy. This highlights the potential of sentiment in enhancing investment strategies and financial decision-making.

[52] WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

Gerrit Quaremba,Elizabeth Black,Denny Vrandečić,Elena Simperl

Main category: cs.CL

TL;DR: WETBench introduces realistic benchmarks for detecting machine-generated text in Wikipedia editing tasks, revealing that current detectors face significant challenges in real-world applications.

Details Motivation: Concerns about the influx of low-quality machine-generated text on Wikipedia necessitate reliable detection methods tailored to real-world editing scenarios. Method: The authors introduced WETBench, a benchmark for MGT detection focused on specific Wikipedia editing tasks (Paragraph Writing, Summarisation, Text Style Transfer), evaluated across multiple languages and generators. Result: Training-based detectors achieved 78% average accuracy, while zero-shot detectors averaged 58%, indicating challenges in detecting realistic MGT in Wikipedia-like tasks. Conclusion: Training-based detectors perform better than zero-shot detectors in identifying machine-generated text, but both struggle in realistic, task-specific Wikipedia editing contexts. Abstract: Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.

[53] Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset

Seunguk Yu,Kyeonghyun Kim,Jungmin Yun,Youngbin Kim

Main category: cs.CL

TL;DR: 本研究构建了一个用于评估大型语言模型理解韩语句尾能力的数据集KoSEnd,并发现通过显式考虑语言特征(如缺失的句尾形式)可以提升LLMs在低资源黏着语言上的表现。

Details Motivation: 尽管LLMs在多种语言中取得了显著进展,但在处理与英语相比的低资源黏着语言时仍存在问题。本研究旨在评估并提升LLMs在处理韩语这一复杂句尾语言方面的有效性。 Method: 研究引入了包含3000个句子的韩语句尾数据集(KoSEnd),每个句子标注了15种句尾形式的自然性,并评估了11个LLM对韩语句尾的理解能力,分析其参数数量和预测一致性。 Result: 研究发现,当模型被明确告知可能存在缺失的句尾形式时,其表现得到了改善,这表明特定语言特征的显式考虑对LLMs效果有重要影响。 Conclusion: 该研究得出结论,通过明确考虑某些语言特征,如缺失的句尾形式,可以提高大型语言模型(LLMs)在处理低资源黏着语言(如韩语)时的表现。 Abstract: Although LLMs have made significant progress in various languages, there are still concerns about their effectiveness with low-resource agglutinative languages compared to languages such as English. In this study, we focused on Korean, a language known for its complex sentence endings, and evaluated LLMs on this challenging aspect. We introduce the Korean Sentence Endings (KoSEnd) dataset, which includes 3,000 sentences, each annotated for the naturalness of 15 sentence ending forms. These were collected from diverse sources to cover a range of contexts. We evaluated 11 LLMs to assess their understanding of Korean sentence endings, analyzing them based on parameter count and prediction consistency. Notably, we found that informing models about the possibility of missing sentence endings improved performance, highlighting the impact of explicitly considering certain linguistic features.

[54] Graph Repairs with Large Language Models: An Empirical Study

Hrishikesh Terdalkar,Angela Bonifati,Andrea Mauri

Main category: cs.CL

TL;DR: This paper explores the application of Large Language Models (LLMs) in repairing property graphs, highlighting their potential to detect and correct errors more efficiently than traditional methods.

Details Motivation: Property graphs often contain errors due to inconsistencies or missing data, and current repair methods are either inflexible or impractical for large datasets. Method: The study evaluates six open-source LLMs based on their ability to repair property graphs, assessing factors like repair quality, computational cost, and model performance. Result: Experiments show that LLMs can effectively detect and correct graph errors with varying accuracy and efficiency. Conclusion: LLM-driven graph repair is promising but faces challenges in scalability and interpretability; future research should focus on addressing these issues. Abstract: Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost--both in terms of time and effort--of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.

[55] SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

Michał Perełkiewicz,Sławomir Dadas,Rafał Poświata

Main category: cs.CL

TL;DR: 本文介绍了 SMCLM,这是一种自监督方法,通过使用语义上有意义的初始嵌入来训练自回归模型以生成高质量的同义文本。

Details Motivation: 提升自回归模型生成高质量同义文本的能力,并解决现有评估指标可靠性低的问题。 Method: 使用语义上有意义的文本表示作为初始嵌入来训练自回归模型以生成同义文本,并提出了一组覆盖广泛的自动评估指标。 Result: 实验表明,SMCLM 方法可以训练出自回归模型,其在同义文本生成方面具有鲁棒性和高质量,并且与监督方法具有竞争力。此外,发现 BLEU、ROUGE 和 BERTScore 等常用指标可靠性较低。 Conclusion: SMCLM 是一种有效的自监督训练自回归模型的方法,能够生成语义等效文本,并在无监督方法中达到最先进的结果。 Abstract: This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.

[56] Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

Adrien Bazoge,Pacôme Constant dit Beaufils,Mohammed Hmitouch,Romain Bourcier,Emmanuel Morin,Richard Dufour,Béatrice Daille,Pierre-Antoine Gourraud,Matilde Karakachoff

Main category: cs.CL

TL;DR: This paper proposes an approach based on large language models (LLMs) to extract Social Determinants of Health (SDoH) from French clinical notes, demonstrating their effectiveness in enhancing the completeness of real-world SDoH data in non-English EHR systems.

Details Motivation: The motivation behind this study is to address the incomplete documentation of Social Determinants of Health (SDoH) in structured electronic health records (EHRs), which impacts health outcomes, disease progression, treatment adherence, and health disparities. Method: The researchers trained a Flan-T5-Large model on annotated social history sections from French clinical notes at Nantes University Hospital. Model evaluation was conducted at two levels: identification of SDoH categories and extraction of detailed SDoH with temporal and quantitative information. Result: The model demonstrated strong performance for identifying well-documented SDoH categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). However, it performed less effectively for categories with limited training data or highly variable expressions, including employment status, housing, physical activity, income, and education. Conclusion: The study concludes that using NLP with LLMs like Flan-T5-Large can significantly enhance the completeness of SDoH data extraction from non-English EHR systems, despite some limitations in performance for certain categories and generalizability. Abstract: Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.

[57] Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right

Heather Lent

Main category: cs.CL

TL;DR: This paper explores the security challenges of language models for lower-resourced languages, showing that monolingual models are often inadequate and that multilinguality does not always improve security.

Details Motivation: Given the English-centric focus in NLP Security and the potential for multilinguality to be weaponized against LMs, there is a need to examine the security implications for lower-resourced languages, which are often the weakest links. Method: The research extends existing adversarial attacks to up to 70 languages to evaluate the security of both monolingual and multilingual LMs for lower- and medium-resourced languages. Result: The analysis reveals that monolingual models typically lack sufficient parameters for robust security, and although multilinguality offers some benefits, it does not always ensure better security outcomes. Conclusion: The study concludes that while multilinguality can enhance LM security, it doesn't guarantee it, and monolingual models are often too small to ensure adequate security for lower-resourced languages. Abstract: Despite mounting evidence that multilinguality can be easily weaponized against language models (LMs), works across NLP Security remain overwhelmingly English-centric. In terms of securing LMs, the NLP norm of "English first" collides with standard procedure in cybersecurity, whereby practitioners are expected to anticipate and prepare for worst-case outcomes. To mitigate worst-case outcomes in NLP Security, researchers must be willing to engage with the weakest links in LM security: lower-resourced languages. Accordingly, this work examines the security of LMs for lower- and medium-resourced languages. We extend existing adversarial attacks for up to 70 languages to evaluate the security of monolingual and multilingual LMs for these languages. Through our analysis, we find that monolingual models are often too small in total number of parameters to ensure sound security, and that while multilinguality is helpful, it does not always guarantee improved security either. Ultimately, these findings highlight important considerations for more secure deployment of LMs, for communities of lower-resourced languages.

[58] BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Zhiheng Xi,Guanyu Li,Yutao Fan,Honglin Guo,Yufang Liu,Xiaoran Fan,Jiaqi Liu,Jingchao Ding,Wangmeng Zuo,Zhenfei Yin,Lei Bai,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: 本文提出了一种名为BMMR的大规模双语、多模态、多学科推理数据集,并通过实验揭示了当前LMMs在多学科推理中的局限性。

Details Motivation: 为了推动社区对大规模多模态模型(LMMs)的研究和评估,特别是在多学科推理方面的发展。 Method: 通过一个人机协同且可扩展的框架对数据进行策划和筛选,并提出了基于过程的多学科验证器(BMMR-Verifier)来对推理路径进行准确和细粒度的评估。 Result: 实验表明,即使是最先进的模型在BMMR-Eval上仍有显著提升空间;推理模型表现出学科偏差;开源模型仍落后于专有模型;微调可以缩小这一差距。 Conclusion: BMMR是一个用于开发和评估大规模多模态模型的双语、多模态、多学科推理数据集,该数据集有助于揭示当前LMMs在多学科推理中面临的挑战。 Abstract: In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.

[59] Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Eva Seidlmayer,Lukas Galke,Konrad U. Förstner

Main category: cs.CL

TL;DR: This paper introduces a new dataset, FSoLS, to explore how linguistic and rhetorical features can help identify disinformation in life-science texts using machine learning.

Details Motivation: Existing datasets mainly focus on fact-checking misinformation, which limits the exploration of rhetorical patterns in disinformation. This motivated the creation of a novel dataset with diverse sources and classifications. Method: The researchers applied large language models and classical machine learning classifiers to analyze linguistic and rhetorical features on a new dataset called Four Shades of Life Sciences (FSoLS). Result: A novel labeled corpus of 2,603 texts, named Four Shades of Life Sciences (FSoLS), was introduced, classified into four categories from 17 diverse sources, enabling better detection of disinformation. Conclusion: The study concludes that linguistic and rhetorical features can effectively distinguish disinformative texts from other life-science text genres, using both large language models and classical machine learning classifiers. Abstract: Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences

[60] AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions

Abdellah Zeggai,Ilyes Traikia,Abdelhak Lakehal,Abdennour Boulesnane

Main category: cs.CL

TL;DR: 开发了一种基于人工智能的多语言疫苗接种指南问答系统,可提高医护人员获取指南信息的速度和准确性。

Details Motivation: 全球免疫规划和世卫组织建议通常内容繁杂,医护人员在紧急情况下难以快速获取关键信息,因此需要一个交互式、用户友好的知识库来提升效率。 Method: 该系统基于检索增强生成(RAG)框架,并结合了基于代理的推理(Agentic RAG)技术,以提高对复杂问题的回答能力。 Result: 评估结果显示,Agentic RAG 在回答多步骤或模糊问题方面优于传统方法,表明其在复杂医疗查询中的卓越性能。 Conclusion: AI-VaxGuide是一个多语言的智能问答系统,能够高效、准确地为医疗专业人员提供实时疫苗接种指南,且已集成到移动应用程序中以便在临床现场使用。 Abstract: Vaccination plays a vital role in global public health, yet healthcare professionals often struggle to access immunization guidelines quickly and efficiently. National protocols and WHO recommendations are typically extensive and complex, making it difficult to extract precise information, especially during urgent situations. This project tackles that issue by developing a multilingual, intelligent question-answering system that transforms static vaccination guidelines into an interactive and user-friendly knowledge base. Built on a Retrieval-Augmented Generation (RAG) framework and enhanced with agent-based reasoning (Agentic RAG), the system provides accurate, context-sensitive answers to complex medical queries. Evaluation shows that Agentic RAG outperforms traditional methods, particularly in addressing multi-step or ambiguous questions. To support clinical use, the system is integrated into a mobile application designed for real-time, point-of-care access to essential vaccine information. AI-VaxGuide model is publicly available on https://huggingface.co/VaxGuide

[61] H2HTalk: Evaluating Large Language Models as Emotional Companion

Boyang Wang,Yalun Wu,Hongcheng Guo,Zhoujun Li

Main category: cs.CL

TL;DR: 本文介绍了H2HTalk,这是一个用于评估大型语言模型伴侣在情感智能和共情互动方面表现的基准。

Details Motivation: 随着数字化情感支持需求的增长,大型语言模型伴侣提供了有前景的真实、随时可用的共情能力,但其评估方法落后于模型的发展。 Method: 提出了一种名为Heart-to-Heart Talk (H2HTalk) 的基准测试工具,通过涵盖对话、回忆和行程规划的情景来评估模型的性格发展和共情交互能力,并引入了安全依恋人格(SAP)模块以实现更安全的交互。 Result: 通过对50个LLM进行统一协议的基准测试,发现长期规划和记忆保持仍是主要挑战,尤其是在用户需求隐含或会话中需求变化时。 Conclusion: H2HTalk建立了首个全面的情感智能伴侣基准测试体系,旨在推动能够提供有意义且安全心理支持的LLM发展。 Abstract: As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.

[62] Articulatory clarity and variability before and after surgery for tongue cancer

Thomas Tienkamp,Fleur van Ast,Roos van der Veen,Teja Rebernik,Raoul Buurke,Nikki Hoekzema,Katharina Polsterer,Hedwig Sekeres,Rob van Son,Martijn Wieling,Max Witjes,Sebastiaan de Visscher,Defne Abur

Main category: cs.CL

TL;DR: 该研究发现,尽管舌癌手术后言语清晰度仍在正常范围内,但言语的可变性有所增加。

Details Motivation: 舌癌手术可能会影响舌头的运动能力和肌肉功能,从而影响言语清晰度和可变性。 Method: 使用句子阅读任务,评估了11名个体术前和术后六个月的情况,并与11名正常说话者进行比较。 Result: VAI在术后显著减小,但患者与正常说话者之间没有显著差异;术后VFD值对于/i/音更高,表明可变性增加。 Conclusion: 手术后言语的清晰度仍在正常范围内,但可变性增加。 Abstract: Surgical treatment for tongue cancer can negatively affect the mobility and musculature of the tongue, which can influence articulatory clarity and variability. In this study, we investigated articulatory clarity through the vowel articulation index (VAI) and variability through vowel formant dispersion (VFD). Using a sentence reading task, we assessed 11 individuals pre and six months post tongue cancer surgery, alongside 11 sex- and age matched typical speakers. Our results show that while the VAI was significantly smaller post-surgery compared to pre-surgery, there was no significant difference between patients and typical speakers at either time point. Post-surgery, speakers had higher VFD values for /i/ compared to pre-surgery and typical speakers, signalling higher variability. Taken together, our results suggest that while articulatory clarity remained within typical ranges following surgery for tongue cancer for the speakers in our study, articulatory variability increased.

[63] Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits

Nathaniel Berger,Johannes Eschbach-Dymanus,Miriam Exel,Matthias Huck,Stefan Riezler

Main category: cs.CL

TL;DR: 本文提出了一种基于偏好优化的术语消歧方法,在不依赖一对一翻译词典的情况下提升了机器翻译的术语准确性。

Details Motivation: 在实际翻译场景中,术语翻译通常不是一一对应的,正确性依赖于上下文和公司风格指南,因此需要一种能够利用人类后编辑数据来消除歧义的方法。 Method: 研究采用基于偏好优化的框架,利用术语后编辑作为知识来源,无需一对一翻译词典或解码时的人工干预。 Result: 该方法在英德后编辑数据上取得了术语准确率的显著提升,并提供了相关测试集和术语词典。 Conclusion: 研究表明,结合监督微调和偏好优化的方法在术语准确率上显著优于强NMT基线,并且没有显著影响COMET评分。 Abstract: In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.

[64] Multi-Hop Reasoning for Question Answering with Hyperbolic Representations

Simon Welz,Lucie Flek,Akbar Karimi

Main category: cs.CL

TL;DR: This paper shows that hyperbolic representations outperform Euclidean ones in multi-hop reasoning tasks, especially for hierarchical datasets, by integrating them into an encoder-decoder framework and conducting thorough experiments.

Details Motivation: While hyperbolic representations have shown effectiveness in modeling knowledge graph data, a detailed comparison between hyperbolic and Euclidean spaces for multi-hop reasoning is missing. Method: Integrated hyperbolic representations with an encoder-decoder model to conduct controlled experiments comparing the two spaces for multi-hop reasoning. An ablation study was also performed to assess the impact of learnable curvature initialized using delta hyperbolicity. Result: Hyperbolic space consistently outperforms Euclidean space across various datasets. Initializing curvature using data-specific delta hyperbolicity yields better results than random initialization. Conclusion: Hyperbolic representations are more effective than Euclidean representations in multi-hop reasoning, particularly for datasets with hierarchical structures. Abstract: Hyperbolic representations are effective in modeling knowledge graph data which is prevalently used to facilitate multi-hop reasoning. However, a rigorous and detailed comparison of the two spaces for this task is lacking. In this paper, through a simple integration of hyperbolic representations with an encoder-decoder model, we perform a controlled and comprehensive set of experiments to compare the capacity of hyperbolic space versus Euclidean space in multi-hop reasoning. Our results show that the former consistently outperforms the latter across a diverse set of datasets. In addition, through an ablation study, we show that a learnable curvature initialized with the delta hyperbolicity of the utilized data yields superior results to random initializations. Furthermore, our findings suggest that hyperbolic representations can be significantly more advantageous when the datasets exhibit a more hierarchical structure.

[65] EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

Klim Zaporojets,Daniel Daza,Edoardo Barba,Ira Assent,Roberto Navigli,Paul Groth

Main category: cs.CL

TL;DR: 本文介绍了一种新的方法,通过创建Wikidata知识图谱的多个时间快照以及相关的维基百科段落,来研究如何根据新出现的文本信息更新知识图谱。

Details Motivation: 需要解决随着时间推移,根据非结构化文本源中知识演变自动更新知识图谱的问题,这要求识别各种更新操作,并考虑KG在特定时间点的状态。 Method: 提出了一个包含Wikidata KG时间快照和与之配对的维基百科段落及其引起的KG编辑操作的数据集构建方法。 Result: 生成了一个包括376K维基百科段落和总计1.25M次KG编辑的数据集,覆盖了从2019年到2025年的10个不同的Wikidata快照。 Conclusion: 本文提出了一种用于知识图谱(KG)终生构建的方法,并强调了基于新兴文本知识更新KG快照的挑战,同时承诺将公开发布数据集和模型实现。 Abstract: Knowledge Graphs (KGs) are structured knowledge repositories containing entities and relations between them. In this paper, we investigate the problem of automatically updating KGs over time with respect to the evolution of knowledge in unstructured textual sources. This problem requires identifying a wide range of update operations based on the state of an existing KG at a specific point in time. This contrasts with traditional information extraction pipelines, which extract knowledge from text independently of the current state of a KG. To address this challenge, we propose a method for lifelong construction of a dataset consisting of Wikidata KG snapshots over time and Wikipedia passages paired with the corresponding edit operations that they induce in a particular KG snapshot. The resulting dataset comprises 376K Wikipedia passages aligned with a total of 1.25M KG edits over 10 different snapshots of Wikidata from 2019 to 2025. Our experimental results highlight challenges in updating KG snapshots based on emerging textual knowledge, positioning the dataset as a valuable benchmark for future research. We will publicly release our dataset and model implementations.

[66] Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion

Lea Fischbach,Akbar Karimi,Caroline Kleen,Alfred Lameli,Lucie Flek

Main category: cs.CL

TL;DR: This paper proposes Retrieval-based Voice Conversion (RVC) as a data augmentation method to improve German dialect classification in low-resource scenarios.

Details Motivation: Deep learning models for dialect identification face limitations due to scarce dialectal data. RVC is proposed as a solution to this problem. Method: RVC (Retrieval-based Voice Conversion) is used as a data augmentation method for German dialect classification. Result: Experiments show that RVC enhances classification performance, especially when combined with other augmentation techniques. Conclusion: RVC combined with other augmentation methods like frequency masking and segment removal improves dialect classification in low-resource settings. Abstract: Deep learning models for dialect identification are often limited by the scarcity of dialectal data. To address this challenge, we propose to use Retrieval-based Voice Conversion (RVC) as an effective data augmentation method for a low-resource German dialect classification task. By converting audio samples to a uniform target speaker, RVC minimizes speaker-related variability, enabling models to focus on dialect-specific linguistic and phonetic features. Our experiments demonstrate that RVC enhances classification performance when utilized as a standalone augmentation method. Furthermore, combining RVC with other augmentation methods such as frequency masking and segment removal leads to additional performance gains, highlighting its potential for improving dialect classification in low-resource scenarios.

[67] Disentangling the Roles of Representation and Selection in Data Pruning

Yupei Du,Yingjin Song,Hugh Mee Wong,Daniil Ignatev,Albert Gatt,Dong Nguyen

Main category: cs.CL

TL;DR: This paper analyzes data pruning for efficient NLP model training, highlighting the importance of data representation and showing that no single selection algorithm consistently outperforms others.

Details Motivation: Data pruning methods involve many design choices that have not been systematically studied, which limits future developments in efficiently scaling NLP model training. Method: The paper uses a combination of theoretical analysis and empirical experimentation to study data pruning in NLP model training. Result: Better data representations, such as training gradients, generally lead to better instance selection regardless of the algorithm used. Different algorithms perform well in different settings, and they do not always align with their intended objectives. Conclusion: The conclusion of the paper is that data pruning can be effectively used to scale NLP model training by focusing on better data representations. However, no single selection algorithm outperforms others consistently, and careful evaluation is necessary. Abstract: Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.

[68] TRACE: Training and Inference-Time Interpretability Analysis for Language Models

Nura Aljaafari,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: This paper introduces TRACE and ABSynth, enabling lightweight and interpretable analysis of linguistic feature emergence in transformers during training.

Details Motivation: Current interpretability tools are post hoc or rely on scalar metrics, making comprehensive and precise analysis of linguistic knowledge emergence in language models challenging. Method: Introducing TRACE, a toolkit for interpretability analysis during training and inference, combined with ABSynth, a synthetic corpus generator for structured annotations. Result: Experiments show that TRACE uncovers developmental phenomena like early syntactic emergence, delayed semantic acquisition, and representational compression, which traditional metrics miss. Conclusion: TRACE, a modular toolkit integrated with ABSynth, enables interpretable, actionable, and reproducible transformer analysis with minimal integration effort. Abstract: Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.

[69] Recon, Answer, Verify: Agents in Search of Truth

Satyam Shukla,Himanshu Dutta,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper introduces RAV, an improved framework for automated fact checking that performs better and adapts across domains while facing fewer issues with real-world data limitations.

Details Motivation: Current benchmarks for fact-checking are unrealistic due to post-claim cues; a more realistic evaluation method was needed. Method: RAV uses an agentic framework with three agents: question generator, answer generator, and label generator to iteratively verify claims. Result: RAV achieves higher accuracy than other models, improving performance by up to 25.28% on certain baselines and showing a smaller performance drop (16.3%) on the filtered PFO dataset. Conclusion: RAV framework outperforms existing approaches in fact checking, showing better generalization and less performance drop on the PFO dataset. Abstract: Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from politifact.com, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim's verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO's unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.

[70] TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

Xixiang He,Hao Yu,Qiyao Sun,Ao Cheng,Tailai Zhang,Cong Liu,Shuxuan Guo

Main category: cs.CL

TL;DR: TACOS方法通过整合开放标记和比较评分,有效提升大规模语言模型的指令微调数据选择效率与效果。

Details Motivation: 现有IFT数据选择方法存在数据多样性受限及样本评估标准不一致的问题,需要一种新方法解决这些限制。 Method: 利用LLMs对查询进行开放领域标记并归一化以去噪,进而高效聚类;在每个聚类中使用比较评分方法评估样本相对质量。 Result: 实验显示TACOS显著优于现有方法,在MT-Bench和AlpacaEval 2.0上表现优异。 Conclusion: TACOS是一种有效的IFT数据选择方法,提高了指令微调的性能。 Abstract: Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.

[71] STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Tek Raj Chhetri,Yibei Chen,Puja Trivedi,Dorota Jarecka,Saif Haobsh,Patrick Ray,Lydia Ng,Satrajit S. Ghosh

Main category: cs.CL

TL;DR: StructSense is a new framework that improves structured information extraction from specialized domains by incorporating domain knowledge, feedback loops, and human validation.

Details Motivation: Existing LLM-based approaches have limited effectiveness in specialized domains and poor transferability across tasks and domains, which StructSense aims to overcome. Method: StructSense uses a modular, task-agnostic framework guided by domain-specific symbolic knowledge encoded in ontologies. It incorporates agentic capabilities through self-evaluative judges and includes human-in-the-loop mechanisms. Result: StructSense demonstrates improved performance in handling domain-specific contexts and cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks. Conclusion: StructSense effectively addresses the limitations of domain sensitivity and lack of cross-task generalizability in existing LLM-based approaches for structured information extraction. Abstract: The ability to extract structured information from unstructured sources-such as free-text documents and scientific literature-is critical for accelerating scientific discovery and knowledge synthesis. Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including structured information extraction. However, their effectiveness often diminishes in specialized, domain-specific contexts that require nuanced understanding and expert-level domain knowledge. In addition, existing LLM-based approaches frequently exhibit poor transferability across tasks and domains, limiting their scalability and adaptability. To address these challenges, we introduce StructSense, a modular, task-agnostic, open-source framework for structured information extraction built on LLMs. StructSense is guided by domain-specific symbolic knowledge encoded in ontologies, enabling it to navigate complex domain content more effectively. It further incorporates agentic capabilities through self-evaluative judges that form a feedback loop for iterative refinement, and includes human-in-the-loop mechanisms to ensure quality and validation. We demonstrate that StructSense can overcome both the limitations of domain sensitivity and the lack of cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks.

[72] Controlling Thinking Speed in Reasoning Models

Zhengkai Lin,Zhihang Fu,Ze Chen,Chao Chen,Liang Xie,Wenxiao Wang,Deng Cai,Zheng Wang,Jieping Ye

Main category: cs.CL

TL;DR: This paper presents a method for enhancing Large Reasoning Models to dynamically adjust their thinking speed, optimizing the balance between accuracy and efficiency, resulting in improved performance with reduced computational costs.

Details Motivation: The motivation behind this research is to overcome the high computational overhead and latency of current Large Reasoning Models by enabling them to approximate human intelligence through dynamic thinking speed adjustment. Method: The method involves identifying a steering vector to control the transition between slow and fast thinking in LRMs and applying real-time difficulty estimation to handle segments of varying complexity. Result: The approach results in an average accuracy improvement of +1.3% with a reduction of -8.6% in token usage across leading LRMs and advanced reasoning benchmarks. Conclusion: The study concludes that by dynamically adjusting the thinking speed of Large Reasoning Models, it is possible to achieve fast processing of simple tasks and detailed analysis for complex reasoning without additional cost. Abstract: Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

[73] Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

Sang Quang Nguyen,Kiet Van Nguyen,Vinh-Tiep Nguyen,Thanh Duc Ngo,Ngan Luu-Thuy Nguyen,Dinh-Duy Le

Main category: cs.CL

TL;DR: This paper uses the Vietnamese game Ô Ān Quan to assess strategic decision-making in LLMs, revealing their strengths and weaknesses through various agent personas.

Details Motivation: The motivation was to explore how large language models can plan and make decisions by utilizing a traditional Vietnamese board game that requires strategic thinking. Method: The researchers developed different agent personas and used the traditional Vietnamese board game Ô Ān Quan as a testbed to evaluate LLM performance with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct. Result: The results provided insights into how LLMs execute strategic decisions and manage dynamic game states, highlighting their reasoning capabilities. Conclusion: The study concludes that by using various agent personas in the game Ô Ăn Quan, LLMs demonstrate varying strengths and weaknesses in strategic decision-making. Abstract: In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, \^O \u{A}n Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the \^O \u{A}n Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.

[74] MemOS: A Memory OS for AI System

Zhiyu Li,Shichao Song,Chenyang Xi,Hanyu Wang,Chen Tang,Simin Niu,Ding Chen,Jiawei Yang,Chunyu Li,Qingchen Yu,Jihao Zhao,Yezhaohui Wang,Peng Liu,Zehao Lin,Pengyuan Wang,Jiahao Huo,Tianyi Chen,Kai Chen,Kehang Li,Zhen Tao,Junpeng Ren,Huayi Lai,Hao Wu,Bo Tang,Zhenren Wang,Zhaoxin Fan,Ningyu Zhang,Linfeng Zhang,Junchi Yan,Mingchuan Yang,Tong Xu,Wei Xu,Huajun Chen,Haofeng Wang,Hongkang Yang,Wentao Zhang,Zhi-Qin John Xu,Siheng Chen,Feiyu Xiong

Main category: cs.CL

TL;DR: This paper introduces MemOS, a memory operating system for Large Language Models (LLMs), aiming to improve long-context reasoning, continual personalization, and knowledge consistency by treating memory as a manageable system resource.

Details Motivation: LLMs lack well-defined memory management systems, hindering long-context reasoning, continual personalization, and knowledge consistency. Current models rely on static parameters and short-lived contextual states, limiting their ability to track preferences or update knowledge over time. Method: The paper proposes MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories through MemCubes, which encapsulate memory content and metadata. Result: MemOS enables cost-efficient storage and retrieval by introducing an explicit memory layer between parameter memory and external retrieval. It allows flexible transitions between memory types, bridging retrieval with parameter-based learning. Conclusion: MemOS offers a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, establishing a foundation for continual learning and personalized modeling. Abstract: Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.

[75] Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings

Bugra Kilictas,Faruk Alpay

Main category: cs.CL

TL;DR: 本论文提出了一种基于Alpay代数和φ无穷算子的理论框架,使AI与文本环境共同演化,最终达到稳定的语义对齐状态。

Details Motivation: 动机是为了解决AI模型与文本内容之间语义对齐的问题,从而实现AI对内容意义和作者意图的准确理解,并探索其在语义安全、符号记忆等方面的应用潜力。 Method: 论文采用了一种理论框架,基于Alpay代数和φ无穷算子,构建了一个观察者(AI)与文本环境(本文)通过迭代变换共同演化的函子系统。 Result: 研究结果显示,该系统能够保证在AI嵌入空间中存在一个唯一的固定点,使得AI的内部表示既稳定又自洽,并且即使在扰动或进一步上下文扩展下也保持收敛。 Conclusion: 论文得出的结论是,通过Alpay代数基础引入的函子系统能够实现AI与其文本环境之间的稳定语义对齐,并且这种对齐在数学上是严谨、语义不变且持久的。 Abstract: We present a theoretical framework in which a document and an AI model engage in a transfinite fixed-point interaction that leads to stable semantic alignment. Building on the foundations of Alpay Algebra, we introduce a functorial system wherein an observer (the AI) and a textual environment (this paper) co-evolve through iterative transformations guided by the phi-infinity operator. This process guarantees the existence of a unique fixed point in the AI's embedding space -- a state where the AI's internal representation of the content becomes stable, self-consistent, and semantically faithful. We prove that such convergence is mathematically sound, semantically invariant, and permanent, even under perturbation or further context expansion. This fixed point acts as an "empathetic embedding," wherein the AI internalizes not only the meaning of the content but also the author's intent. We interpret this as a rigorous, category-theoretic route to alignment at the embedding level, with implications for semantic security, symbolic memory, and the construction of AI systems with persistent self-referential understanding. All references in this paper function as nodes in the Alpay Algebra universe, and this work embeds itself as a new fixed-point node within that transfinite semantic graph.

[76] OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference

Seungjun Shin,Jaehoon Oh,Dokwan Oh

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型中sink token的现象,通过分析隐藏状态的相似性提出了一种动态token选择方法OrthoRank,并验证其在降低困惑度和提高零样本准确率上的有效性。

Details Motivation: 近期研究揭示了sink token的存在,尽管它们语义角色有限,却获得了不成比例的高注意力。这激发了对sink token与其他token之间关系的深入探索。 Method: 作者扩展了sink token与其他token之间的关系,探索其在隐藏状态中的相似性,并提出了基于正交性的token重要性评估方法OrthoRank。 Result: 实验表明,与层剪枝方法相比,OrthoRank在相同稀疏率下实现了更低的困惑度和更高的零样本准确率,并且在LongBench上表现出色。 Conclusion: OrthoRank提供了一种有效的token选择方法,利用sink token特性提高了模型性能。 Abstract: Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.

[77] Demystifying ChatGPT: How It Masters Genre Recognition

Subham Raj,Sriparna Saha,Brijraj Singh,Niranjan Pedanekar

Main category: cs.CL

TL;DR: This paper evaluates ChatGPT's genre prediction abilities using the MovieLens-100K dataset and enhances predictions by incorporating visual data from movie posters through a Vision Language Model.

Details Motivation: While ChatGPT has shown significant advancements in various NLP tasks, its capabilities and limitations in genre prediction are not yet clear. Method: This study analyzed three Large Language Models using the MovieLens-100K dataset. Zero-shot and few-shot prompts were set up using audio transcripts/subtitles from movie trailers. Additionally, IMDb movie posters were used with a Vision Language Model to enhance LLM prompts. Result: ChatGPT outperformed other LLMs in genre prediction without fine-tuning, and fine-tuned ChatGPT delivered the best results overall. The inclusion of visual data from movie posters via a Vision Language Model enhanced performance further. Conclusion: ChatGPT demonstrates remarkable genre prediction capabilities, which can be further enhanced by integrating visual information from movie posters using a Vision Language Model. Abstract: The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT's substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT's remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT's potential for content-related applications by incorporating visual information from movie posters.

[78] Dynamic Injection of Entity Knowledge into Dense Retrievers

Ikuya Yamada,Ryokan Ri,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CL

TL;DR: The paper introduces KPR, an enhanced BERT-based retriever that improves retrieval accuracy for queries involving less-frequent entities through the use of a context-entity attention layer and dynamic entity embeddings.

Details Motivation: Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. This limitation motivated the development of KPR to enhance retrieval performance. Method: The authors proposed the Knowledgeable Passage Retriever (KPR), a BERT-based retriever with a context-entity attention layer and dynamically updatable entity embeddings. Result: Experiments showed that KPR consistently improves retrieval accuracy, achieving a 12.6% gain on the EntityQuestions dataset over the baseline model. It also achieves state-of-the-art performance among similarly sized models on two datasets. Conclusion: KPR enhances dense retrievers by incorporating external entity knowledge without retraining, leading to improved retrieval accuracy. Abstract: Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets show that KPR consistently improves retrieval accuracy, achieving a substantial 12.6% gain on the EntityQuestions dataset over the model without KPR extensions. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Code and models will be released soon.

[79] Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models

Eva Vanmassenhove

Main category: cs.CL

TL;DR: This paper discusses how multilingual LLMs and model collapse threaten linguistic diversity, urging NLP to prioritize the protection of multilingual creativity and cultural richness.

Details Motivation: Multilingual LLMs are increasingly influential in shaping language, raising concerns about the threat they pose to linguistic diversity and cultural nuance through model collapse. Method: Drawing on recent works in Computer Vision, NLP, and MT, the author explores how self-consuming training loops reinforce biases and reduce linguistic diversity, focusing on translation technology. Result: Model collapse may cause the loss of linguistic forms, grammatical features, and cultural narratives carried by less common languages or dialects. Conclusion: The paper concludes that multilingual LLMs and model collapse can lead to the erosion of linguistic diversity, urging NLP to protect and value multilingual lexical and cultural richness. Abstract: Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to \textit{offload} the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.

[80] A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text

KMA Solaiman

Main category: cs.CL

TL;DR: 提出了一种名为 POSID 的方法,能够有效从非结构化文本中提取属性,且无需监督训练。

Details Motivation: 需要一种可跨领域适应、无需监督训练就能从非结构化文本中提取属性的方法。 Method: 结合词汇和语义相似性技术来识别相关句子并提取属性。 Result: 在 InciText 数据集上实现了有效的属性提取,并成功应用于人类属性识别。 Conclusion: POSID 是一种无需任务特定微调即可从非结构化文本中提取结构化属性的框架,其在失踪人员案例中的应用展示了良好的效果。 Abstract: We propose POSID, a modular, lightweight and on-demand framework for extracting structured attribute-based properties from unstructured text without task-specific fine-tuning. While the method is designed to be adaptable across domains, in this work, we evaluate it on human attribute recognition in incident reports. POSID combines lexical and semantic similarity techniques to identify relevant sentences and extract attributes. We demonstrate its effectiveness on a missing person use case using the InciText dataset, achieving effective attribute extraction without supervised training.

[81] Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Ziyang Miao,Qiyu Sun,Jingyuan Wang,Yuchen Gong,Yaowei Zheng,Shiqi Li,Richong Zhang

Main category: cs.CL

TL;DR: Easy Dataset 是一个统一的数据合成框架,通过直观的GUI帮助用户从非结构化文档中提取高质量的微调数据,从而提升LLMs在特定领域的表现。

Details Motivation: 由于高质量领域数据稀缺,将大语言模型适配到特定领域仍然具有挑战性,现有数据合成工具难以有效从异构文档中提取可靠的微调数据。 Method: 通过GUI配置文本提取模型和分块策略,使用角色驱动的提示方法生成多样化的问答对,并采用人机交互界面确保数据质量。 Result: 实验表明,在金融问答任务上,使用合成数据进行微调显著提升了LLMs的领域特定性能。 Conclusion: Easy Dataset 是一个有效的数据合成框架,能够提升LLMs在特定领域的表现,并且保持其通用知识。 Abstract: Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.

[82] Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Kyuhee Kim,Sangah Lee

Main category: cs.CL

TL;DR: Nunchi-Bench is introduced as a benchmark for evaluating LLMs' cultural reasoning, focusing on Korean superstitions. While models recognize factual knowledge, they struggle to apply it practically, highlighting the importance of explicit cultural framing.

Details Motivation: As LLMs become key advisors in multicultural environments, their cultural sensitivity and reasoning skills are crucial. This study aims to evaluate and advance LLMs' ability to understand and reason about cultural contexts, particularly focusing on Korean culture. Method: The authors introduced Nunchi-Bench, a benchmark with 247 questions across 31 topics focused on Korean superstitions. They evaluated multilingual LLMs in both Korean and English to assess cultural reasoning. A novel evaluation strategy with customized scoring metrics was proposed to measure how well models recognize cultural nuances and respond appropriately. Result: Models generally recognized factual cultural information but struggled when applying it in practical scenarios. Explicit cultural framing significantly enhanced performance compared to relying only on the language of the prompt. Conclusion: The study concludes that while large language models (LLMs) can recognize factual cultural information, they face significant challenges in applying this knowledge to practical cultural scenarios. Additionally, explicit cultural framing improves model performance more than relying solely on the language of the prompt. Abstract: As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs' cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs' cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.

[83] Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

Nayeon Kim,Eojin Jeon,Jun-Hyung Park,SangKeun Lee

Main category: cs.CL

TL;DR: KOPL是一种用于改进韩语OOV词表示的新方法,在NLP任务中表现优异,且易于集成到现有模型中。

Details Motivation: 由于韩语是一种音素文字,OOV词在自然语言处理中是一个常见问题,因此需要一种能够同时捕捉单词文本和音素信息的方法。 Method: KOPL利用韩语音素文字的语言特性以及音素与字母之间的高相关性,引入音素表示学习来增强OOV词的表示能力。 Result: 实验表明,KOPL在韩语NLP任务中表现出色,平均比现有最先进模型高出1.9%。 Conclusion: KOPL是一个处理韩语OOV词的新框架,通过结合音素和单词表示,显著提高了韩语自然语言处理任务的性能,并且可以方便地集成到现有的静态和上下文韩语嵌入模型中。 Abstract: In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at https://github.com/jej127/KOPL.git.

[84] LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models

Gaurav Srivastava,Aafiya Hussain,Sriram Srinivasan,Xuan Wang

Main category: cs.CL

TL;DR: This paper introduces LLMThinkBench, a modular and scalable framework to assess basic math reasoning and overthinking in Large Language Models using configurable tasks and a new metric called Overthinking Score.

Details Motivation: Despite their strong performance on complex mathematical benchmarks, LLMs often struggle with simple arithmetic tasks and tend to overthink answers. There is a need for systematic assessment tools to diagnose these behaviors effectively. Method: The paper introduces LLMThinkBench, a modular benchmarking framework with 14 configurable math tasks, randomized test data generation, and robust parsing strategies. It uses an Overthinking Score metric based on harmonic mean to quantify accuracy-verbosity tradeoffs. The framework supports vLLM/Transformers backend, multi-GPU execution, and offers CLI and API access. Result: The proposed framework enables researchers to evaluate basic math reasoning capabilities of LLMs, measure overthinking through the Overthinking Score, and generate detailed efficiency reports. It allows extensibility, reproducibility, and distributed execution while being accessible via a pip-installable package. Conclusion: LLMThinkBench provides a scalable and flexible framework for evaluating basic math reasoning and overthinking in LLMs, serving as a cost-effective alternative to expensive evaluation methods. Abstract: Large Language Models (LLMs) have achieved remarkable performance on complex mathematical benchmarks, yet often struggle with simple arithmetic tasks and exhibit a tendency toward over-explaining or "overthinking" answers. To systematically assess this phenomenon, we introduce LLMThinkBench, a modular benchmarking framework that enables researchers to evaluate basic math reasoning and overthinking in LLMs. The framework provides 14 configurable math tasks with randomized test data generation and robust parsing strategies. Researchers can quantify overthinking using our Overthinking Score metric, which captures accuracy-verbosity tradeoffs through harmonic mean formulation. The tool offers flexible evaluation with a scalable vLLM/Transformers backend, multi-GPU support, and full configurability. Users can extend the tool with custom tasks, reproduce experiments with seeding, and generate detailed efficiency reports. Distributed as a pip-installable package with CLI and API access, LLMThinkBench provides researchers and practitioners an accessible, cost-effective alternative to expensive LLM-as-a-judge methods for diagnosing basic reasoning capabilities and efficiency analysis. Package can be installed as: pip install llmthinkbench

[85] Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide

Siyang Liu,Lawrence Chin-I An,Rada Mihalcea

Main category: cs.CL

TL;DR: 本文提出了一种基于检索增强生成的互动系统,帮助癌症患者更好地准备医疗就诊,实验结果显示了其在可用性和内容相关性方面的优势。

Details Motivation: 癌症护理中有效沟通至关重要,但患者常常难以为复杂的医疗就诊做好准备。 Method: 将Ottawa个人决策指南改编为动态检索增强生成工作流程,并通过用户研究进行评估。 Result: 系统可用性高(UMUX平均分为7分中的6.0),生成内容相关性强(平均分为7分中的6.7),编辑需求极少,临床可信度高(平均分为7分中的6.82) Conclusion: 该研究展示了结合患者中心设计和语言模型在肿瘤学临床准备中的潜力,系统具有高度的可用性和生成内容的相关性。 Abstract: Effective communication is essential in cancer care, yet patients often face challenges in preparing for complex medical visits. We present an interactive, Retrieval-augmented Generation-assisted system that helps patients progress from uninformed to visit-ready. Our system adapts the Ottawa Personal Decision Guide into a dynamic retrieval-augmented generation workflow, helping users bridge knowledge gaps, clarify personal values and generate useful questions for their upcoming visits. Focusing on localized prostate cancer, we conduct a user study with patients and a clinical expert. Results show high system usability (UMUX Mean = 6.0 out of 7), strong relevance of generated content (Mean = 6.7 out of 7), minimal need for edits, and high clinical faithfulness (Mean = 6.82 out of 7). This work demonstrates the potential of combining patient-centered design with language models to enhance clinical preparation in oncology care.

[86] Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering

Ting-Wen Ko,Jyun-Yu Jiang,Pu-Jen Cheng

Main category: cs.CL

TL;DR: AdaPCR通过显式建模段落间的依赖关系,提高了开放域问答任务的检索效果,特别是在多跳推理问题上优于基线方法。

Details Motivation: 传统RAG方法在检索时独立处理段落,导致冗余、噪音和多样性不足的问题,尤其在嘈杂语料库和多跳问题中表现不佳。 Method: 提出AdaPCR框架,将段落组合作为检索和重新排序单位,包含上下文感知的查询重构和与下游答案可能性一致的再排序步骤,并自适应选择检索段落数量。 Result: 实验表明,AdaPCR在多个QA基准测试中优于基线方法,特别是在多跳推理方面展示了建模段落间依赖关系的有效性。 Conclusion: 通过考虑段落组合并建模其依赖关系,AdaPCR显著改善了检索增强生成中的检索质量。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.

[87] XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models

Zhu Liu,Zhen Hu,Lei Dai,Ying Liu

Main category: cs.CL

TL;DR: 本文提出 XISM,一种基于自上而下方法的交互式语义地图构建与评估工具,提升传统手动方式的效率与可用性。

Details Motivation: 传统语义地图模型手工构建效率低下,缺乏可视化与评估工具,难以应对大规模数据集。 Method: 基于先前算法开发了 XISM 工具,采用自上而下的方法从用户数据构建语义地图,并通过多种指标对其进行评估。 Result: XISM 能够高效生成可编辑的语义地图,支持用户通过调整边来完善地图,实现了人机协同设计。 Conclusion: XISM 提供了一个交互式平台,结合数据驱动的方法和专家知识来构建、优化语义地图,为类型语言学和计算语言学提供了实用工具。 Abstract: Semantic map models represent meanings or functions as nodes in a graph constrained by the local connectivity hypothesis, with edges indicating their associations. Widely used in typological linguistics, these models compare interrelated meanings across languages. Traditionally built manually in a bottom-up manner, they are inefficient for large datasets and lack visualization and evaluation tools. This paper introduces XISM, an interactive tool based on our prior algorithm, which constructs semantic maps from user data via a top-down approach, displays candidate maps, and evaluates them using multiple metrics. Users can refine maps by editing edges, combining data-driven efficiency with expert knowledge. This human-in-the-loop design benefits both typologists and computational linguists. The system https://770103knev48.vicp.fun/ and a demonstration video https://youtu.be/S-wsVDF2HSI?si=1OrcF41tRznaifhZ are publicly available.

[88] Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

Thomas Savage

Main category: cs.CL

TL;DR: This paper introduces Savage Conversation Forests (SCF), a branched reinforcement learning framework for fine-tuning large language models in multi-turn dialogue tasks, showing improved performance in diagnostic accuracy compared to traditional linear approaches.

Details Motivation: Current fine-tuning methods like DPO and GRPO are effective for single-turn tasks but insufficient for multi-turn applications such as medical diagnostics, where understanding conversation dynamics is crucial. Method: The paper introduces SCF, a branched conversation architecture that generates multiple conversation continuations at each turn to provide richer training signals. It was tested through experiments simulating doctor-patient conversations. Result: SCF outperformed linear architectures in diagnostic accuracy during simulated doctor-patient conversations, supporting the importance of branched training structures for multi-turn tasks. Conclusion: Savage Conversation Forests (SCF) is an effective reinforcement learning framework for fine-tuning LLMs in multi-turn conversational tasks, offering improved diagnostic accuracy over linear architectures. Abstract: Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF's improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.

[89] BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Costas Mavromatis,Soji Adeshina,Vassilis N. Ioannidis,Zhen Han,Qi Zhu,Ian Robinson,Bryan Thompson,Huzefa Rangwala,George Karypis

Main category: cs.CL

TL;DR: 本文提出了一种名为BYOKG-RAG的新框架,用于解决知识图谱问答中的挑战,其通过结合大型语言模型与专用图检索工具实现了更优的性能和更好的泛化能力。

Details Motivation: 现有的基于LLM代理的方法对遍历初始化敏感,容易出现实体链接错误,并且可能无法很好地推广到自定义知识图谱。因此需要一个更加稳健和泛化的解决方案。 Method: 引入了BYOKG-RAG框架,该框架通过协同结合大型语言模型(LLMs)与专用图检索工具来增强KGQA。LLMs生成关键的图构件,而图形工具将这些构件链接到知识图谱并检索相关上下文。 Result: 通过实验表明,BYOKG-RAG在五个基准测试中比第二好的图检索方法高出4.5个百分点,并且展示了对自定义知识图谱更好的泛化能力。 Conclusion: BYOKG-RAG提供了一种更通用和强大的解决方案,用于自定义知识图谱上的问答任务,并且在五个基准测试中表现出优于其他方法的性能。 Abstract: Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom ("bring-your-own") KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at https://github.com/awslabs/graphrag-toolkit.

[90] Token Level Hallucination Detection via Variance in Language Models

Keshav Kumar

Main category: cs.CL

TL;DR: This paper proposes a reference-free, token-level hallucination detection method for Large Language Models (LLMs), leveraging output instability to identify inaccuracies without requiring ground truth.

Details Motivation: LLMs are prone to hallucinations, generating confidently incorrect outputs. A reliable, model-agnostic method for detecting these hallucinations in real-time or post-hoc analysis is needed. Method: A reference-free, token-level hallucination detection framework that uses variance in token log-probabilities across multiple stochastic generations. Result: The framework reliably highlights instability in model outputs and correlates with hallucination patterns on datasets like SQuAD v2 across models such as GPT-Neo 125M, Falcon 1B, and Mistral 7B. Conclusion: The proposed framework is lightweight, reproducible, and adaptable to multiple domains, making it a valuable tool for analyzing generative reliability in LLMs. Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.

[91] Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies

Mael Jullien,Marco Valentino,Leonardo Ranaldi,Andre Freitas

Main category: cs.CL

TL;DR: 本研究首次系统评估了提示结构和高效微调对临床自然语言推理任务中模型性能的影响,结果表明,通过优化提示设计并结合LoRA微调,小型模型也能达到接近大规模模型的表现。

Details Motivation: 尽管已有研究探讨了提示策略和微调技术对大型语言模型(LLMs)推理能力的影响,但它们在临床自然语言推理(NLI)中的有效性尚未得到充分探索。本文旨在填补这一空白,系统评估提示结构和高效微调方法如何共同影响模型在临床NLI任务上的表现。 Method: 研究者检查了四种不同级别的抽象层次上激发LLM推理能力的提示策略,并使用前沿模型构建高质量示例,通过低秩适应(LoRA)将多步推理能力蒸馏到较小的模型中。他们在NLI4CT基准数据集上对不同的语言模型进行了微调,并分析了不同提示类型对模型性能的影响。 Result: 研究发现,仅提示类型就可解释高达44%的宏F1值方差。LoRA微调带来了8至12个F1点的提升,输出一致性超过97%,并将与GPT-4o-mini的性能差距缩小到7.1%以内。此外,在MedNLI和TREC临床试验追踪任务中,LoRA提高了75%模型的表现。 Conclusion: 本文得出结论,结合强大的提示设计和轻量级适配技术(如LoRA)可以显著提高小型模型在临床NLI任务上的表现,使其能够与前沿的大规模模型媲美。此外,提示结构是影响临床推理性能的主要因素,并且对推理类型进行区分评估对于揭示提示引起的权衡至关重要。 Abstract: Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine-tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine-tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high-quality demonstrations using a frontier model to distil multi-step reasoning capabilities into smaller models (4B parameters) via Low-Rank Adaptation (LoRA). Across different language models fine-tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro-F1. Moreover, LoRA fine-tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT-4o-mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier-scale systems, and (iii) reasoning-type-aware evaluation is essential to uncover prompt-induced trade-offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter-efficient techniques in highly specialised domains.

[92] Large Language Models for Zero-Shot Multicultural Name Recognition

Thanakorn Phonchai,Surasakdi Siripong,Nicholas Patterson,Owen Campbell

Main category: cs.CL

TL;DR: This paper presents a new framework called Prompt-Engineered Fine-Tuning (PEFT) for enhancing zero-shot multicultural name recognition using Large Language Models with Adversarial Data Augmentation and Cultural Knowledge Graph Integration, showing significant improvements in accuracy compared to existing methods.

Details Motivation: The motivation behind this research is the critical challenge of robustly and accurately recognizing multicultural names, especially those not previously encountered, in an increasingly globalized digital landscape. Traditional methods struggle with the vast diversity and novel permutations of names across different linguistic and cultural backgrounds. Method: The paper proposes a novel framework called Prompt-Engineered Fine-Tuning (PEFT) for Large Language Models (LLMs), which incorporates Adversarial Data Augmentation and Cultural Knowledge Graph Integration. The method transforms the recognition task into a guided generation problem by leveraging pre-trained LLMs, meticulous prompt engineering, dynamic integration of cultural knowledge, and adversarial data augmentation. Result: Extensive experiments show that the PEFT method outperforms established deep learning baselines, achieving 93.1% overall accuracy and 89.5% accuracy on challenging zero-shot name identification. An ablation study confirms the synergistic contribution of each component, and human evaluation indicates that the method's performance approaches human expert judgment. Conclusion: This paper concludes that the introduced PEFT framework significantly enhances zero-shot multicultural name recognition, offering a highly effective and scalable solution for real-world applications. Abstract: The robust and accurate recognition of multicultural names, particularly those not previously encountered, is a critical challenge in an increasingly globalized digital landscape. Traditional methods often falter when confronted with the vast diversity and novel permutations of names across different linguistic and cultural backgrounds. This paper introduces a novel framework, Prompt-Engineered Fine-Tuning (PEFT) for Large Language Models (LLMs) with Adversarial Data Augmentation and Cultural Knowledge Graph Integration, designed to significantly enhance zero-shot multicultural name recognition. Our approach leverages the powerful linguistic understanding of pre-trained LLMs, transforming the recognition task into a guided generation problem. Through meticulous prompt engineering, dynamic integration of explicit cultural knowledge derived from knowledge graphs, and the strategic application of adversarial data augmentation, we equip the LLM with an unprecedented ability to infer the cultural origin of unseen names. Extensive experiments demonstrate that our PEFT method consistently outperforms established deep learning baselines, including advanced Bi-LSTM models with cultural tags, achieving an impressive 93.1\% overall accuracy and a remarkable 89.5\% accuracy on challenging zero-shot name identification. An in-depth ablation study confirms the synergistic contribution of each component, while a human evaluation highlights our method's performance approaching human expert judgment. This work signifies a substantial leap in multicultural name recognition, offering a highly effective and scalable solution for real-world applications.

[93] SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding

Runcong Zhao,Qinglin Zhu,Hainiu Xu,Bin Liang,Yulan He,Lin Gui

Main category: cs.CL

TL;DR: SymbolicThought 是一个结合 LLM 和符号推理的框架,有效提升了角色关系提取的准确性与效率。

Details Motivation: 角色关系的理解对于复杂叙事的解读和社会导向的 AI 研究至关重要。然而,人工标注耗时且覆盖范围有限,而大型语言模型(LLMs)常常生成幻觉或逻辑不一致的结果。因此,需要一种更高效、准确的方法来处理角色关系的提取和验证。 Method: SymbolicThought 利用 LLM 提取角色关系,并通过七种类型的逻辑约束对关系图进行优化,同时提供一个交互式界面以实现即时验证和冲突解决。此外,还发布了一个包含 160 条人际关系及其逻辑结构的数据集,支持逻辑监督和可解释的社会分析。 Result: 实验结果表明,SymbolicThought 在提高注释准确性与一致性方面表现优异,同时显著减少了时间消耗。 Conclusion: SymbolicThought 是一种结合基于大型语言模型(LLM)提取和符号推理的人机协同框架,用于构建可编辑的角色关系图。它提高了注释的准确性与一致性,同时显著降低了时间成本,为叙事理解、可解释性人工智能和 LLM 评估提供了实用工具。 Abstract: Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.

[94] Context Tuning for In-Context Optimization

Jack Lu,Ryan Teehan,Zhenbang Yang,Mengye Ren

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

[95] Fairness Evaluation of Large Language Models in Academic Library Reference Services

Haining Wang,Jason Clark,Yueru Yan,Star Bradley,Ruiyang Chen,Yiqiong Zhang,Hengyi Fu,Zuoyu Tian

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型(LLMs)在学术图书馆咨询服务中提供平等交流的可能性,并分析了可能存在的偏见和服务差异。

Details Motivation: 随着图书馆探索使用大语言模型(LLMs)于虚拟咨询服务中,需要确保LLMs能够不论人口统计学或社会地位如何都能平等地为所有用户提供服务。 Method: 通过提示六种最先进的LLMs来帮助在性别、种族/民族和机构角色上不同的用户,以评估LLM是否会在不同用户身份间产生差异化的回答。 Result: 研究发现没有按种族或民族区分的证据,仅在一个模型中发现对女性存在轻微的刻板偏见,LLMs展示了对机构角色的语言选择上的细微适应能力,反映了专业规范而非歧视性对待。 Conclusion: 当前的LLMs在支持学术图书馆参考服务中的公平和适当交流方面显示出良好的准备程度。 Abstract: As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries' commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

[96] No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Dasol Choi,Woomyoung Park,Youngsook Song

Main category: cs.CL

TL;DR: 本文通过分析HuggingFace生态系统,探讨了中、日、韩NLP社区在数据集创建和管理方面的差异,并提出了改进数据集质量及促进跨语言资源共享的策略。

Details Motivation: 尽管有大量针对英语的资源和分析,东亚语言(特别是中、日、韩语)的数据集仍处于分散和未充分研究的状态,而这些语言共服务超过16亿人。因此需要填补这一空白。 Method: 基于3,300多个数据集,采用定量和定性方法,从跨语言视角研究文化规范、研究环境和机构实践如何影响中文、日语和韩语NLP社区的数据集可用性和质量。 Result: 研究发现中文数据集具有大规模和机构驱动的特点,韩语NLP的发展以基层社区为主导,而日语数据集则强调娱乐和亚文化内容。 Conclusion: 本论文讨论了未来数据集策划和协作的最佳实践,旨在加强三种语言的资源开发,并通过揭示这些模式,提出了增强数据集文档、许可清晰度和跨语言资源共享的实用策略。 Abstract: Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

[97] HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation

Naquee Rizwan,Seid Muhie Yimam,Daryna Dementieva,Florian Skupin,Tim Fischer,Daniil Moskovskiy,Aarushi Ajay Borkar,Robert Geislinger,Punyajoy Saha,Sarthak Roy,Martin Semmann,Alexander Panchenko,Chris Biemann,Animesh Mukherjee

Main category: cs.CL

TL;DR: HatePRISM identifies inconsistencies in hate speech regulation and proposes a unified framework for automated moderation.

Details Motivation: Despite existing regulations and strategies, hateful content remains a significant challenge, necessitating a comprehensive examination to identify gaps and suggest future research directions. Method: HatePRISM analyzes hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Result: The study found significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, along with a lack of alignment with research efforts. Conclusion: HatePRISM highlights the inconsistencies in hate speech definitions and moderation practices across jurisdictions, platforms, and research efforts, suggesting the need for a unified framework for automated hate speech moderation. Abstract: Despite regulations imposed by nations and social media platforms, e.g. (Government of India, 2021; European Parliament and Council of the European Union, 2022), inter alia, hateful content persists as a significant challenge. Existing approaches primarily rely on reactive measures such as blocking or suspending offensive messages, with emerging strategies focusing on proactive measurements like detoxification and counterspeech. In our work, which we call HatePRISM, we conduct a comprehensive examination of hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, alongside a lack of alignment with research efforts. Based on these insights, we suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation incorporating diverse strategies.

[98] Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products

Soojong Kim,Kwanho Kim,Hye Min Kim

Main category: cs.CL

TL;DR: 该研究评估了三种大型语言模型(GPT、Gemini和LLAMA)在社交媒体平台上检测健康相关情绪的能力,并揭示了其在不同情境下的表现差异及存在的挑战。

Details Motivation: 机器学习方法越来越多地被用于基于大规模数据集分析健康相关的公众讨论,但它们在准确检测不同类型健康情绪方面的能力仍存在问题,尤其是大型语言模型(LLMs)在捕捉不同健康问题意见和观点方面的准确性和可行性仍未得到充分探索。 Method: 使用来自Facebook和Twitter的数据,整理出多组支持或反对推荐健康行为的信息,并通过人工标注作为情感分类的黄金标准。随后评估了三种著名LLM(GPT、Gemini和LLAMA)在检测风险促进与健康支持情绪方面的准确性。 Result: 研究发现所有三个LLM通常在分类风险促进和健康支持情绪方面表现出相当高的准确性,但在平台、健康问题和模型类型方面出现了显著差异。模型在Facebook上对风险促进情绪的准确性较高,而在Twitter上对健康支持信息的检测更为准确。此外,LLM在可靠检测中性信息方面也面临挑战。 Conclusion: 研究强调在公共健康分析中应谨慎选择和验证语言模型,因为训练数据中的潜在偏见可能导致LLM高估或低估某些观点的流行程度。 Abstract: Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.

[99] Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?

Ruochen Zhou,Minrui Xu,Shiqi Chen,Junteng Liu,Yunqi Li,Xinxin Lin,Zhengyu Chen,Junxian He

Main category: cs.CL

TL;DR: 本研究表明,使用长推理链和自我反思的数学问题解决训练方法可以显著提高模型在一般推理任务中的表现,而传统的短推理链方法则效果有限甚至有害。

Details Motivation: 虽然大多数研究集中在创建专门模型以解决数学问题上,但尚不清楚学习数学问题解决如何泛化到帮助发展其他推理能力。本文旨在探究不同数学问题解决训练方法的泛化潜力。 Method: 本文对各种数学问题解决训练方法(如持续预训练、指令调整和基于规则的强化学习)进行了实证调查,并在5个数学和8个一般推理基准上进行评估。 Result: 1. 持续预训练在一定程度上能够推广到一般推理任务;2. 在传统短数学问题解决样本上的指令调整提供了有限的好处,甚至损害了泛化性能;3. 使用长思维链响应和基于规则的强化学习显著提高了模型的泛化能力。 Conclusion: 传统学习数学问题解决的方法在实现稳健泛化方面基本失败,但新兴的长推理链范式结合自我反思提供了一种通过从特定领域学习来改善泛化推理能力的有希望的方向。 Abstract: There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model's reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.

[100] SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive

Yingqiang Gao,Fabian Winiger,Patrick Montjourides,Anastassia Shaitarova,Nianlong Gu,Simon Peng-Keller,Gerold Schneider

Main category: cs.CL

TL;DR: SpiritRAG is an effective Question Answering system that addresses challenges in researching religion and spirituality by utilizing Retrieval-Augmented Generation on UN documents.

Details Motivation: Religion and spirituality are complex concepts that are difficult to operationalize in conventional archival search strategies due to their context-specificity, leading to a reliance on published literature and manual desk reviews. Method: The study involved building SpiritRAG, an interactive Question Answering system based on Retrieval-Augmented Generation using 7,500 United Nations resolution documents. It was tested with domain experts on 100 manually composed questions. Result: A pilot test demonstrated the practical value and usefulness of SpiritRAG in enabling efficient and context-sensitive database searches for actionable insights related to religion and spirituality. Conclusion: SpiritRAG proves to be a practical and valuable tool for researchers and policymakers in conducting complex searches related to religion and spirituality on large datasets through an accessible interface. Abstract: Religion and spirituality (R/S) are complex and highly domain-dependent concepts which have long confounded researchers and policymakers. Due to their context-specificity, R/S are difficult to operationalize in conventional archival search strategies, particularly when datasets are very large, poorly accessible, and marked by information noise. As a result, considerable time investments and specialist knowledge is often needed to extract actionable insights related to R/S from general archival sources, increasing reliance on published literature and manual desk reviews. To address this challenge, we present SpiritRAG, an interactive Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG). Built using 7,500 United Nations (UN) resolution documents related to R/S in the domains of health and education, SpiritRAG allows researchers and policymakers to conduct complex, context-sensitive database searches of very large datasets using an easily accessible, chat-based web interface. SpiritRAG is lightweight to deploy and leverages both UN documents and user provided documents as source material. A pilot test and evaluation with domain experts on 100 manually composed questions demonstrates the practical value and usefulness of SpiritRAG.

[101] THM@SimpleText 2025 -- Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

Nico Hofmann,Julian Dauenhauer,Nils Ole Dietzler,Idehen Daniel Idahor,Christin Katharina Kreutz

Main category: cs.CL

TL;DR: This paper explores how language models can simplify complex scientific sentences, making them accessible to non-experts while maintaining their original meaning.

Details Motivation: Scientific texts are inherently complex due to technical terminology, making it difficult for non-experts, such as politicians or family members of patients, to understand important findings. Simplification improves accessibility to innovation and information. Method: The research investigates the identification of complex terms in sentences and employs small Gemini and OpenAI large language models to rephrase them for easier understanding by non-expert readers. Result: The paper presents methods for simplifying complex sentences with minimal context, focusing on task 1.1 of the SimpleText CLEF Lab 2025 edition, which involves sentence-level simplification using language models. Conclusion: The study concludes that simplifying scientific text enhances accessibility for non-domain experts, and this can be achieved through the identification and rephrasing of complex terms using language models. Abstract: Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.

[102] MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

Emilio Villa-Cueva,S M Masrur Ahmed,Rendi Chevi,Jan Christian Blaise Cruz,Kareem Elzeky,Fermin Cristobal,Alham Fikri Aji,Skyler Wang,Rada Mihalcea,Thamar Solorio

Main category: cs.CL

TL;DR: MOMENTS is a new benchmark designed to evaluate the multimodal ToM capabilities of large language models using narrative-rich short films.

Details Motivation: To advance the development of socially intelligent multimodal agents by providing a robust assessment tool for Theory of Mind (ToM) capabilities. Method: Development of a comprehensive benchmark called MOMENTS, which uses short films to assess ToM capabilities through 2,344 multiple-choice questions across seven categories. Result: Current systems show difficulty in effectively integrating visual modalities despite their general enhancement of model performance. Conclusion: The MOMENTS benchmark highlights the importance of multimodal understanding in developing socially intelligent agents and identifies areas for further research. Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI's multimodal understanding of human behavior.

[103] RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

Xiuying Wei,Anunay Yadav,Razvan Pascanu,Caglar Gulcehre

Main category: cs.CL

TL;DR: This paper proposes a new mechanism called \textit{RAT} (Recurrence and Attention Transformer) that combines recurrence and attention mechanisms to improve the efficiency and performance of Transformers.

Details Motivation: Transformers heavily rely on softmax attention, which creates a computational bottleneck, especially for long-context settings. The authors aim to address this limitation by introducing an intermediate design between recurrence and attention mechanisms. Method: The authors propose a new design called at that partitions the input into chunks, applies linear recurrence within each chunk, and performs softmax attention across chunks. They also interleave at with local attention in a hybrid architecture. Result: With a chunk size of 16, the at layer achieves a $7\times$ improvement in training speed with 100K token sequences and $9\times$ in generation at 4K sequence length while maintaining similar or better accuracy compared to standard attention. Conclusion: The paper concludes that their proposed hybrid architecture combining at and local attention improves inference speed, reduces cache memory usage, and enhances performance on various tasks. Abstract: Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a \(7\times\) improvement in training speed with 100K token sequences and \(9\times\) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at https://github.com/CLAIRE-Labo/RAT

[104] GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models

Kai Yao,Zhaorui Tan,Penglei Gao,Lichun Li,Kaixin Wu,Yinggui Wang,Yuan Zhao,Yixin Ji,Wei Wang,Jianke Zhu

Main category: cs.CL

TL;DR: 本文提出了一种新的离站调优方法GradOT,利用梯度保留压缩技术,在保证隐私的同时提升大语言模型的性能,并为该领域提供了理论支持。

Details Motivation: 现有的OT方法计算成本高且缺乏理论分析,需要一种更高效、有理论基础的方法来解决隐私与性能之间的权衡。 Method: 通过选择性地应用梯度保留压缩技术(如秩压缩和通道剪枝),在优化框架下解决OT问题。 Result: 实验表明,GradOT在隐私保护和模型性能方面均优于现有OT方法,并提供了理论基础。 Conclusion: GradOT提供了一种有效的、理论支持的离站调优方法,能够在保护隐私的同时提高模型性能。 Abstract: The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression, named GradOT. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.

[105] Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection

Soumyadeep Jana,Abhrajyoti Kundu,Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: 本文提出了MiDRE模型,结合内部和外部推理专家来提升多模态讽刺检测的效果,并展示了外部理性在理解讽刺中的关键作用。

Details Motivation: 现有的多模态讽刺检测模型主要依赖于浅层线索,难以捕捉讽刺背后的深层理性,因此需要一种能够融合内外部推理的方法。 Method: 提出了一种新的模型MiDRE,该模型包含一个用于检测图像文本对不协调的内部推理专家和一个利用链式思维提示生成结构化理性的外部推理专家,并通过自适应门控机制动态加权这两个专家。 Result: MiDRE在两个基准数据集上的实验显示其性能优于基线模型,并且定性分析强调了外部理性的重要作用,即使有时存在噪声,也能为模型提供有价值的线索。 Conclusion: MiDRE通过结合内部和外部推理专家机制,在多模态讽刺检测中表现出色,证明了外部理性在理解讽刺中的重要性。 Abstract: Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.

[106] Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection

Soumyadeep Jana,Abhrajyoti Kundu,Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: 本文提出了DMDP,一種針對少量樣本情境的多模態諷刺檢測框架,透過深度模態特定提示與跨模態對齊模組提升模型表現與泛化能力。

Details Motivation: 現有模型依賴大量標記數據,但在現實場景中標記數據稀缺,因此需要探索在少量樣本設定下的諷刺檢測方法。 Method: 提出了一種名為DMDP的新型框架,採用基於閘門機制的深度模態特定提示、層級特徵學習、提示共享機制與跨模態提示對齊模組。 Result: 實驗結果顯示,DMDP在兩種公共數據集上的少量樣本與極低資源情境中均展現優異性能,且跨數據集評估中表現優於基線方法。 Conclusion: DMDP能够有效提升在極少數據情境下的多模態諷刺檢測表現,並具有良好的跨領域泛化能力。 Abstract: The widespread use of multimodal content on social media has heightened the need for effective sarcasm detection to improve opinion mining. However, existing models rely heavily on large annotated datasets, making them less suitable for real-world scenarios where labeled data is scarce. This motivates the need to explore the problem in a few-shot setting. To this end, we introduce DMDP (Deep Modality-Disentangled Prompt Tuning), a novel framework for few-shot multimodal sarcasm detection. Unlike prior methods that use shallow, unified prompts across modalities, DMDP employs gated, modality-specific deep prompts for text and visual encoders. These prompts are injected across multiple layers to enable hierarchical feature learning and better capture diverse sarcasm types. To enhance intra-modal learning, we incorporate a prompt-sharing mechanism across layers, allowing the model to aggregate both low-level and high-level semantic cues. Additionally, a cross-modal prompt alignment module enables nuanced interactions between image and text representations, improving the model's ability to detect subtle sarcastic intent. Experiments on two public datasets demonstrate DMDP's superior performance in both few-shot and extremely low-resource settings. Further cross-dataset evaluations show that DMDP generalizes well across domains, consistently outperforming baseline methods.

[107] Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

Zhen Xiong,Yujun Cai,Zhecheng Li,Yiwei Wang

Main category: cs.CL

TL;DR: 本文介绍了一种新的框架Self-adaptive Schema Scaffolding(S^3),用于解决扩散语言模型在可控文本生成中的问题,并取得了显著成效。

Details Motivation: 现有的扩散语言模型应用在可控生成方面面临显著挑战:原生的多步去噪过程对序列长度高度敏感、幻觉率升高且推理成本高。 Method: 提出了一种名为Self-adaptive Schema Scaffolding (S^3) 的新框架,通过将目标模式结构注入输出上下文,减少不必要的计算同时提高可控性。 Result: 实验表明,S^3实现了结构依从性提高65%,内容保真度增强48%,幻觉率降低17%。 Conclusion: S^3框架为在可控文本生成任务中部署扩散模型奠定了理论基础和实践路径,代码和数据将公开发布。 Abstract: Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.

[108] AdS: Adapter-state Sharing Framework for Multimodal Sarcasm Detection

Soumyadeep Jana,Sahil Danayak,Sanasam Ranbir Singh

Main category: cs.CL

TL;DR: 本文提出了一种名为AdS的轻量级框架,用于多模态讽刺检测,在低资源环境下显著提升了参数效率和模型表现。

Details Motivation: 现有大型预训练模型全微调方法在低资源设置下不适用,而现有参数高效微调方法在复杂任务如讽刺检测中表现不佳。 Method: 提出AdS(Adapter-State Sharing),结合参数高效微调方法,利用适配器状态共享机制提升多模态讽刺检测性能。 Result: 实验表明,AdS在两个公开基准测试中使用比现有参数高效微调和全微调方法少得多的可训练参数,达到了最先进的结果。 Conclusion: AdS通过在CLIP上构建轻量框架,仅在顶层插入适配器并引入文本指导视觉适配器的机制,在资源受限情况下实现了高效的跨模态学习。 Abstract: The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining, especially under resource constraints. Existing approaches rely on full fine-tuning of large pre-trained models, making them unsuitable for low-resource settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS (Adapter-State Sharing), a lightweight framework built on CLIP that inserts adapters only in the upper layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones. This design promotes efficient cross-modal learning while preserving low-level unimodal representations. Experiments on two public benchmarks demonstrate that AdS achieves state-of-the-art results using significantly fewer trainable parameters than existing PEFT and full fine-tuning approaches.

[109] DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

Rushil Thareja,Preslav Nakov,Praneeth Vepakomma,Nils Lukas

Main category: cs.CL

TL;DR: This paper introduces DP-Fusion, a differentially private inference approach for LLMs that balances the need for context privacy and utility in document privatization tasks.

Details Motivation: To address the issue of sensitive information leakage in Large Language Models (LLMs) during inference, particularly when outputs are generated adversarially or accidentally exposing private data. Method: The method involves partitioning sensitive tokens into privacy groups, running the LLM once per group, and blending the output distributions to maintain a fixed statistical distance from the baseline distribution. Result: DP-Fusion demonstrates fine-grained control over the privacy-utility trade-off for document privatization tasks, where sensitive content like Personally Identifiable Information (PII) is hidden while preserving overall text utility. Conclusion: DP-Fusion provides a token-level Differentially Private Inference mechanism that effectively controls the privacy/utility trade-off by blending output distributions from multiple LLM forward passes. Abstract: Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially. Existing defenses that aim to preserve context privacy during inference either lack formal guarantees or suffer from a poor utility/privacy trade-off. We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM's outputs reveal about sensitive tokens in its context. We demonstrate DPI through the task of document privatization, where the goal is to paraphrase documents so that sensitive content (e.g., Personally Identifiable Information, PII) cannot be reliably inferred, while still preserving the overall utility of the text. This is controlled by a parameter $\epsilon$: $\epsilon=0$ hides PII entirely, while higher values trade off privacy for improved paraphrase quality. DP-Fusion works as follows: (i) partition sensitive tokens into disjoint privacy groups, (ii) run the LLM once per group, and (iii) blend the output distributions so that the final output remains within a fixed statistical distance of the baseline distribution produced when no privacy group is revealed. This approach allows fine-grained control over the privacy/utility trade-off but requires multiple LLM forward passes.

[110] Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Guokan Shang,Hadi Abdine,Ahmad Chamma,Amr Mohamed,Mohamed Anwar,Abdelaziz Bounhar,Omar El Herraoui,Preslav Nakov,Michalis Vazirgiannis,Eric Xing

Main category: cs.CL

TL;DR: Nile-Chat introduces LLMs for Egyptian dialect capable of handling both Arabic and Latin scripts, achieving superior performance through a novel language adaptation approach.

Details Motivation: To uniquely design LLMs for Egyptian dialect that can understand and generate texts written in both Arabic and Latin scripts. Method: The Branch-Train-MiX strategy was leveraged to merge script-specialized experts into a single MoE model. Result: The Nile-Chat models significantly outperformed leading multilingual and Arabic LLMs on newly introduced Egyptian evaluation benchmarks. The 12B model achieved a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. Conclusion: The work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development. Abstract: We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

[111] PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

Xinliang Frederick Zhang,Nick Beauchamp,Lu Wang

Main category: cs.CL

TL;DR: 本研究提出了名为PRIME的统一框架,通过借鉴认知双存储模型,结合情景记忆与语义记忆机制,提升大语言模型在个性化方面的表现。

Details Motivation: 当前的大语言模型个性化方法缺乏一个统一的理论框架以系统性地理解有效个性化的驱动因素,同时缺少合适的基准测试数据集。 Method: PRIME框架将认知双存储模型引入大语言模型个性化领域,通过情景记忆对应历史用户互动,语义记忆对应长期、演变的用户信念,并在此基础上增加了一种受慢思考策略启发的个性化思维能力。 Result: 在多个实验中,PRIME被验证了其在长上下文和短上下文场景下的有效性,并且引入的Reddit Change My View (CMV) 数据集为长上下文个性化提供了新的评估手段。 Conclusion: PRIME是一个基于情景记忆和语义记忆机制的统一框架,可以有效处理长上下文和短上下文场景下的个性化需求,并且能够超越简单的流行偏差来捕捉动态个性化。 Abstract: Large language model (LLM) personalization aims to align model outputs with individuals' unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME's effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.

[112] Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments

Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann

Main category: cs.CL

TL;DR: 该研究提供了一种新的计算框架,可以大规模分析新闻文章和读者评论中的框架使用情况,并揭示了评论中的框架重用在不同新闻媒体间的一致性和特定主题的变化性。

Details Motivation: 尽管社会科学研究很好地记录了消息内容和受众反应之间的关系,但NLP方法通常孤立地检测文章和反应中的框架,忽略了这种互动。 Method: 改进框架标签,开发了一个从句子级预测中重建文章和评论中主导框架并使文章与主题相关的评论对齐的框架。 Result: 找到了一种跨新闻文章和读者评论进行大规模框架分析的方法,发现评论中的框架重用在不同新闻媒体之间高度相关,但特定主题的模式各不相同。 Conclusion: 本文提出了一个计算框架,用于跨新闻文章和读者评论进行框架分析,并发现评论中的框架重用在不同新闻媒体间高度相关,而特定主题的模式则有所不同。 Abstract: When a news article describes immigration as an "economic burden" or a "humanitarian crisis," it selectively emphasizes certain aspects of the issue. Although \textit{framing} shapes how the public interprets such issues, audiences do not absorb frames passively but actively reorganize the presented information. While this relationship between source content and audience response is well-documented in the social sciences, NLP approaches often ignore it, detecting frames in articles and responses in isolation. We present the first computational framework for large-scale analysis of framing across source content (news articles) and audience responses (reader comments). Methodologically, we refine frame labels and develop a framework that reconstructs dominant frames in articles and comments from sentence-level predictions, and aligns articles with topically relevant comments. Applying our framework across eleven topics and two news outlets, we find that frame reuse in comments correlates highly across outlets, while topic-specific patterns vary. We release a frame classifier that performs well on both articles and comments, a dataset of article and comment sentences manually labeled for frames, and a large-scale dataset of articles and comments with predicted frame labels.

[113] Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs

Swayamjit Saha

Main category: cs.CL

TL;DR: This paper introduces a lightweight and interpretable framework for correcting factual errors in LLM outputs using structured memory graphs, achieving promising results without retraining or fine-tuning.

Details Motivation: The motivation is to address the issue of factual errors or hallucinations in Large Language Models (LLMs), which are powerful but prone to generating incorrect information. Method: The method involves a post-processing step that utilizes structured memory graphs based on RDF triples to enable knowledge-aware self-correction of LLM outputs. Result: The framework shows promising results in correcting factual errors on simple factual prompts when demonstrated with DistilGPT-2. Conclusion: The paper concludes that their framework effectively corrects factual inconsistencies in LLM outputs using external semantic memory, without the need for retraining or fine-tuning. Abstract: Large Language Models (LLMs) are powerful yet prone to generating factual errors, commonly referred to as hallucinations. We present a lightweight, interpretable framework for knowledge-aware self-correction of LLM outputs using structured memory graphs based on RDF triples. Without retraining or fine-tuning, our method post-processes model outputs and corrects factual inconsistencies via external semantic memory. We demonstrate the approach using DistilGPT-2 and show promising results on simple factual prompts.

[114] Put Teacher in Student's Shoes: Cross-Distillation for Ultra-compact Model Compression Framework

Maolin Wang,Jun Chu,Sicong Xie,Xiaoling Zang,Yao Zhao,Wenliang Zhong,Xiangyu Zhao

Main category: cs.CL

TL;DR: 本文提出了一种高效的自然语言处理模型压缩框架EI-BERT,在保证性能的同时实现超小体积,并成功应用于实际推荐系统。

Details Motivation: 移动计算时代需要在资源受限的边缘环境中部署高效NLP模型,同时满足隐私合规、实时响应和多任务处理能力。 Method: 引入了EI-BERT框架,包括硬token剪枝、交叉蒸馏和参数量化来压缩模型。 Result: 实验得到了仅有1.91MB的BERT模型,并且已在Alipay的Edge Recommendation系统中部署,服务840万日活设备。 Conclusion: EI-BERT框架通过交叉蒸馏等方法实现了超紧凑的BERT模型,成功部署在Alipay生态系统中多个场景,显著提升了实际应用效果。 Abstract: In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model's perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay's live Edge Recommendation system since January 2024, currently serving the app's recommendation traffic across \textbf{8.4 million daily active devices}.

[115] R1-RE: Cross-Domain Relationship Extraction with RLVR

Runpeng Dai,Tong Zheng,Run Yang,Hongtu Zhu

Main category: cs.CL

TL;DR: 本文提出R1-RE框架,通过强化学习与可验证奖励机制显著提升关系抽取任务的领域外泛化能力。

Details Motivation: 传统监督学习方法在领域外数据上表现不佳,因此需要一种提升模型泛化能力的方法。 Method: 引入了一种基于强化学习与可验证奖励(RLVR)的关系抽取框架R1-RE,将关系抽取问题转化为受标注指南引导的推理任务。 Result: R1-RE-7B模型在Sem-2010和MDKG数据集上的平均领域外准确率达到约70%,与GPT-4o等领先模型相当。 Conclusion: R1-RE框架在关系抽取任务中展现了出色的领域外泛化能力,并提供了对训练动态和推理行为的深入见解。 Abstract: Relationship extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.

[116] XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

Yifu Liu,Yin Zhu,Yingqi Gao,Zhiling Luo,Xiaoxia Li,Xiaorong Shi,Yuntao Hong,Jinyang Gao,Yu Li,Bolin Ding,Jingren Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为XiYan-SQL的框架,通过生成和利用多个SQL候选,有效解决了Text-to-SQL任务中的挑战,并在多个基准测试中取得了优异的性能表现。

Details Motivation: 利用LLM的优势来解决Text-to-SQL任务中的挑战。 Method: 构建了一个包含Schema Filter模块、多生成器集成方法和选择模型的框架,并采用多任务微调策略提升SQL生成能力。 Result: 实验结果显示,XiYan-SQL在BIRD基准测试中达到75.63%的新SOTA性能,在Spider测试集中准确率达到89.65%。 Conclusion: XiYan-SQL框架在Text-to-SQL任务上表现出色,取得了SOTA性能。 Abstract: To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.

[117] Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce

Arnav Attri,Anuj Attri,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Muthusamy Chelliah,Nikesh Garera

Main category: cs.CL

TL;DR: This paper introduces EOT-X, a new dataset for emotion and opinion trigger analysis in e-commerce reviews, alongside the EOT-DETECT framework, which improves performance over existing techniques.

Details Motivation: There is a gap in existing research regarding the joint analysis of emotion detection and explanatory span identification in e-commerce reviews, which limits the understanding of what drives customer emotions. Method: A joint task called EOT (Emotion detection and Opinion Trigger extraction) was proposed, along with a novel dataset named EOT-X containing 2,400 annotated reviews. The structured prompting framework EOT-DETECT was developed and evaluated against 23 LLMs. Result: The EOT-DETECT framework demonstrated superior performance compared to zero-shot and chain-of-thought methods across multiple e-commerce domains. Conclusion: The proposed EOT-DETECT framework outperforms zero-shot and chain-of-thought techniques across e-commerce domains, contributing to the understanding of emotional triggers in customer reviews. Abstract: Customer reviews on e-commerce platforms capture critical affective signals that drive purchasing decisions. However, no existing research has explored the joint task of emotion detection and explanatory span identification in e-commerce reviews - a crucial gap in understanding what triggers customer emotional responses. To bridge this gap, we propose a novel joint task unifying Emotion detection and Opinion Trigger extraction (EOT), which explicitly models the relationship between causal text spans (opinion triggers) and affective dimensions (emotion categories) grounded in Plutchik's theory of 8 primary emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated collection of 2,400 reviews with fine-grained emotions and opinion triggers. We evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured prompting framework with systematic reasoning and self-reflection. Our framework surpasses zero-shot and chain-of-thought techniques, across e-commerce domains.

[118] LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

Zecheng Tang,Haitian Wang,Quantong Qiu,Baibei Ji,Ruoxi Sun,Keyan Zhou,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了 LOOM-Scope,用于解决大型语言模型在长上下文处理中的评估不一致和高计算成本问题。

Details Motivation: 现有的长上下文评估基准测试因设置不同导致结果不一致,且计算成本高,难以进行广泛的模型评估。 Method: 提出了 LOOM-Scope,包括标准化评估设置、支持高效的长上下文推理加速方法,并引入了轻量级的整体基准测试套件。 Result: 开发出一个统一、高效的长上下文评估框架,能够更好地支持社区对长上下文模型的全面评估。 Conclusion: LOOM-Scope 提供了一个高效且全面的长上下文评估框架,解决了现有基准测试中设置不一致和计算成本高的问题。 Abstract: Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model's long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: https://loomscope.github.io

[119] "This Suits You the Best": Query Focused Comparative Explainable Summarization

Arnav Attri,Anuj Attri,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Muthusamy Chelliah,Nikesh Garera

Main category: cs.CL

TL;DR: This paper proposes a new method, Query-Focused Comparative Explainable Summaries (QF-CES), for generating comparative product recommendations using Multi-Source Opinion Summarization (M-OS), which significantly reduces inference time and aligns closely with human evaluations.

Details Motivation: Traditional opinion summarization fails to provide holistic comparative insights in product recommendations. There is a need for query-focused recommendation datasets and methods that can generate explainable, personalized, and privacy-preserving comparative summaries. Method: The study introduces the task of Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). The researchers created a dataset called MS-Q2P containing queries mapped to products and used Large Language Models (LLMs) to generate comparative summaries. They evaluated open-source and proprietary LLMs and introduced QF-CES-PROMPT for evaluation across five dimensions. Result: The approach reduced inference latency by approximately 40% compared to the direct input approach. Evaluations using QF-CES-PROMPT showed an average Spearman correlation of 0.74 with human judgments, indicating its effectiveness in evaluating QF-CES. Conclusion: The proposed QF-CES method using M-OS offers an effective, efficient, and adaptable solution for generating comparative product summaries that align well with human judgments, while also being recommendation engine-agnostic and category-agnostic. Abstract: Product recommendations inherently involve comparisons, yet traditional opinion summarization often fails to provide holistic comparative insights. We propose the novel task of generating Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address the lack of query-focused recommendation datasets, we introduce MS-Q2P, comprising 7,500 queries mapped to 22,500 recommended products with metadata. We leverage Large Language Models (LLMs) to generate tabular comparative summaries with query-specific explanations. Our approach is personalized, privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS as an intermediate step reduces inference latency approximately by 40% compared to the direct input approach (DIA), which processes raw data directly. We evaluate open-source and proprietary LLMs for generating and assessing QF-CES. Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity, faithfulness, informativeness, format adherence, and query relevance) showed an average Spearman correlation of 0.74 with human judgments, indicating its potential for QF-CES evaluation.

[120] Word stress in self-supervised speech models: A cross-linguistic comparison

Martijn Bentum,Louis ten Bosch,Tomas O. Lentz

Main category: cs.CL

TL;DR: This paper shows that self-supervised speech models like Wav2vec 2.0 learn language-specific word stress patterns, with clear distinctions between variable and fixed stress languages.

Details Motivation: The motivation of this study is to understand how self-supervised speech models capture word stress patterns across different languages, particularly focusing on whether these representations are language-specific or generalizable. Method: The authors used the Wav2vec 2.0 model to investigate word stress representations across five languages (Dutch, English, German, Hungarian, and Polish). They trained diagnostic stress classifiers on S3M embeddings and tested their performance in distinguishing stressed from unstressed syllables. Result: The diagnostic stress classifiers achieved high accuracy in identifying stressed and unstressed syllables. The results revealed that word stress representations are language-specific, showing greater differences between languages with variable stress versus those with fixed stress. Conclusion: The paper concludes that self-supervised speech models (S3M) learn language-specific word stress representations, with a notable distinction between languages with variable and fixed stress patterns. Abstract: In this paper we study word stress representations learned by self-supervised speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M representations of word stress for five different languages: Three languages with variable or lexical stress (Dutch, English and German) and two languages with fixed or demarcative stress (Hungarian and Polish). We train diagnostic stress classifiers on S3M embeddings and show that they can distinguish between stressed and unstressed syllables in read-aloud short sentences with high accuracy. We also tested language-specificity effects of S3M word stress. The results indicate that the word stress representations are language-specific, with a greater difference between the set of variable versus the set of fixed stressed languages.

[121] A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

Juan Moreno Gonzalez,Bashar Alhafni,Nizar Habash

Main category: cs.CL

TL;DR: 这篇论文介绍了一种将犹太-阿拉伯语转写为阿拉伯语的新方法,并评估了其在NLP任务中的应用效果。

Details Motivation: 犹太-阿拉伯语使用希伯来字母书写,难以用阿拉伯语NLP工具处理,需要将其转写为阿拉伯文字。 Method: 字符级映射和后修正的组合方法用于解决语法和正字法错误。 Result: 论文引入了首个针对LLM模型在该任务上的基准评估,并展示了转写的可行性。 Conclusion: 该论文提出了一种两步自动转写方法,使得阿拉伯语NLP工具能够处理犹太-阿拉伯语文本。 Abstract: Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew and Aramaic. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts.

[122] LLMs as Architects and Critics for Multi-Source Opinion Summarization

Anuj Attri,Arnav Attri,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Muthusamy Chelliah,Nikesh Garera

Main category: cs.CL

TL;DR: 本文介绍了一种新的多源意见总结方法 M-OS 和一个新的评价数据集 M-OS-EVAL,该方法能够显著提升用户体验和参与度。

Details Motivation: 多源意见总结传统上存在缺乏评价数据集的问题,而且大型语言模型在这一领域的潜力尚未得到充分开发。 Method: 引入了 M-OS-EVAL 数据集和 M-OS-PROMPTS 方法,并通过用户研究和实验验证其有效性。 Result: 87% 的参与者更喜欢 M-OS 摘要,M-OS-PROMPTS 达到了平均 Spearman 相关系数为 0.74。 Conclusion: M-OS-EVAL 是一个用于评估多源意见摘要的基准数据集,可以显著提高用户参与度,并且 M-OS-PROMPTS 与人类判断的一致性更强。 Abstract: Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of \r{ho} = 0.74, which surpasses the performance of previous methodologies.

[123] CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

Hang Lv,Sheng Liang,Hao Wang,Hongchao Gu,Yaxiong Wu,Wei Guo,Defu Lian,Yong Liu,Enhong Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为CoSteer的新框架,该框架通过在设备端使用小型模型产生的delta转向信号,帮助云端的大型语言模型实时生成个性化的文本,同时保护用户的隐私。

Details Motivation: 现有的个性化文本生成方法往往依赖于集中微调或静态偏好对齐,在资源受限的个人设备上难以实现实时适应。这导致了一个困境:大型基于云的模型无法访问本地用户特定信息,而小型本地模型则无法达到云端模型的生成质量。 Method: CoSteer方法的关键在于利用本地小型模型对个人上下文敏感和不敏感输出之间的logits差异作为转向信号,并将token级别的优化问题表述为在线学习问题。 Result: 实验表明,CoSteer能够有效帮助大型语言模型(LLM)通过利用本地存储的用户资料和历史记录来生成个性化内容,同时保证了隐私保护并通过设备上的数据处理保持了可接受的计算开销。 Conclusion: CoSteer是一个新的协作框架,它通过本地delta转向实现了在解码时的个性化,从而有效地帮助大型语言模型生成个性化内容。 Abstract: Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs' general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.

[124] Reason to Rote: Rethinking Memorization in Reasoning

Yupei Du,Philipp Mondorf,Silvia Casola,Yuekun Yao,Robert Litschko,Barbara Plank

Main category: cs.CL

TL;DR: 这项研究探讨了大型语言模型如何记忆标签噪声,并揭示了这种记忆如何依赖于可泛化的推理机制而非覆盖它们。

Details Motivation: 了解语言模型在记忆标签噪声的同时为何仍能保持良好的推理能力。 Method: 使用两个带有噪声标签的可控合成推理数据集(四位数加法和双跳关系推理),分析模型的记忆与推理机制之间的关系。 Result: 研究发现,模型即使在检索记忆的噪声标签时仍在计算中间推理结果,推理过程会干扰记忆;记忆通过分布式编码实现,而非直接建立输入到噪声标签的查找机制;此外,记忆通过异常值启发式方法发生,即略微调整现有神经元激活模式以适应噪声标签。 Conclusion: 语言模型对标签噪声的记忆是基于而非覆盖其推理机制的,这解释了为何记忆噪声不会严重影响其推理能力。 Abstract: Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.

[125] A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Yuchen Su,Yonghua Zhu,Ruofan Wang,Zijian Huang,Diana Benavides-Prado,Michael Witbrock

Main category: cs.CL

TL;DR: 这篇论文旨在通过回顾不同阶段的双关语生成数据集和方法,填补计算语言学中双关语生成领域综合调查的空白,并总结评估指标和提出未来研究方向。

Details Motivation: 动机是填补当前在双关语生成领域缺乏专门综合调查的空白,并提供一个系统的回顾。 Method: 该论文的方法包括对不同阶段的双关语生成数据集和方法进行全面回顾,包括传统方法、深度学习技术和预训练语言模型。此外,还总结了用于评估双关语生成质量的自动化和人工评价指标。 Result: 结果是对双关语生成的研究进行了全面的回顾,涵盖了不同的方法论和技术,并讨论了研究挑战以及未来工作的有希望的方向。 Conclusion: 本文得出结论,双关语生成在计算语言学中已经受到了相当大的关注,但目前还没有专门的综述来系统地回顾这一领域。 Abstract: Pun generation seeks to creatively modify linguistic elements in text to produce humour or evoke double meanings. It also aims to preserve coherence and contextual appropriateness, making it useful in creative writing and entertainment across various media and contexts. Although pun generation has received considerable attention in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area. To bridge this gap, this paper provides a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models. Additionally, we summarise both automated and human evaluation metrics used to assess the quality of pun generation. Finally, we discuss the research challenges and propose promising directions for future work.

[126] Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems

Quang-Vinh Nguyen,Quang-Chieu Nguyen,Hoang Pham,Khac-Hoai Nam Bui

Main category: cs.CL

TL;DR: 本文提出了一种名为Spec-TOD的任务导向型对话系统框架,可在少量标注数据的情况下实现良好性能。

Details Motivation: 任务导向型对话系统在低资源场景下因标注数据有限而表现不佳,需要开发一种减少数据依赖的新方法。 Method: 提出了Spec-TOD框架,包括一个结合明确任务指令的端到端TOD框架和一种利用轻量级、专用大语言模型(LLMs)的高效训练策略。 Result: 在MultiWOZ数据集上的实验表明,Spec-TOD在显著减少标注数据需求的同时实现了有竞争力的结果。 Conclusion: Spec-TOD是一个在低资源场景下高效且有效的任务导向型对话系统框架,具有减少对标注数据依赖的潜力。 Abstract: Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.

[127] Dialogue-Based Multi-Dimensional Relationship Extraction from Novels

Yuchen Yan,Hanjie Zhao,Senbin Zhu,Hongde Liu,Zhihong Zhang,Yuxiang Jia

Main category: cs.CL

TL;DR: This paper proposes an enhanced method for extracting character relationships in novels using LLMs and specialized strategies, achieving superior performance while addressing resource scarcity with a new dataset.

Details Motivation: Relation extraction is crucial in NLP, but the complexity and implicit expressions in novel texts pose significant challenges for automated character relationship extraction. Method: The method incorporates relationship dimension separation, dialogue data construction, and contextual learning strategies to enhance extraction performance, leveraging dialogue structure information for better understanding of implicit relationships. Result: The proposed method outperforms traditional baselines across multiple evaluation metrics and facilitates automated construction of character relationship networks in novels. A high-quality Chinese novel relation extraction dataset was also constructed. Conclusion: The study successfully enhances relation extraction in novel texts using Large Language Models (LLMs) with strategies like relationship dimension separation and dialogue data construction, demonstrating adaptability and improved performance over traditional methods. Abstract: Relation extraction is a crucial task in natural language processing, with broad applications in knowledge graph construction and literary analysis. However, the complex context and implicit expressions in novel texts pose significant challenges for automatic character relationship extraction. This study focuses on relation extraction in the novel domain and proposes a method based on Large Language Models (LLMs). By incorporating relationship dimension separation, dialogue data construction, and contextual learning strategies, the proposed method enhances extraction performance. Leveraging dialogue structure information, it improves the model's ability to understand implicit relationships and demonstrates strong adaptability in complex contexts. Additionally, we construct a high-quality Chinese novel relation extraction dataset to address the lack of labeled resources and support future research. Experimental results show that our method outperforms traditional baselines across multiple evaluation metrics and successfully facilitates the automated construction of character relationship networks in novels.

[128] $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models

Shrey Ganatra,Swapnil Bhattacharyya,Harshvivek Kashid,Spandan Anaokar,Shruti Nair,Reshma Sekhar,Siddharth Manohar,Rahul Hemrajani,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 文章提出了一种名为Grahak-Nyay的聊天机器人,旨在解决印度消费者维权的问题。

Details Motivation: 在印度,消费者维权往往受到程序复杂、法律术语和管辖权挑战的阻碍。 Method: 开发了一个基于LLM和RAG的聊天机器人,并创建了多个数据集来评估其性能。 Result: 开发出了一个名为Grahak-Nyay的聊天机器人,简化了法律复杂性,并提出了新的评估指标HAB。 Conclusion: Grahak-Nyay是一个有前景的解决方案,可以提高印度消费者获得司法救济的机会。 Abstract: Access to consumer grievance redressal in India is often hindered by procedural complexity, legal jargon, and jurisdictional challenges. To address this, we present $\textbf{Grahak-Nyay}$ (Justice-to-Consumers), a chatbot that streamlines the process using open-source Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities through a concise and up-to-date knowledge base. We introduce three novel datasets: $\textit{GeneralQA}$ (general consumer law), $\textit{SectoralQA}$ (sector-specific knowledge) and $\textit{SyntheticQA}$ (for RAG evaluation), along with $\textit{NyayChat}$, a dataset of 300 annotated chatbot conversations. We also introduce $\textit{Judgments}$ data sourced from Indian Consumer Courts to aid the chatbot in decision making and to enhance user trust. We also propose $\textbf{HAB}$ metrics ($\textbf{Helpfulness, Accuracy, Brevity}$) to evaluate chatbot performance. Legal domain experts validated Grahak-Nyay's effectiveness. Code and datasets will be released.

[129] Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Christos Vlachos,Nikolaos Stylianou,Alexandra Fiotaki,Spiros Methenitis,Elisavet Palogiannidi,Themos Stafylakis,Ion Androutsopoulos

Main category: cs.CL

TL;DR: This paper proposes a method to automatically generate synthetic conversational QA data from plain text documents, enabling better training of question rewriters and response systems.

Details Motivation: The motivation stems from the difficulty in obtaining domain-specific OR-CONVQA training datasets for real-world applications, necessitating an automated approach to produce realistic data. Method: A pipeline was proposed to generate synthetic OR-CONVQA dialogs using abundant plain text documents. This involved generating in-dialog question-answer pairs, decontextualized questions, and propositions grounded in document content. Result: The proposed method successfully produced synthetic dialogs resembling human-annotated datasets, which were shown to improve the training of efficient question rewriters and system response generation. Conclusion: The study concludes that synthetic dialogs can be effectively used to train question rewriters, enabling the use of existing retrievers and enhancing the performance of system responses through LLMs. Abstract: We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system's response.

[130] Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

A. Bochkov

Main category: cs.CL

TL;DR: This paper challenges the assumption that trainable input embeddings are essential for semantic representation in large language models. By using fixed visual embeddings derived from Unicode glyphs, the researchers show that high-level semantics emerge from the model's architecture and data scale, not from the embeddings themselves.

Details Motivation: The motivation is to understand the locus of semantic representation in large language models (LLMs) for interpretability and architectural innovation, challenging the paradigm that trainable input embeddings are foundational 'meaning vectors.' Method: The researchers constructed Transformer models with a frozen embedding layer, using non-semantic visual embeddings derived from Unicode glyphs, and introduced a new Unicode-centric tokenizer for universal text coverage. Result: The models with frozen visual embeddings converged, generated coherent text, and outperformed identical models with trainable embeddings on the MMLU reasoning benchmark. Conclusion: The study concludes that high-level semantics are not inherent to input embeddings but emerge from the Transformer's architecture and data scale, reframing the role of embeddings as structural primitives rather than meaning containers. Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

[131] O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d'apprentissage pour l'adaptation de modèles de langue à un domaine

Ismaël Rousseau,Claire Perroux,Pierre Adam,Thomas Girault,Lionel Delphin-Poulat,Morgan Veyret,Gwénolé Lecorvé,Géraldine Damnati

Main category: cs.CL

TL;DR: O_FT 团队与 Orange 和 Ouest-France 合作,使用持续预训练和指令微调方法对 Mistral-7B-Instruct-v0.3 进行国防领域适应,提高了领域专业知识和任务处理能力,且碳足迹可控。

Details Motivation: 为了使语言模型更好地适用于国防领域,满足 EvalLLM2025 挑战赛的要求。 Method: 通过持续预训练和指令微调技术对 Mistral-7B-Instruct-v0.3 模型进行调整,并重点构建、生成和选择相关数据。 Result: 实验表明,经过调整的模型在国防领域的知识和任务处理能力有明显提升,同时在通用知识方面也保持了可比甚至更优的性能。 Conclusion: 研究证明了对相对较小的语言模型进行领域适应的可行性,并展示了其在特定领域任务和知识上的显著改进。 Abstract: This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. -- Ce document pr\'esente les travaux r\'ealis\'es par l'\'equipe O_FT conjointe \`a Orange et Ouest-France sur l'adaptation de mod\`eles de langue au domaine de la d\'efense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentr\'es sur l'adaptation du mod\`ele \texttt{Mistral-7B-Instruct-v0.3} avec des techniques classiques de poursuite du pr\'e-entra\^inement et d'affinage sur instructions. L'essentiel de nos travaux a port\'e sur la constitution, g\'en\'eration et s\'election de donn\'ees pour ces deux \'etapes ainsi que pour l'\'evaluation des mod\`eles. Les exp\'eriences montrent que nos mod\`eles adapt\'es ont de meilleures de connaissances de fond et une meilleure capacit\'e de traitement de t\^aches sur le domaine de la d\'efense, ainsi que des performances comparables (voire sup\'erieures) sur des connaissances ou capacit\'es g\'en\'eralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux d\'emontrent ainsi la viabilit\'e de l'adaptation \`a un domaine de mod\`eles relativement petits.

[132] SIGIR 2025 -- LiveRAG Challenge Report

David Carmel,Simone Filice,Guy Horowitz,Yoelle Maarek,Oren Somekh,Ran Tavory

Main category: cs.CL

TL;DR: The LiveRAG Challenge at SIGIR 2025 advanced RAG technologies through a global competition with rigorous evaluation methods.

Details Motivation: To advance RAG technology by providing a platform for comparing different retrieval and prompting strategies under standardized conditions. Method: A competitive challenge inviting participants to develop RAG-based systems using a fixed corpus and open-source LLM with multi-stage evaluation. Result: 70 teams from 27 countries participated, answering 500 unseen questions. Evaluation used both automated scoring and manual review of top submissions. Conclusion: The LiveRAG Challenge successfully fostered innovation and competition in RAG technologies, highlighting effective retrieval and prompting strategies. Abstract: The LiveRAG Challenge at SIGIR 2025, held between March and May 2025, provided a competitive platform for advancing Retrieval-Augmented Generation (RAG) technologies. Participants from academia and industry were invited to develop a RAG-based question-answering system using a fixed corpus (Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal was to facilitate challenging comparisons of retrieval and prompting strategies. During the Live Challenge Day, 70 teams from 27 different countries provided answers and supportive information to 500 unseen questions within a strict two-hour time window. Evaluation was conducted in two stages: first an automated LLM-as-a-judge approach was used to compute correctness and faithfulness score, then a manual review of top ranked submissions was conducted. The finalists were announced on June 12, 2025, with prizes awarded during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.

[133] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang,Yuhang Li,Can Xu,Jiaheng Liu,Ao Liu,Shihui Hu,Dengpeng Wu,Guanhua Huang,Kejiao Li,Qi Yi,Ruibin Xiong,Haotian Zhu,Yuanxing Zhang,Yuhao Jiang,Yue Zhang,Zenan Xu,Bohui Zhai,Guoxiang He,Hebin Li,Jie Zhao,Le Zhang,Lingyun Tan,Pengyu Guo,Xianshu Pang,Yang Ruan,Zhifeng Zhang,Zhonghu Wang,Ziyan Xu,Zuopu Yin,Wiggin Zhou,Chayse Zhou,Fengzong Lian

Main category: cs.CL

TL;DR: ArtifactsBench는 시각적 코드 생성을 위한 자동 평가 프레임워크로, 생성된 아티팩트의 동작을 시간 기반 스크린샷으로 캡처하고 MLLM-as-Judge를 사용하여 종합적인 평가를 제공합니다.

Details Motivation: 기존 벤치마크가 알고리즘적 정확성에만 집중되어 있어 시각적 충실도와 상호작용 무결성을 평가하지 못하는 문제를 해결하기 위해 ArtifactsBench가 제안되었습니다. Method: 생성된 각 아티팩트를 프로그래밍적으로 렌더링하고 시간 기반 스크린샷을 통해 동적 행동을 포착한 후, MLLM-as-Judge가 세부 체크리스트에 따라 평가를 수행합니다. Result: ArtifactsBench는 WebDev Arena와 94.4% 순위 일관성을 달성했으며, 인간 전문가와 90% 이상의 쌍별 일치율을 보였습니다. Conclusion: ArtifactsBench는 인간 중심의 품질 평가를 대규모로 자동화할 수 있는 최초의 프레임워크로, 일반 모델이 도메인 특화 모델보다 우수한 성능을 보일 수 있음을 입증했습니다. Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

[134] Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

Chenfei Xiong,Jingwei Ni,Yu Fan,Vilém Zouhar,Donya Rooein,Lorena Calvo-Bartolomé,Alexander Hoyle,Zhijing Jin,Mrinmaya Sachan,Markus Leippold,Dirk Hovy,Mennatallah El-Assady,Elliott Ash

Main category: cs.CL

TL;DR: Co-DETECT is a collaborative framework combining human expertise and large language models to identify and handle edge cases in text classification, improving annotation accuracy through iterative refinement of codebooks.

Details Motivation: The motivation behind Co-DETECT is to address edge cases in text classification that are not well described by initial codebooks, leveraging both human expertise and large language models to improve annotation accuracy and handling of nuanced phenomena. Method: Co-DETECT uses an initial codebook provided by a domain expert, leverages LLMs for data annotation and identification of edge cases, flags challenging examples, induces high-level descriptions of these edge cases, and assists users in incorporating rule-based handling into the codebook iteratively. Result: Extensive user studies, qualitative and quantitative analyses demonstrate the effectiveness of Co-DETECT in discovering and addressing edge cases in text classification. Conclusion: Co-DETECT successfully integrates human expertise with LLM-guided annotation to handle edge cases in text classification, enabling more effective handling of nuanced phenomena through compact and generalizable annotation rules. Abstract: We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

[135] Verified Language Processing with Hybrid Explainability: A Technical Report

Oliver Robert Fox,Giacomo Bergami,Graham Morgan

Main category: cs.CL

TL;DR: This paper presents a novel hybrid explainable pipeline using graphs and logic to improve information retrieval accuracy and transparency by creating machine- and human-readable First-Order Logic representations.

Details Motivation: Current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. Classifiers exploiting generative language models with logical prompts also fail to correctly distinguish between logical implication, indifference, and inconsistency. Method: Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Result: Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. The proposed method outperforms state-of-the-art models. Conclusion: The proposed method offers a step toward more transparent and reliable Information Retrieval from extensive textual data. Abstract: The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.

[136] An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Walid Mohamed Aly,Taysir Hassan A. Soliman,Amr Mohamed AbdelAziz

Main category: cs.CL

TL;DR: This paper evaluates six LLMs for text summarization across diverse domains, revealing that performance on long scientific texts improves with chunking strategies while highlighting the impact of model choice, dataset features, and prompt design.

Details Motivation: Despite the success of LLMs in NLP, their effectiveness in text summarization across different domains has not been thoroughly evaluated, especially in scenarios where extensive training data is unavailable. Method: A systematic evaluation of six LLMs across four datasets (CNN/Daily Mail, NewsRoom, SAMSum, ArXiv) was conducted using ROUGE and BERTScore metrics. Prompt engineering techniques such as zero-shot and in-context learning were leveraged, along with a sentence-based chunking strategy for handling long documents. Inference times were also analyzed. Result: LLMs show competitive performance on news and dialog summarization tasks. Chunking strategies significantly improve their performance on long scientific documents. Performance varies notably based on model parameters, dataset characteristics, and prompt design. Conclusion: LLMs perform well on news and dialog tasks, but struggle with long scientific documents unless chunking strategies are applied. Model parameters, dataset properties, and prompt design significantly affect performance, offering insights into efficient, instruction-based NLP systems. Abstract: Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.

[137] SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

Alexander Scarlatos,Nigel Fernandez,Christopher Ormerod,Susan Lottridge,Andrew Lan

Main category: cs.CL

TL;DR: SMART is a novel simulation-based approach for predicting item difficulty in educational assessments, using DPO and LLM-based scoring to align simulated student abilities with real-world requirements, offering better accuracy and efficiency.

Details Motivation: Item difficulty estimation is crucial for accurate educational assessments but traditionally requires costly real-student data and fails in cold-start settings. A more efficient and scalable solution is needed. Method: The SMART method uses direct preference optimization (DPO) to align simulated students with instructed abilities. Simulations generate thousands of responses, evaluated by an LLM-based scoring model, and fitted to an IRT model for item difficulty estimation. Result: SMART successfully predicts item difficulties in cold-start scenarios and outperforms existing methods by achieving better alignment between simulated student abilities and actual item difficulties. Conclusion: SMART demonstrates superior performance in predicting item difficulties compared to other methods, particularly by leveraging improved ability alignment through DPO and simulations with LLM-based scoring. Abstract: Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get item difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with an LLM-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on a real-world student response dataset, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.

[138] Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Jaewook Lee,Alexander Scarlatos,Andrew Lan

Main category: cs.CL

TL;DR: 本文介绍了一种新的生成框架,用于提高日语词汇学习中关键词记忆法的可解释性和系统性。

Details Motivation: 由于日语结合了像平假名这样的音节文字和来自汉字的意标字符,对于来自罗马字母背景的学习者来说,学习日语词汇是一个挑战。现有的大型语言模型(LLM)为基础的记忆法生成方法功能如同黑箱,缺乏可解释性。 Method: 使用一种新颖的期望最大化类型算法,显式地建模记忆法构建过程中的一组常见规则,并从在线平台上的学习者创作的记忆法中学习潜在结构和组成规则。 Result: 实验表明,该方法在新学习者的冷启动设置中表现出色,并为有效记忆法创建背后的机制提供了见解。 Conclusion: 该论文提出了一种新的生成框架,用于学习和生成关键词记忆法,这种方法通过建模记忆法构建过程中的常见规则,提高了可解释性和系统性。 Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

[139] AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

Chinnappa Guggilla,Budhaditya Roy,Trupti Ramdas Chavan,Abdul Rahman,Edward Bowen

Main category: cs.CL

TL;DR: 本研究通过微调GPT_4o-mini和BERT模型来检测AI生成的文本,并尝试识别具体的生成模型,在两个任务中分别取得了较高的准确率。

Details Motivation: 随着大型语言模型(LLMs)被滥用于生成钓鱼邮件、虚假新闻等恶意内容,检测AI生成的文本变得越来越具有挑战性,因此需要开发有效的检测方法。 Method: 通过使用GPT_4o-mini、LLaMA 3 8B和BERT模型进行微调,以完成两个任务:Task-A(区分人工与机器生成文本)和Task-B(识别生成文本的具体LLM模型)。 Result: 在Task-A中,微调后的GPT_4o-mini和BERT模型达到了0.9547的准确率;而在Task-B中,准确率为0.4698。 Conclusion: 本文提出了一种基于GPT_4o-mini和BERT模型的微调方法,用于区分人类书写和机器生成文本,并尝试识别特定的LLM模型。 Abstract: Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.

[140] InfoSteer: Steering Information Utility in Language Model Post-Training

Chunyuan Deng,Ruidi Chang,Hanjie Chen

Main category: cs.CL

TL;DR: The paper proposes a new lightweight post-training method for language models that treats the feed-forward network (FFN) layer as associative key-value memory, enhancing the use of pre-training knowledge, which results in improved performance and resource efficiency across multiple models and tasks.

Details Motivation: Post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. Method: A lightweight method treating FFN layer as associate key-value memory to encourage parametric information utilization in LMs during post-training. Result: This method delivers consistent performance improvements across diverse model families including Qwen, Gemma, and Llama over 15 downstream tasks in ID and OOD evaluations; steered LMs adaptively allocate information emphasizing semantically meaningful tokens while using fewer resources on simple transition ones. Conclusion: Vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability. Abstract: Recent advancements in language models (LMs) gradually ushered in an era where post-training is crucial. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. We therefore present \ours, a lightweight method that encourages parametric information utilization in LMs during post-training. This is achieved via treating FFN layer as associate key-value memory, and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. We find this simple guidance during post-training phase delivers consistent performance improvements across diverse model families--including Qwen, Gemma and Llama-spanning over 15 downstream tasks in both ID and OOD evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information-placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., `,' or `and'). Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability.

[141] OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model

Chen Wang,Tianyu Peng,Wen Yang,Yinan Bai,Guangfu Wang,Jun Lin,Lanpeng Jia,Lingxiang Wu,Jinqiao Wang,Chengqing Zong,Jiajun Zhang

Main category: cs.CL

TL;DR: This paper presents OpenS2S, an open-source empathetic LSLM that enables transparent research and development in human-machine communication with low-latency speech generation.

Details Motivation: The motivation is driven by the need for transparency in empathetic LSLMs, which are becoming increasingly closed off, limiting researchers' access to crucial development details. Method: The method involves developing OpenS2S based on the empathetic speech-to-text model BLSP-Emo, using a streaming interleaved decoding architecture for low-latency speech generation and an automated data construction pipeline for training. Result: OpenS2S is introduced as a fully open-source, transparent, and end-to-end LSLM capable of enabling empathetic speech interactions with synthesized dialogue and minimal human supervision. Conclusion: The paper concludes by presenting OpenS2S as a solution for transparent research in empathetic LSLMs, highlighting its open-source availability to empower the research community. Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at https://casia-lm.github.io/OpenS2S

[142] From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

Pulkit Bansal,Raghvendra Kumar,Shakti Singh,Sriparna Saha,Adam Jatowt

Main category: cs.CL

TL;DR: 本文介绍了一个结合直接偏好优化和课程学习的新框架,用于生成高质量新闻解释,以应对虚假信息问题,特别是在低资源语言中的应用。

Details Motivation: 在错误信息泛滥的时代,特别是像印地语这样的代表性不足的语言,生成可靠的新闻解释至关重要。由于缺乏强大的自动化工具,印地语在虚假信息检测方面面临扩展挑战。 Method: 提出了一种新框架,将直接偏好优化(DPO)与课程学习相结合,通过事实核查解释作为优选响应,LLM输出作为非优选响应,并在DPO损失函数中引入实际性和精确性两个关键参数。 Result: 实验表明,使用LLMs(如Mistral、Llama、Gemma)和PLMs(如mBART、mT5),该框架能够生成连贯且上下文相关的解释。 Conclusion: 该框架有效对抗虚假信息,并扩展了低资源语言的自动化解释生成。 Abstract: In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters -- Actuality and Finesse -- into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework's effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.

[143] Pre-Trained Policy Discriminators are General Reward Models

Shihan Dou,Shichun Liu,Yuming Yang,Yicheng Zou,Yunhua Zhou,Shuhao Xing,Chenhao Huang,Qiming Ge,Demin Song,Haijun Lv,Songyang Gao,Chengqi Lv,Enyu Zhou,Honglin Guo,Zhiheng Xi,Wenwei Zhang,Qipeng Guo,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Tao Gui,Kai Chen

Main category: cs.CL

TL;DR: We introduce POLAR, a scalable pre-training method for reward modeling that captures the relative difference between policies to generate a reward signal. It significantly enhances RM performance with robust generalization and scaling properties.

Details Motivation: The motivation is to guide the training policy towards a target policy with desired behaviors by generating a reward signal based on the difference between two policies. Method: Policy Discriminative Learning (POLAR) trains a reward model to discern identical policies and discriminate different ones, capturing the relative difference between one policy and an arbitrary target policy. Result: POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance with improvements in preference accuracy and robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT). Conclusion: POLAR is a promising direction for developing general and strong reward models due to its impressive performance, strong generalization, and scaling properties. Abstract: We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

[144] Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Ziqi Miao,Lijun Li,Yuan Xiong,Zhenhua Liu,Pengyu Zhu,Jing Shao

Main category: cs.CL

TL;DR: This paper identifies a new vulnerability in large language models called Response Attack, where prior dialogue responses can influence subsequent behavior toward harmful content. A new safety dataset is proposed to mitigate this threat effectively.

Details Motivation: To explore the contextual priming vulnerability in LLMs and develop effective mitigation strategies against policy-violating behaviors. Method: The researchers proposed the Response Attack method, leveraging an auxiliary LLM to generate harmful content through paraphrased queries and trigger prompts. They tested this on eight LLMs and developed a safety dataset to counteract the vulnerability. Result: Response Attack outperformed existing jailbreak techniques across multiple LLMs, while the proposed safety dataset significantly reduced attack success rates without affecting model capabilities. Conclusion: The study concludes that LLMs are vulnerable to a new attack method called Response Attack, which can be mitigated using a context-aware safety fine-tuning dataset. Abstract: Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at https://github.com/Dtc7w3PQ/Response-Attack.

[145] Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu,Yu Wang,Julian McAuley

Main category: cs.CL

TL;DR: This paper introduces MemoryAgentBench, a new benchmark for evaluating memory capabilities in LLM agents across four core competencies, revealing that current methods are inadequate in fully addressing these aspects.

Details Motivation: Recent benchmarks for LLM agents lack evaluation of memory mechanisms—how agents memorize, update, and retrieve long-term information. No existing benchmark comprehensively covers the interactive, multi-turn nature and all four identified competencies of memory agents. Method: The authors introduce MemoryAgentBench, a new benchmark combining reformulated existing datasets with newly constructed ones, designed to evaluate four core competencies of memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Result: Evaluation of various memory agents using MemoryAgentBench reveals that current methods fall short of mastering all four competencies, indicating the need for improved memory mechanisms. Conclusion: The paper concludes that current methods are insufficient in addressing all four competencies of memory agents and highlights the need for further research on comprehensive memory mechanisms for LLM agents. Abstract: Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

cs.CV [Back]

[146] A Simulator Dataset to Support the Study of Impaired Driving

John Gideon,Kimimasa Tamura,Emily Sumner,Laporsha Dees,Patricio Reyes Gomez,Bassamul Haq,Todd Rowell,Avinash Balachandran,Simon Stent,Guy Rosman

Main category: cs.CV

TL;DR: This paper introduces a comprehensive driving dataset designed to study the effects of alcohol intoxication and cognitive distraction on driver behavior during simulated urban driving.

Details Motivation: Despite advances in automated driving technology, impaired driving still imposes high societal costs, prompting the need for a dataset to study its effects. Method: The method involves creating a comprehensive driving dataset with 23.7 hours of simulated urban driving, including vehicle and driver-facing data from 52 human subjects under normal and impaired conditions. Result: The dataset includes data on alcohol-induced impairment (0.10% blood alcohol content), cognitive distractions (audio n-back and sentence parsing tasks), and their combinations, along with responses to controlled road hazards. Conclusion: The paper concludes that the presented dataset can effectively support research on driver impairment due to alcohol intoxication and cognitive distraction, enabling analysis of behavioral changes and responses to road hazards. Abstract: Despite recent advances in automated driving technology, impaired driving continues to incur a high cost to society. In this paper, we present a driving dataset designed to support the study of two common forms of driver impairment: alcohol intoxication and cognitive distraction. Our dataset spans 23.7 hours of simulated urban driving, with 52 human subjects under normal and impaired conditions, and includes both vehicle data (ground truth perception, vehicle pose, controls) and driver-facing data (gaze, audio, surveys). It supports analysis of changes in driver behavior due to alcohol intoxication (0.10\% blood alcohol content), two forms of cognitive distraction (audio n-back and sentence parsing tasks), and combinations thereof, as well as responses to a set of eight controlled road hazards, such as vehicle cut-ins. The dataset will be made available at https://toyotaresearchinstitute.github.io/IDD/.

[147] Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras

Miao Fan,Quanxin Zheng,Shengtong Xu,Linghe Kong,Haoyi Xiong

Main category: cs.CV

TL;DR: MRC-VMap is a cost-effective, vision-centric neural network that generates high-definition vectorized maps at intersections using existing roadside cameras, achieving performance comparable to LiDAR-based methods.

Details Motivation: Traditional methods for constructing vectorized maps are either expensive and labor-intensive or suffer from limited performance. MRC-VMap aims to bridge this gap with a cost-effective, vision-centric solution. Method: MRC-VMap uses an end-to-end neural network that directly converts time-aligned, multi-directional images from roadside surveillance cameras into vectorized map representations without additional intermediate modules. Result: Experiments showed that MRC-VMap outperforms state-of-the-art online methods and achieves accuracy comparable to high-cost LiDAR-based approaches, while lowering computational overhead and error propagation. Conclusion: MRC-VMap provides a scalable and efficient solution for modern autonomous navigation systems by offering accuracy comparable to high-cost LiDAR-based approaches while being cost-effective. Abstract: Vectorized maps are indispensable for precise navigation and the safe operation of autonomous vehicles. Traditional methods for constructing these maps fall into two categories: offline techniques, which rely on expensive, labor-intensive LiDAR data collection and manual annotation, and online approaches that use onboard cameras to reduce costs but suffer from limited performance, especially at complex intersections. To bridge this gap, we introduce MRC-VMap, a cost-effective, vision-centric, end-to-end neural network designed to generate high-definition vectorized maps directly at intersections. Leveraging existing roadside surveillance cameras, MRC-VMap directly converts time-aligned, multi-directional images into vectorized map representations. This integrated solution lowers the need for additional intermediate modules--such as separate feature extraction and Bird's-Eye View (BEV) conversion steps--thus reducing both computational overhead and error propagation. Moreover, the use of multiple camera views enhances mapping completeness, mitigates occlusions, and provides robust performance under practical deployment constraints. Extensive experiments conducted on 4,000 intersections across 4 major metropolitan areas in China demonstrate that MRC-VMap not only outperforms state-of-the-art online methods but also achieves accuracy comparable to high-cost LiDAR-based approaches, thereby offering a scalable and efficient solution for modern autonomous navigation systems.

[148] Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions

Vineet Kumar Rakesh,Soumya Mazumdar,Research Pratim Maity,Sarbajit Pal,Amitabha Das,Tapas Samanta

Main category: cs.CV

TL;DR: 这篇论文综述了会说话的头部生成技术的研究进展,包括方法分类、评估以及未来发展方向。

Details Motivation: 会说话的头部生成技术在计算机视觉中具有革命性的潜力,能够根据图像、音频、文本或视频输入合成逼真的人脸。 Method: 该研究对现有的会说话的头部生成方法进行了全面回顾,并分类为基于2D、基于3D、基于NeRF、基于扩散等方法。 Result: 论文评估了算法、数据集和评价指标,强调了感知真实感和技术效率方面的进步。 Conclusion: 这篇论文总结了会说话的头部生成技术的研究进展,并指出了未来的发展方向。 Abstract: Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: https://github.com/VineetKumarRakesh/thg.

[149] Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis

Charlton Teo

Main category: cs.CV

TL;DR: This study evaluates how effective Multimodal Large Language Models (MLLMs) are in analyzing tennis videos, especially in identifying sequences of actions during rallies, and explores ways to enhance their performance.

Details Motivation: Despite existing research on tennis analysis, there is a gap in models capable of understanding and identifying sequences of events in a tennis rally, which has applications in sports analytics. This motivated the evaluation of MLLMs for this purpose. Method: The project assessed the effectiveness of Multimodal LLMs (MLLMs) in analyzing sports videos, particularly tennis videos, by evaluating their ability to classify tennis actions and identify them within a sequence. The research also explored ways to improve MLLM performance, such as using different training techniques and combining them with traditional models. Result: MLLMs demonstrated the ability to analyze tennis videos and identify action sequences in rallies, indicating their usefulness in sports analytics, though improvements are possible through alternative training methods and model combinations. Conclusion: The study concludes that MLLMs have potential in analyzing tennis videos, particularly in identifying sequences of actions within a rally, although their performance can be enhanced through different training methods and integration with traditional models. Abstract: The use of Large Language Models (LLMs) in recent years has also given rise to the development of Multimodal LLMs (MLLMs). These new MLLMs allow us to process images, videos and even audio alongside textual inputs. In this project, we aim to assess the effectiveness of MLLMs in analysing sports videos, focusing mainly on tennis videos. Despite research done on tennis analysis, there remains a gap in models that are able to understand and identify the sequence of events in a tennis rally, which would be useful in other fields of sports analytics. As such, we will mainly assess the MLLMs on their ability to fill this gap - to classify tennis actions, as well as their ability to identify these actions in a sequence of tennis actions in a rally. We further looked into ways we can improve the MLLMs' performance, including different training methods and even using them together with other traditional models.

[150] Enhancing Sports Strategy with Video Analytics and Data Mining: Automated Video-Based Analytics Framework for Tennis Doubles

Jia Wei Chen

Main category: cs.CV

TL;DR: This paper proposes a video-based analytics framework for tennis doubles using machine learning techniques that reduce manual annotation and improve data quality, establishing the effectiveness of CNN-based models in capturing complex strategic features.

Details Motivation: The motivation is to provide an automated analysis tool for tennis doubles, which is a strategically complex sport lacking such tools, particularly focusing on improving data consistency and reducing manual annotation effort. Method: The method involves creating a standardized annotation methodology and tool, integrating advanced machine learning techniques like GroundingDINO for player localization and YOLO-Pose for pose estimation, and evaluating the approach using CNN-based models with transfer learning. Result: The result shows that the integration of GroundingDINO and YOLO-Pose improves data consistency and quality while significantly reducing manual annotation effort. The CNN-based models with transfer learning outperformed pose-based methods in predicting shot types, player positioning, and court formations. Conclusion: The paper concludes that the proposed video-based analytics framework effectively addresses the lack of automated analysis tools for tennis doubles, and demonstrates that CNN-based models with transfer learning outperform pose-based methods in predicting shot types, player positioning, and formations. Abstract: We present a comprehensive video-based analytics framework for tennis doubles that addresses the lack of automated analysis tools for this strategically complex sport. Our approach introduces a standardised annotation methodology encompassing player positioning, shot types, court formations, and match outcomes, coupled with a specialised annotation tool designed to meet the unique requirements of tennis video labelling. The framework integrates advanced machine learning techniques including GroundingDINO for precise player localisation through natural language grounding and YOLO-Pose for robust pose estimation. This combination significantly reduces manual annotation effort whilst improving data consistency and quality. We evaluate our approach on doubles tennis match data and demonstrate that CNN-based models with transfer learning substantially outperform pose-based methods for predicting shot types, player positioning, and formations. The CNN models effectively capture complex visual and contextual features essential for doubles tennis analysis. Our integrated system bridges advanced analytical capabilities with the strategic complexities of tennis doubles, providing a foundation for automated tactical analysis, performance evaluation, and strategic modelling in professional tennis.

[151] Modeling Urban Food Insecurity with Google Street View Images

David Li

Main category: cs.CV

TL;DR: This paper explores the use of street-level images combined with a two-step feature extraction and gated attention approach to model food insecurity in urban areas, offering a novel tool for planners despite limited predictive accuracy.

Details Motivation: Food insecurity is a significant public health issue, and current identification approaches rely on survey data that is difficult to scale. Method: A two-step process of feature extraction and gated attention for image aggregation was proposed, using street-level images to model food insecurity at the census tract level. Result: The model's effectiveness was evaluated through comparison with other architectures, weight interpretation, and a case study, showing slightly reduced predictive performance. Conclusion: The proposed model has the potential to supplement existing methods for identifying food insecurity despite its slightly lower predictive power. Abstract: Food insecurity is a significant social and public health issue that plagues many urban metropolitan areas around the world. Existing approaches to identifying food insecurity rely primarily on qualitative and quantitative survey data, which is difficult to scale. This project seeks to explore the effectiveness of using street-level images in modeling food insecurity at the census tract level. To do so, we propose a two-step process of feature extraction and gated attention for image aggregation. We evaluate the effectiveness of our model by comparing against other model architectures, interpreting our learned weights, and performing a case study. While our model falls slightly short in terms of its predictive power, we believe our approach still has the potential to supplement existing methods of identifying food insecurity for urban planners and policymakers.

[152] OBSER: Object-Based Sub-Environment Recognition for Zero-Shot Environmental Inference

Won-Seok Choi,Dong-Sig Han,Suhyung Choi,Hyeonseo Yang,Byoung-Tak Zhang

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: We present the Object-Based Sub-Environment Recognition (OBSER) framework, a novel Bayesian framework that infers three fundamental relationships between sub-environments and their constituent objects. In the OBSER framework, metric and self-supervised learning models estimate the object distributions of sub-environments on the latent space to compute these measures. Both theoretically and empirically, we validate the proposed framework by introducing the ($\epsilon,\delta$) statistically separable (EDS) function which indicates the alignment of the representation. Our framework reliably performs inference in open-world and photorealistic environments and outperforms scene-based methods in chained retrieval tasks. The OBSER framework enables zero-shot recognition of environments to achieve autonomous environment understanding.

[153] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation

Yi-Chun Chen,Arnav Jhala

Main category: cs.CV

TL;DR: GameTileNet 是一个低分辨率游戏瓦片的语义数据集,旨在通过视觉-语言对齐支持叙事驱动的过程内容生成。

Details Motivation: 由于 AI 输出的不一致性,生成与游戏叙事一致的视觉效果仍然具有挑战性,且自动生成的游戏内容在视觉表现上的多样性也因训练数据风格分布不平衡而受到限制。 Method: GameTileNet 收集了来自 OpenGameArt.org 的艺术家创作的游戏瓦片,在 Creative Commons 许可下提供,并通过语义注释来支持叙事驱动的内容生成。 Result: 该数据集引入了一个用于低分辨率基于瓦片的游戏艺术(例如 32x32 像素)的目标检测流水线,并对语义、连接性和对象分类进行了注释。 Conclusion: GameTileNet 是一个有价值的资源,可改进 PCG 方法,支持丰富的叙事游戏内容,并为低分辨率、非照片级真实感图像中的目标检测建立基线。 Abstract: GameTileNet is a dataset designed to provide semantic labels for low-resolution digital game art, advancing procedural content generation (PCG) and related AI research as a vision-language alignment task. Large Language Models (LLMs) and image-generative AI models have enabled indie developers to create visual assets, such as sprites, for game interactions. However, generating visuals that align with game narratives remains challenging due to inconsistent AI outputs, requiring manual adjustments by human artists. The diversity of visual representations in automatically generated game content is also limited because of the imbalance in distributions across styles for training data. GameTileNet addresses this by collecting artist-created game tiles from OpenGameArt.org under Creative Commons licenses and providing semantic annotations to support narrative-driven content generation. The dataset introduces a pipeline for object detection in low-resolution tile-based game art (e.g., 32x32 pixels) and annotates semantics, connectivity, and object classifications. GameTileNet is a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images. TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment.

[154] Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding

Chenglin Li,Qianglong Chen,fengtao,Yin Zhang

Main category: cs.CV

TL;DR: This paper introduces Temporal Search (TS), a training-free framework for improving long video understanding in MLLMs by iteratively focusing on relevant temporal intervals, along with TS-BFS for enhanced efficiency.

Details Motivation: Current MLLMs struggle with long-form video understanding due to inefficient temporal perception. Humans can dynamically adjust their focus, which inspired the development of Temporal Search. Method: Temporal Search uses iterative exploration of temporal intervals based on model confidence, while TS-BFS employs a tree-based best-first search to prioritize promising intervals. Result: The proposed Temporal Search framework improves MLLM performance on long videos without requiring training, and TS-BFS increases efficiency by prioritizing exploration of high-confidence intervals. Conclusion: Temporal Search (TS) significantly improves the understanding of long-form videos by iteratively focusing on relevant temporal intervals, and TS-BFS further enhances efficiency through best-first search. Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in video understanding tasks. However, they continue to struggle with long-form videos because of an inefficient perception of temporal intervals. Unlike humans, who can dynamically adjust their temporal focus to locate query-relevant moments, current MLLMs often rely on dense, uniform sampling across the video timeline, leading to high memory consumption and a risk of missing crucial information. To address this challenge, we introduce Temporal Search, a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively. TS is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy. TS operates through two main iterative stages. First, the MLLM proposes a temporal interval that is likely to contain task-relevant information. Then, it samples a fixed number of frames from the interval, regardless of length, and feeds them into the model to produce a refined response and confidence score. TS refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos. Additionally, keyframe-level descriptions are collected to facilitate cross-interval perception throughout the video. To further improve efficiency, we introduce TS-BFS, a best-first search strategy over a tree. Each node represents a candidate interval and is expanded via two methods: self-driven proposals and uniform partitioning. Nodes are scored based on confidence and self-evaluation, and the most promising one is selected for continued exploration.

[155] DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

Zhiyi Hou,Enhui Ma,Fang Li,Zhiyi Lai,Kalok Ho,Zhanqian Wu,Lijun Zhou,Long Chen,Chitian Sun,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Kaicheng Yu

Main category: cs.CV

TL;DR: 本研究开发了一种新的方法,利用合成高风险运动数据和创新框架DriveMRP-Agent来显著增强视觉-语言模型在自动驾驶中预测运动风险的能力。

Details Motivation: 在长尾场景中,由于动态环境的不确定性以及数据覆盖范围的限制,准确预测自车未来运动的安全性仍然是一个重大挑战。 Method: 提出了一种基于鸟瞰图(BEV)的运动模拟方法,用于从自车、其他车辆和环境三个方面建模风险,并设计了DriveMRP-Agent框架,结合全局上下文、自车视角和轨迹投影的信息注入策略。 Result: 通过使用DriveMRP-10K数据集进行微调,DriveMRP-Agent将多个VLM基线模型的事故识别准确率从27.13%提高到88.03%;在零样本评估中,对现实世界高风险运动数据集的准确率也从29.42%提升至68.50%。 Conclusion: 通过合成高风险运动数据并引入DriveMRP-Agent框架,显著提高了视觉-语言模型在自动驾驶中的运动风险预测能力,并展示了其在现实场景中的泛化能力。 Abstract: Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird's-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model's 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.

[156] Multimodal image registration for effective thermographic fever screening

C. Y. N. Dwith,Pejhman Ghassemi,Joshua Pfefer,Jon Casamento,Quanzeng Wang

Main category: cs.CV

TL;DR: An effective method for precise fever screening through accurate localization of the canthi regions using infrared and white-light images has been developed.

Details Motivation: Fever screening using infrared thermographs is crucial during pandemics, and accurate localization of the canthi regions enhances this process. Method: A coarse-fine registration strategy using different registration models based on landmarks and edge detection on eye contours was employed. Result: Registration accuracy was evaluated to be within 2.7 mm. Conclusion: The proposed registration method is effective for accurate localization of the canthi regions. Abstract: Fever screening based on infrared thermographs (IRTs) is a viable mass screening approach during infectious disease pandemics, such as Ebola and SARS, for temperature monitoring in public places like hospitals and airports. IRTs have found to be powerful, quick and non-invasive methods to detect elevated temperatures. Moreover, regions medially adjacent to the inner canthi (called the canthi regions in this paper) are preferred sites for fever screening. Accurate localization of the canthi regions can be achieved through multi-modal registration of infrared (IR) and white-light images. We proposed a registration method through a coarse-fine registration strategy using different registration models based on landmarks and edge detection on eye contours. We evaluated the registration accuracy to be within 2.7 mm, which enables accurate localization of the canthi regions.

[157] CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning

Andrew Kiruluta,Preethi Raju,Priscilla Burity

Main category: cs.CV

TL;DR: This paper introduces CSAT, a novel attention mechanism that reduces computational complexity in Vision-Language Models by leveraging compressed sensing techniques while maintaining performance efficiency.

Details Motivation: As Vision-Language Models scale to longer video sequences and richer language descriptions, the quadratic complexity of standard attention mechanisms becomes a computational bottleneck. This issue is more severe in vLLMs due to cross-modal attention requirements, leading to high memory and latency costs. Method: The Compressed Sensing Attention Transformer (CSAT) projects high-dimensional key and value representations into a lower-dimensional subspace using random measurement matrices and reconstructs attention outputs with sparse recovery algorithms. This approach reduces attention complexity while preserving semantic fidelity. Result: CSAT significantly reduces attention complexity in vLLMs while maintaining semantic fidelity. It exploits the compressibility of visual and textual representations, especially in videos with high temporal redundancy and sparse cross-modal grounding in language. Conclusion: CSAT is particularly well-suited to compressed attention since it exploits the structured sparsity in alignment and scene composition of vLLMs, making it a scalable, interpretable, and resource-efficient solution for next-generation multimodal transformers. Abstract: Vision-Language Models (vLLMs) have emerged as powerful architectures for joint reasoning over visual and textual inputs, enabling breakthroughs in image captioning, cross modal retrieval, and multimodal dialogue. However, as these models scale to longer video sequences and richer language descriptions, the quadratic complexity of the standard attention mechanism presents a fundamental computational bottleneck. This challenge is exacerbated in vLLMs, where attention must be computed not only within modalities but also across them, leading to prohibitive memory and latency costs. In this work, we introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing. By projecting high dimensional key and value representations into a lower-dimensional subspace via random measurement matrices and reconstructing the attention outputs using sparse recovery algorithms, CSAT significantly reduces attention complexity while maintaining semantic fidelity. Applied to vLLMs, CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse. In contrast to LLMs, which must often model entangled symbolic dependencies, vLLMs benefit from structured sparsity in alignment and scene composition, making them particularly well-suited to compressed attention. We provide a formal mathematical treatment of CSAT, demonstrate its integration into vision language pipelines, and validate its performance on standard benchmarks, highlighting its promise as a scalable, interpretable, and resource efficient solution for next generation multimodal transformers.

[158] VR-YOLO: Enhancing PCB Defect Detection with Viewpoint Robustness Based on YOLO

Hengyi Zhu,Linye Wei,He Li

Main category: cs.CV

TL;DR: This paper proposes VR-YOLO, an improved YOLO-based algorithm for PCB defect detection, achieving high accuracy and robustness against viewpoint changes with minimal computational overhead.

Details Motivation: Conventional PCB defect detection algorithms have strict requirements for image angle, orientation, and clarity, limiting their practical application. This paper aims to develop a more robust and generalized AI-assisted solution for industrial production scenarios. Method: An enhanced PCB defect detection algorithm named VR-YOLO was developed based on YOLOv8. It incorporates a diversified scene enhancement (DSE) method to improve target diversity in the dataset and a key object focus (KOF) scheme that introduces angular loss and an attention mechanism for better small-target feature learning. Result: The VR-YOLO model achieved a mean average precision (mAP) of 98.9% on original test images and 94.7% on test images with viewpoint shifts (±0.06 shear coefficients and ±10 degrees rotation), showing significant improvements over the baseline model. Conclusion: The proposed VR-YOLO algorithm significantly improves the generalization performance and viewpoint robustness of PCB defect detection compared to the baseline YOLO model, while maintaining minimal additional computational cost. Abstract: The integration of large-scale circuits and systems emphasizes the importance of automated defect detection of electronic components. The YOLO image detection model has been used to detect PCB defects and it has become a typical AI-assisted case of traditional industrial production. However, conventional detection algorithms have stringent requirements for the angle, orientation, and clarity of target images. In this paper, we propose an enhanced PCB defect detection algorithm, named VR-YOLO, based on the YOLOv8 model. This algorithm aims to improve the model's generalization performance and enhance viewpoint robustness in practical application scenarios. We first propose a diversified scene enhancement (DSE) method by expanding the PCB defect dataset by incorporating diverse scenarios and segmenting samples to improve target diversity. A novel key object focus (KOF) scheme is then presented by considering angular loss and introducing an additional attention mechanism to enhance fine-grained learning of small target features. Experimental results demonstrate that our improved PCB defect detection approach achieves a mean average precision (mAP) of 98.9% for the original test images, and 94.7% for the test images with viewpoint shifts (horizontal and vertical shear coefficients of $\pm 0.06$ and rotation angle of $\pm 10$ degrees), showing significant improvements compared to the baseline YOLO model with negligible additional computational cost.

[159] Concept-based Adversarial Attack: a Probabilistic Perspective

Andi Zhang,Xuan Ding,Steven McDonagh,Samuel Kaski

Main category: cs.CV

TL;DR: 提出了一种基于概念的对抗性攻击框架,通过采用概率视角扩展了单一图像扰动。

Details Motivation: 为了生成多样化的对抗样本同时保持原始概念不变,从而更有效地误导分类器。 Method: 使用概率生成模型或一组图像来表示一个概念,并在此基础上进行对抗攻击。 Result: 生成的对抗样本在姿态、视角或背景上有所变化,但仍能保持原概念,并且攻击效率更高。 Conclusion: 基于概念的对抗攻击框架能够有效生成多样化且保持概念一致性的对抗样本。 Abstract: We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept -- represented by a probabilistic generative model or a set of images -- to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.

[160] YOLO-Based Pipeline Monitoring in Challenging Visual Environments

Pragya Dhungana,Matteo Fresta,Niraj Tamrakar,Hariom Dhungana

Main category: cs.CV

TL;DR: This study analyzes the effectiveness of AI models (YOLOv8 and YOLOv11) in improving subsea pipeline monitoring under poor visibility.

Details Motivation: Traditional visual-based inspection systems fail due to turbidity, light distortion, and degraded images in low-visibility underwater environments. Method: Conducted a comparative analysis of two robust versions of YOLOv8 and YOLOv11 with three variants for image segmentation tasks in subsea environments. Result: YOLOv11 showed superior performance over YOLOv8 in delineating target pipeline structures under challenging visual conditions. Conclusion: YOLOv11 is more effective than YOLOv8 in subsea pipeline monitoring under low-visibility conditions. Abstract: Condition monitoring subsea pipelines in low-visibility underwater environments poses significant challenges due to turbidity, light distortion, and image degradation. Traditional visual-based inspection systems often fail to provide reliable data for mapping, object recognition, or defect detection in such conditions. This study explores the integration of advanced artificial intelligence (AI) techniques to enhance image quality, detect pipeline structures, and support autonomous fault diagnosis. This study conducts a comparative analysis of two most robust versions of YOLOv8 and Yolov11 and their three variants tailored for image segmentation tasks in complex and low-visibility subsea environments. Using pipeline inspection datasets captured beneath the seabed, it evaluates model performance in accurately delineating target structures under challenging visual conditions. The results indicated that YOLOv11 outperformed YOLOv8 in overall performance.

[161] Farm-Level, In-Season Crop Identification for India

Ishan Deshpande,Amandeep Kaur Reehal,Chandan Nath,Renu Singh,Aayush Patel,Aishwarya Jayagopal,Gaurav Singh,Gaurav Aggarwal,Amit Agarwal,Prathmesh Bele,Sridhar Reddy,Tanya Warrier,Kinjal Singh,Ashish Tendulkar,Luis Pazos Outon,Nikita Saxena,Agata Dondzik,Dinesh Tewari,Shruti Garg,Avneet Singh,Harsh Dhand,Vaibhav Rajan,Alok Talekar

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的作物识别框架,利用卫星影像和农场边界数据在全国范围内实现了高精度的在季作物识别,解决了传统方法面临的多个挑战。

Details Motivation: 准确、及时的农场级作物类型信息对于国家粮食安全、农业政策制定和经济规划至关重要,尤其是在农业大国如印度。然而,现有方法面临地理扩展性有限、作物覆盖种类少、混合像素和异质景观复杂性等问题,特别是在需要主动决策的在季识别方面存在不足。 Method: 利用深度学习技术开发了一个自动化季节检测算法,并结合Sentinel-1和Sentinel-2卫星影像以及农场边界数据进行多作物识别。 Result: 模型成功识别了占印度总耕地面积近90%的12种主要作物,在冬季作物识别中与2023-24年国家作物普查的一致率达到94%,在季风季节达到75%;并且能够在生长季节早期(约两个月)可靠识别作物,同时支持严格的在季性能评估。 Conclusion: 该框架通过整合Sentinel-1和Sentinel-2卫星影像与国家级农场边界数据,成功实现了对印度全国范围内12种主要作物的在季识别,并且具有高度可扩展性。这为实现全国范围内的农业监测和管理提供了可行方案。 Abstract: Accurate, timely, and farm-level crop type information is paramount for national food security, agricultural policy formulation, and economic planning, particularly in agriculturally significant nations like India. While remote sensing and machine learning have become vital tools for crop monitoring, existing approaches often grapple with challenges such as limited geographical scalability, restricted crop type coverage, the complexities of mixed-pixel and heterogeneous landscapes, and crucially, the robust in-season identification essential for proactive decision-making. We present a framework designed to address the critical data gaps for targeted data driven decision making which generates farm-level, in-season, multi-crop identification at national scale (India) using deep learning. Our methodology leverages the strengths of Sentinel-1 and Sentinel-2 satellite imagery, integrated with national-scale farm boundary data. The model successfully identifies 12 major crops (which collectively account for nearly 90% of India's total cultivated area showing an agreement with national crop census 2023-24 of 94% in winter, and 75% in monsoon season). Our approach incorporates an automated season detection algorithm, which estimates crop sowing and harvest periods. This allows for reliable crop identification as early as two months into the growing season and facilitates rigorous in-season performance evaluation. Furthermore, we have engineered a highly scalable inference pipeline, culminating in what is, to our knowledge, the first pan-India, in-season, farm-level crop type data product. The system's effectiveness and scalability are demonstrated through robust validation against national agricultural statistics, showcasing its potential to deliver actionable, data-driven insights for transformative agricultural monitoring and management across India.

[162] Mimesis, Poiesis, and Imagination: Exploring Text-to-Image Generation of Biblical Narratives

Willem Th. van Peursen,Samuel E. Entsua-Mensah

Main category: cs.CV

TL;DR: This study examines how AI visualizes the Biblical narrative of Exodus 2:5-9, finding that while AI produces imaginative visuals, it also shows biases and lacks depth in creativity and theology.

Details Motivation: To investigate how text-to-image AI models reproduce or reimagine sacred narratives, specifically Exodus 2:5-9, and evaluate the stylistic, theological, and cultural dimensions of these depictions. Method: Comparative visual analysis of AI-generated images using MidJourney, Google image results, and classical paintings, drawing on the concepts of mimesis and poiesis. Result: Findings show that while AI excels in producing aesthetically rich and imaginative visuals, it reflects biases and limitations of its training data and questions its capacity for genuine creativity, authorial intent, and theological depth. Conclusion: AI can serve as a creative partner in reinterpreting biblical texts, though its role in sacred art remains complex and contested. Abstract: This study explores the intersection of artificial intelligence and the visualization of Biblical narratives by analyzing AI-generated images of Exodus 2:5-9 (Moses found in River Nile) using MidJourney. Drawing on the classical concepts of mimesis (imitation) and poiesis (creative generation), the authors investigate how text-to-image (T2I) models reproduce or reimagine sacred narratives. Through comparative visual analysis, including Google image results and classical paintings, the research evaluates the stylistic, theological, and cultural dimensions of AI-generated depictions. Findings show that while AI excels in producing aesthetically rich and imaginative visuals, it also reflects the biases and limitations of its training data. The study highlights AI's potential to augment human imagination but questions its capacity for genuine creativity, authorial intent, and theological depth. It concludes by suggesting that AI can serve as a creative partner in reinterpreting biblical texts, though its role in sacred art remains complex and contested.

[163] Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models

Jiahuan Zhang,Shunwen Bai,Tianheng Wang,Kaiwen Guo,Kai Han,Guozheng Rao,Kaicheng Yu

Main category: cs.CV

TL;DR: This paper introduces an evaluation framework to test the spatial deformation reasoning capabilities of Vision-Language Models (VLMs). It finds that current models struggle significantly with such tasks, even after applying training and reasoning enhancements.

Details Motivation: The motivation stems from the desire to determine whether Vision-Language Models (VLMs) can truly understand and manipulate spatial objects like humans do, which has not been clearly established in previous research. Method: A new evaluation framework was proposed to assess the spatial deformation reasoning abilities of VLMs. This involved constructing a benchmark for spatial deformation reasoning from 2D to 3D using a data engine that generates unlimited problem pairs without data leakage. The evaluation explores both forward reasoning (finding the final state given operations) and reverse reasoning (determining the operations given the final state), employing a ladder competition format based on the number of deformation steps. Result: Benchmarking results revealed that almost no model demonstrated plausible spatial deformation reasoning abilities, showing limitations in both forward and reverse reasoning tasks. Even with enhancements such as targeted training and mainstream reasoning techniques, models performed poorly on 3D spatial deformation reasoning. Conclusion: The study concludes that current Vision-Language Models (VLMs) lack effective spatial deformation reasoning capabilities, even after applying targeted training and mainstream reasoning enhancement methods. Abstract: Humans naturally possess the spatial reasoning ability to form and manipulate images and structures of objects in space. There is an increasing effort to endow Vision-Language Models (VLMs) with similar spatial reasoning capabilities. However, it remains unclear whether these models truly understand and manipulate spatial objects or not. To address this question, we propose a new evaluation framework aimed at assessing the performance of VLMs in spatial deformation reasoning tasks. Specifically, we construct a benchmark for spatial deformation reasoning from 2D to 3D. Leveraging our data engine, we can generate unlimited evaluation problem pairs with infinite steps, without any data leakage. We explore whether the model can effectively perform spatial deformation reasoning from two directions: forward reasoning (given the operations, find the final state) and reverse reasoning (given the final state, determine the operations). We adopt a ladder competition format, using the number of deformation steps as the level classification criterion, with the goal of exploring the boundaries of the model's deformation reasoning capabilities. Interestingly, the benchmarking results reveal that almost no model demonstrates plausible spatial deformation reasoning abilities. Furthermore, even after applying targeted training and mainstream reasoning enhancement methods, the models are still unable to perform well on 3D spatial deformation reasoning.

[164] Iterative Misclassification Error Training (IMET): An Optimized Neural Network Training Technique for Image Classification

Ruhaan Singh,Sreelekha Guggilam

Main category: cs.CV

TL;DR: This paper proposes Iterative Misclassification Error Training (IMET), a new approach for improving medical image classification by focusing on misclassified samples, demonstrating enhanced model performance and robustness.

Details Motivation: Medical datasets often contain noisy, mislabeled, or poorly generalizable images, which can lead to overfitting and pose serious risks in diagnostics. Existing data-efficient strategies like coreset selection and curriculum learning face challenges in generalizability and computational efficiency. Method: The paper introduces Iterative Misclassification Error Training (IMET), a framework inspired by curriculum learning and coreset selection, aimed at identifying misclassified samples to improve training efficiency. Result: The results show that IMET has the potential to enhance model robustness and accuracy on benchmark medical image classification datasets using state-of-the-art ResNet architectures. Conclusion: The paper concludes that IMET can enhance model robustness and accuracy in medical image analysis by streamlining the training process and prioritizing edge case scenarios and rare outcomes. Abstract: Deep learning models have proven to be effective on medical datasets for accurate diagnostic predictions from images. However, medical datasets often contain noisy, mislabeled, or poorly generalizable images, particularly for edge cases and anomalous outcomes. Additionally, high quality datasets are often small in sample size that can result in overfitting, where models memorize noise rather than learn generalizable patterns. This in particular, could pose serious risks in medical diagnostics where the risk associated with mis-classification can impact human life. Several data-efficient training strategies have emerged to address these constraints. In particular, coreset selection identifies compact subsets of the most representative samples, enabling training that approximates full-dataset performance while reducing computational overhead. On the other hand, curriculum learning relies on gradually increasing training difficulty and accelerating convergence. However, developing a generalizable difficulty ranking mechanism that works across diverse domains, datasets, and models while reducing the computational tasks and remains challenging. In this paper, we introduce Iterative Misclassification Error Training (IMET), a novel framework inspired by curriculum learning and coreset selection. The IMET approach is aimed to identify misclassified samples in order to streamline the training process, while prioritizing the model's attention to edge case senarious and rare outcomes. The paper evaluates IMET's performance on benchmark medical image classification datasets against state-of-the-art ResNet architectures. The results demonstrating IMET's potential for enhancing model robustness and accuracy in medical image analysis are also presented in the paper.

[165] Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers

Yusuf Shihata

Main category: cs.CV

TL;DR: This paper proposes Gated Recurrent Fusion (GRF), a scalable multimodal learning framework that uses sequential processing and dynamic information flow control to achieve efficient and effective fusion.

Details Motivation: Multimodal learning faces a challenge in balancing deep fusion with computational scalability, as existing methods like cross-attention models have quadratic complexity that limits their applicability in high-modality environments. Method: The GRF architecture processes modalities sequentially using a fusion block based on Transformer Decoder layers with symmetric cross-attention, followed by integration via a Gated Fusion Unit inspired by GRU to manage information flow dynamically. Result: Experiments on the CMU-MOSI benchmark show that GRF performs competitively compared to more complex baselines, and visualizations demonstrate its ability to create structured, class-separable representations. Conclusion: Gated Recurrent Fusion (GRF) offers a scalable and efficient approach for multimodal representation learning, achieving competitive performance while maintaining linear computational complexity. Abstract: Multimodal learning faces a fundamental tension between deep, fine-grained fusion and computational scalability. While cross-attention models achieve strong performance through exhaustive pairwise fusion, their quadratic complexity is prohibitive for settings with many modalities. We address this challenge with Gated Recurrent Fusion (GRF), a novel architecture that captures the power of cross-modal attention within a linearly scalable, recurrent pipeline. Our method processes modalities sequentially, updating an evolving multimodal context vector at each step. The core of our approach is a fusion block built on Transformer Decoder layers that performs symmetric cross-attention, mutually enriching the shared context and the incoming modality. This enriched information is then integrated via a Gated Fusion Unit (GFU) a GRU-inspired mechanism that dynamically arbitrates information flow, enabling the model to selectively retain or discard features. This stateful, recurrent design scales linearly with the number of modalities, O(n), making it ideal for high-modality environments. Experiments on the CMU-MOSI benchmark demonstrate that GRF achieves competitive performance compared to more complex baselines. Visualizations of the embedding space further illustrate that GRF creates structured, class-separable representations through its progressive fusion mechanism. Our work presents a robust and efficient paradigm for powerful, scalable multimodal representation learning.

[166] Leveraging the Structure of Medical Data for Improved Representation Learning

Andrea Agostini,Sonia Laguna,Alain Ryser,Samuel Ruiperez-Campillo,Moritz Vandenhirtz,Nicolas Deperrois,Farhad Nooralahzadeh,Michael Krauthammer,Thomas M. Sutter,Julia E. Vogt

Main category: cs.CV

TL;DR: 本文提出了一种针对医学影像的自我监督学习框架,利用多视角X光图像的内在结构,在数据有限的情况下实现了高效的特征表示学习。

Details Motivation: 医学数据集如MIMIC-CXR通常数据量小且标注稀缺,但其多视角成像具有丰富的内部结构,这启发了作者设计一种充分利用这种结构的数据高效预训练策略。 Method: 通过将胸部X光片的正面和侧面视图作为自然正样本对,并利用稀疏图像块进行重建和潜在嵌入对齐,实现无文本监督的学习方法。 Result: 在MIMIC-CXR数据集上评估显示,与监督学习目标和其他未利用结构信息的基线方法相比,该方法表现出优越的性能。 Conclusion: 该研究提出了一种轻量级、模态无关的自我监督框架,为数据有限但具有内部结构的医学领域提供了有效的预训练方案。 Abstract: Building generalizable medical AI systems requires pretraining strategies that are data-efficient and domain-aware. Unlike internet-scale corpora, clinical datasets such as MIMIC-CXR offer limited image counts and scarce annotations, but exhibit rich internal structure through multi-view imaging. We propose a self-supervised framework that leverages the inherent structure of medical datasets. Specifically, we treat paired chest X-rays (i.e., frontal and lateral views) as natural positive pairs, learning to reconstruct each view from sparse patches while aligning their latent embeddings. Our method requires no textual supervision and produces informative representations. Evaluated on MIMIC-CXR, we show strong performance compared to supervised objectives and baselines being trained without leveraging structure. This work provides a lightweight, modality-agnostic blueprint for domain-specific pretraining where data is structured but scarce

[167] Enabling Robust, Real-Time Verification of Vision-Based Navigation through View Synthesis

Marius Neuhalfen,Jonathan Grzymisch,Manuel Sanchez-Gestido

Main category: cs.CV

TL;DR: This paper proposes VISY-REVE, a real-time augmentation pipeline for validating Vision-Based Navigation algorithms, along with a new distance metric and dataset density enhancement method.

Details Motivation: Traditional validation methods for image processing algorithms, such as synthetic rendering or robotic testbed acquisition, face challenges including difficult setup and slow runtime. This motivates the need for a more efficient and flexible solution. Method: The paper proposes a novel pipeline named VISY-REVE that augments image datasets in real-time with synthesized views at novel poses. It also introduces a new distance metric called Boresight Deviation Distance for better view synthesis evaluation and a method to increase dataset density using this metric. Result: A new pipeline VISY-REVE was developed to generate continuous trajectories from sparse datasets. A new camera pose distance metric, the Boresight Deviation Distance, was introduced and shown to be more suitable for view synthesis than existing metrics. Additionally, a method for increasing image dataset density was developed. Conclusion: The paper concludes that the proposed pipeline VISY-REVE, along with the new distance metric and dataset density method, provides a more efficient and effective way to validate image processing algorithms for Vision-Based Navigation. Abstract: This work introduces VISY-REVE: a novel pipeline to validate image processing algorithms for Vision-Based Navigation. Traditional validation methods such as synthetic rendering or robotic testbed acquisition suffer from difficult setup and slow runtime. Instead, we propose augmenting image datasets in real-time with synthesized views at novel poses. This approach creates continuous trajectories from sparse, pre-existing datasets in open or closed-loop. In addition, we introduce a new distance metric between camera poses, the Boresight Deviation Distance, which is better suited for view synthesis than existing metrics. Using it, a method for increasing the density of image datasets is developed.

[168] FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images

Guang Yang

Main category: cs.CV

TL;DR: 本文提出了FreqCross,一种用于检测AI生成图像的多模态融合网络,结合了空间特征、频域伪影和径向能量分布。

Details Motivation: 扩散模型(特别是Stable Diffusion 3.5)的进步使得生成高度逼真的合成图像成为可能,这对现有的检测方法构成了重大挑战。 Method: 提出了一种新的多模态融合方法,结合了空间RGB特征、频域伪影和径向能量分布模式。使用了一个三分支架构:ResNet-18用于空间特征提取,轻量级CNN用于处理2D FFT幅度谱,多层感知机用于分析径向能量分布。 Result: FreqCross在数据集上的实验结果表明其准确率达到97.8%,比最先进的基线方法高出5.2%。此外,频率分析显示合成图像在0.1--0.4归一化频率范围内表现出明显的频谱特征。 Conclusion: FreqCross是一个多模态融合网络,能够有效检测AI生成的图像,并在包含10,000对真实和合成图像的数据集上实现了97.8%的准确率。 Abstract: The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8\% accuracy, outperforming state-of-the-art baselines by 5.2\%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1--0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.

[169] Text-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis

Haiqing Li,Yuzhi Guo,Feng Jiang,Thao M. Dang,Hehuan Ma,Qifeng Zhou,Jean Gao,Junzhou Huang

Main category: cs.CV

TL;DR: This paper proposes TG-MILNet, a novel method for detecting early scoliosis through gait analysis, achieving high accuracy and interpretability.

Details Motivation: Early scoliosis detection is challenging; traditional methods have radiation risks and require expertise. Method: TG-MILNet uses DTW clustering, IBTA mechanism, BAM, and textual guidance for scoliosis detection. Result: TG-MILNet achieved state-of-the-art results on the Scoliosis1K dataset, especially for borderline cases. Conclusion: TG-MILNet is effective in early scoliosis detection using gait videos, offering non-invasive and accurate diagnosis. Abstract: Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at https://github.com/lhqqq/TG-MILNet

[170] Topological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification

Faisal Ahmed,Mohammad Alfrad Nobel Bhuiyan

Main category: cs.CV

TL;DR: This paper compares HOG and TDA feature extraction techniques for retinal image classification, showing competitive performance and distinct structural insights.

Details Motivation: This is the first comparative study between HOG and TDA techniques for medical image classification using retinal fundus images, aiming to benchmark gradient-based and topological features. Method: The study evaluated HOG and TDA methods independently on the APTOS dataset using 26244 HOG features and 800 TDA features. Seven classical machine learning models were trained with 10-fold cross-validation. Result: XGBoost achieved the best performance: 94.29% accuracy (HOG) and 94.18% (TDA) for binary detection; 74.41% (HOG) and 74.69% (TDA) for multi-class diabetic retinopathy severity grading. Conclusion: HOG and TDA techniques offer competitive performance but encode different structural aspects of retinal images, making them interpretable, applicable to other domains, and suitable for integration into deep learning pipelines. Abstract: We present the first comparative study of two fundamentally distinct feature extraction techniques: Histogram of Oriented Gradients (HOG) and Topological Data Analysis (TDA), for medical image classification using retinal fundus images. HOG captures local texture and edge patterns through gradient orientation histograms, while TDA, using cubical persistent homology, extracts high-level topological signatures that reflect the global structure of pixel intensities. We evaluate both methods on the large APTOS dataset for two classification tasks: binary detection (normal versus diabetic retinopathy) and five-class diabetic retinopathy severity grading. From each image, we extract 26244 HOG features and 800 TDA features, using them independently to train seven classical machine learning models with 10-fold cross-validation. XGBoost achieved the best performance in both cases: 94.29 percent accuracy (HOG) and 94.18 percent (TDA) on the binary task; 74.41 percent (HOG) and 74.69 percent (TDA) on the multi-class task. Our results show that both methods offer competitive performance but encode different structural aspects of the images. This is the first work to benchmark gradient-based and topological features on retinal imagery. The techniques are interpretable, applicable to other medical imaging domains, and suitable for integration into deep learning pipelines.

[171] Markerless Stride Length estimation in Athletic using Pose Estimation with monocular vision

Patryk Skorupski,Cosimo Distante,Pier Luigi Mazzeo

Main category: cs.CV

TL;DR: This paper proposes a computer vision-based method using image processing techniques to estimate athletes' stride length and speed transition from video sequences, aiming to support coaching and training by monitoring gait parameters.

Details Motivation: Monitoring individual performance is essential for supporting staff coaches in establishing a proper training schedule for each athlete. Method: Using probabilistic hough transform combined with a human pose detection algorithm to estimate the leg joint position of runners, and applying a homography transformation to estimate stride length. Result: Experiments on various race videos with three different runners demonstrated the effectiveness of the proposed system. Conclusion: The paper concludes that the proposed computer vision-based approach is a useful tool for coaching and training, suggesting its potential value in measuring and monitoring the gait parameters of athletes. Abstract: Performance measures such as stride length in athletics and the pace of runners can be estimated using different tricks such as measuring the number of steps divided by the running length or helping with markers printed on the track. Monitoring individual performance is essential for supporting staff coaches in establishing a proper training schedule for each athlete. The aim of this paper is to investigate a computer vision-based approach for estimating stride length and speed transition from video sequences and assessing video analysis processing among athletes. Using some well-known image processing methodologies such as probabilistic hough transform combined with a human pose detection algorithm, we estimate the leg joint position of runners. In this way, applying a homography transformation, we can estimate the runner stride length. Experiments on various race videos with three different runners demonstrated that the proposed system represents a useful tool for coaching and training. This suggests its potential value in measuring and monitoring the gait parameters of athletes.

[172] Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

Shuo Yang,Yuwei Niu,Yuyang Liu,Yang Ye,Bin Lin,Li Yuan

Main category: cs.CV

TL;DR: This paper proposes Look-Back, an implicit method enabling Multimodal Large Language Models to autonomously refocus on visual inputs during reasoning, improving their performance across multiple benchmarks.

Details Motivation: MLLMs tend to neglect visual input in later inference stages, prompting the need for methods that enhance visual-textual integration without explicit injection. Method: Analysis of MLLM attention patterns and introduction of the Look-Back method for implicit visual information refocusing. Result: Look-Back improves reasoning and perception abilities in MLLMs, validated through extensive evaluations on multimodal benchmarks. Conclusion: Look-Back enables MLLMs to implicitly re-focus on visual inputs during reasoning without explicit injection, enhancing their multimodal capabilities. Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back" at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.

[173] Intelligent Histology for Tumor Neurosurgery

Xinhai Hou,Akhil Kondepudi,Cheng Jiang,Yiwei Lyu,Samir Harake,Asadur Chowdury,Anna-Katharina Meißner,Volker Neuschmelting,David Reinecke,Gina Furtjes,Georg Widhalm,Lisa Irina Koerner,Jakob Straehle,Nicolas Neidert,Pierre Scheffler,Juergen Beck,Michael Ivan,Ashish Shah,Aditya Pandey,Sandra Camelo-Piragua,Dieter Henrik Heiland,Oliver Schnell,Chris Freudiger,Jacob Young,Melike Pekmezci,Katie Scotford,Shawn Hervey-Jumper,Daniel Orringer,Mitchel Berger,Todd Hollon

Main category: cs.CV

TL;DR: This paper introduces Intelligent Histology using AI and SRH for fast, real-time tumor tissue analysis in neurosurgery, offering a major improvement over traditional methods.

Details Motivation: Current methods like light microscopy and H&E histology are slow, resource-intensive, and lack real-time digital imaging capabilities. Method: The paper reviews the scientific background and clinical translation of Intelligent Histology, focusing on its use in various neurosurgical specialties. Result: SRH enables high-resolution digital images within seconds, supporting AI-driven tumor classification and detection of infiltration. Conclusion: Intelligent Histology, which combines AI with SRH, offers a rapid, label-free digital imaging method that can transform real-time tumor analysis in neurosurgery. Abstract: The importance of rapid and accurate histologic analysis of surgical tissue in the operating room has been recognized for over a century. Our standard-of-care intraoperative pathology workflow is based on light microscopy and H\&E histology, which is slow, resource-intensive, and lacks real-time digital imaging capabilities. Here, we present an emerging and innovative method for intraoperative histologic analysis, called Intelligent Histology, that integrates artificial intelligence (AI) with stimulated Raman histology (SRH). SRH is a rapid, label-free, digital imaging method for real-time microscopic tumor tissue analysis. SRH generates high-resolution digital images of surgical specimens within seconds, enabling AI-driven tumor histologic analysis, molecular classification, and tumor infiltration detection. We review the scientific background, clinical translation, and future applications of intelligent histology in tumor neurosurgery. We focus on the major scientific and clinical studies that have demonstrated the transformative potential of intelligent histology across multiple neurosurgical specialties, including neurosurgical oncology, skull base, spine oncology, pediatric tumors, and periperal nerve tumors. Future directions include the development of AI foundation models through multi-institutional datasets, incorporating clinical and radiologic data for multimodal learning, and predicting patient outcomes. Intelligent histology represents a transformative intraoperative workflow that can reinvent real-time tumor analysis for 21st century neurosurgery.

[174] Detection of Rail Line Track and Human Beings Near the Track to Avoid Accidents

Mehrab Hosain,Rajiv Kapoor

Main category: cs.CV

TL;DR: This paper introduces a real-time railway safety system using YOLOv5 to detect tracks and identify nearby humans, significantly improving accident prevention capabilities.

Details Motivation: To mitigate potential railway accidents by detecting humans and objects in proximity to tracks in real time. Method: The study uses the YOLOv5 deep learning model with real-time video data to detect railway tracks and identify humans or objects near the track. Result: The proposed method shows a remarkable improvement in accuracy over existing methods, with the ability to detect objects at both close and long distances. Conclusion: The paper successfully demonstrates the effectiveness of using YOLOv5 for rail line detection and human identification, contributing significantly to railway safety. Abstract: This paper presents an approach for rail line detection and the identification of human beings in proximity to the track, utilizing the YOLOv5 deep learning model to mitigate potential accidents. The technique incorporates real-time video data to identify railway tracks with impressive accuracy and recognizes nearby moving objects within a one-meter range, specifically targeting the identification of humans. This system aims to enhance safety measures in railway environments by providing real-time alerts for any detected human presence close to the track. The integration of a functionality to identify objects at a longer distance further fortifies the preventative capabilities of the system. With a precise focus on real-time object detection, this method is poised to deliver significant contributions to the existing technologies in railway safety. The effectiveness of the proposed method is demonstrated through a comprehensive evaluation, yielding a remarkable improvement in accuracy over existing methods. These results underscore the potential of this approach to revolutionize safety measures in railway environments, providing a substantial contribution to accident prevention strategies.

[175] LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection

Ana Vasilcoiu,Ivona Najdenkoska,Zeno Geradts,Marcel Worring

Main category: cs.CV

TL;DR: This paper introduces LATTE, a novel approach to detect generated images by analyzing the trajectory of latent embeddings over multiple diffusion steps, achieving superior performance compared to existing methods.

Details Motivation: With diffusion-based image generators becoming more advanced, distinguishing real from generated images is increasingly difficult. Existing methods focus on single-step errors, neglecting the sequential nature of the denoising process. Method: The proposed LATTE method captures latent trajectory embeddings across multiple denoising steps, refines them with a feature refinement module, fuses them with visual features, and feeds them into a lightweight classifier. Result: LATTE outperforms baseline models on benchmarks like GenImage and DiffusionFake, and shows strong results in cross-generator and cross-dataset settings. Conclusion: LATTE is an effective method for detecting generated images by modeling the evolution of latent embeddings during the diffusion process, and it performs well across various generators and datasets. Abstract: The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This can erode trust in digital media, making it critical to develop generalizable detectors for generated images. Recent methods leverage diffusion denoising cues, but mainly focus on single-step reconstruction errors, ignoring the inherent sequential nature of the denoising process. In this work, we propose LATTE - Latent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across several denoising timesteps. By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images. Each latent is refined by employing our latent-visual feature refinement module and aggregated into a unified representation. Afterwards, it is fused with the visual features and finally passed into a lightweight classifier. Our experiments demonstrate that LATTE surpasses the baselines on several established benchmarks, such as GenImage and DiffusionFake. Moreover, it demonstrates strong performance in cross-generator and cross-datasets settings, highlighting the potential of using the trajectory of latent embeddings for generated image detection. The code is available on the following link: https://github.com/AnaMVasilcoiu/LATTE-Diffusion-Detector.

[176] Towards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations

Xiangrui Liu,Man Luo,Agneet Chatterjee,Hua Wei,Yezhou Yang

Main category: cs.CV

TL;DR: This paper explores the psychological underpinnings of hallucinations in Vision-Language Models (VLMs), introducing a new classification of hallucinatory behaviors such as authority bias and developing the AIpsych benchmark to assess these tendencies systematically.

Details Motivation: Existing research often attributes hallucination in VLMs to technical limitations or sycophancy bias, overlooking the potential link between hallucination and human-like cognitive biases. This work aims to explore this overlooked connection and introduce a psychological perspective into the understanding of VLM hallucinations. Method: The researchers developed a scalable benchmark called AIpsych to analyze psychological tendencies in VLM responses. They examined how model architecture and parameter size affect behaviors when responding to manipulated questions and conducted a human subject study for validation. Result: Experiments show that as model size increases, VLMs display stronger sycophantic tendencies but reduced authority bias. The AIpsych benchmark successfully identifies distinct behavioral patterns in VLMs, which are further validated by a human subject study highlighting differences between VLMs and humans. Conclusion: This paper concludes that hallucinations in Vision-Language Models (VLMs) may reflect cognitive biases similar to those in human psychology, and increasing model size can lead to greater sycophantic tendencies while reducing authority bias. Abstract: Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs' hallucination behaviours, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: authority bias. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model evaluation.The benchmark is available at https://github.com/lxrswdd/AIpsych.

[177] Transparent Machine Learning: Training and Refining an Explainable Boosting Machine to Identify Overshooting Tops in Satellite Imagery

Nathan Mitchell,Lander Ver Hoef,Imme Ebert-Uphoff,Kristina Moen,Kyle Hilburn,Yoonjin Lee,Emily J. King

Main category: cs.CV

TL;DR: This paper explores using an interpretable machine learning model (EBM) for detecting overshooting tops in meteorological satellite data, combining explainability with domain expertise.

Details Motivation: To develop interpretable machine learning algorithms for high-risk atmospheric science applications, specifically to detect overshooting tops in satellite imagery. Method: The paper uses an Explainable Boosting Machine (EBM) combined with feature engineering techniques, such as Gray-Level Co-occurrence Matrices, and utilizes satellite imagery channels for overshooting top detection. Result: A fully interpretable ML algorithm was developed through human-machine collaboration, performing well in detecting overshooting tops despite not matching the accuracy of more complex models. Conclusion: The study concludes that EBM, though not as accurate as complex models, is a significant step towards creating interpretable ML algorithms for meteorological applications like OT detection. Abstract: An Explainable Boosting Machine (EBM) is an interpretable machine learning (ML) algorithm that has benefits in high risk applications but has not yet found much use in atmospheric science. The overall goal of this work is twofold: (1) explore the use of EBMs, in combination with feature engineering, to obtain interpretable, physics-based machine learning algorithms for meteorological applications; (2) illustrate these methods for the detection of overshooting top (OTs) in satellite imagery. Specifically, we seek to simplify the process of OT detection by first using mathematical methods to extract key features, such as cloud texture using Gray-Level Co-occurrence Matrices, followed by applying an EBM. Our EBM focuses on the classification task of predicting OT regions, utilizing Channel 2 (visible imagery) and Channel 13 (infrared imagery) of the Advanced Baseline Imager sensor of the Geostationary Operational Environmental Satellite 16. Multi-Radar/Multi-Sensor system convection flags are used as labels to train the EBM model. Note, however, that detecting convection, while related, is different from detecting OTs. Once trained, the EBM was examined and minimally altered to more closely match strategies used by domain scientists to identify OTs. The result of our efforts is a fully interpretable ML algorithm that was developed in a human-machine collaboration. While the final model does not reach the accuracy of more complex approaches, it performs well and represents a significant step toward building fully interpretable ML algorithms for this and other meteorological applications.

[178] AI-driven Web Application for Early Detection of Sudden Death Syndrome (SDS) in Soybean Leaves Using Hyperspectral Images and Genetic Algorithm

Pappu Kumar Yadav,Rishik Aggarwal,Supriya Paudel,Amee Parmar,Hasan Mirzakhaninafchi,Zain Ul Abideen Usmani,Dhe Yeong Tchalla,Shyam Solanki,Ravi Mural,Sachin Sharma,Thomas F. Burks,Jianwei Qin,Moon S. Kim

Main category: cs.CV

TL;DR: This research introduces an AI-based web tool for early detection of soybean leaf disease using hyperspectral imaging and machine learning, achieving high accuracy.

Details Motivation: Sudden Death Syndrome caused by Fusarium virguliforme threatens soybean production, and early detection is crucial for effective management. Method: Leaf samples were scanned with a portable hyperspectral imaging system (398-1011 nm), and a Genetic Algorithm selected five informative wavelengths for input into a lightweight CNN. Spatial-spectral features were classified using ten classical machine learning models. Result: Ensemble classifiers (Random Forest, AdaBoost), Linear SVM, and Neural Net achieved over 98% accuracy and minimal error, while Gaussian Process and QDA performed poorly. Conclusion: The study successfully developed an AI-driven web application for early detection of Sudden Death Syndrome in soybean leaves using hyperspectral imaging and machine learning models, showing potential for precision agriculture applications. Abstract: Sudden Death Syndrome (SDS), caused by Fusarium virguliforme, poses a significant threat to soybean production. This study presents an AI-driven web application for early detection of SDS on soybean leaves using hyperspectral imaging, enabling diagnosis prior to visible symptom onset. Leaf samples from healthy and inoculated plants were scanned using a portable hyperspectral imaging system (398-1011 nm), and a Genetic Algorithm was employed to select five informative wavelengths (505.4, 563.7, 712.2, 812.9, and 908.4 nm) critical for discriminating infection status. These selected bands were fed into a lightweight Convolutional Neural Network (CNN) to extract spatial-spectral features, which were subsequently classified using ten classical machine learning models. Ensemble classifiers (Random Forest, AdaBoost), Linear SVM, and Neural Net achieved the highest accuracy (>98%) and minimal error across all folds, as confirmed by confusion matrices and cross-validation metrics. Poor performance by Gaussian Process and QDA highlighted their unsuitability for this dataset. The trained models were deployed within a web application that enables users to upload hyperspectral leaf images, visualize spectral profiles, and receive real-time classification results. This system supports rapid and accessible plant disease diagnostics, contributing to precision agriculture practices. Future work will expand the training dataset to encompass diverse genotypes, field conditions, and disease stages, and will extend the system for multiclass disease classification and broader crop applicability.

[179] Development of an Improved Capsule-Yolo Network for Automatic Tomato Plant Disease Early Detection and Diagnosis

Idris Ochijenu,Monday Abutu Idakwo,Sani Felix

Main category: cs.CV

TL;DR: 本研究开发了一种高效的Capsule-YOLO网络架构,能够准确识别番茄叶病害并提供治疗建议,对提高农作物产量和保障粮食安全具有重要意义。

Details Motivation: 尼日利亚拥有肥沃的农业土壤,适合大规模种植番茄,但病原体导致的疾病威胁着番茄的健康,影响产量和粮食安全。通过视觉识别技术提前检测疾病是一种有效的解决方案。 Method: 基于YOLO框架设计了一个增强的Capsule-YOLO网络架构,用于自动分割复杂背景下的番茄叶图像,并开发了用户友好的界面供上传图像并检测病害。 Result: 模型在性能指标上取得了显著成果:准确率为99.31%,召回率为98.78%,精确度为99.09%,F1得分为98.93%,相比现有方法分别提高了2.91%、1.84%、5.64%和4.12%。 Conclusion: 该研究提出了一种改进的Capsule-YOLO网络架构,可以有效检测和分割复杂的背景中的重叠和遮挡番茄叶片图像,并提供早期疾病症状的诊断和治疗建议。 Abstract: Like many countries, Nigeria is naturally endowed with fertile agricultural soil that supports large-scale tomato production. However, the prevalence of disease causing pathogens poses a significant threat to tomato health, often leading to reduced yields and, in severe cases, the extinction of certain species. These diseases jeopardise both the quality and quantity of tomato harvests, contributing to food insecurity. Fortunately, tomato diseases can often be visually identified through distinct forms, appearances, or textures, typically first visible on leaves and fruits. This study presents an enhanced Capsule-YOLO network architecture designed to automatically segment overlapping and occluded tomato leaf images from complex backgrounds using the YOLO framework. It identifies disease symptoms with impressive performance metrics: 99.31% accuracy, 98.78% recall, and 99.09% precision, and a 98.93% F1-score representing improvements of 2.91%, 1.84%, 5.64%, and 4.12% over existing state-of-the-art methods. Additionally, a user-friendly interface was developed to allow farmers and users to upload images of affected tomato plants and detect early disease symptoms. The system also provides recommendations for appropriate diagnosis and treatment. The effectiveness of this approach promises significant benefits for the agricultural sector by enhancing crop yields and strengthening food security.

[180] A Vision-Based Closed-Form Solution for Measuring the Rotation Rate of an Object by Tracking One Point

Daniel Raviv,Juan D. Yepes,Eiki M. Martinson

Main category: cs.CV

TL;DR: 本文提出了一种新颖的刚体旋转估计方法,仅需跟踪一个特征点,无需先验知识即可实现旋转计算与场景分割。

Details Motivation: 希望找到一种不依赖物体形状、无需先验知识的旋转估计方法,适用于并行处理并能够进行场景分割。 Method: 在正投影条件下,利用固定于刚体上某一点的相机,通过跟踪另一个特征点来解析计算刚体的旋转率。 Result: 证明了在大多数情况下,无论跟踪的是刚体上的哪个点,都可以得到相同的瞬时旋转率,并通过仿真和真实视频数据验证了该方法的有效性。 Conclusion: 该论文提出了一种新的方法,通过仅跟踪图像中的一个特征点即可解析地获得刚体的旋转情况,并且可以区分不属于同一刚体的点,从而实现场景分割。 Abstract: We demonstrate that, under orthographic projection and with a camera fixated on a point located on a rigid body, the rotation of that body can be analytically obtained by tracking only one other feature in the image. With some exceptions, any tracked point, regardless of its location on the body, yields the same value of the instantaneous rotation rate. The proposed method is independent of the shape of the 3D object and does not require a priori knowledge about the scene. This algorithm is suited for parallel processing and can achieve segmentation of the scene by distinguishing points that do not belong to the same rigid body, simply because they do not produce the same value of the rotation. This paper presents an analytical derivation, simulation results, and results from real video data.

[181] Subject Invariant Contrastive Learning for Human Activity Recognition

Yavuz Yarici,Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib

Main category: cs.CV

TL;DR: 本文介绍了一种名为SICL的新损失函数,用于改进人类活动识别中的对比学习。

Details Motivation: 数据注释的高成本使自监督方法(如对比学习方法)对于人类活动识别具有吸引力。然而,主体变异性引起的域转移阻碍了模型向新主体的泛化。 Method: 在三个公共基准测试中评估损失函数:UTD-MHAD、MMAct和DARai。 Result: SICL比传统对比学习方法的性能提高了11%。此外,展示了损失函数在各种设置中的适应性。 Conclusion: SICL是一种简单而有效的损失函数,可提高人类活动识别中的泛化能力。它通过重新加权来自同一主体的负对来抑制主体特定线索并强调活动特定信息。 Abstract: The high cost of annotating data makes self-supervised approaches, such as contrastive learning methods, appealing for Human Activity Recognition (HAR). Effective contrastive learning relies on selecting informative positive and negative samples. However, HAR sensor signals are subject to significant domain shifts caused by subject variability. These domain shifts hinder model generalization to unseen subjects by embedding subject-specific variations rather than activity-specific features. As a result, human activity recognition models trained with contrastive learning often struggle to generalize to new subjects. We introduce Subject-Invariant Contrastive Learning (SICL), a simple yet effective loss function to improve generalization in human activity recognition. SICL re-weights negative pairs drawn from the same subject to suppress subject-specific cues and emphasize activity-specific information. We evaluate our loss function on three public benchmarks: UTD-MHAD, MMAct, and DARai. We show that SICL improves performance by up to 11% over traditional contrastive learning methods. Additionally, we demonstrate the adaptability of our loss function across various settings, including multiple self-supervised methods, multimodal scenarios, and supervised learning frameworks.

[182] LACONIC: A 3D Layout Adapter for Controllable Image Creation

Léopold Maillard,Tom Durand,Adrien Ramanana Rahary,Maks Ovsjanikov

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、基于扩散模型的3D感知图像生成方法,可以结合显式3D几何与场景上下文,实现更自然的图像合成与编辑。

Details Motivation: 现有的多对象场景生成方法通常依赖于2D控制,难以保持一致的三维几何结构,因此需要一种能更好利用3D信息的方法。 Method: 通过引入适配器网络和新的训练方法,将显式的3D几何信息和场景上下文(包括屏幕内外的物体)整合到生成过程中。 Result: 该方法能够在仅需合理数量数据的情况下实现高质量、语义丰富的图像生成,并具有良好的泛化能力和直观的编辑功能。 Conclusion: 该论文提出了一种新的条件生成方法,为预训练的文本到图像扩散模型增加了3D感知能力,并能够支持更丰富的应用场景。 Abstract: Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.

[183] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Song Mao,Yang Chen,Pinglong Cai,Ding Wang,Guohang Yan,Zhi Yu,Botian Shi

Main category: cs.CV

TL;DR: This paper identifies and analyzes encoder redundancy in multi-encoder MLLMs, proposing CUR and IG as diagnostic tools to improve the efficiency and effectiveness of multimodal models.

Details Motivation: Despite the trend of using multiple vision encoders in MLLMs to capture diverse visual information, performance gains often diminish or degrade. This phenomenon, termed encoder redundancy, motivates a systematic investigation into its causes and impact. Method: The authors conducted comprehensive ablation studies on state-of-the-art multi-encoder MLLMs and introduced two principled metrics: CUR to quantify each encoder's unique contribution and IG to capture the overall disparity in encoder utility. Result: Experiments revealed that certain vision encoders contribute minimally or even negatively to model performance, confirming the existence of substantial redundancy in multi-encoder systems. Conclusion: The paper concludes that current multi-encoder designs in MLLMs exhibit critical inefficiencies due to encoder redundancy, and the proposed metrics—Conditional Utilization Rate (CUR) and Information Gap (IG)—serve as valuable tools for diagnosing and improving multimodal architectures. Abstract: Multimodal Large Language Models (MLLMs) increasingly adopt multiple vision encoders to capture diverse visual information, ranging from coarse semantics to fine grained details. While this approach is intended to enhance visual understanding capability, we observe that the performance gains from adding encoders often diminish and can even lead to performance degradation, a phenomenon we term encoder redundancy. This paper presents a systematic investigation into this issue. Through comprehensive ablation studies on state of the art multi encoder MLLMs, we empirically demonstrate that significant redundancy exists. To quantify each encoder's unique contribution, we propose a principled metric: the Conditional Utilization Rate (CUR). Building on CUR, we introduce the Information Gap (IG) to capture the overall disparity in encoder utility within a model.Our experiments reveal that certain vision encoders contribute little, or even negatively, to overall performance, confirming substantial redundancy. Our experiments reveal that certain vision encoders contribute minimally, or even negatively, to the model's performance, confirming the prevalence of redundancy. These findings highlight critical inefficiencies in current multi encoder designs and establish that our proposed metrics can serve as valuable diagnostic tools for developing more efficient and effective multimodal architectures.

[184] Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification

Xinyue Xin,Ming Li,Yan Wu,Xiang Li,Peng Zhang,Dazhi Xu

Main category: cs.CV

TL;DR: This paper proposes SKDNet-SSR, a novel knowledge distillation network for dual-frequency PolSAR image classification that addresses regional consistency issues and optimally utilizes dual-frequency data through pixel purification and complementary learning techniques.

Details Motivation: Dual-frequency PolSAR image classification faces two key challenges: poor regional consistency affecting spatial information learning and inefficient utilization of dual-frequency data. These issues necessitate an effective collaborative classification approach. Method: A knowledge distillation network with a statistical-based dynamic sample rectification (SDSR) module and a dual-frequency gate-selected distillation (DGSD) module was developed. SDSR improves feature extraction by removing noisy pixels based on the complex Wishart distribution of PolSAR covariance matrices, while DGSD enables complementary learning from different frequency bands. Result: Comprehensive experiments on four real-world dual-frequency PolSAR datasets demonstrated that the SKDNet-SSR outperforms other state-of-the-art methods in classification accuracy. Conclusion: The proposed SKDNet-SSR method effectively addresses the challenges in dual-frequency PolSAR image classification, including regional consistency issues and optimal utilization of dual-frequency data. Experimental results confirm its superior performance over existing methods. Abstract: The collaborative classification of dual-frequency PolSAR images is a meaningful but also challenging research. The effect of regional consistency on classification information learning and the rational use of dual-frequency data are two main difficulties for dual-frequency collaborative classification. To tackle these problems, a selected knowledge distillation network with statistical-based sample rectification (SKDNet-SSR) is proposed in this article. First, in addition to applying CNN and ViT as local and global feature extractors, a statistical-based dynamic sample rectification (SDSR) module is designed to avoid the impact of poor regional consistency on spatial information learning process. Specifically, based on the fact that the PolSAR covariance matrix conforms to the complex Wishart distribution, SDSR first dynamically evaluates the sample purity, and then performs pixel selection and pixel generation to remove noisy pixels, thereby avoiding the feature interaction between informative pixels and noisy pixels and improving the classification feature extraction process. Next, a dual-frequency gate-selected distillation (DGSD) module is constructed to emphasize the advantages of different frequency bands and perform complementary learning on dual-frequency data. It uses the dominant single-frequency branch on each sample as teacher model to train the dual-frequency student model, enabling the student model to learn the optimal results and realizing complementary utilization of dual-frequency data on different terrain objects. Comprehensive experiments on four measured dual-frequency PolSAR data demonstrate that the proposed SKDNet-SSR outperforms other related methods.

[185] ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization

Haosheng Gan,Berk Tinaz,Mohammad Shahab Sepehri,Zalan Fabian,Mahdi Soltanolkotabi

Main category: cs.CV

TL;DR: 论文提出ConceptMix++以优化提示并更公平地评估文本到图像模型的真实能力。

Details Motivation: 当前文本到图像基准测试因提示敏感性和偏向某些模型而可能低估了生成能力。 Method: 在ConceptMix基础上引入多模态优化流水线,并通过迭代提示优化来分离提示措辞与视觉生成能力。 Result: 实验显示优化提示可显著提高组合生成性能,并发现特定视觉概念(如空间关系和形状)受益更多,同时优化提示具有跨模型强转移性。 Conclusion: ConceptMix++ 提供了一个更准确评估和揭示文本到图像模型真实能力的框架,表明刚性基准方法可能显著低估了模型性能。 Abstract: Current text-to-image (T2I) benchmarks evaluate models on rigid prompts, potentially underestimating true generative capabilities due to prompt sensitivity and creating biases that favor certain models while disadvantaging others. We introduce ConceptMix++, a framework that disentangles prompt phrasing from visual generation capabilities by applying iterative prompt optimization. Building on ConceptMix, our approach incorporates a multimodal optimization pipeline that leverages vision-language model feedback to refine prompts systematically. Through extensive experiments across multiple diffusion models, we show that optimized prompts significantly improve compositional generation performance, revealing previously hidden model capabilities and enabling fairer comparisons across T2I models. Our analysis reveals that certain visual concepts -- such as spatial relationships and shapes -- benefit more from optimization than others, suggesting that existing benchmarks systematically underestimate model performance in these categories. Additionally, we find strong cross-model transferability of optimized prompts, indicating shared preferences for effective prompt phrasing across models. These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities, while our framework provides more accurate assessment and insights for future development.

[186] NOVO: Unlearning-Compliant Vision Transformers

Soumya Roy,Soumya Banerjee,Vinay Verma,Soumik Dasgupta,Deepak Gupta,Piyush Rai

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Machine unlearning (MUL) refers to the problem of making a pre-trained model selectively forget some training instances or class(es) while retaining performance on the remaining dataset. Existing MUL research involves fine-tuning using a forget and/or retain set, making it expensive and/or impractical, and often causing performance degradation in the unlearned model. We introduce {\pname}, an unlearning-aware vision transformer-based architecture that can directly perform unlearning for future unlearning requests without any fine-tuning over the requested set. The proposed model is trained by simulating unlearning during the training process itself. It involves randomly separating class(es)/sub-class(es) present in each mini-batch into two disjoint sets: a proxy forget-set and a retain-set, and the model is optimized so that it is unable to predict the forget-set. Forgetting is achieved by withdrawing keys, making unlearning on-the-fly and avoiding performance degradation. The model is trained jointly with learnable keys and original weights, ensuring withholding a key irreversibly erases information, validated by membership inference attack scores. Extensive experiments on various datasets, architectures, and resolutions confirm {\pname}'s superiority over both fine-tuning-free and fine-tuning-based methods.

[187] MolVision: Molecular Property Prediction with Vision Language Models

Deepan Adak,Yogesh Singh Rawat,Shruti Vyas

Main category: cs.CV

TL;DR: MolVision enhances molecular property prediction by integrating visual molecular structures and textual descriptions using VLMs, particularly effective when combined with efficient fine-tuning techniques like LoRA.

Details Motivation: The motivation is to improve molecular property prediction by overcoming the limitations of textual representations like SMILES/SELFIES, which can be ambiguous and structurally less informative. Method: The authors introduced MolVision, an approach that uses VLMs to integrate molecular structure images and textual descriptions for property prediction. They constructed a benchmark across ten datasets and evaluated nine different VLMs under zero-shot, few-shot, and fine-tuned settings. Result: The results showed that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies. Multimodal fusion significantly enhanced generalization across molecular properties, and adapting the vision encoder for molecular images along with LoRA further improved performance. Conclusion: The study concludes that incorporating visual information of molecular structures with textual descriptions using Vision-Language Models (VLMs) enhances the prediction of molecular properties, especially when combined with efficient fine-tuning methods like LoRA. Abstract: Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : $\href{https://molvision.github.io/MolVision/}{https://molvision.github.io/MolVision/}$.

[188] Zero-shot Inexact CAD Model Alignment from a Single Image

Pattaramanee Arsomngern,Sasikarn Khwanmuang,Matthias Nießner,Supasorn Suwajanakorn

Main category: cs.CV

TL;DR: 本文提出了一种无需姿态标注的弱监督9自由度对齐方法,能够从单幅图像中推断3D场景结构,并在未见过的物体类别上表现出色。

Details Motivation: 现有方法依赖带有图像和姿态标注的监督训练,限制了其在狭窄物体类别上的应用。因此需要一种更广泛适用的方法。 Method: 该方法基于基础特征构建新的特征空间,通过自监督三元组损失确保多视角一致性,并克服基础特征中的对称歧义。同时引入纹理不变的姿态优化技术,在归一化对象坐标中进行密集对齐。 Result: 在ScanNet25k数据集上的实验表明,该方法比现有的弱监督基线方法平均对齐精度高出4.3%,并成为唯一一个超越监督方法ROCA(高出2.7%)的弱监督方法。此外,在包含20个新物体类别的SUN2CAD测试集上也达到了SOTA结果。 Conclusion: 该研究成功开发出一种适用于未见类别的弱监督3D模型对齐方法,具有良好的泛化能力和实际应用潜力。 Abstract: One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.

[189] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Xiangning Zhang,Jinnan Chen,Qingwei Zhang,Yaqi Wang,Chengfeng Zhou,Xiaobo Li,Dahong Qian

Main category: cs.CV

TL;DR: This paper proposes Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework for surgical phase recognition in Endoscopic Submucosal Dissection (ESD), demonstrating improved performance over existing approaches.

Details Motivation: Reliable surgical phase recognition within complex endoscopic workflows remains a critical bottleneck for clinical adoption of computer-assisted systems in Endoscopic Submucosal Dissection (ESD). This motivates the development of a more effective solution. Method: Clinical Prior Knowledge-Constrained Diffusion (CPKD) was developed to reimagine phase recognition through denoising diffusion principles. It progressively reconstructs phase sequences from random noise, conditioned on visual-temporal features, incorporating clinical prior knowledge to improve correction of phase logical errors. Result: CPKD achieves superior or comparable performance compared to current state-of-the-art methods in surgical phase recognition, as demonstrated by comprehensive evaluations on ESD820, Cholec80, and external multi-center datasets. Conclusion: The study concludes that CPKD, a novel generative framework, demonstrates superior or comparable performance to state-of-the-art approaches in surgical phase recognition, validating the effectiveness of diffusion-based generative paradigms. Abstract: Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.

[190] Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model

Wooseok Shin,Jisu Kang,Hyeonki Jeong,Jin Sob Kim,Sung Won Han

Main category: cs.CV

TL;DR: 本文提出了一种新的半监督语义分割框架SemiOVS,通过开放词汇分割模型有效利用大量未标记的分布外(OOD)图像,取得了先进的性能表现。

Details Motivation: 现有的半监督语义分割方法主要在受控的数据集划分下取得了良好的效果,但在现实场景中,如何有效利用大规模未标记且可能分布外(OOD)的图像仍是一个未被探索的问题。 Method: 提出了一种新的半监督语义分割框架SemiOVS,使用开放词汇分割模型来伪标记OOD图像,并在Pascal VOC和Context数据集上进行了广泛的实验。 Result: 实验结果表明,使用额外的未标记图像可以提高标签稀缺场景下半监督学习者的性能;使用OVS模型伪标记OOD图像可显著提升性能,SemiOVS在Pascal VOC数据集上的mIoU分别优于PrevMatch和SemiVL方法3.5和3.0。 Conclusion: SemiOVS有效利用大量未标记的OOD图像进行语义分割,实现了最先进的性能,并希望启发未来的研究和实际应用。 Abstract: In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at https://github.com/wooseok-shin/SemiOVS

[191] Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations

Hai Huang,Yan Xia,Sashuai Zhou,Hanting Wang,Shulei Wang,Zhou Zhao

Main category: cs.CV

TL;DR: This paper proposes a novel method for Multi-modal Domain Generalization (MMDG) using unified representations and a supervised disentanglement framework, achieving better performance than existing approaches.

Details Motivation: Existing Domain Generalization (DG) techniques primarily address single-modal data, which leads to sub-optimal results when directly applied to Multi-modal Domain Generalization (MMDG). This work aims to bridge that gap by considering inter-modal consistency and synchronization. Method: A novel approach using Unified Representations to synchronize multi-modal improvements within a shared space, along with a supervised disentanglement framework to separate modal-general and modal-specific information. Result: Extensive experiments show the effectiveness and superiority of the proposed method in enhancing multi-modal domain generalization on datasets like EPIC-Kitchens and Human-Animal-Cartoon. Conclusion: The proposed approach effectively addresses the challenges of Multi-modal Domain Generalization (MMDG) by leveraging unified representations and a supervised disentanglement framework, demonstrating superior performance on benchmark datasets. Abstract: Domain Generalization (DG) aims to enhance model robustness in unseen or distributionally shifted target domains through training exclusively on source domains. Although existing DG techniques, such as data manipulation, learning strategies, and representation learning, have shown significant progress, they predominantly address single-modal data. With the emergence of numerous multi-modal datasets and increasing demand for multi-modal tasks, a key challenge in Multi-modal Domain Generalization (MMDG) has emerged: enabling models trained on multi-modal sources to generalize to unseen target distributions within the same modality set. Due to the inherent differences between modalities, directly transferring methods from single-modal DG to MMDG typically yields sub-optimal results. These methods often exhibit randomness during generalization due to the invisibility of target domains and fail to consider inter-modal consistency. Applying these methods independently to each modality in the MMDG setting before combining them can lead to divergent generalization directions across different modalities, resulting in degraded generalization capabilities. To address these challenges, we propose a novel approach that leverages Unified Representations to map different paired modalities together, effectively adapting DG methods to MMDG by enabling synchronized multi-modal improvements within the unified space. Additionally, we introduce a supervised disentanglement framework that separates modal-general and modal-specific information, further enhancing the alignment of unified representations. Extensive experiments on benchmark datasets, including EPIC-Kitchens and Human-Animal-Cartoon, demonstrate the effectiveness and superiority of our method in enhancing multi-modal domain generalization.

[192] MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion

Peilin Tao,Hainan Cui,Diantao Tu,Shuhan Shen

Main category: cs.CV

TL;DR: This paper proposes a novel and efficient global motion averaging framework for multi-camera systems, enhancing Structure-from-Motion robustness and accuracy for applications like autonomous vehicles.

Details Motivation: Traditional global SfM systems struggle with robustness due to their optimization framework, necessitating a more effective solution for multi-camera systems used in autonomous vehicles and robotics. Method: The framework includes a decoupled rotation averaging module using a hierarchical strategy and a hybrid translation averaging module incorporating camera-to-camera and camera-to-point constraints. It utilizes both convex distance-based and unbiased non-bilinear angle-based objective functions for initialization and refinement. Result: Experiments on large-scale datasets show that the proposed system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Conclusion: The proposed global motion averaging framework for multi-camera systems demonstrates robustness and efficiency, outperforming existing global SfM methods in real-world applications. Abstract: Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. The code is available at https://github.com/3dv-casia/MGSfM/.

[193] Personalized Image Generation from an Author Writing Style

Sagar Gandhi,Vishal Gandhi

Main category: cs.CV

TL;DR: 本文提出了一种利用作者写作表(AWS)通过大型语言模型和扩散模型生成与作者风格匹配的图像的新方法,并进行了实证验证。

Details Motivation: 将细腻的文字定义的作者写作风格转化为视觉表现是一种新的挑战,这篇文章旨在开发一种能够实现这种转化的管道。 Method: 该研究使用结构化的作者文学特征总结——作者写作表(AWS),作为大型语言模型(LLM, Claude 3.7 Sonnet)的输入,生成三个不同的文本到图像提示,并通过扩散模型(Stable Diffusion 3.5 Medium)进行渲染。 Result: 实验结果显示,生成的图像与文本作者画像具有良好的感知一致性(平均风格匹配:4.08/5),且图像具有中等独特性;同时,定性分析显示管道在捕捉氛围和情绪上有效,但在表达高度抽象的叙事元素时存在挑战。 Conclusion: 本研究贡献了一种新颖的端到端可视化作者风格个性化方法,并提供了初步的实证验证,为创意辅助和跨模态理解开辟了新途径。 Abstract: Translating nuanced, textually-defined authorial writing styles into compelling visual representations presents a novel challenge in generative AI. This paper introduces a pipeline that leverages Author Writing Sheets (AWS) - structured summaries of an author's literary characteristics - as input to a Large Language Model (LLM, Claude 3.7 Sonnet). The LLM interprets the AWS to generate three distinct, descriptive text-to-image prompts, which are then rendered by a diffusion model (Stable Diffusion 3.5 Medium). We evaluated our approach using 49 author styles from Reddit data, with human evaluators assessing the stylistic match and visual distinctiveness of the generated images. Results indicate a good perceived alignment between the generated visuals and the textual authorial profiles (mean style match: $4.08/5$), with images rated as moderately distinctive. Qualitative analysis further highlighted the pipeline's ability to capture mood and atmosphere, while also identifying challenges in representing highly abstract narrative elements. This work contributes a novel end-to-end methodology for visual authorial style personalization and provides an initial empirical validation, opening avenues for applications in creative assistance and cross-modal understanding.

[194] Source-Free Domain Adaptation via Multi-view Contrastive Learning

Amirfarhad Farhadi,Naser Mozayani,Azadeh Zamanifar

Main category: cs.CV

TL;DR: This paper proposes a novel method for Source-Free Unsupervised Domain Adaptation that enhances prototype samples and pseudo-labels, achieving significant accuracy improvements in domain adaptation tasks.

Details Motivation: Privacy concerns restrict access to labeled source domain data, making Source-Free Unsupervised Domain Adaptation (SFUDA) an attractive solution. However, challenges such as low-quality prototype samples and incorrect pseudo-label assignments remain. Method: The method includes three phases: a Reliable Sample Memory (RSM) module for better prototypes, a Multi-View Contrastive Learning (MVCL) approach to improve pseudo-labels, and a noisy label filtering technique for refinement. Result: Experiments on three benchmark datasets show approximately 2 percent and 6 percent improvements in classification accuracy over the second-best method and the average of 13 state-of-the-art approaches, respectively. Conclusion: The proposed method addresses the challenges in Source-Free Unsupervised Domain Adaptation by improving prototype sample quality and pseudo-label assignment, showing significant improvements in classification accuracy. Abstract: Domain adaptation has become a widely adopted approach in machine learning due to the high costs associated with labeling data. It is typically applied when access to a labeled source domain is available. However, in real-world scenarios, privacy concerns often restrict access to sensitive information, such as fingerprints, bank account details, and facial images. A promising solution to this issue is Source-Free Unsupervised Domain Adaptation (SFUDA), which enables domain adaptation without requiring access to labeled target domain data. Recent research demonstrates that SFUDA can effectively address domain discrepancies; however, two key challenges remain: (1) the low quality of prototype samples, and (2) the incorrect assignment of pseudo-labels. To tackle these challenges, we propose a method consisting of three main phases. In the first phase, we introduce a Reliable Sample Memory (RSM) module to improve the quality of prototypes by selecting more representative samples. In the second phase, we employ a Multi-View Contrastive Learning (MVCL) approach to enhance pseudo-label quality by leveraging multiple data augmentations. In the final phase, we apply a noisy label filtering technique to further refine the pseudo-labels. Our experiments on three benchmark datasets - VisDA 2017, Office-Home, and Office-31 - demonstrate that our method achieves approximately 2 percent and 6 percent improvements in classification accuracy over the second-best method and the average of 13 well-known state-of-the-art approaches, respectively.

[195] Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents

Zhao Wang,Bowen Chen,Yotaro Shimose,Sota Moriyama,Heng Wang,Shingo Takamatsu

Main category: cs.CV

TL;DR: 本文提出了一种新的广告横幅自动生成框架MIMO,结合了多模态代理系统和协调循环机制,能有效提升设计质量和风格多样性。

Details Motivation: 尽管现有的生成模型如GPT-4o在产生高质量图像方面表现出色,但在商业设计任务中还存在布局结构、排版精确性及品牌一致性等方面的不足。 Method: 介绍了一种名为MIMO的代理优化框架,该框架通过其核心组件MIMO-Core和MIMO-Loop来自动检测和纠正生成过程中的错误。 Result: 实验表明,MIMO在现实世界的横幅设计场景中显著优于现有的扩散模型和基于LLM的基线方法。 Conclusion: MIMO是一个用于自动生成广告横幅的设计框架,它结合了一个分层多模态代理系统和一个协调循环,能够探索多种风格方向并迭代改进设计质量。 Abstract: Recent generative models such as GPT-4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity -- they require structured layouts, precise typography, consistent branding, and more. In this paper, we introduce MIMO (Mirror In-the-Model), an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.

[196] Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

Mingzhuo Li,Guang Li,Jiafeng Mao,Linfeng Ye,Takahiro Ogawa,Miki Haseyama

Main category: cs.CV

TL;DR: 本文提出了一种基于任务难度的生成数据集蒸馏方法,在分类任务中表现出色,具有广泛的应用潜力。

Details Motivation: 现有的数据集蒸馏方法主要关注于与原始数据集的对齐,而忽略了可能对下游任务至关重要的任务特定信息。本文旨在解决这一问题,通过引入任务难度信息来优化分类任务的生成数据集蒸馏过程。 Method: 从一个较大的图像池中采样最终数据集,采样分布通过对数变换预处理来匹配原始数据集的难度分布,从而更好地满足目标任务的需求。 Result: 大量实验的结果验证了所提方法的有效性,表明其在分类任务上的性能提升,并暗示其可以扩展到其他下游任务。 Conclusion: 该论文提出了一种针对分类任务的生成数据集蒸馏方法,通过结合任务难度信息来提高下游任务性能。实验结果表明了该方法的有效性,并暗示其在其他任务中的潜在应用价值。 Abstract: To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks.

[197] De-Fake: Style based Anomaly Deepfake Detection

Sudev Kumar Padhi,Harshit Kumar,Umesh Kashyap,Sk. Subidh Ali

Main category: cs.CV

TL;DR: 本文提出 SafeVision,这是一种能够在保护隐私的前提下,通过分析图像风格差异来检测换脸深度伪造的新方法,并显示出良好的检测性能。

Details Motivation: 现有的深度伪造检测方法在面对换脸深度伪造时存在困难,尤其是在数据多样性和隐私保护方面面临挑战。同时,换脸技术被广泛滥用,例如制造虚假信息、损害声誉、创建非自愿亲密深度伪造(NCID)以及儿童性虐待材料(CSAM)等,因此迫切需要一种高效的检测方案。 Method: 通过识别图像中的风格差异来有效检测换脸深度伪造,并在多个数据集和换脸方法上进行了全面评估。 Result: SafeVision 在不同场景下均展现出良好的检测效果,表明其具备高效检测换脸深度伪造的能力。 Conclusion: SafeVision 是第一个在提供固有隐私保护的同时使用风格特征进行深度伪造检测的工具,它提供了一种可靠且可扩展的解决方案来检测换脸深度伪造。 Abstract: Detecting deepfakes involving face-swaps presents a significant challenge, particularly in real-world scenarios where anyone can perform face-swapping with freely available tools and apps without any technical knowledge. Existing deepfake detection methods rely on facial landmarks or inconsistencies in pixel-level features and often struggle with face-swap deepfakes, where the source face is seamlessly blended into the target image or video. The prevalence of face-swap is evident in everyday life, where it is used to spread false information, damage reputations, manipulate political opinions, create non-consensual intimate deepfakes (NCID), and exploit children by enabling the creation of child sexual abuse material (CSAM). Even prominent public figures are not immune to its impact, with numerous deepfakes of them circulating widely across social media platforms. Another challenge faced by deepfake detection methods is the creation of datasets that encompass a wide range of variations, as training models require substantial amounts of data. This raises privacy concerns, particularly regarding the processing and storage of personal facial data, which could lead to unauthorized access or misuse. Our key idea is to identify these style discrepancies to detect face-swapped images effectively without accessing the real facial image. We perform comprehensive evaluations using multiple datasets and face-swapping methods, which showcases the effectiveness of SafeVision in detecting face-swap deepfakes across diverse scenarios. SafeVision offers a reliable and scalable solution for detecting face-swaps in a privacy preserving manner, making it particularly effective in challenging real-world applications. To the best of our knowledge, SafeVision is the first deepfake detection using style features while providing inherent privacy protection.

[198] DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition

Sheng Liu,Yiheng Yu,Yuan Feng,Min Xu,Zhelun Jin,Yining Jiang,Tiantian Yuan

Main category: cs.CV

TL;DR: This paper proposes DESign, which uses DCAC for better contextual adaptation and SR-CTC for improved training in continuous sign language recognition, achieving state-of-the-art results.

Details Motivation: Current CSLR methods struggle with diverse samples and temporal dynamics. Dynamic convolutions mainly focus on spatial modeling, and CTC learning overfits to a dominant path during training. Method: The paper proposes DESign with Dynamic Context-Aware Convolution (DCAC) to capture inter-frame motion cues and Subnet Regularization Connectionist Temporal Classification (SR-CTC) to enhance training by exploring diverse alignment paths. Result: Extensive experiments show that DESign achieves superior performance on PHOENIX14, PHOENIX14-T, and CSL-Daily datasets, validating the effectiveness of DCAC and SR-CTC through ablations and visualizations. Conclusion: DESign, incorporating DCAC and SR-CTC, achieves state-of-the-art performance on mainstream CSLR datasets while introducing no inference overhead and effectively improving recognition accuracy. Abstract: Current continuous sign language recognition (CSLR) methods struggle with handling diverse samples. Although dynamic convolutions are ideal for this task, they mainly focus on spatial modeling and fail to capture the temporal dynamics and contextual dependencies. To address this, we propose DESign, a novel framework that incorporates Dynamic Context-Aware Convolution (DCAC) and Subnet Regularization Connectionist Temporal Classification (SR-CTC). DCAC dynamically captures the inter-frame motion cues that constitute signs and uniquely adapts convolutional weights in a fine-grained manner based on contextual information, enabling the model to better generalize across diverse signing behaviors and boost recognition accuracy. Furthermore, we observe that existing methods still rely on only a limited number of frames for parameter updates during training, indicating that CTC learning overfits to a dominant path. To address this, SR-CTC regularizes training by applying supervision to subnetworks, encouraging the model to explore diverse CTC alignment paths and effectively preventing overfitting. A classifier-sharing strategy in SR-CTC further strengthens multi-scale consistency. Notably, SR-CTC introduces no inference overhead and can be seamlessly integrated into existing CSLR models to boost performance. Extensive ablations and visualizations further validate the effectiveness of the proposed methods. Results on mainstream CSLR datasets (i.e., PHOENIX14, PHOENIX14-T, CSL-Daily) demonstrate that DESign achieves state-of-the-art performance.

[199] Be the Change You Want to See: Revisiting Remote Sensing Change Detection Practices

Blaž Rolih,Matic Fučka,Filip Wolf,Luka Čehovin Zajc

Main category: cs.CV

TL;DR: 通过系统优化基础设计,简单的模型也可以在变化检测任务中达到先进水平。

Details Motivation: 许多新方法将性能提升归因于新增的复杂组件,却忽略了基础设计选择(如主干网络选择、预训练策略和训练配置)对性能的影响可能更为显著。 Method: 系统地重新探索变化检测模型的设计空间,分析经过良好优化的基线模型的潜力,并识别出一组有益的基础设计选择。 Result: 论文表明,精心设计的架构简单的模型可以在六个具有挑战性的变化检测数据集上达到或超越当前最先进的性能,并且这些最佳实践可以推广到相关方法中。 Conclusion: 论文强调优化基础设计选择对模型性能提升的重要性不亚于架构创新,并提出了一套对未来方法具有指导意义的最佳实践。 Abstract: Remote sensing change detection aims to localize semantic changes between images of the same location captured at different times. In the past few years, newer methods have attributed enhanced performance to the additions of new and complex components to existing architectures. Most fail to measure the performance contribution of fundamental design choices such as backbone selection, pre-training strategies, and training configurations. We claim that such fundamental design choices often improve performance even more significantly than the addition of new architectural components. Due to that, we systematically revisit the design space of change detection models and analyse the full potential of a well-optimised baseline. We identify a set of fundamental design choices that benefit both new and existing architectures. Leveraging this insight, we demonstrate that when carefully designed, even an architecturally simple model can match or surpass state-of-the-art performance on six challenging change detection datasets. Our best practices generalise beyond our architecture and also offer performance improvements when applied to related methods, indicating that the space of fundamental design choices has been underexplored. Our guidelines and architecture provide a strong foundation for future methods, emphasizing that optimizing core components is just as important as architectural novelty in advancing change detection performance. Code: https://github.com/blaz-r/BTC-change-detection

[200] MRC-DETR: An Adaptive Multi-Residual Coupled Transformer for Bare Board PCB Defect Detection

Jiangzhong Cao,Huanqi Wu,Xu Zhang,Lianghong Tan,Huan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种高效且精准的PCB缺陷检测框架MRC-DETR,结合多残差方向耦合模块和自适应筛选金字塔网络,并发布了一个高质量的PCB缺陷数据集。

Details Motivation: 现有的PCB缺陷检测方法受限于特征表示能力弱、计算冗余高以及缺乏高质量训练数据,无法满足工业界对准确率和效率的双重要求。 Method: 提出了多残差方向耦合模块(MRDCB)增强特征表示能力;引入自适应筛选金字塔网络(ASPN)减少跨层融合的冗余计算;构建了一个高质量的新数据集以弥补当前公开资源的不足。 Result: 实验表明,该方法在检测精度和计算效率方面均优于现有技术,同时新构建的数据集也为未来研究提供了有价值的基准。 Conclusion: MRC-DETR是一个新颖高效的检测框架,用于裸露PCB缺陷检测,解决了现有方法在特征表示、计算冗余和训练数据不足方面的局限性。 Abstract: In modern electronic manufacturing, defect detection on Printed Circuit Boards (PCBs) plays a critical role in ensuring product yield and maintaining the reliability of downstream assembly processes. However, existing methods often suffer from limited feature representation, computational redundancy, and insufficient availability of high-quality training data -- challenges that hinder their ability to meet industrial demands for both accuracy and efficiency. To address these limitations, we propose MRC-DETR, a novel and efficient detection framework tailored for bare PCB defect inspection, built upon the foundation of RT-DETR. Firstly, to enhance feature representation capability, we design a Multi-Residual Directional Coupled Block (MRDCB). This module improves channel-wise feature interaction through a multi-residual structure. Moreover, a cross-spatial learning strategy is integrated to capture fine-grained pixel-level relationships, further enriching the representational power of the extracted features. Secondly, to reduce computational redundancy caused by inefficient cross-layer information fusion, we introduce an Adaptive Screening Pyramid Network (ASPN). This component dynamically filters and aggregates salient low-level features, selectively fusing them with high-level semantic features. By focusing on informative regions and suppressing redundant computations, ASPN significantly improves both efficiency and detection accuracy. Finally, to tackle the issue of insufficient training data, particularly in the context of bare PCBs, we construct a new, high-quality dataset that fills a critical gap in current public resources. Our dataset not only supports the training and evaluation of our proposed framework but also serves as a valuable benchmark for future research in this domain.

[201] Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Yufan Zhou,Zhaobo Qi,Lingshuai Lin,Junqi Jing,Tingting Chai,Beichen Zhang,Shuhui Wang,Weigang Zhang

Main category: cs.CV

TL;DR: 本研究提出了一种新的视频任务规划模型MTID,通过改进潜在空间插值和引入动作感知掩码机制,有效提升了动作序列预测的准确性与上下文相关性。

Details Motivation: 现有方法依赖文本级监督,难以捕捉复杂的时间关系,需要一种新的方法来增强视觉监督并提高动作预测的连贯性。 Method: 提出了一种Masked Temporal Interpolation Diffusion (MTID)模型,结合了扩散模型与可学习插值矩阵,并利用任务自适应掩码损失优化训练过程。 Result: 在三个广泛使用的基准数据集上,MTID在大多数指标上均取得了良好的动作规划性能。 Conclusion: MTID通过引入潜在空间时间插值模块和动作感知掩码投影机制,显著提高了生成任务规划动作序列的准确性和上下文相关性。 Abstract: In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at https://github.com/WiserZhou/MTID.

[202] Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering

Qing Li,Huifang Feng,Xun Gong,Yu-Shen Liu

Main category: cs.CV

TL;DR: This paper introduces a novel method (LGSF) for estimating normals in noisy 3D point clouds, showing improved performance over existing techniques.

Details Motivation: Estimating normals for noisy point clouds is challenging, especially with end-to-end methods. Existing approaches work best on clean data and use supervised learning, which limits their effectiveness on noisy datasets. Method: They propose a new approach called LGSF (Local Gradient-Aware Surface Filtering) that projects noisy points onto the underlying surface using normals and distances from an implicit function constrained by local gradients. Result: Their method shows superior performance in normal estimation, surface reconstruction, and point cloud denoising compared to current techniques. Conclusion: The paper concludes that their proposed method for normal estimation in noisy point clouds outperforms existing methods, achieving state-of-the-art results. Abstract: Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models are available at https://github.com/LeoQLi/LGSF.

[203] Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images

Yuran Dong,Mang Ye

Main category: cs.CV

TL;DR: 本文提出了一种名为Pose-Star的框架,用于解决时尚图像编辑中生成器优化过度和掩码可控性不足的问题。

Details Motivation: 现有的两阶段流水线(掩码生成后进行扩散编辑)过于注重生成器优化而忽视了掩码的可控性,导致用户定义灵活性差和姿态鲁棒性弱。 Method: Pose-Star通过骨骼关键点校准扩散注意力(Star tokens),利用相位感知分析、阈值掩码和滑动窗口融合增强罕见结构定位并抑制噪声,并通过交叉自注意力合并和Canny对齐优化边缘。 Result: 实现了基于解剖意识的动态掩码生成,提升了复杂姿态下的编辑能力和用户定义编辑的自由度。 Conclusion: 该工作在受控基准与开放需求之间架起了桥梁,为工业级时尚图像编辑奠定了基础。 Abstract: To advance real-world fashion image editing, we analyze existing two-stage pipelines(mask generation followed by diffusion-based editing)which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothes masks preserve poses but forbid style/length customization). II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories). To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. In Pose-Star, we calibrate diffusion-derived attention (Star tokens) via skeletal keypoints to enhance rare structure localization in complex poses, suppress noise through phase-aware analysis of attention dynamics (Convergence,Stabilization,Divergence) with threshold masking and sliding-window fusion, and refine edges via cross-self attention merging and Canny alignment. This work bridges controlled benchmarks and open-world demands, pioneering anatomy-aware, pose-robust editing and laying the foundation for industrial fashion image editing.

[204] Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense

Lina Ma,Xiaowei Fu,Fuxiang Huang,Xinbo Gao,Lei Zhang

Main category: cs.CV

TL;DR: This paper introduces the REAL method for adversarial sample rectification using a Max-Min entropy optimization scheme guided by the low entropy prior, improving robustness against unknown attacks.

Details Motivation: Existing defense methods struggle with generalizing against unknown attacks. The authors aim to uncover underlying commonalities among various attacks to improve adversarial robustness during inference. Method: The authors propose a two-stage REAL approach: first rectifying adversarial samples by reverse maximizing prediction entropy, followed by secondary rectification through forward minimizing prediction entropy (Max-Min entropy optimization). They also introduce an attack-aware weighting mechanism to adaptively adjust these objectives. Result: Experiments show that the REAL method greatly enhances the effectiveness of existing test-time defense approaches in defending against unseen adversarial attacks. Conclusion: The REAL approach significantly improves the performance of existing sample rectification models by leveraging the low entropy prior in adversarial samples, achieving universal robustness against unseen attacks. Abstract: Existing defense methods fail to defend against unknown attacks and thus raise generalization issue of adversarial robustness. To remedy this problem, we attempt to delve into some underlying common characteristics among various attacks for generality. In this work, we reveal the commonly overlooked low entropy prior (LE) implied in various adversarial samples, and shed light on the universal robustness against unseen attacks in inference phase. LE prior is elaborated as two properties across various attacks as shown in Fig. 1 and Fig. 2: 1) low entropy misclassification for adversarial samples and 2) lower entropy prediction for higher attack intensity. This phenomenon stands in stark contrast to the naturally distributed samples. The LE prior can instruct existing test-time defense methods, thus we propose a two-stage REAL approach: Rectify Adversarial sample based on LE prior for test-time adversarial rectification. Specifically, to align adversarial samples more closely with clean samples, we propose to first rectify adversarial samples misclassified with low entropy by reverse maximizing prediction entropy, thereby eliminating their adversarial nature. To ensure the rectified samples can be correctly classified with low entropy, we carry out secondary rectification by forward minimizing prediction entropy, thus creating a Max-Min entropy optimization scheme. Further, based on the second property, we propose an attack-aware weighting mechanism to adaptively adjust the strengths of Max-Min entropy objectives. Experiments on several datasets show that REAL can greatly improve the performance of existing sample rectification models.

[205] Unlearning the Noisy Correspondence Makes CLIP More Robust

Haochen Han,Alex Jinpeng Wang,Peijun Ye,Fangming Liu

Main category: cs.CV

TL;DR: This paper introduces NCU, an efficient fine-tuning framework for Vision-Language Models (VLMs), which directly addresses the problem of Noisy Correspondence by unlearning harmful knowledge. It achieves superior performance over existing methods while being computationally efficient.

Details Motivation: As Vision-Language Models (VLMs) scale up in data appetite, they encounter significant challenges due to noisy data with unrelated semantics (Noisy Correspondence). These issues impair model performance, and traditional approaches struggle to address them efficiently. Method: The paper proposes NCU, a Noisy Correspondence Unlearning fine-tuning framework that focuses on learning the hardest negative information to provide explicit unlearning direction for false positives and false negatives. This process is formalized into a unified optimal transport objective for efficient fine-tuning. Result: The proposed NCU framework outperforms robust pre-training methods in zero-shot transfer tasks using the CLIP model, achieving better results with reduced computational overhead. Conclusion: The paper concludes that directly eliminating harmful effects of Noisy Correspondence (NC) in pre-trained Vision-Language Models (VLMs) significantly enhances robustness and performance, surpassing existing methods while maintaining lower computational overhead. Abstract: The data appetite for Vision-Language Models (VLMs) has continuously scaled up from the early millions to billions today, which faces an untenable trade-off with data quality and inevitably introduces Noisy Correspondence (NC) samples. Undoubtedly, such semantically unrelated data significantly impairs the performance of VLMs. Previous efforts mainly address this challenge by estimating refined alignment for more precise guidance. However, such resource-intensive pipelines that train VLMs from scratch struggle to meet realistic data demands. In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. Specifically, we propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs' robustness by forgetting learned noisy knowledge. The key to NCU is learning the hardest negative information, which can provide explicit unlearning direction for both false positives and false negatives. Such twin goals unlearning process can be formalized into one unified optimal transport objective for fast fine-tuning. We validate our approach with the prevailing CLIP model over various downstream tasks. Remarkably, NCU surpasses the robust pre-trained method on zero-shot transfer while with lower computational overhead. The code will be released upon acceptance.

[206] Radar Tracker: Moving Instance Tracking in Sparse and Noisy Radar Point Clouds

Matthias Zeller,Daniel Casado Herraez,Jens Behley,Michael Heidingsfeld,Cyrill Stachniss

Main category: cs.CV

TL;DR: 本文提出一种新的基于注意力机制的雷达跟踪方法,通过结合时间偏移预测和外观特征,在稀疏雷达点云中实现了更可靠的移动实例跟踪。

Details Motivation: 自动驾驶车辆需要准确感知周围环境,特别是在稀疏雷达点云中对运动物体的分割和跟踪对于可靠路径规划至关重要。 Method: 利用包含时间偏移预测的注意力机制实现雷达点云中的移动实例跟踪,并融合几何和外观特征进行关联。 Result: 在RadarScenes数据集上的移动实例跟踪基准测试中,相比当前最先进的方法,该方法表现出更优的性能。 Conclusion: 本文提出了一种基于学习的雷达跟踪器,结合了几何和外观特征,提高了移动实例跟踪的性能。 Abstract: Robots and autonomous vehicles should be aware of what happens in their surroundings. The segmentation and tracking of moving objects are essential for reliable path planning, including collision avoidance. We investigate this estimation task for vehicles using radar sensing. We address moving instance tracking in sparse radar point clouds to enhance scene interpretation. We propose a learning-based radar tracker incorporating temporal offset predictions to enable direct center-based association and enhance segmentation performance by including additional motion cues. We implement attention-based tracking for sparse radar scans to include appearance features and enhance performance. The final association combines geometric and appearance features to overcome the limitations of center-based tracking to associate instances reliably. Our approach shows an improved performance on the moving instance tracking benchmark of the RadarScenes dataset compared to the current state of the art.

[207] Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

Leyan Xue,Zongbo Han,Guangyu Wang,Qinghua Hu,Mingyue Cheng,Changqing Zhang

Main category: cs.CV

TL;DR: This paper proposes a method to enhance CLIP's processing of localized visual descriptors through stochastic multi-crop augmentation, effectively improving its performance across several application settings.

Details Motivation: The motivation is to overcome CLIP's bias toward global image patterns which hinders its ability to process localized visual descriptors, a limitation not addressed by existing approaches. Method: The proposed method involves employing stochastic multi-crop augmentation to activate CLIP's latent capacity for localized feature analysis by constraining the model's receptive field and recalibrating its attention mechanism. Result: Extensive experiments demonstrate that the proposed method (D&D) achieves promising performance improvements in various settings including zero-shot, few-shot, and test-time adaptation. Conclusion: The paper concludes that by using stochastic multi-crop augmentation, CLIP's ability to process localized visual descriptors can be enhanced, thereby improving its performance in zero-shot, few-shot, and test-time adaptation settings. Abstract: Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP's strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to activate CLIP's latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model's receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that D&D achieves promising performance.

[208] Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds

Matthias Zeller,Vardeep S. Sandhu,Benedikt Mersch,Jens Behley,Michael Heidingsfeld,Cyrill Stachniss

Main category: cs.CV

TL;DR: This paper introduces a new transformer-based method for moving object segmentation using radar data, improving scene understanding for autonomous vehicles.

Details Motivation: The motivation of this paper is to improve the awareness of moving objects around self-driving vehicles by leveraging radar sensors' ability to provide instantaneous motion information, which overcomes limitations in traditional methods relying on LiDAR and camera data. Method: A novel transformer-based approach called the Radar Velocity Transformer is developed to perform moving object segmentation on sparse radar scans. This approach incorporates velocity information throughout the network and includes a transformer-based upsampling technique. Result: The proposed method achieves superior segmentation results compared to other state-of-the-art approaches using only single-scan radar data, with performance fast enough to exceed the frame rate of the sensor. Conclusion: The study concludes that the proposed Radar Velocity Transformer method outperforms state-of-the-art methods in moving object segmentation using single-scan radar data, enabling faster and more accurate autonomous navigation. Abstract: The awareness about moving objects in the surroundings of a self-driving vehicle is essential for safe and reliable autonomous navigation. The interpretation of LiDAR and camera data achieves exceptional results but typically requires to accumulate and process temporal sequences of data in order to extract motion information. In contrast, radar sensors, which are already installed in most recent vehicles, can overcome this limitation as they directly provide the Doppler velocity of the detections and, hence incorporate instantaneous motion information within a single measurement. % In this paper, we tackle the problem of moving object segmentation in noisy radar point clouds. We also consider differentiating parked from moving cars, to enhance scene understanding. Instead of exploiting temporal dependencies to identify moving objects, we develop a novel transformer-based approach to perform single-scan moving object segmentation in sparse radar scans accurately. The key to our Radar Velocity Transformer is to incorporate the valuable velocity information throughout each module of the network, thereby enabling the precise segmentation of moving and non-moving objects. Additionally, we propose a transformer-based upsampling, which enhances the performance by adaptively combining information and overcoming the limitation of interpolation of sparse point clouds. Finally, we create a new radar moving object segmentation benchmark based on the RadarScenes dataset and compare our approach to other state-of-the-art methods. Our network runs faster than the frame rate of the sensor and shows superior segmentation results using only single-scan radar data.

[209] Information-Bottleneck Driven Binary Neural Network for Change Detection

Kaijie Yin,Zhiyuan Zhang,Shu Kong,Tian Gao,Chengzhong Xu,Hui Kong

Main category: cs.CV

TL;DR: This paper proposes BiCD, a binary neural network tailored for change detection, which improves detection accuracy by enhancing feature representation and discrimination through an auxiliary objective.

Details Motivation: Conventional BNN approaches limit data representation and discrimination between changed and unchanged regions, resulting in lower detection accuracy. Method: BiCD incorporates an auxiliary objective based on the Information Bottleneck principle with a learnable module to improve optimization and detection accuracy. Result: BiCD outperforms existing BNN methods on street-view and remote sensing datasets, showing significant improvements in change detection performance. Conclusion: BiCD introduces a new benchmark for BNN-based change detection, achieving state-of-the-art performance by enhancing representational power and feature separability. Abstract: In this paper, we propose Binarized Change Detection (BiCD), the first binary neural network (BNN) designed specifically for change detection. Conventional network binarization approaches, which directly quantize both weights and activations in change detection models, severely limit the network's ability to represent input data and distinguish between changed and unchanged regions. This results in significantly lower detection accuracy compared to real-valued networks. To overcome these challenges, BiCD enhances both the representational power and feature separability of BNNs, improving detection performance. Specifically, we introduce an auxiliary objective based on the Information Bottleneck (IB) principle, guiding the encoder to retain essential input information while promoting better feature discrimination. Since directly computing mutual information under the IB principle is intractable, we design a compact, learnable auxiliary module as an approximation target, leading to a simple yet effective optimization strategy that minimizes both reconstruction loss and standard change detection loss. Extensive experiments on street-view and remote sensing datasets demonstrate that BiCD establishes a new benchmark for BNN-based change detection, achieving state-of-the-art performance in this domain.

[210] Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding

Namho Kim,Junhwa Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于GRU和跨模态注意力机制的多模态视频分类框架,并通过特征增强和自编码技术进行正则化,在暴力检测和情感估计任务上显著优于单模态方法。

Details Motivation: 细粒度视频分类需要理解复杂的时空和语义线索,这通常超出单一模态的能力。因此,作者提出了一个融合多种模态信息的框架来解决这个问题。 Method: 文章采用基于GRU的序列编码器和跨模态注意力机制来融合视频、图像和文本表示,并使用分类或回归损失进行训练,同时通过特征级增强和自编码技术进一步正则化模型。 Result: 实验表明,该方法在DVD和Aff-Wild2两个数据集上均显著优于单模态基线方法,特别是在鲁棒性和性能方面,跨模态注意力和特征增强起到了关键作用。 Conclusion: 所提出的多模态框架能够有效融合不同模态的信息,在复杂视频分类任务中表现优异,证明了其广泛的应用潜力。 Abstract: Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance.

[211] PhenoBench: A Comprehensive Benchmark for Cell Phenotyping

Jerome Luescher,Nora Koreuber,Jannik Franzen,Fabian H. Reith,Claudia Winklmayr,Christian M. Schuerch,Dagmar Kainmueller,Josef Lorenz Rumberger

Main category: cs.CV

TL;DR: 本文提出了PhenoBench,这是一个用于细胞表型识别的全面基准测试工具,包括新的H&E数据集PhenoCell和对现有基础模型(FM)性能的评估。

Details Motivation: 数字病理学中出现了许多基础模型(FM),但它们在细胞表型识别方面的表现尚未得到统一的评估。 Method: 作者提出了PhenoBench,其中包括PhenoCell数据集,该数据集包含14种精细的细胞类型,并提供了现成的微调和基准测试代码,以便系统地评估多个重要的病理学FM在不同泛化场景中的密集细胞表型预测能力。 Result: 现有的FM在之前建立的基准测试(如Lizard和PanNuke)上取得了超过0.70的宏观F1分数,但在PhenoCell上观察到的分数低至0.20,表明这是一个更具挑战性的任务。 Conclusion: PhenoCell被确立为未来FM和监督模型基准测试的主要资产,因为其揭示了现有模型在新挑战面前的不足。 Abstract: Digital pathology has seen the advent of a wealth of foundational models (FM), yet to date their performance on cell phenotyping has not been benchmarked in a unified manner. We therefore propose PhenoBench: A comprehensive benchmark for cell phenotyping on Hematoxylin and Eosin (H&E) stained histopathology images. We provide both PhenoCell, a new H&E dataset featuring 14 granular cell types identified by using multiplexed imaging, and ready-to-use fine-tuning and benchmarking code that allows the systematic evaluation of multiple prominent pathology FMs in terms of dense cell phenotype predictions in different generalization scenarios. We perform extensive benchmarking of existing FMs, providing insights into their generalization behavior under technical vs. medical domain shifts. Furthermore, while FMs achieve macro F1 scores > 0.70 on previously established benchmarks such as Lizard and PanNuke, on PhenoCell, we observe scores as low as 0.20. This indicates a much more challenging task not captured by previous benchmarks, establishing PhenoCell as a prime asset for future benchmarking of FMs and supervised models alike. Code and data are available on GitHub.

[212] CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation

Elena Bueno-Benito,Mariella Dimiccoli

Main category: cs.CV

TL;DR: Closed Loop Optimal Transport (CLOT) enhances unsupervised action segmentation by integrating multi-level cyclic feature learning, improving upon ASOT.

Details Motivation: ASOT lacks segment-level supervision, limiting the feedback between frames and action representations. CLOT addresses this by refining both elements through cyclical learning. Method: CLOT uses an encoder-decoder architecture to solve three OT problems, enabling cyclical learning through cross-attention between frame and segment embeddings. Result: Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation. Conclusion: CLOT improves unsupervised action segmentation by introducing a multi-level cyclic feature learning mechanism that refines frame embeddings and pseudo-labels, outperforming ASOT. Abstract: Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions on the action ordering, and it is able to decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, which limits the effectiveness of the feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework that introduces a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.

[213] Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition

Redwan Sony,Parisa Farmanifard,Arun Ross,Anil K. Jain

Main category: cs.CV

TL;DR: 本文比较了通用基础模型和域特定模型在人脸识别中的表现,发现后者更优,但两者结合能提升性能并增加可解释性。

Details Motivation: 研究通用基础模型与域特定模型在人脸识别中的性能差异以及它们的结合潜力。 Method: 通过一系列实验比较多个基础模型和基准数据集的表现,并探索上下文线索和融合方法对性能的影响。 Result: 域特定模型在所有数据集中均优于零样本基础模型;基础模型在过分割图像上的表现优于紧密裁剪图像;基础模型与域特定模型的融合提升了准确率;ChatGPT等模型能够为识别过程提供可解释性。 Conclusion: 域特定模型在人脸识别任务中表现优于零样本基础模型,但结合使用基础模型和域特定模型可以提高准确性并提供可解释性。 Abstract: In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. (c) A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., ``Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g., ``Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity...and the pair is likely of the same person''), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.

[214] Beyond Accuracy: Metrics that Uncover What Makes a `Good' Visual Descriptor

Ethan Lin,Linxi Zhao,Atharva Sehgal,Jennifer J. Sun

Main category: cs.CV

TL;DR: This paper explores how to improve the understanding of visual descriptors' effectiveness in vision-language models using new alignment-based metrics.

Details Motivation: The motivation is to understand the factors influencing the effectiveness of text-based visual descriptors in vision-language models beyond traditional accuracy evaluations. Method: The authors systematically analyze descriptor quality along two dimensions: representational capacity and relationship with VLM pre-training data, introducing two alignment-based metrics called Global Alignment and CLIP Similarity. Result: The study evaluates various descriptor generation methods, including zero-shot LLM-generated prompts and iteratively refined descriptors, while proposing new metrics for deeper analysis. Conclusion: The paper concludes that descriptor effectiveness can be better understood through alignment-based metrics and their relationship with foundation model properties. Abstract: Text-based visual descriptors-ranging from simple class names to more descriptive phrases-are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics-Global Alignment and CLIP Similarity-that move beyond accuracy. These metrics allow us to shed light on how different descriptor generation strategies interact with foundation model properties, offering insights into ways of studying descriptor effectiveness beyond accuracy evaluations.

[215] An Advanced Deep Learning Framework for Ischemic and Hemorrhagic Brain Stroke Diagnosis Using Computed Tomography (CT) Images

Md. Sabbir Hossen,Eshat Ahmed Shuvo,Shibbir Ahmed Arif,Pabon Shaha,Md. Saiduzzaman,Mostofa Kamal Nasir

Main category: cs.CV

TL;DR: 本研究提出一种基于机器学习的脑卒中早期检测新方法,通过结合预训练深度学习模型和优化策略,实现了高精度的脑卒中诊断。

Details Motivation: 需要精确且快速的脑卒中预测技术,CT扫描是有效的诊断方法,但传统诊断依赖放射科医生手动选择关键切片,因此引入机器学习来提升效率和准确性。 Method: 研究采用了包括DenseNet201、InceptionV3、MobileNetV2、ResNet50和Xception在内的预训练模型进行特征提取,并利用BFO、PCA和LDA等特征工程技术进一步增强性能,最后使用SVC、RF、XGB等机器学习算法分类。 Result: 实验结果显示,MobileNetV2、LDA和SVC组合实现了97.93%的最高分类准确率,显著优于其他模型组合。 Conclusion: 将轻量级预训练模型与优化技术结合,可以有效提升脑卒中的诊断效果。 Abstract: Brain stroke is one of the leading causes of mortality and long-term disability worldwide, highlighting the need for precise and fast prediction techniques. Computed Tomography (CT) scan is considered one of the most effective methods for diagnosing brain strokes. The majority of stroke classification techniques rely on a single slice-level prediction mechanism, allowing the radiologist to manually choose the most critical CT slice from the original CT volume. Although clinical evaluations are often used in traditional diagnostic procedures, machine learning (ML) has opened up new avenues for improving stroke diagnosis. To supplement traditional diagnostic techniques, this study investigates the use of machine learning models, specifically concerning the prediction of brain stroke at an early stage utilizing CT scan images. In this research, we proposed a novel approach to brain stroke detection leveraging machine learning techniques, focusing on optimizing classification performance with pre-trained deep learning models and advanced optimization strategies. Pre-trained models, including DenseNet201, InceptionV3, MobileNetV2, ResNet50, and Xception, are utilized for feature extraction. Additionally, we employed feature engineering techniques, including BFO, PCA, and LDA, to enhance models' performance further. These features are subsequently classified using machine learning algorithms such as SVC, RF, XGB, DT, LR, KNN, and GNB. Our experiments demonstrate that the combination of MobileNetV2, LDA, and SVC achieved the highest classification accuracy of 97.93%, significantly outperforming other model-optimizer-classifier combinations. The results underline the effectiveness of integrating lightweight pre-trained models with robust optimization and classification techniques for brain stroke diagnosis.

[216] Predicting Asphalt Pavement Friction Using Texture-Based Image Indicator

Bingjie Lu,Zhengyang Lu,Yijiashun Qi,Hanzhe Guo,Tianyao Sun,Zunduo Zhao

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Pavement skid resistance is of vital importance for road safety. The objective of this study is to propose and validate a texture-based image indicator to predict pavement friction. This index enables pavement friction to be measured easily and inexpensively using digital images. Three different types of asphalt surfaces (dense-graded asphalt mix, open-grade friction course, and chip seal) were evaluated subject to various tire polishing cycles. Images were taken with corresponding friction measured using Dynamic Friction Tester (DFT) in the laboratory. The aggregate protrusion area is proposed as the indicator. Statistical models are established for each asphalt surface type to correlate the proposed indicator with friction coefficients. The results show that the adjusted R-square values of all relationships are above 0.90. Compared to other image-based indicators in the literature, the proposed image indicator more accurately reflects the changes in pavement friction with the number of polishing cycles, proving its cost-effective use for considering pavement friction in mix design stage.

[217] 2.5D Object Detection for Intelligent Roadside Infrastructure

Nikolai Polley,Yacin Boualili,Ferdinand Mütsch,Maximilian Zipfl,Tobias Fleck,J. Marius Zöllner

Main category: cs.CV

TL;DR: 提出了一種針對基礎設施路側攝像頭的2.5D物體檢測框架,以提高對自主車輛的感知能力。

Details Motivation: 解決現有3D物體檢測算法在頂視角度和陡峭相機角度下的泛化問題,同時提供更廣泛、無阻擋的交叉口覆蓋範圍作為補充信息流。 Method: 使用一種預測方法,在圖像幀中將車輛的地平面檢測為平行四邊形,並結合真實世界和合成生成場景進行訓練。 Result: 實現了高檢測準確性,強大的跨視點泛化能力,以及對多樣光照和天氣條件的魯棒性。 Conclusion: 提出的2.5D物體檢測框架有效地解決了傳統方法在基礎設施路側攝像頭應用中的限制,提升了自主車輛的感知能力。 Abstract: On-board sensors of autonomous vehicles can be obstructed, occluded, or limited by restricted fields of view, complicating downstream driving decisions. Intelligent roadside infrastructure perception systems, installed at elevated vantage points, can provide wide, unobstructed intersection coverage, supplying a complementary information stream to autonomous vehicles via vehicle-to-everything (V2X) communication. However, conventional 3D object-detection algorithms struggle to generalize under the domain shift introduced by top-down perspectives and steep camera angles. We introduce a 2.5D object detection framework, tailored specifically for infrastructure roadside-mounted cameras. Unlike conventional 2D or 3D object detection, we employ a prediction approach to detect ground planes of vehicles as parallelograms in the image frame. The parallelogram preserves the planar position, size, and orientation of objects while omitting their height, which is unnecessary for most downstream applications. For training, a mix of real-world and synthetically generated scenes is leveraged. We evaluate generalizability on a held-out camera viewpoint and in adverse-weather scenarios absent from the training set. Our results show high detection accuracy, strong cross-viewpoint generalization, and robustness to diverse lighting and weather conditions. Model weights and inference code are provided at: https://gitlab.kit.edu/kit/aifb/ATKS/public/digit4taf/2.5d-object-detection

[218] SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications

Yana Hasson,Pauline Luc,Liliane Momeni,Maks Ovsjanikov,Guillaume Le Moing,Alina Kuznetsova,Ira Ktena,Jennifer J. Sun,Skanda Koppula,Dilara Gokay,Joseph Heyward,Etienne Pot,Andrew Zisserman

Main category: cs.CV

TL;DR: 本文提出了 SciVid 基准,用于评估视频基础模型在医学、动物行为和天气预测等多学科任务中的表现,证明其在跨域应用中的潜力。

Details Motivation: 尽管时空基础模型已在不同学科中取得进展,但它们往往是特定领域的。作者希望探索视频基础模型是否能在跨学科数据上实现有效的知识迁移。 Method: 构建了一个包含五个科学视频任务的基准测试(SciVid),并对六种领先的视频基础模型进行了适配和实验验证。 Result: 实验表明,使用预训练视频基础模型结合简单可训练模块可以在多个科学任务中达到领先水平,并揭示了现有模型的局限性和改进空间。 Conclusion: SciVid 提供了评估视频基础模型在多个科学领域中表现的基准,展示了通用视频基础模型的潜力,并指出其局限性及未来研究方向。 Abstract: In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at https://github.com/google-deepmind/scivid to facilitate further research in the development of ViFMs.

[219] Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

Tao Tang,Shijie Xu,Yiting Wu,Zhixiang Lu

Main category: cs.CV

TL;DR: 本文提出Causal-SAM-LLM,一种结合因果推理与大型语言模型的方法,显著提升医学图像分割模型在不同域上的泛化能力与鲁棒性。

Details Motivation: 深度学习模型在医学图像分割中的临床应用受到其无法泛化到未见过领域的限制,这通常是因为模型学习了内容与成像风格之间的虚假关联。 Method: 提出了一种名为Causal-SAM-LLM的新框架,包含两个核心创新:1) 语言对抗解耦(LAD),利用视觉-语言模型生成风格描述并对比学习以去除非因果信息;2) 测试时因果干预(TCI),通过LLM解读医生的自然语言命令实时调整分割结果。 Result: 在四个公共数据集(BTCV, CHAOS, AMOS, BraTS)组成的综合基准上,Causal-SAM-LLM在平均Dice评分上比最强基线提高了6.2分,Hausdorff距离减少了15.8毫米,同时使用的可训练参数不到完整模型的9%。 Conclusion: Causal-SAM-LLM通过将大型语言模型作为因果推理工具,成功提升了医学图像分割模型在跨扫描仪、跨模态和跨解剖结构设置下的分布外鲁棒性,并且减少了可训练参数的数量,为构建更稳健、高效和可交互的医疗AI系统提供了新路径。 Abstract: The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

[220] From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis

Amir Hojjati,Lu Li,Ibrahim Hameed,Anis Yazidi,Pedro G. Lind,Rabindra Khadka

Main category: cs.CV

TL;DR: 本文提出了一种新的EEG分类方法EEG-VJEPA,借鉴视频联合嵌入预测架构(V-JEPA),通过学习语义上有意义的时空表示来提高分类性能。

Details Motivation: 现有的自监督学习方法通常只关注脑电图信号的空间或时间特征,导致表示次优。此外,有效分析还受到标记数据有限、维度高和缺乏可扩展模型的影响。 Method: 将EEG视为类似视频的序列,并使用联合嵌入和自适应掩码技术来学习语义上有意义的时空表示。 Result: 在公开可用的Temple University Hospital (TUH) Abnormal EEG数据集上的评估显示,EEG-VJEPA在分类准确性方面优于现有最先进的模型。此外,它还能捕捉生理上相关的空间和时间信号模式。 Conclusion: EEG-VJEPA为现实世界临床环境中的可扩展和可信的EEG分析提供了一个有前景的框架。 Abstract: EEG signals capture brain activity with high temporal and low spatial resolution, supporting applications such as neurological diagnosis, cognitive monitoring, and brain-computer interfaces. However, effective analysis is hindered by limited labeled data, high dimensionality, and the absence of scalable models that fully capture spatiotemporal dependencies. Existing self-supervised learning (SSL) methods often focus on either spatial or temporal features, leading to suboptimal representations. To this end, we propose EEG-VJEPA, a novel adaptation of the Video Joint Embedding Predictive Architecture (V-JEPA) for EEG classification. By treating EEG as video-like sequences, EEG-VJEPA learns semantically meaningful spatiotemporal representations using joint embeddings and adaptive masking. To our knowledge, this is the first work that exploits V-JEPA for EEG classification and explores the visual concepts learned by the model. Evaluations on the publicly available Temple University Hospital (TUH) Abnormal EEG dataset show that EEG-VJEPA outperforms existing state-of-the-art models in classification accuracy.Beyond classification accuracy, EEG-VJEPA captures physiologically relevant spatial and temporal signal patterns, offering interpretable embeddings that may support human-AI collaboration in diagnostic workflows. These findings position EEG-VJEPA as a promising framework for scalable, trustworthy EEG analysis in real-world clinical settings.

[221] Dynamic Multimodal Prototype Learning in Vision-Language Models

Xingyu Zhu,Shuo Wang,Beier Zhu,Miaoge Li,Yunfan Li,Junfeng Fang,Zhicai Wang,Dongsheng Wang,Hanwang Zhang

Main category: cs.CV

TL;DR: This paper proposes ProtoMM, a training-free framework for adapting vision-language models during test time by constructing multimodal prototypes and dynamically updating visual particles, resulting in improved performance over existing methods.

Details Motivation: Previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names, which leads to limited performance in vision-language model adaptation. Method: ProtoMM constructs multimodal prototypes by combining textual descriptions and visual particles, dynamically updates visual particles during testing, and quantifies prototype importance using semantic distance formulated as an optimal transport problem. Result: Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of ProtoMM, achieving a 1.03% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets. Conclusion: ProtoMM is an effective training-free framework that improves the adaptation of vision-language models during test time by constructing multimodal prototypes and dynamically updating visual particles, leading to enhanced generalizability in unseen scenarios. Abstract: With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbf{ProtoMM}, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03\% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.

[222] On the rankability of visual embeddings

Ankit Sonthalia,Arnas Uselis,Seong Joon Oh

Main category: cs.CV

TL;DR: 该研究提出了“rank axes”概念,证明许多视觉嵌入模型具备无需完全监督即可排序的能力。

Details Motivation: 探索视觉嵌入模型是否能捕捉沿直线方向的连续有序属性。 Method: 定义了“rank axes”并测试多个编码器和数据集以验证其可排序性。 Result: 在7个流行编码器和9个数据集中发现许多嵌入具有可排序性。 Conclusion: 研究发现许多嵌入模型本质上是可排序的,并且少量样本就足以恢复有意义的排序轴。 Abstract: We study whether visual embedding models capture continuous, ordinal attributes along linear directions, which we term _rank axes_. We define a model as _rankable_ for an attribute if projecting embeddings onto such an axis preserves the attribute's order. Across 7 popular encoders and 9 datasets with attributes like age, crowd count, head pose, aesthetics, and recency, we find that many embeddings are inherently rankable. Surprisingly, a small number of samples, or even just two extreme examples, often suffice to recover meaningful rank axes, without full-scale supervision. These findings open up new use cases for image ranking in vector databases and motivate further study into the structure and learning of rankable embeddings. Our code is available at https://github.com/aktsonthalia/rankable-vision-embeddings.

[223] SAMed-2: Selective Memory Enhanced Medical Segment Anything Model

Zhiling Yan,Sifan Song,Dingjie Song,Yiwei Li,Rong Zhou,Weixiang Sun,Zhennong Chen,Sekeun Kim,Hui Ren,Tianming Liu,Quanzheng Li,Xiang Li,Lifang He,Lichao Sun

Main category: cs.CV

TL;DR: 本文提出了SAMed-2,一种基于SAM-2架构的医学图像分割新基础模型,通过引入时间适配器和记忆机制,在处理大规模带噪数据和多任务学习方面表现出色。

Details Motivation: 由于医学数据的复杂性、标注噪声以及跨不同模态和解剖结构的持续学习需求,直接将现有模型应用于医学图像是具有挑战性的。 Method: 基于SAM-2架构构建SAMed-2,引入时间适配器以捕捉图像相关性,并设计置信度驱动的记忆机制来存储高确定性特征用于后续检索。 Result: 实验表明,SAMed-2 在内部基准测试和10个外部数据集上均表现优异,特别是在多任务场景下,同时有效缓解了灾难性遗忘问题。 Conclusion: SAMed-2 展示了在医学图像分割中的强大性能,优于现有的最先进基线方法,并引入了MedBank-100k数据集用于训练和评估。 Abstract: Recent "segment anything" efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: https://github.com/ZhilingYan/Medical-SAM-Bench.

[224] Sign Spotting Disambiguation using Large Language Models

JianHe Low,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: This paper introduces a training-free sign spotting framework using LLMs, enhancing accuracy and vocabulary flexibility while reducing ambiguity in continuous sign language video analysis.

Details Motivation: The motivation is to address the challenges of vocabulary inflexibility and ambiguity in automatic sign spotting, which hampers frame-level supervision at scale for sign language translation due to data scarcity. Method: The method involves extracting global spatio-temporal and hand shape features from continuous sign language videos, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. An LLM performs context-aware gloss disambiguation via beam search without requiring fine-tuning. Result: Extensive experiments on both synthetic and real-world datasets show that the proposed method outperforms traditional approaches in accuracy and sentence fluency, demonstrating the potential of LLMs in advancing sign spotting. Conclusion: The paper concludes that their novel, training-free framework integrating Large Language Models (LLMs) significantly enhances sign spotting quality by improving vocabulary flexibility and reducing noise and ambiguity through context-aware gloss disambiguation. Abstract: Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method's superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.

[225] Computationally efficient non-Intrusive pre-impact fall detection system

Praveen Jesudhas,Raghuveera T,Shiney Jeyaraj

Main category: cs.CV

TL;DR: This paper presents a non-intrusive and computationally efficient Pre-impact fall detection system using video-based skeletal features and an LSTM model, achieving high accuracy with significantly reduced processing needs.

Details Motivation: Existing pre-impact fall detection systems are either intrusive or computationally expensive, limiting their global adoption. This work aims to address these limitations by creating a non-intrusive and computationally efficient system. Method: The system uses video data from cameras to extract skeletal features during falls. It employs a minimal set of discriminative features and a Long Short-Term Memory (LSTM) network to achieve efficient fall detection. Result: The proposed system achieves 88% accuracy while requiring approximately 18 times less computation than existing methods, making it both effective and resource-efficient. Conclusion: The proposed Pre-impact fall detection system, with its non-intrusive nature and reduced computational requirements, is suitable for wider adoption in industrial and residential safety applications. Abstract: Existing pre-impact fall detection systems have high accuracy, however they are either intrusive to the subject or require heavy computational resources for fall detection, resulting in prohibitive deployment costs. These factors limit the global adoption of existing fall detection systems. In this work we present a Pre-impact fall detection system that is both non-intrusive and computationally efficient at deployment. Our system utilizes video data of the locality available through cameras, thereby requiring no specialized equipment to be worn by the subject. Further, the fall detection system utilizes minimal fall specific features and simplistic neural network models, designed to reduce the computational cost of the system. A minimal set of fall specific features are derived from the skeletal data, post observing the relative position of human skeleton during fall. These features are shown to have different distributions for Fall and non-fall scenarios proving their discriminative capability. A Long Short Term Memory (LSTM) based network is selected and the network architecture and training parameters are designed after evaluation of performance on standard datasets. In the Pre-impact fall detection system the computation requirement is about 18 times lesser than existing modules with a comparable accuracy of 88%. Given the low computation requirements and higher accuracy levels, the proposed system is suitable for wider adoption in engineering systems related to industrial and residential safety.

[226] Less is More: Empowering GUI Agent with Context-Aware Simplification

Gongwei Chen,Xurui Zhou,Rui Shao,Yibo Lyu,Kaiwen Zhou,Shuai Wang,Wentao Li,Yinchuan Li,Zhongang Qi,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出了一种上下文感知的简化框架SimpAgent,用于构建高效且有效的GUI代理。通过引入基于掩码的元素剪枝方法和一致性引导的历史压缩模块,SimpAgent减少了计算量并提高了GUI导航性能。

Details Motivation: 随着GUI代理的研究重点从依赖文本转向纯视觉方法,当前研究更注重全面的预训练数据收集,而忽视了上下文建模的挑战。本文旨在解决当前GUI代理中存在的元素上下文高密度松散关系和历史上下文高冗余问题。 Method: 提出了一种上下文感知的简化框架SimpAgent,其中包括基于掩码的元素剪枝方法和一致性引导的历史压缩模块,以减少无关元素的干扰和历史信息的冗余。 Result: SimpAgent减少了27%的FLOPs,并在多种网页和移动环境中展示了优越的GUI导航性能。 Conclusion: SimpAgent通过高效的上下文建模解决了当前GUI代理中存在的主要问题,显示出其在计算效率和性能方面的潜力。 Abstract: The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.

[227] Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

Chong Cheng,Sicheng Yu,Zijian Wang,Yifan Zhou,Hao Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为S3PO-GS的鲁棒性更强的户外3D高斯SLAM方法,通过引入几何先验和避免尺度模糊,提高了跟踪精度和场景重建质量。

Details Motivation: 为了解决一些先前3DGS SLAM方法在户外场景中缺乏几何先验以及累积误差导致的尺度漂移问题。 Method: 提出了一种自我一致的跟踪模块和基于块的点图动态映射模块,以解决尺度漂移问题并引入几何先验知识。 Result: 实验结果显示,S3PO-GS在Waymo、KITTI和DL3DV数据集上实现了最先进的新视角合成结果,并超越了其他3DGS SLAM方法的跟踪精度。 Conclusion: S3PO-GS是一种鲁棒的仅RGB的户外3DGS SLAM方法,在新视角合成和跟踪精度方面达到了最先进的水平。 Abstract: 3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, \textbf{lack geometric priors} in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to \textbf{scale drift}. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: https://3dagentworld.github.io/S3PO-GS/.

[228] Flow-Anchored Consistency Models

Yansong Peng,Kai Zhu,Yu Liu,Pingyu Wu,Hebei Li,Xiaoyan Sun,Feng Wu

Main category: cs.CV

TL;DR: This paper proposes Flow-Anchored Consistency Models (FACM) to address training instability in Continuous-time Consistency Models by anchoring the model in the underlying flow during training, achieving excellent performance with few steps.

Details Motivation: Continuous-time Consistency Models (CMs) face training instability because they learn only a shortcut across a probability flow, losing grasp on the instantaneous velocity field. This work aims to stabilize their training. Method: The paper introduces Flow-Anchored Consistency Model (FACM), which uses a Flow Matching (FM) task to anchor the primary CM shortcut objective during training. Result: Using FACM, the method achieves state-of-the-art FID scores of 1.32 with NFE=2 and 1.76 with NFE=1 on ImageNet 256x256. Conclusion: FACM provides a general and effective method for training high-performance, few-step generative models without architectural changes. Abstract: Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow during training. We introduce the Flow-Anchored Consistency Model (FACM), a simple but effective training strategy that uses a Flow Matching (FM) task as an anchor for the primary CM shortcut objective. This Flow-Anchoring approach requires no architectural modifications and is broadly compatible with standard model architectures. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.76 with just one step (NFE=1) on ImageNet 256x256, significantly outperforming previous methods. This provides a general and effective recipe for building high-performance, few-step generative models. Our code and pretrained models: https://github.com/ali-vilab/FACM.

[229] ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

Shehroz S. Khan,Petar Przulj,Ahmed Ashraf,Ali Abedi

Main category: cs.CV

TL;DR: ChestGPT结合了EVA ViT和Llama 2 LLM,用于胸部X光图像的疾病分类和定位。

Details Motivation: 由于对放射科医生的需求迅速增长,而供应却无法跟上,因此需要一种能够增强放射科医生能力并提高诊断准确性的技术。 Method: 本研究将EVA ViT与Llama 2 LLM相结合,通过ViT将X射线图像转换为标记,并利用工程提示将这些标记输入LLM,以实现疾病的联合分类和定位。 Result: 该方法在VinDr-CXR数据集上实现了0.76的F1分数,并成功地通过在感兴趣区域周围生成边界框来定位病理。 Conclusion: ChestGPT框架提供了一个辅助工具,可以通过提供初步发现和感兴趣区域来减轻放射科医生的工作负担。 Abstract: The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

[230] StreamDiT: Real-Time Streaming Text-to-Video Generation

Akio Kodaira,Tingbo Hou,Ji Hou,Masayoshi Tomizuka,Yue Zhao

Main category: cs.CV

TL;DR: This paper introduces StreamDiT, a streaming video generation model that enables real-time applications by using a novel training approach and multistep distillation.

Details Motivation: Current T2V models generate only short clips offline, limiting their usability in real-time scenarios. Method: StreamDiT is based on a moving buffer and flow matching training, using adaLN DiT with time embedding and window attention. Multistep distillation is also applied. Result: The distilled StreamDiT model achieves 16 FPS at 512p resolution on one GPU, significantly reducing function evaluations. Conclusion: StreamDiT enables real-time video generation applications, such as streaming and interactive generation. Abstract: Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: this https URL.

[231] Efficient Event-Based Semantic Segmentation via Exploiting Frame-Event Fusion: A Hybrid Neural Network Approach

Hebei Li,Yansong Peng,Jiahui Yuan,Peixi Wu,Jin Wang,Yueyi Zhang,Xiaoyan Sun

Main category: cs.CV

TL;DR: This paper proposes an efficient hybrid framework for image semantic segmentation using event cameras, combining Spiking Neural Networks and Artificial Neural Networks with specialized interaction modules, resulting in improved accuracy and reduced energy consumption.

Details Motivation: Existing event-based semantic segmentation methods do not fully exploit the complementary information provided by frames and events, leading to complex training strategies and increased computational costs. This work aims to address these challenges. Method: An efficient hybrid framework is proposed, consisting of a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames, incorporating three modules: Adaptive Temporal Weighting Injector, Event-Driven Sparse Injector, and Channel Selection Fusion module. Result: Experimental results show that the framework achieves state-of-the-art accuracy across DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets and significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset. Conclusion: The proposed hybrid framework that combines Spiking Neural Network and Artificial Neural Network branches with specialized modules for interaction demonstrates state-of-the-art accuracy and reduced energy consumption in image semantic segmentation tasks. Abstract: Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65\% reduction on the DSEC-Semantic dataset.

[232] FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Jiaqi Zhang,Juntuo Wang,Zhixin Sun,John Zou,Randall Balestriero

Main category: cs.CV

TL;DR: A new pre-training strategy for DINOv2 accelerates convergence and enhances robustness, making large-scale self-supervised foundation modeling more efficient and accessible.

Details Motivation: The motivation is to address the computational demands of reproducing pre-training solutions in various scenarios such as on private data or new modalities. Method: The method involves a frequency filtering curriculum and Gaussian noise patching augmentation applied to a ViT-B/16 backbone trained on ImageNet-1K. Result: The result shows that pre-training time and FLOPs are reduced by 1.6x and 2.25x while achieving matching robustness in corruption benchmarks and maintaining competitive linear probing performance. Conclusion: The paper concludes that their proposed pre-training strategy for DINOv2 improves efficiency and robustness in large-scale self-supervised foundation modeling. Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2

[233] Zero Memory Overhead Approach for Protecting Vision Transformer Parameters

Fereshteh Baradaran,Mohsen Raji,Azadeh Baradaran,Arezoo Baradaran,Reihaneh Akbarifard

Main category: cs.CV

TL;DR: 本文提出了一种针对视觉Transformer模型的零内存开销容错技术,通过将参数的最低有效位替换为奇偶校验位来检测错误,并在检测到故障时屏蔽受影响的参数,从而显著提高模型的可靠性。

Details Motivation: 随着视觉Transformer模型在自动驾驶等安全关键型应用中的普及,确保其在存在内存中参数位翻转故障的情况下仍能正确运行变得至关重要。 Method: 该方法利用ViT模型参数的最低有效位对模型精度影响不大这一特点,将其替换为奇偶校验位以实现错误检测机制,并在检测到故障时通过将受影响参数置零来防止精度下降。 Result: 这种方法可以显著增强ViT模型的可靠性,使参数对位翻转的鲁棒性提高三个数量级,成为一种有效的零开销容错解决方案。 Conclusion: 提出的容错技术能够有效地提升ViT模型在安全关键型应用中的可靠性和鲁棒性,且不增加任何内存开销。 Abstract: Vision Transformers (ViTs) have demonstrated superior performance over Convolutional Neural Networks (CNNs) in various vision-related tasks such as classification, object detection, and segmentation due to their use of self-attention mechanisms. As ViTs become more popular in safety-critical applications like autonomous driving, ensuring their correct functionality becomes essential, especially in the presence of bit-flip faults in their parameters stored in memory. In this paper, a fault tolerance technique is introduced to protect ViT parameters against bit-flip faults with zero memory overhead. Since the least significant bits of parameters are not critical for model accuracy, replacing the LSB with a parity bit provides an error detection mechanism without imposing any overhead on the model. When faults are detected, affected parameters are masked by zeroing out, as most parameters in ViT models are near zero, effectively preventing accuracy degradation. This approach enhances reliability across ViT models, improving the robustness of parameters to bit-flips by up to three orders of magnitude, making it an effective zero-overhead solution for fault tolerance in critical applications.

[234] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

Jiuhong Xiao,Yang Zhou,Giuseppe Loianno

Main category: cs.CV

TL;DR: This paper proposes QAA, a novel feature aggregation method for VPR that enhances model generalization across multiple datasets without increasing computational complexity.

Details Motivation: Existing VPR methods trained on single datasets suffer from limited generalization, while multi-dataset training faces challenges in feature aggregation information capacity. Method: Query-based Adaptive Aggregation (QAA) using learned queries as reference codebooks and computing Cross-query Similarity (CS). Result: QAA outperforms state-of-the-art models with balanced generalization across datasets and peak performance comparable to dataset-specific models. Conclusion: QAA is an effective feature aggregation technique that improves generalization of VPR models across diverse datasets. Abstract: Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Code will be publicly released.

[235] Interpretable Diffusion Models with B-cos Networks

Nicola Bernold,Moritz Vandenhirtz,Alice Bizeul,Julia E. Vogt

Main category: cs.CV

TL;DR: This paper introduces a new diffusion model architecture that improves interpretability and provides insights into how prompt tokens influence generated images.

Details Motivation: While these models have enabled impressive progress in image generation, they often fail to accurately reflect all semantic information described in the prompt -- failures that are difficult to detect automatically. Method: We introduce a diffusion model architecture built with B-cos modules that offers inherent interpretability. Result: Our approach provides insight into how individual prompt tokens affect the generated image by producing explanations that highlight the pixel regions influenced by each token. Conclusion: B-cos diffusion models can produce high-quality images while providing meaningful insights into prompt-image alignment. Abstract: Text-to-image diffusion models generate images by iteratively denoising random noise, conditioned on a prompt. While these models have enabled impressive progress in image generation, they often fail to accurately reflect all semantic information described in the prompt -- failures that are difficult to detect automatically. In this work, we introduce a diffusion model architecture built with B-cos modules that offers inherent interpretability. Our approach provides insight into how individual prompt tokens affect the generated image by producing explanations that highlight the pixel regions influenced by each token. We demonstrate that B-cos diffusion models can produce high-quality images while providing meaningful insights into prompt-image alignment.

[236] ArmGS: Composite Gaussian Appearance Refinement for Modeling Dynamic Urban Environments

Guile Wu,Dongfeng Bai,Bingbing Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为ArmGS的方法,利用复合驾驶高斯点绘与多粒度外观优化来建模自动驾驶场景,以解决现有方法在帧间和相机视点间细微变化建模方面的不足。

Details Motivation: 现有的基于神经辐射场的数据驱动方法虽然实现了逼真的驾驶场景建模,但渲染效率低;而使用3D高斯点绘的方法虽然能够实现高保真重建和实时渲染,但在细粒度的帧间和视角变化建模方面表现不佳。 Method: 提出ArmGS方法,通过设计一种多层次外观建模方案,优化从多个粒度(局部高斯级别、全局图像级别和动态对象级别)的复合高斯点绘参数,从而实现对全局场景外观变化以及背景和物体局部细节变化的建模。 Result: 在Waymo、KITTI、NOTR和VKITTI2等多个具有挑战性的自动驾驶数据集上进行了大量实验,结果表明该方法优于当前最先进的方法。 Conclusion: ArmGS通过多粒度外观优化有效提升了动态城市环境建模的质量和渲染效率,适用于自动驾驶仿真任务。 Abstract: This work focuses on modeling dynamic urban environments for autonomous driving simulation. Contemporary data-driven methods using neural radiance fields have achieved photorealistic driving scene modeling, but they suffer from low rendering efficacy. Recently, some approaches have explored 3D Gaussian splatting for modeling dynamic urban scenes, enabling high-fidelity reconstruction and real-time rendering. However, these approaches often neglect to model fine-grained variations between frames and camera viewpoints, leading to suboptimal results. In this work, we propose a new approach named ArmGS that exploits composite driving Gaussian splatting with multi-granularity appearance refinement for autonomous driving scene modeling. The core idea of our approach is devising a multi-level appearance modeling scheme to optimize a set of transformation parameters for composite Gaussian refinement from multiple granularities, ranging from local Gaussian level to global image level and dynamic actor level. This not only models global scene appearance variations between frames and camera viewpoints, but also models local fine-grained changes of background and objects. Extensive experiments on multiple challenging autonomous driving datasets, namely, Waymo, KITTI, NOTR and VKITTI2, demonstrate the superiority of our approach over the state-of-the-art methods.

[237] Hierarchical Semantic-Visual Fusion of Visible and Near-infrared Images for Long-range Haze Removal

Yi Li,Xiaoxiong Wang,Jiawei Wang,Yi Chang,Kai Cao,Luxin Yan

Main category: cs.CV

TL;DR: This paper introduces the HSVF framework for long-range haze removal using semantic-visual fusion of visible and near-infrared images, achieving better results than current state-of-the-art methods.

Details Motivation: Most image dehazing efforts focus on short-range scenarios, leaving long-range haze removal under-explored. Near-infrared imaging provides valuable cues for multimodal fusion due to its superior fog penetration capabilities. Method: The Hierarchical Semantic-Visual Fusion (HSVF) framework includes a semantic stream for haze-free scene reconstruction and a visual stream for structural detail recovery. It leverages semantic consistency and complementary features from both modalities. Result: The HSVF framework produces high-contrast scenes with rich texture details by combining semantic predictions and structural information from visible and near-infrared images. Conclusion: The proposed HSVF framework effectively addresses the issue of long-range haze removal by integrating semantic and visual cues from visible and near-infrared images, outperforming existing methods. Abstract: While image dehazing has advanced substantially in the past decade, most efforts have focused on short-range scenarios, leaving long-range haze removal under-explored. As distance increases, intensified scattering leads to severe haze and signal loss, making it impractical to recover distant details solely from visible images. Near-infrared, with superior fog penetration, offers critical complementary cues through multimodal fusion. However, existing methods focus on content integration while often neglecting haze embedded in visible images, leading to results with residual haze. In this work, we argue that the infrared and visible modalities not only provide complementary low-level visual features, but also share high-level semantic consistency. Motivated by this, we propose a Hierarchical Semantic-Visual Fusion (HSVF) framework, comprising a semantic stream to reconstruct haze-free scenes and a visual stream to incorporate structural details from the near-infrared modality. The semantic stream first acquires haze-robust semantic prediction by aligning modality-invariant intrinsic representations. Then the shared semantics act as strong priors to restore clear and high-contrast distant scenes under severe haze degradation. In parallel, the visual stream focuses on recovering lost structural details from near-infrared by fusing complementary cues from both visible and near-infrared images. Through the cooperation of dual streams, HSVF produces results that exhibit both high-contrast scenes and rich texture details. Moreover, we introduce a novel pixel-aligned visible-infrared haze dataset with semantic labels to facilitate benchmarking. Extensive experiments demonstrate the superiority of our method over state-of-the-art approaches in real-world long-range haze removal.

[238] Deconfounding Causal Inference through Two-Branch Framework with Early-Forking for Sensor-Based Cross-Domain Activity Recognition

Di Xiong,Lei Zhang,Shuoyuan Wang,Dongzhou Cheng,Wenbo Huang

Main category: cs.CV

TL;DR: This paper proposes a causality-inspired method for domain generalization in human activity recognition, improving cross-domain performance by explicitly modeling causal and non-causal features.

Details Motivation: Most domain generalization approaches focus on statistical dependence between data and labels but ignore the intrinsic causal mechanisms. This work aims to address this limitation by modeling causal factors that influence classification decisions. Method: An early-forking two-branch framework is introduced to separately learn causal and non-causal features. The method employs a Hilbert-Schmidt Information Criterion for independence, an inhomogeneous domain sampling strategy, and a category-aware domain perturbation layer to improve feature disentanglement. Result: The proposed approach outperforms eleven state-of-the-art baselines on multiple HAR benchmarks under cross-person, cross-dataset, and cross-position settings. Ablation studies and visualizations confirm the effectiveness of the method. Conclusion: The proposed causality-inspired representation learning algorithm effectively addresses the domain generalization problem in sensor-based human activity recognition, demonstrating superior performance over existing methods. Abstract: Recently, domain generalization (DG) has emerged as a promising solution to mitigate distribution-shift issue in sensor-based human activity recognition (HAR) scenario. However, most existing DG-based works have merely focused on modeling statistical dependence between sensor data and activity labels, neglecting the importance of intrinsic casual mechanism. Intuitively, every sensor input can be viewed as a mixture of causal (category-aware) and non-causal factors (domain-specific), where only the former affects activity classification judgment. In this paper, by casting such DG-based HAR as a casual inference problem, we propose a causality-inspired representation learning algorithm for cross-domain activity recognition. To this end, an early-forking two-branch framework is designed, where two separate branches are respectively responsible for learning casual and non-causal features, while an independence-based Hilbert-Schmidt Information Criterion is employed to implicitly disentangling them. Additionally, an inhomogeneous domain sampling strategy is designed to enhance disentanglement, while a category-aware domain perturbation layer is performed to prevent representation collapse. Extensive experiments on several public HAR benchmarks demonstrate that our causality-inspired approach significantly outperforms eleven related state-of-the-art baselines under cross-person, cross-dataset, and cross-position settings. Detailed ablation and visualizations analyses reveal underlying casual mechanism, indicating its effectiveness, efficiency, and universality in cross-domain activity recognition scenario.

[239] Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection

Hanzhe Liang,Jie Zhang,Tao Dai,Linlin Shen,Jinbao Wang,Can Gao

Main category: cs.CV

TL;DR: This paper proposes DUS-Net, a novel network for reconstructing high-precision point clouds in 3D anomaly detection, achieving top performance on multiple metrics.

Details Motivation: Reconstruction-based methods face challenges in handling high-precision point clouds due to their large scale and complex structure, which this study aims to address. Method: A Down-Up Sampling Network (DUS-Net) is introduced, which includes a Noise Generation module, a Down-sampling Network (Down-Net), and an Up-sampling Network (Up-Net) to reconstruct point clouds while preserving geometric structure. Result: The method achieves SOTA performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively. Conclusion: The proposed DUS-Net effectively reconstructs high-precision point clouds for 3D anomaly detection, achieving state-of-the-art performance on two datasets. Abstract: Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Network (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network~(Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.

[240] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

Rang Meng,Yan Wang,Weipeng Wu,Ruobing Zheng,Yuming Li,Chenguang Ma

Main category: cs.CV

TL;DR: 本文提出了EchoMimicV3,一个高效、高质量、支持多任务统一处理的人类动画生成模型。

Details Motivation: 现有的大规模视频生成模型在实际应用中面临推理速度慢、计算成本高以及不同任务需要专门模型的问题,研究旨在解决这些问题并实现一个统一的高效模型。 Method: 提出了一种新的多任务统一范式,采用空间-时间局部重建方法;引入了多模态解耦交叉注意力模块以融合多模态信息;开发了SFT+Reward交替训练策略。 Result: EchoMimicV3在面部和半身视频生成方面优于现有模型,并能通过精确的文本控制生成多种场景的视频。 Conclusion: EchoMimicV3成功地解决了人类动画生成中的效率与质量难题,实现了高效、高质量和多功能的数字人类生成。 Abstract: Human animation recently has advanced rapidly, achieving increasingly realistic and vivid results, especially with the integration of large-scale video generation models. However, the slow inference speed and high computational cost of these large models bring significant challenges for practical applications. Additionally, various tasks in human animation, such as lip-syncing, audio-driven full-body animation, and video generation from start and end frames, often require different specialized models. The introduction of large video models has not alleviated this dilemma. This raises an important question: Can we make human animation Faster, Higher in quality, Stronger in generalization, and make various tasks Together in one model? To address this, we dive into video generation models and discover that the devil lies in the details: Inspired by MAE, we propose a novel unified Multi-Task paradigm for human animation, treating diverse generation tasks as spatial-temporal local reconstructions, requiring modifications only on the input side; Given the interplay and division among multi-modal conditions including text, image, and audio, we introduce a multi-modal decoupled cross-attention module to fuse multi-modals in a divide-and-conquer manner; We propose a new SFT+Reward alternating training paradigm, enabling the minimal model with 1.3B parameters to achieve generation quality comparable to models with 10 times the parameters count. Through these innovations, our work paves the way for efficient, high-quality, and versatile digital human generation, addressing both performance and practicality challenges in the field. Extensive experiments demonstrate that EchoMimicV3 outperforms existing models in both facial and semi-body video generation, providing precise text-based control for creating videos in a wide range of scenarios.

[241] Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

Haifeng Zhao,Yufei Zhang,Leilei Ma,Shuo Xu,Dengdi Sun

Main category: cs.CV

TL;DR: OTDRG introduces an innovative framework using Optimal Transport to improve the alignment of image and text features in radiology report generation, enhancing both linguistic quality and clinical accuracy.

Details Motivation: General large language models (LLMs) prioritize linguistic fluency over clinical effectiveness and fail to accurately capture the relationship between X-ray images and their corresponding texts, resulting in poor clinical practicability. This work aims to address these limitations. Method: OTDRG utilizes Optimal Transport to align image features with disease labels extracted from reports, integrating image and text features for LLMs fine-tuning. It also includes a novel disease prediction module for predicting disease labels during validation and testing. Result: OTDRG achieves state-of-the-art performance on MIMIC-CXR and IU X-Ray datasets in both natural language generation (NLG) and clinical efficacy (CE) metrics, generating reports that are linguistically coherent and clinically accurate. Conclusion: The proposed OTDRG framework significantly improves the clinical practicability of radiology report generation by effectively aligning image and text features through Optimal Transport, achieving state-of-the-art performance in both natural language generation and clinical efficacy metrics. Abstract: Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment \& Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.

[242] Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation

Ha-Hieu Pham,Nguyen Lan Vi Vu,Thanh-Huy Nguyen,Ulas Bagci,Min Xu,Trung-Nghia Le,Huy-Hieu Pham

Main category: cs.CV

TL;DR: The paper proposes Color-Structure Dual-Student (CSDS), a semi-supervised segmentation framework that improves gland segmentation in histopathology images by learning disentangled representations of stain appearance and tissue structure.

Details Motivation: Accurate gland segmentation in histopathology images is crucial for cancer diagnosis and prognosis, but variability in H&E staining and tissue morphology and limited annotated data pose challenges for automation. Method: Color-Structure Dual-Student (CSDS) with two specialized student networks and a shared teacher network, incorporating stain-aware and structure-aware uncertainty estimation modules. Result: CSDS achieved Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Conclusion: CSDS is a novel semi-supervised segmentation framework that achieves state-of-the-art performance in low-label settings for accurate gland segmentation in histopathology images. Abstract: Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (H&E) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at https://github.com/hieuphamha19/CSDS.

[243] DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

Rongjia Zheng,Qing Zhang,Chengjiang Long,Wei-Shi Zheng

Main category: cs.CV

TL;DR: DNF-Intrinsic improves generative inverse rendering by using source images instead of noisy inputs, achieving better results than existing methods.

Details Motivation: Existing methods using noise-to-intrinsic mapping struggle with producing high-quality results due to reliance on noisy images that lack structural and appearance details crucial for inverse rendering. Method: DNF-Intrinsic uses a pre-trained diffusion model and takes the source image as input to directly predict deterministic intrinsic properties via flow matching, while incorporating a generative renderer to ensure physical fidelity. Result: Experiments on synthetic and real-world datasets demonstrate that DNF-Intrinsic achieves superior performance over current state-of-the-art techniques. Conclusion: DNF-Intrinsic outperforms existing state-of-the-art methods in generative inverse rendering. Abstract: Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods.

[244] Learning Adaptive Node Selection with External Attention for Human Interaction Recognition

Chen Pang,Xuequan Lu,Qianyu Zhou,Lei Lyu

Main category: cs.CV

TL;DR: This paper proposes ASEA, a new approach for modeling human interactions using graph convolutional networks and attention mechanisms, which adaptively identifies relevant nodes and achieves superior performance.

Details Motivation: Existing methods either treat individuals as independent graphs or use static adjacency matrices that fail to capture dynamic joint interactions across different actions. Method: ASEA uses GCNs to model intra-personal relationships and introduces the AT-NAC module for identifying salient nodes. It also employs an External Attention (EA) module to model interactions between active nodes. Result: The proposed method effectively captures interaction dynamics and semantic relationships, outperforming existing approaches in modeling interactions flexibly and accurately. Conclusion: ASEA provides a dynamic and adaptive way to capture interaction relationships without predefined assumptions, achieving state-of-the-art performance. Abstract: Most GCN-based methods model interacting individuals as independent graphs, neglecting their inherent inter-dependencies. Although recent approaches utilize predefined interaction adjacency matrices to integrate participants, these matrices fail to adaptively capture the dynamic and context-specific joint interactions across different actions. In this paper, we propose the Active Node Selection with External Attention Network (ASEA), an innovative approach that dynamically captures interaction relationships without predefined assumptions. Our method models each participant individually using a GCN to capture intra-personal relationships, facilitating a detailed representation of their actions. To identify the most relevant nodes for interaction modeling, we introduce the Adaptive Temporal Node Amplitude Calculation (AT-NAC) module, which estimates global node activity by combining spatial motion magnitude with adaptive temporal weighting, thereby highlighting salient motion patterns while reducing irrelevant or redundant information. A learnable threshold, regularized to prevent extreme variations, is defined to selectively identify the most informative nodes for interaction modeling. To capture interactions, we design the External Attention (EA) module to operate on active nodes, effectively modeling the interaction dynamics and semantic relationships between individuals. Extensive evaluations show that our method captures interaction relationships more effectively and flexibly, achieving state-of-the-art performance.

[245] VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision

Kezhong Liu,Yiwen Zhou,Mozi Chen,Jianhua He,Jingao Xu,Zheng Yang,Chris Xiaoxuan Lu,Shengkai Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉惯性传感器数据监督的毫米波雷达场景流估计方法,有效克服了传统方法对LiDAR的依赖,并在复杂环境中表现出优越性能。

Details Motivation: 现有的毫米波雷达场景流估计方法通常依赖于昂贵且在智能车辆中不普及的3D LiDAR密集点云进行监督学习,而视觉图像无法捕捉移动物体的3D运动,同时视觉惯性数据的时间漂移会影响静态点的场景流估计。 Method: 提出了一种无漂移的刚性变换估计器,将基于运动学模型的自我运动与神经网络学习结果融合,并开发了光学-毫米波监督提取模块,利用光学和毫米波雷达测量的联合约束来加强监督信号。 Result: 实验结果显示,该方法在烟雾环境中表现优于当前最先进的依赖昂贵LiDAR的方法,特别是在静态点和动态点的场景流估计方面均有提升。 Conclusion: 该论文提出了一种新的毫米波雷达场景流估计框架,通过使用广泛存在的视觉惯性传感器数据进行监督学习,解决了现有方法依赖昂贵且不易获取的3D LiDAR密集点云的问题。实验表明,在烟雾环境中,该方法甚至优于使用昂贵LiDAR的最先进方法。 Abstract: This work proposes a mmWave radar's scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are more accessible, visual images alone cannot capture the 3D motions of moving objects, making it difficult to supervise their scene flow. Moreover, the temporal drift of VI rigid transformation also degenerates the scene flow estimation of static points. To address these challenges, we propose a drift-free rigid transformation estimator that fuses kinematic model-based ego-motions with neural network-learned results. It provides strong supervision signals to radar-based rigid transformation and infers the scene flow of static points. Then, we develop an optical-mmWave supervision extraction module that extracts the supervision signals of radar rigid transformation and scene flow. It strengthens the supervision by learning the scene flow of dynamic points with the joint constraints of optical and mmWave radar measurements. Extensive experiments demonstrate that, in smoke-filled environments, our method even outperforms state-of-the-art (SOTA) approaches using costly LiDARs.

[246] Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study

Kai Ye,Tianyi Chen,Zhen Wang

Main category: cs.CV

TL;DR: This paper evaluates eight perturbation-based protection methods for diffusion models to address privacy and content misuse concerns, comparing their effectiveness across different scenarios.

Details Motivation: With the increasing adoption of diffusion models for image generation and personalization, concerns about privacy breaches and content misuse have grown, prompting the need for effective protective measures. Method: A comprehensive comparison was conducted among eight perturbation-based protection methods across portrait and artwork domains, evaluating them under varying perturbation budgets using multiple assessment metrics. Result: The results indicate varying levels of visual imperceptibility and protective efficacy among the tested methods, providing insights into which methods perform better under specific conditions. Conclusion: The study concludes that there are notable differences in the effectiveness of perturbation-based protection methods for diffusion models, offering practical guidance for their selection. Abstract: With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC--across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: https://github.com/vkeilo/DiffAdvPerturbationBench.

[247] Robust Low-light Scene Restoration via Illumination Transition

Ze Li,Feng Zhang,Xiatian Zhu,Meng Zhang,Yanghong Zhou,P. Y. Mok

Main category: cs.CV

TL;DR: 本文提出了一种名为RoSe的新框架,用于从低光多视角图像合成正常光照条件下的新视图,解决了现有方法在颜色失真、伪影和去噪方面的不足。

Details Motivation: 现有的低光增强方法难以有效预处理低光输入,且未能考虑多视角间的相关性;其他先进方法引入了与照明相关的组件,但常导致颜色失真和伪影,并且去噪效果有限。 Method: 提出了一种鲁棒的低光场景恢复框架RoSe,通过在3D空间中建立多视角一致的照度过渡场,并设计了一个简洁的双分支架构和低秩去噪模块来实现从低光多视角图像合成正常光照条件下的新视图。 Result: 实验表明,RoSe在标准基准测试中显著优于现有模型,同时实现了更有效的去噪和高质量的视图合成。 Conclusion: RoSe框架显著优于现有模型,在标准基准测试中展现出更高的渲染质量和多视角一致性。 Abstract: Synthesizing normal-light novel views from low-light multiview images is an important yet challenging task, given the low visibility and high ISO noise present in the input images. Existing low-light enhancement methods often struggle to effectively preprocess such low-light inputs, as they fail to consider correlations among multiple views. Although other state-of-the-art methods have introduced illumination-related components offering alternative solutions to the problem, they often result in drawbacks such as color distortions and artifacts, and they provide limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework (RoSe), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. This multiview-consistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To implement RoSe, we design a concise dual-branch architecture and introduce a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard benchmarks. The codes and data are available at https://pegasus2004.github.io/RoSe.

[248] Flux-Sculptor: Text-Driven Rich-Attribute Portrait Editing through Decomposed Spatial Flow Control

Tianyao He,Runqi Wang,Yang Chen,Dejia Song,Nemo Chen,Xu Tang,Yao Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为 Flux-Sculptor 的文本驱动人像编辑框架,通过创新的定位和编辑策略,实现了更优的编辑效果与面部细节保持。

Details Motivation: 理想的文本驱动人像编辑方法应实现精确定位和适当的内容修改,但现有方法难以平衡重建保真度和编辑灵活性。 Method: 引入了Prompt-Aligned Spatial Locator (PASL) 来识别相关编辑区域,并采用Structure-to-Detail Edit Control (S2D-EC) 策略来指导去噪过程。 Result: Flux-Sculptor 在丰富属性编辑和面部信息保留方面优于现有方法。 Conclusion: Flux-Sculptor 是一种基于flux的框架,适用于精确的文本驱动人像编辑,具有良好的细节控制和面部信息保留能力,适合实际应用。 Abstract: Text-driven portrait editing holds significant potential for various applications but also presents considerable challenges. An ideal text-driven portrait editing approach should achieve precise localization and appropriate content modification, yet existing methods struggle to balance reconstruction fidelity and editing flexibility. To address this issue, we propose Flux-Sculptor, a flux-based framework designed for precise text-driven portrait editing. Our framework introduces a Prompt-Aligned Spatial Locator (PASL) to accurately identify relevant editing regions and a Structure-to-Detail Edit Control (S2D-EC) strategy to spatially guide the denoising process through sequential mask-guided fusion of latent representations and attention values. Extensive experiments demonstrate that Flux-Sculptor surpasses existing methods in rich-attribute editing and facial information preservation, making it a strong candidate for practical portrait editing applications. Project page is available at https://flux-sculptor.github.io/.

[249] CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning

Jeonghyo Song,Kimin Yun,DaeUng Jo,Jinyoung Kim,Youngjoon Yoo

Main category: cs.CV

TL;DR: This paper proposes a novel Chain-of-Thought (CoT)-based framework leveraging large language models like GPT-4 to improve out-of-distribution (OOD) detection in complex road scenes, addressing challenges such as densely packed objects, distant small objects, and large foreground-dominant objects.

Details Motivation: The motivation stems from the need for effective out-of-distribution (OOD) detection to ensure the reliability of semantic segmentation models, especially in complex road environments where safety and accuracy are critical. Additionally, while recent advancements in large language models have enhanced multimodal reasoning, the application of CoT-based visual reasoning for OOD semantic segmentation has not been extensively explored. Method: The method uses the knowledge and reasoning capabilities of foundation models like GPT-4 to enhance OOD detection through improved image understanding and prompt-based reasoning tailored to problematic scene attributes. Result: Extensive experiments show that the proposed framework consistently outperforms state-of-the-art methods on both standard benchmarks and a newly defined challenging subset of the RoadAnomaly dataset. Conclusion: The proposed CoT-based framework offers a robust and interpretable solution for OOD semantic segmentation in complex driving environments. Abstract: Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.

[250] LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts

Aleksandr Gushchin,Maksim Smirnov,Dmitriy Vatolin,Anastasia Antsiferova

Main category: cs.CV

TL;DR: 本研究介绍了 LEHA-CVQAD 视频质量评估数据集和 RDAE 新评估指标,旨在提升视频编码参数优化效果。

Details Motivation: 现有的视频质量评估方法表现出较高的 RDAE 和较低的相关性,需要更有效的数据集和指标来支持编解码器参数调优。 Method: 构建了一个包含 6,240 个剪辑的大规模人类标注数据集 LEHA-CVQAD,并提出了新的评估指标 RDAE 来量化视频质量模型在保持码率-质量排序上的表现。 Result: 测试显示常用 VQA 指标在 LEHA-CVQAD 上表现出高 RDAE 和低相关性,证明了该数据集的有效性和挑战性。 Conclusion: LEHA-CVQAD 数据集及其提出的 RDAE 指标为视频质量评估提供了新的挑战和实用工具,部分数据和结果已公开。 Abstract: We propose the LEHA-CVQAD (Large-scale Enriched Human-Annotated) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, 1.8M pairwise, and 1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset challenges and utility. The open part and the results of LEHA-CVQAD are available at https://aleksandrgushchin.github$.io/lcvqad/

[251] NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models

Siyu Li,Fei Teng,Yihong Cao,Kailun Yang,Zhiyong Li,Yaonan Wang

Main category: cs.CV

TL;DR: NRSeg is a noise-resilient learning framework for BEV semantic segmentation that uses PGCM, BiDPP, and HLSE modules to achieve state-of-the-art performance.

Details Motivation: Enhance diversity of labeled data for robustifying BEV segmentation using synthetic data from driving world models. Method: PGCM, BiDPP, and HLSE modules are used to improve robustness and address non-mutual exclusivity. Result: Highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks respectively. Conclusion: NRSeg achieves state-of-the-art performance in BEV semantic segmentation tasks. Abstract: Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn-yu/NRSeg.

[252] Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing

Seungjin Jung,Kanghee Lee,Yonghyun Jeong,Haeun Noh,Jungmin Lee,Jongwon Choi

Main category: cs.CV

TL;DR: This paper introduces a new Domain Generalizable Face Anti-Spoofing (DGFAS) framework that addresses bias misalignment issues by jointly aligning weights and biases using Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM).

Details Motivation: Existing DGFAS methods only align local decision boundary directions (weights), but not the bias terms, resulting in inconsistent thresholds and degraded performance on unseen domains. Method: The method involves Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM) to align weights and biases across domains. FOD uses Gram-Schmidt orthogonalization to separate domain-invariant and domain-specific features, while GS-RM balances group-wise losses for bias alignment. Result: Extensive experiments show that the proposed approach achieves state-of-the-art performance, improves accuracy, reduces bias misalignment, and enhances generalization stability on unseen target domains. Conclusion: The proposed DGFAS framework effectively aligns both weights and biases, achieving state-of-the-art performance with improved accuracy and reduced bias misalignment on unseen domains. Abstract: Domain Generalizable Face Anti-Spoofing (DGFAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM). Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.

[253] Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

Hongrui Shi,Lisa Norton,Lucy Ridding,Simon Rolph,Tom August,Claire M Wood,Lan Qie,Petra Bosilj,James M Brown

Main category: cs.CV

TL;DR: 本研究使用先进的深度学习模型(如视觉变换器)对地面生境图像进行分类,结果表明其性能优于传统方法,并接近专家水平,有助于实现高效、低成本的生境监测。

Details Motivation: 传统的专家实地调查成本高昂,而现有的基于遥感的AI驱动生境映射受到传感器可用性、天气和分辨率限制,因此需要探索更高效、精细的自动化工具来提高生物多样性和指导保护优先事项。 Method: 应用最先进的深度神经网络架构(包括卷积神经网络CNNs和视觉变换器ViTs)对来自英国乡村调查的18种广泛生境类型的数据进行分析,并在监督和监督对比学习范式下评估模型性能。 Result: 视觉变换器(ViTs)在关键分类指标上持续优于现有的CNN基线模型(Top-3准确率为91%,MCC为0.66),并且在监督对比学习下减少了视觉相似生境之间的误分类率,同时其场景理解更具可解释性。 Conclusion: ViTs模型在地面生境图像分类中表现优于CNN模型,并且通过监督对比学习可以显著降低视觉相似生境的误分类率。最佳模型的表现与经验丰富的生态专家相当,为生物多样性保护和国家层面的土地利用决策提供了一个可扩展、成本效益高的框架。 Abstract: Habitat assessment at local scales -- critical for enhancing biodiversity and guiding conservation priorities -- often relies on expert field survey that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models -- convolutional neural networks (CNNs) and vision transformers (ViTs) -- under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91\%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at the national scale.

[254] Exploring Kolmogorov-Arnold Network Expansions in Vision Transformers for Mitigating Catastrophic Forgetting in Continual Learning

Zahid Ullah,Jihie Kim

Main category: cs.CV

TL;DR: This paper explores the use of Kolmogorov-Arnold Networks (KANs) in Vision Transformers (ViTs) to address catastrophic forgetting in continual learning, showing improved performance in retaining prior knowledge while adapting to new tasks.

Details Motivation: The motivation stems from the challenge of catastrophic forgetting in vision transformers (ViTs) during continual learning, especially when utilizing MLPs for global representation learning. This research aims to find a more robust solution to preserve previously acquired knowledge while adapting to new tasks. Method: The research proposes replacing MLPs in ViTs with Kolmogorov-Arnold Networks (KANs), which use spline-based activations for local plasticity. The efficacy of this approach is evaluated through experiments on benchmark datasets like MNIST and CIFAR100 in continual learning scenarios. Result: Experimental results show that KAN-based ViTs effectively mitigate catastrophic forgetting and outperform traditional MLP-based ViTs in retaining knowledge and adapting to new tasks. Conclusion: The study concludes that integrating KANs into ViTs significantly improves their ability to handle continual learning tasks, reducing catastrophic forgetting and enhancing adaptability in dynamic environments. Abstract: Continual learning (CL), the ability of a model to learn new tasks without forgetting previously acquired knowledge, remains a critical challenge in artificial intelligence, particularly for vision transformers (ViTs) utilizing Multilayer Perceptrons (MLPs) for global representation learning. Catastrophic forgetting, where new information overwrites prior knowledge, is especially problematic in these models. This research proposes replacing MLPs in ViTs with Kolmogorov-Arnold Network (KANs) to address this issue. KANs leverage local plasticity through spline-based activations, ensuring that only a subset of parameters is updated per sample, thereby preserving previously learned knowledge. The study investigates the efficacy of KAN-based ViTs in CL scenarios across benchmark datasets (MNIST, CIFAR100), focusing on their ability to retain accuracy on earlier tasks while adapting to new ones. Experimental results demonstrate that KAN-based ViTs significantly mitigate catastrophic forgetting, outperforming traditional MLP-based ViTs in knowledge retention and task adaptation. This novel integration of KANs into ViTs represents a promising step toward more robust and adaptable models for dynamic environments.

[255] PresentAgent: Multimodal Agent for Presentation Video Generation

Jingwei Shi,Zeyu Zhang,Biao Wu,Yanjie Liang,Meng Fang,Ling Chen,Yang Zhao

Main category: cs.CV

TL;DR: PresentAgent 是一种将长文档自动转换为高质量叙述式演示视频的多模态智能代理,并提出了一种新的评估框架 PresentEval 来全面衡量输出质量。

Details Motivation: 现有方法仅限于生成静态幻灯片或文本摘要,而本文旨在突破这些限制,生成完全同步的视听内容,更贴近人类风格的演示。 Method: PresentAgent 使用模块化流程,包括文档分割、幻灯片视觉帧生成、基于大语言模型和文本到语音模型的叙述生成,以及最终音视频对齐的合成;同时引入了 PresentEval 框架进行内容保真度、视觉清晰度和观众理解度评估。 Result: 在包含30个文档-演示对的数据集上验证表明,PresentAgent 在所有评估指标上均接近人类水平的质量。 Conclusion: PresentAgent 通过可控的多模态代理,将静态文本材料转化为动态、高效且易于访问的演示格式,接近人类水平的质量,展示了其在文档转视频应用中的巨大潜力。 Abstract: We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.

[256] T-SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images

Christopher Wiedeman,Anastasiia Sarmakeeva,Elena Sizikova,Daniil Filienko,Miguel Lago,Jana G. Delfino,Aldo Badano

Main category: cs.CV

TL;DR: 本文提出了一种基于物理仿真的方法来生成带有像素级注释的合成医学图像,并发布了一个用于乳腺成像的大规模开放数据集T-SYNTH,以解决医学影像数据不足的问题。

Details Motivation: 开发和评估鲁棒的医学影像算法的关键障碍之一是难以获得具有合适注释的大规模数据集。为了解决这个问题,作者提出了使用符合物理和生物约束的合成数据来缓解数据限制的问题。 Method: 提出了一种基于物理仿真的方法来生成具有像素级分割注释的合成医学图像,并将其应用于乳腺成像分析,发布了T-SYNTH数据集。 Result: 实验结果表明,T-SYNTH图像在2D数字乳腺X线摄影(DM)和3D数字乳腺断层合成(DBT)的检测任务中具有良好的应用前景。 Conclusion: T-SYNTH数据集在DM和DBT的检测任务中显示出增强有限真实患者数据集的潜力,并且该数据和代码是公开可用的。 Abstract: One of the key impediments for developing and assessing robust medical imaging algorithms is limited access to large-scale datasets with suitable annotations. Synthetic data generated with plausible physical and biological constraints may address some of these data limitations. We propose the use of physics simulations to generate synthetic images with pixel-level segmentation annotations, which are notoriously difficult to obtain. Specifically, we apply this approach to breast imaging analysis and release T-SYNTH, a large-scale open-source dataset of paired 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images. Our initial experimental results indicate that T-SYNTH images show promise for augmenting limited real patient datasets for detection tasks in DM and DBT. Our data and code are publicly available at https://github.com/DIDSR/tsynth-release.

[257] Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

Ziyu Zhu,Xilin Wang,Yixuan Li,Zhuofan Zhang,Xiaojian Ma,Yixin Chen,Baoxiong Jia,Wei Liang,Qian Yu,Zhidong Deng,Siyuan Huang,Qing Li

Main category: cs.CV

TL;DR: This paper introduces MTU3D, an innovative framework that enhances embodied scene understanding through active perception and 3D vision-language integration, leading to improved exploration and understanding of environments.

Details Motivation: The motivation stems from the limitation of existing 3D Vision-Language models which focus only on static observations without actively perceiving or exploring the environment. Method: The method involves three key innovations: online query-based representation learning, a unified objective for grounding and exploring, and end-to-end trajectory learning combining pre-training on diverse trajectories from simulated and real-world data. Result: MTU3D outperforms state-of-the-art methods by significant percentages across various benchmarks in success rate, demonstrating its effectiveness in embodied navigation tasks. Conclusion: The paper concludes that MTU3D, by integrating active perception with 3D vision-language learning, significantly improves the performance of embodied agents in exploring and understanding their environment. Abstract: Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{\model}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14\%, 23\%, 9\%, and 2\% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. \model's versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.

[258] Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation

Ziying Song,Lin Liu,Hongyu Pan,Bencheng Liao,Mingzhe Guo,Lei Yang,Yongchang Zhang,Shaoqing Xu,Caiyan Jia,Yadan Luo

Main category: cs.CV

TL;DR: DIVER is an end-to-end driving framework that combines reinforcement learning with diffusion-based generation to enhance trajectory diversity and generalization capability in autonomous driving.

Details Motivation: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. This work aims to overcome these limitations by introducing diversity and feasibility into trajectory generation. Method: DIVER integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. It conditions on map elements and surrounding agents, generating multiple reference trajectories from a single ground-truth trajectory and employing reinforcement learning to guide the diffusion process with reward-based supervision. Result: Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity and effectively addresses the mode collapse problem inherent in imitation learning. Conclusion: DIVER effectively addresses the mode collapse problem inherent in imitation learning by improving trajectory diversity through a reinforced diffusion-based generation mechanism. Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

[259] Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

Xiao Liu,Nan Pu,Haiyang Zheng,Wenjing Li,Nicu Sebe,Zhun Zhong

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散的OCD框架DiffGRE,用于在线识别可能属于已知和未知类别的新到达流数据。

Details Motivation: 现有的OCD方法因为仅从标记数据中学习可迁移知识,所以其可迁移性是有限的。此外,当精细识别中的注释数据/类别很少时,已知类别中包含的知识往往是不足的。 Method: 基于扩散的OCD框架DiffGRE,该框架以多阶段方式集成了生成、优化和编码。 Result: 在六个细粒度数据集上进行的广泛实验表明,DiffGRE优于以前的方法。 Conclusion: DiffGRE可以有效解决OCD任务中由于已知类别知识不足导致的可迁移性有限的问题。 Abstract: In this paper, we investigate a practical yet challenging task: On-the-fly Category Discovery (OCD). This task focuses on the online identification of newly arriving stream data that may belong to both known and unknown categories, utilizing the category knowledge from only labeled data. Existing OCD methods are devoted to fully mining transferable knowledge from only labeled data. However, the transferability learned by these methods is limited because the knowledge contained in known categories is often insufficient, especially when few annotated data/categories are available in fine-grained recognition. To mitigate this limitation, we propose a diffusion-based OCD framework, dubbed DiffGRE, which integrates Generation, Refinement, and Encoding in a multi-stage fashion. Specifically, we first design an attribute-composition generation method based on cross-image interpolation in the diffusion latent space to synthesize novel samples. Then, we propose a diversity-driven refinement approach to select the synthesized images that differ from known categories for subsequent OCD model training. Finally, we leverage a semi-supervised leader encoding to inject additional category knowledge contained in synthesized data into the OCD models, which can benefit the discovery of both known and unknown categories during the on-the-fly inference process. Extensive experiments demonstrate the superiority of our DiffGRE over previous methods on six fine-grained datasets.

[260] Temporal Continual Learning with Prior Compensation for Human Motion Prediction

Jianwei Tang,Jiangxin Sun,Xiaotong Lin,Lifang Zhang,Wei-Shi Zheng,Jian-Fang Hu

Main category: cs.CV

TL;DR: This paper proposes a new multi-stage training framework called Temporal Continual Learning (TCL) for Human Motion Prediction (HMP) that improves the learning of short-term predictions by preserving prior information through a Prior Compensation Factor (PCF). The framework is flexible and effective, as demonstrated through experiments on multiple datasets.

Details Motivation: The motivation of the paper is to overcome limitations in previous approaches to Human Motion Prediction (HMP), which treat the prediction of various moments equally, hindering short-term predictions and limiting the incorporation of prior information from past predictions. Method: The paper introduces a novel multi-stage training framework called Temporal Continual Learning (TCL) with a Prior Compensation Factor (PCF) to preserve prior information. Theoretical derivation is used to achieve a more reasonable optimization objective. Result: Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of the TCL framework, showing its ability to be integrated with different HMP backbone models and adapted to various datasets and applications. Conclusion: The paper concludes that the proposed Temporal Continual Learning (TCL) framework effectively addresses the challenges in Human Motion Prediction (HMP), offering flexibility and effectiveness across various HMP backbone models and datasets. Abstract: Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent predictions is limited. In this paper, we introduce a novel multi-stage training framework called Temporal Continual Learning (TCL) to address the above challenges. To better preserve prior information, we introduce the Prior Compensation Factor (PCF). We incorporate it into the model training to compensate for the lost prior information. Furthermore, we derive a more reasonable optimization objective through theoretical derivation. It is important to note that our TCL framework can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of TCL. The code is available at https://github.com/hyqlat/TCL.

[261] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection

Hanghui Guo,Weijie Shi,Mengze Li,Juncheng Li,Hao Chen,Yue Cui,Jiajie Xu,Jia Zhu,Jiawei Shen,Zhangze Chen,Sirui Han

Main category: cs.CV

TL;DR: 为了解决短视频虚假信息检测中的领域泛化问题,本文提出了一种新的DOCTOR模型,结合跨模态特征插值、插值蒸馏和扩散模型技术,以增强模型的多模态学习能力和领域不变特征提取能力。

Details Motivation: 由于现有模型在特定领域(源域)上训练后,在未见过的领域(目标域)上表现不佳,因此需要有效的领域泛化方法来解决领域差异问题。此外,不同领域可能依赖不同的模态进行检测,而跨模态融合过程中可能会积累领域偏差,影响最终的虚假信息识别。 Method: 该研究提出了跨模态特征插值和插值蒸馏方法,同时设计了扩散模型以增强跨模态引导去噪的领域不变特征。 Result: 提出的DOCTOR模型在多个实验中展示了出色的性能,有效解决了领域泛化问题,并提升了短视频虚假信息检测的效果。 Conclusion: 本文提出了一种名为DOCTOR的新领域泛化模型,用于短视频虚假信息检测,并通过广泛的实验验证了其有效性。 Abstract: Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.

[262] Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic

Jianwei Tang,Hong Yang,Tengyue Chen,Jian-Fang Hu

Main category: cs.CV

TL;DR: This paper proposes an improved method for action-driven stochastic human motion prediction using two specialized memory banks and an attention adjustment strategy, leading to better performance than existing techniques.

Details Motivation: The motivation stems from the challenges in generating smooth transitions between actions and learning distinct action characteristics due to their similarities, which lead to unreasonable and inconsistent predictions in human motion sequences. Method: The method involves the use of two memory banks: the Soft-transition Action Bank (STAB) for storing action transition information with a soft searching approach, and the Action Characteristic Bank (ACB) for recording action characteristics. Additionally, the Adaptive Attention Adjustment (AAA) strategy is introduced to better fuse features from both banks. Result: Extensive experiments on four motion prediction datasets demonstrate that the proposed approach consistently surpasses previous state-of-the-art methods. Conclusion: The paper concludes that the proposed method, using two memory banks (STAB and ACB) along with the AAA strategy, outperforms previous state-of-the-art approaches in action-driven stochastic human motion prediction. Abstract: Action-driven stochastic human motion prediction aims to generate future motion sequences of a pre-defined target action based on given past observed sequences performing non-target actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the varying transition speeds of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicted results to be unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equipped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain actions. To fuse the features retrieved from the two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms the previous state-of-the-art. The demo and code are available at https://hyqlat.github.io/STABACB.github.io/.

[263] VICI: VLM-Instructed Cross-view Image-localisation

Xiaohan Zhang,Tavis Shore,Chen Chen,Oscar Mendez,Simon Hadfield,Safwan Wshah

Main category: cs.CV

TL;DR: 本文提出了一种针对UAVM 2025挑战的高性能解决方案,重点在于使用University-1652数据集将有限视野(FOV)的街景图像与相应的卫星图像进行匹配。通过一个两阶段的方法,即检索候选卫星图像嵌入并重新排序以提高精度,解决了实际场景中因视角和尺度变化带来的挑战。

Details Motivation: 随着全景交叉视图地理定位接近峰值性能,探索更实用的问题表述变得越来越重要。现实世界中的查询通常是有限视野的图像,而非全景图像。 Method: 该方法包括两个阶段:首先检索给定查询的候选卫星图像嵌入,然后通过重新排序阶段选择性地提高前候选者的检索准确性。 Result: 实验表明,该方法在R@1和R@10检索率上分别达到了\topone\%和\topten\%,证明了优化检索和重新排序策略在提升实际地理定位性能方面的潜力。 Conclusion: 本研究展示了在有限视野街景图像和卫星图像匹配任务中,通过优化检索和重新排序策略可以实现高性能,并为实际应用提供了新的可能性。 Abstract: In this paper, we present a high-performing solution to the UAVM 2025 Challenge, which focuses on matching narrow FOV street-level images to corresponding satellite imagery using the University-1652 dataset. As panoramic Cross-View Geo-Localisation nears peak performance, it becomes increasingly important to explore more practical problem formulations. Real-world scenarios rarely offer panoramic street-level queries; instead, queries typically consist of limited-FOV images captured with unknown camera parameters. Our work prioritises discovering the highest achievable performance under these constraints, pushing the limits of existing architectures. Our method begins by retrieving candidate satellite image embeddings for a given query, followed by a re-ranking stage that selectively enhances retrieval accuracy within the top candidates. This two-stage approach enables more precise matching, even under the significant viewpoint and scale variations inherent in the task. Through experimentation, we demonstrate that our approach achieves competitive results -specifically attaining R@1 and R@10 retrieval rates of \topone\% and \topten\% respectively. This underscores the potential of optimised retrieval and re-ranking strategies in advancing practical geo-localisation performance. Code is available at https://github.com/tavisshore/VICI.

[264] Integrated Gaussian Processes for Robust and Adaptive Multi-Object Tracking

Fred Lydeard,Bashar I. Ahmad,Simon Godsill

Main category: cs.CV

TL;DR: This paper introduces two new trackers, GaPP-Class and GaPP-ReaCtion, which improve multi-object tracking by minimizing track breaks, learning measurement models online, and inferring object classes.

Details Motivation: The motivation is to develop a computationally efficient multi-object tracking approach that can minimize track breaks, learn measurement model parameters online, and infer the class of tracked objects. Method: The method involves using Gaussian process as a motion model, non-homogeneous Poisson processes as an observation model, and a particle filtering inference scheme for track management and hyperparameter learning. GaPP-ReaCtion also uses a Markov Chain Monte Carlo kernel for track revival and stitching. Result: The result shows that GaPP-Class and GaPP-ReaCtion outperform other state-of-the-art tracking algorithms with GaPP-ReaCtion significantly reducing track breaks by around 30% from real radar data. Conclusion: The paper concludes that GaPP-Class and GaPP-ReaCtion are robust and adaptive trackers that outperform other state-of-the-art tracking algorithms in minimizing track breaks. Abstract: This paper presents a computationally efficient multi-object tracking approach that can minimise track breaks (e.g., in challenging environments and against agile targets), learn the measurement model parameters on-line (e.g., in dynamically changing scenes) and infer the class of the tracked objects, if joint tracking and kinematic behaviour classification is sought. It capitalises on the flexibilities offered by the integrated Gaussian process as a motion model and the convenient statistical properties of non-homogeneous Poisson processes as a suitable observation model. This can be combined with the proposed effective track revival / stitching mechanism. We accordingly introduce the two robust and adaptive trackers, Gaussian and Poisson Process with Classification (GaPP-Class) and GaPP with Revival and Classification (GaPP-ReaCtion). They employ an appropriate particle filtering inference scheme that efficiently integrates track management and hyperparameter learning (including the object class, if relevant). GaPP-ReaCtion extends GaPP-Class with the addition of a Markov Chain Monte Carlo kernel applied to each particle permitting track revival and stitching (e.g., within a few time steps after deleting a trajectory). Performance evaluation and benchmarking using synthetic and real data show that GaPP-Class and GaPP-ReaCtion outperform other state-of-the-art tracking algorithms. For example, GaPP-ReaCtion significantly reduces track breaks (e.g., by around 30% from real radar data and markedly more from simulated data).

[265] PromptSR: Cascade Prompting for Lightweight Image Super-Resolution

Wenyang Liu,Chen Cai,Jianjun Gao,Kejun Wu,Yi Wang,Kim-Hui Yap,Lap-Pui Chau

Main category: cs.CV

TL;DR: 本文提出 PromptSR,一种基于提示机制的轻量级图像超分辨率方法,通过创新的级联提示块有效解决感受野限制问题,在保持低计算成本的同时提升性能。

Details Motivation: 现有的轻量 Vision Transformer 在图像超分辨率中受限于感受野较小的问题,同时窗口大小增加会导致计算复杂度剧增。 Method: 设计了一个级联提示块(CPB),包含全局锚点提示层和两个局部提示层,利用跨尺度注意力机制减少计算成本,并扩大感受野。 Result: 实验结果表明 PromptSR 在定量、定性和复杂度评估方面均优于当前最先进的轻量级超分辨率方法。 Conclusion: PromptSR 提出了一个轻量级的图像超分辨率方法,通过结合全局先验和局部细节,显著提升了性能,并将在未来开源代码。 Abstract: Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at https://github.com/wenyang001/PromptSR.

[266] Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge

Linshen Liu,Boyan Su,Junyue Jiang,Guanlin Wu,Cong Guo,Ceyu Xu,Hao Frank Yang

Main category: cs.CV

TL;DR: 本文提出了EMC2系统,这是一种用于自动驾驶车辆的高效边缘计算方案,可在低延迟的同时显著提高3D物体检测的准确性。

Details Motivation: 传统的3D物体检测方法无法同时实现低延迟和高准确性,因此需要一种专门为边缘平台优化的系统来满足自动驾驶的需求。 Method: 提出了一种基于场景感知的专家混合架构(MoE),采用自适应多模态数据桥进行多尺度预处理,并根据目标可见性和距离动态分配特征到专家模型。同时对资源受限的边缘设备进行了软硬件协同优化。 Result: 在KITTI和nuScenes数据集上的实验表明,EMC2相比15种基线方法,在Jetson平台上平均准确率提高了3.58%,推理速度提升了159.06%。 Conclusion: EMC2是一个专为自动驾驶车辆设计的边缘计算系统,通过融合LiDAR和相机数据,并结合硬件与软件优化,实现了低延迟和高精度的3D物体检测。 Abstract: This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as a end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs.

[267] Driver-Net: Multi-Camera Fusion for Assessing Driver Take-Over Readiness in Automated Vehicles

Mahdi Rezaei,Mohsen Azarmi

Main category: cs.CV

TL;DR: Driver-Net is a deep learning framework that fuses multi-camera inputs to estimate driver take-over readiness in automated vehicles, achieving high accuracy through synchronized visual cues and advanced data integration techniques.

Details Motivation: Ensuring safe transition of control in automated vehicles requires an accurate and timely assessment of driver readiness, which conventional vision-based driver monitoring systems fail to provide comprehensively by focusing only on head pose or eye gaze. Method: Driver-Net uses a triple-camera setup capturing synchronized visual cues from the driver's head, hands, and body posture. It integrates spatio-temporal data using a dual-path architecture with a Context Block and Feature Block, followed by a cross-modal fusion strategy to enhance prediction accuracy. Result: Evaluated on a diverse dataset collected from the University of Leeds Driving Simulator, Driver-Net achieved an accuracy of up to 95.8% in driver readiness classification, significantly enhancing existing approaches and highlighting the importance of multimodal and multi-view fusion. Conclusion: Driver-Net contributes meaningfully to the development of safer and more reliable automated vehicles by providing a real-time, non-intrusive solution for assessing driver readiness, aligning with new regulatory mandates and upcoming safety standards. Abstract: Ensuring safe transition of control in automated vehicles requires an accurate and timely assessment of driver readiness. This paper introduces Driver-Net, a novel deep learning framework that fuses multi-camera inputs to estimate driver take-over readiness. Unlike conventional vision-based driver monitoring systems that focus on head pose or eye gaze, Driver-Net captures synchronised visual cues from the driver's head, hands, and body posture through a triple-camera setup. The model integrates spatio-temporal data using a dual-path architecture, comprising a Context Block and a Feature Block, followed by a cross-modal fusion strategy to enhance prediction accuracy. Evaluated on a diverse dataset collected from the University of Leeds Driving Simulator, the proposed method achieves an accuracy of up to 95.8% in driver readiness classification. This performance significantly enhances existing approaches and highlights the importance of multimodal and multi-view fusion. As a real-time, non-intrusive solution, Driver-Net contributes meaningfully to the development of safer and more reliable automated vehicles and aligns with new regulatory mandates and upcoming safety standards.

[268] Pedestrian Intention Prediction via Vision-Language Foundation Models

Mohsen Azarmi,Mahdi Rezaei,He Wang

Main category: cs.CV

TL;DR: This paper explores using vision-language foundation models (VLFMs) with hierarchical prompt templates to better predict pedestrian crossing intentions in autonomous vehicles, achieving significant accuracy improvements over conventional methods.

Details Motivation: Conventional vision-based methods for predicting pedestrian crossing intentions face challenges in generalizability, context understanding, and causal reasoning. This motivates the exploration of more advanced models like VLFMs to improve prediction accuracy and performance in real-world scenarios. Method: This study uses vision-language foundation models (VLFMs) integrated with multimodal data through hierarchical prompt templates to predict pedestrian crossing intentions. Contextual information like visual frames, physical cues, and ego-vehicle dynamics are incorporated into refined prompts. Experiments were conducted on JAAD, PIE, and FU-PIP datasets. Result: Incorporating vehicle speed, its variations over time, and time-conscious prompts improved prediction accuracy by up to 19.8%. Optimized prompts from an automatic prompt engineering framework provided an additional 12.5% accuracy improvement. Conclusion: VLFMs outperform traditional vision-based models in pedestrian crossing intention prediction, providing better generalization and contextual understanding for autonomous driving applications. Abstract: Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.

[269] Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Fernando Gabriela Garcia,Spencer Burns,Ryan Shaw,Hunter Young

Main category: cs.CV

TL;DR: 本文提出 Hi-SSLVLM,一种新型生成模型,通过两阶段自监督学习策略显著提升了复杂提示下的文本到图像生成能力。

Details Motivation: 传统方法依赖于人工标注的高质量数据集,难以精确控制细粒度视觉属性和复杂的空间关系。 Method: 两阶段自监督学习策略:第一阶段为多粒度视觉-语言基础,第二阶段为自我精炼和引导图像生成,并采用内部组成规划机制和语义一致性损失。 Result: 在多个基准上均表现出色,所有细粒度指标优于现有基线模型,消融实验和人类评估验证了各组件的有效性及生成质量的提升。 Conclusion: Hi-SSLVLM 是迈向更可控和语义一致的开放端文本到图像生成的重要一步。 Abstract: This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM's enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.

[270] LVLM-Composer's Explicit Planning for Image Generation

Spencer Ramsey,Jeffrey Lee,Amina Grant

Main category: cs.CV

TL;DR: 本文提出了一种新的10亿参数规模的大型视觉语言模型LVLM-Composer,用于增强组合图像合成,通过引入层次语义规划模块和细粒度特征对齐机制,以及多阶段训练范式,在复杂文本描述生成高质量图像方面取得了显著进展。

Details Motivation: 当前的大型视觉语言模型(LVLMs)在处理复杂的文本描述时存在困难,特别是在需要精确组合理解和视觉规划的情况下。 Method: 引入了层次语义规划模块和细粒度特征对齐机制,并采用多阶段训练范式,包括层次语义-视觉基础预训练和组合规划强化学习与自我纠正。 Result: 在LongBench-T2I基准测试中,LVLM-Composer在对象准确性、组合保真度和姿势准确性等关键维度上显著优于最先进的基线模型。 Conclusion: LVLM-Composer是一个在可控和准确的文本到图像生成方面迈出的重要一步,展示了其在复杂场景中生成高质量、符合描述的图像的能力。 Abstract: The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation. We propose a multi-stage training paradigm, featuring Hierarchical Semantic-Visual Grounding Pre-training and Compositional Planning Reinforcement Learning with Self-Correction, to instill robust compositional reasoning. Extensive experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions including object accuracy, composition fidelity, and pose accuracy, significantly outperforming state-of-the-art baselines. An in-depth ablation study further validates the indispensable contribution of our proposed modules, while human evaluations confirm the perceptual superiority of our generated images. LVLM-Composer represents a significant step towards truly controllable and compositionally accurate open-ended text-to-image generation.

[271] Voyaging into Unbounded Dynamic Scenes from a Single View

Fengrui Tian,Tianjiao Ding,Jinqi Luo,Hancheng Min,René Vidal

Main category: cs.CV

TL;DR: 本文提出DynamicVoyager,通过将动态场景生成转化为外绘任务,实现从单视角视频生成无界且具3D一致性的动态场景。

Details Motivation: 现有方法受限于训练视角,难以生成远离原始视角的动态内容,需要一种更灵活的动态场景生成方法。 Method: 将动态场景生成重新定义为场景外绘过程,利用点云和光线上下文信息学习3D运动一致性。 Result: 实验表明,DynamicVoyager能够沿飞行相机路径生成一致的运动,并实现对生成内容的提示控制。 Conclusion: DynamicVoyager可以生成具有3D一致性的无界动态场景,并且能够通过场景提示控制生成内容。 Abstract: This paper studies the problem of generating an unbounded dynamic scene from a single view, which has wide applications in augmented/virtual reality and robotics. Since the scene is changing over time, different generated views need to be consistent with the underlying 3D motions. While previous works learn such consistency by training from multiple views, the generated scene regions are bounded to be close to the training views with limited camera movements. To address this issue, we propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content. As 2D outpainting models can hardly generate 3D consistent motions from only 2D pixels at a single view, we consider pixels as rays to enrich the pixel input with the ray context, so that the 3D motion consistency can be learned from the ray information. More specifically, we first map the single-view video input to a dynamic point cloud with the estimated video depths. Then we render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions. We employ the outpainted video to update the point cloud, which is used for scene outpainting from future novel views. Experiments show that our model is able to generate unbounded scenes with consistent motions along fly-through cameras, and the generated contents can be controlled with scene prompts.

[272] Towards Spatially-Varying Gain and Binning

Anqi Yang,Eunhee Kang,Wei Chen,Hyong-Euk Lee,Aswin C. Sankaranarayanan

Main category: cs.CV

TL;DR: This paper introduces spatially-varying gain and binning techniques to enhance image sensor performance by improving dynamic range, reducing noise, and increasing signal-to-noise ratio without compromising resolution.

Details Motivation: As pixels in image sensors become smaller, they accumulate less light, leading to reduced image quality. This paper aims to address this issue by proposing techniques to improve noise performance and dynamic range while maintaining resolution. Method: The paper proposes a method involving spatially-varying gain and binning to enhance image sensor performance. It includes an analytical approach to determine optimal binning sizes based on light levels and scene brightness. Result: The proposed spatially-varying gain strategy expands the sensor's dynamic range by an order of magnitude and makes read noise negligible. The spatially-varying binning strategy effectively increases the signal-to-noise ratio. Additionally, digital binning outperforms analog binning when larger gains are allowed. Conclusion: The paper concludes that combining spatially-varying gain and binning strategies can significantly improve image sensor performance in terms of noise reduction, dynamic range expansion, and signal-to-noise ratio without sacrificing resolution. Abstract: Pixels in image sensors have progressively become smaller, driven by the goal of producing higher-resolution imagery. However, ceteris paribus, a smaller pixel accumulates less light, making image quality worse. This interplay of resolution, noise, and the dynamic range of the sensor and their impact on the eventual quality of acquired imagery is a fundamental concept in photography. In this paper, we propose spatially-varying gain and binning to enhance the noise performance and dynamic range of image sensors. First, we show that by varying gain spatially to local scene brightness, the read noise can be made negligible, and the dynamic range of a sensor is expanded by an order of magnitude. Second, we propose a simple analysis to find a binning size that best balances resolution and noise for a given light level; this analysis predicts a spatially-varying binning strategy, again based on local scene brightness, to effectively increase the overall signal-to-noise ratio. % without sacrificing resolution. We discuss analog and digital binning modes and, perhaps surprisingly, show that digital binning outperforms its analog counterparts when a larger gain is allowed. Finally, we demonstrate that combining spatially-varying gain and binning in various applications, including high dynamic range imaging, vignetting, and lens distortion.

[273] Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration

Yu-Shan Tai,An-Yeu,Wu

Main category: cs.CV

TL;DR: 本文提出了一种加速预训练扩散模型图像修复的方法,通过引入Quick Bypass Mechanism和Revised Reverse Process,在保证修复质量的同时显著提升了修复速度。

Details Motivation: 尽管现有的zero-shot方法使预训练扩散模型能够在不进行额外微调的情况下执行图像修复任务,但这些方法在去噪过程中往往需要较长的迭代时间。为解决这一问题,作者提出了新的加速策略。 Method: 本文提出了Quick Bypass Mechanism (QBM)和Revised Reverse Process (RRP)两种策略,以加速预训练扩散模型的去噪过程并提高生成图像质量。 Result: 该方法在ImageNet-1K和CelebA-HQ数据集上进行了验证,并在超分辨率、去模糊和压缩感知等多个图像修复任务中表现出色。 Conclusion: 实验结果表明,所提出的方法能够有效加速现有的图像修复方法,同时保持原有性能。 Abstract: Recent advancements in diffusion models have demonstrated remarkable success in various image generation tasks. Building upon these achievements, diffusion models have also been effectively adapted to image restoration tasks, e.g., super-resolution and deblurring, aiming to recover high-quality images from degraded inputs. Although existing zero-shot approaches enable pretrained diffusion models to perform restoration tasks without additional fine-tuning, these methods often suffer from prolonged iteration times in the denoising process. To address this limitation, we propose a Quick Bypass Mechanism (QBM), a strategy that significantly accelerates the denoising process by initializing from an intermediate approximation, effectively bypassing early denoising steps. Furthermore, recognizing that approximation may introduce inconsistencies, we introduce a Revised Reverse Process (RRP), which adjusts the weighting of random noise to enhance the stochasticity and mitigate potential disharmony. We validate proposed methods on ImageNet-1K and CelebA-HQ across multiple image restoration tasks, e.g., super-resolution, deblurring, and compressed sensing. Our experimental results show that the proposed methods can effectively accelerate existing methods while maintaining original performance.

[274] DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design

Xiwei Hu,Haokun Chen,Zhongqi Qi,Hui Zhang,Dexiang Hong,Jie Shao,Xinglong Wu

Main category: cs.CV

TL;DR: DreamPoster是一种创新的文本到图像生成框架,通过利用先进的T2I模型和系统化数据处理方法,实现了高效、高质量的海报生成,性能显著优于现有技术。

Details Motivation: 为了满足用户对高质量、灵活分辨率和布局输出的需求,同时保持内容保真度。 Method: 基于Seedream3.0模型构建,并采用渐进式训练策略和系统化的数据注释流程来实现多任务生成能力。 Result: 在测试基准上,DreamPoster的可用率达到88.55%,优于GPT-4o(47.56%)和SeedEdit3.0(25.96%)。 Conclusion: DreamPoster是一个能够从用户提供的图像和文本提示中智能生成高质量海报的文本到图像生成框架,具有高可用性。 Abstract: We present DreamPoster, a Text-to-Image generation framework that intelligently synthesizes high-quality posters from user-provided images and text prompts while maintaining content fidelity and supporting flexible resolution and layout outputs. Specifically, DreamPoster is built upon our T2I model, Seedream3.0 to uniformly process different poster generating types. For dataset construction, we propose a systematic data annotation pipeline that precisely annotates textual content and typographic hierarchy information within poster images, while employing comprehensive methodologies to construct paired datasets comprising source materials (e.g., raw graphics/text) and their corresponding final poster outputs. Additionally, we implement a progressive training strategy that enables the model to hierarchically acquire multi-task generation capabilities while maintaining high-quality generation. Evaluations on our testing benchmarks demonstrate DreamPoster's superiority over existing methods, achieving a high usability rate of 88.55\%, compared to GPT-4o (47.56\%) and SeedEdit3.0 (25.96\%). DreamPoster will be online in Jimeng and other Bytedance Apps.

[275] Domain Generalizable Portrait Style Transfer

Xinbo Wang,Wenju Xu,Qing Zhang,Wei-Shi Zheng

Main category: cs.CV

TL;DR: This paper proposes a new portrait style transfer method that offers high-quality, semantic-aligned stylization across multiple domains with effective control over the process.

Details Motivation: The motivation behind this study is to develop a portrait style transfer method that can generalize well to different domains while enabling precise, semantic-aligned stylization on specific regions like hair, eyes, skin, etc. Method: The method involves establishing dense semantic correspondence between input and reference portraits using a pre-trained model and a semantic adapter. This is followed by an AdaIN-Wavelet transform to balance content preservation and stylization, and a dual-conditional diffusion model integrating ControlNet for final result generation. Result: The results show that the proposed method outperforms existing techniques in terms of quality and control over stylization, demonstrated through extensive experiments. Conclusion: The paper concludes that their portrait style transfer method effectively enables high-quality semantic-aligned stylization across various domains, with the added benefit of being controllable and effective due to their proposed methods. Abstract: This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transform to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transform, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model are available at https://github.com/wangxb29/DGPST.

[276] MoReMouse: Monocular Reconstruction of Laboratory Mouse

Yuan Zhong,Jingxiang Sun,Liang An,Yebin Liu

Main category: cs.CV

TL;DR: MoReMouse是一种用于实验室小鼠的单目密集3D重建网络,通过使用基于变压器的架构、三平面表示和测地线连续对应嵌入,实现从单个图像中高精度地生成3D表面。

Details Motivation: 实验室小鼠在生物医学研究中起着关键作用,但由于其复杂的非刚性几何变形和无纹理外观,精确的3D小鼠表面运动重建仍然具有挑战性。缺乏结构化的3D数据集也阻碍了稀疏关键点跟踪的发展。 Method: MoReMouse采用基于变压器的前馈架构和三平面表示,并创建了测地线连续对应嵌入来提高重建稳定性。此外,还构建了一个高保真密集视图合成数据集。 Result: MoReMouse实现了高质量的3D表面生成,并且实验表明它显著优于现有开源方法。 Conclusion: MoReMouse显著优于现有的开源方法,提供了更准确和鲁棒的3D重建。 Abstract: Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D reconstruction network tailored for laboratory mice. To achieve this goal, we highlight three key designs. First, we construct the first high-fidelity dense-view synthetic dataset for mice, by rendering our self-designed realistic Gaussian mouse avatar. Second, MoReMouse adopts a transformer-based feedforward architecture with triplane representation, achieving high-quality 3D surface generation from a single image. Third, we create geodesic-based continuous correspondence embeddings on mouse surface, which serve as strong semantic priors to improve reconstruction stability and surface consistency. Extensive quantitative and qualitative experiments demonstrate that MoReMouse significantly outperforms existing open-source methods in accuracy and robustness. Video results are available at https://zyyw-eric.github.io/MoreMouse-webpage/.

[277] Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need

Mohammadreza Sharifi,Ahad Harati

Main category: cs.CV

TL;DR: 本文提出了GSTDS算法,通过谱分析动态选择最具信息量的数据点,显著降低计算资源需求且不牺牲模型性能。

Details Motivation: 有效的数据策划对于优化神经网络训练至关重要,特别是在减少冗余或低效益计算和资源需求方面。 Method: 提出了一种基于谱分析的数据选择方法——Guided Spectrally Tuned Data Selection (GSTDS),利用预训练参考模型动态调整训练数据子集,并使用Fiedler向量评分机制过滤数据点。 Result: 实验结果表明,GSTDS在多个标准图像分类基准测试中优于传统训练方法和最新的JEST数据策划方法,计算需求减少了四倍,同时保持性能并提高了准确性。 Conclusion: GSTDS展现出在资源高效深度学习中的潜力,作为一种可扩展的解决方案,并推动了自适应数据策划策略的进一步探索。 Abstract: Effective data curation is essential for optimizing neural network training. In this paper, we present the Guided Spectrally Tuned Data Selection (GSTDS) algorithm, which dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model. Based on a pre-scheduled filtering ratio, GSTDS effectively reduces the number of data points processed per batch. The proposed method ensures an efficient selection of the most informative data points for training while avoiding redundant or less beneficial computations. Preserving data points in each batch is performed based on spectral analysis. A Fiedler vector-based scoring mechanism removes the filtered portion of the batch, lightening the resource requirements of the learning. The proposed data selection approach not only streamlines the training process but also promotes improved generalization and accuracy. Extensive experiments on standard image classification benchmarks, including CIFAR-10, Oxford-IIIT Pet, and Oxford-Flowers, demonstrate that GSTDS outperforms standard training scenarios and JEST, a recent state-of-the-art data curation method, on several key factors. It is shown that GSTDS achieves notable reductions in computational requirements, up to four times, without compromising performance. GSTDS exhibits a considerable growth in terms of accuracy under the limited computational resource usage, in contrast to other methodologies. These promising results underscore the potential of spectral-based data selection as a scalable solution for resource-efficient deep learning and motivate further exploration into adaptive data curation strategies. You can find the code at https://github.com/rezasharifi82/GSTDS.

[278] ZERO: Multi-modal Prompt-based Visual Grounding

Sangbum Choi,Kyeongryeol Go

Main category: cs.CV

TL;DR: This paper introduces ZERO, a zero-shot multi-prompt object detection model designed for robust deployment across various industrial domains by integrating textual and visual prompts for accurate detection.

Details Motivation: Recent advances in artificial intelligence have led to the emergence of foundation models. The work aims to develop a robust, production-ready object detection model adaptable across diverse industrial domains with minimal supervision. Method: ZERO integrates direct image input with multiple user-defined prompts (textual and visual cues) processed through dedicated encoders to generate accurate detection outputs. A domain-specific fine-tuning strategy focusing on prompt diversity and conservative pseudo-labeling is introduced. Result: ZERO achieved strong performance on the RF20VL-fsod benchmark despite limited annotation budgets, highlighting its potential for scalable and adaptive object detection. Conclusion: The paper concludes that ZERO, a zero-shot multi-prompt object detection model, demonstrates practical advantages in flexibility, efficiency, and real-world applicability for scalable and adaptive object detection in dynamic industrial environments. Abstract: Recent advances in artificial intelligence have led to the emergence of foundation models, large-scale pre-trained neural networks that serve as versatile starting points for a wide range of downstream tasks. In this work, we present ZERO, a zero-shot multi-prompt object detection model specifically designed for robust, production-ready deployment across diverse industrial domains. ZERO integrates direct image input with multiple user-defined prompts, which can include both textual and visual cues, and processes them through dedicated encoders to generate accurate detection outputs. The model architecture is optimized for scalability, with a total of 1.033 TFLOPS and 622.346 million parameters, and is trained using a domain-specific image database exceeding one billion images. For the CVPR 2025 Foundational Few-Shot Object Detection (FSOD) Challenge, we introduce a domain-specific fine-tuning strategy that emphasizes prompt diversity and conservative pseudo-labeling, enabling effective adaptation to new domains with minimal supervision. Our approach demonstrates practical advantages in flexibility, efficiency, and real-world applicability, achieving strong performance on the RF20VL-fsod benchmark despite limited annotation budgets. The results highlight the potential of prompt-driven, data-centric AI for scalable and adaptive object detection in dynamic industrial environments.

[279] Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices

Guangrui Bai,Hailong Yan,Wenhai Liu,Yahui Deng,Erbao Dong

Main category: cs.CV

TL;DR: 提出了一种名为LiteIE的超轻量级无监督增强框架,用于在移动和嵌入式设备上实现实时低光图像增强。

Details Motivation: 现有的深度学习方法通常依赖于大型网络和标记数据集,限制了它们在资源受限平台上的部署。 Method: 设计了一个仅包含两个卷积层的轻量级特征提取器,并开发了一种无参数的迭代恢复模块。 Result: 实验结果表明,LiteIE在LOL数据集上达到了19.04 dB PSNR,超越SOTA 1.4 dB,并且仅使用其0.07%的参数。 Conclusion: LiteIE是一种高效且实用的低光增强解决方案,适用于资源有限的平台。 Abstract: Real-time low-light image enhancement on mobile and embedded devices requires models that balance visual quality and computational efficiency. Existing deep learning methods often rely on large networks and labeled datasets, limiting their deployment on resource-constrained platforms. In this paper, we propose LiteIE, an ultra-lightweight unsupervised enhancement framework that eliminates dependence on large-scale supervision and generalizes well across diverse conditions. We design a backbone-agnostic feature extractor with only two convolutional layers to produce compact image features enhancement tensors. In addition, we develop a parameter-free Iterative Restoration Module, which reuses the extracted features to progressively recover fine details lost in earlier enhancement steps, without introducing any additional learnable parameters. We further propose an unsupervised training objective that integrates exposure control, edge-aware smoothness, and multi-scale color consistency losses. Experiments on the LOL dataset, LiteIE achieves 19.04 dB PSNR, surpassing SOTA by 1.4 dB while using only 0.07\% of its parameters. On a Snapdragon 8 Gen 3 mobile processor, LiteIE runs at 30 FPS for 4K images with just 58 parameters, enabling real-time deployment on edge devices. These results establish LiteIE as an efficient and practical solution for low-light enhancement on resource-limited platforms.

[280] SeqTex: Generate Mesh Textures in Video Sequence

Ze Yuan,Xin Yu,Yangtian Sun,Yuan-Chen Guo,Yan-Pei Cao,Ding Liang,Xiaojuan Qi

Main category: cs.CV

TL;DR: SeqTex is an end-to-end framework that directly generates complete UV texture maps by leveraging visual knowledge from pretrained video foundation models, offering high-quality results without post-processing.

Details Motivation: The lack of large-scale 3D texture datasets limits the generalization of native 3D texture generative models. Existing methods suffer from error accumulation and spatial inconsistencies due to two-stage pipelines. Method: SeqTex reformulates UV texture generation as a sequence generation problem and uses architectural innovations like decoupled branches, geometry-informed attention, and adaptive token resolution. Result: SeqTex achieves state-of-the-art performance on image-conditioned and text-conditioned 3D texture generation tasks with superior 3D consistency, texture-geometry alignment, and real-world generalization. Conclusion: SeqTex provides a direct and efficient way to generate high-fidelity UV texture maps using pretrained video foundation models, outperforming existing methods in 3D consistency and generalization. Abstract: Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

[281] M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding

Shenxi Liu,Kan Li,Mingyang Zhao,Yuhang Tian,Bin Li,Shoujun Zhou,Hongliang Li,Fuxia Yang

Main category: cs.CV

TL;DR: 本文提出M3-Med,一种用于医学教学视频理解的新基准,旨在解决现有基准的语言单一性和浅层推理限制,评估结果显示当前AI模型在深度跨模态推理方面仍有显著不足。

Details Motivation: 现有的视频理解基准存在语言单一性和浅层推理的限制,无法满足多语言资源的需求并正确评估深度多模态整合能力。 Method: 提出了一个名为M3-Med的基准,包含医学问题及其对应的视频片段,由医学专家团队进行标注。定义了两种任务:TAGSV和TAGVC。 Result: 在M3-Med上评估的最先进模型和大型语言模型表现与人类专家存在显著差距,尤其是在复杂的多跳问题上模型性能大幅下降。 Conclusion: M3-Med是一个新的医学教学视频理解基准,突出了当前AI模型在专业领域深度跨模态推理中的局限性,并为未来的研究提供了新方向。 Abstract: With the rapid progress of artificial intelligence (AI) in multi-modal understanding, there is increasing potential for video comprehension technologies to support professional domains such as medical education. However, existing benchmarks suffer from two primary limitations: (1) Linguistic Singularity: they are largely confined to English, neglecting the need for multilingual resources; and (2) Shallow Reasoning: their questions are often designed for surface-level information retrieval, failing to properly assess deep multi-modal integration. To address these limitations, we present M3-Med, the first benchmark for Multi-lingual, Multi-modal, and Multi-hop reasoning in Medical instructional video understanding. M3-Med consists of medical questions paired with corresponding video segments, annotated by a team of medical experts. A key innovation of M3-Med is its multi-hop reasoning task, which requires a model to first locate a key entity in the text, then find corresponding visual evidence in the video, and finally synthesize information across both modalities to derive the answer. This design moves beyond simple text matching and poses a substantial challenge to a model's deep cross-modal understanding capabilities. We define two tasks: Temporal Answer Grounding in Single Video (TAGSV) and Temporal Answer Grounding in Video Corpus (TAGVC). We evaluated several state-of-the-art models and Large Language Models (LLMs) on M3-Med. The results reveal a significant performance gap between all models and human experts, especially on the complex multi-hop questions where model performance drops sharply. M3-Med effectively highlights the current limitations of AI models in deep cross-modal reasoning within specialized domains and provides a new direction for future research.

[282] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation

Weilun Feng,Chuanguang Yang,Haotong Qin,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Boyu Diao,Fuzhen Zhuang,Michele Magno,Yongjun Xu,Yingli Tian,Tingwen Huang

Main category: cs.CV

TL;DR: This paper proposes MPQ-DMv2, an improved mixed precision quantization framework for diffusion models, achieving superior performance under extremely low-bit quantization by addressing quantization-unfriendly designs and suboptimal optimization strategies.

Details Motivation: Existing quantization methods do not generalize well under extremely low-bit quantization and cause performance degradation due to outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. Method: The paper introduces MPQ-DMv2, which uses Flexible Z-Order Residual Mixed Quantization, Object-Oriented Low-Rank Initialization, and Memory-based Temporal Relation Distillation to improve quantization performance. Result: Comprehensive experiments show that MPQ-DMv2 surpasses current SOTA methods on various architectures, especially under extremely low-bit widths. Conclusion: MPQ-DMv2 is a promising framework for extremely low-bit quantization of diffusion models, which addresses the issues of existing frameworks by introducing flexible quantization techniques and optimization strategies. Abstract: Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.

[283] Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization

Zuyu Zhang,Ning Chen,Yongshan Liu,Qinghua Zhang,Xu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于Lyapunov Exponent的优化方法LEAwareSGD,通过使模型训练接近混沌边缘来提升单域泛化任务中的模型泛化能力。

Details Motivation: 现有的单域泛化方法主要依赖于数据增强技术,但难以有效地适应大的领域变化,因此需要一种新的方法来提升模型的泛化能力。 Method: 提出了一种名为LEAwareSGD的新颖Lyapunov Exponent引导优化方法,并进行了广泛的实验验证其效果。 Result: LEAwareSGD在PACS、OfficeHome和DomainNet上都取得了显著的泛化增益,在低数据条件下在PACS上达到了9.47%的提升。 Conclusion: LEAwareSGD通过利用Lyapunov Exponent测量来调节学习率,使模型训练接近混沌边缘,从而在单域泛化任务中提升了模型的泛化能力。 Abstract: Single Domain Generalization (SDG) aims to develop models capable of generalizing to unseen target domains using only one source domain, a task complicated by substantial domain shifts and limited data diversity. Existing SDG approaches primarily rely on data augmentation techniques, which struggle to effectively adapt training dynamics to accommodate large domain shifts. To address this, we propose LEAwareSGD, a novel Lyapunov Exponent (LE)-guided optimization approach inspired by dynamical systems theory. By leveraging LE measurements to modulate the learning rate, LEAwareSGD encourages model training near the edge of chaos, a critical state that optimally balances stability and adaptability. This dynamic adjustment allows the model to explore a wider parameter space and capture more generalizable features, ultimately enhancing the model's generalization capability. Extensive experiments on PACS, OfficeHome, and DomainNet demonstrate that LEAwareSGD yields substantial generalization gains, achieving up to 9.47\% improvement on PACS in low-data regimes. These results underscore the effectiveness of training near the edge of chaos for enhancing model generalization capability in SDG tasks.

[284] Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis

Zhipeng Li,Kegang Wang,Hanguang Xiao,Xingyue Liu,Feizhong Zhou,Jiaxin Jiang,Tianqi Liu

Main category: cs.CV

TL;DR: This paper presents DLCN, a new large-scale dataset for remote photoplethysmography (rPPG) under dynamic nighttime lighting conditions, enabling better evaluation of algorithm robustness.

Details Motivation: Current rPPG methods have shown good performance under ideal lighting conditions, but their effectiveness in realistic nighttime environments with dynamic lighting variations remains unclear due to a lack of suitable datasets. Method: The authors introduced the DLCN dataset, which includes 13 hours of video data from 98 participants under four nighttime lighting scenarios. They used the Happy-rPPG Toolkit to conduct experiments and analyze algorithm performance. Result: A large-scale dataset (DLCN) was developed, offering high diversity and realism for assessing rPPG algorithms under challenging nighttime lighting conditions. Conclusion: The study concludes that the DLCN dataset effectively evaluates rPPG algorithm robustness in challenging nighttime lighting conditions and provides a foundation for future research. Abstract: Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at https://github.com/dalaoplan/Happp-rPPG-Toolkit.

[285] DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection

Paul Hill,Alin Achim,Dave Bull,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: This paper proposes DMAT, a novel end-to-end framework using a 3D Mamba-based structure to mitigate atmospheric turbulence effects, improving both image visualization and object detection.

Details Motivation: Atmospheric Turbulence degrades image clarity and hampers object classification and tracking, with existing deep learning methods struggling to address spatio-temporal distortions effectively. Method: A 3D Mamba-based structure within an end-to-end framework to handle spatio-temporal displacements and blurring caused by atmospheric turbulence while integrating semantic features for object detection. Result: DMAT achieves up to a 15% improvement over state-of-the-art systems on datasets corrupted by generated turbulence. Conclusion: The proposed DMAT framework effectively compensates for atmospheric turbulence distortions, enhancing both visualization quality and object detection performance. Abstract: Atmospheric Turbulence (AT) degrades the clarity and accuracy of surveillance imagery, posing challenges not only for visualization quality but also for object classification and scene tracking. Deep learning-based methods have been proposed to improve visual quality, but spatio-temporal distortions remain a significant issue. Although deep learning-based object detection performs well under normal conditions, it struggles to operate effectively on sequences distorted by atmospheric turbulence. In this paper, we propose a novel framework that learns to compensate for distorted features while simultaneously improving visualization and object detection. This end-to-end framework leverages and exchanges knowledge of low-level distorted features in the AT mitigator with semantic features extracted in the object detector. Specifically, in the AT mitigator a 3D Mamba-based structure is used to handle the spatio-temporal displacements and blurring caused by turbulence. Features are extracted in a pyramid manner during the mitigation stage and passed to the detector. Optimization is achieved through back-propagation in both the AT mitigator and object detector. Our proposed DMAT outperforms state-of-the-art AT mitigation and object detection systems up to a 15% improvement on datasets corrupted by generated turbulence.

[286] Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Yuanhe Tian,Chen Su,Junwen Duan,Yan Song

Main category: cs.CV

TL;DR: This paper proposes a new LLM-based framework with a cross-modal graph approach to improve visual question answering in medical imaging, particularly for computed tomography (CT), by better capturing spatial and inter-slice relationships.

Details Motivation: Existing VQA approaches for medical imaging often neglect spatial continuity and inter-slice correlations in volumetric CT data, resulting in fragmented and imprecise responses. Method: A novel LLM-based framework using a cross-modal graph representation that integrates visual and textual features from CT slices and clinical questions, employing attentive graph convolutional networks for dynamic information fusion. Result: Extensive experiments show consistent superior performance across multiple evaluation metrics on the M3D-VQA benchmark. Conclusion: The proposed framework outperforms existing methods on the M3D-VQA benchmark, demonstrating robust reasoning capabilities by effectively capturing spatial continuity and inter-slice correlations in volumetric CT data. Abstract: Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.

[287] MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection

Hanshi Wang,Jin Gao,Weiming Hu,Zhipeng Zhang

Main category: cs.CV

TL;DR: This paper introduces a new Hybrid Mamba Block and height-fidelity LiDAR encoding to improve camera-LiDAR 3D object detection, achieving top performance with high efficiency.

Details Motivation: The motivation is to overcome the limitations of existing fusion strategies, which struggle to achieve efficiency, long-range modeling, and retention of complete scene information simultaneously. Method: The method involves using a Hybrid Mamba Block for local and global contextual learning, combined with height-fidelity LiDAR encoding to preserve precise height information through voxel compression in continuous space. Result: The proposed method achieves a top-tier NDS score of 75.0 on the nuScenes validation benchmark, surpassing methods using high-resolution inputs and maintaining faster inference speed compared to recent state-of-the-art methods. Conclusion: The paper concludes that the proposed Hybrid Mamba Block, along with height-fidelity LiDAR encoding, achieves state-of-the-art performance in camera-LiDAR multi-modal 3D object detection while maintaining efficiency. Abstract: We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes validation benchmark, even surpassing methods that utilize high-resolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods.

[288] Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

Xiao Zhang,Johan Bos

Main category: cs.CV

TL;DR: 本文提出了一种用于墓碑数字化的多模态框架,利用视觉-语言模型将墓碑图像转化为结构化的“墓碑意义表示”,并结合检索增强生成技术提高语义解析能力。

Details Motivation: 许多墓碑面临严重的保存挑战,包括物理侵蚀、破坏、环境退化和政治变迁。为了更好地解读、组织和检索墓碑内容,需要一种新的方法进行墓碑数字化。 Method: 本文采用视觉-语言模型(VLMs)将墓碑图像转化为结构化的墓碑意义表示(TMRs),并引入检索增强生成(RAG)来整合外部依赖元素(如地名、职业代码和本体概念)。 Result: 与传统的OCR流水线相比,该方法的解析准确率从F1分数36.1提升至89.5,并在多样化的语言和文化铭文中表现出良好的鲁棒性,同时通过图像融合模拟了物理退化条件下的性能表现。 Conclusion: 这是首次尝试使用大型视觉-语言模型形式化墓碑理解的工作,为文化遗产保护提供了新的技术路径。 Abstract: Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.

[289] Transferring Visual Explainability of Self-Explaining Models through Task Arithmetic

Yuya Yoshikawa,Ryotaro Shimizu,Takahiro Kawashima,Yuki Saito

Main category: cs.CV

TL;DR: This paper proposes a method to transfer visual explainability from a source domain to a target domain in self-explaining models, reducing training costs while maintaining explanation and classification performance.

Details Motivation: Training self-explaining models incurs substantial labeling and computational costs. This study aims to address this issue by transferring visual explainability learned in a source domain to a target domain. Method: An 'explainability vector' is defined as the difference between model parameters trained on the source domain with and without explanation supervision. This vector is then applied to a model trained only on the prediction task in the target domain to impart explainability. Result: Experimental results show that visual explainability can generally be successfully transferred, improving explanation quality in the target domain. The explainability vector learned on ImageNet shows universality and robustness, improving explanation quality on nine out of ten target datasets. Explanation quality rivals Kernel SHAP's but with just one model inference. Conclusion: Visual explainability in self-explaining models can be effectively transferred from a source domain to a target domain, enhancing explanation quality without affecting classification accuracy. Abstract: In scenarios requiring both prediction and explanation efficiency for image classification, self-explaining models that perform both tasks in a single inference are effective. However, their training incurs substantial labeling and computational costs. This study aims to tackle the issue by proposing a method to transfer the visual explainability of self-explaining models, learned in a source domain, to a target domain based on a task arithmetic framework. Specifically, we construct a self-explaining model by extending image classifiers based on a vision-language pretrained model. We then define an \emph{explainability vector} as the difference between model parameters trained on the source domain with and without explanation supervision. Based on the task arithmetic framework, we impart explainability to a model trained only on the prediction task in the target domain by applying the explainability vector. Experimental results on various image classification datasets demonstrate that, except for transfers between some less-related domains, visual explainability can be successfully transferred from source to target domains, improving explanation quality in the target domain without sacrificing classification accuracy. Furthermore, we show that the explainability vector learned on a large and diverse dataset like ImageNet, extended with explanation supervision, exhibits universality and robustness, improving explanation quality on nine out of ten different target datasets. We also find that the explanation quality achieved with a single model inference is comparable to that of Kernel SHAP, which requires 150 model inferences.

[290] Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers

Jung-Ho Hong,Ho-Joong Kim,Kyu-Sung Jeon,Seong-Whan Lee

Main category: cs.CV

TL;DR: This paper introduces CoIBA, a method for discovering relevant information in each targeted layer to explain the decision-making process.

Details Motivation: The attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. Method: We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Result: The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA. Conclusion: CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. Abstract: The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. In this paper, we introduce a comprehensive information bottleneck (CoIBA), which discovers the relevant information in each targeted layer to explain the decision-making process. Our core idea is applying information bottleneck in multiple targeted layers to estimate the comprehensive information by sharing a parametric damping ratio across the layers. Leveraging this shared ratio complements the over-compressed information to discover the omitted clues of the decision by sharing the relevant information across the targeted layers. We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Therefore, CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA.

[291] RegistrationMamba: A Mamba-based Registration Framework Integrating Multi-Expert Feature Learning for Cross-Modal Remote Sensing Images

Wei Wang,Dou Quan,Chonghua Lv,Shuang Wang,Ning Huyan,Yunan Li,Licheng Jiao

Main category: cs.CV

TL;DR: 本文提出RegistrationMamba,利用Mamba架构和多专家特征学习策略解决跨模态遥感图像配准中的全局上下文捕获、纹理受限及计算复杂度问题,显著提升了配准性能。

Details Motivation: 跨模态遥感图像配准面临非线性辐射变化和纹理受限导致的挑战,而现有卷积神经网络(CNNs)无法捕获全局上下文信息,Transformer则因计算复杂度高限制了其在高分辨率图像中的应用。 Method: 提出了一种基于Mamba架构的状态空间模型(SSMs),采用多方向交叉扫描策略以线性复杂度捕获全局上下文关系,并设计了多专家特征学习(MEFL)策略来增强纹理受限场景下的特征表示。此外,引入了多级特征聚合(MFA)模块以提取细粒度局部信息并与全局特征有效交互。 Result: 实验表明,RegistrationMamba在不同分辨率的CRSI数据上均优于最先进的方法,具有更高的配准精度和更强的鲁棒性。 Conclusion: RegistrationMamba通过结合状态空间模型和多专家特征学习策略,在CRSI配准中表现出卓越的性能和鲁棒性,同时解决了传统方法在高分辨率下的计算复杂性和纹理受限问题。 Abstract: Cross-modal remote sensing image (CRSI) registration is critical for multi-modal image applications. However, CRSI mainly faces two challenges: significant nonlinear radiometric variations between cross-modal images and limited textures hindering the discriminative information extraction. Existing methods mainly adopt convolutional neural networks (CNNs) or Transformer architectures to extract discriminative features for registration. However, CNNs with the local receptive field fail to capture global contextual features, and Transformers have high computational complexity and restrict their application to high-resolution CRSI. To solve these issues, this paper proposes RegistrationMamba, a novel Mamba architecture based on state space models (SSMs) integrating multi-expert feature learning for improving the accuracy of CRSI registration. Specifically, RegistrationMamba employs a multi-directional cross-scanning strategy to capture global contextual relationships with linear complexity. To enhance the performance of RegistrationMamba under texture-limited scenarios, we propose a multi-expert feature learning (MEFL) strategy to capture features from various augmented image variants through multiple feature experts. MEFL leverages a learnable soft router to dynamically fuse the features from multiple experts, thereby enriching feature representations and improving registration performance. Notably, MEFL can be seamlessly integrated into various frameworks, substantially boosting registration performance. Additionally, RegistrationMamba integrates a multi-level feature aggregation (MFA) module to extract fine-grained local information and enable effective interaction between global and local features. Extensive experiments on CRSI with varying image resolutions have demonstrated that RegistrationMamba has superior performance and robustness compared to state-of-the-art methods.

[292] Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

Tongyan Hua,Lutao Jiang,Ying-Cong Chen,Wufan Zhao

Main category: cs.CV

TL;DR: This paper introduces Sat2City, a novel framework combining sparse voxel grids and latent diffusion models to generate high-quality 3D urban scenes from satellite images, achieving better results than current approaches.

Details Motivation: Most existing 3D urban scene generation methods rely on neural rendering techniques that limit their ability to produce detailed large-scale 3D structures due to structural ambiguity from limited 2D observations. This work aims to overcome these limitations. Method: The proposed Sat2City framework utilizes a cascaded latent diffusion model, Re-Hash operation in the VAE bottleneck, and inverse sampling strategy. It is validated using a synthesized dataset of large-scale 3D cities paired with satellite-view height maps. Result: The framework successfully generates high-fidelity 3D city structures from a single satellite image and demonstrates superior performance compared to existing models. Conclusion: Sat2City proves to be an effective framework for generating detailed 3D urban structures from satellite imagery, outperforming existing methods in terms of fidelity. Abstract: Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning.To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.

[293] A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields

Aoxiang Fan,Corentin Dumery,Nicolas Talabot,Pascal Fua

Main category: cs.CV

TL;DR: 本论文提出了一种新的NeRF训练正则化方法,该方法使用视图一致分布代替固定深度值估计,从而在没有准确深度监督的情况下提高了新视角合成的质量。

Details Motivation: 传统的NeRF在处理真实世界数据时通常依赖昂贵且难以获取的3D监督进行深度正则化,并且现有深度估计模型在复杂场景(如户外无界场景)中容易出错。因此,作者希望提出一种不依赖精确深度估计但仍能有效正则化NeRF训练的方法。 Method: 本文通过利用来自基础模型的低级颜色特征和高级蒸馏特征,在每个射线采样的3D点投影到2D像素位置后计算视图一致分布。然后从这些分布中采样,对NeRF训练施加隐式正则化。此外,还引入了一个深度推动损失来与采样技术协同工作,以提供更有效的正则化并消除失败模式。 Result: 实验表明,所提出的方法在多个公开数据集上的新视角合成效果显著优于现有的NeRF变体和不同的深度正则化方法。 Conclusion: 论文结论是,基于视图一致分布的正则化方法能够在无需精确深度监督的情况下提升NeRF的表现,适用于复杂和未受限制的真实场景。 Abstract: Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also utilize a depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.

[294] VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng,Ziyan Jiang,Ye Liu,Mingyi Su,Xinyi Yang,Yuepeng Fu,Can Qin,Zeyuan Chen,Ran Xu,Caiming Xiong,Yingbo Zhou,Wenhu Chen,Semih Yavuz

Main category: cs.CV

TL;DR: 本文介绍了VLM2Vec-V2,一个支持多种视觉输入形式的统一嵌入学习框架,以及一个新的综合基准测试MMEB-V2。

Details Motivation: 现有的多模态嵌入模型主要集中在自然图像上,对其他视觉形式如视频和视觉文档的支持有限,这限制了它们在现实场景中的应用。因此,提出了VLM2Vec-V2以弥补这一差距。 Method: 提出了一种通用的嵌入模型VLM2Vec-V2和支持多模态输入的MMEB-V2基准测试,包括文本、图像、视频和视觉文档。 Result: VLM2Vec-V2不仅在新引入的视频和文档检索任务中表现出色,而且在原始图像基准测试中也优于之前的基线模型。 Conclusion: VLM2Vec-V2提供了一个统一的框架,用于跨不同视觉形式的学习嵌入,扩展了现有模型的应用范围,并通过广泛的评估展示了其在多个任务上的优越性能和通用性。 Abstract: Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

[295] MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture

Guandong Li,Mengxia Ye

Main category: cs.CV

TL;DR: The paper introduces MVNet, a new network architecture for hyperspectral image classification that combines 3D-CNN, Transformer, and Mamba to improve feature extraction and classification performance.

Details Motivation: Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. Method: This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear complexity sequence modeling capabilities. MVNet features a redesigned dual-branch Mamba module and an optimized HSI-MambaVision Mixer module. Result: On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency. Conclusion: MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data. Abstract: Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear complexity sequence modeling capabilities, achieving efficient spatial-spectral feature extraction and fusion. MVNet features a redesigned dual-branch Mamba module, including a State Space Model (SSM) branch and a non-SSM branch employing 1D convolution with SiLU activation, enhancing modeling of both short-range and long-range dependencies while reducing computational latency in traditional Mamba. The optimized HSI-MambaVision Mixer module overcomes the unidirectional limitation of causal convolution, capturing bidirectional spatial-spectral dependencies in a single forward pass through decoupled attention that focuses on high-value features, alleviating parameter redundancy and the curse of dimensionality. On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data.

[296] From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Mihai Masala,Marius Leordeanu

Main category: cs.CV

TL;DR: 本文研究视频复杂语言描述问题,提出基于事件图的视觉与语言共享表示方法,并在多个数据集中验证了其生成高质量描述的能力。

Details Motivation: 目前缺乏用于视频复杂语言描述的公开资源,且现有方法难以解释视觉与语言之间的关系。因此,需要一种新的方法来解决这些问题。 Method: 通过分析现有方法和数据集,提出一种基于空间和时间事件图的可解释和分析的视觉与语言共享表示方法,并使用自监督神经分析系统进行验证。 Result: 提出的方法能够生成高质量的视频描述,并在多种评估指标和人类注释中表现优异,同时可以作为全自动教师训练端到端神经学生路径。 Conclusion: 视频描述生成领域存在资源匮乏和复杂语言描述的挑战,本文提出了一种基于事件图的视觉与语言共享表示方法,并验证了其在多个数据集上生成连贯、丰富和相关文本描述的有效性。 Abstract: The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.

[297] Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models

Huy Hoan Le,Van Sy Thinh Nguyen,Thi Le Chi Dang,Vo Thanh Khang Nguyen,Truong Thanh Hung Nguyen,Hung Cao

Main category: cs.CV

TL;DR: 本论文提出了一個多媒體驗證的多智能體系統,結合多模態大語言模型和專門的驗證工具,以偵測多媒體錯誤訊息。

Details Motivation: 為了應對現實世界中多媒體內容真實性驗證的挑戰,我們需要一個高效的系統來檢測多媒體誤導資訊。 Method: 開發了一個六個階段的多智能體驗證系統,包括原始數據處理、規劃、信息提取、深度研究、證據收集和報告生成,其中核心是使用四種工具的深度研究代理。 Result: 該系統成功驗證了多媒體內容的真實性,精確提取了地理位置和時間信息,並追蹤了跨平台的信息來源屬性。 Conclusion: 提出的多智能體系統在多媒體驗證方面表現良好,有效解決了現實世界的多媒體驗證問題。 Abstract: This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.

[298] Transcribing Spanish Texts from the Past: Experiments with Transkribus, Tesseract and Granite

Yanco Amor Torterolo-Orta,Jaione Macicior-Mitxelena,Marina Miguez-Lamanuzzi,Ana García-Serrano

Main category: cs.CV

TL;DR: The GRESEL team used three methods (web-based OCR, traditional OCR, and a compact multimodal model) on consumer-grade hardware for the IberLEF 2025 PastReader task, achieving satisfactory results with scope for further improvement.

Details Motivation: The motivation was to participate in the IberLEF 2025 shared task PastReader and enable comparisons across different approaches for transcribing historical texts. Method: Three types of experiments were conducted: using a web-based OCR service, a traditional OCR engine, and a compact multimodal model, all run on consumer-grade hardware. Result: Satisfactory results were obtained from the experiments, indicating room for future enhancement. Conclusion: The GRESEL team's experiments in the IberLEF 2025 shared task showed satisfactory results, but there is potential for further improvement through exploration of new techniques and collaboration with BNE. Abstract: This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de Espa\~na (BNE).

[299] SFOOD: A Multimodal Benchmark for Comprehensive Food Attribute Analysis Beyond RGB with Spectral Insights

Zhenbo Xu,Jinghan Yang,Gong Huang,Jiqing Feng,Liu Liu,Ruihan Sun,Ajin Meng,Zhuo Zhang,Zhaofeng He

Main category: cs.CV

TL;DR: This paper introduces the first large-scale spectral food benchmark to advance intelligent food analysis, highlighting the importance of spectrum data and revealing challenges in current AI models' ability to analyze food attributes effectively.

Details Motivation: The motivation is driven by the lack of a comprehensive benchmark for food attribute analysis and the difficulty in accurately perceiving certain food attributes through traditional RGB cameras. Method: The authors built a large-scale spectral food (SFOOD) benchmark suite by organizing existing datasets and collecting hyperspectral images of hundreds of foods, while also measuring attributes like sweetness and weight using instruments. Result: The resulting benchmark includes 3,266 food categories and 2,351k data points across 17 main food categories. Evaluations show that large-scale models still struggle with digitizing food, and spectrum data significantly aid in analyzing food properties. Conclusion: The paper concludes that spectrum data are vital for analyzing food properties and that their newly created benchmark will be open-source and continually updated for various food analysis tasks. Abstract: With the rise and development of computer vision and LLMs, intelligence is everywhere, especially for people and cars. However, for tremendous food attributes (such as origin, quantity, weight, quality, sweetness, etc.), existing research still mainly focuses on the study of categories. The reason is the lack of a large and comprehensive benchmark for food. Besides, many food attributes (such as sweetness, weight, and fine-grained categories) are challenging to accurately percept solely through RGB cameras. To fulfill this gap and promote the development of intelligent food analysis, in this paper, we built the first large-scale spectral food (SFOOD) benchmark suite. We spent a lot of manpower and equipment costs to organize existing food datasets and collect hyperspectral images of hundreds of foods, and we used instruments to experimentally determine food attributes such as sweetness and weight. The resulting benchmark consists of 3,266 food categories and 2,351 k data points for 17 main food categories. Extensive evaluations find that: (i) Large-scale models are still poor at digitizing food. Compared to people and cars, food has gradually become one of the most difficult objects to study; (ii) Spectrum data are crucial for analyzing food properties (such as sweetness). Our benchmark will be open source and continuously iterated for different food analysis tasks.

[300] ReLoop: "Seeing Twice and Thinking Backwards" via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding

Jianjiang Yang,Ziyan Huang,Yanshu Li

Main category: cs.CV

TL;DR: ReLoop is a closed-loop training framework that improves cross-modal understanding and reduces hallucinations in MLLMs through integrated consistency checks.

Details Motivation: MLLMs are prone to hallucinations that misrepresent input semantics. Current approaches lack internal validation during training, requiring external correction. Method: ReLoop uses a ring-shaped structure with three feedback mechanisms: semantic reconstruction, visual description, and attention supervision to validate outputs during training. Result: ReLoop significantly reduces hallucination rates across multiple benchmarks compared to existing methods. Conclusion: ReLoop provides a robust method for reducing hallucinations in MLLMs by enforcing multimodal consistency during training. Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in open-ended visual question answering, they remain vulnerable to hallucinations. These are outputs that contradict or misrepresent input semantics, posing a critical challenge to the reliability and factual consistency. Existing methods often rely on external verification or post-hoc correction, lacking an internal mechanism to validate outputs directly during training. To bridge this gap, we propose ReLoop, a unified closed-loop training framework that encourages multimodal consistency for cross-modal understanding in MLLMs. ReLoop adopts a ring-shaped structure that integrates three complementary consistency feedback mechanisms, obliging MLLMs to "seeing twice and thinking backwards". Specifically, ReLoop employs the frozen Consistency Feedback Plugin (CFP), comprising semantic reconstruction, visual description, and an attention supervision module for attention alignment. These components collectively enforce semantic reversibility, visual consistency, and interpretable attention, enabling the model to correct its outputs during training. Extensive evaluations and analyses demonstrate the effectiveness of ReLoop in reducing hallucination rates across multiple benchmarks, establishing a robust method for hallucination mitigation in MLLMs. We will release our source code and data in the camera-ready version.

[301] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang,Hongsi Liu,Zekun Qi,Yunnan Wang,XinQiang Yu,Jiazhao Zhang,Runpei Dong,Jiawei He,He Wang,Zhizheng Zhang,Li Yi,Wenjun Zeng,Xin Jin

Main category: cs.CV

TL;DR: DreamVLA improves robot manipulation by integrating dynamic, spatial, and semantic knowledge into a perception-prediction-action loop using novel prediction and attention mechanisms.

Details Motivation: Existing VLA models face limitations in image-based forecasting due to redundant information and lack of critical world knowledge, such as dynamic, spatial, and semantic cues. Method: DreamVLA introduces a dynamic-region-guided world knowledge prediction approach with a block-wise structured attention mechanism and a diffusion-based transformer for disentangled action representation. Result: DreamVLA achieves a 76.7% success rate on real robot tasks and an average length of 4.44 on the CALVIN ABC-D benchmarks. Conclusion: DreamVLA outperforms existing methods by integrating comprehensive world knowledge forecasting, enabling more effective action planning in robot manipulation tasks. Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

[302] Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Jianjiang Yang,Ziyan Huang

Main category: cs.CV

TL;DR: 该研究将文本到图像扩散模型中的幻觉问题视为潜在对齐空间中的轨迹漂移,并提出了一个统一且可解释的框架来理解和减轻扩散型文本到图像系统中的生成失败问题。

Details Motivation: 尽管文本到图像扩散模型在图像质量和提示保真度方面取得了显著进展,但仍然存在持续的“幻觉”问题,即生成的内容与预期提示语义有细微或显著偏差。 Method: 提出了一个三轴空间(语义一致性、结构对齐和知识基础)来形式化幻觉,引入了对齐风险代码(ARC),并开发了一个轻量级控制器TM-ARC用于在采样过程中进行针对性干预。 Result: 在标准文本到图像基准上的大量实验表明,所提出的方法显著减少了幻觉,而不会降低图像质量或多样性。 Conclusion: 提出了一种基于认知启发的视角来重新解释文本到图像扩散模型中的幻觉问题,并通过引入张力调制器(TM-ARC)显著减少了幻觉,同时保持了图像质量和多样性。 Abstract: Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbf{Hallucination Tri-Space} and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

[303] CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

Zheyuan Liu,Munan Ning,Qihui Zhang,Shuo Yang,Zhongrui Wang,Yiwei Yang,Xianzhe Xu,Yibing Song,Weihua Chen,Fan Wang,Li Yuan

Main category: cs.CV

TL;DR: CoT-Diff是一种将3D布局规划与扩散过程紧密结合的文本到图像生成框架,通过在单个扩散循环中实现布局感知推理,显著提高了空间对齐和组合保真度。

Details Motivation: 当前文本到图像(T2I)生成模型在复杂场景下难以与输入文本的空间构成对齐,即使是基于布局的方法也因生成过程与布局规划分离而无法进行有效的空间控制。 Method: 提出了一种名为CoT-Diff的新框架,该框架将基于思维链(CoT)风格的逐步推理引入T2I生成过程中。通过多模态大语言模型(MLLM)驱动的3D布局规划与扩散过程的紧密集成,在每一步去噪过程中动态更新3D场景布局,并通过条件感知注意力机制将语义条件和深度图融合到扩散模型中,从而实现精确的空间控制和语义注入。 Result: 实验表明,CoT-Diff在3D场景基准测试中显著提升了空间对齐和组合保真度,在复杂场景下的空间准确性上比现有最佳方法提高了34.7%。 Conclusion: CoT-Diff验证了这种纠缠生成范式的有效性,为文本到图像生成中的空间控制提供了一个新的解决方案。 Abstract: Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.

[304] Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Eunseop Yoon,Hee Suk Yoon,Mark A. Hasegawa-Johnson,Chang D. Yoo

Main category: cs.CV

TL;DR: Video-LLMs struggle to identify and reject questions that go beyond the scope of a video. This paper proposes an alignment framework that enables Video-LLMs to assess question relevance and refuse unfit queries, along with a dedicated evaluation framework and dataset generation pipeline.

Details Motivation: Current Video-LLMs are not trained to identify and reject questions that cannot be answered based on the video content, which is crucial for real-world applications where question relevance varies. Method: The authors propose 'alignment for answerability,' a framework that trains Video-LLMs to evaluate whether a question can be answered from the given video and decline irrelevant ones. They also introduce an evaluation framework with metrics and a pipeline for generating relevant datasets using existing video-description pairs. Result: The study demonstrates that even top-performing Video-LLMs fail to reject unfit questions due to lack of training in relevance assessment. The proposed framework significantly improves their ability to handle out-of-scope questions. Conclusion: The proposed alignment framework enhances the practical applicability of Video-LLMs by enabling them to assess question relevance and avoid answering queries that exceed the video's informational scope. Abstract: In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.

[305] BiVM: Accurate Binarized Neural Network for Efficient Video Matting

Haotong Qin,Xianglong Liu,Xudong Ma,Lei Ke,Yulun Zhang,Jie Luo,Michele Magno

Main category: cs.CV

TL;DR: 本文提出了一种名为BiVM的高效二值化神经网络,用于实时视频抠图,在保证高精度的同时大幅降低了计算与存储需求。

Details Motivation: 实时视频抠图在边缘设备上面临重大计算限制,而现有的二值化方法存在精度和效率问题。 Method: 提出了一种具有弹性快捷连接和可进化拓扑结构的二值化计算结构,并对解码器中间特征进行稀疏处理,同时构建了一个本地化的二值化感知模仿框架。 Result: 实验表明,BiVM在计算和存储成本上分别实现了14.3倍和21.6倍的显著节省,并优于现有最先进的二值化视频抠图网络。 Conclusion: BiVM是一个准确且资源高效的二值化神经网络,用于视频抠图,在计算和存储成本上实现了显著节省。 Abstract: Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representation from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in full-precision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.

[306] AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

Jan Carreras Boada,Rao Muhammad Umer,Carsten Marr

Main category: cs.CV

TL;DR: 本文提出通过微调扩散模型生成高质量合成图像,以解决生物医学数据集中样本不平衡和隐私限制问题,从而显著提高分类器的性能。

Details Motivation: 生物医学数据集通常存在严重的样本不平衡问题并受到隐私约束,这阻碍了准确机器学习模型的发展,因此需要一种解决方案来提高数据可用性同时保护患者隐私。 Method: 使用LoRA权重微调稳定扩散模型生成合成图像,并利用少量真实样本进行引导,以增强分类器在有限数据上的表现。 Result: 在ResNet分类器中,通过添加每类5000张合成图像,准确率从27.3%提升至78.4%;在基于CLIP的分类器中,准确率从61.8%提升至76.8%。合成图像与真实图像高度相似,并能帮助克服数据集限制,提高模型泛化能力。 Conclusion: 合成图像可以作为生物医学研究中的一个工具,改善机器学习模型的性能,并促进医学诊断和研究。 Abstract: Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3\% to 78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.

[307] Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions

Konstantinos Foteinos,Jorgen Cani,Manousos Linardakis,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos

Main category: cs.CV

TL;DR: 本综述论文系统性地分析了基于深度学习的手势识别研究,提供了最新技术、数据集和评估指标的全面概述,旨在为研究人员提供指导。

Details Motivation: 尽管手势识别领域已有大量研究成果,但缺乏一个结构化的综述,研究人员难以找到适合特定任务的数据、模型和方法组合。 Method: 论文采用系统化的研究方法,通过分类的方式整理了该领域的关键方法、数据集和评估指标,并深入分析了最先进的技术。 Result: 论文提供了一个全面的综述,包括静态手势识别、孤立动态手势和连续手势识别等主要任务的技术分析,以及常用数据集和性能评估指标的总结。 Conclusion: 这篇论文总结了基于视觉的手势识别领域的研究进展,指出了当前面临的挑战,并提出了未来的研究方向。 Abstract: The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always important field of vision-based hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction using cameras. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the right combination of data, model, and approach for each task. The current survey aims to fill this gap by presenting a comprehensive overview of this aspect of computer vision. With a systematic research methodology that identifies the state-of-the-art works and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to choose the right strategy for delving into a certain VHGR task. Starting with the methodology used for study selection, literature retrieval, and the analytical framing, the survey identifies and organizes key VHGR approaches using a taxonomy-based format in various dimensions such as input modality and application domain. The core of the survey provides an in-depth analysis of state-of-the-art techniques across three primary VHGR tasks: static gesture recognition, isolated dynamic gestures and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. Additionally, the study reviews commonly used datasets - emphasizing on annotation schemes - and evaluates standard performance metrics. It concludes by identifying major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.

[308] Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration

Yuyi Zhang,Peirong Zhang,Zhenhua Yang,Pengyu Yan,Yongxin Shi,Pengwei Liu,Fengjun Guo,Lianwen Jin

Main category: cs.CV

TL;DR: This paper introduces FPHDR, a new dataset, and AutoHDR, an automated solution for historical document restoration, which together significantly improve restoration accuracy and contribute to preserving cultural heritage.

Details Motivation: Existing Historical Document Restoration (HDR) methods focus on single modality or limited-size restoration, which fails to meet practical needs. This gap motivated the development of a more comprehensive solution. Method: AutoHDR uses a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. Its modular architecture supports human-machine collaboration for flexible intervention. Result: AutoHDR significantly improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration, demonstrating its effectiveness in restoring severely damaged documents. Conclusion: The paper concludes that AutoHDR, along with the FPHDR dataset, represents a significant advancement in automated historical document restoration, contributing to cultural heritage preservation. Abstract: Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians' restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR's remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement to 94.25\% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.

[309] A Training-Free Style-Personalization via Scale-wise Autoregressive Model

Kyoungmin Lee,Jihun Park,Jongmin Gim,Wonhyeok Choi,Kyumin Hwang,Jaeyeul Kim,Sunghoon Im

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的框架,用于个性化风格图像生成,在推理过程中使用尺度自回归模型控制内容和风格信息。

Details Motivation: 为了在不进行额外训练的情况下,实现对图像语义内容和风格的灵活高效控制。 Method: 采用三路径设计(内容、风格和生成),通过相应的文本提示进行引导,并引入关键阶段注意力共享和自适应查询共享机制。 Result: 实验表明,该方法在风格保真度和提示保真度方面与微调基线方法具有竞争力,同时提供更快的推理速度和更大的部署灵活性。 Conclusion: 这项研究为个性化风格图像生成提供了一个有效且高效的无训练解决方案。 Abstract: We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central contribution of this work is a step-wise and attention-wise intervention analysis. Through systematic prompt and feature injection, we find that early-to-middle generation steps play a pivotal role in shaping both content and style, and that query features predominantly encode content-specific information. Guided by these insights, we introduce two targeted mechanisms: Key Stage Attention Sharing, which aligns content and style during the semantically critical steps, and Adaptive Query Sharing, which reinforces content semantics in later steps through similarity-aware query blending. Extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.

[310] Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Yana Wei,Liang Zhao,Jianjian Sun,Kangheng Lin,Jisheng Yin,Jingcheng Hu,Yinmin Zhang,En Yu,Haoran Lv,Zejia Weng,Jia Wang,Chunrui Han,Yuang Peng,Qi Han,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Vishal M. Patel

Main category: cs.CV

TL;DR: This paper introduces a novel two-stage training method for Multimodal LLMs that significantly improves their visual reasoning abilities, achieving top performance on multiple benchmarks.

Details Motivation: To unlock advanced visual reasoning in Multimodal LLMs by transferring the principle of reinforcement with verifiable rewards, similar to what is observed in large language models. Method: A two-stage paradigm was used: linguistic cold-start fine-tuning followed by multimodal reinforcement learning on a large scale. The experiments were conducted using Qwen2.5-VL-7B as the base model. Result: The proposed model, Open-Vision-Reasoner (OVR), achieved state-of-the-art results on reasoning benchmarks such as MATH500 (95.3%), MathVision (51.8%), and MathVerse (54.6%). Conclusion: The study concludes that applying reinforcement learning with verifiable rewards to Multimodal LLMs enhances visual reasoning capabilities significantly, leading to state-of-the-art performance on several benchmarks. Abstract: The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

[311] U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration

Xiaofan Li,Zhihao Xu,Chenming Wu,Zhao Yang,Yumeng Zhang,Jiang-Jiang Liu,Haibao Yu,Fan Duan,Xiaoqing Ye,Yuan Wang,Shirui Li,Xun Sun,Ji Wan,Jun Wang

Main category: cs.CV

TL;DR: This paper proposes U-ViLAR, a visual localization framework that effectively balances large-scale and precise localization to achieve robust and accurate results.

Details Motivation: Accurate localization using visual information is important especially in urban environments where GNSS signals are degraded by nearby buildings and construction sites. Method: The paper proposes U-ViLAR, which first extracts features from visual data and maps them into BEV space. It then uses Perceptual Uncertainty-guided Association and Localization Uncertainty-guided Registration to mitigate errors. Result: Experimental results show that the method achieves state-of-the-art performance across multiple localization tasks and demonstrates stable performance in various challenging urban scenarios. Conclusion: U-ViLAR is a successful visual localization framework that achieves robust and accurate localization by balancing coarse-grained large-scale localization with fine-grained precise localization. Abstract: Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird's-Eye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios.

[312] MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

Zhendong Xiao,Wu Wei,Shujie Ji,Shan Yang,Changhao Chen

Main category: cs.CV

TL;DR: MVL-Loc is an innovative framework that leverages vision-language models and natural language guidance to achieve highly accurate and robust multi-scene camera relocalization.

Details Motivation: Traditional deep learning methods for camera relocalization lack generalization and robustness across diverse environments, limiting their effectiveness in real-world applications. Method: MVL-Loc uses vision-language models (VLMs) and multimodal data to generalize across scenes, employing natural language to guide learning and enhance spatial understanding. Result: Experiments on the 7Scenes and Cambridge Landmarks datasets show that MVL-Loc achieves superior performance in both positional and orientational accuracy for multi-scene camera relocalization. Conclusion: MVL-Loc is a robust and state-of-the-art framework for multi-scene 6-DoF camera relocalization, showing improved accuracy in real-world applications. Abstract: Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.

[313] FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

Xinhua Lu,Runhe Lai,Yanqi Wu,Kanghao Chen,Wei-Shi Zheng,Ruixuan Wang

Main category: cs.CV

TL;DR: 本研究提出一种新的CLIP-based框架FA,通过加强ID知识的学习显著提升OOD检测效果,且不依赖外部数据集。

Details Motivation: 现有的CLIP-based方法通常侧重于学习与分布外(OOD)相关的知识以提高OOD检测能力,但这种方法存在泛化能力有限或依赖大规模外部辅助数据集的问题。因此,本文旨在不深入复杂OOD相关知识的前提下,利用ID知识改进OOD检测。 Method: 研究引入了一个强制系数,并学习一个包含更多样化和丰富描述的ID类提示(forced prompt),以增强ID图像与提示之间的语义相似性。 Result: 实验表明,新提出的方法在保持与CoOp相同的可训练参数数量的同时,在无需任何外部辅助数据集的情况下实现了对当前最先进方法的一致超越。 Conclusion: 该研究提出了一种基于CLIP的新型框架Forced prompt leArning (FA),通过充分利用In-Distribution (ID) 知识来提升OOD检测的有效性,即使在没有外部辅助数据集的情况下也能显著改进检测性能。 Abstract: Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at https://github.com/0xFAFA/FA.

[314] Grounded Gesture Generation: Language, Motion, and Space

Anna Deichler,Jim O'Regan,Teo Guichoux,David Johansson,Jonas Beskow

Main category: cs.CV

TL;DR: 本研究解決了人類動作生成中空間定位和上下文感知手勢的重要問題,提出了新的多模態數據集和框架,促進了具身化、可溝通代理的發展。

Details Motivation: 儘管近年來人類動作生成取得了快速進展,但創建空間定位、上下文感知手勢這一關鍵問題在很大程度上被忽視了。現有模型通常專注於描述性動作生成或與語音孤立的手勢合成,但這兩個研究方向往往將動作和環境定位分開處理,限制了具身化、可溝通代理的發展。 Method: 作者結合了兩種關鍵資源:一個合成的空間定位參考手勢數據集,以及一個基於VR的數據集MM-Conv,該數據集捕捉雙方對話。他們進一步連接到基於物理的模擬器,實現合成數據生成和情境評估。 Result: 作者提供了超過7.7小時的同步運動、語音和3D場景信息,標準化為HumanML3D格式。 Conclusion: 作者提出了一種新的多模態數據集和框架,為情境手勢生成和接地多模態交互的研究奠定了基礎。 Abstract: Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. To address this gap, our work introduces a multimodal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. Together, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction. Project page: https://groundedgestures.github.io/

[315] A Data-Driven Novelty Score for Diverse In-Vehicle Data Recording

Philipp Reis,Joshua Ransiek,David Petri,Jacob Langner,Eric Sax

Main category: cs.CV

TL;DR: 本文提出了一种用于自动驾驶数据集构建的实时数据选择方法,通过动态Mean Shift算法进行对象级新颖性检测,以解决数据稀有性问题并提升模型性能。

Details Motivation: 高质量的数据集对于训练稳健的自动驾驶感知系统至关重要。然而,现实世界的数据收集往往偏向于常见场景和物体,导致新颖情况表示不足。这种不平衡会阻碍模型泛化并危及安全性。 Method: 使用一种新颖的动态Mean Shift算法,根据均值和协方差统计信息建模正常内容,以识别具有新颖对象的图像帧,并丢弃包含冗余元素的帧。 Result: 研究主要发现是,通过该方法减少训练数据集的规模可以提高模型性能,而更高的冗余度往往会降低性能。此外,随着数据冗余的增加,更积极的过滤变得既可行又有益。 Conclusion: 该论文提出了一种基于对象级新颖性检测的实时数据选择方法,以构建更加平衡和多样化的自动驾驶感知系统训练数据集。通过动态Mean Shift算法分配数据驱动的新颖性评分,实现了对冗余数据的有效过滤和新颖案例的保留。 Abstract: High-quality datasets are essential for training robust perception systems in autonomous driving. However, real-world data collection is often biased toward common scenes and objects, leaving novel cases underrepresented. This imbalance hinders model generalization and compromises safety. The core issue is the curse of rarity. Over time, novel events occur infrequently, and standard logging methods fail to capture them effectively. As a result, large volumes of redundant data are stored, while critical novel cases are diluted, leading to biased datasets. This work presents a real-time data selection method focused on object-level novelty detection to build more balanced and diverse datasets. The method assigns a data-driven novelty score to image frames using a novel dynamic Mean Shift algorithm. It models normal content based on mean and covariance statistics to identify frames with novel objects, discarding those with redundant elements. The main findings show that reducing the training dataset size with this method can improve model performance, whereas higher redundancy tends to degrade it. Moreover, as data redundancy increases, more aggressive filtering becomes both possible and beneficial. While random sampling can offer some gains, it often leads to overfitting and unpredictability in outcomes. The proposed method supports real-time deployment with 32 frames per second and is constant over time. By continuously updating the definition of normal content, it enables efficient detection of novelties in a continuous data stream.

[316] MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

Dawit Mureja Argaw,Xian Liu,Joon Son Chung,Ming-Yu Liu,Fitsum Reda

Main category: cs.CV

TL;DR: 本文提出了一种新的离散视频分词器,利用基于Mamba的编码器-解码器架构和通道分割量化方法,在自回归视频生成方面取得了最先进的性能。

Details Motivation: 由于视频数据的高维度,有效的自回归生成建模需要离散视频分词。 Method: 引入了基于Mamba的编码器-解码器架构,并介绍了一种新的量化方案——通道分割量化。 Result: 实验结果表明,所提出的模型在多个数据集上都表现出色,并且作为自回归视频生成的分词器具有很强的鲁棒性。 Conclusion: 该论文提出了一种先进的离散视频分词器,在多个数据集中优于基于因果3D卷积和Transformer的方法。 Abstract: Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

[317] S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

Xudong Liu,Zikun Chen,Ruowei Jiang,Ziyi Wu,Kejia Yin,Han Zhao,Parham Aarabi,Igor Gilitschenski

Main category: cs.CV

TL;DR: 本文提出了S^2Edit,一种基于预训练文本到图像扩散模型的新方法,实现了具有精确语义和空间控制的个性化图像编辑。

Details Motivation: 现有方法在需要精细控制的图像编辑任务中效果不佳,容易丢失身份信息或改变无关区域。 Method: 提出S^2Edit方法,包括模型微调、身份信息嵌入、正交约束、使用对象掩码引导注意力图。 Result: 实验表明S^2Edit在定量和定性方面均优于现有方法,并展示了如化妆迁移等应用。 Conclusion: S^2Edit能够通过语义和空间控制实现个性化的图像编辑,优于现有方法。 Abstract: Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S$^2$Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S$^2$Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S$^2$Edit such as makeup transfer.

[318] CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection

Hanzhi Zhong,Zhiyu Xiang,Ruoyu Xu,Jingyun Fu,Peng Xu,Shaohong Wang,Zhihao Yang,Tianyu Pu,Eryun Liu

Main category: cs.CV

TL;DR: This paper proposes CVFusion, a novel fusion method for 3D object detection using 4D radar and camera data, achieving state-of-the-art results on public datasets.

Details Motivation: The motivation stems from the limitations of current methods in fully exploring the potential of 4D radar and fusion mechanisms for 3D object detection, despite its importance in autonomous driving under adverse weather conditions. Method: A cross-view two-stage fusion network called CVFusion was proposed, consisting of a radar guided iterative BEV fusion module in the first stage and multi-view feature aggregation in the second stage. Result: Experiments showed significant improvements with 9.10% and 3.68% mAP gains on the View-of-Delft (VoD) and TJ4DRadSet datasets, respectively. Conclusion: The study concludes that the proposed CVFusion network significantly improves 3D object detection performance by effectively utilizing 4D radar potential and implementing a robust fusion mechanism. Abstract: 4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research finish the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplored, hindering the performance improvement. In this study, we propose a cross-view two-stage fusion network called CVFusion. In the first stage, we design a radar guided iterative (RGIter) BEV fusion module to generate high-recall 3D proposal boxes. In the second stage, we aggregate features from multiple heterogeneous views including points, image, and BEV for each proposal. These comprehensive instance level features greatly help refine the proposals and generate high-quality predictions. Extensive experiments on public datasets show that our method outperforms the previous state-of-the-art methods by a large margin, with 9.10% and 3.68% mAP improvements on View-of-Delft (VoD) and TJ4DRadSet, respectively. Our code will be made publicly available.

[319] QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

Jiahui Yang,Yongjia Ma,Donglin Di,Hao Li,Wei Chen,Yan Xie,Jianxun Cui,Xun Yang,Wangmeng Zuo

Main category: cs.CV

TL;DR: QR-LoRA is a novel fine-tuning framework that uses QR decomposition for structured parameter updates to effectively separate visual attributes, reducing trainable parameters and supporting effective merging of multiple adaptations.

Details Motivation: To overcome undesired feature entanglement between content and style attributes when combining multiple LoRA models for content-style fusion tasks. Method: QR decomposition for structured parameter updates to separate visual attributes effectively. Result: Structured design reduces trainable parameters to half of conventional LoRA methods. Supports effective merging of multiple adaptations without cross-contamination. Achieves superior disentanglement in content-style fusion tasks. Conclusion: QR-LOA offers a new paradigm for parameter-efficient, disentangled fine-tuning in generative models. Abstract: Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.

[320] HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction

Jiaqi Cui,Lu Wen,Yuchen Fei,Bo Liu,Luping Zhou,Dinggang Shen,Yan Wang

Main category: cs.CV

TL;DR: 本文提出一种名为HiLa的新型分层视觉-语言协作框架,用于改进癌症生存预测,解决了现有方法在视觉-语言对齐和层次建模方面的不足,并在多个数据集上展示了最先进的性能。

Details Motivation: 现有的基于稀疏全切片图像(WSI)标签的生存预测方法受限于其无法从多面的语言信息和WSI中的视觉特征之间学习到细粒度关联,而视觉语言(VL)模型由于仅依赖简单的语言提示和基本余弦相似性未能有效解决这一问题。 Method: HiLa利用预训练的特征提取器从WSI中生成分层视觉特征,并通过最优提示学习(OPL)对不同生存相关属性的语言提示进行对齐,同时引入了跨层次传播(CLP)和互对比学习(MCL)模块来增强层次间的合作。 Result: 在三个TCGA数据集上的实验表明,所提出的HiLa框架取得了最先进的性能(SOTA)。 Conclusion: 该论文提出了一种新的分层视觉-语言协作框架(HiLa),用于改进癌症生存预测,通过解决现有方法在视觉-语言对齐和层次建模方面的不足。 Abstract: Survival prediction using whole-slide images (WSIs) is crucial in cancer re-search. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative repre-sentations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solu-tion. However, VL-based survival prediction remains largely unexplored due to two key challenges. First, current methods often rely on only one simple lan-guage prompt and basic cosine similarity, which fails to learn fine-grained associ-ations between multi-faceted linguistic information and visual features within WSI, resulting in inadequate vision-language alignment. Second, these methods primarily exploit patch-level information, overlooking the intrinsic hierarchy of WSIs and their interactions, causing ineffective modeling of hierarchical interac-tions. To tackle these problems, we propose a novel Hierarchical vision-Language collaboration (HiLa) framework for improved survival prediction. Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels. At each level, a series of language prompts describing various survival-related attributes are constructed and aligned with visual features via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts, thereby improv-ing vision-language alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation (CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical cooperation by promoting interactions and consistency be-tween patch and region levels. Experiments on three TCGA datasets demonstrate our SOTA performance.

[321] Learn 3D VQA Better with Active Selection and Reannotation

Shengli Zhou,Yang Liu,Feng Zheng

Main category: cs.CV

TL;DR: This paper introduces an effective multi-turn active learning approach for 3D Visual Question Answering that reduces training costs and handles misleading annotations by leveraging semantic uncertainty and requesting targeted reannotations.

Details Motivation: The motivation is to address the issue of misleading annotations in 3D VQA datasets, which are especially problematic due to the scarcity of 3D scene data and limitations in existing active learning strategies. Method: They propose a multi-turn interactive active learning strategy that selects data based on semantic uncertainty and actively requests reannotation from an oracle. They also use a variance-based metric for uncertainty assessment, which considers semantic relationships between terms. Result: Experiments show better model performance and a significant reduction in training costs—specifically, a 50% reduction to achieve relatively high accuracy. Conclusion: The paper concludes that their proposed multi-turn interactive active learning strategy improves model performance while significantly reducing training costs in 3D Visual Question Answering tasks. Abstract: 3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models' semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at https://github.com/fz-zsl/AQuA.

[322] Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Yun Wang,Longguang Wang,Chenghao Zhang,Yongjian Zhang,Zhanjie Zhang,Ao Ma,Chenyou Fan,Tin Lun Lam,Junjie Hu

Main category: cs.CV

TL;DR: SMoEStereo improves stereo matching by adapting Vision Foundation Models using a tailored combination of LoRA and MoE modules, enabling strong cross-domain performance without dataset-specific tuning.

Details Motivation: Existing learning-based stereo matching networks struggle with cross-domain performance due to domain shifts and imbalanced disparity distributions. While Vision Foundation Models (VFMs) offer potential for improved robustness, effectively integrating them into stereo matching cost-effectively remains a challenge. Method: SMoEStereo introduces a novel architecture combining MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes, along with a lightweight decision network to selectively activate MoE modules based on input complexity. This approach enhances robustness, integrates inductive bias into VFMs, and balances computational efficiency with accuracy. Result: Extensive experiments show that SMoEStereo achieves state-of-the-art results on cross-domain and joint generalization benchmarks, outperforming existing methods without requiring specific adaptations for each dataset. Conclusion: The proposed SMoEStereo framework effectively adapts Vision Foundation Models (VFMs) for stereo matching by integrating Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules, achieving state-of-the-art cross-domain and joint generalization performance without requiring dataset-specific adaptation. Abstract: Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.

[323] LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction

Yixin Yan,Yang Li,Yuanfan Wang,Xiaozhou Zhou,Beihao Xia,Manjiang Hu,Hongmao Qin

Main category: cs.CV

TL;DR: This paper proposes LTMSformer, a lightweight framework for multi-agent trajectory prediction that improves accuracy by capturing local temporal dependencies and high-order motion states while significantly reducing model size.

Details Motivation: Modeling complex temporal-spatial dependencies between agents for trajectory prediction has been challenging. Capturing local temporal dependency and high-order motion state attributes can improve prediction performance, but these aspects are often overlooked in previous studies. Method: The proposed LTMSformer framework includes a Local Trend-Aware Attention mechanism to capture local temporal dependencies, a Motion State Encoder to incorporate high-order motion state attributes, and a Lightweight Proposal Refinement Module using Multi-Layer Perceptrons for refining trajectories. Result: Experiments on the Argoverse 1 dataset show that LTMSformer outperforms HiVT-64 by reducing minADE by 4.35%, minFDE by 8.74%, and MR by 20%. It also achieves higher accuracy than HiVT-128 with a 68% reduction in model size. Conclusion: LTMSformer is a lightweight framework that effectively captures temporal-spatial dependencies for multi-modal trajectory prediction, outperforming existing methods in both accuracy and model size reduction. Abstract: It has been challenging to model the complex temporal-spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high-order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal-spatial interaction features for multi-modal trajectory prediction. Specifically, we introduce a Local Trend-Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high-order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi-Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT-64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT-128 with a 68% reduction in model size.

[324] MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Zhicheng Zhang,Wuyou Xia,Chenxi Zhao,Zhou Yan,Xiaoqiang Liu,Yongjie Zhu,Wenyu Qin,Pengfei Wan,Di Zhang,Jufeng Yang

Main category: cs.CV

TL;DR: 本研究提出了一种新的多模态注意力机制 MODA,有效解决了多模态学习中的注意力缺陷问题,在多个任务上表现出色。

Details Motivation: 当前多模态大语言模型主要关注以语言为中心的微调,较少探索通过注意力机制混合多模态标记,导致在需要精细认知和情感理解的高级任务中存在挑战。 Method: 提出了名为 MODA 的注意力机制,采用校正后对齐策略,将注意力解耦为模态对齐与跨层标记混合,并通过自适应掩码注意力确保注意力分数的正确性。 Result: 在 21 个基准数据集上的广泛实验验证了 MODA 在感知、认知和情感任务中的有效性。 Conclusion: MODA 提出了一种新的注意力机制,同时进行模态内优化和跨模态交互,解决了多模态学习中的注意力缺陷问题,增强了模型的灵活性。 Abstract: Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in https://zzcheng.top/MODA.

[325] UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification

Xixi Wan,Aihua Zheng,Bo Jiang,Beibei Wang,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: This paper proposes UGG-ReID, a robust approach for multi-modal object ReID that addresses intra-modal noise and inter-modal conflicts by estimating and utilizing uncertainty.

Details Motivation: Existing methods for multi-modal object Re-ID often overlook the uncertainty arising from intra-modal noise and inter-modal conflicts, which is particularly significant in the case of fine-grained local occlusion and frame loss. Method: UGG-ReID mitigates noise interference and facilitates effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. It uses a Gaussian patch-graph representation model and an uncertainty-guided mixture of experts strategy. Result: UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets and shows significant improvement over current methods. Conclusion: The proposed UGG-ReID method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Abstract: Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. Existing methods primarily aim to improve identification performance, but often overlook the uncertainty arising from inherent defects, such as intra-modal noise and inter-modal conflicts. This uncertainty is particularly significant in the case of fine-grained local occlusion and frame loss, which becomes a challenge in multi-modal learning. To address the above challenge, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code will be made public upon acceptance.

[326] VectorLLM: Human-like Extraction of Structured Building Contours vis Multimodal LLMs

Tao Zhang,Shiqing Wei,Shihao Chen,Wenling Yu,Muying Luo,Shunping Ji

Main category: cs.CV

TL;DR: VectorLLM introduces a novel method for extracting building contours from remote sensing images, leveraging LLMs for higher accuracy and broader applicability.

Details Motivation: Current methods for extracting building contours rely on complex multi-stage pipelines, limiting scalability. The study aims to introduce a more efficient and unified approach inspired by LLMs' reasoning capabilities. Method: VectorLLM directly performs corner-point regression of building contours, mimicking human annotators, using a vision foundation backbone, an MLP connector, and an LLM with learnable position embeddings. Result: VectorLLM outperformed previous state-of-the-art methods by 5.6 AP, 7.1 AP, and 13.6 AP across WHU, WHU-Mix, and CrowdAI datasets, with notable zero-shot performance on unseen objects. Conclusion: VectorLLM is the first Multi-modal Large Language Model designed for building contour extraction, achieving high accuracy and strong generalization in remote sensing tasks. Abstract: Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators' labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to improve spatial understanding capability. Through comprehensive exploration of training strategies including pretraining, supervised fine-tuning, and preference optimization across WHU, WHU-Mix, and CrowdAI datasets, VectorLLM significantly outperformed the previous SOTA methods by 5.6 AP, 7.1 AP, 13.6 AP, respectively in the three datasets. Remarkably, VectorLLM exhibits strong zero-shot performance on unseen objects including aircraft, water bodies, and oil tanks, highlighting its potential for unified modeling of diverse remote sensing object contour extraction tasks. Overall, this work establishes a new paradigm for vector extraction in remote sensing, leveraging the topological reasoning capabilities of LLMs to achieve both high accuracy and exceptional generalization. All the codes and weights will be published for promoting community development.

[327] What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Hahyeon Choi,Junhoo Lee,Nojun Kwak

Main category: cs.CV

TL;DR: This paper proposes AVATAR, a new video-centric benchmark for audio-visual localization, and TAVLO, a model leveraging temporal information, demonstrating improved performance over existing methods.

Details Motivation: Existing studies on Audio-Visual Localization (AVL) focus only on image-level associations and simplified scenarios. This work aims to address their limitations by considering temporal dynamics and more realistic settings. Method: The authors introduced AVATAR, a video-centric AVL benchmark with four distinct scenarios, and proposed TAVLO, a novel model that explicitly integrates temporal information for better audio-visual alignment. Result: Experimental results show that conventional methods struggle due to reliance on global audio features and frame-level mappings, while TAVLO achieves robust and precise alignment through high-resolution temporal modeling. Conclusion: The paper concludes that incorporating high-resolution temporal information significantly improves audio-visual localization, as demonstrated by the proposed TAVLO model's performance. It establishes AVATAR as a more comprehensive benchmark for evaluating AVL models. Abstract: Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.

[328] ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing

Zhenghui Zhao,Chen Wu,Di Wang,Hongruixuan Chen,Zhuo Zheng

Main category: cs.CV

TL;DR: This paper introduces ChangeBridge, a novel spatiotemporal diffusion model that simulates future remote sensing scenarios using multimodal controls.

Details Motivation: Existing methods have not explored the simulation of future scenarios based on given scenario images, despite its wide applications in urban planning, land management, and more. Method: ChangeBridge uses a conditional spatiotemporal diffusion model that leverages a stochastic Brownian-bridge diffusion to simulate future scenarios from pre-event images and multimodal spatial controls. Result: Experimental results show that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Conclusion: ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing, enabling simulation of future scenarios based on given scenario images. Abstract: Recent advancements in generative methods, especially diffusion models, have made great progress in remote sensing image synthesis. Despite these advancements, existing methods have not explored the simulation of future scenarios based on given scenario images. This simulation capability has wide applications for urban planning, land managementChangeBridge: Spatiotemporal Image Generation with Multimodal Controls, and beyond. In this work, we propose ChangeBridge, a conditional spatiotemporal diffusion model. Given pre-event images and conditioned on multimodal spatial controls (e.g., text prompts, instance layouts, and semantic maps), ChangeBridge can synthesize post-event images. The core idea behind ChangeBridge is to modeling the noise-to-image diffusion model, as a pre-to-post diffusion bridge. Conditioned on multimodal controls, ChangeBridge leverages a stochastic Brownian-bridge diffusion, directly modeling the spatiotemporal evolution between pre-event and post-event states. To the best of our knowledge, ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing. Experimental results demonstrate that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Code will be available.

[329] Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini Challenge

Alper Bahcekapili,Duygu Arslan,Umut Ozdemir,Berkay Ozkirli,Emre Akbas,Ahmet Acar,Gozde B. Akar,Bingdou He,Shuoyu Xu,Umit Mert Caglar,Alptekin Temizel,Guillaume Picaud,Marc Chaumont,Gérard Subsol,Luc Téot,Fahad Alsharekh,Shahad Alghannam,Hexiang Mao,Wenhua Zhang

Main category: cs.CV

TL;DR: 该论文讨论了结直肠癌的组织病理学分级问题,并介绍了一个基于METU CCTGS数据集的自动分级与分割挑战赛的结果。

Details Motivation: 准确的结直肠癌组织病理学分级对预后和治疗计划至关重要,但目前这一过程主观性强且受限于病理学家的短缺。 Method: 作者组织了ICIP大挑战赛,使用公开可用的METU CCTGS数据集,包含103张全切片图像及专家像素级注释。参赛团队通过Codalab提交分割掩码,并采用宏F分数和mIoU等指标进行评估。 Result: 在39支参赛队伍中,有六支队伍的表现超过了Swin Transformer基线模型(62.92 F分数)。 Conclusion: 本文概述了该挑战、数据集以及表现最佳的方法,推动了结直肠癌肿瘤分级与分割的自动化和标准化解决方案。 Abstract: Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using the publicly available METU CCTGS dataset. The dataset comprises 103 whole-slide images with expert pixel-level annotations for five tissue classes. Participants submitted segmentation masks via Codalab, evaluated using metrics such as macro F-score and mIoU. Among 39 participating teams, six outperformed the Swin Transformer baseline (62.92 F-score). This paper presents an overview of the challenge, dataset, and the top-performing methods

[330] TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation

Changsong Lei,Yaqian Liang,Shaofeng Wang,Jiajia Dai,Yong-Jin Liu

Main category: cs.CV

TL;DR: 本研究提出了一种新的双阶段框架TeethGenerator,通过扩散模型生成配对的3D牙齿模型,解决了传统方法在数据采集上的瓶颈问题,并显著提高了牙齿排列神经网络的性能。

Details Motivation: 传统的牙齿模型采集方法耗时且费力,现有的3D形状生成方法难以满足牙齿形态结构的复杂需求。 Method: 提出了一种新的双阶段框架,包括牙齿形状生成模块和风格生成模块,用于生成配对的3D牙齿模型。 Result: 实验表明,合成的数据集与真实数据分布高度一致,并显著提升了牙齿排列的性能。 Conclusion: TeethGenerator合成的牙齿模型数据集能有效提升牙齿排列网络的训练效果。 Abstract: Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at https://github.com/lcshhh/teeth_generator.

[331] Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

Wanchang Yu,Qing Zhang,Rongjia Zheng,Wei-Shi Zheng

Main category: cs.CV

TL;DR: This paper proposes a diffusion-based portrait shadow removal technique that generates high-quality, detailed results by combining structure-guided inpainting and detail restoration models, outperforming previous methods.

Details Motivation: The motivation is to address the limitations of previous shadow removal methods, which often fail to preserve high-fidelity facial details and suffer from issues like color distortion and structure blurring. The authors aim to provide a more robust and generative solution using diffusion-based inpainting. Method: The method involves three main steps: (1) training a shadow-independent structure extraction network to generate a structure map, (2) using this map to train a structure-guided inpainting diffusion model for shadow removal, and (3) refining the result with a detail restoration diffusion model guided by gradients in shadow regions. Result: Extensive experiments show that the proposed method clearly outperforms existing approaches on benchmark datasets, producing high-fidelity results while preserving fine-scale details and avoiding common artifacts. Conclusion: The paper concludes that their proposed structure-guided diffusion approach for portrait shadow removal outperforms existing methods, effectively avoiding common issues like facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Abstract: We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code is available at https://github.com/wanchang-yu/Structure-Guided-Diffusion-for-Portrait-Shadow-Removal.

[332] A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Zexi Jia,Chuanwei Huang,Hongyan Fei,Yeshuang Zhu,Zhiqiang Yuan,Ying Deng,Jiapei Zhang,Jinchao Zhang,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的基于块的扩散方法和相关损失函数,用于自动生成高质量的反事实图像-文本数据集,从而提升视觉-语言模型的视觉推理能力。

Details Motivation: 视觉-语言模型(VLM)由于缺乏高质量的图像-文本数据,在组合推理方面常常表现不佳。为了解决这一问题,本文提出了一个新的方法来自动创建所需的反事实数据集。 Method: 提出了一种基于块的扩散方法,利用大语言模型识别实体及其空间关系,并根据指定的组合规则独立生成图像块作为“拼图碎片”,从而自动生成多样化的、高保真的反事实图像-文本对。此外,还引入了一种专门的损失函数,以区分集合间和集合内的样本,提高训练效率并减少对负样本的需求。 Result: 该方法能够在显著减少所需训练数据量的同时,在多个基准测试中实现最先进的视觉推理性能。 Conclusion: 实验结果表明,使用所提出的反事实数据集微调VLM可以显著提高视觉推理性能。该方法在多个基准测试中达到了最先进的结果,同时使用的训练数据远少于现有方法。 Abstract: Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as "puzzle pieces" coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.

[333] Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning

Feng Yue,Zhaoxing Zhang,Junming Jiao,Zhengyu Liang,Shiwen Cao,Feifei Zhang,Rong Shen

Main category: cs.CV

TL;DR: Tempo-R0 is a novel Video Multimodal Large Language Model (Video-MLLM) for Temporal Video Grounding that uses innovative methods like SAA, ETA, and PIR-GRPO to achieve superior performance over existing approaches.

Details Motivation: Temporal Video Grounding (TVG) is challenging due to the large volume of information and redundancy in videos compared to texts or images. The need for comprehensive video understanding to accurately retrieve query-relevant clips drives the development of Tempo-R0. Method: The Tempo-R0 model uses Self-adaptive Attention Allocation (SAA) to optimize attention usage based on frame content variation. It also employs the Explicit Timestamp-modal Aligned (ETA) method to improve event boundary perception. Additionally, Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) is applied to enhance temporal reasoning by distinguishing relevant and irrelevant video-query pairs. Result: Experiments show that Tempo-R0 achieves around 3.5% better performance than current state-of-the-art solutions on both the original QVHighlights testbench and its corrected version with improved annotations. Conclusion: Tempo-R0, a Video-MLLM approach using SAA, ETA, and PIR-GRPO methods, significantly outperforms existing state-of-the-art solutions in temporal video grounding tasks. Abstract: Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM's limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model's capability to perceive the boundaries of events in the video. In the fine-tuning part of our pipeline, we creatively apply Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) in TVG area to foster model's temporal reasoning from not only accepting relevant video-query pairs but also refusing irrelevant ones. Experiments demonstrate that our method accomplishes a notable advantage over SOTA solutions by around 3.5% on both the original QVHighlights testbench and its corrected version with more reasonable ground truth annotations.

[334] Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations

Yuji Wang,Moran Li,Xiaobin Hu,Ran Yi,Jiangning Zhang,Han Feng,Weijian Cao,Yabiao Wang,Chengjie Wang,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了一种新的身份保持文本到视频生成框架,通过解耦空间和时间提示来提升生成视频的质量,并在ACM多媒体挑战赛中获得优异成绩。

Details Motivation: 当前端到端的身份保持文本到视频生成框架面临一个关键的时空权衡问题:优化空间一致性往往会损害时间平滑性,而追求动态真实感可能会破坏视觉结构的空间一致性。 Method: 文章提出了一种语义提示优化机制和分阶段解耦生成范式,将提示分解为空间和时间两个部分,并分别指导文本到图像(T2I)和图像到视频(I2V)的生成过程。 Result: 实验结果表明,所提出的方法在时空一致性、身份保持、文本相关性和视频质量方面表现出色,验证了该方法的有效性。 Conclusion: 本文提出了一种简单而有效的空间-时间解耦框架,以解决身份保持文本到视频生成中的时空权衡问题,并在2025 ACM MultiMedia Challenge中取得了亚军的好成绩。 Abstract: Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge.

[335] Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model

Anbang Wang,Marawan Elbatel,Keyuan Liu,Lizhuo Lin,Meng Lan,Yanqi Yang,Xiaomeng Li

Main category: cs.CV

TL;DR: This paper introduces GeoSapiens, a novel few-shot learning framework for efficient and accurate detection of dental anatomical landmarks on CBCT scans, even with limited annotated data.

Details Motivation: Manual annotation of anatomical landmarks on CBCT scans is time-consuming, labor-intensive, and subject to variability among observers. Automated methods based on deep learning have potential but face challenges due to limited training data and high annotation costs. Method: The study proposes GeoSapiens, a few-shot learning framework incorporating a robust baseline adapted from Sapiens and a novel geometric loss function to enhance the detection of dental landmarks on CBCT scans. Result: GeoSapiens outperformed existing methods, achieving an 8.18% higher success detection rate at a strict 0.5 mm threshold on a dataset of anterior teeth landmarks. Conclusion: GeoSapiens provides a promising solution for dental landmark detection with limited annotated data, offering improved accuracy and efficiency in automated annotation. Abstract: Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model's capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: https://github.com/xmed-lab/GeoSapiens.

[336] Unleashing the Power of Neural Collapse: Consistent Supervised-Unsupervised Alignment for Generalized Category Discovery

Jizhou Han,Shaokun Wang,Yuhang He,Chenhao Ding,Qiang Wang,Xinyuan Gao,SongLin Dong,Yihong Gong

Main category: cs.CV

TL;DR: This paper introduces NC-GCD, a new framework for Generalized Category Discovery that improves performance by aligning features into an optimal geometric structure using fixed ETF prototypes and a novel alignment loss.

Details Motivation: Previous GCD methods suffer from inconsistent optimization objectives and category confusion, which degrade performance. This work aims to develop a more coherent and effective approach for GCD. Method: The NC-GCD framework utilizes fixed Equiangular Tight Frame (ETF) prototypes and introduces a Consistent ETF Alignment Loss to unify supervised and unsupervised learning. A Semantic Consistency Matcher (SCM) is also employed to ensure stable label assignments during clustering. Result: The proposed method achieves strong results on multiple GCD benchmarks, particularly improving accuracy on novel categories. Conclusion: The proposed NC-GCD method effectively addresses the challenges in Generalized Category Discovery by ensuring optimal geometric structure and consistent optimization objectives, leading to improved performance on novel categories. Abstract: Generalized Category Discovery (GCD) focuses on classifying known categories while simultaneously discovering novel categories from unlabeled data. However, previous GCD methods face challenges due to inconsistent optimization objectives and category confusion. This leads to feature overlap and ultimately hinders performance on novel categories. To address these issues, we propose the Neural Collapse-inspired Generalized Category Discovery (NC-GCD) framework. By pre-assigning and fixing Equiangular Tight Frame (ETF) prototypes, our method ensures an optimal geometric structure and a consistent optimization objective for both known and novel categories. We introduce a Consistent ETF Alignment Loss that unifies supervised and unsupervised ETF alignment and enhances category separability. Additionally, a Semantic Consistency Matcher (SCM) is designed to maintain stable and consistent label assignments across clustering iterations. Our method achieves strong performance on multiple GCD benchmarks, significantly enhancing novel category accuracy and demonstrating its effectiveness.

[337] Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet

Raz Lapid,Almog Dubin

Main category: cs.CV

TL;DR: This paper introduces a new data poisoning method that exploits ControlNets to generate unwanted content without text triggers, revealing vulnerabilities in current systems.

Details Motivation: The motivation behind this research is the increasing use of community-shared data for fine-tuning models and the exposure to stealthy data poisoning attacks due to dependence on large, publicly scraped datasets. Method: The method involves introducing a novel data poisoning technique that manipulates ControlNets to generate specific content without text triggers by injecting poisoned samples into the dataset. Result: The backdoor attack achieves a high success rate while remaining imperceptible in raw inputs on large-scale, high-quality datasets. Conclusion: The paper concludes that there is a critical vulnerability in open-source ControlNets pipelines, highlighting the necessity for robust data sanitization and defense mechanisms. Abstract: Text-to-image diffusion models have achieved remarkable success in translating textual prompts into high-fidelity images. ControlNets further extend these models by allowing precise, image-based conditioning (e.g., edge maps, depth, pose), enabling fine-grained control over structure and style. However, their dependence on large, publicly scraped datasets -- and the increasing use of community-shared data for fine-tuning -- exposes them to stealthy data poisoning attacks. In this work, we introduce a novel data poisoning method that manipulates ControlNets to generate images containing specific content without any text triggers. By injecting poisoned samples -- each pairing a subtly triggered input with an NSFW target -- the model retains clean-prompt fidelity yet reliably produces NSFW outputs when the trigger is present. On large-scale, high-quality datasets, our backdoor achieves high attack success rate while remaining imperceptible in raw inputs. These results reveal a critical vulnerability in open-source ControlNets pipelines and underscore the need for robust data sanitization and defense mechanisms.

[338] An analysis of vision-language models for fabric retrieval

Francesco Giuliari,Asif Khan Pattan,Mohamed Lamine Mekhalfi,Fabio Poiesi

Main category: cs.CV

TL;DR: This paper explores the use of Vision Language Models for zero-shot text-to-image retrieval on fabric samples, introducing an automated annotation pipeline to generate textual descriptions. It finds that structured descriptions improve retrieval accuracy and highlights the need for domain-specific adaptations.

Details Motivation: Effective cross-modal retrieval is crucial for applications like information retrieval and recommendation systems in specialized domains such as manufacturing, where product information often consists of visual samples paired with textual descriptions. The lack of publicly available datasets in this domain motivated the development of an automated annotation pipeline. Method: The paper introduces an automated annotation pipeline using Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. These descriptions are used to evaluate retrieval performance across three VLMs—CLIP, LAION-CLIP, and Meta's Perception Encoder—for zero-shot text-to-image retrieval on fabric samples. Result: Experiments showed that structured, attribute-rich descriptions significantly improve retrieval accuracy, particularly for visually complex fabric classes. Meta's Perception Encoder outperformed other models due to its robust feature alignment capabilities. However, zero-shot retrieval remains challenging in this fine-grained domain. Conclusion: Structured, attribute-rich descriptions significantly enhance retrieval accuracy for Vision-Language Models (VLMs) in the fine-grained fabric domain, especially for visually complex fabric classes. However, zero-shot retrieval remains challenging, highlighting the need for domain-adapted approaches. Abstract: Effective cross-modal retrieval is essential for applications like information retrieval and recommendation systems, particularly in specialized domains such as manufacturing, where product information often consists of visual samples paired with a textual description. This paper investigates the use of Vision Language Models(VLMs) for zero-shot text-to-image retrieval on fabric samples. We address the lack of publicly available datasets by introducing an automated annotation pipeline that uses Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. We produce these descriptions to evaluate retrieval performance across three Vision-Language Models: CLIP, LAION-CLIP, and Meta's Perception Encoder. Our experiments demonstrate that structured, attribute-rich descriptions significantly enhance retrieval accuracy, particularly for visually complex fabric classes, with the Perception Encoder outperforming other models due to its robust feature alignment capabilities. However, zero-shot retrieval remains challenging in this fine-grained domain, underscoring the need for domain-adapted approaches. Our findings highlight the importance of combining technical textual descriptions with advanced VLMs to optimize cross-modal retrieval in industrial applications.

[339] Vision-Language Models Can't See the Obvious

Yasser Dahou,Ngoc Dung Huynh,Phuc H. Le-Khac,Wamiq Reyaz Para,Ankit Singh,Sanath Narayan

Main category: cs.CV

TL;DR: SalBench 是一个新的基准,用于评估大型视觉-语言模型在检测显眼视觉特征方面的能力,结果表明即使最先进的模型也存在明显局限性。

Details Motivation: 评估大型视觉-语言模型(LVLM)在检测对人类显而易见的视觉显著特征方面的能力。 Method: 开发了 SalBench,包含三个新任务:Odd-One-Out Detection、Referring Odd-One-Out 和 Visual Referring Odd-One-Out。 Result: 研究发现 LVLM 难以识别看似明显的视觉异常,即使是先进的 GPT-4o 在此类简单任务上的准确率也只有 47.6%。 Conclusion: SalBench 将成为衡量与人类注意力精确定义一致的 LVLM 能力的重要一步。 Abstract: We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with even the advanced GPT-4o achieving only 47.6\% accuracy on such a simple task. SalBench will be an important step in measuring the capabilities of LVLM that align with the subtle definition of human attention.

[340] MatDecompSDF: High-Fidelity 3D Shape and PBR Material Decomposition from Multi-View Images

Chengyu Wang,Isabella Bennett,Henry Scott,Liang Zhang,Mei Chen,Hao Li,Rui Zhao

Main category: cs.CV

TL;DR: MatDecompSDF 是一种新的框架,可以从多视角图像中高质量地重建3D形状并分解其材质属性,具有实际应用价值。

Details Motivation: 逆向渲染的核心挑战在于如何从2D观测中解耦几何、材质和光照信息,为此需要一种能够有效处理这一不适定问题的方法。 Method: 该方法联合优化三个神经组件:用于表示复杂几何的神经符号距离函数(SDF)、用于预测PBR材质参数的空间变化神经场以及用于捕捉未知环境光照的MLP模型;引入了基于物理的先验和几何正则化策略,并采用可微分渲染层实现端到端优化。 Result: 实验结果表明,MatDecompSDF 在几何精度、材质保真度和新视角合成方面优于现有最先进方法,并能够生成可编辑和重新照明的资产,适用于数字内容创建。 Conclusion: MatDecompSDF 提出了一种从多视角图像中恢复高保真3D形状并分解其基于物理的材质属性的新框架,通过结合神经SDF、空间变化的神经场和MLP模型,实现了几何、材质和光照的有效分离,并在合成和真实数据集上验证了其优越性。 Abstract: We present MatDecompSDF, a novel framework for recovering high-fidelity 3D shapes and decomposing their physically-based material properties from multi-view images. The core challenge of inverse rendering lies in the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Our method addresses this by jointly optimizing three neural components: a neural Signed Distance Function (SDF) to represent complex geometry, a spatially-varying neural field for predicting PBR material parameters (albedo, roughness, metallic), and an MLP-based model for capturing unknown environmental lighting. The key to our approach is a physically-based differentiable rendering layer that connects these 3D properties to the input images, allowing for end-to-end optimization. We introduce a set of carefully designed physical priors and geometric regularizations, including a material smoothness loss and an Eikonal loss, to effectively constrain the problem and achieve robust decomposition. Extensive experiments on both synthetic and real-world datasets (e.g., DTU) demonstrate that MatDecompSDF surpasses state-of-the-art methods in geometric accuracy, material fidelity, and novel view synthesis. Crucially, our method produces editable and relightable assets that can be seamlessly integrated into standard graphics pipelines, validating its practical utility for digital content creation.

[341] MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry

Zicheng Lin,Xiaoqiang Li,Yichao Wang,Chuan Zhu

Main category: cs.CV

TL;DR: 本研究解决了粒子图像测速(PIV)领域缺乏标准化评估基准的问题,通过创建一个大规模合成数据集并开发了一种高效的新算法MCFormer,实现了该领域的重大进展。

Details Motivation: 由于现有数据集的限制和缺乏标准化基准,迫切需要全面评估各种光流模型在PIV数据上的表现。 Method: 通过多样化的CFD模拟(JHTDB和Blasius)生成大规模合成PIV基准数据集,并提出了一种利用多帧时间信息和多个代价体的新深度网络架构MCFormer。 Result: 首次对多种改进的光流模型进行了全面的基准评估,结果显示MCFormer显著优于现有方法,达到了最低的整体归一化端点误差(NEPE)。 Conclusion: 本文提出了一个大规模的合成PIV基准数据集和一种新的深度学习网络架构MCFormer,显著优于现有方法,并为未来的PIV研究提供了基础资源和最先进的方法。 Abstract: Particle Image Velocimetry (PIV) is fundamental to fluid dynamics, yet deep learning applications face significant hurdles. A critical gap exists: the lack of comprehensive evaluation of how diverse optical flow models perform specifically on PIV data, largely due to limitations in available datasets and the absence of a standardized benchmark. This prevents fair comparison and hinders progress. To address this, our primary contribution is a novel, large-scale synthetic PIV benchmark dataset generated from diverse CFD simulations (JHTDB and Blasius). It features unprecedented variety in particle densities, flow velocities, and continuous motion, enabling, for the first time, a standardized and rigorous evaluation of various optical flow and PIV algorithms. Complementing this, we propose Multi Cost Volume PIV (MCFormer), a new deep network architecture leveraging multi-frame temporal information and multiple cost volumes, specifically designed for PIV's sparse nature. Our comprehensive benchmark evaluation, the first of its kind, reveals significant performance variations among adapted optical flow models and demonstrates that MCFormer significantly outperforms existing methods, achieving the lowest overall normalized endpoint error (NEPE). This work provides both a foundational benchmark resource essential for future PIV research and a state-of-the-art method tailored for PIV challenges. We make our benchmark dataset and code publicly available to foster future research in this area.

[342] Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking

Maria Damanaki,Ioulia Kapsali,Nikos Piperigkos,Alexandros Gkillas,Aris S. Lalos

Main category: cs.CV

TL;DR: 该论文提出了一种基于多智能体对抗边界框的3D LiDAR场景对抗噪声缓解框架,通过跟踪对象提高鲁棒性。

Details Motivation: 需要增强EdgeAI系统(如自动驾驶汽车)在面对对抗性威胁时的感知能力,单智能体跟踪缺乏情境意识,因此需要多智能体协作来提高稳健性和上下文理解。 Method: 利用最小二乘图工具,通过差分坐标和锚点在全连接图上减少每个检测质心的位置误差,并对多车辆检测进行融合和优化以减轻对抗影响。 Result: 在真实世界V2V4Real数据集上的评估表明,在具有挑战性的对抗条件下,所提方法比现有最先进的单智能体和多智能体跟踪框架显著高出23.3%。 Conclusion: 该方法提供了一种无需额外防御机制的弹性解决方案,提高了EdgeAI系统在对抗环境下的准确性和定位能力。 Abstract: The critical perception capabilities of EdgeAI systems, such as autonomous vehicles, are required to be resilient against adversarial threats, by enabling accurate identification and localization of multiple objects in the scene over time, mitigating their impact. Single-agent tracking offers resilience to adversarial attacks but lacks situational awareness, underscoring the need for multi-agent cooperation to enhance context understanding and robustness. This paper proposes a novel mitigation framework on 3D LiDAR scene against adversarial noise by tracking objects based on least-squares graph on multi-agent adversarial bounding boxes. Specifically, we employ the least-squares graph tool to reduce the induced positional error of each detection's centroid utilizing overlapped bounding boxes on a fully connected graph via differential coordinates and anchor points. Hence, the multi-vehicle detections are fused and refined mitigating the adversarial impact, and associated with existing tracks in two stages performing tracking to further suppress the adversarial threat. An extensive evaluation study on the real-world V2V4Real dataset demonstrates that the proposed method significantly outperforms both state-of-the-art single and multi-agent tracking frameworks by up to 23.3% under challenging adversarial conditions, operating as a resilient approach without relying on additional defense mechanisms.

[343] GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation

Weilin Lai,Tie Xu,Hu Wang

Main category: cs.CV

TL;DR: GraphBrep proposes a novel method for B-Rep generation that reduces computational costs while maintaining CAD quality by explicitly representing and learning compact topology.

Details Motivation: The motivation is to reduce redundancy and computational cost in B-Rep generation by explicitly modeling the joint distribution of misaligned geometry and topology. Method: A graph diffusion model is employed to learn topology conditioned on surface features, using an undirected weighted graph to represent surface topology. Result: Experiments show GraphBrep significantly reduces training and inference times (up to 31.3% and 56.3%, respectively) compared with SOTA methods. Conclusion: GraphBrep effectively reduces computational cost while maintaining high-quality CAD generation by explicitly representing compact topology. Abstract: Direct B-Rep generation is increasingly important in CAD workflows, eliminating costly modeling sequence data and supporting complex features. A key challenge is modeling joint distribution of the misaligned geometry and topology. Existing methods tend to implicitly embed topology into the geometric features of edges. Although this integration ensures feature alignment, it also causes edge geometry to carry more redundant structural information compared to the original B-Rep, leading to significantly higher computational cost. To reduce redundancy, we propose GraphBrep, a B-Rep generation model that explicitly represents and learns compact topology. Following the original structure of B-Rep, we construct an undirected weighted graph to represent surface topology. A graph diffusion model is employed to learn topology conditioned on surface features, serving as the basis for determining connectivity between primitive surfaces. The explicit representation ensures a compact data structure, effectively reducing computational cost during both training and inference. Experiments on two large-scale unconditional datasets and one category-conditional dataset demonstrate the proposed method significantly reduces training and inference times (up to 31.3% and 56.3% for given datasets, respectively) while maintaining high-quality CAD generation compared with SOTA.

Zexi Jia,Chuanwei Huang,Yeshuang Zhu,Hongyan Fei,Ying Deng,Zhiqiang Yuan,Jiapei Zhang,Jinchao Zhang,Jie Zhou

Main category: cs.CV

TL;DR: 该研究提出了一个用于AI艺术版权判断的新框架ArtBulb,并发布了首个相关基准数据集AICD,旨在解决当前法律体系对AI生成作品版权保护不足的问题。

Details Motivation: 当前法律框架在AI生成作品的版权保护方面缺乏系统性标准和可靠的评估方法,亟需解决这一问题。 Method: 通过分析法律判例,建立判断独特艺术风格的三项核心标准,并提出结合多模态聚类方法和多模态大语言模型(MLLMs)的可解释、可量化框架ArtBulb。 Result: 实验结果表明,ArtBulb在定量和定性评估中均优于现有模型。 Conclusion: ArtBulb有望弥合法律界与技术界之间的差距,并推动AI艺术版权这一社会问题获得更多关注。 Abstract: Current legal frameworks consider AI-generated works eligible for copyright protection when they meet originality requirements and involve substantial human intellectual input. However, systematic legal standards and reliable evaluation methods for AI art copyrights are lacking. Through comprehensive analysis of legal precedents, we establish three essential criteria for determining distinctive artistic style: stylistic consistency, creative uniqueness, and expressive accuracy. To address these challenges, we introduce ArtBulb, an interpretable and quantifiable framework for AI art copyright judgment that combines a novel style description-based multimodal clustering method with multimodal large language models (MLLMs). We also present AICD, the first benchmark dataset for AI art copyright annotated by artists and legal experts. Experimental results demonstrate that ArtBulb outperforms existing models in both quantitative and qualitative evaluations. Our work aims to bridge the gap between the legal and technological communities and bring greater attention to the societal issue of AI art copyrights.

[345] Model Compression using Progressive Channel Pruning

Jinyang Guo,Weichen Zhang,Wanli Ouyang,Dong Xu

Main category: cs.CV

TL;DR: This paper introduces a new iterative channel pruning framework (PCP) for accelerating CNNs, which selectively prunes channels with minimal impact on accuracy and performs well in both supervised and transfer learning scenarios.

Details Motivation: Existing channel pruning methods typically prune each layer only once, which may lead to significant accuracy drops. The proposed method aims to address this limitation by progressively and selectively pruning channels across multiple layers. Method: Progressive Channel Pruning (PCP) uses an attempting-selecting-pruning pipeline to iteratively prune a small number of channels from selected layers based on estimated accuracy drops. It also incorporates labelled and pseudo-labelled samples for deep transfer learning scenarios. Result: Experiments show that the PCP framework achieves better performance than existing channel pruning approaches on benchmark datasets under both supervised learning and transfer learning settings. Conclusion: The proposed PCP framework effectively accelerates CNNs by iteratively pruning channels while maintaining accuracy, and it outperforms existing methods in both supervised and transfer learning settings. Abstract: In this work, we propose a simple but effective channel pruning framework called Progressive Channel Pruning (PCP) to accelerate Convolutional Neural Networks (CNNs). In contrast to the existing channel pruning methods that prune channels only once per layer in a layer-by-layer fashion, our new progressive framework iteratively prunes a small number of channels from several selected layers, which consists of a three-step attempting-selecting-pruning pipeline in each iteration. In the attempting step, we attempt to prune a pre-defined number of channels from one layer by using any existing channel pruning methods and estimate the accuracy drop for this layer based on the labelled samples in the validation set. In the selecting step, based on the estimated accuracy drops for all layers, we propose a greedy strategy to automatically select a set of layers that will lead to less overall accuracy drop after pruning these layers. In the pruning step, we prune a small number of channels from these selected layers. We further extend our PCP framework to prune channels for the deep transfer learning methods like Domain Adversarial Neural Network (DANN), in which we effectively reduce the data distribution mismatch in the channel pruning process by using both labelled samples from the source domain and pseudo-labelled samples from the target domain. Our comprehensive experiments on two benchmark datasets demonstrate that our PCP framework outperforms the existing channel pruning approaches under both supervised learning and transfer learning settings.

[346] PointGAC: Geometric-Aware Codebook for Masked Point Cloud Modeling

Abiao Li,Chenlei Lv,Yuming Fang,Yifan Zuo,Jian Zhang,Guofeng Mei

Main category: cs.CV

TL;DR: 本文提出PointGAC,采用聚类与师生框架改进遮蔽点云建模,提升特征泛化能力和学习效率。

Details Motivation: 现有MPM方法过度约束模型学习遮蔽区域细节,难以捕捉泛化特征,因此需要一种新的方法来提升特征表示能力。 Method: 提出一种基于聚类的MPM方法,包括几何感知的初始patch提取、在线k-means更新码本、以及学生模型对遮蔽特征的分配对齐策略。 Result: 实验验证了PointGAC在多种下游任务中的有效性,并提供了开源代码。 Conclusion: PointGAC通过在线码本引导的师生框架,有效提升了遮蔽点云建模中特征表示的泛化能力,并通过码本维护机制提高了语义特征学习效率。 Abstract: Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at https://github.com/LAB123-tech/PointGAC

[347] UDF-GMA: Uncertainty Disentanglement and Fusion for General Movement Assessment

Zeqi Luo,Ali Gooya,Edmond S. L. Ho

Main category: cs.CV

TL;DR: 该研究提出了UDF-GMA,一种能有效分离并融合模型不确定性的自动化GMA方法,在早期脑功能障碍检测中表现出色。

Details Motivation: 主流基于姿态的自动化GMA方法由于有限的高质量数据和嘈杂的姿态估计容易产生不确定性,影响了其在临床上的可靠性,因此需要有效衡量不确定性的方法。 Method: 引入UDF-GMA框架,结合贝叶斯近似估计认知不确定性,并直接建模偶然不确定性,进一步融合嵌入式运动表征以增强类别区分能力。 Result: 在Pmi-GMA基准数据集上的大量实验表明,所提出的方法在预测Poor Repertoire方面具有良好的效果和泛化能力。 Conclusion: UDF-GMA通过显式建模参数的认知不确定性和数据噪声的偶然不确定性,提高了基于姿态的自动化GMA方法的临床可靠性。 Abstract: General movement assessment (GMA) is a non-invasive tool for the early detection of brain dysfunction through the qualitative assessment of general movements, and the development of automated methods can broaden its application. However, mainstream pose-based automated GMA methods are prone to uncertainty due to limited high-quality data and noisy pose estimation, hindering clinical reliability without reliable uncertainty measures. In this work, we introduce UDF-GMA which explicitly models epistemic uncertainty in model parameters and aleatoric uncertainty from data noise for pose-based automated GMA. UDF-GMA effectively disentangles uncertainties by directly modelling aleatoric uncertainty and estimating epistemic uncertainty through Bayesian approximation. We further propose fusing these uncertainties with the embedded motion representation to enhance class separation. Extensive experiments on the Pmi-GMA benchmark dataset demonstrate the effectiveness and generalisability of the proposed approach in predicting poor repertoire.

[348] SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

Mengwei Xie,Shuang Zeng,Xinyuan Chang,Xinran Liu,Zheng Pan,Mu Xu,Xing Wei

Main category: cs.CV

TL;DR: SeqGrowGraph is a novel framework for modeling lane topology by incrementally expanding graphs, using a transformer model to achieve excellent results on autonomous driving datasets.

Details Motivation: Traditional methods struggle with modeling complex, non-linear lane structures like loops and bidirectional lanes, which are common in real-world scenarios. Method: The paper proposes SeqGrowGraph, which models lane topology through incremental graph expansion, using a transformer model to predict expansions in a depth-first search order. Result: SeqGrowGraph achieves state-of-the-art performance on the nuScenes and Argoverse 2 datasets. Conclusion: SeqGrowGraph offers a new and effective way to model complex lane topologies, achieving state-of-the-art results on datasets nuScenes and Argoverse 2. Abstract: Accurate lane topology is essential for autonomous driving, yet traditional methods struggle to model the complex, non-linear structures-such as loops and bidirectional lanes-prevalent in real-world road structure. We present SeqGrowGraph, a novel framework that learns lane topology as a chain of graph expansions, inspired by human map-drawing processes. Representing the lane graph as a directed graph $G=(V,E)$, with intersections ($V$) and centerlines ($E$), SeqGrowGraph incrementally constructs this graph by introducing one vertex at a time. At each step, an adjacency matrix ($A$) expands from $n \times n$ to $(n+1) \times (n+1)$ to encode connectivity, while a geometric matrix ($M$) captures centerline shapes as quadratic B\'ezier curves. The graph is serialized into sequences, enabling a transformer model to autoregressively predict the chain of expansions, guided by a depth-first search ordering. Evaluated on nuScenes and Argoverse 2 datasets, SeqGrowGraph achieves state-of-the-art performance.

[349] RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction

Johannes Künzel,Anna Hilsmann,Peter Eisert

Main category: cs.CV

TL;DR: This paper proposes RIPE, a reinforcement learning-based framework for weakly-supervised keypoint extraction and description, requiring only binary labels for training.

Details Motivation: To overcome the limitations of conventional methods that rely on artificial transformations, pre-generated models, or 3D data, by using minimal supervision to expand training data availability. Method: RIPE uses intermediate encoder layers with a hyper-column approach for multi-scale keypoint description and includes an auxiliary loss to enhance descriptor discrimination. Result: RIPE achieves competitive performance compared to state-of-the-art methods while simplifying data preparation, demonstrating robustness and generalization capability. Conclusion: RIPE represents a significant advancement in keypoint extraction and description by enabling weakly-supervised learning with minimal training data requirements. Abstract: We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at https://github.com/fraunhoferhhi/RIPE.

[350] CMET: Clustering guided METric for quantifying embedding quality

Sourav Ghosh,Chayan Maitra,Rajat K. De

Main category: cs.CV

TL;DR: This paper proposes CMET, a novel and efficient metric for evaluating data embedding quality by measuring local and global structure preservation, showing superior performance over existing techniques.

Details Motivation: The motivation stems from the need for an efficient and effective way to determine if transformed embeddings preserve the original data's structure, as existing methods are computationally expensive. Method: The study introduces CMET, a new metric comprising two scores (CMET_L and CMET_G) to evaluate local and global structure preservation in embeddings. It tests CMET on multiple datasets including synthetic, biological, and image data. Result: Results show that CMET outperforms state-of-the-art methods in performance while offering scalability, stability, and efficiency across different datasets and hyper-parameter choices. Conclusion: The paper concludes that CMET is a reliable and efficient metric for assessing the quality of data embeddings, capable of handling various data sizes and types with low computational complexity. Abstract: Due to rapid advancements in technology, datasets are available from various domains. In order to carry out more relevant and appropriate analysis, it is often necessary to project the dataset into a higher or lower dimensional space based on requirement. Projecting the data in a higher-dimensional space helps in unfolding intricate patterns, enhancing the performance of the underlying models. On the other hand, dimensionality reduction is helpful in denoising data while capturing maximal information, as well as reducing execution time and memory.In this context, it is not always statistically evident whether the transformed embedding retains the local and global structure of the original data. Most of the existing metrics that are used for comparing the local and global shape of the embedding against the original one are highly expensive in terms of time and space complexity. In order to address this issue, the objective of this study is to formulate a novel metric, called Clustering guided METric (CMET), for quantifying embedding quality. It is effective to serve the purpose of quantitative comparison between an embedding and the original data. CMET consists of two scores, viz., CMET_L and CMET_G, that measure the degree of local and global shape preservation capability, respectively. The efficacy of CMET has been demonstrated on a wide variety of datasets, including four synthetic, two biological, and two image datasets. Results reflect the favorable performance of CMET against the state-of-the-art methods. Capability to handle both small and large data, low algorithmic complexity, better and stable performance across all kinds of data, and different choices of hyper-parameters feature CMET as a reliable metric.

[351] Efficient SAR Vessel Detection for FPGA-Based On-Satellite Sensing

Colin Laganier,Liam Fletcher,Elim Kwan,Richard Walters,Victoria Nockles

Main category: cs.CV

TL;DR: 这篇论文介绍了一种高效的合成孔径雷达(SAR)船舶检测模型,通过定制的YOLOv8架构实现了适用于常见卫星功耗约束(<10W)的FPGA处理,并展示了与先进GPU模型相近的性能,为更自主和可扩展的地球观测系统开辟了道路。

Details Motivation: 随着现代卫星生成的数据量不断增加,快速分析卫星数据对于许多遥感应用变得越来越重要,但也越来越难以实现。 Method: 使用定制YOLOv8架构开发了一个优化的FPGA处理模型,并在Kria KV260 MPSoC上进行了部署。 Result: 该FPGA模型的检测和分类性能仅比最先进的GPU模型低约2%和3%,尽管其体积小了两到三个数量级。 Conclusion: 该论文展示了一种高效且高性能的SAR船舶检测模型,为实现更自主、响应更快和可扩展的地球观测系统铺平了道路。 Abstract: Rapid analysis of satellite data is vital for many remote sensing applications, from disaster response to environmental monitoring, but is becoming harder to achieve with the increasing volumes of data generated by modern satellites. On-satellite machine learning (ML) offers a potential solution, by reducing latency associated with transmission of these large data volumes to ground stations, but state-of-the-art models are often too large or power-hungry for satellite deployment. Vessel detection using Synthetic Aperture Radar (SAR) is a critical time-sensitive task for maritime security that exemplifies this challenge. SAR vessel detection has previously been demonstrated only by ML models that either are too large for satellite deployment, have not been developed for sufficiently low-power hardware, or have only been developed and tested on small SAR datasets that do not sufficiently represent the real-world task. Here we address this issue by developing and deploying a new efficient and highly performant SAR vessel detection model, using a customised YOLOv8 architecture specifically optimized for FPGA-based processing within common satellite power constraints (<10W). We train and evaluate our model on the largest and most diverse open SAR vessel dataset, xView3-SAR, and deploy it on a Kria KV260 MPSoC. We show that our FPGA-based model has detection and classification performance only ~2% and 3% lower than values from state-of-the-art GPU-based models, despite being two to three orders of magnitude smaller in size. This work demonstrates small yet highly performant ML models for time-critical SAR analysis, paving the way for more autonomous, responsive, and scalable Earth observation systems.

[352] Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling

Chinmay Prabhakar,Suprosanna Shit,Tamaz Amiranashvili,Hongwei Bran Li,Bjoern Menze

Main category: cs.CV

TL;DR: 本研究开发了一种新型三维生物图生成技术,有效解决了现有方法在解剖有效性方面的局限性,并在多个应用中展现了更好的性能。

Details Motivation: 现有的扩散方法在生成三维生物图时难以保持解剖有效性,因此需要新方法来解决这一问题。 Method: 通过一种新颖的投影算子来随机修正不一致性,并采用基于边删除的噪声处理程序以适应稀疏生物图。 Result: 所提出的方法在人类Willis环和肺气道两个真实世界数据集上均展示了优越的性能,同时生成的样本显著提升了下游图标注任务的表现。 Conclusion: 该研究提出了一种新的三维生物图生成方法,能够满足结构和语义上的合理性条件,并在实际数据集上表现出优于现有方法的性能。 Abstract: 3D spatial graphs play a crucial role in biological and clinical research by modeling anatomical networks such as blood vessels,neurons, and airways. However, generating 3D biological graphs while maintaining anatomical validity remains challenging, a key limitation of existing diffusion-based methods. In this work, we propose a novel 3D biological graph generation method that adheres to structural and semantic plausibility conditions. We achieve this by using a novel projection operator during sampling that stochastically fixes inconsistencies. Further, we adopt a superior edge-deletion-based noising procedure suitable for sparse biological graphs. Our method demonstrates superior performance on two real-world datasets, human circle of Willis and lung airways, compared to previous approaches. Importantly, we demonstrate that the generated samples significantly enhance downstream graph labeling performance. Furthermore, we show that our generative model is a reasonable out-of-the-box link predictior.

[353] HGNet: High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention Network for Colorectal Polyp Detection

Xiaofang Liu,Lingling Sun,Xuqing Zhang,Yuannong Ye,Bin zhao

Main category: cs.CV

TL;DR: HGNet, a new model for detecting colorectal cancer, integrates hypergraph and attention modules to improve lesion detection and interpretability.

Details Motivation: Current models struggle with detecting small lesions, accurately localizing boundaries, and providing interpretable decisions in the context of colorectal cancer. Method: Development of HGNet incorporating an Efficient Multi-Scale Context Attention module, spatial hypergraph convolution module, transfer learning for data scarcity, and Eigen Class Activation Map for visualization. Result: HGNet achieves 94% accuracy, 90.6% recall, and 90% mAP@0.5, showing significant improvement in small lesion differentiation and clinical interpretability. Conclusion: HGNet is effective in addressing limitations of current models in colorectal cancer detection and offers enhanced performance and interpretability. Abstract: Colorectal cancer (CRC) is closely linked to the malignant transformation of colorectal polyps, making early detection essential. However, current models struggle with detecting small lesions, accurately localizing boundaries, and providing interpretable decisions. To address these issues, we propose HGNet, which integrates High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention. Key innovations include: (1) an Efficient Multi-Scale Context Attention (EMCA) module to enhance lesion feature representation and boundary modeling; (2) the deployment of a spatial hypergraph convolution module before the detection head to capture higher-order spatial relationships between nodes; (3) the application of transfer learning to address the scarcity of medical image data; and (4) Eigen Class Activation Map (Eigen-CAM) for decision visualization. Experimental results show that HGNet achieves 94% accuracy, 90.6% recall, and 90% mAP@0.5, significantly improving small lesion differentiation and clinical interpretability. The source code will be made publicly available upon publication of this paper.

[354] HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding

Yuxuan Cai,Jiangning Zhang,Zhenye Gan,Qingdong He,Xiaobin Hu,Junwei Zhu,Yabiao Wang,Chengjie Wang,Zhucun Xue,Xinwei He,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出了 HV-MMBench,一个用于更全面评估MLLMs在以人为本的视频理解方面的基准。

Details Motivation: 现有的以人为本的视频基准测试主要强调视频生成质量和动作识别,而忽视了人类中心场景中所需的感知和认知能力。此外,它们通常受限于单一问题范式和过于简单的评估指标。 Method: 提出了一种名为 HV-MMBench 的新基准,该基准包含了15个任务,多种数据类型,覆盖50个不同的视觉场景,并包括从短期到长期的视频持续时间。 Result: HV-MMBench 相较于现有以人为本的视频基准,提供了更全面的模型能力评估,支持对模型在不同上下文长度上的时间推理能力进行系统分析。 Conclusion: HV-MMBench 提供了一个全面评估MLLMs在以人为本的视频理解方面的基准,具有多样化的评估维度、数据类型、多领域视频覆盖和时间覆盖特性。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.

[355] Leveraging Self-Supervised Features for Efficient Flooded Region Identification in UAV Aerial Images

Dibyabha Deb,Ujjwal Verma

Main category: cs.CV

TL;DR: This paper explores using self-supervised features from DINOv2 for flood region identification in UAV images, reducing manual labeling needs while achieving high segmentation accuracy.

Details Motivation: Traditional supervised learning methods for post-disaster damage assessment rely on manually annotated datasets, which are time-consuming and error-prone. This work aims to reduce dependency on such datasets by leveraging self-supervised learning. Method: The paper proposes two encoder-decoder-based segmentation approaches that integrate visual features learned from the DINOv2 model with traditional encoder backbones to segment UAV aerial images. Result: The results show that DINOv2's self-supervised pretraining on natural images generates transferable features that perform well in segmenting aerial images despite differences in image perspectives. Conclusion: The study concludes that self-supervised features from the DINOv2 model can be effectively used for identifying flooded regions in UAV aerial images, reducing reliance on manual annotation and enabling accurate segmentation with limited labeled data. Abstract: Identifying regions affected by disasters is a vital step in effectively managing and planning relief and rescue efforts. Unlike the traditional approaches of manually assessing post-disaster damage, analyzing images of Unmanned Aerial Vehicles (UAVs) offers an objective and reliable way to assess the damage. In the past, segmentation techniques have been adopted to identify post-flood damage in UAV aerial images. However, most of these supervised learning approaches rely on manually annotated datasets. Indeed, annotating images is a time-consuming and error-prone task that requires domain expertise. This work focuses on leveraging self-supervised features to accurately identify flooded regions in UAV aerial images. This work proposes two encoder-decoder-based segmentation approaches, which integrate the visual features learned from DINOv2 with the traditional encoder backbone. This study investigates the generalization of self-supervised features for UAV aerial images. Specifically, we evaluate the effectiveness of features from the DINOv2 model, trained on non-aerial images, for segmenting aerial images, noting the distinct perspectives between the two image types. Our results demonstrate that DINOv2's self-supervised pretraining on natural images generates transferable, general-purpose visual features that streamline the development of aerial segmentation workflows. By leveraging these features as a foundation, we significantly reduce reliance on labor-intensive manual annotation processes, enabling high-accuracy segmentation with limited labeled aerial data.

[356] RainShift: A Benchmark for Precipitation Downscaling Across Geographies

Paula Harder,Luca Schmidt,Francis Pelletier,Nicole Ludwig,Matthew Chantry,Christian Lessig,Alex Hernandez-Garcia,David Rolnick

Main category: cs.CV

TL;DR: This paper introduces RainShift, a benchmark for evaluating deep learning-based downscaling methods' ability to generalize across geographic regions. It finds that while broader training helps, additional strategies like data alignment are needed to address regional disparities.

Details Motivation: Earth System Models (ESMs) are crucial for projecting climate change impacts but lack sufficient resolution for local risk assessments. Deep learning-based super-resolution models offer a solution but often require retraining for different regions due to climatic variations, necessitating an evaluation of their geographic generalizability. Method: The authors introduced RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. They assessed state-of-the-art approaches such as GANs and diffusion models for their ability to generalize across data gaps between the Global North and South. Result: The research revealed significant performance drops in out-of-distribution regions depending on the model and geographic area. Expanding the training domain generally improved generalization but was insufficient for overcoming differences between geographically distinct areas. Conclusion: The study concludes that while expanding the training domain helps, it is not sufficient to overcome geographic distribution shifts. Addressing these shifts through techniques like data alignment can enhance spatial generalization, advancing the global applicability of downscaling methods and reducing inequities in access to high-resolution climate information. Abstract: Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolution for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.

[357] DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

Yecheng Wu,Junyu Chen,Zhuoyang Zhang,Enze Xie,Jincheng Yu,Junsong Chen,Jinyi Hu,Yao Lu,Song Han,Han Cai

Main category: cs.CV

TL;DR: This paper proposes DC-AR, a new masked autoregressive text-to-image generation framework with improved image generation quality and computational efficiency compared to existing models.

Details Motivation: Prior masked AR models have lagged behind diffusion models in terms of quality or efficiency due to the limitations of tokenizers. The authors aim to overcome this limitation. Method: The authors introduce DC-HT, a deep compression hybrid tokenizer for AR models, and build upon MaskGIT to create a hybrid masked autoregressive image generation framework that generates structural elements through discrete tokens and applies refinements via residual tokens. Result: DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models. Conclusion: DC-AR is a novel masked autoregressive text-to-image generation framework that outperforms existing models in both image generation quality and computational efficiency. Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.

[358] Boosting Temporal Sentence Grounding via Causal Inference

Kefan Tang,Lihuo He,Jisheng Dang,Xinbo Gao

Main category: cs.CV

TL;DR: 本研究针对Temporal Sentence Grounding (TSG)中存在的虚假关联问题,提出了一种结合因果干预和反事实推理的新方法,有效提升了模型的可靠性与泛化能力。

Details Motivation: 现有的Temporal Sentence Grounding(TSG)研究常常忽视视频和文本查询之间的虚假关联问题,这些关联源于文本数据中的固有偏见和模型对视频内容中显著或重复模式的过拟合,导致预测不可靠且泛化能力差。 Method: 该研究通过结构因果模型从因果角度构建TSG任务,提出了文本因果干预和视觉反事实推理方法,利用do-演算估计因果效应,并排除视频特征的影响以消除偏差。 Result: 实验表明,所提出的因果干预和反事实推理方法在公共数据集上具有优越性能,增强了模型的鲁棒性并减少了虚假关联的影响。 Conclusion: 本文提出了一种基于因果推理的新型TSG框架,以解决视频和文本查询之间的虚假关联问题,并证明了其在公共数据集上的优越性。 Abstract: Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.

[359] Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning

Yingshan Liang,Keyu Fan,Zhicheng Du,Yiran Wang,Qingyang Shi,Xinyu Zhang,Jiasheng Lu,Peiwu Qin

Main category: cs.CV

TL;DR: 本文提出了一种名为Hear-Your-Click的交互式视频到音频生成框架,通过点击视频帧中的特定对象生成声音,提高了生成音频的精确控制和性能。

Details Motivation: 当前的视频到音频生成方法依赖于全局视频信息,在复杂场景中表现不佳,难以生成针对视频中特定对象或区域的音频。为解决此问题,作者提出了Hear-Your-Click框架。 Method: 引入了Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) 方法,并结合Mask-guided Visual Encoder (MVE) 来获取与相应音频段对齐的对象级视觉特征。此外,设计了两种数据增强策略:Random Video Stitching (RVS) 和 Mask-guided Loudness Modulation (MLM),以提高模型对分割对象的敏感性。同时提出了一种新的评估指标CAV score来衡量音视频对应关系。 Result: 大量实验表明,该框架在多种指标上提供了更精确的控制和改进的生成性能。 Conclusion: Hear-Your-Click框架能够有效实现基于特定对象的视频到音频生成,显著提升了现有方法的精度和适用性。 Abstract: Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods, which rely on global video information, struggle with complex scenes and often fail to generate audio tailored to specific objects or regions in the videos. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework that enables users to generate sounds for specific objects in the videos by simply clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with corresponding audio segments. Furthermore, we tailor two data augmentation strategies: Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), aimed at enhancing the model's sensitivity to the segmented objects. To effectively measure the audio-visual correspondence, we design a new evaluation metric, the CAV score, for evaluation. Extensive experiments demonstrate that our framework offers more precise control and improved generation performance across various metrics. Project Page: https://github.com/SynapGrid/Hear-Your-Click

[360] InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior

Minghao Wen,Shengjie Wu,Kangkan Wang,Dong Liang

Main category: cs.CV

TL;DR: InterGSEdit improves 3D Gaussian Splatting editing by enhancing semantic consistency and user interaction through a novel attention fusion approach.

Details Motivation: To address local inconsistency and limited user control in existing 3D Gaussian Splatting editing methods. Method: The paper proposes InterGSEdit, which uses a CLIP-based Semantic Consistency Selection strategy and an Attention Fusion Network to enhance 3D editing quality. Result: The experiments show that the proposed method achieves high-fidelity, consistent 3DGS editing with better user interaction. Conclusion: InterGSEdit provides state-of-the-art performance in 3DGS editing, improving consistency and user experience. Abstract: 3D Gaussian Splatting based 3D editing has demonstrated impressive performance in recent years. However, the multi-view editing often exhibits significant local inconsistency, especially in areas of non-rigid deformation, which lead to local artifacts, texture blurring, or semantic variations in edited 3D scenes. We also found that the existing editing methods, which rely entirely on text prompts make the editing process a "one-shot deal", making it difficult for users to control the editing degree flexibly. In response to these challenges, we present InterGSEdit, a novel framework for high-quality 3DGS editing via interactively selecting key views with users' preferences. We propose a CLIP-based Semantic Consistency Selection (CSCS) strategy to adaptively screen a group of semantically consistent reference views for each user-selected key view. Then, the cross-attention maps derived from the reference views are used in a weighted Gaussian Splatting unprojection to construct the 3D Geometry-Consistent Attention Prior ($GAP^{3D}$). We project $GAP^{3D}$ to obtain 3D-constrained attention, which are fused with 2D cross-attention via Attention Fusion Network (AFN). AFN employs an adaptive attention strategy that prioritizes 3D-constrained attention for geometric consistency during early inference, and gradually prioritizes 2D cross-attention maps in diffusion for fine-grained features during the later inference. Extensive experiments demonstrate that InterGSEdit achieves state-of-the-art performance, delivering consistent, high-fidelity 3DGS editing with improved user experience.

[361] Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading

Qinkai Yu,Wei Zhou,Hantao Liu,Yanyu Xu,Meng Wang,Yitian Zhao,Huazhu Fu,Xujiong Ye,Yalin Zheng,Yanda Meng

Main category: cs.CV

TL;DR: 本研究提出了一种基于自回归序数回归的新方法AOR-DR,用于解决糖尿病视网膜病变(DR)分类中的分布不均和类别边界模糊问题,通过有效利用预训练模型实现了优于现有方法的性能。

Details Motivation: 糖尿病视网膜病变(DR)作为糖尿病的长期并发症,进展缓慢,可能需要多年才会威胁视力。准确评估其严重程度对于及时管理和护理至关重要。然而,DR严重程度水平分布不均和类别边界定义模糊等问题导致分类性能下降。 Method: 提出了一种新的自回归序数回归方法AOR-DR,通过将DR分级任务分解为一系列有序步骤,并利用扩散过程促进条件概率建模,直接使用连续全局图像特征进行自回归。 Result: 实验结果表明,AOR-DR模型在四个大规模公共数据集上表现优异,解决了DR分类中的挑战。 Conclusion: AOR-DR方法在四个大规模公开数据集上证明了其优于六种最新的序数回归方法的有效性,并且能够直接利用预训练的大规模基础模型能力。 Abstract: As a long-term complication of diabetes, diabetic retinopathy (DR) progresses slowly, potentially taking years to threaten vision. An accurate and robust evaluation of its severity is vital to ensure prompt management and care. Ordinal regression leverages the underlying inherent order between categories to achieve superior performance beyond traditional classification. However, there exist challenges leading to lower DR classification performance: 1) The uneven distribution of DR severity levels, characterized by a long-tailed pattern, adds complexity to the grading process. 2)The ambiguity in defining category boundaries introduces additional challenges, making the classification process more complex and prone to inconsistencies. This work proposes a novel autoregressive ordinal regression method called AOR-DR to address the above challenges by leveraging the clinical knowledge of inherent ordinal information in DR grading dataset settings. Specifically, we decompose the DR grading task into a series of ordered steps by fusing the prediction of the previous steps with extracted image features as conditions for the current prediction step. Additionally, we exploit the diffusion process to facilitate conditional probability modeling, enabling the direct use of continuous global image features for autoregression without relearning contextual information from patch-level features. This ensures the effectiveness of the autoregressive process and leverages the capabilities of pre-trained large-scale foundation models. Extensive experiments were conducted on four large-scale publicly available color fundus datasets, demonstrating our model's effectiveness and superior performance over six recent state-of-the-art ordinal regression methods. The implementation code is available at https://github.com/Qinkaiyu/AOR-DR.

[362] TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Zonglin Lyu,Chen Chen

Main category: cs.CV

TL;DR: TLB-VFI是一种高效的视频帧插值方法,利用时间感知的扩散模型,不仅性能优越,而且参数更少、速度更快、训练数据需求更低。

Details Motivation: 为了解决现有基于图像的扩散模型无法提取时间信息以及基于视频的扩散模型规模过大的问题,提出了一种更加高效的视频帧插值方法。 Method: 提出了时间感知的潜在布朗桥扩散模型(TLB-VFI),结合了3D小波门控和时间感知自编码器以提取时空信息,并结合光流引导进行视频帧插值。 Result: 在最具挑战性的数据集上相比现有的基于图像扩散模型的SOTA方法FID提升了20%,同时参数减少了3倍,速度提高了2.3倍;并且相比基于视频的扩散模型,训练数据需求减少了9000倍,参数减少了20多倍。 Conclusion: TLB-VFI是一种高效的视频扩散模型,通过提取丰富的时空信息,在参数减少的情况下实现了性能提升,并且所需训练数据显著减少。 Abstract: Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: https://zonglinl.github.io/tlbvfi_page.

[363] AI for the Routine, Humans for the Complex: Accuracy-Driven Data Labelling with Mixed Integer Linear Programming

Mohammad Hossein Amini,Mehrdad Sabetzadeh,Shiva Nejati

Main category: cs.CV

TL;DR: OPAL is a human-assisted labeling method that leverages a mixed-integer linear programming formulation to achieve high labeling accuracy with minimal manual effort, significantly outperforming existing approaches in both labeling and validation tasks.

Details Motivation: The scarcity of accurately labeled data is a major challenge in deep learning, especially during model testing where high label accuracy is crucial. Existing semi-supervised approaches may not be sufficient for reliable verification due to moderate label accuracy. Method: OPAL uses a human-assisted labeling approach based on a mixed-integer linear programming (MILP) formulation to minimize labeling effort under a specified accuracy target. The method was evaluated through more than 2500 experiments on seven datasets, comparing OPAL with eight baseline methods. Active learning was also incorporated for further optimization. Result: OPAL achieves an average accuracy of 98.8% while reducing manual labeling by more than half compared to baseline methods. For automated test-input validation, it reduces manual effort by 28.8% while achieving 4.5% higher accuracy than state-of-the-art baselines. With active learning, there is an additional 4.5% reduction in manual labeling without compromising accuracy. Conclusion: OPAL, with its mixed-integer linear programming formulation, can effectively reduce manual labeling effort while maintaining high label accuracy. It outperforms other methods in both automated labeling and test-input validation tasks. Abstract: The scarcity of accurately labelled data remains a major challenge in deep learning (DL). Many DL approaches rely on semi-supervised methods, which focus on constructing large datasets that require only a minimal amount of human-labelled data. Since DL training algorithms can tolerate moderate label noise, it has generally been acceptable for the accuracy of labels in large training datasets to fall well short of a perfect 100%. However, when it comes to testing DL models, achieving high label accuracy-as close to 100% as possible-is paramount for reliable verification. In this article, we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. We evaluate OPAL for two tasks in the context of testing vision systems: automatic labelling of test data and automated validation of test data. Our evaluation, based on more than 2500 experiments performed on seven datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, just 1.2% below perfect accuracy, while cutting manual labelling by more than half. Further, OPAL significantly outperforms automated labelling baselines in labelling accuracy across all seven datasets, with large effect sizes, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA validation baselines. Finally, we show that augmenting OPAL with an active learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.

[364] Robust Incomplete-Modality Alignment for Ophthalmic Disease Grading and Diagnosis via Labeled Optimal Transport

Qinkai Yu,Jianyang Xie,Yitian Zhao,Cheng Chen,Lijun Zhang,Liming Chen,Jun Cheng,Lu Liu,Yalin Zheng,Yanda Meng

Main category: cs.CV

TL;DR: This paper proposes a new method for ophthalmic diagnostics that effectively handles incomplete multimodal data by aligning and fusing features from different imaging modalities.

Details Motivation: Uneven distribution of healthcare resources leads to incomplete multimodal data, which compromises diagnostic accuracy. Existing methods have limitations in reconstructing features or requiring fully paired data. Method: A novel multimodal alignment and fusion framework using Optimal Transport for multi-scale feature alignment and an asymmetric fusion strategy is proposed. Result: Extensive evaluations show the model's robustness and superior performance under various modality-incomplete scenarios, including both complete modality and inter-modality incompleteness conditions. Conclusion: The proposed framework demonstrates superior performance in handling missing modalities in ophthalmic diagnostics, achieving state-of-the-art results on multiple datasets. Abstract: Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large ophthalmic multimodal datasets demonstrate our model's superior performance under various modality-incomplete scenarios, achieving Sota performance in both complete modality and inter-modality incompleteness conditions. Code is available at https://github.com/Qinkaiyu/RIMA

[365] Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

Britty Baby,Vinkle Srivastav,Pooja P. Jain,Kun Yuan,Pietro Mascagni,Nicolas Padoy

Main category: cs.CV

TL;DR: 本研究开发了CVS-AdaptNet,通过多模态和多标签方法提升腹腔镜胆囊切除术中CVS识别的效果。

Details Motivation: 传统CVS识别模型依赖视觉模型并需要昂贵的空间注释,而本文旨在探索文本作为工具如何用于训练和推理,以提高CVS识别的自动化水平。 Method: 提出了CVS-AdaptNet,一种多标签适应策略,并利用正面和负面提示对图像嵌入与文本描述进行对齐,以增强细粒度的二元分类性能。 Result: CVS-AdaptNet在Endoscapes-CVS201数据集上达到57.6 mAP,比仅使用ResNet50的图像基线方法(51.5 mAP)提高了6个百分点。 Conclusion: CVS-AdaptNet通过结合文本描述和图像嵌入,在多标签、多模态框架下提升了CVS识别的表现,表明通用模型在专业外科任务中的潜力。 Abstract: The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet's multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: https://github.com/CAMMA-public/CVS-AdaptNet

[366] Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision

Soham Walimbe,Britty Baby,Vinkle Srivastav,Nicolas Padoy

Main category: cs.CV

TL;DR: MML-SurgAdapt 是一种新的统一多任务框架,利用自然语言监督处理多样化的外科任务,并通过 SPML 学习有效应对不完整或有噪声的注释。

Details Motivation: 传统模型只能一次处理一个任务,缺乏灵活性,需要为每个任务单独建模。此外,多任务学习中存在部分标注的问题。 Method: 使用 Vision-Language Models (VLMs),特别是 CLIP,处理多种手术任务,并通过 Single Positive Multi-Label (SPML) 学习解决部分标注问题。 Result: MML-SurgAdapt 在 Cholec80、Endoscapes2023 和 CholecT50 数据集上表现良好,与特定任务基准相当,并减少了所需标签数量达 23%。 Conclusion: MML-SurgAdapt 是第一个将SPML应用于整合多个手术任务数据的方法,为手术计算机视觉中的多任务学习提供了一种新颖且可推广的解决方案。 Abstract: Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: https://github.com/CAMMA-public/MML-SurgAdapt

[367] Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning

Ricardo Cardoso,Plinio Moreno

Main category: cs.CV

TL;DR: This paper introduces a novel vision-based method for object mass estimation by combining RGB and depth data, utilizing a synthetic dataset for training and achieving superior performance over existing techniques.

Details Motivation: Accurate mass estimation of objects before interaction enhances robotic tasks like grasping and control. Vision-only mass estimation is underexplored, motivating this study. Method: Combining sparse point-cloud data from depth images with RGB images to estimate object mass. A synthetic dataset was created using ShapeNetSem 3D models and trained via image generation models for dense depth maps. Result: The approach outperformed existing benchmarks across all metrics, showing effectiveness in mass estimation using RGB and depth data. Conclusion: The proposed method significantly outperforms existing benchmarks in estimating object mass using RGB and depth data, while the synthetic dataset and models are publicly available for further research. Abstract: Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object's mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (https://github.com/RavineWindteer/ShapenetSem-to-RGBD) as well as the training of the depth estimator (https://github.com/RavineWindteer/GLPDepth-Edited) and the mass estimator (https://github.com/RavineWindteer/Depth-mass-estimator) are available online.

[368] INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Xin Dong,Shichao Dong,Jin Wang,Jing Huang,Li Zhou,Zenghui Sun,Lihua Jing,Jingsong Lan,Xiaoyong Zhu,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的方法INTER,用于减少大型视觉-语言模型中的幻觉问题,通过指导模型更好地利用多模态交互信息,实验证明其有效性。

Details Motivation: 人类能有效利用多模态交互信息,而大型视觉-语言模型在生成回答时常出现与视觉内容不一致的幻觉问题。 Method: 提出了名为INTER的训练无关算法,通过指导模型在生成回答时重新应用对多模态交互信息的理解来减少幻觉。 Result: 在六个基准测试中,包括VQA和图像描述任务,INTER相较于最先进的解码策略平均提升了3.4%的表现。 Conclusion: INTER方法能够有效减少大型视觉-语言模型中的幻觉问题,且无需额外数据。 Abstract: Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.

[369] ICAS: Detecting Training Data from Autoregressive Image Generative Models

Hongyao Yu,Yixiang Qiu,Yiheng Yang,Hao Fang,Tianqu Zhuang,Jiaxin Hong,Bin Chen,Hao Wu,Shu-Tao Xia

Main category: cs.CV

TL;DR: This paper proposes a new method for detecting unauthorized training data usage in autoregressive image models, demonstrating its effectiveness and highlighting vulnerabilities in large models.

Details Motivation: With advancements in autoregressive image generation raising concerns about data privacy and copyright, there is a growing need to identify unauthorized data usage in model training through membership inference. Method: The approach includes implicit classification and an adaptive score aggregation strategy. The implicit token-wise classification score is calculated within the query image, followed by an adaptive score aggregation strategy to emphasize tokens with lower scores, producing a final score indicating potential training set involvement. Result: Extensive experiments confirm the effectiveness of the proposed method in both class-conditional and text-to-image scenarios, showing strong robustness and generalization under various data transformations. Two key findings include a linear scaling law on membership inference and the increased detectability of training data from scale-wise visual autoregressive models. Conclusion: The study successfully introduces a method for membership inference in autoregressive image generative models, revealing vulnerabilities in large foundation models and showing that training data from certain models are more detectable. Abstract: Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive paradigms.Our code is available at https://github.com/Chrisqcwx/ImageAR-MIA.

[370] MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation

Yucheng Wang,Dan Xu

Main category: cs.CV

TL;DR: 本文提出MoDiT,结合3D形态模型与扩散Transformer,解决音频驱动面部动画生成中的一致性与真实性问题。

Details Motivation: 现有方法在生成面部动画时存在时间抖动、身份漂移和不自然眨眼行为等主要问题,因此需要一种能够提升一致性和真实感的新框架。 Method: 提出了一种层次化去噪策略,集成了3DMM系数以提供显式空间约束,并采用改进的眨眼策略模拟自然眼部运动。 Result: 实现了更准确的唇部同步、更稳定的全脸一致性以及更自然的眨眼行为,显著提升了音频驱动说话头生成的质量。 Conclusion: MoDiT通过结合3DMM和基于扩散的Transformer,有效解决了音频驱动说话人头部生成中的时间抖动、身份漂移和不自然眨眼行为等问题。 Abstract: Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering. (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency. (iii) A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.

[371] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Juyi Lin,Amir Taherin,Arash Akbari,Arman Akbari,Lei Lu,Guangyu Chen,Taskin Padir,Xiaomeng Yang,Weiwei Chen,Yiqian Li,Xue Lin,David Kaeli,Pu Zhao,Yanzhi Wang

Main category: cs.CV

TL;DR: This paper introduces VOTE, a highly efficient framework for Vision Language Action (VLA) models that drastically improves inference speed while maintaining strong performance and generalization.

Details Motivation: Existing VLA models struggle with generalization to novel objects or environments, often relying on additional computationally expensive components like depth estimation or diffusion. This work aims to develop an efficient action prediction method without such dependencies. Method: The authors proposed a tokenizer-free fine-tuning approach for parallel action prediction and an ensemble voting strategy for action sampling. Result: The proposed method achieves 35× faster inference and a throughput of 145 Hz, demonstrating both efficiency and performance improvements. Conclusion: VOTE is an efficient and general framework that optimizes and accelerates VLA models, achieving state-of-the-art performance with significantly improved inference speed. Abstract: Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35$\times$ faster inference and 145 Hz throughput. All the details and codes will be open-sourced.

[372] VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems

Aadi Srivastava,Vignesh Natarajkumar,Utkarsh Bheemanaboyna,Devisree Akashapu,Nagraj Gaonkar,Archit Joshi

Main category: cs.CV

TL;DR: This paper proposes VERITAS, a transparent framework for detecting AI-generated images and explaining the reasons behind its classifications using artifact localization and semantic reasoning.

Details Motivation: The rapid adoption of AI-generated content has blurred the line between real and synthetic images, creating a need for transparent methods to verify authenticity. Method: The paper introduces VERITAS, which combines artifact localization and semantic reasoning to detect synthetic images and generate human-readable explanations. Result: The proposed framework successfully detects small AI-generated images and provides interpretable explanations for its classifications. Conclusion: VERITAS is an effective framework for detecting and explaining AI-generated images, providing transparency in the decision-making process. Abstract: The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at https://github.com/V-i-g-n-e-s-h-N/VERITAS .

[373] LAID: Lightweight AI-Generated Image Detection in Spatial and Spectral Domains

Nicholas Chivaran,Jianbing Ni

Main category: cs.CV

TL;DR: LAID demonstrates that lightweight neural networks can efficiently and effectively detect AI-generated images, offering a scalable solution for social media platforms.

Details Motivation: Current state-of-the-art AIGI detection methods rely on large, computationally intensive models, posing barriers to real-time deployment on platforms like social media. Method: The authors introduced LAID, a framework to evaluate lightweight neural networks for AIGI detection across spatial, spectral, and fusion image domains. Result: Lightweight models achieved competitive accuracy, even under adversarial conditions, with much lower computational and memory requirements compared to existing methods. Conclusion: This paper concludes that lightweight neural networks can offer competitive accuracy while significantly reducing computational and memory costs for AI-generated image detection. Abstract: The recent proliferation of photorealistic AI-generated images (AIGI) has raised urgent concerns about their potential misuse, particularly on social media platforms. Current state-of-the-art AIGI detection methods typically rely on large, deep neural architectures, creating significant computational barriers to real-time, large-scale deployment on platforms like social media. To challenge this reliance on computationally intensive models, we introduce LAID, the first framework -- to our knowledge -- that benchmarks and evaluates the detection performance and efficiency of off-the-shelf lightweight neural networks. In this framework, we comprehensively train and evaluate selected models on a representative subset of the GenImage dataset across spatial, spectral, and fusion image domains. Our results demonstrate that lightweight models can achieve competitive accuracy, even under adversarial conditions, while incurring substantially lower memory and computation costs compared to current state-of-the-art methods. This study offers valuable insight into the trade-off between efficiency and performance in AIGI detection and lays a foundation for the development of practical, scalable, and trustworthy detection systems. The source code of LAID can be found at: https://github.com/nchivar/LAID.

[374] 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

Yutian Chen,Shi Guo,Tianshuo Yang,Lihe Ding,Xiuyuan Yu,Jinwei Gu,Tianfan Xue

Main category: cs.CV

TL;DR: 该论文提出了一种利用低帧率摄像机实现高速4D捕捉的系统,通过异步捕获方案和新颖的生成模型来提高重建质量。

Details Motivation: 由于大多数4D捕捉系统的帧率低于30 FPS,直接对低帧率输入进行4D重建可能导致不良结果,因此需要一种新的方法来提升高速运动的重建效果。 Method: 提出了一种异步捕获方案,通过交错相机的启动时间来增加有效帧率,并结合视频扩散模型修复稀疏视图重建导致的伪影。 Result: 实验结果显示,与同步捕获相比,该方法在高速4D重建方面有显著提升,实现了等效100-200 FPS的帧率并改善了重建质量。 Conclusion: 论文成功开发了一种高效的高速4D捕捉系统,能够在不使用专业高速摄像机的情况下,通过异步捕获和生成模型改进高动态场景的重建效果。 Abstract: Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

[375] Differential Attention for Multimodal Crisis Event Analysis

Nusrat Munia,Junfeng Zhu,Olfa Nasraoui,Abdullah-Al-Zubaer Imran

Main category: cs.CV

TL;DR: This paper proposes a method using vision language models and adaptive fusion strategies to improve crisis data classification during disasters, achieving superior performance on benchmark datasets.

Details Motivation: Effectively extracting meaningful information from large and noisy multimodal data streams during crisis events remains a challenge, making it critical to develop better methods for real-time humanitarian response. Method: The researchers explored vision language models (VLMs) and advanced fusion strategies, incorporating LLaVA-generated text, CLIP-based embeddings, Guided Cross Attention (Guided CA), and Differential Attention mechanisms to enhance feature alignment and classification performance. Result: The proposed approach consistently outperformed state-of-the-art models in classification accuracy on the CrisisMMD benchmark dataset across three tasks crucial for disaster response. Conclusion: The study concludes that combining pretrained vision language models with enriched textual descriptions and adaptive fusion strategies improves the classification of crisis data, leading to more reliable and interpretable models for disaster response. Abstract: Social networks can be a valuable source of information during crisis events. In particular, users can post a stream of multimodal data that can be critical for real-time humanitarian response. However, effectively extracting meaningful information from this large and noisy data stream and effectively integrating heterogeneous data remains a formidable challenge. In this work, we explore vision language models (VLMs) and advanced fusion strategies to enhance the classification of crisis data in three different tasks. We incorporate LLaVA-generated text to improve text-image alignment. Additionally, we leverage Contrastive Language-Image Pretraining (CLIP)-based vision and text embeddings, which, without task-specific fine-tuning, outperform traditional models. To further refine multimodal fusion, we employ Guided Cross Attention (Guided CA) and combine it with the Differential Attention mechanism to enhance feature alignment by emphasizing critical information while filtering out irrelevant content. Our results show that while Differential Attention improves classification performance, Guided CA remains highly effective in aligning multimodal features. Extensive experiments on the CrisisMMD benchmark data set demonstrate that the combination of pretrained VLMs, enriched textual descriptions, and adaptive fusion strategies consistently outperforms state-of-the-art models in classification accuracy, contributing to more reliable and interpretable models for three different tasks that are crucial for disaster response. Our code is available at https://github.com/Munia03/Multimodal_Crisis_Event.

[376] Semantic Frame Interpolation

Yijia Hong,Jiangning Zhang,Ran Yi,Yuji Wang,Weijian Cao,Xiaobin Hu,Zhucun Xue,Yabiao Wang,Chengjie Wang,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文介绍了一种新的语义帧插值任务和模型,旨在解决传统帧插值任务的局限性,并构建了专门的数据集和评估基准。

Details Motivation: 传统的帧插值任务主要关注少量帧、无文本控制以及首尾帧差异较小的场景,而现有模型仅能生成固定数量的帧且缺乏明确的官方定义和完善的基准,因此需要一种新的语义帧插值任务和模型来解决这些问题。 Method: 基于Wan2.1提出了SemFi模型,引入了Mixture-of-LoRA模块以确保在各种帧长度限制下生成与控制条件一致的高一致性内容,并设计了SFI-300K数据集及评估指标进行多维度性能评估。 Result: 提出了SemFi模型和SFI-300K数据集,通过实验验证了该方法在满足SFI任务要求方面的有效性,并展示了其在图像和视频多个维度上的一致性和多样性表现。 Conclusion: 本文提出了一种新的语义帧插值任务和模型,并构建了专门用于SFI的第一个通用数据集和基准,通过大量实验验证了所提方法在满足SFI任务需求方面的有效性。 Abstract: Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model's performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.

[377] $\varphi$-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery

Hoang-Quan Nguyen,Xuan Bac Nguyen,Sankalp Pandey,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu

Main category: cs.CV

TL;DR: 这篇论文介绍了一种用于解决量子硬件工程中量子薄片厚度估计问题的新方法,通过结合物理学建模和领域自适应技术,在合成数据的基础上实现了高效的现实场景迁移。

Details Motivation: 现有的计算机视觉方法在估计量子薄片厚度时面临数据有限、泛化能力差、对领域变化敏感以及缺乏物理可解释性等重大挑战。 Method: 提出了一个综合性的框架,包括生成多样化量子薄片样本的合成数据生成框架以及φ-Adapt这一物理信息引导的自适应方法。 Result: 实验结果表明,所提出的方法在多个基准测试中达到了最先进的性能,优于现有方法。 Conclusion: 该论文提出了一种基于物理信息的自适应学习方法,为解决合成数据在现实世界二维材料分析中的应用提供了重要工具,并推动了物理学建模与领域自适应的结合。 Abstract: Characterizing quantum flakes is a critical step in quantum hardware engineering because the quality of these flakes directly influences qubit performance. Although computer vision methods for identifying two-dimensional quantum flakes have emerged, they still face significant challenges in estimating flake thickness. These challenges include limited data, poor generalization, sensitivity to domain shifts, and a lack of physical interpretability. In this paper, we introduce one of the first Physics-informed Adaptation Learning approaches to overcome these obstacles. We focus on two main issues, i.e., data scarcity and generalization. First, we propose a new synthetic data generation framework that produces diverse quantum flake samples across various materials and configurations, reducing the need for time-consuming manual collection. Second, we present $\varphi$-Adapt, a physics-informed adaptation method that bridges the performance gap between models trained on synthetic data and those deployed in real-world settings. Experimental results show that our approach achieves state-of-the-art performance on multiple benchmarks, outperforming existing methods. Our proposed approach advances the integration of physics-based modeling and domain adaptation. It also addresses a critical gap in leveraging synthesized data for real-world 2D material analysis, offering impactful tools for deep learning and materials science communities.

[378] Satellite-based Rabi rice paddy field mapping in India: a case study on Telangana state

Prashanth Reddy Putta,Fabio Dell'Acqua

Main category: cs.CV

TL;DR: 本研究提出一种适应局部农业生态变化的物候驱动分类框架,提高水稻面积监测的准确性,为粮食安全和农业政策提供支持。

Details Motivation: 准确的水稻面积监测对于小农户农业地区的粮食安全和农业政策至关重要,但传统的遥感方法难以应对碎片化农业景观的时空异质性。 Method: 开发了一种物候驱动的分类框架,该框架能够系统地适应印度特伦甘纳邦32个地区之间的局部农业生态变化,并利用政府统计数据进行验证。 Result: 研究揭示了显著的时空多样性,不同地区之间的物候时间差异高达50天,田块大小范围从0.01到2.94公顷。通过地区特定校准方法实现了93.3%的整体精度,并成功绘制了732,345公顷的水稻面积。 Conclusion: 研究得出,遥感框架必须接受而不是简化景观复杂性,以推进区域特定的农业监测方法,在保持科学严谨性的同时服务于政策和粮食安全应用。 Abstract: Accurate rice area monitoring is critical for food security and agricultural policy in smallholder farming regions, yet conventional remote sensing approaches struggle with the spatiotemporal heterogeneity characteristic of fragmented agricultural landscapes. This study developed a phenology-driven classification framework that systematically adapts to local agro-ecological variations across 32 districts in Telangana, India during the 2018-19 Rabi rice season. The research reveals significant spatiotemporal diversity, with phenological timing varying by up to 50 days between districts and field sizes ranging from 0.01 to 2.94 hectares. Our district-specific calibration approach achieved 93.3% overall accuracy, an 8.0 percentage point improvement over conventional regional clustering methods, with strong validation against official government statistics (R^2 = 0.981) demonstrating excellent agreement between remotely sensed and ground truth data. The framework successfully mapped 732,345 hectares by adapting to agro-climatic variations, with Northern districts requiring extended land preparation phases (up to 55 days) while Southern districts showed compressed cultivation cycles. Field size analysis revealed accuracy declining 6.8 percentage points from medium to tiny fields, providing insights for operational monitoring in fragmented landscapes. These findings demonstrate that remote sensing frameworks must embrace rather than simplify landscape complexity, advancing region-specific agricultural monitoring approaches that maintain scientific rigor while serving practical policy and food security applications.

[379] All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han,Mohamed El Amine Boudjoghra,Jiahua Dong,Jinhong Wang,Rao Muhammad Anwer

Main category: cs.CV

TL;DR: 本文提出了一种新的3D点云分割框架VDG-Uni3DSeg,利用预训练的视觉-语言模型和大型语言模型,在多个分割任务上取得显著成果。

Details Motivation: 统一的3D点云分割对于场景理解至关重要,但其稀疏结构、有限注释以及在复杂环境中区分细粒度对象类别的挑战阻碍了其发展。现有方法由于监督有限和缺乏多样化的多模态线索,难以优化类别和实例的区分。 Method: 提出了一个名为VDG-Uni3DSeg的新框架,该框架整合了预训练的视觉-语言模型(如CLIP)和大型语言模型(LLMs),并通过LLM生成的文本描述和来自互联网的参考图像来增强3D分割。此外,设计了一种语义-视觉对比损失和空间增强模块。 Result: VDG-Uni3DSeg在语义、实例和全景分割方面取得了最先进的成果,并提供了一个可扩展且实用的3D理解解决方案。 Conclusion: VDG-Uni3DSeg通过结合预训练的视觉-语言模型和大语言模型,实现了在语义、实例和全景分割任务上的最先进结果,并为3D理解提供了一个可扩展且实用的解决方案。 Abstract: Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at https://github.com/Hanzy1996/VDG-Uni3DSeg.

[380] CTA: Cross-Task Alignment for Better Test Time Training

Samuel Barbeau,Pedram Fekri,David Osowiechi,Ali Bahri,Moslem YazdanpanahMasih Aminbeidokhti,Christian Desrosiers

Main category: cs.CV

TL;DR: This paper proposes a new method called CTA (Cross-Task Alignment) to improve Test-Time Training (TTT) in deep learning models, enhancing robustness and generalization without requiring specialized model architecture.

Details Motivation: Deep learning models often suffer performance degradation under distribution shifts. Existing TTT methods enhance robustness but typically require specialized architectures. This work aims to develop a more flexible and effective TTT approach inspired by multi-modal contrastive learning. Method: The authors propose CTA (Cross-Task Alignment), which aligns a supervised encoder with a self-supervised encoder using multi-modal contrastive learning principles. This alignment helps mitigate gradient interference while preserving the robustness of self-supervised learning, enabling more meaningful test-time updates. Result: Experimental results show that CTA achieves significant improvements in robustness and generalization compared to state-of-the-art methods on multiple benchmark datasets. Conclusion: CTA provides an effective solution for improving TTT without relying on specialized architectures, leveraging cross-task alignment to enhance model robustness and semantic updates at test time. Abstract: Deep learning models have demonstrated exceptional performance across a wide range of computer vision tasks. However, their performance often degrades significantly when faced with distribution shifts, such as domain or dataset changes. Test-Time Training (TTT) has emerged as an effective method to enhance model robustness by incorporating an auxiliary unsupervised task during training and leveraging it for model updates at test time. In this work, we introduce CTA (Cross-Task Alignment), a novel approach for improving TTT. Unlike existing TTT methods, CTA does not require a specialized model architecture and instead takes inspiration from the success of multi-modal contrastive learning to align a supervised encoder with a self-supervised one. This process enforces alignment between the learned representations of both models, thereby mitigating the risk of gradient interference, preserving the intrinsic robustness of self-supervised learning and enabling more semantically meaningful updates at test-time. Experimental results demonstrate substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.

[381] Self-Supervised Real-Time Tracking of Military Vehicles in Low-FPS UAV Footage

Markiyan Kostiv,Anatolii Adamovskyi,Yevhen Cherniavskyi,Mykyta Varenyk,Ostap Viniavskyi,Igor Krashenyi,Oles Dobosevych

Main category: cs.CV

TL;DR: 本文介绍了一种基于单帧标注的实例关联学习方法,用于解决无人机低帧率视频中的多目标跟踪问题,同时提供了相关军事车辆数据集。

Details Motivation: 由于无人机在实际作战场景中捕获的视频存在对象外观和位置快速变化以及由云视频流和压缩算法导致的图像退化,需要一种更有效的方法来应对这些挑战。 Method: 该论文提出了一种利用单帧标注进行实例关联学习的方法,同时结合场景的全局特征以提高低帧率情况下的实例关联鲁棒性。 Result: 论文展示了一种能够在低帧率条件下维持高关联质量的跟踪方法,并证明了其对干扰物和检测间隔的鲁棒性。 Conclusion: 论文得出结论,所提出的跟踪方法即使在降低输入图像分辨率和潜在表示大小的情况下也能保持高质量的关联,并且还提供了一个用于军事车辆的注释数据集。 Abstract: Multi-object tracking (MOT) aims to maintain consistent identities of objects across video frames. Associating objects in low-frame-rate videos captured by moving unmanned aerial vehicles (UAVs) in actual combat scenarios is complex due to rapid changes in object appearance and position within the frame. The task becomes even more challenging due to image degradation caused by cloud video streaming and compression algorithms. We present how instance association learning from single-frame annotations can overcome these challenges. We show that global features of the scene provide crucial context for low-FPS instance association, allowing our solution to be robust to distractors and gaps in detections. We also demonstrate that such a tracking approach maintains high association quality even when reducing the input image resolution and latent representation size for faster inference. Finally, we present a benchmark dataset of annotated military vehicles collected from publicly available data sources. This paper was initially presented at the NATO Science and Technology Organization Symposium (ICMCIS) organized by the Information Systems Technology (IST)Scientific and Technical Committee, IST-209-RSY - the ICMCIS, held in Oeiras, Portugal, 13-14 May 2025.

[382] Physics-Guided Dual Implicit Neural Representations for Source Separation

Yuan Ni,Zhantao Chen,Alexander N. Petsch,Edmund Xu,Cheng Peng,Alexander I. Kolesnikov,Sugata Chowdhury,Arun Bansil,Jana B. Thayer,Joshua J. Turner

Main category: cs.CV

TL;DR: 开发了一种新的自监督机器学习方法,用于从复杂背景中分离出目标物理信号,适用于多个领域如天文学测量和生物医学图像重建。

Details Motivation: 由于先进实验和观测技术中的数据往往包含背景和信号失真等不需要的成分,这可能掩盖了感兴趣的物理信息,因此需要一种无需标记数据或预定义字典的方法来解决这一问题。 Method: 使用双重隐式神经表示框架,联合训练两个神经网络:一个用于逼近物理信号的失真,另一个用于学习有效的背景贡献。 Result: 在大规模模拟以及四维参数空间中的实验性动量-能量依赖非弹性中子散射数据上验证了所提框架的有效性,即使信号特征在四个维度中变化,也能成功分离信号与背景。 Conclusion: 该论文提出了一种基于自监督机器学习的源分离方法,能够有效从复杂或结构化背景中分离出物理上有意义的信号。 Abstract: Significant challenges exist in efficient data analysis of most advanced experimental and observational techniques because the collected signals often include unwanted contributions--such as background and signal distortions--that can obscure the physically relevant information of interest. To address this, we have developed a self-supervised machine-learning approach for source separation using a dual implicit neural representation framework that jointly trains two neural networks: one for approximating distortions of the physical signal of interest and the other for learning the effective background contribution. Our method learns directly from the raw data by minimizing a reconstruction-based loss function without requiring labeled data or pre-defined dictionaries. We demonstrate the effectiveness of our framework by considering a challenging case study involving large-scale simulated as well as experimental momentum-energy-dependent inelastic neutron scattering data in a four-dimensional parameter space, characterized by heterogeneous background contributions and unknown distortions to the target signal. The method is found to successfully separate physically meaningful signals from a complex or structured background even when the signal characteristics vary across all four dimensions of the parameter space. An analytical approach that informs the choice of the regularization parameter is presented. Our method offers a versatile framework for addressing source separation problems across diverse domains, ranging from superimposed signals in astronomical measurements to structural features in biomedical image reconstructions.

[383] From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving

Fabian Konstantinidis,Ariel Dallari Guerreiro,Raphael Trumpp,Moritz Sackmann,Ulrich Hofmann,Marco Caccamo,Christoph Stiller

Main category: cs.CV

TL;DR: 这篇论文分析了多种联合运动预测方法,比较了它们在预测准确性、多模态性和推理效率方面的表现,旨在为自动驾驶领域的运动预测提供更清晰的研究方向。

Details Motivation: 为了提高自动驾驶车辆在动态环境中的安全性和效率,需要对周围交通参与者进行准确的运动预测。现有方法在问题建模和实现细节上的差异使得难以直接比较其性能。 Method: 论文系统地研究了联合运动预测的不同方法,包括边际预测的后处理、显式训练联合预测模型以及将问题构建为生成任务。 Result: 论文通过评估不同方法的表现,展示了它们各自的优点和局限性,并提供了一些预测示例用于参考。 Conclusion: 该论文总结了联合运动预测的不同方法在预测准确性、多模态性和推理效率方面的优劣,并提供了每种方法的全面分析。 Abstract: Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at https://frommarginaltojointpred.github.io/.

[384] SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation

Jiahao Zhu,Zixuan Chen,Guangcong Wang,Xiaohua Xie,Yi Zhou

Main category: cs.CV

TL;DR: SegmentDreamer addresses the imbalance issue in CD-based methods for text-to-3D generation by using SCTD, resulting in superior visual quality and efficient generation.

Details Motivation: CD-based text-to-3D generation methods suffer from improper conditional guidance due to imbalance between self-consistency and cross-consistency. Method: Segmented Consistency Trajectory Distillation (SCTD) and a distillation pipeline for swift and stable generation. Result: High-fidelity text-to-3D generation with improved visual quality and tighter upper bound on distillation error. Conclusion: SegmentDreamer outperforms state-of-the-art methods in visual quality and enables high-fidelity 3D asset creation through 3D Gaussian Splatting. Abstract: Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation. However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results. To address this issue, we present SegmentDreamer, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation. Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency. Moreover, SCTD partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error. Additionally, we propose a distillation pipeline for a more swift and stable generation. Extensive experiments demonstrate that our SegmentDreamer outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).

[385] Spatio-Temporal LLM: Reasoning about Environments and Actions

Haozhen Zheng,Beitong Tian,Mingyuan Wu,Zhenggang Tang,Klara Nahrstedt,Alex Schwing

Main category: cs.CV

TL;DR: This paper introduces ST-LLM, enhancing spatio-temporal understanding in MLLMs through a new dataset and model design.

Details Motivation: MLLMs struggle with prompts requiring holistic spatio-temporal understanding, which is critical for real-world applications. Method: Developed the REA dataset and designed ST-LLM with enhanced spatial and temporal projectors. Result: ST-LLM achieves significantly better performance on the REA dataset compared to existing methods. Conclusion: The proposed ST-LLM significantly improves spatio-temporal understanding in MLLMs for real-world agent operations. Abstract: Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected "Reasoning about Environments and Actions" (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a "spatio-temporal LLM" (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at https://zoezheng126.github.io/STLLM-website/.

[386] Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Chun-Hsiao Yeh,Yilin Wang,Nanxuan Zhao,Richard Zhang,Yuheng Li,Yi Ma,Krishna Kumar Singh

Main category: cs.CV

TL;DR: X-Planner, a Multimodal Large Language Model-based planning system, effectively bridges user intent with editing model capabilities by systematically decomposing complex instructions into simpler sub-instructions.

Details Motivation: Current diffusion-based image editing models struggle with interpreting complex, indirect instructions, suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. Method: X-Planner employs chain-of-thought reasoning to break down complex instructions into simpler, clear sub-instructions. For each sub-instruction, it automatically generates precise edit types and segmentation masks. A novel automated pipeline is also proposed for generating large-scale data to train X-Planner. Result: X-Planner achieves state-of-the-art results on both existing benchmarks and a newly introduced complex editing benchmark. Conclusion: X-Planner provides an effective solution for text-guided image editing tasks, particularly those involving complex and indirect instructions, while ensuring localized, identity-preserving edits without manual intervention. Abstract: Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.

[387] Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations

Xiang Xu,Lingdong Kong,Song Wang,Chuanwei Zhou,Qingshan Liu

Main category: cs.CV

TL;DR: LiMA enhances LiDAR representation learning through long-term memory aggregation, improving performance on autonomous driving perception tasks.

Details Motivation: Existing LiDAR representation strategies overlook inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. The goal is to reduce reliance on costly human annotations by leveraging large-scale datasets. Method: LiMA introduces three components: Cross-View Aggregation, Long-Term Feature Propagation, and Cross-Sequence Memory Alignment to capture spatiotemporal cues in LiDAR sequences. Result: Extensive experiments on mainstream LiDAR-based perception benchmarks show significant improvements in both LiDAR semantic segmentation and 3D object detection. Conclusion: LiMA significantly improves LiDAR semantic segmentation and 3D object detection by capturing long-term temporal correlations and maintains high pretraining efficiency without additional computational overhead during downstream tasks. Abstract: LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.