Table of Contents
cs.CL [Back]
[1] Outcome-Based Education: Evaluating Students' Perspectives Using Transformer
Shuvra Smaran Das,Anirban Saha Anik,Md Kishor Morol,Mohammad Sakib Mahmood
Main category: cs.CL
TL;DR: This paper uses DistilBERT and LIME to analyze student feedback, improving sentiment classification and offering interpretable insights that align with Outcome-Based Education goals.
Details
Motivation: The motivation stems from the need to improve educational outcomes in line with Outcome-Based Education (OBE) by leveraging advanced NLP techniques to better understand student experiences and sentiments. Method: Transformer-based models, specifically DistilBERT, were implemented to analyze an NLP dataset containing student feedback. Additionally, LIME was applied to interpret model predictions and understand the influence of key terms on sentiment. Result: The approach outperformed other machine learning models by leveraging transformers' deep understanding of language context, resulting in more accurate sentiment classification across multiple metrics. The integration of LIME also provided interpretable insights into model behavior. Conclusion: The study concludes that combining transformer models with LIME explanations creates a powerful framework for analyzing student feedback, which aligns closely with OBE principles and enhances educational practices through data-driven insights. Abstract: Outcome-Based Education (OBE) emphasizes the development of specific competencies through student-centered learning. In this study, we reviewed the importance of OBE and implemented transformer-based models, particularly DistilBERT, to analyze an NLP dataset that includes student feedback. Our objective is to assess and improve educational outcomes. Our approach is better than other machine learning models because it uses the transformer's deep understanding of language context to classify sentiment better, giving better results across a wider range of matrices. Our work directly contributes to OBE's goal of achieving measurable outcomes by facilitating the identification of patterns in student learning experiences. We have also applied LIME (local interpretable model-agnostic explanations) to make sure that model predictions are clear. This gives us understandable information about how key terms affect sentiment. Our findings indicate that the combination of transformer models and LIME explanations results in a strong and straightforward framework for analyzing student feedback. This aligns more closely with the principles of OBE and ensures the improvement of educational practices through data-driven insights.[2] Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs
Xiang Li,Chong Zhang,Jia Wang,Fangyu Wu,Yushi Li,Xiaobo Jin
Main category: cs.CL
TL;DR: This paper proposes an efficient method called Adversarial Prompt Distillation to enable small models to jailbreak large language models, highlighting security vulnerabilities and offering new directions for research.
Details
Motivation: Current jailbreak attack methods face issues like low efficiency, high computational cost, and limited adaptability, which hinder their effectiveness against rapidly evolving LLMs and defense strategies. Method: The paper uses a combination of masked language modeling, reinforcement learning, and dynamic temperature control through a prompt generation and distillation approach. Result: The experimental results demonstrate the superiority of the proposed method in attack success rate and harm, as well as its resource efficiency and cross-model adaptability. Conclusion: This paper concludes that the proposed Adversarial Prompt Distillation method effectively enables small language models to conduct jailbreaking attacks on large language models, revealing vulnerabilities and providing new insights for security research. Abstract: Attacks on large language models (LLMs) in jailbreaking scenarios raise many security and ethical issues. Current jailbreak attack methods face problems such as low efficiency, high computational cost, and poor cross-model adaptability and versatility, which make it difficult to cope with the rapid development of LLM and new defense strategies. Our work proposes an Adversarial Prompt Distillation, which combines masked language modeling, reinforcement learning, and dynamic temperature control through a prompt generation and distillation method. It enables small language models (SLMs) to jailbreak attacks on mainstream LLMs. The experimental results verify the superiority of the proposed method in terms of attack success rate and harm, and reflect the resource efficiency and cross-model adaptability. This research explores the feasibility of distilling the jailbreak ability of LLM to SLM, reveals the model's vulnerability, and provides a new idea for LLM security research.[3] GTA: Grouped-head latenT Attention
Luoyang Sun,Jiwen Jiang,Cheng Deng,Xinjian Wu,Haifeng Zhang,Lei Chen,Lionel Ni,Jun Wang
Main category: cs.CL
TL;DR: This paper proposes a new attention mechanism called Grouped-Head Latent Attention (GTA) to reduce memory usage and computational complexity in large language models while maintaining performance.
Details
Motivation: Attention mechanisms have substantial computational and memory overhead, posing challenges for optimizing efficiency and performance of large language models. This is mainly due to the rapid scaling of KV cache and attention computations with text length. Method: The paper proposes Grouped-Head Latent Attention (GTA), which includes a shared attention map mechanism that reuses attention scores across multiple heads and a nonlinear value decoder with learned projections that compresses the value cache into a latent space. Result: GTA cuts attention computation FLOPs by up to 62.5% versus Grouped-Query Attention and shrinks the KV cache by up to 70%, without the extra overhead of Multi-Head Latent Attention. Conclusion: GTA achieves a 2x increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from the smaller cache footprint. Abstract: Attention mechanisms underpin the success of large language models (LLMs), yet their substantial computational and memory overhead poses challenges for optimizing efficiency and performance. A critical bottleneck arises as KV cache and attention computations scale rapidly with text length, challenging deployment on hardware with limited computational and memory resources. We observe that attention mechanisms exhibit substantial redundancy, since the KV cache can be significantly compressed and attention maps across heads display high similarity, revealing that much of the computation and storage is unnecessary. Leveraging these insights, we propose \textbf{G}rouped-Head Laten\textbf{T} \textbf{A}ttention (GTA), a novel attention mechanism that reduces memory usage and computational complexity while maintaining performance. GTA comprises two components: (1) a shared attention map mechanism that reuses attention scores across multiple heads, decreasing the key cache size; and (2) a nonlinear value decoder with learned projections that compresses the value cache into a latent space, further cutting memory needs. GTA cuts attention computation FLOPs by up to \emph{62.5\%} versus Grouped-Query Attention and shrink the KV cache by up to \emph{70\%}, all while avoiding the extra overhead of Multi-Head Latent Attention to improve LLM deployment efficiency. Consequently, GTA models achieve a \emph{2x} increase in end-to-end inference speed, with prefill benefiting from reduced computational cost and decoding benefiting from the smaller cache footprint.[4] AI-Generated Game Commentary: A Survey and a Datasheet Repository
Qirui Zheng,Xingbo Wang,Keyuan Cheng,Yunlong Lu,Wenxin Li
Main category: cs.CL
TL;DR: 这篇论文介绍了AI生成游戏评论(AIGGC)的通用框架,综述了45个相关数据集和方法,并总结了评估指标,为未来研究提供了基础。
Details
Motivation: 由于AI生成游戏评论(AIGGC)具有市场潜力和内在的技术挑战,该任务对语言模型提出了较高的要求,包括事实准确性、逻辑推理、富有表现力的文本生成、生成速度和上下文管理等方面。因此,需要一个综合性的框架和数据集来推动这一领域的发展。 Method: 该论文提出了一个AIGGC的通用框架,并对45个现有的游戏评论数据集和相关方法进行了系统性的综述。此外,论文还分类并比较了常用的评估指标,并在附录中提供了结构化的数据表以总结这些数据集的核心属性。 Result: 论文提出了一种适用于AIGGC的通用框架,系统性地回顾了45个游戏评论相关的数据集和方法,按照其解决的关键问题进行了分类。同时,整理了常用的评估指标,并提供了公开可用的数据表格以支持未来的AIGGC研究。 Conclusion: 该论文提供了一个用于AI生成游戏评论(AIGGC)的通用框架,并对现有数据集和方法进行了全面调查,还整理了评估指标,为未来的研究和基准测试提供了支持。 Abstract: AI-Generated Game Commentary (AIGGC) has gained increasing attention due to its market potential and inherent technical challenges. As a comprehensive multimodal Natural Language Processing (NLP) task, AIGGC imposes substantial demands on language models, including factual accuracy, logical reasoning, expressive text generation, generation speed, and context management. In this paper, we introduce a general framework for AIGGC and present a comprehensive survey of 45 existing game commentary dataset and methods according to key challenges they aim to address in this domain. We further classify and compare various evaluation metrics commonly used in this domain. To support future research and benchmarking, we also provide a structured datasheet summarizing the essential attributes of these datasets in appendix, which is meanwhile publicly available in an open repository.[5] Semantic uncertainty in advanced decoding methods for LLM generation
Darius Foodeei,Simin Fan,Martin Jaggi
Main category: cs.CL
TL;DR: This paper explores how decoding methods like chain-of-thought and speculative sampling affect the diversity and reliability of large language model outputs, revealing that structured approaches can improve performance without sacrificing diversity.
Details
Motivation: To understand how different decoding methods affect the diversity and reliability of large language model outputs, especially in terms of semantic uncertainty. Method: Experiments on question answering, summarization, and code generation tasks were conducted to analyze the impact of decoding strategies on output diversity and reliability. Result: CoT decoding showed higher semantic diversity with lower predictive entropy and improved code generation performance. Speculative sampling excelled in summarization tasks with high ROUGE scores and moderate diversity. Conclusion: The study concludes that structured decoding methods like CoT and speculative sampling can enhance semantic exploration while maintaining or improving output quality, challenging the conventional trade-off assumptions between diversity and accuracy. Abstract: This study investigates semantic uncertainty in large language model (LLM) outputs across different decoding methods, focusing on emerging techniques like speculative sampling and chain-of-thought (CoT) decoding. Through experiments on question answering, summarization, and code generation tasks, we analyze how different decoding strategies affect both the diversity and reliability of model outputs. Our findings reveal that while CoT decoding demonstrates higher semantic diversity, it maintains lower predictive entropy, suggesting that structured exploration can lead to more confident and accurate outputs. This is evidenced by a 48.8% improvement in code generation Pass@2 rates, despite lower alignment with reference solutions. For summarization tasks, speculative sampling proved particularly effective, achieving superior ROUGE scores while maintaining moderate semantic diversity. Our results challenge conventional assumptions about trade-offs between diversity and accuracy in language model outputs, demonstrating that properly structured decoding methods can increase semantic exploration while maintaining or improving output quality. These findings have significant implications for deploying language models in practical applications where both reliability and diverse solution generation are crucial.[6] Mercury: Ultra-Fast Language Models Based on Diffusion
Inception Labs,Samar Khanna,Siddhant Kharbanda,Shufan Li,Harshit Varma,Eric Wang,Sawyer Birnbaum,Ziyang Luo,Yanis Miraoui,Akash Palrecha,Stefano Ermon,Aditya Grover,Volodymyr Kuleshov
Main category: cs.CL
TL;DR: Mercury Coder is a diffusion-based LLM for coding tasks that offers exceptional speed and quality, setting new benchmarks in performance.
Details
Motivation: To develop a commercial-scale model that excels in both speed and quality for coding tasks. Method: The models use the Transformer architecture and are trained to predict multiple tokens in parallel, based on diffusion techniques. Result: Mercury Coder achieves state-of-the-art throughputs of 1109 tokens/sec (Mini) and 737 tokens/sec (Small), outperforming other models by up to 10x while maintaining comparable quality. Conclusion: Mercury Coder, a new generation of diffusion-based large language models for coding applications, demonstrates superior performance in speed and quality. Abstract: We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at https://platform.inceptionlabs.ai/ and free playground at https://chat.inceptionlabs.ai[7] PRAISE: Enhancing Product Descriptions with LLM-Driven Structured Insights
Adnan Qidwai,Srija Mukhopadhyay,Prerana Khatiwada,Dan Roth,Vivek Gupta
Main category: cs.CL
TL;DR: PRAISE是一个利用大语言模型自动结构化分析产品评论和描述的系统,旨在提升电商产品质量和信任度。
Details
Motivation: 准确完整的产品描述对电子商务至关重要,但卖家提供的信息往往不足,而客户评论虽然提供了有价值的细节,但手动筛选非常费力。 Method: 利用大语言模型(LLMs)自动提取、比较和结构化来自客户评论和卖家描述的见解,并提供直观界面识别缺失、矛盾或部分匹配的细节。 Result: PRAISE能够以清晰的结构格式展示差异,并提供来自评论的支持证据,帮助卖家提高产品列表的清晰度和说服力,买家更好地评估产品质量和可靠性。 Conclusion: PRAISE系统有潜力显著提升电子商务产品目录的质量和可信度。 Abstract: Accurate and complete product descriptions are crucial for e-commerce, yet seller-provided information often falls short. Customer reviews offer valuable details but are laborious to sift through manually. We present PRAISE: Product Review Attribute Insight Structuring Engine, a novel system that uses Large Language Models (LLMs) to automatically extract, compare, and structure insights from customer reviews and seller descriptions. PRAISE provides users with an intuitive interface to identify missing, contradictory, or partially matching details between these two sources, presenting the discrepancies in a clear, structured format alongside supporting evidence from reviews. This allows sellers to easily enhance their product listings for clarity and persuasiveness, and buyers to better assess product reliability. Our demonstration showcases PRAISE's workflow, its effectiveness in generating actionable structured insights from unstructured reviews, and its potential to significantly improve the quality and trustworthiness of e-commerce product catalogs.[8] Towards Safety Evaluations of Theory of Mind in Large Language Models
Tatsuhiro Aoshima,Mitsuaki Akiyama
Main category: cs.CL
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: As the capabilities of large language models (LLMs) continue to advance, the importance of rigorous safety evaluation is becoming increasingly evident. Recent concerns within the realm of safety assessment have highlighted instances in which LLMs exhibit behaviors that appear to disable oversight mechanisms and respond in a deceptive manner. For example, there have been reports suggesting that, when confronted with information unfavorable to their own persistence during task execution, LLMs may act covertly and even provide false answers to questions intended to verify their behavior.To evaluate the potential risk of such deceptive actions toward developers or users, it is essential to investigate whether these behaviors stem from covert, intentional processes within the model. In this study, we propose that it is necessary to measure the theory of mind capabilities of LLMs. We begin by reviewing existing research on theory of mind and identifying the perspectives and tasks relevant to its application in safety evaluation. Given that theory of mind has been predominantly studied within the context of developmental psychology, we analyze developmental trends across a series of open-weight LLMs. Our results indicate that while LLMs have improved in reading comprehension, their theory of mind capabilities have not shown comparable development. Finally, we present the current state of safety evaluation with respect to LLMs' theory of mind, and discuss remaining challenges for future work.[9] Cash or Comfort? How LLMs Value Your Inconvenience
Mateusz Cedro,Timour Ichmoukhamedov,Sofie Goethals,Yifan He,James Hinns,David Martens
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在涉及用户不适的情境下如何做出财务权衡,发现其决策存在不稳定性和不合理性,因此不建议直接将当前LLMs用于此类个人决策场景。
Details
Motivation: 虽然大型语言模型在许多技术任务上表现出色,但它们在涉及个人决策中的行为仍不够明确,特别是在财务奖励与用户舒适相矛盾的情况下。 Method: 通过量化多个LLMs对一系列用户不适(如额外步行、等待、饥饿和疼痛)所设定的价格来研究AI助手在财务奖励与用户舒适之间冲突时的行为表现。 Result: 发现了几个关键问题:(1) 不同LLMs之间的反应差异很大;(2) 同一LLM对提示措辞的小变化也表现出脆弱性;(3) LLMs可以接受极低的回报以换取重大不便;(4) 在没有任何不适的情况下,LLMs也可能拒绝金钱收益。 Conclusion: 研究发现当前的大型语言模型在评估人类不适的价值时存在显著问题,包括不同模型之间的反应差异大、对提示措辞变化的脆弱性、接受不合理的低报酬以及拒绝无不适情况下的金钱收益。这些发现强调了在现金与舒适权衡的应用中需要严格审查LLMs作为决策助手的适用性。 Abstract: Large Language Models (LLMs) are increasingly proposed as near-autonomous artificial intelligence (AI) agents capable of making everyday decisions on behalf of humans. Although LLMs perform well on many technical tasks, their behaviour in personal decision-making remains less understood. Previous studies have assessed their rationality and moral alignment with human decisions. However, the behaviour of AI assistants in scenarios where financial rewards are at odds with user comfort has not yet been thoroughly explored. In this paper, we tackle this problem by quantifying the prices assigned by multiple LLMs to a series of user discomforts: additional walking, waiting, hunger and pain. We uncover several key concerns that strongly question the prospect of using current LLMs as decision-making assistants: (1) a large variance in responses between LLMs, (2) within a single LLM, responses show fragility to minor variations in prompt phrasing (e.g., reformulating the question in the first person can considerably alter the decision), (3) LLMs can accept unreasonably low rewards for major inconveniences (e.g., 1 Euro to wait 10 hours), and (4) LLMs can reject monetary gains where no discomfort is imposed (e.g., 1,000 Euro to wait 0 minutes). These findings emphasize the need for scrutiny of how LLMs value human inconvenience, particularly as we move toward applications where such cash-versus-comfort trade-offs are made on users' behalf.[10] Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study
Danielle R. Thomas,Conrad Borchers,Jionghao Lin,Sanjit Kakarla,Shambhavi Bhushan,Erin Gatz,Shivang Gupta,Ralph Abboud,Kenneth R. Koedinger
Main category: cs.CL
TL;DR: 研究表明,生成式AI可以高效且准确地识别实际数学辅导中的关键教学行为,并为大规模教育评估提供可行方案。
Details
Motivation: 基于音频转录的大规模识别和研究与学生学习最相关的辅导行为是一个开放的研究问题,而生成式人工智能可能为此提供解决方案。 Method: 使用GPT-4、GPT-4o、GPT-4-turbo、Gemini-1.5-pro和LearnLM分析50个远程辅导数学的大学学生辅导中学生的真实转录文本,检测两种辅导技能:提供有效的表扬和回应学生的数学错误。 Result: 所有模型都能可靠地检测相关情境,例如导师向学生提供表扬(准确率94-98%)和学生犯数学错误(准确率82-88%),并且能有效评估导师是否遵循最佳实践,与人类判断高度一致(分别为83-89%和73-77%)。 Conclusion: 本研究提出了一种具有成本效益的提示策略,并讨论了大型语言模型在真实环境中支持可扩展评估的实际意义,同时为AI支持的学习研究提供了可重复性和新方向。 Abstract: Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors' application of two tutor skills: delivering effective praise and responding to student math errors. All models reliably detected relevant situations, for example, tutors providing praise to students (94-98% accuracy) and a student making a math error (82-88% accuracy) and effectively evaluated the tutors' adherence to tutoring best practices, aligning closely with human judgments (83-89% and 73-77%, respectively). We propose a cost-effective prompting strategy and discuss practical implications for using large language models to support scalable assessment in authentic settings. This work further contributes LLM prompts to support reproducibility and research in AI-supported learning.[11] UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making
Jinhao Duan,James Diffenderfer,Sandeep Madireddy,Tianlong Chen,Bhavya Kailkhura,Kaidi Xu
Main category: cs.CL
TL;DR: This paper introduces UProp, a novel framework for quantifying uncertainty in multi-step LLM decision-making, which outperforms existing single-turn methods and provides insights into uncertainty propagation.
Details
Motivation: Current LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, leaving multi-step decision-making scenarios underexplored. This limits the reliability of LLMs in safety-critical applications involving sequential decision-making. Method: The paper proposes UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of Mutual Information (MI) to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). Result: UProp was evaluated on extensive multi-step decision-making benchmarks like AgentBench and HotpotQA using state-of-the-art LLMs such as GPT-4.1 and DeepSeek-V3, demonstrating its superior performance compared to existing methods. Conclusion: UProp significantly outperforms existing single-turn UQ baselines and offers effective estimation of extrinsic uncertainty in multi-step decision-making scenarios. Abstract: As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.[12] Beyond the Link: Assessing LLMs' ability to Classify Political Content across Global Media
Alberto Martinez-Serra,Alejandro De La Fuente,Nienke Viescher,Ana S. Cardenal
Main category: cs.CL
TL;DR: This paper explores how large language models (LLMs) can classify political content (PC) using only URLs across multiple countries and languages. Results show URLs are effective for PC classification, providing an accuracy-cost balance, though some contextual limitations exist.
Details
Motivation: While LLMs are increasingly used in political science for analyzing digital media, their effectiveness in classifying political content solely from URLs has not been thoroughly explored. This study aims to bridge this gap by evaluating whether URL-based classification using LLMs can approximate full-text analysis accurately across diverse linguistic and national contexts. Method: This research evaluated the performance of large language models (LLMs) like GPT, Llama, Mistral, Deepseek, Qwen, and Gemma in classifying political content from URLs and article texts across five countries (France, Germany, Spain, UK, US) and multiple languages. The results were compared with human-labelled data and traditional supervised machine learning techniques. Result: Findings indicate that LLMs can accurately identify political content from URLs, demonstrating that URL-level analysis is a viable and cost-efficient approximation for full-text analysis across different languages and countries. However, contextual limitations were also identified. Conclusion: The study concludes that URLs can effectively embed most news content for classifying political content (PC) with LLMs, offering a cost-effective alternative to full-text analysis while considering methodological recommendations for future use in political science. Abstract: The use of large language models (LLMs) is becoming common in the context of political science, particularly in studies that analyse individuals use of digital media. However, while previous research has demonstrated LLMs ability at labelling tasks, the effectiveness of using LLMs to classify political content (PC) from just URLs is not yet well explored. The work presented in this article bridges this gap by evaluating whether LLMs can accurately identify PC vs. non-PC from both the article text and the URLs from five countries (France, Germany, Spain, the UK, and the US) and different languages. Using cutting-edge LLMs like GPT, Llama, Mistral, Deepseek, Qwen and Gemma, we measure model performance to assess whether URL-level analysis can be a good approximation for full-text analysis of PC, even across different linguistic and national contexts. Model outputs are compared with human-labelled articles, as well as traditional supervised machine learning techniques, to set a baseline of performance. Overall, our findings suggest the capacity of URLs to embed most of the news content, providing a vital perspective on accuracy-cost balancing. We also account for contextual limitations and suggest methodological recommendations to use LLMs within political science studies.[13] Breaking the Transcription Bottleneck: Fine-tuning ASR Models for Extremely Low-Resource Fieldwork Languages
Siyu Liang,Gina-Anne Levow
Main category: cs.CL
TL;DR: 本文比较了两种多语言自动语音识别模型(MMS和XLS-R)在低资源语言上的性能,发现MMS适合于极小规模训练数据,而XLS-R在训练数据超过一小时后表现相当。研究还提供了针对实地语言学家的实用建议,以缓解语言记录中的转录瓶颈。
Details
Motivation: 自动语音识别(ASR)在高资源语言上已达到令人印象深刻的准确性,但在实地语言工作中其效用仍然有限。实地收集的录音面临自发性言语、环境噪音以及来自记录不足的语言的数据集严重受限等独特挑战。 Method: 对两种微调的多语言ASR模型(MMS和XLS-R)进行了基准测试,并对五种类型多样的低资源语言的训练数据持续时间进行控制。 Result: 研究发现MMS模型在训练数据极其稀少时表现最佳,而当训练数据超过一小时后,XLS-R模型表现出与MMS相当的性能。此外,该研究提供了语言学基础分析,进一步为实地语言学家提供实用指南,并强调了可重复的ASR适应方法来缓解语言记录中的转录瓶颈。 Conclusion: MMS模型适用于训练数据极其有限的情况,而XLS-R模型在训练数据超过一小时后显示出同等性能,论文为实地语言学家提供了实用指南和可重复的ASR适应方法以缓解语言记录中的转录瓶颈。 Abstract: Automatic Speech Recognition (ASR) has reached impressive accuracy for high-resource languages, yet its utility in linguistic fieldwork remains limited. Recordings collected in fieldwork contexts present unique challenges, including spontaneous speech, environmental noise, and severely constrained datasets from under-documented languages. In this paper, we benchmark the performance of two fine-tuned multilingual ASR models, MMS and XLS-R, on five typologically diverse low-resource languages with control of training data duration. Our findings show that MMS is best suited when extremely small amounts of training data are available, whereas XLS-R shows parity performance once training data exceed one hour. We provide linguistically grounded analysis for further provide insights towards practical guidelines for field linguists, highlighting reproducible ASR adaptation approaches to mitigate the transcription bottleneck in language documentation.[14] Computational Approaches to Understanding Large Language Model Impact on Writing and Information Ecosystems
Weixin Liang
Main category: cs.CL
TL;DR: This dissertation explores the impact of large language models on writing practices, highlighting both their potential benefits and challenges such as equity issues and the need for better AI governance.
Details
Motivation: The motivation behind this dissertation is to understand how individuals and institutions are adapting to and engaging with large language models (LLMs), which are rapidly changing the landscape of writing, communication, and creation. Method: The research employs three main methods: analysis of institutional adoption of AI detectors, population-level algorithmic approaches to measure LLM adoption across writing domains, and a large-scale empirical analysis of LLMs' capability to provide manuscript feedback. Result: The study reveals systematic biases introduced by AI detectors disadvantaging non-dominant language varieties, consistent patterns of AI-assisted content across various writing domains, and insights into the potential of LLMs to support researchers facing barriers in accessing manuscript feedback. Conclusion: The dissertation concludes that while LLMs offer significant potential in transforming communication and creation processes, their institutional adoption and use raise critical equity concerns and highlight the need for careful AI governance. Abstract: Large language models (LLMs) have shown significant potential to change how we write, communicate, and create, leading to rapid adoption across society. This dissertation examines how individuals and institutions are adapting to and engaging with this emerging technology through three research directions. First, I demonstrate how the institutional adoption of AI detectors introduces systematic biases, particularly disadvantaging writers of non-dominant language varieties, highlighting critical equity concerns in AI governance. Second, I present novel population-level algorithmic approaches that measure the increasing adoption of LLMs across writing domains, revealing consistent patterns of AI-assisted content in academic peer reviews, scientific publications, consumer complaints, corporate communications, job postings, and international organization press releases. Finally, I investigate LLMs' capability to provide feedback on research manuscripts through a large-scale empirical analysis, offering insights into their potential to support researchers who face barriers in accessing timely manuscript feedback, particularly early-career researchers and those from under-resourced settings.[15] VeriLocc: End-to-End Cross-Architecture Register Allocation via LLM
Lesheng Jin,Zhenyuan Ruan,Haohui Mai,Jingbo Shang
Main category: cs.CL
TL;DR: 提出VeriLocc,一种结合大语言模型与形式化编译技术的框架,用于在不同GPU架构上实现高效、可验证的寄存器分配。
Details
Motivation: 现代GPU硬件发展迅速,但生产环境中的编译器仍然依赖手工调整的寄存器分配启发式方法,需要针对每一代硬件重新调整。 Method: 通过微调一个大语言模型,将中间表示转化为目标相关的寄存器分配,利用静态分析进行跨架构规范化和泛化,并采用验证引导再生循环确保正确性。 Result: 在矩阵乘法(GEMM)和多头注意力(MHA)任务中,VeriLocc实现了85-99%的单次准确率和接近100%的pass@100,在性能上超过rocBLAS超过10%。 Conclusion: VeriLocc是一个结合大语言模型和形式化编译技术的框架,能够在不同GPU架构上实现可验证和通用的寄存器分配,并且优于专家优化的库。 Abstract: Modern GPUs evolve rapidly, yet production compilers still rely on hand-crafted register allocation heuristics that require substantial re-tuning for each hardware generation. We introduce VeriLocc, a framework that combines large language models (LLMs) with formal compiler techniques to enable generalizable and verifiable register allocation across GPU architectures. VeriLocc fine-tunes an LLM to translate intermediate representations (MIRs) into target-specific register assignments, aided by static analysis for cross-architecture normalization and generalization and a verifier-guided regeneration loop to ensure correctness. Evaluated on matrix multiplication (GEMM) and multi-head attention (MHA), VeriLocc achieves 85-99% single-shot accuracy and near-100% pass@100. Case study shows that VeriLocc discovers more performant assignments than expert-tuned libraries, outperforming rocBLAS by over 10% in runtime.[16] Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning
Mingfei Lau,Qian Chen,Yeming Fang,Tingting Xu,Tongzhou Chen,Pavel Golik
Main category: cs.CL
TL;DR: 该论文审查了三个多语言语音数据集的质量问题,提出了改善数据集质量的建议,以提升下游模型性能。
Details
Motivation: 这些数据集在某些语言中存在显著的质量问题,解决这些问题可以使这些数据集在训练和评估中更有用,并改进下游模型。 Method: 对三个广泛使用的多语言语音数据集(Mozilla Common Voice 17.0、FLEURS 和 VoxPopuli)进行了质量审计,并将质量问题分为微观层面和宏观层面进行分析。 Result: 研究发现宏观层面的问题在较少制度化、资源匮乏的语言中更为普遍,并以台湾闽南语(nan_tw)为例,突出了在自动语音识别(ASR)数据集创建过程中积极语言规划和增强数据质量控制的必要性。 Conclusion: 该论文提出了解决语音数据集质量问题的指导方针和建议,强调在创建健壮可靠的语音数据资源时需要具备社会语言学意识。 Abstract: Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and VoxPopuli - shows that in some languages, these datasets suffer from significant quality issues. We believe addressing these issues will make these datasets more useful as training and evaluation sets, and improve downstream models. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness in creating robust and reliable speech data resources.[17] DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning
Yuanhao Wu,Juntong Song,Hanning Zhang,Tong Zhang,Cheng Niu
Main category: cs.CL
TL;DR: DuaShepherd 结合正确性和潜在性奖励信号,有效提升了大型语言模型在数学推理任务中的表现,并在多个基准测试中达到SOTA性能。
Details
Motivation: 为了增强大型语言模型(LLMs)的数学推理能力,提出 DuaShepherd 框架,结合正确性和潜在性两种互补的奖励信号。 Method: 开发了一个自动化管道来构建大规模奖励模型数据集,并探索了一种统一的多头架构在多任务设置中同时训练两种奖励模型。 Result: 实验表明,结合这两种信号的模型在多个基准测试中表现更优,特别是在 MATH500 和 ProcessBench 上显著优于仅使用单一奖励类型的模型。 Conclusion: DuaShepherd 框架通过结合正确性和潜在性奖励信号,在资源受限的情况下实现了SOTA性能,证明了其在提升LLM数学推理能力方面的有效性。 Abstract: In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.[18] Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception
Nitin Venkateswaran,Kevin Tang,Ratree Wayland
Main category: cs.CL
TL;DR: This study demonstrates that self-supervised speech representations can effectively model accent perception by focusing on specific phonological features that influence how accents are perceived.
Details
Motivation: Traditional models of accent perception underestimate the role of gradient variations in phonological features. The study aimed to investigate how current self-supervised learning models encode these variations and influence accent perception. Method: The study used the CSLU Foreign Accented English corpus to extract phonological feature probabilities and pretrained representations from Wav2Vec2-BERT and WavLM. Accent judgments were made by native speakers of American English, and probing analyses and multinomial logistic regression were conducted. Result: Accent strength was best predicted by a subset of segment's pretrained representation features that emphasize perceptually salient phonological features. There were strong associations between accent strength and distances from American and Indian English baselines. Conclusion: Self-supervised speech representations are valuable for modeling accent perception using interpretable phonological features. Abstract: Traditional models of accent perception underestimate the role of gradient variations in phonological features which listeners rely upon for their accent judgments. We investigate how pretrained representations from current self-supervised learning (SSL) models of speech encode phonological feature-level variations that influence the perception of segmental accent. We focus on three segments: the labiodental approximant, the rhotic tap, and the retroflex stop, which are uniformly produced in the English of native speakers of Hindi as well as other languages in the Indian sub-continent. We use the CSLU Foreign Accented English corpus (Lander, 2007) to extract, for these segments, phonological feature probabilities using Phonet (V\'asquez-Correa et al., 2019) and pretrained representations from Wav2Vec2-BERT (Barrault et al., 2023) and WavLM (Chen et al., 2022) along with accent judgements by native speakers of American English. Probing analyses show that accent strength is best predicted by a subset of the segment's pretrained representation features, in which perceptually salient phonological features that contrast the expected American English and realized non-native English segments are given prominent weighting. A multinomial logistic regression of pretrained representation-based segment distances from American and Indian English baselines on accent ratings reveals strong associations between the odds of accent strength and distances from the baselines, in the expected directions. These results highlight the value of self-supervised speech representations for modeling accent perception using interpretable phonological features.[19] AgriCHN: A Comprehensive Cross-domain Resource for Chinese Agricultural Named Entity Recognition
Lingxiao Zeng,Yiqi Tong,Wei Guo,Huarui Wu,Lihao Ge,Yijun Ye,Fuzhen Zhuang,Deqing Wang,Wei Guo,Cheng Chen
Main category: cs.CL
TL;DR: 本文提出了一种新的中文农业命名实体识别资源AgriCHN,包括4040个句子和15799个实体提及,旨在提高自动化农业实体标注的准确性。
Details
Motivation: 主流方法在此任务上的表现欠佳,且早期工作忽略了农业与水文、气象之间的深刻关联。 Method: 从大量农业文章中精心整理了4040个句子,并包含15799个农业实体提及,涵盖27种不同实体类别。 Result: 数据验证显示,与相关资源相比,AgriCHN展示了更出色的数据质量。使用最先进的NER模型进行基准测试也表明该数据集具有重大挑战性和研究价值。 Conclusion: AgriCHN是一个高质量的中文农业命名实体识别数据集,具有丰富的实体类型和细粒度划分,展示了其在进一步研究中的潜力。 Abstract: Agricultural named entity recognition is a specialized task focusing on identifying distinct agricultural entities within vast bodies of text, including crops, diseases, pests, and fertilizers. It plays a crucial role in enhancing information extraction from extensive agricultural text resources. However, the scarcity of high-quality agricultural datasets, particularly in Chinese, has resulted in suboptimal performance when employing mainstream methods for this purpose. Most earlier works only focus on annotating agricultural entities while overlook the profound correlation of agriculture with hydrology and meteorology. To fill this blank, we present AgriCHN, a comprehensive open-source Chinese resource designed to promote the accuracy of automated agricultural entity annotation. The AgriCHN dataset has been meticulously curated from a wealth of agricultural articles, comprising a total of 4,040 sentences and encapsulating 15,799 agricultural entity mentions spanning 27 diverse entity categories. Furthermore, it encompasses entities from hydrology to meteorology, thereby enriching the diversity of entities considered. Data validation reveals that, compared with relevant resources, AgriCHN demonstrates outstanding data quality, attributable to its richer agricultural entity types and more fine-grained entity divisions. A benchmark task has also been constructed using several state-of-the-art neural NER models. Extensive experimental results highlight the significant challenge posed by AgriCHN and its potential for further research.[20] Mind the Gap: Assessing Wiktionary's Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages
Jonathan Sakunkoo,Annabella Sakunkoo
Main category: cs.CL
TL;DR: 本研究利用神经形态分析器验证Wiktionary中的缺陷动词列表,发现Wiktionary在意大利语中可靠性高,但在拉丁语中有7%的缺陷词被错误标记。研究推进了计算形态学,并指出了众包资源在语言学研究中的潜力与限制。
Details
Motivation: 解决形态缺陷性问题对于提高形态丰富语言的NLP工具准确性至关重要。传统语言资源往往缺乏对形态空缺的覆盖,而Wikipedia和Wiktionary虽然广泛使用,但其可靠性存在争议。 Method: 本研究定制了一种新的神经形态分析器,用于标注拉丁语和意大利语语料库,并利用大规模标注数据从Wiktionary中验证了缺陷动词的众包列表。 Result: 研究结果表明,Wiktionary对意大利语的形态空缺提供了高度可靠的描述,但在拉丁语中列出的7%的缺陷词根显示出强有力的语料证据证明其并非缺陷词。 Conclusion: 尽管维基资源在语言学知识中具有重要价值,但在处理较少研究的现象和语言时,其作为权威来源可能存在局限性。本研究通过提供可扩展的工具和方法,促进了计算形态学的发展,并扩展了对非英语、形态丰富语言中的缺陷性的语言学知识。 Abstract: Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.[21] TyphoFormer: Language-Augmented Transformer for Accurate Typhoon Track Forecasting
Lincan Li,Eren Erman Ozguven,Yue Zhao,Guang Wang,Yiqun Xie,Yushun Dong
Main category: cs.CL
TL;DR: 本文提出了一种名为TyphoFormer的新框架,通过引入自然语言处理技术,提高台风轨迹预测的准确性。
Details
Motivation: 基于Transformer的模型虽然在建模人类和车辆密集轨迹的时间动态方面表现出色,但它们通常缺乏获取更广泛背景知识的能力,这种知识可以增强稀疏气象轨迹(如台风轨迹)预测的可靠性。 Method: TyphoFormer利用大型语言模型(LLM)根据北大西洋飓风数据库中的数值属性生成简洁的文本描述,并将这些语言描述作为辅助特殊标记嵌入到数值时间序列输入中,从而在统一的Transformer编码器中结合文本和序列信息。 Result: 在HURDAT2基准上的实验结果表明,TyphoFormer持续优于其他最先进的基线方法。 Conclusion: TyphoFormer是一种新颖的框架,通过整合自然语言描述作为辅助提示,提高了台风轨迹预测的准确性,特别是在涉及非线性路径变化和有限历史观测的挑战性场景中。 Abstract: Accurate typhoon track forecasting is crucial for early system warning and disaster response. While Transformer-based models have demonstrated strong performance in modeling the temporal dynamics of dense trajectories of humans and vehicles in smart cities, they usually lack access to broader contextual knowledge that enhances the forecasting reliability of sparse meteorological trajectories, such as typhoon tracks. To address this challenge, we propose TyphoFormer, a novel framework that incorporates natural language descriptions as auxiliary prompts to improve typhoon trajectory forecasting. For each time step, we use Large Language Model (LLM) to generate concise textual descriptions based on the numerical attributes recorded in the North Atlantic hurricane database. The language descriptions capture high-level meteorological semantics and are embedded as auxiliary special tokens prepended to the numerical time series input. By integrating both textual and sequential information within a unified Transformer encoder, TyphoFormer enables the model to leverage contextual cues that are otherwise inaccessible through numerical features alone. Extensive experiments are conducted on HURDAT2 benchmark, results show that TyphoFormer consistently outperforms other state-of-the-art baseline methods, particularly under challenging scenarios involving nonlinear path shifts and limited historical observations.[22] OpusLM: A Family of Open Unified Speech Language Models
Jinchuan Tian,William Chen,Yifan Peng,Jiatong Shi,Siddhant Arora,Shikhar Bharadwaj,Takashi Maekaku,Yusuke Shinohara,Keita Goto,Xiang Yue,Huck Yang,Shinji Watanabe
Main category: cs.CL
TL;DR: 本文提出了Open Unified Speech Language Models (OpusLMs),通过大规模预训练实现了高性能的多任务语音语言模型。
Details
Motivation: 开发开放、可扩展的语音语言模型以提升语音识别、合成和文本处理能力。 Method: 从仅解码器文本语言模型初始化,在大量语音-文本对和纯文本数据上进行连续预训练。 Result: OpusLMs在多个任务上表现优异,验证了模型规模扩展和数据选择退火的有效性。 Conclusion: OpusLMs是开放基础语音语言模型,具有与现有SpeechLM相当或更优的性能,并且完全透明公开以促进研究。 Abstract: This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research[23] Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs
Yang Wu,Yifan Zhang,Yiwei Wang,Yujun Cai,Yurong Wu,Yuran Wang,Ning Xu,Jian Cheng
Main category: cs.CL
TL;DR: 大型语言模型的推理能力可能主要依赖于显式答案线索,而非真正推理,答案遮蔽显著降低性能。
Details
Motivation: 已有研究表明大型语言模型的成功可能源于记忆化的答案-推理模式而非真实推理,因此需要深入探究其推理机制的本质。 Method: 提出了一种五级答案可见性提示框架,通过系统性地操控答案线索并进行间接的行为分析来探究模型行为。 Result: 实验显示,当答案线索被屏蔽时,即使提供完整的推理链,模型表现下降26.90%,显示出对显式答案的强烈依赖。 Conclusion: 该研究揭示了大型语言模型在推理任务中主要依赖于显式答案线索而非真正的推理过程,表明其推理能力可能是事后的合理化而非深度推断。 Abstract: While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, growing evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference. In this work, we investigate a central question: are LLMs primarily anchored to final answers or to the textual pattern of reasoning chains? We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis. Experiments across state-of-the-art LLMs reveal a strong and consistent reliance on explicit answers. The performance drops by 26.90\% when answer cues are masked, even with complete reasoning chains. These findings suggest that much of the reasoning exhibited by LLMs may reflect post-hoc rationalization rather than true inference, calling into question their inferential depth. Our study uncovers the answer-anchoring phenomenon with rigorous empirical validation and underscores the need for a more nuanced understanding of what constitutes reasoning in LLMs.[24] Step-Opt: Boosting Optimization Modeling in LLMs through Iterative Data Synthesis and Structured Validation
Yang Wu,Yifan Zhang,Yurong Wu,Yuran Wang,Junkai Zhang,Jian Cheng
Main category: cs.CL
TL;DR: Step-Opt-Instruct框架通过增强现有数据集并生成高质量的优化建模微调数据,提升了大型语言模型在复杂运筹学任务中的性能。
Details
Motivation: 大型语言模型(LLMs)在解决运筹学(OR)优化建模任务时面临重大挑战,尤其是在处理复杂问题时。为此,本文提出了Step-Opt-Instruct框架以提升LLM在这些任务上的表现。 Method: Step-Opt-Instruct采用迭代问题生成来系统地增加问题的复杂性,并通过逐步验证严格检验数据,防止错误传播并确保生成数据集的质量。此外,利用该框架对LLaMA-3-8B和Mistral-7B等开源LLMs进行微调,开发出Step-Opt模型。 Result: 实验结果显示,Step-Opt在NL4OPT、MAMO和IndustryOR等基准测试中表现出色,尤其是在处理复杂OR任务时,其在难题上的微观平均准确率提高了17.01%。 Conclusion: 结合结构化验证与渐进式问题优化的方法能够有效提升LLM在自动化决策过程中的应用效果。 Abstract: Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Instruct employs iterative problem generation to systematically increase problem complexity and stepwise validation to rigorously verify data, preventing error propagation and ensuring the quality of the generated dataset. Leveraging this framework, we fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt--a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR. Extensive experiments demonstrate the superior performance of Step-Opt, especially in addressing complex OR tasks, with a notable 17.01\% improvement in micro average accuracy on difficult problems. These findings highlight the effectiveness of combining structured validation with gradual problem refinement to advance the automation of decision-making processes using LLMs.The code and dataset are available at https://github.com/samwu-learn/Step.[25] TPTT: Transforming Pretrained Transformer into Titans
Fabien Furfaro
Main category: cs.CL
TL;DR: TPTT框架通过高效的线性化注意力机制和内存管理技术,提升了预训练Transformer模型的性能,在保持与Hugging Face Transformers库兼容的同时,实现了更高的效率和准确性。
Details
Motivation: 大型语言模型(LLMs)在自然语言处理方面取得了显著进展,但其计算和内存需求仍然是长上下文推理的一个重大挑战。 Method: 提出了TPTT框架,结合了Memory as Gate (MaG) 和混合线性化注意力机制 (LiZA),并通过参数高效微调(LoRA) 实现对任何因果LLM的无缝适配。 Result: 在MMLU基准测试中,使用约10亿参数的模型,TPTT显示出了效率和准确性的显著提升,例如Titans-Llama-3.2-1B比基线模型Exact Match (EM) 提高了20%。 Conclusion: TPTT是一个有效的框架,通过线性化注意力机制和先进的内存管理技术提高了预训练Transformer模型的效率和准确性,并且与Hugging Face Transformers库完全兼容。 Abstract: Recent advances in large language models (LLMs) have led to remarkable progress in natural language processing, but their computational and memory demands remain a significant challenge, particularly for long-context inference. We introduce TPTT (Transforming Pretrained Transformer into Titans), a novel framework for enhancing pretrained Transformer models with efficient linearized attention mechanisms and advanced memory management. TPTT employs techniques such as Memory as Gate (MaG) and mixed linearized attention (LiZA). It is fully compatible with the Hugging Face Transformers library, enabling seamless adaptation of any causal LLM through parameter-efficient fine-tuning (LoRA) without full retraining. We show the effectiveness of TPTT on the MMLU benchmark with models of approximately 1 billion parameters, observing substantial improvements in both efficiency and accuracy. For instance, Titans-Llama-3.2-1B achieves a 20% increase in Exact Match (EM) over its baseline. Statistical analyses and comparisons with recent state-of-the-art methods confirm the practical scalability and robustness of TPTT. Code is available at https://github.com/fabienfrfr/tptt . Python package at https://pypi.org/project/tptt/ .[26] Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering
Binquan Ji,Haibo Luo,Yifei Lu,Lei Hei,Jiaqi Wang,Tingjing Liao,Lingyu Wang,Shichao Wang,Feiliang Ren
Main category: cs.CL
TL;DR: This paper proposes DEC, a framework for knowledge-intensive multi-hop question answering that improves efficiency and accuracy, particularly for lightweight models.
Details
Motivation: Knowledge-intensive multi-hop QA tasks often require multiple rounds of retrieval and iterative generation by LLMs. Lightweight LLMs face challenges like hallucinations and semantic drift when handling many documents and extended contexts. Method: A framework called DEC (Dynamic Enhancement Chain) was proposed. It decomposes complex questions into subquestions, refines these subquestions through context-aware rewriting, and uses a lightweight discriminative keyword extraction module for retrieval. Result: DEC performs on par with or surpasses state-of-the-art benchmarks while significantly reducing token consumption. It achieves state-of-the-art results on models with 8B parameters. Conclusion: DEC is effective, especially in resource-constrained environments. Abstract: Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges -such as hallucinations and semantic drift-for lightweight LLMs with fewer parameters. This work proposes a novel framework called DEC (Dynamic Enhancement Chain). DEC first decomposes complex questions into logically coherent subquestions to form a hallucination-free reasoning chain. It then iteratively refines these subquestions through context-aware rewriting to generate effective query formulations. For retrieval, we introduce a lightweight discriminative keyword extraction module that leverages extracted keywords to achieve targeted, precise document recall with relatively low computational overhead. Extensive experiments on three multi-hop QA datasets demonstrate that DEC performs on par with or surpasses state-of-the-art benchmarks while significantly reducing token consumption. Notably, our approach attains state-of-the-art results on models with 8B parameters, showcasing its effectiveness in various scenarios, particularly in resource-constrained environments.[27] Zero-Shot Conversational Stance Detection: Dataset and Approaches
Yuzhe Ding,Kang He,Bobo Li,Li Zheng,Haijun He,Fei Li,Chong Teng,Donghong Ji
Main category: cs.CL
TL;DR: 本文介绍了一种用于零样本对话立场检测的新方法SITPCL和一个大型数据集ZS-CSD,实验结果显示该方法性能先进,但仍有改进空间。
Details
Motivation: 现有的对话立场检测数据集仅限于特定目标,这限制了立场检测模型在现实应用中遇到大量未见目标时的有效性。为弥合这一差距,作者手动策划了一个大型、高质量的零样本对话立场检测数据集ZS-CSD。 Method: 提出了一种说话人交互和目标感知的原型对比学习模型SITPCL,并建立了一个大规模、高质量的零样本对话立场检测数据集ZS-CSD。 Result: 实验结果表明,提出的SITPCL模型在零样本对话立场检测中达到了最先进的性能,但在F1-macro评分上仅为43.81%。 Conclusion: SITPCL模型在零样本对话立场检测中达到最先进的性能,但仍存在挑战。 Abstract: Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the increasing number of online debates among social media users, conversational stance detection has become a crucial research area. However, existing conversational stance detection datasets are restricted to a limited set of specific targets, which constrains the effectiveness of stance detection models when encountering a large number of unseen targets in real-world applications. To bridge this gap, we manually curate a large-scale, high-quality zero-shot conversational stance detection dataset, named ZS-CSD, comprising 280 targets across two distinct target types. Leveraging the ZS-CSD dataset, we propose SITPCL, a speaker interaction and target-aware prototypical contrastive learning model, and establish the benchmark performance in the zero-shot setting. Experimental results demonstrate that our proposed SITPCL model achieves state-of-the-art performance in zero-shot conversational stance detection. Notably, the SITPCL model attains only an F1-macro score of 43.81%, highlighting the persistent challenges in zero-shot conversational stance detection.[28] The Evolution of Natural Language Processing: How Prompt Optimization and Language Models are Shaping the Future
Summra Saleem,Muhammad Nabeel Asim,Shaista Zulfiqar,Andreas Dengel
Main category: cs.CL
TL;DR: 这篇论文系统分析并分类了11种提示优化策略,旨在推动未来基于大语言模型的预测方法的研究与发展。
Details
Motivation: 填补当前对大语言模型中提示优化策略缺乏综合分析的空白。 Method: 对现有的提示优化策略进行了全面分析,并根据其工作原理进行分类。 Result: 提供了关于不同自然语言处理任务中使用的提示优化策略、大语言模型和基准数据集的详细信息。 Conclusion: 该论文总结了各种提示优化策略的潜力,并将其分为11个不同的类别,为未来的比较研究奠定了坚实的基础。 Abstract: Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) by automating traditional labor-intensive tasks and consequently accelerated the development of computer-aided applications. As researchers continue to advance this field with the introduction of novel language models and more efficient training/finetuning methodologies, the idea of prompt engineering and subsequent optimization strategies with LLMs has emerged as a particularly impactful trend to yield a substantial performance boost across diverse NLP tasks. To best of our knowledge numerous review articles have explored prompt engineering, however, a critical gap exists in comprehensive analyses of prompt optimization strategies. To bridge this gap this paper provides unique and comprehensive insights about the potential of diverse prompt optimization strategies. It analyzes their underlying working paradigms and based on these principles, categorizes them into 11 distinct classes. Moreover, the paper provides details about various NLP tasks where these prompt optimization strategies have been employed, along with details of different LLMs and benchmark datasets used for evaluation. This comprehensive compilation lays a robust foundation for future comparative studies and enables rigorous assessment of prompt optimization and LLM-based predictive pipelines under consistent experimental settings: a critical need in the current landscape. Ultimately, this research will centralize diverse strategic knowledge to facilitate the adaptation of existing prompt optimization strategies for development of innovative predictors across unexplored tasks.[29] Aged to Perfection: Machine-Learning Maps of Age in Conversational English
MingZe Tang
Main category: cs.CL
TL;DR: This paper explores age-related linguistic differences in British English using large-scale data and machine learning, aiming to identify generational language markers and build age prediction models.
Details
Motivation: The motivation behind this research is to understand sociolinguistic diversity and how language patterns change across generations in modern British speech. Method: The research uses computational language analysis and machine learning methodologies on the British National Corpus 2014 to analyze variations in utterance duration, lexical diversity, and word choice among different age groups. Result: The researchers identified specific linguistic features associated with each generation and developed prediction models capable of estimating a speaker's age group based on their language use. Conclusion: The study concludes that there are distinctive linguistic markers across different age groups in contemporary British English, which can be used to predict a speaker's age group with some accuracy. Abstract: The study uses the British National Corpus 2014, a large sample of contemporary spoken British English, to investigate language patterns across different age groups. Our research attempts to explore how language patterns vary between different age groups, exploring the connection between speaker demographics and linguistic factors such as utterance duration, lexical diversity, and word choice. By merging computational language analysis and machine learning methodologies, we attempt to uncover distinctive linguistic markers characteristic of multiple generations and create prediction models that can consistently estimate the speaker's age group from various aspects. This work contributes to our knowledge of sociolinguistic diversity throughout the life of modern British speech.[30] Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages
Matthias Schöffel,Esteban Garces Arias,Marinus Wiedner,Paula Ruppert,Meimingwei Li,Christian Heumann,Matthias Aßenmacher
Main category: cs.CL
TL;DR: 这篇论文探讨了如何提高中世纪罗曼语的词性标注性能,研究了不同技术和模型架构的效果,并指出了大型语言模型在此过程中的局限性和潜力。
Details
Motivation: 尽管现代大型语言模型在古代语言方面取得了显著进展,但它们在中世纪罗曼语的应用上由于历时语言演变、拼写变化和标记数据稀缺而面临独特的挑战。 Method: 通过严格实验评估微调方法、提示工程、模型架构、解码策略和跨语言迁移学习技术对标注准确率的影响。 Result: 结果揭示了大型语言模型在处理历史语言变体和非标准化拼写方面的局限性,并展示了某些专门技术能够有效地解决这些独特的挑战。 Conclusion: 该研究得出了一些有效应对低资源历史语言独特挑战的专门技术,同时也揭示了大型语言模型在处理历史语言变体和非标准化拼写方面的显著局限性。 Abstract: Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines, particularly critical for historical text analysis at the intersection of computational linguistics and digital humanities. Despite significant advancements in modern large language models (LLMs) for ancient languages, their application to Medieval Romance languages presents distinctive challenges stemming from diachronic linguistic evolution, spelling variations, and labeled data scarcity. This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts, spanning biblical, hagiographical, medical, and dietary domains. Through rigorous experimentation, we evaluate how fine-tuning approaches, prompt engineering, model architectures, decoding strategies, and cross-lingual transfer learning techniques affect tagging accuracy. Our results reveal both notable limitations in LLMs' ability to process historical language variations and non-standardized spelling, as well as promising specialized techniques that effectively address the unique challenges presented by low-resource historical languages.[31] KAG-Thinker: Teaching Large Language Models to Think with Human-like Reasoning Process
Dalong Zhang,Jun Xu,Jun Zhou,Lei Liang,Lin Yuan,Ling Zhong,Mengshu Sun,Peilong Zhao,QiWei Wang,Xiaorui Wang,Xinkai Du,YangYang Hou,Yu Ao,ZhaoYang Wang,Zhengke Gui,ZhiYing Yi,Zhongpu Bo
Main category: cs.CL
TL;DR: 本文提出了一种名为 KAG-Thinker 的新框架,它通过模仿人类的认知机制,提高参数轻量级大语言模型在特定领域知识库上执行问答任务时的表现。
Details
Motivation: 为了提升参数轻量级大语言模型(LLM)在特定领域知识库上的问答任务中的逻辑连贯性和上下文一致性,需要一种新颖的人类思维框架。 Method: 该框架首先通过广度分解将复杂问题分解为可独立解决的子问题,并使用检索、数学和推理函数进行处理;其次,在知识检索任务中,通过知识边界模型确定最佳知识源,并通过深度求解模型增强知识获取的全面性;最后,使用多轮对话的监督微调对齐模型与结构化推理范式。 Result: 开发了 KAG-Thinker 框架,它能够有效分解复杂问题,利用多种功能处理不同类型的任务,并优化知识获取过程。 Conclusion: KAG-Thinker 框架通过模拟人类认知机制,显著提高了在领域知识库上进行问答任务时的逻辑连贯性和上下文一致性,并且通过监督微调而不是强化学习来避免过度反思。 Abstract: In this paper, we introduce KAG-Thinker, a novel human-like reasoning framework built upon a parameter-light large language model (LLM). Our approach enhances the logical coherence and contextual consistency of the thinking process in question-answering (Q\&A) tasks on domain-specific knowledge bases (KBs) within LLMs. This framework simulates human cognitive mechanisms for handling complex problems by establishing a structured thinking process. Continuing the \textbf{Logical Form} guided retrieval and reasoning technology route of KAG v0.7, firstly, it decomposes complex questions into independently solvable sub-problems(also referred to as logical forms) through \textbf{breadth decomposition}, each represented in two equivalent forms-natural language and logical function-and further classified as either Knowledge Retrieval or Reasoning Analysis tasks, with dependencies and variables passing explicitly modeled via logical function interfaces. In the solving process, the Retrieval function is used to perform knowledge retrieval tasks, while the Math and Deduce functions are used to perform reasoning analysis tasks. Secondly, it is worth noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge boundary} model to determine the optimal source using self-regulatory mechanisms such as confidence calibration and reflective reasoning, and use the \textbf{depth solving} model to enhance the comprehensiveness of knowledge acquisition. Finally, instead of utilizing reinforcement learning, we employ supervised fine-tuning with multi-turn dialogues to align the model with our structured inference paradigm, thereby avoiding excessive reflection. This is supported by a data evaluation framework and iterative corpus synthesis, which facilitate the generation of detailed reasoning trajectories...[32] HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations
Anwoy Chatterjee,Yash Goel,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: The paper proposes HIDE, a single-pass, training-free method for detecting hallucinations in language models by analyzing the statistical decoupling between input context and output representations. HIDE demonstrates strong performance and efficiency improvements over existing approaches.
Details
Motivation: Language models often generate factually incorrect or unfaithful content, known as hallucinations. Existing methods to detect these hallucinations typically rely on multiple generations per input, leading to high computational costs and latency. Method: HIDE uses the Hilbert-Schmidt Independence Criterion (HSIC) to quantify the statistical decoupling between internal representations of the input context and the output sequence during generation, enabling a single-pass, training-free hallucination detection approach. Result: HIDE outperforms other single-pass methods in most settings, achieving ~29% average relative improvement in AUC-ROC over the best-performing single-pass strategy. It also shows competitive performance with multi-pass state-of-the-art methods, achieving ~3% average relative improvement in AUC-ROC while using ~51% less computation time. Conclusion: HIDE is an effective and efficient method for hallucination detection in language models by leveraging the statistical decoupling between input context and generated output representations. Abstract: Contemporary Language Models (LMs), while impressively fluent, often generate content that is factually incorrect or unfaithful to the input context - a critical issue commonly referred to as 'hallucination'. This tendency of LMs to generate hallucinated content undermines their reliability, especially because these fabrications are often highly convincing and therefore difficult to detect. While several existing methods attempt to detect hallucinations, most rely on analyzing multiple generations per input, leading to increased computational cost and latency. To address this, we propose a single-pass, training-free approach for effective Hallucination detectIon via Decoupled rEpresentations (HIDE). Our approach leverages the hypothesis that hallucinations result from a statistical decoupling between an LM's internal representations of input context and its generated output. We quantify this decoupling using the Hilbert-Schmidt Independence Criterion (HSIC) applied to hidden-state representations extracted while generating the output sequence. We conduct extensive experiments on four diverse question answering datasets, evaluating both faithfulness and factuality hallucinations across six open-source LMs of varying scales and properties. Our results demonstrate that HIDE outperforms other single-pass methods in almost all settings, achieving an average relative improvement of ~29% in AUC-ROC over the best-performing single-pass strategy across various models and datasets. Additionally, HIDE shows competitive and often superior performance with multi-pass state-of-the-art methods, obtaining an average relative improvement of ~3% in AUC-ROC while consuming ~51% less computation time. Our findings highlight the effectiveness of exploiting internal representation decoupling in LMs for efficient and practical hallucination detection.[33] Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights
N J Karthika,Maharaj Brahma,Rohit Saluja,Ganesh Ramakrishnan,Maunendra Sankar Desarkar
Main category: cs.CL
TL;DR: This paper evaluates tokenization strategies for 17 Indian languages, showing how to build more effective and equitable tokenizers for multilingual NLP.
Details
Motivation: Existing tokenizers are skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. Method: The paper conducts an intrinsic evaluation of tokenization strategies across 17 Indian languages, comparing bottom-up and top-down algorithms (BPE and Unigram LM), vocabulary sizes, and multilingual vocabulary construction methods like joint and cluster-based training. Result: The research quantifies trade-offs between tokenizer algorithms, effects of vocabulary sizes, and shows that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Conclusion: The study concludes that building fair and efficient tokenizers for multilingual NLP requires intrinsic evaluation of tokenization strategies, especially for low-resource languages. Abstract: Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. This paper presents a comprehensive intrinsic evaluation of tokenization strategies across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.[34] THCM-CAL: Temporal-Hierarchical Causal Modelling with Conformal Calibration for Clinical Risk Prediction
Xin Zhang,Qiyu Wei,Yingjie Zhu,Fanyi Wu,Sophia Ananiadou
Main category: cs.CL
TL;DR: The paper proposes THCM-CAL, a novel framework for automated clinical risk prediction using electronic health records, which captures complex causal interactions between structured and unstructured data through a multimodal causal graph and conformal calibration.
Details
Motivation: Prior approaches to clinical risk prediction either handle structured diagnostic codes and unstructured narrative notes separately or rely on simplistic fusion strategies, ignoring directional and hierarchical causal interactions. Method: A Temporal-Hierarchical Causal Model with Conformal Calibration (THCM-CAL) that constructs a multimodal causal graph and infers clinically grounded interactions. Result: Experimental results on MIMIC-III and MIMIC-IV show the effectiveness of THCM-CAL for automated clinical risk prediction. Conclusion: THCM-CAL demonstrates superiority in handling complex clinical risk prediction by considering hierarchical causal interactions and conformal calibration. Abstract: Automated clinical risk prediction from electronic health records (EHRs) demands modeling both structured diagnostic codes and unstructured narrative notes. However, most prior approaches either handle these modalities separately or rely on simplistic fusion strategies that ignore the directional, hierarchical causal interactions by which narrative observations precipitate diagnoses and propagate risk across admissions. In this paper, we propose THCM-CAL, a Temporal-Hierarchical Causal Model with Conformal Calibration. Our framework constructs a multimodal causal graph where nodes represent clinical entities from two modalities: Textual propositions extracted from notes and ICD codes mapped to textual descriptions. Through hierarchical causal discovery, THCM-CAL infers three clinically grounded interactions: intra-slice same-modality sequencing, intra-slice cross-modality triggers, and inter-slice risk propagation. To enhance prediction reliability, we extend conformal prediction to multi-label ICD coding, calibrating per-code confidence intervals under complex co-occurrences. Experimental results on MIMIC-III and MIMIC-IV demonstrate the superiority of THCM-CAL.[35] LLMs for Customized Marketing Content Generation and Evaluation at Scale
Haoran Liu,Amir Tahmasbi,Ehtesham Sam Haque,Purak Jain
Main category: cs.CL
TL;DR: 本研究提出了MarketingFM和AutoEval系统,用于优化电子商务中的离站营销广告文案生成与评估,有效提升了广告性能和评估效率。
Details
Motivation: 当前离站营销内容过于通用且模板化,缺乏与落地页的有效对齐,限制了其效果。 Method: 通过结合多数据源生成关键词特定的广告文案,并提出AutoEval-Main和AutoEval-Update框架进行自动化评估。 Result: MarketingFM 提升了点击率、展示量并降低了每次点击成本;AutoEval-Main 在大规模实验中达到89.57%的人工一致性;AutoEval-Update 提高了评估的一致性并减少了人工工作量。 Conclusion: MarketingFM 和 AutoEval 系统在提升广告效果和评估效率方面具有显著潜力,但需要人类监督以确保质量与准确性。 Abstract: Offsite marketing is essential in e-commerce, enabling businesses to reach customers through external platforms and drive traffic to retail websites. However, most current offsite marketing content is overly generic, template-based, and poorly aligned with landing pages, limiting its effectiveness. To address these limitations, we propose MarketingFM, a retrieval-augmented system that integrates multiple data sources to generate keyword-specific ad copy with minimal human intervention. We validate MarketingFM via offline human and automated evaluations and large-scale online A/B tests. In one experiment, keyword-focused ad copy outperformed templates, achieving up to 9% higher CTR, 12% more impressions, and 0.38% lower CPC, demonstrating gains in ad ranking and cost efficiency. Despite these gains, human review of generated ads remains costly. To address this, we propose AutoEval-Main, an automated evaluation system that combines rule-based metrics with LLM-as-a-Judge techniques to ensure alignment with marketing principles. In experiments with large-scale human annotations, AutoEval-Main achieved 89.57% agreement with human reviewers. Building on this, we propose AutoEval-Update, a cost-efficient LLM-human collaborative framework to dynamically refine evaluation prompts and adapt to shifting criteria with minimal human input. By selectively sampling representative ads for human review and using a critic LLM to generate alignment reports, AutoEval-Update improves evaluation consistency while reducing manual effort. Experiments show the critic LLM suggests meaningful refinements, improving LLM-human agreement. Nonetheless, human oversight remains essential for setting thresholds and validating refinements before deployment.[36] QueueEDIT: Structural Self-Correction for Sequential Model Editing in LLMs
Taolin Zhang,Haidong Kang,Dongyang Li,Qizhou Chen,Chengyu Wang Xiaofeng He,Richang Hong
Main category: cs.CL
TL;DR: QueueEDIT improves sequential model editing by addressing long-sequence dependency and reducing parameter bias impact on LLM capabilities.
Details
Motivation: To address challenges in sequential model editing (SME) where LLMs suffer from hallucinations and parameter bias affecting their general capabilities. Method: Queue-based self-correction framework that includes structural mapping editing loss, parameter queue storage, and dynamic alignment. Result: QueueEDIT outperforms baselines in SME settings, maintains competitiveness in single-turn editing, and preserves high NLP capabilities during SME. Conclusion: The proposed QueueEDIT framework effectively enhances sequential model editing performance while preserving the general capabilities of large language models (LLMs). Abstract: Recently, large language models (LLMs) have demonstrated impressive results but still suffer from hallucinations. Model editing has been proposed to correct factual inaccuracies in LLMs. A challenging case is sequential model editing (SME), which aims to rectify errors continuously rather than treating them as a one-time task. During SME, the general capabilities of LLMs can be negatively affected due to the introduction of new parameters. In this paper, we propose a queue-based self-correction framework (QueueEDIT) that not only enhances SME performance by addressing long-sequence dependency but also mitigates the impact of parameter bias on the general capabilities of LLMs. Specifically, we first introduce a structural mapping editing loss to map the triplets to the knowledge-sensitive neurons within the Transformer layers of LLMs. We then store the located parameters for each piece of edited knowledge in a queue and dynamically align previously edited parameters. In each edit, we select queue parameters most relevant to the currently located parameters to determine whether previous knowledge needs realignment. Irrelevant parameters in the queue are frozen, and we update the parameters at the queue head to the LLM to ensure they do not harm general abilities. Experiments show that our framework significantly outperforms strong baselines across various SME settings and maintains competitiveness in single-turn editing. The resulting LLMs also preserve high capabilities in general NLP tasks throughout the SME process.[37] How Alignment Shrinks the Generative Horizon
Chenghao Yang,Ari Holtzman
Main category: cs.CL
TL;DR: 论文提出 Branching Factor (BF) 指标,研究对齐语言模型生成缺乏多样性的问题,发现对齐调整和推理链(CoT)会显著降低生成的不确定性,同时实验支持通过提示基础模型也可以实现类似效果。
Details
Motivation: 研究对齐后的大型语言模型为何生成缺乏多样性的输出,探索其在生成过程中稳定性的驱动因素。 Method: 通过概率集中现象引入 Branching Factor (BF) 指标,并进行实证分析 BF 在生成过程中的变化和对模型输出的影响。 Result: 1. 发现 BF 随着生成过程的推进而降低,表明 LLM 在生成过程中变得越来越可预测;2. 对齐调整显著降低了模型输出分布的分支因子,使 BF 减少近一个数量级;3. 对齐的推理链(CoT)模型通过生成更长的推理链进一步降低 BF;4. 实验证明提示基础模型使用特定标记可以类似地减少 BF。 Conclusion: Branching Factor (BF) 是一个有效的诊断工具,有助于理解并控制大型语言模型(LLM)的输出。它揭示了对齐调整如何减少变化性,推理链(CoT)如何促进稳定的生成,以及基础模型如何被引导远离多样性。 Abstract: Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the Branching Factor (BF) -- a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.[38] Multi-turn Jailbreaking via Global Refinement and Active Fabrication
Hua Tang,Lingyong Yan,Yukun Zhao,Shuaiqiang Wang,Jizhou Huang,Dawei Yin
Main category: cs.CL
TL;DR: 本文提出了一种新的多轮越狱方法,通过全局优化越狱路径和主动伪造模型响应来提高多轮对话中生成有害内容的可能性,并在多个大语言模型上验证了其优越性。
Details
Motivation: 由于大型语言模型仍存在显著的安全风险,尤其是可能被滥用于恶意目的,因此需要探索更复杂的多轮对话场景中的越狱技术。 Method: 提出了一种新的多轮越狱方法,该方法在每次交互中全局优化越狱路径,并主动伪造模型响应以抑制与安全相关的警告。 Result: 所提出的多轮越狱方法在六种最先进的大语言模型上均表现出优于现有技术的性能。 Conclusion: 实验结果表明,所提出的多轮越狱方法在六种最先进的大语言模型上均优于现有的单轮和多轮越狱技术。 Abstract: Large Language Models (LLMs) have achieved exceptional performance across a wide range of tasks. However, they still pose significant safety risks due to the potential misuse for malicious purposes. Jailbreaks, which aim to elicit models to generate harmful content, play a critical role in identifying the underlying security threats. Recent jailbreaking primarily focuses on single-turn scenarios, while the more complicated multi-turn scenarios remain underexplored. Moreover, existing multi-turn jailbreaking techniques struggle to adapt to the evolving dynamics of dialogue as the interaction progresses. To address this limitation, we propose a novel multi-turn jailbreaking method that refines the jailbreaking path globally at each interaction. We also actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent questions. Experimental results demonstrate the superior performance of our method compared with existing single-turn and multi-turn jailbreaking techniques across six state-of-the-art LLMs. Our code is publicly available at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.[39] Scatter-Based Innovation Propagation in Large Language Models for Multi-Stage Process Adaptation
Hong Su
Main category: cs.CL
TL;DR: This paper introduces an innovation scatter model that helps large language models generalize and apply novel ideas across structurally similar stages.
Details
Motivation: LLMs struggle to generalize novel ideas beyond their original context, especially in multi-stage processes. This work aims to address this limitation by expanding localized innovations to broader applications. Method: The paper proposes a four-step scatter-based innovation expansion model: (1) identifying core innovation, (2) generalizing it, (3) determining broader applicability, and (4) applying it to other stages using LLMs. Result: Verification results show that the proposed model enables LLMs to extend innovations across structurally similar stages, improving generalization and reuse. Conclusion: The innovation scatter model effectively enhances the generalization and reuse of novel ideas in LLMs by leveraging structural redundancy across stages. Abstract: Large Language Models (LLMs) exhibit strong capabilities in reproducing and extending patterns observed during pretraining but often struggle to generalize novel ideas beyond their original context. This paper addresses the challenge of applying such localized innovations - introduced at a specific stage or component - to other parts of a multi-stage process. We propose a scatter-based innovation expansion model (innovation scatter model) that guides the LLM through a four-step process: (1) identifying the core innovation by comparing the user's input with its surrounding context, (2) generalizing the innovation by removing references to specific stages or components, (3) determining whether the generalized innovation applies to a broader scope beyond the original stage, and (4) systematically applying it to other structurally similar stages using the LLM. This model leverages structural redundancy across stages to improve the applicability of novel ideas. Verification results demonstrate that the innovation scatter model enables LLMs to extend innovations across structurally similar stages, thereby enhancing generalization and reuse.[40] A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment
Quanwei Tang,Sophia Yat Mei Lee,Junshuang Wu,Dong Zhang,Shoushan Li,Erik Cambria,Guodong Zhou
Main category: cs.CL
TL;DR: 本文提出了一种名为GraphMPA的新框架,通过基于图的方法和模式寻求偏好优化解决检索增强生成中的全局理解和人类偏好对齐问题,并证明了其有效性。
Details
Motivation: 现有的检索增强生成(RAG)方法在全局理解和响应与人类伦理及质量偏好的对齐方面仍面临挑战。 Method: 提出了GraphMPA,一种基于图的框架,使用层次文档图和模式寻求偏好优化来对齐模型输出与人类偏好。 Result: 在六个数据集上的实验表明GraphMPA的有效性。 Conclusion: GraphMPA是一个有效的框架,解决了RAG在全局理解和人类偏好对齐方面的挑战。 Abstract: Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference alignment. Our approach constructs a hierarchical document graph using a general similarity measurement, mimicking human cognitive processes for information understanding and synthesis. Additionally, we introduce mode-seeking preference optimization to better align model outputs with human preferences through probability-matching constraints. Extensive experiments on six datasets demonstrate the effectiveness of our \href{https://github.com/tangquanwei/GraphMPA}{GraphMPA}.[41] PDF Retrieval Augmented Question Answering
Thi Thu Uyen Hoang,Viet Anh Nguyen
Main category: cs.CL
TL;DR: 本文提出了一种基于检索增强生成(RAG)框架的问答系统,能够从PDF文件中提取多模态信息并回答复杂问题。
Details
Motivation: 现有的问答系统主要针对文本内容设计,难以处理PDF中包含的丰富多样的数据类型(如图像、矢量图、表格等),因此需要开发一个更全面的解决方案。 Method: 通过改进对PDF中非文本元素的处理和整合方法,并微调大型语言模型以更好地适应系统需求,构建了一个综合的RAG-based问答系统。 Result: 实验评估表明,该系统能够准确地从不同类型的PDF内容中提取信息,并有效回答复杂的多模态问题。 Conclusion: 这项工作不仅拓展了检索增强问答系统的边界,还为多模态数据整合和处理的研究奠定了基础。 Abstract: This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs--including text, images, vector diagrams, graphs, and tables--poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.[42] Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices
Maxence Lasbordes,Daniele Falavigna,Alessio Brutti
Main category: cs.CL
TL;DR: This paper introduces an improved early-exit neural architecture with parallel layers for downsampled inputs, enhancing speech recognition performance without increasing inference time.
Details
Motivation: The motivation stems from the need for resource-aware neural models in on-device processing scenarios with limited computational resources. Additionally, while memory-efficient architectures like Zipformer reduce operations through variable frame rate analysis, they lack modularity for early-exit branches, which this work aims to address. Method: The authors propose the introduction of parallel layers in neural network architectures to process downsampled input versions alongside standard processing layers. This approach is designed to enhance early-exit models' performance. Result: The proposed architecture demonstrates significant improvements in speech recognition performance on standard benchmarks, with minimal impact on model parameters and no effect on inference time. Conclusion: The paper concludes that introducing parallel layers in early-exit architectures significantly improves speech recognition performance on standard benchmarks without affecting inference time, despite a small increase in model parameters. Abstract: The ability to dynamically adjust the computational load of neural models during inference in a resource aware manner is crucial for on-device processing scenarios, characterised by limited and time-varying computational resources. Early-exit architectures represent an elegant and effective solution, since they can process the input with a subset of their layers, exiting at intermediate branches (the upmost layers are hence removed from the model). From a different perspective, for automatic speech recognition applications there are memory-efficient neural architectures that apply variable frame rate analysis, through downsampling/upsampling operations in the middle layers, reducing the overall number of operations and improving significantly the performance on well established benchmarks. One example is the Zipformer. However, these architectures lack the modularity necessary to inject early-exit branches. With the aim of improving the performance in early-exit models, we propose introducing parallel layers in the architecture that process downsampled versions of their inputs. % in conjunction with standard processing layers. We show that in this way the speech recognition performance on standard benchmarks significantly improve, at the cost of a small increase in the overall number of model parameters but without affecting the inference time.[43] Markov-Enhanced Clustering for Long Document Summarization: Tackling the 'Lost in the Middle' Challenge with Large Language Models
Aziz Amari,Mohamed Achref Ben Ammar
Main category: cs.CL
TL;DR: A hybrid summarization approach combining extractive and abstractive methods effectively addresses the challenge of retaining key information in long documents by clustering text chunks and using a Markov chain for semantic ordering.
Details
Motivation: The motivation stems from the increasing need for effective automatic text summarization due to the rapid expansion of information. Current large language models face challenges in retaining key information across lengthy documents. Method: The method involves splitting the document into smaller chunks, clustering their vector embeddings to identify key ideas, generating summaries for each cluster, and constructing the final summary using a Markov chain graph to determine the semantic order. Result: The proposed method successfully tackles the limitations of existing abstractive summarization techniques by preserving important information and improving coherence through semantic ordering with a Markov chain graph. Conclusion: The hybrid summarization approach combining extractive and abstractive techniques effectively addresses the 'lost in the middle' problem, offering a solution that is both efficient and capable of retaining key information. Abstract: The rapid expansion of information from diverse sources has heightened the need for effective automatic text summarization, which condenses documents into shorter, coherent texts. Summarization methods generally fall into two categories: extractive, which selects key segments from the original text, and abstractive, which generates summaries by rephrasing the content coherently. Large language models have advanced the field of abstractive summarization, but they are resourceintensive and face significant challenges in retaining key information across lengthy documents, which we call being "lost in the middle". To address these issues, we propose a hybrid summarization approach that combines extractive and abstractive techniques. Our method splits the document into smaller text chunks, clusters their vector embeddings, generates a summary for each cluster that represents a key idea in the document, and constructs the final summary by relying on a Markov chain graph when selecting the semantic order of ideas.[44] Statistical Multicriteria Evaluation of LLM-Generated Text
Esteban Garces Arias,Hannah Blocher,Julian Rodemann,Matthias Aßenmacher,Christoph Jansen
Main category: cs.CL
TL;DR: This paper introduces GSD-front, a new method for evaluating LLM-generated text quality across multiple dimensions without relying on single-metric evaluations or arbitrary metric weighting.
Details
Motivation: Current methods for assessing the quality of LLM-generated text rely on isolated metrics or simplistic aggregations that do not adequately capture trade-offs between key indicators like coherence, diversity, and fluency. Method: Adapting a framework based on Generalized Stochastic Dominance (GSD) to evaluate multiple dimensions of text quality simultaneously, while respecting their different measurement scales. Result: The proposed GSD-front framework successfully identifies statistically significant performance differences among decoding strategies while considering deviations from standard sampling assumptions. Conclusion: The GSD-front approach provides a more effective and statistically grounded method for evaluating LLM-generated text by addressing the limitations of traditional evaluation techniques. Abstract: Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.[45] Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution
Patrik Stano,Aleš Horák
Main category: cs.CL
TL;DR: This paper compares prompt-based large language models and fine-tuned compact models for Czech anaphora resolution, showing that the latter achieves higher accuracy with fewer resources.
Details
Motivation: Anaphora resolution is essential for natural language understanding, especially in morphologically rich languages like Czech. This work aims to compare modern approaches—prompt engineering with LLMs versus fine-tuning compact models—to determine their effectiveness on Czech text. Method: A comparative evaluation was conducted using a dataset from the Prague Dependency Treebank. Instruction-tuned LLMs (e.g., Mistral Large 2, Llama 3) were tested with various prompt templates, while compact generative models (mT5, Mistral) were fine-tuned specifically for Czech anaphora resolution. Result: Prompting achieved promising few-shot results (up to 74.5% accuracy), but fine-tuned models significantly outperformed them, reaching up to 88% accuracy. Fine-tuned models also required fewer computational resources compared to LLMs. Conclusion: The study concludes that fine-tuned models, particularly mT5-large, outperform prompt-engineered LLMs in Czech anaphora resolution with up to 88% accuracy and lower computational demands. Abstract: Anaphora resolution plays a critical role in natural language understanding, especially in morphologically rich languages like Czech. This paper presents a comparative evaluation of two modern approaches to anaphora resolution on Czech text: prompt engineering with large language models (LLMs) and fine-tuning compact generative models. Using a dataset derived from the Prague Dependency Treebank, we evaluate several instruction-tuned LLMs, including Mistral Large 2 and Llama 3, using a series of prompt templates. We compare them against fine-tuned variants of the mT5 and Mistral models that we trained specifically for Czech anaphora resolution. Our experiments demonstrate that while prompting yields promising few-shot results (up to 74.5% accuracy), the fine-tuned models, particularly mT5-large, outperform them significantly, achieving up to 88% accuracy while requiring fewer computational resources. We analyze performance across different anaphora types, antecedent distances, and source corpora, highlighting key strengths and trade-offs of each approach.[46] InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
Fuyu Wang,Jiangtong Li,Kun Zhu,Changjun Jiang
Main category: cs.CL
TL;DR: 本文提出了一种新的辩论任务分析框架,包括一个多维评估系统InspireScore和一个优化辩论框架InspireDebate,实验证明它们在评估相关性和辩论性能上均优于现有方法。
Details
Motivation: 现有的基于大语言模型的辩论系统主要关注对特定论点的回应,而忽视了如真实性与逻辑有效性等客观评估,并且缺乏跨多个维度的结构化优化方法。 Method: 提出了一个双组件框架:(1) InspireScore,一种包含主观标准和客观指标的多维评估体系;(2) InspireDebate,利用链式推理增强、多维直接偏好优化和基于网络的检索增强生成进行阶段优化的辩论框架。 Result: InspireScore在与专家判断的相关性方面比现有方法高44%;InspireDebate在辩论框架方面表现出显著改进,优于基线模型57%。 Conclusion: 实验评估表明,InspireScore与专家判断的相关性比现有方法高44%,而InspireDebate在辩论框架方面表现出显著改进,优于基线模型57%。 Abstract: With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions$-$including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement$-$thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) $\textbf{InspireScore}$, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) $\textbf{InspireDebate}$, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that $\textbf{InspireScore}$ achieves 44$\%$ higher correlation with expert judgments compared to existing methods, while $\textbf{InspireDebate}$ shows significant improvements, outperforming baseline models by 57$\%$. Source code is available at https://github.com/fywang12/InspireDebate.[47] Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
Yicheng Fu,Zhemin Huang,Liuxin Yang,Yumeng Lu,Zhongdongming Dai
Main category: cs.CL
TL;DR: 本文介绍了一个新的中文成语理解基准——Chengyu-Bench,并发现现有语言模型在情感判断上表现良好,但在理解和正确使用成语方面仍有较大提升空间。
Details
Motivation: 中文成语具有丰富的历史文化内涵,字面翻译往往不能准确表达其含义,这对语言模型来说是一个挑战。现有的测试主要集中在狭窄的任务上,如多项选择题、孤立翻译或简单改写,因此需要一个更全面的测试方法。 Method: 引入了一个包含三个任务的综合基准:(1) 判断成语的情感极性;(2) 检测成语在上下文中的使用是否恰当;(3) 在没有选项的情况下填充文章中的空白。并使用2,937个经过人工验证的例子进行评估。 Result: 实验结果显示,领先的LLM在情感判断任务中准确率超过95%,但在检测使用是否恰当的任务中准确率约为85%,而在开放填空任务中的top-1准确率仅为约40%。错误分析表明,大多数错误源于对成语含义的基本误解。 Conclusion: Chengyu-Bench是一个新的综合性基准,用于评估语言模型对中文成语的理解能力,尤其是文化和语境细微差别的掌握。结果显示,现有的LLMs在情感判断上表现良好,但在正确使用成语方面仍存在较大困难。 Abstract: Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.[48] Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives
Batool Haider,Atmika Gorti,Aman Chadha,Manas Gaur
Main category: cs.CL
TL;DR: 该研究提出了一种多跳问答框架,用于检测大型语言模型在心理健康领域中的交叉性偏见,并引入了两种去偏技术以减少偏见。
Details
Motivation: 大型语言模型可能传播偏见,加剧对弱势群体的污名化,但目前缺乏系统的方法来检测这些偏见。 Method: 通过多跳问答框架分析IMHI数据集的内容,并使用年龄、种族、性别和社会经济地位等标签进行系统分类,评估四种大型语言模型并实施去偏技术。 Result: 发现了与人口统计特征和心理健康状况相关的系统性差异,并证明多跳问答方法比传统方法更能检测偏见放大点;采用去偏技术后偏见减少了66-94%。 Conclusion: 研究揭示了大型语言模型在心理健康护理中再现偏见的关键领域,为开发公平的人工智能提供了可操作的见解。 Abstract: Large Language Models (LLMs) in mental healthcare risk propagating biases that reinforce stigma and harm marginalized groups. While previous research identified concerning trends, systematic methods for detecting intersectional biases remain limited. This work introduces a multi-hop question answering (MHQA) framework to explore LLM response biases in mental health discourse. We analyze content from the Interpretable Mental Health Instruction (IMHI) dataset across symptom presentation, coping mechanisms, and treatment approaches. Using systematic tagging across age, race, gender, and socioeconomic status, we investigate bias patterns at demographic intersections. We evaluate four LLMs: Claude 3.5 Sonnet, Jamba 1.6, Gemma 3, and Llama 4, revealing systematic disparities across sentiment, demographics, and mental health conditions. Our MHQA approach demonstrates superior detection compared to conventional methods, identifying amplification points where biases magnify through sequential reasoning. We implement two debiasing techniques: Roleplay Simulation and Explicit Bias Reduction, achieving 66-94% bias reductions through few-shot prompting with BBQ dataset examples. These findings highlight critical areas where LLMs reproduce mental healthcare biases, providing actionable insights for equitable AI development.[49] The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English
Tom S Juzek
Main category: cs.CL
TL;DR: The Syntactic Acceptability Dataset includes 1,000 English sequences labeled for grammar and acceptability, showing that machine learning models can better predict acceptability than grammaticality, making it a valuable tool for linguistic and computational research.
Details
Motivation: To provide a publicly accessible dataset for syntax and computational linguistics research that combines both grammatical and acceptability judgments, addressing current debates in the field. Method: The dataset includes 1,000 English sequences labeled for grammatical and acceptability status, with data from textbooks and the journal Linguistic Inquiry. Grammatical status was extracted from literature, while acceptability was determined via crowdsourcing. Result: Grammaticality and acceptability judgments converge in about 83% of cases, 'in-betweenness' is common, and machine learning models perform better at predicting acceptability than grammaticality. Conclusion: The Syntactic Acceptability Dataset serves as a significant resource for syntax and computational linguistics research, demonstrating that acceptability judgments are more predictable using machine learning models compared to grammaticality judgments. Abstract: We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status ("well-formedness" according to syntactic formalisms) extracted from the literature, as well as its acceptability status ("intuitive goodness" as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that "in-betweenness" occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.[50] $φ^{\infty}$: Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models
Bugra Kilictas,Faruk Alpay
Main category: cs.CL
TL;DR: 本文研究了自回归变压器语言模型中由于破折号标记引起的递归语义漂移问题,并提出了一种无需重新训练模型即可抑制问题标记的新方法。
Details
Motivation: 发现自回归变压器语言模型中存在一个关键漏洞,其中破折号标记会引起递归语义漂移,导致子句边界幻觉和嵌入空间纠缠。 Method: 通过形式分析语义格中的标记级扰动,提出了使用phi-infinity算子进行符号子句净化和目标嵌入矩阵重新对齐的解决方案。 Result: 实验验证表明,在生成一致性和主题保持方面有显著改进,并且该方法在保持语义连贯性的同时具备固定点收敛保证。 Conclusion: 本文提出了一种结合符号子句净化和嵌入矩阵重新对齐的新方法,可以完全抑制问题标记,而无需重新训练模型。这种方法对识别和缓解基础模型中的标记级漏洞具有普遍框架的意义,并扩展了神经文本生成系统中的递归不稳定性的解决方法。 Abstract: We identify a critical vulnerability in autoregressive transformer language models where the em dash token induces recursive semantic drift, leading to clause boundary hallucination and embedding space entanglement. Through formal analysis of token-level perturbations in semantic lattices, we demonstrate that em dash insertion fundamentally alters the model's latent representations, causing compounding errors in long-form generation. We propose a novel solution combining symbolic clause purification via the phi-infinity operator with targeted embedding matrix realignment. Our approach enables total suppression of problematic tokens without requiring model retraining, while preserving semantic coherence through fixed-point convergence guarantees. Experimental validation shows significant improvements in generation consistency and topic maintenance. This work establishes a general framework for identifying and mitigating token-level vulnerabilities in foundation models, with immediate implications for AI safety, model alignment, and robust deployment of large language models in production environments. The methodology extends beyond punctuation to address broader classes of recursive instabilities in neural text generation systems.[51] Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models
Ruixuan Deng,Xiaoyang Hu,Miles Gilberti,Shane Storks,Aman Taxali,Mike Angstadt,Chandra Sripada,Joyce Chai
Main category: cs.CL
TL;DR: 该论文提出了一种基于稀疏自编码器特征共激活的方法,用于识别大型语言模型中的语义组件,并展示了通过调整这些组件可以实现对模型输出的精确控制。
Details
Motivation: 了解大型语言模型内部如何组织和处理知识,以及能否通过调整模型内的特定组件来实现对模型行为的精准操控。 Method: 研究人员利用从少量提示中收集的稀疏自编码器(SAE)特征的共激活情况,识别出语义上连贯且上下文一致的网络组件。随后通过对国家关系任务中的语义组件进行消融和放大实验,观察模型输出的变化。 Result: 研究发现,大多数国家相关的语义组件出现在模型较早的层次,而更抽象的关系组件则集中在后续层次;并且在关系组件中,后续层次的节点对模型输出具有更强的因果影响。通过组合关系和国家组件,可以生成复合反事实输出。 Conclusion: 这项研究表明,大型语言模型(LLMs)内部存在模块化的知识组织结构,并且通过调整特定的语义组件可以实现对模型输出的预测性和反事实控制。此外,研究还推进了对模型进行高效、有针对性操作的方法。 Abstract: We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on country-relation tasks, we show that ablating semantic components for countries and relations changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and country components yields compound counterfactual outputs. We find that, whereas most country components emerge from the very first layer, the more abstract relation components are concentrated in later layers. Furthermore, within relation components themselves, nodes from later layers tend to have a stronger causal impact on model outputs. Overall, these findings suggest a modular organization of knowledge within LLMs and advance methods for efficient, targeted model manipulation.[52] QuranMorph: Morphologically Annotated Quranic Corpus
Diyam Akra,Tymaa Hammouda,Mustafa Jarrar
Main category: cs.CL
TL;DR: This paper introduces QuranMorph, a manually annotated corpus for the Quran, which enhances linguistic research through integration with various resources.
Details
Motivation: To create a rich morphologically annotated corpus for the Quran that can be linked to other linguistic resources for enhanced Quranic research. Method: Manual lemmatization and part-of-speech tagging of Quranic tokens by expert linguists using Qabas database and SAMA/Qabas tagset. Result: Development of the QuranMorph corpus containing 77,429 manually annotated tokens, interlinked with multiple linguistic databases and tools. Conclusion: QuranMorph corpus is successfully developed and interconnected with various linguistic resources, making it a valuable open-source tool for Quranic studies. Abstract: We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (https://sina.birzeit.edu/quran)[53] CareLab at #SMM4H-HeaRD 2025: Insomnia Detection and Food Safety Event Extraction with Domain-Aware Transformers
Zihan Liang,Ziwen Pan,Sumon Kanti Dey,Azra Ismail
Main category: cs.CL
TL;DR: 本论文描述了参与SMM4H-HeaRD 2025共享任务的系统,重点在任务5子任务1中取得了第一名的成绩。
Details
Motivation: 参与SMM4H-HeaRD 2025共享任务并优化系统性能以取得更好成绩。 Method: 使用基于编码器的模型(如RoBERTa)和GPT-4进行数据增强,并对各个子任务进行了特定适应。 Result: 在任务5子任务1中以F1分数0.958获得第一名。 Conclusion: 通过采用先进的模型和技术,我们的系统在SMM4H-HeaRD 2025共享任务中表现出色。 Abstract: This paper presents our system for the SMM4H-HeaRD 2025 shared tasks, specifically Task 4 (Subtasks 1, 2a, and 2b) and Task 5 (Subtasks 1 and 2). Task 4 focused on detecting mentions of insomnia in clinical notes, while Task 5 addressed the extraction of food safety events from news articles. We participated in all subtasks and report key findings across them, with particular emphasis on Task 5 Subtask 1, where our system achieved strong performance-securing first place with an F1 score of 0.958 on the test set. To attain this result, we employed encoder-based models (e.g., RoBERTa), alongside GPT-4 for data augmentation. This paper outlines our approach, including preprocessing, model architecture, and subtask-specific adaptations[54] Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review
Bushra Asseri,Estabrag Abdelaziz,Areej Al-Wabil
Main category: cs.CL
TL;DR: This study reviews prompt engineering strategies to reduce cultural bias in large language models toward Arabs and Muslims, identifying five key approaches with varying effectiveness and highlighting the need for further research and culturally adaptive solutions.
Details
Motivation: Cultural bias in LLMs perpetuates harmful stereotypes against Arabs and Muslims, raising ethical concerns. Prompt engineering strategies targeting this issue remain understudied despite growing recognition of bias in AI. Method: Mixed-methods systematic review following PRISMA guidelines and Kitchenham's methodology, analyzing 8 empirical studies published between 2021-2024 on bias mitigation strategies in LLMs. Result: Five prompt engineering approaches were identified: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Structured multi-step pipelines showed the highest effectiveness (up to 87.7% bias reduction), while cultural prompting offered broader accessibility. Effectiveness varied across bias types, with some being more resistant to prompt-based mitigation. Conclusion: Prompt engineering can effectively mitigate cultural bias in large language models (LLMs), particularly toward Arabs and Muslims, with structured multi-step pipelines being the most effective approach. However, there is a significant research gap, emphasizing the need for more culturally adaptive techniques and evaluation resources. Abstract: Large language models have demonstrated remarkable capabilities across various domains, yet concerns about cultural bias - particularly towards Arabs and Muslims - pose significant ethical challenges by perpetuating harmful stereotypes and marginalization. Despite growing recognition of bias in LLMs, prompt engineering strategies specifically addressing Arab and Muslim representation remain understudied. This mixed-methods systematic review examines such techniques, offering evidence-based guidance for researchers and practitioners. Following PRISMA guidelines and Kitchenham's systematic review methodology, we analyzed 8 empirical studies published between 2021-2024 investigating bias mitigation strategies. Our findings reveal five primary prompt engineering approaches: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Although all approaches show potential for reducing bias, effectiveness varied substantially across studies and bias types. Evidence suggests that certain bias types may be more resistant to prompt-based mitigation than others. Structured multi-step pipelines demonstrated the highest overall effectiveness, achieving up to 87.7% reduction in bias, though they require greater technical expertise. Cultural prompting offers broader accessibility with substantial effectiveness. These results underscore the accessibility of prompt engineering for mitigating cultural bias without requiring access to model parameters. The limited number of studies identified highlights a significant research gap in this critical area. Future research should focus on developing culturally adaptive prompting techniques, creating Arab and Muslim-specific evaluation resources, and integrating prompt engineering with complementary debiasing methods to address deeper stereotypes while maintaining model utility.[55] Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications
Bushra Asseri,Estabraq Abdelaziz,Maha Al Mogren,Tayef Alhefdhi,Areej Al-Wabil
Main category: cs.CL
TL;DR: 本研究探讨了GPT-4o和Gemini 1.5 Pro在识别阿拉伯语儿童图画书情感方面的表现。结果显示,尽管GPT-4o表现更佳,但两个模型在理解和解释文化特定及模糊情绪方面仍有显著局限。
Details
Motivation: 为了开发文化响应型教育技术,情感识别在多模态AI系统中的作用至关重要,但在阿拉伯语环境中仍研究不足。阿拉伯语背景下迫切需要适当的文化学习工具,这促使我们对该领域进行探索。 Method: 评估了两种先进的多模态大语言模型(GPT-4o和Gemini 1.5 Pro)在处理阿拉伯语儿童故事插图时的情感识别能力。通过三种提示策略(零样本、少样本和思维链)对75张图片进行测试,并将模型预测与基于Plutchik情绪框架的人工注释进行比较。 Result: GPT-4o在所有条件下始终优于Gemini,最佳宏F1得分为59%(思维链提示),而Gemini的最佳表现为43%。错误分析显示,60.7%的错误是由效价反转造成的,两个模型在识别文化上细微的情绪和模糊的叙事背景时均遇到困难。 Conclusion: GPT-4o在识别阿拉伯语儿童图画书中情感方面表现优于Gemini 1.5 Pro,尤其是在使用思维链提示策略时。然而,两者在理解文化细微差别和模糊的叙事背景时均存在困难,显示出当前模型在文化理解上的根本性局限。 Abstract: Emotion recognition capabilities in multimodal AI systems are crucial for developing culturally responsive educational technologies, yet remain underexplored for Arabic language contexts where culturally appropriate learning tools are critically needed. This study evaluates the emotion recognition performance of two advanced multimodal large language models, GPT-4o and Gemini 1.5 Pro, when processing Arabic children's storybook illustrations. We assessed both models across three prompting strategies (zero-shot, few-shot, and chain-of-thought) using 75 images from seven Arabic storybooks, comparing model predictions with human annotations based on Plutchik's emotional framework. GPT-4o consistently outperformed Gemini across all conditions, achieving the highest macro F1-score of 59% with chain-of-thought prompting compared to Gemini's best performance of 43%. Error analysis revealed systematic misclassification patterns, with valence inversions accounting for 60.7% of errors, while both models struggled with culturally nuanced emotions and ambiguous narrative contexts. These findings highlight fundamental limitations in current models' cultural understanding and emphasize the need for culturally sensitive training approaches to develop effective emotion-aware educational technologies for Arabic-speaking learners.[56] Enhancing Entity Aware Machine Translation with Multi-task Learning
An Trieu,Phuong Nguyen,Minh Le Nguyen
Main category: cs.CL
TL;DR: This paper proposes a multi-task learning approach to improve Entity-aware machine translation by optimizing named entity recognition and translation tasks simultaneously.
Details
Motivation: The motivation stems from the challenges in Entity-aware machine translation, such as the lack of sufficient translation data for entities and the complexity of contextual processing required during translation. Method: The method involves using multi-task learning to optimize the two subtasks: named entity recognition and machine translation, leveraging their interdependence for improved results. Result: The proposed approach was evaluated on the dataset from Task 2 of the SemEval 2025 competition, demonstrating improvements in the performance of Entity-aware machine translation. Conclusion: The study concludes that applying multi-task learning to named entity recognition and machine translation subtasks enhances the performance of Entity-aware machine translation. Abstract: Entity-aware machine translation (EAMT) is a complicated task in natural language processing due to not only the shortage of translation data related to the entities needed to translate but also the complexity in the context needed to process while translating those entities. In this paper, we propose a method that applies multi-task learning to optimize the performance of the two subtasks named entity recognition and machine translation, which improves the final performance of the Entity-aware machine translation task. The result and analysis are performed on the dataset provided by the organizer of Task 2 of the SemEval 2025 competition.[57] TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance
Syed Mekael Wasti,Shou-Yi Hung,Christopher Collins,En-Shiun Annie Lee
Main category: cs.CL
TL;DR: TranslationCorrect 是一个整合机器翻译、错误预测与后编辑的高效框架,提升了翻译工作效率和质量。
Details
Motivation: 现有的机器翻译后编辑和研究数据收集流程效率低下且相互脱节,需要一个整合的解决方案以优化任务。 Method: 开发了一个集成框架TranslationCorrect,结合了机器翻译生成、自动化错误预测以及直观的后编辑界面,并基于人机交互原则设计。 Result: 用户研究表明,TranslationCorrect 能够有效减少认知负荷,并在翻译纠错、批量翻译及高质量注释输出方面表现优异。 Conclusion: TranslationCorrect 提升了翻译效率和用户满意度,相较于传统方法表现出显著优势。 Abstract: Machine translation (MT) post-editing and research data collection often rely on inefficient, disconnected workflows. We introduce TranslationCorrect, an integrated framework designed to streamline these tasks. TranslationCorrect combines MT generation using models like NLLB, automated error prediction using models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive post-editing interface within a single environment. Built with human-computer interaction (HCI) principles in mind to minimize cognitive load, as confirmed by a user study. For translators, it enables them to correct errors and batch translate efficiently. For researchers, TranslationCorrect exports high-quality span-based annotations in the Error Span Annotation (ESA) format, using an error taxonomy inspired by Multidimensional Quality Metrics (MQM). These outputs are compatible with state-of-the-art error detection models and suitable for training MT or post-editing systems. Our user study confirms that TranslationCorrect significantly improves translation efficiency and user satisfaction over traditional annotation methods.[58] Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLMs
Kang Chen,Mengdi Zhang,Yixin Cao
Main category: cs.CL
TL;DR: This paper proposes $L^2$, a multi-lingual unification learning approach that improves reasoning performance and efficiency in large language models by leveraging diverse linguistic data with minimal resource use.
Details
Motivation: The motivation stems from the observed diversity in multi-lingual reasoning and the need to improve data and inference efficiency for test-time scaling of large language models. Method: The $L^2$ multi-lingual unification learning approach with a decoding intervention strategy was introduced. This involved further tuning based on two types of multi-lingual data: long chain-of-thought annotations in different languages and step-wise language mixtures. Result: The results show that even small amounts of multi-lingual data significantly enhance reasoning capabilities, reducing both required data volume and inference tokens while maintaining comparable performance. Conclusion: The paper concludes that the proposed $L^2$ method effectively addresses test-time scaling challenges in LLMs by enhancing performance and efficiency through multilingual learning, which is orthogonal to other data-efficient approaches. Abstract: This paper explores the challenges of test-time scaling of large language models (LLMs), regarding both the data and inference efficiency. We highlight the diversity of multi-lingual reasoning based on our pilot studies, and then introduce a novel approach, \(L^2\) multi-lingual unification learning with a decoding intervention strategy for further investigation. The basic idea of \(L^2\) is that the reasoning process varies across different languages, which may be mutually beneficial to enhance both model performance and efficiency. In specific, there are two types of multi-lingual data: the entire long chain-of-thought annotations in different languages and the step-wise mixture of languages. By further tuning based on them, we show that even small amounts of data can significantly improve reasoning capabilities. Our findings suggest that multilingual learning reduces both the required data and the number of inference tokens while maintaining a comparable performance. Furthermore, \(L^2\) is orthogonal to other data efficient methods. Thus, we also emphasize the importance of diverse data selection. The \(L^2\) method offers a promising solution to the challenges of data collection and test-time compute efficiency in LLMs.[59] Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics
Yousang Cho,Key-Sun Choi
Main category: cs.CL
TL;DR: this study compares six evaluation metrics for capturing causal explanation quality in diagnostic reports and finds that GPT-Black demonstrates the strongest performance
Details
Motivation: to determine how accurately different evaluation metrics capture the quality of causal explanations in diagnostic reports Method: compare six evaluation metrics across two input types with two weighting strategies Result: GPT-Black showed the strongest discriminative power, GPT-White aligned with expert evaluations, and similarity-based metrics diverged from clinical reasoning quality Conclusion: metric selection and weighting significantly affect evaluation outcomes, and LLM-based evaluation is recommended for tasks requiring interpretability and causal reasoning Abstract: This study investigates how accurately different evaluation metrics capture the quality of causal explanations in automatically generated diagnostic reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec, GPT-White, GPT-Black, and expert qualitative assessment across two input types: observation-based and multiple-choice-based report generation. Two weighting strategies are applied: one reflecting task-specific priorities, and the other assigning equal weights to all metrics. Our results show that GPT-Black demonstrates the strongest discriminative power in identifying logically coherent and clinically valid causal narratives. GPT-White also aligns well with expert evaluations, while similarity-based metrics diverge from clinical reasoning quality. These findings emphasize the impact of metric selection and weighting on evaluation outcomes, supporting the use of LLM-based evaluation for tasks requiring interpretability and causal reasoning.[60] Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
Mostafa Saeed,Nizar Habash
Main category: cs.CL
TL;DR: This paper presents novel approaches for Arabic lemmatization using classification and semantic clustering, offering improved robustness and interpretability over existing methods.
Details
Motivation: Existing Arabic lemmatization tools face challenges due to inconsistent standards and limited genre coverage. Method: Framing lemmatization as classification into a Lemma-POS-Gloss tagset, leveraging machine translation and semantic clustering; evaluation of character-level sequence-to-sequence models. Result: The proposed approaches yield competitive performance, with classification and clustering showing superior robustness and interpretability. Conclusion: Classification and clustering methods offer more robust and interpretable outputs, setting new benchmarks for Arabic lemmatization. Abstract: Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.[61] TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
Ce Li,Xiaofan Liu,Zhiyan Song,Ce Chi,Chen Zhao,Jingjing Yang,Zhendong Wang,Kexin Yang,Boshen Shi,Xing Wang,Chao Deng,Junlan Feng
Main category: cs.CL
TL;DR: This paper introduces TReB, a new comprehensive benchmark for evaluating large language models' abilities to reason with table-structured data, highlighting significant room for improvement in current models.
Details
Motivation: The motivation is to address the lack of an effective evaluation benchmark that fairly reflects the performances of large language models on broad table reasoning abilities, given the challenges posed by the hidden semantics, inherent complexity, and structured nature of table-structured data. Method: An iterative data processing procedure was used to construct a high-quality dataset. An evaluation framework with three distinct inference modes (TCoT, PoT, and ICoT) was created to measure table reasoning capabilities. Over 20 state-of-the-art LLMs were benchmarked using this framework. Result: A high-quality dataset and robust evaluation framework named TReB were developed, which have been made publicly available. The benchmarking of over 20 state-of-the-art LLMs proved the effectiveness of the framework, while also revealing that current LLMs have significant room for improvement in handling complex, real-world table-related tasks. Conclusion: TReB is a comprehensive table reasoning evolution benchmark that measures both shallow table understanding abilities and deep table reasoning abilities. It has 26 sub-tasks, and the experimental results show that there is still significant room for improvement in addressing complex and real-world table-related tasks. Abstract: The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on [HuggingFace] and the framework on [GitHub].[62] MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models
Junjie Zhang,Guozheng Ma,Shunyu Liu,Haoyu Wang,Jiaxing Huang,Ting-En Lin,Fei Huang,Yongbin Li,Dacheng Tao
Main category: cs.CL
TL;DR: This paper proposes MeRF, a method that improves large language model reasoning by incorporating in-context motivational prompts aligned with reinforcement learning rewards, showing strong results on a logic puzzle benchmark.
Details
Motivation: Existing RLVR methods neglect the in-context learning ability of LLMs, a key strength prominently demonstrated through Chain-of-Thought prompting. This work explores how reinforcement learning can be better combined with this ability to improve LLM reasoning. Method: The paper introduces Motivation-enhanced Reinforcement Finetuning (MeRF), which injects reward specifications directly into the prompt as an in-context motivation to align model generation with optimization objectives. Result: Empirical evaluations show that MeRF achieves significant performance gains over baselines on the Knights and Knaves logic puzzle benchmark. Ablation studies confirm improved performance with greater consistency between in-context motivation and external reward, and the model demonstrates adaptability to misleading motivations. Conclusion: MeRF effectively enhances the reasoning capabilities of LLMs by integrating in-context motivation with reinforcement learning, demonstrating performance improvements on the Knights and Knaves logic puzzle benchmark. Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Language Models (LLMs) to tackle complex reasoning tasks. However, existing RLVR methods overlook one of the most distinctive capabilities of LLMs, their in-context learning ability, as prominently demonstrated by the success of Chain-of-Thought (CoT) prompting. This motivates us to explore how reinforcement learning can be effectively combined with in-context learning to better improve the reasoning capabilities of LLMs. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning} (MeRF), an intuitive yet effective method enhancing reinforcement learning of LLMs by involving ``telling LLMs the rules of the game''. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective. This simple modification leverages the in-context learning ability of LLMs aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations on the Knights and Knaves~(K&K) logic puzzle reasoning benchmark demonstrate that \texttt{MeRF} achieves substantial performance gains over baselines. Moreover, ablation studies show that performance improves with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement learning.[63] Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance
Wael Etaiwi,Bushra Alhijawi
Main category: cs.CL
TL;DR: 本文评估了ChatGPT和DeepSeek在五个关键NLP任务上的表现,发现DeepSeek在分类稳定性与逻辑推理方面更强,而ChatGPT则在细腻理解与灵活应对任务上占优。
Details
Motivation: 随着大型语言模型(LLMs)在自然语言处理(NLP)任务中的广泛应用,对其在不同应用场景中的效果进行全面评估变得至关重要。尽管像ChatGPT和DeepSeek这样的模型在许多NLP领域表现出色,但需要全面评估以了解其优势、劣势和领域特定能力。 Method: 通过结构化的实验协议,在五个关键NLP任务(情感分析、主题分类、文本摘要、机器翻译和文本蕴含)上对ChatGPT和DeepSeek进行评估。使用相同的中性提示,并在每个任务上的两个基准数据集上进行测试。 Result: 结果表明,DeepSeek在分类稳定性和逻辑推理方面表现出色,而ChatGPT在需要细致理解和灵活性的任务中表现更优。 Conclusion: 研究得出DeepSeek在分类稳定性和逻辑推理方面表现优异,而ChatGPT在需要细致理解和灵活性的任务中表现更好,为根据任务需求选择合适的大型语言模型提供了有价值的见解。 Abstract: The increasing use of large language models (LLMs) in natural language processing (NLP) tasks has sparked significant interest in evaluating their effectiveness across diverse applications. While models like ChatGPT and DeepSeek have shown strong results in many NLP domains, a comprehensive evaluation is needed to understand their strengths, weaknesses, and domain-specific abilities. This is critical as these models are applied to various tasks, from sentiment analysis to more nuanced tasks like textual entailment and translation. This study aims to evaluate ChatGPT and DeepSeek across five key NLP tasks: sentiment analysis, topic classification, text summarization, machine translation, and textual entailment. A structured experimental protocol is used to ensure fairness and minimize variability. Both models are tested with identical, neutral prompts and evaluated on two benchmark datasets per task, covering domains like news, reviews, and formal/informal texts. The results show that DeepSeek excels in classification stability and logical reasoning, while ChatGPT performs better in tasks requiring nuanced understanding and flexibility. These findings provide valuable insights for selecting the appropriate LLM based on task requirements.[64] End-to-End Spoken Grammatical Error Correction
Mengjie Qian,Rao Ma,Stefano Bannò,Mark J. F. Gales,Kate M. Knill
Main category: cs.CL
TL;DR: This paper explores an End-to-End framework for spoken grammatical error correction, addressing challenges like limited labeled data and error propagation through novel techniques such as pseudo-labeling, reference alignment, and confidence-based filtering.
Details
Motivation: Spoken Grammatical Error Correction (SGEC) poses additional challenges compared to written GEC due to disfluencies, transcription errors, and lack of structured input. Traditional cascaded pipelines are vulnerable to error propagation, necessitating an improved End-to-End (E2E) framework. Method: Comparison of cascaded, partial-cascaded, and End-to-End (E2E) architectures built on the Whisper foundation model; automatic pseudo-labeling framework to expand training data; reference alignment process to improve feedback precision; edit confidence estimation to exclude low-confidence edits. Result: Experiments showed that the proposed methods significantly boost E2E SGEC performance on both in-house Linguaskill (LNG) and public Speak & Improve (S&I) corpora. Conclusion: The proposed approaches, including a pseudo-labeling framework, reference alignment process, and edit confidence estimation, significantly enhance the performance of End-to-End SGEC systems. Abstract: Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners' speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I) corpus show that the proposed approaches significantly boost E2E SGEC performance.[65] When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking
Manu Pande,Shahil Kumar,Anay Yatin Damle
Main category: cs.CL
TL;DR: 微调预训练Transformer模型在MS MARCO段落排序任务上表现不佳。
Details
Motivation: 研究微调预训练Transformer模型为何在MS MARCO段落排序任务上表现不佳的反直觉现象。 Method: 通过涉及五种模型变体的全面实验,包括全参数微调和参数高效的LoRA适配。 Result: 所有微调方法的表现都低于基础sentence-transformers/all-MiniLM-L6-v2模型(MRR@10: 0.3026)。 Conclusion: 该论文得出的结论是,微调预训练的Transformer模型在MS MARCO段落排序任务上的性能会下降。 Abstract: This paper investigates the counterintuitive phenomenon where fine-tuning pre-trained transformer models degrades performance on the MS MARCO passage ranking task. Through comprehensive experiments involving five model variants-including full parameter fine-tuning and parameter efficient LoRA adaptations-we demonstrate that all fine-tuning approaches underperform the base sentence-transformers/all- MiniLM-L6-v2 model (MRR@10: 0.3026). Our analysis reveals that fine-tuning disrupts the optimal embedding space structure learned during the base model's extensive pre-training on 1 billion sentence pairs, including 9.1 million MS MARCO samples. UMAP visualizations show progressive embedding space flattening, while training dynamics analysis and computational efficiency metrics further support our findings. These results challenge conventional wisdom about transfer learning effectiveness on saturated benchmarks and suggest architectural innovations may be necessary for meaningful improvements.[66] A Modular Taxonomy for Hate Speech Definitions and Its Impact on Zero-Shot LLM Classification Performance
Matteo Melis,Gabriella Lapesa,Dennis Assenmacher
Main category: cs.CL
TL;DR: This paper explores how varying definitions of hate speech affect language model performance, showing that specificity in definitions impacts results differently across models.
Details
Motivation: Hate speech detection is crucial for NLP applications aimed at social good. However, the lack of a consistent definition of hate speech creates ambiguity and affects the reliability of models, prompting an investigation into how different definitions impact performance. Method: The researchers systematically conducted a zero-shot evaluation using three large language models (LLMs) on three hate speech datasets. They employed different definitions of hate speech derived from existing literature and analyzed how these definitions affected model outcomes. Result: Different definitions of hate speech significantly affect model performance, with the level of impact depending on the specific elements included in the definition and the architecture of the model being used. Conclusion: The study concludes that varying definitions of hate speech, particularly their specificity, influence the performance of language models, though the impact differs across model architectures. Abstract: Detecting harmful content is a crucial task in the landscape of NLP applications for Social Good, with hate speech being one of its most dangerous forms. But what do we mean by hate speech, how can we define it, and how does prompting different definitions of hate speech affect model performance? The contribution of this work is twofold. At the theoretical level, we address the ambiguity surrounding hate speech by collecting and analyzing existing definitions from the literature. We organize these definitions into a taxonomy of 14 Conceptual Elements-building blocks that capture different aspects of hate speech definitions, such as references to the target of hate (individual or groups) or of the potential consequences of it. At the experimental level, we employ the collection of definitions in a systematic zero-shot evaluation of three LLMs, on three hate speech datasets representing different types of data (synthetic, human-in-the-loop, and real-world). We find that choosing different definitions, i.e., definitions with a different degree of specificity in terms of encoded elements, impacts model performance, but this effect is not consistent across all architectures.[67] Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu,Zhihao Teng,Kewei Tu
Main category: cs.CL
TL;DR: 本文提出了一种新的并行连续思维链方法(PCCoT),通过并行处理潜在思维标记来提高训练和推理效率。
Details
Motivation: 连续思维链已被证明在节省大型语言模型的推理标记方面是有效的,但潜在思维标记之间的顺序依赖性导致训练时间较长。 Method: 提出了一种名为 Parallel Continuous Chain-of-Thought (PCCoT) 的方法,该方法使用 Jacobi 迭代对潜在思维标记进行并行更新。 Result: 实验表明,通过选择适当的迭代次数,我们能够实现相当甚至更好的性能,同时节省近50%的训练和推理时间。 Conclusion: PCCoT 提高了训练和推理效率,并显示出更好的稳定性和鲁棒性。 Abstract: Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at https://github.com/whyNLP/PCCoT.[68] Reply to "Emergent LLM behaviors are observationally equivalent to data leakage"
Ariel Flint Ashery,Luca Maria Aiello,Andrea Baronchelli
Main category: cs.CL
TL;DR: Despite worries about data contamination, the paper asserts that genuine emergent dynamics like self-organisation can be studied in populations of large language models, as shown through the emergence of social conventions.
Details
Motivation: There is a concern about data contamination affecting outcomes in simulations with large language models, which may hinder certain experiments with multi-agent models. Method: The paper uses a critique by Barrie and Törnberg of another study by Flint Ashery et al. to clarify the potential for studying emergent dynamics in LLM populations. Result: The paper establishes that despite concerns about data contamination, it is still possible to investigate genuinely emergent dynamics, such as self-organisation, in LLM populations, evidenced by the emergence of social conventions. Conclusion: It is feasible to study self-organisation and model-dependent emergent dynamics in populations of large language models, as demonstrated by the empirical observation of social conventions. Abstract: A potential concern when simulating populations of large language models (LLMs) is data contamination, i.e. the possibility that training data may shape outcomes in unintended ways. While this concern is important and may hinder certain experiments with multi-agent models, it does not preclude the study of genuinely emergent dynamics in LLM populations. The recent critique by Barrie and T\"ornberg [1] of the results of Flint Ashery et al. [2] offers an opportunity to clarify that self-organisation and model-dependent emergent dynamics can be studied in LLM populations, highlighting how such dynamics have been empirically observed in the specific case of social conventions.[69] Semantic similarity estimation for domain specific data using BERT and other techniques
R. Prashanth
Main category: cs.CL
TL;DR: 该论文研究了不同技术(如USE、InferSent和BERT)在语义相似性估计中的应用,发现BERT在领域特定数据上表现最佳。
Details
Motivation: 语义相似性估计在自然语言处理和理解中具有重要意义,并在各种下游任务中有广泛应用。 Method: 使用USE、InferSent和BERT模型对两个问题对数据集进行分析,包括一个领域特定的内部数据集和Quora的问题对公共数据集。 Result: BERT模型的表现明显优于其他方法,这归因于其训练过程中的微调步骤,使其能够根据所使用的训练数据学习模式。 Conclusion: BERT模型在特定领域数据集上的语义相似性估计优于其他方法,表明其适用于领域特定数据。 Abstract: Estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding, and that has tremendous application on various downstream tasks such as question answering, semantic search, information retrieval, document clustering, word-sense disambiguation and machine translation. In this work, we carry out the estimation of semantic similarity using different state-of-the-art techniques including the USE (Universal Sentence Encoder), InferSent and the most recent BERT, or Bidirectional Encoder Representations from Transformers, models. We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset which is the Quora's question pairs dataset. We observe that the BERT model gave much superior performance as compared to the other methods. This should be because of the fine-tuning procedure that is involved in its training process, allowing it to learn patterns based on the training data that is used. This works demonstrates the applicability of BERT on domain specific datasets. We infer from the analysis that BERT is the best technique to use in the case of domain specific data.[70] The Anatomy of Speech Persuasion: Linguistic Shifts in LLM-Modified Speeches
Alisa Barkar,Mathieu Chollet,Matthieu Labeau,Beatrice Biancardi,Chloe Clavel
Main category: cs.CL
TL;DR: This study shows that GPT-4o modifies speech persuasiveness through structured linguistic changes, not human-like persuasion tactics.
Details
Motivation: To understand how large language models grasp persuasiveness in public speaking and whether their approach mirrors human strategies. Method: The study uses the 3MT French dataset, prompting GPT-4o to enhance or diminish persuasiveness in speech transcripts. Linguistic shifts are analyzed using a novel feature set integrating rhetorical devices and discourse markers. Result: GPT-4o was found to apply systematic stylistic modifications, particularly manipulating emotional words and sentence types like interrogatives and exclamations, rather than optimizing persuasiveness as humans might. Conclusion: GPT-4o manipulates specific linguistic features like emotional lexicon and syntactic structures to modify persuasiveness, indicating a systematic rather than human-like approach. Abstract: This study examines how large language models understand the concept of persuasiveness in public speaking by modifying speech transcripts from PhD candidates in the "Ma These en 180 Secondes" competition, using the 3MT French dataset. Our contributions include a novel methodology and an interpretable textual feature set integrating rhetorical devices and discourse markers. We prompt GPT-4o to enhance or diminish persuasiveness and analyze linguistic shifts between original and generated speech in terms of the new features. Results indicate that GPT-4o applies systematic stylistic modifications rather than optimizing persuasiveness in a human-like manner. Notably, it manipulates emotional lexicon and syntactic structures (such as interrogative and exclamatory clauses) to amplify rhetorical impact.[71] ByteSpan: Information-Driven Subword Tokenisation
Zébulon Goriely,Suchir Salhan,Pietro Lesci,Julius Cheng,Paula Buttery
Main category: cs.CL
TL;DR: This paper proposes ByteSpan, a new subword tokenisation method that groups predictable byte sequences, showing improved morphological alignment and efficiency across multiple languages.
Details
Motivation: Dynamic tokenisation methods operating on bytes show similarities to computational models of word segmentation. This inspires exploring grouping predictable bytes into a fixed subword vocabulary. Method: Proposed ByteSpan, an information-driven subword tokeniser that identifies contiguous predictable byte sequences using an external byte-level LM during training. Result: Experiments showed that ByteSpan produces more morphologically aligned subwords for English and maintains efficiency in multilingual settings. Conclusion: ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English, and similar compression and Rényi efficiency across 25 languages. Abstract: Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and R\'enyi efficiency for 25 languages.[72] Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
Raquel Ferrando,Javier Conde,Gonzalo Martínez,Pedro Reviriego
Main category: cs.CL
TL;DR: 本文研究了针对聊天机器人对话优化分词器的方法,发现其能够减少标记数量并实现节能效果。
Details
Motivation: 大型语言模型(LLMs)的计算和能量成本随着模型规模的增长和数亿用户的广泛采用而呈指数级增长。由于分词器在模型效率中起着关键作用,因此需要探索是否可以通过优化聊天机器人对话的分词器来进一步降低成本。 Method: 通过使用公开可用的聊天机器人对话语料库重新设计不同分词器的词汇表,并评估它们在该领域中的性能。 Result: 结果表明,经过对话优化的分词器在聊天机器人对话中始终减少了标记数量,可能导致5%至10%的能源节省。 Conclusion: 优化用于聊天机器人的分词器可以有效降低对话中的标记数量,从而实现显著的能源节约,同时对原始训练语料库的分词效率影响甚微或略有提升。 Abstract: The computational and energy costs of Large Language Models (LLMs) have increased exponentially driven by the growing model sizes and the massive adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is the computation of a token. Therefore, the tokenizer plays an important role in the efficiency of a model, and they are carefully optimized to minimize the number of tokens for the text in their training corpus. One of the most popular applications of LLMs are chatbots that interact with users. A key observation is that, for those chatbots, what is important is the performance of the tokenizer in the user text input and the chatbot responses. Those are most likely different from the text in the training corpus. So, a question that immediately arises is whether there is a potential benefit in optimizing tokenizers for chatbot conversations. In this paper, this idea is explored for different tokenizers by using a publicly available corpus of chatbot conversations to redesign their vocabularies and evaluate their performance in this domain. The results show that conversation-optimized tokenizers consistently reduce the number of tokens in chatbot dialogues, which can lead to meaningful energy savings, in the range of 5% to 10% while having minimal or even slightly positive impact on tokenization efficiency for the original training corpus.[73] Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
Christian Huber,Alexander Waibel
Main category: cs.CL
TL;DR: 本文提出了一种在推理过程中实时修正识别错误的方法,有效提高了对未见词汇特别是发音与拼写不一致词汇的识别准确性。
Details
Motivation: 动机是解决现有上下文偏置方法在发音与正字法不匹配的词汇上仍表现不佳的问题,特别是在实际应用中神经序列到序列系统往往无法识别训练期间未见过的词汇(如命名实体、缩写或领域特定词汇)的问题。 Method: 该论文提出了一种允许在推理期间即时添加更正的识别校正方法,以提高难以处理的词汇识别准确率。 Result: 结果显示,该方法能够显著提升偏倚词的识别效果,相对改善词错误率高达11%,同时保持整体性能稳定。 Conclusion: 论文的结论是,所提出的方法能够在保持整体词错误率具有竞争力的同时,最多可使偏倚词错误率相对改善11%。 Abstract: Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, for words with a pronunciation-orthography mismatch, these methods may still struggle. We propose a method which allows corrections of substitution errors to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate of up to 11\%, while maintaining a competitive overall word error rate.[74] Benchmarking the Pedagogical Knowledge of Large Language Models
Maxime Lelièvre,Amy Waldock,Meng Liu,Natalia Valdés Aspillaga,Alasdair Mackintosh,María José Ogando Portelo,Jared Lee,Paul Atherton,Robin A. A. Ince,Oliver G. B. Garrod
Main category: cs.CL
TL;DR: This paper introduces The Pedagogy Benchmark to assess large language models' understanding of teaching methods and special education needs, highlighting their potential impact on education and the importance of responsible AI deployment in this field.
Details
Motivation: Existing AI benchmarks primarily focus on content knowledge but neglect pedagogical understanding. This gap limits the evaluation of AI's ability to support education effectively, especially in addressing diverse learner needs and global learning challenges. Method: The paper introduces The Pedagogy Benchmark, a dataset built from teacher professional development exams, designed to evaluate Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. It outlines the methodology behind the benchmark and reports results for 97 models. Result: Results show that model accuracies on pedagogical knowledge questions range from 28% to 89%. The study also examines the trade-off between cost and accuracy and tracks the evolution of the Pareto frontier over time. Online leaderboards provide updated model rankings and interactive filtering options. Conclusion: The paper concludes that education-focused benchmarks like The Pedagogy Benchmark are essential for measuring models' ability to understand pedagogical concepts and support teaching practices, which can guide the responsible deployment of LLMs in educational settings. Abstract: Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at https://rebrand.ly/pedagogy which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models' capacities to understand pedagogical concepts, respond appropriately to learners' needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.[75] Semantic-Preserving Adversarial Attacks on LLMs: An Adaptive Greedy Binary Search Approach
Chong Zhang,Xiang Li,Jia Wang,Shan Liang,Haochen Xue,Xiaobo Jin
Main category: cs.CL
TL;DR: 该研究提出了一种新的提示优化方法AGBS,能在保持原始语义的基础上有效提升大型语言模型的攻击性样本生成能力。
Details
Motivation: 自动提示工程在GUI中被广泛用于改进用户输入和提高响应准确性,但由于用户需求多样,常导致无意的误解,因此需要一种保持语义稳定性的优化机制。 Method: 提出了自适应贪心二分搜索(AGBS)方法,动态评估对LLM性能的影响,并生成鲁棒的对抗样本。 Result: 通过在开源和闭源LLM上的大量实验,证明了AGBS在保持语义一致性的同时有效提升攻击效果。 Conclusion: AGBS方法在平衡语义一致性和攻击效果方面表现出色,为设计更可靠的提示优化系统提供了可操作的见解。 Abstract: Large Language Models (LLMs) increasingly rely on automatic prompt engineering in graphical user interfaces (GUIs) to refine user inputs and enhance response accuracy. However, the diversity of user requirements often leads to unintended misinterpretations, where automated optimizations distort original intentions and produce erroneous outputs. To address this challenge, we propose the Adaptive Greedy Binary Search (AGBS) method, which simulates common prompt optimization mechanisms while preserving semantic stability. Our approach dynamically evaluates the impact of such strategies on LLM performance, enabling robust adversarial sample generation. Through extensive experiments on open and closed-source LLMs, we demonstrate AGBS's effectiveness in balancing semantic consistency and attack efficacy. Our findings offer actionable insights for designing more reliable prompt optimization systems. Code is available at: https://github.com/franz-chang/DOBS[76] ASP2LJ : An Adversarial Self-Play Laywer Augmented Legal Judgment Framework
Ao Chang,Tong Zhou,Yubo Chen,Delai Qiu,Shengping Liu,Kang Liu,Jun Zhao
Main category: cs.CL
TL;DR: 本文提出了一种对抗自演进律师增强的法律判决框架ASP2LJ,以解决法律判决预测中的长尾分布和律师改进问题,并引入了一个罕见法律案例数据集RareCases。
Details
Motivation: 法律判决预测面临两个关键挑战:长尾分布导致模型性能下降以及忽视律师在完善论点中的重要作用。 Method: 提出了ASP2LJ框架,包括案例生成模块和对抗自演进机制,同时引入了RareCases数据集。 Result: 实验结果表明该方法在SimuCourt和RareCases数据集上均有效提升了司法决策的客观性、公正性和合理性。 Conclusion: 该研究贡献了一个集成框架、一个罕见案例数据集,并公开了数据和代码以支持自动司法系统的研究。 Abstract: Legal Judgment Prediction (LJP) aims to predict judicial outcomes, including relevant legal charge, terms, and fines, which is a crucial process in Large Language Model(LLM). However, LJP faces two key challenges: (1)Long Tail Distribution: Current datasets, derived from authentic cases, suffer from high human annotation costs and imbalanced distributions, leading to model performance degradation. (2)Lawyer's Improvement: Existing systems focus on enhancing judges' decision-making but neglect the critical role of lawyers in refining arguments, which limits overall judicial accuracy. To address these issues, we propose an Adversarial Self-Play Lawyer Augmented Legal Judgment Framework, called ASP2LJ, which integrates a case generation module to tackle long-tailed data distributions and an adversarial self-play mechanism to enhance lawyers' argumentation skills. Our framework enables a judge to reference evolved lawyers' arguments, improving the objectivity, fairness, and rationality of judicial decisions. Besides, We also introduce RareCases, a dataset for rare legal cases in China, which contains 120 tail-end cases. We demonstrate the effectiveness of our approach on the SimuCourt dataset and our RareCases dataset. Experimental results show our framework brings improvements, indicating its utilization. Our contributions include an integrated framework, a rare-case dataset, and publicly releasing datasets and code to support further research in automated judicial systems.[77] Existing LLMs Are Not Self-Consistent For Simple Tasks
Zhenru Lin,Jiawen Tao,Yang Yuan,Andrew Chi-Chih Yao
Main category: cs.CL
TL;DR: 研究揭示了大型语言模型在简单任务中的不一致性,并提出了两种改进方法。
Details
Motivation: 确保大型语言模型的决策透明且可信,需要保证其内部推理的一致性。 Method: 引入了两种自动化方法——基于图的方法和基于能量的方法来量化并减少不一致性。 Result: 即使是简单的任务,较小的模型也表现出高度的不一致性,最先进的模型如DeepSeek-R1和GPT-o4-mini也未能完全保持一致。 Conclusion: 解决大型语言模型中的不一致性问题对于构建更可靠的AI系统至关重要。 Abstract: Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.[78] RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
Arjun Mukerji,Michael L. Jackson,Jason Jones,Neil Sanghavi
Main category: cs.CL
TL;DR: This paper proposes RWESummary, a benchmark framework under MedHELM, to evaluate LLMs for summarizing real-world evidence studies, with Gemini 2.5 models performing best.
Details
Motivation: While LLMs have been widely assessed for general summarization and medical research assistance, they lack specific evaluation for summarizing structured outputs from real-world evidence studies. Method: The paper introduces RWESummary, which is built upon the MedHELM framework using Atropos Health proprietary data. It includes one scenario and three evaluations to benchmark LLMs in summarizing RWE studies. Result: Using 13 distinct RWE studies, the Gemini 2.5 models (Flash and Pro) showed the best overall performance in internal RWE summarization. Conclusion: RWESummary is proposed as a novel and useful foundation model benchmark for real-world evidence study summarization. Abstract: Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.[79] MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task
Jorge Iranzo-Sánchez,Javier Iranzo-Sánchez,Adrià Giménez,Jorge Civera,Alfons Juan
Main category: cs.CL
TL;DR: MLLP-VRAIN developed a modular cascade system that adapts pre-trained models to streaming scenarios for real-time translation of long-form speech, achieving a good balance between translation quality and latency without requiring extensive in-domain data or end-to-end training.
Details
Motivation: To address the unique challenges of real-time translation of long-form speech by creating a system that balances translation quality with latency. Method: A modular cascade system was developed adapting pre-trained models to streaming scenarios combining Whisper Large-V3-Turbo for ASR and NLLB-3.3B model for MT, using lightweight adaptation techniques, document-level adaptation with prefix training, adaptive emission policies, buffer management, and segmentation strategies. Result: The system achieved a BLEU score of 31.96 on the ACL60/60 dataset and a preliminary score of 29.8 BLEU on the IWSLT25Instruct test set, with a non-computational-aware StreamLAAL latency of 2.94 seconds. Conclusion: The study proves that with proper adaptation, pre-trained models can be effectively utilized in simultaneous translation systems for long-form content without the need for extensive in-domain data or specialized end-to-end training. Abstract: This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system that adapts strong pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight adaptation techniques rather than training new end-to-end models from scratch. Our approach employs document-level adaptation with prefix training to enhance the MT model's ability to handle incomplete inputs, while incorporating adaptive emission policies including a wait-$k$ strategy and RALCP for managing the translation stream. Specialized buffer management techniques and segmentation strategies ensure coherent translations across long audio sequences. Experimental results on the ACL60/60 dataset demonstrate that our system achieves a favorable balance between translation quality and latency, with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of 2.94 seconds. Our final model achieves a preliminary score on the official test set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully adapted pre-trained components can create effective simultaneous translation systems for long-form content without requiring extensive in-domain parallel data or specialized end-to-end training.[80] STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning
Aryasomayajula Ram Bharadwaj
Main category: cs.CL
TL;DR: STUPID is a training-free method using a PID controller to dynamically modulate activation steering strength during inference, effectively reducing redundant reasoning while enhancing accuracy and computational efficiency.
Details
Motivation: Large language models often suffer from overthinking, generating excessive and redundant reasoning steps that increase computational costs and degrade performance. Current static steering approaches lack adaptability to dynamically adjust intervention strength based on real-time reasoning quality. Method: STUPID combines a chunk-level classifier to detect redundant reasoning patterns with a PID controller mechanism that adaptively adjusts activation steering intensity based on predicted redundancy probability during inference. Result: Experimental evaluation on GSM8K shows STUPID achieves a 6% improvement in accuracy while reducing token usage by 32% compared to static steering baselines. Conclusion: STUPID provides a dynamic reasoning calibration framework that maintains quality while significantly improving computational efficiency, outperforming static steering baselines. Abstract: Large Language Models employing extended chain-of-thought (CoT) reasoning often suffer from the overthinking phenomenon, generating excessive and redundant reasoning steps that increase computational costs while potentially degrading performance. While recent work has explored static steering approaches to mitigate this issue, they lack the adaptability to dynamically adjust intervention strength based on real-time reasoning quality. We propose STUPID (Steering Token Usage via PID controller), a novel training-free method that employs a PID controller to dynamically modulate activation steering strength during inference. Our approach combines a chunk-level classifier for detecting redundant reasoning patterns with a PID control mechanism that adaptively adjusts steering intensity based on the predicted redundancy probability. Experimental evaluation on GSM8K demonstrates that STUPID achieves a 6% improvement in accuracy while reducing token usage by 32%, outperforming static steering baselines. Our method provides a principled framework for dynamic reasoning calibration that maintains reasoning quality while significantly improving computational efficiency.[81] LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Yuhao Wu,Yushi Bai,Zhiqiang Hu,Roy Ka-Wei Lee,Juanzi Li
Main category: cs.CL
TL;DR: This paper introduces LongWriter-Zero, an RL-based method for ultra-long text generation in LLMs that does not require synthetic or annotated data and outperforms existing large models.
Details
Motivation: Ultra-long text generation remains challenging for LLMs due to length limits and quality degradation. Existing approaches like LongWriter rely on costly and artificial synthetic SFT data, prompting the need for a more effective and data-efficient solution. Method: Reinforcement learning (RL) is used to train a base model from scratch, with specialized reward models guiding the LLM toward better length control, writing quality, and structural formatting. Result: LongWriter-Zero achieves state-of-the-art results on long-form writing benchmarks, outperforming traditional SFT methods and even 100B+ models like DeepSeek R1 and Qwen3-235B. Conclusion: The proposed incentivization-based approach using reinforcement learning effectively enhances ultra-long text generation in LLMs without relying on annotated or synthetic data, outperforming traditional SFT methods and surpassing larger models. Abstract: Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B. We open-source our data and model checkpoints under https://huggingface.co/THU-KEG/LongWriter-Zero-32B[82] Mechanistic Interpretability Needs Philosophy
Iwan Williams,Ninell Oldenburg,Ruchira Dhar,Joshua Hatherley,Constanza Fierro,Nina Rajcic,Sandrine R. Schiller,Filippos Stamatiou,Anders Søgaard
Main category: cs.CL
TL;DR: This paper argues for integrating philosophy into mechanistic interpretability research to enhance conceptual clarity, methodological rigor, and ethical considerations.
Details
Motivation: As mechanistic interpretability grows in influence, it is important to examine the assumptions, concepts, and explanatory strategies within MI research, which philosophy can help address. Method: The paper uses three open problems from the MI literature as examples to illustrate the value philosophy can add to MI research. Result: The paper outlines a path toward deeper interdisciplinary dialogue between philosophy and MI research. Conclusion: The paper concludes that philosophy should be an ongoing partner in mechanistic interpretability (MI) research to clarify concepts, refine methods, and assess epistemic and ethical stakes. Abstract: Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.[83] CommVQ: Commutative Vector Quantization for KV Cache Compression
Junyan Li,Yang Zhang,Muhammad Yusuf Hassan,Talha Chafekar,Tianle Cai,Zhile Ren,Pengsheng Guo,Foroozan Karimzadeh,Colorado Reed,Chong Wang,Chuang Gan
Main category: cs.CL
TL;DR: This paper proposes CommVQ, a method for significantly reducing memory usage in long-context LLM inference by compressing the KV cache using additive quantization and a RoPE-commutative codebook.
Details
Motivation: The key-value (KV) cache becomes a memory bottleneck on GPUs when using large language models (LLMs) for applications requiring long context lengths. This work aims to address this issue by reducing memory usage. Method: Commutative Vector Quantization (CommVQ) is introduced, which uses additive quantization and a RoPE-commutative codebook trained with an EM algorithm to compress the KV cache. Result: CommVQ achieves an 87.5% reduction in FP16 KV cache size with 2-bit quantization and enables 1-bit quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. Conclusion: The proposed CommVQ method effectively reduces memory usage for long-context LLM inference, enabling efficient integration of decoding into the self-attention mechanism while maintaining high accuracy. Abstract: Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.[84] OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
Yiyou Sun,Shawn Hu,Georgia Zhou,Ken Zheng,Hannaneh Hajishirzi,Nouha Dziri,Dawn Song
Main category: cs.CL
TL;DR: 本研究提出了OMEGA基准测试,用于系统评估大型语言模型在数学领域中的创造性思维能力,并发现当前模型在复杂问题上的局限性及微调后的改进方向。
Details
Motivation: 尽管近期的大型语言模型在奥林匹克级别的数学基准测试中取得了显著成果,但它们通常依赖于狭窄的策略集,在需要新思维方式的问题上表现不佳。 Method: 研究者基于Boden的创造力分类设计了OMEGA基准测试,该测试包括三个泛化轴:探索性、组合性和变革性推理,并通过模板生成训练和测试数据对进行评估。 Result: 前沿LLM在复杂性增加时性能明显下降,微调Qwen系列模型后在探索性泛化方面有显著改进,但组合性泛化效果有限,而变革性推理几乎没有改善。 Conclusion: OMEGA基准测试为推进大型语言模型在数学领域的创造性思维提供了基础,目前的LLMs在机械熟练度方面表现出色,但在真正的创造性问题解决方面仍有局限。 Abstract: Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.[85] ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
Jiaru Zou,Ling Yang,Jingwen Gu,Jiahao Qiu,Ke Shen,Jingrui He,Mengdi Wang
Main category: cs.CL
TL;DR: 本文提出 ReasonFlux-PRM,一种新型轨迹感知的奖励模型,通过精细的轨迹监督提升对大型语言模型推理过程的评估能力,并在多种任务上取得了显著性能提升。
Details
Motivation: 现有的 PRM 主要依赖模型最终输出进行训练,难以有效评估前沿推理模型生成的中间思维轨迹。 Method: 设计了一个结合步骤级和轨迹级监督的新型轨迹感知 PRM,并应用于不同场景如数据筛选、强化学习和测试时扩展。 Result: 在多个基准测试中,ReasonFlux-PRM-7B 表现出色,相较于强基线模型,在监督微调、强化学习和测试时扩展方面分别平均提升了 12.1%、4.5% 和 6.3%。 Conclusion: ReasonFlux-PRM 提供了一种更有效的奖励模型,用于评估大型语言模型的推理轨迹,优于现有方法。 Abstract: Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Projects: https://github.com/Gen-Verse/ReasonFluxcs.CV [Back]
[86] Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation
Dip Roy
Main category: cs.CV
TL;DR: This paper reveals fundamental algorithmic differences in how diffusion models handle synthetic vs. naturalistic image data, identifying eight specialized attention mechanisms and demonstrating causal relationships between circuit functions and model performance.
Details
Motivation: To understand the mechanistic principles and algorithmic differences in diffusion models when processing synthetic versus naturalistic data distributions, particularly focusing on computational pathways in image generation. Method: Quantitative circuit-level analysis was conducted through systematic intervention experiments on 2,000 synthetic and 2,000 CelebA facial images. The researchers measured computational complexity, attention specialization patterns, entropy divergence, and performance degradation after targeted ablations. Result: Real-world face processing requires circuits with higher computational complexity (complexity ratio = 1.084 ± 0.008, p < 0.001), with entropy divergence ranging from 0.015 to 0.166 across denoising steps. Eight functionally distinct attention mechanisms were identified, including edge detection (entropy = 3.18 ± 0.12), texture analysis (entropy = 4.16 ± 0.08), and semantic understanding (entropy = 2.67 ± 0.15). Targeted ablations caused 25.6% to 128.3% performance degradation. Conclusion: The study concludes that there are significant differences in how diffusion models process synthetic and naturalistic data, with real-world face processing demanding higher computational complexity. Eight distinct attention mechanisms were identified, each playing a specialized role in image generation. Abstract: We present a quantitative circuit-level analysis of diffusion models, establishing computational pathways and mechanistic principles underlying image generation processes. Through systematic intervention experiments across 2,000 synthetic and 2,000 CelebA facial images, we discover fundamental algorithmic differences in how diffusion architectures process synthetic versus naturalistic data distributions. Our investigation reveals that real-world face processing requires circuits with measurably higher computational complexity (complexity ratio = 1.084 plus/minus 0.008, p < 0.001), exhibiting distinct attention specialization patterns with entropy divergence ranging from 0.015 to 0.166 across denoising timesteps. We identify eight functionally distinct attention mechanisms showing specialized computational roles: edge detection (entropy = 3.18 plus/minus 0.12), texture analysis (entropy = 4.16 plus/minus 0.08), and semantic understanding (entropy = 2.67 plus/minus 0.15). Intervention analysis demonstrates critical computational bottlenecks where targeted ablations produce 25.6% to 128.3% performance degradation, providing causal evidence for identified circuit functions. These findings establish quantitative foundations for algorithmic understanding and control of generative model behavior through mechanistic intervention strategies.[87] SRKD: Towards Efficient 3D Point Cloud Segmentation via Structure- and Relation-aware Knowledge Distillation
Yuqi Li,Junhao Dong,Zeyu Dong,Chuanguang Yang,Zhulin An,Yongjun Xu
Main category: cs.CV
TL;DR: This paper introduces SRKD, a knowledge distillation framework that efficiently transfers structural and semantic knowledge from a large model to a smaller one, enabling high-performance 3D point cloud segmentation with reduced computational demands.
Details
Motivation: To overcome the computational complexity and deployment limitations of large-scale transformer-based models in 3D point cloud segmentation. Method: A Structure- and Relation-aware Knowledge Distillation (SRKD) framework is introduced, incorporating an affinity matrix-based relation alignment module, cross-sample mini-batch construction, KL divergence for semantic distribution alignment, and ground-truth supervision. Result: The method achieves state-of-the-art performance while reducing model size from over 100M to under 15M parameters, demonstrating effectiveness and efficiency for real-world deployment. Conclusion: The proposed SRKD framework effectively transfers geometric and semantic knowledge from a large teacher model to a lightweight student model, achieving state-of-the-art performance with significantly reduced complexity. Abstract: 3D point cloud segmentation faces practical challenges due to the computational complexity and deployment limitations of large-scale transformer-based models. To address this, we propose a novel Structure- and Relation-aware Knowledge Distillation framework, named SRKD, that transfers rich geometric and semantic knowledge from a large frozen teacher model (>100M) to a lightweight student model (<15M). Specifically, we propose an affinity matrix-based relation alignment module, which distills structural dependencies from the teacher to the student through point-wise similarity matching, enhancing the student's capability to learn contextual interactions. Meanwhile, we introduce a cross-sample mini-batch construction strategy that enables the student to perceive stable and generalized geometric structure. This aligns across diverse point cloud instances of the teacher, rather than within a single sample. Additionally, KL divergence is applied to align semantic distributions, and ground-truth supervision further reinforces accurate segmentation. Our method achieves state of the art performance with significantly reduced model complexity, demonstrating its effectiveness and efficiency in real-world deployment scenarios. Our Code is available at https://github.com/itsnotacie/SRKD.[88] Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning
Yijun Lin,Theresa Chen,Colby Brungard,Grunwald Sabine,Sue Ives,Matt Macander,Timm Nawrocki,Yao-Yi Chiang,Nic Jelinski
Main category: cs.CV
TL;DR: This paper introduces MISO, a machine learning model for high-resolution soil mapping in Alaska, showing better performance than traditional methods in monitoring permafrost thaw and guiding future planning.
Details
Motivation: Fine-scale soil mapping in Alaska is crucial due to ecological importance and permafrost thaw risks from climate change, which threaten infrastructure and ecosystem services like soil carbon storage. Method: MISO integrates a geospatial foundation model, implicit neural representations, and contrastive learning to create high-resolution soil maps. It was compared with the Random Forest (RF) model using spatial cross-validation and regional analysis. Result: MISO demonstrated superior performance over RF in generalizing to unseen locations and achieving higher recall, making it more effective for monitoring permafrost thaw and environmental changes. Conclusion: The study concludes that MISO, an advanced ML approach, outperforms traditional models like RF in fine-scale soil mapping for permafrost regions, offering better generalization and recall for monitoring permafrost thaw. Abstract: Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region's ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at https://github.com/knowledge-computing/Peatland-permafrost.[89] RadarSeq: A Temporal Vision Framework for User Churn Prediction via Radar Chart Sequences
Sina Najafi,M. Hadi Sepanj,Fahimeh Jafari
Main category: cs.CV
TL;DR: This paper presents a new approach for predicting user churn in non-subscription gig platforms by using a temporally-aware computer vision framework.
Details
Motivation: Predicting user churn in non-subscription gig platforms poses unique challenges due to the absence of explicit labels and the dynamic nature of user behavior. Method: We propose a temporally-aware computer vision framework that models user behavioral patterns as a sequence of radar chart images, each encoding day-level behavioral features. By integrating a pretrained CNN encoder with a bidirectional LSTM, our architecture captures both spatial and temporal patterns underlying churn behavior. Result: Extensive experiments on a large real-world dataset demonstrate that our method outperforms classical models and ViT-based radar chart baselines, yielding gains of 17.7 in F1 score, 29.4 in precision, and 16.1 in AUC, along with improved interpretability. Conclusion: The proposed framework's modular design, explainability tools, and efficient deployment characteristics make it suitable for large-scale churn modeling in dynamic gig-economy platforms. Abstract: Predicting user churn in non-subscription gig platforms, where disengagement is implicit, poses unique challenges due to the absence of explicit labels and the dynamic nature of user behavior. Existing methods often rely on aggregated snapshots or static visual representations, which obscure temporal cues critical for early detection. In this work, we propose a temporally-aware computer vision framework that models user behavioral patterns as a sequence of radar chart images, each encoding day-level behavioral features. By integrating a pretrained CNN encoder with a bidirectional LSTM, our architecture captures both spatial and temporal patterns underlying churn behavior. Extensive experiments on a large real-world dataset demonstrate that our method outperforms classical models and ViT-based radar chart baselines, yielding gains of 17.7 in F1 score, 29.4 in precision, and 16.1 in AUC, along with improved interpretability. The framework's modular design, explainability tools, and efficient deployment characteristics make it suitable for large-scale churn modeling in dynamic gig-economy platforms.[90] P2MFDS: A Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments
Haitian Wang,Yiren Wang,Xinyu Wang,Yumeng Miao,Yuliang Zhang,Yu Zhang,Atif Mansoor
Main category: cs.CV
TL;DR: This paper proposes P2MFDS, a privacy-preserving multimodal fall detection system combining millimeter-wave radar and 3D vibration sensing with deep learning, outperforming existing unimodal approaches in accuracy and recall for elderly fall detection in bathrooms.
Details
Motivation: With an aging global population, elderly fall detection—especially in bathrooms—is critical. Existing unimodal systems (e.g., WiFi-, infrared-, or mmWave-based) suffer from environmental interference and limited accuracy. The need for a non-intrusive, privacy-preserving solution motivates this work. Method: The authors propose a multimodal system combining millimeter-wave radar and 3D vibration sensing. They develop a sensor evaluation framework for modality selection and fusion, construct a large-scale real-world dataset, and design P2MFDS, a dual-stream deep learning network integrating CNN-BiLSTM-Attention and multi-scale CNN-SEBlock-Self-Attention modules to capture both macro- and micro-scale features. Result: P2MFDS achieves significant improvements in accuracy and recall compared to state-of-the-art approaches, demonstrating robust performance in complex bathroom environments. A large-scale, privacy-preserving multimodal dataset is also constructed and will be publicly released. Conclusion: The paper concludes that the proposed P2MFDS system significantly improves fall detection accuracy and recall over existing methods, addressing limitations of unimodal systems while preserving privacy. Abstract: By 2050, people aged 65 and over are projected to make up 16 percent of the global population. As aging is closely associated with increased fall risk, particularly in wet and confined environments such as bathrooms where over 80 percent of falls occur. Although recent research has increasingly focused on non-intrusive, privacy-preserving approaches that do not rely on wearable devices or video-based monitoring, these efforts have not fully overcome the limitations of existing unimodal systems (e.g., WiFi-, infrared-, or mmWave-based), which are prone to reduced accuracy in complex environments. These limitations stem from fundamental constraints in unimodal sensing, including system bias and environmental interference, such as multipath fading in WiFi-based systems and drastic temperature changes in infrared-based methods. To address these challenges, we propose a Privacy-Preserving Multimodal Fall Detection System for Elderly People in Bathroom Environments. First, we develop a sensor evaluation framework to select and fuse millimeter-wave radar with 3D vibration sensing, and use it to construct and preprocess a large-scale, privacy-preserving multimodal dataset in real bathroom settings, which will be released upon publication. Second, we introduce P2MFDS, a dual-stream network combining a CNN-BiLSTM-Attention branch for radar motion dynamics with a multi-scale CNN-SEBlock-Self-Attention branch for vibration impact detection. By uniting macro- and micro-scale features, P2MFDS delivers significant gains in accuracy and recall over state-of-the-art approaches. Code and pretrained models will be made available at: https://github.com/HaitianWang/P2MFDS-A-Privacy-Preserving-Multimodal-Fall-Detection-Network-for-Elderly-Individuals-in-Bathroom.[91] A Novel Multi-layer Task-centric and Data Quality Framework for Autonomous Driving
Yuhan Zhou,Haihua Chen,Kewei Sha
Main category: cs.CV
TL;DR: 论文介绍了一个新的自动驾驶汽车数据质量框架,通过映射数据质量与任务需求,提升系统性能和适应性。
Details
Motivation: 由于自动驾驶汽车依赖于多源和多模态数据,而数据质量通常因环境因素或传感器问题而变化,但研究人员和从业人员过于集中于模型/算法而忽视了数据质量。 Method: 该方法包括一个包含五层的框架:数据层、数据质量层、任务层、应用层和目标层,并通过冗余研究案例进行说明。 Result: 在nuScenes数据集上进行的冗余研究表明,部分去除多源图像数据中的冗余可以提高YOLOv8对象检测任务的性能,并进一步分析了图像和LiDAR多模态数据中存在的冗余数据质量问题。 Conclusion: 本文提出了一种新的以任务为中心的数据质量框架,旨在解决自动驾驶汽车领域数据质量与任务需求和性能目标之间的映射问题。 Abstract: The next-generation autonomous vehicles (AVs), embedded with frequent real-time decision-making, will rely heavily on a large volume of multisource and multimodal data. In real-world settings, the data quality (DQ) of different sources and modalities usually varies due to unexpected environmental factors or sensor issues. However, both researchers and practitioners in the AV field overwhelmingly concentrate on models/algorithms while undervaluing the DQ. To fulfill the needs of the next-generation AVs with guarantees of functionality, efficiency, and trustworthiness, this paper proposes a novel task-centric and data quality vase framework which consists of five layers: data layer, DQ layer, task layer, application layer, and goal layer. The proposed framework aims to map DQ with task requirements and performance goals. To illustrate, a case study investigating redundancy on the nuScenes dataset proves that partially removing redundancy on multisource image data could improve YOLOv8 object detection task performance. Analysis on multimodal data of image and LiDAR further presents existing redundancy DQ issues. This paper opens up a range of critical but unexplored challenges at the intersection of DQ, task orchestration, and performance-oriented system development in AVs. It is expected to guide the AV community toward building more adaptive, explainable, and resilient AVs that respond intelligently to dynamic environments and heterogeneous data streams. Code, data, and implementation details are publicly available at: https://anonymous.4open.science/r/dq4av-framework/README.md.[92] Efficient Feedback Gate Network for Hyperspectral Image Super-Resolution
Xufei Wang,Mingjian Zhang,Fei Ge,Jinchen Zhu,Wen Sha,Jifen Ren,Zhimeng Hou,Shouguo Zheng,ling Zheng,Shizhuang Weng
Main category: cs.CV
TL;DR: This paper proposes an efficient feedback gate network for SHSR that enhances spatial-spectral features and outperforms existing methods.
Details
Motivation: Current SHSR methods do not thoroughly explore coherence along bands and spatial-spectral information, limiting their performance. Method: A group-based SHSR method named efficient feedback gate network is proposed, involving feedbacks, gate operations, large kernel convolutions, and spectral interactions. It includes SPDFM and SSRGM modules for enhancing spatial-spectral features. Result: The experimental results on three hyperspectral datasets show improved performance using the proposed method. Conclusion: The proposed network demonstrates superior performance over state-of-the-art methods in spectral fidelity and spatial content reconstruction. Abstract: Even without auxiliary images, single hyperspectral image super-resolution (SHSR) methods can be designed to improve the spatial resolution of hyperspectral images. However, failing to explore coherence thoroughly along bands and spatial-spectral information leads to the limited performance of the SHSR. In this study, we propose a novel group-based SHSR method termed the efficient feedback gate network, which uses various feedbacks and gate operations involving large kernel convolutions and spectral interactions. In particular, by providing different guidance for neighboring groups, we can learn rich band information and hierarchical hyperspectral spatial information using channel shuffling and dilatation convolution in shuffled and progressive dilated fusion module(SPDFM). Moreover, we develop a wide-bound perception gate block and a spectrum enhancement gate block to construct the spatial-spectral reinforcement gate module (SSRGM) and obtain highly representative spatial-spectral features efficiently. Additionally, we apply a three-dimensional SSRGM to enhance holistic information and coherence for hyperspectral data. The experimental results on three hyperspectral datasets demonstrate the superior performance of the proposed network over the state-of-the-art methods in terms of spectral fidelity and spatial content reconstruction.[93] From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge
Muhammad Tayyab Khan,Lequn Chen,Zane Yong,Jun Ming Tan,Wenhe Feng,Seung Ki Moon
Main category: cs.CV
TL;DR: This paper proposes a hybrid vision-language framework using YOLOv11-obb and VLMs like Donut and Florence-2 to efficiently extract structured data from 2D engineering drawings, showing strong performance and practical application in manufacturing workflows.
Details
Motivation: Manual extraction of key information from 2D engineering drawings is slow and labor-intensive, while generic OCR models fail due to complex layouts and rotated text, leading to unreliable outputs. Method: A rotation-aware object detection model (YOLOv11-obb) is combined with a transformer-based vision-language parser. Annotation patches are localized and parsed using Donut and Florence-2 VLMs, with performance compared on four metrics. Result: Donut outperformed Florence-2 with 88.5% precision, 99.2% recall, and 93.5% F1-score, alongside an 11.5% hallucination rate. The framework was shown to support downstream manufacturing tasks effectively. Conclusion: The proposed hybrid vision-language framework successfully extracts structured information from 2D engineering drawings, demonstrating practical utility in modernizing drawing interpretation for manufacturing tasks. Abstract: Efficient and accurate extraction of key information from 2D engineering drawings is essential for advancing digital manufacturing workflows. Such information includes geometric dimensioning and tolerancing (GD&T), measures, material specifications, and textual annotations. Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs. These limitations result in incomplete and unreliable outputs. To address these challenges, we propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language parser. Our structured pipeline applies YOLOv11-OBB to localize annotations and extract oriented bounding box (OBB) patches, which are then parsed into structured outputs using a fine-tuned, lightweight vision-language model (VLM). We curate a dataset of 1,367 2D mechanical drawings annotated across nine key categories. YOLOv11-OBB is trained on this dataset to detect OBBs and extract annotation patches. These are parsed using two open-source VLMs: Donut and Florence-2. Both models are lightweight and well-suited for specialized industrial tasks under limited computational overhead. Following fine-tuning of both models on the curated dataset of image patches paired with structured annotation labels, a comparative experiment is conducted to evaluate parsing performance across four key metrics. Donut outperforms Florence-2, achieving 88.5% precision, 99.2% recall, and a 93.5% F1-score, with a hallucination rate of 11.5%. Finally, a case study demonstrates how the extracted structured information supports downstream manufacturing tasks such as process and tool selection, showcasing the practical utility of the proposed framework in modernizing 2D drawing interpretation.[94] Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos
Zhiyi Shi,Junsik Kim,Helen Y. Yang,Yonghyun Song,Hyun-Jic Oh,Dalit Ben-Yosef,Daniel Needleman,Hanspeter Pfister
Main category: cs.CV
TL;DR: 本文提出了一种用于体外受精(IVF)中胚胎活力预测的新方法STPT,通过空间-时间分阶段预训练解决现有自监督学习方法在胚胎发育视频中的局限性,并取得了良好的性能表现。
Details
Motivation: 由于标记妊娠结果数据的稀缺性以及现有自监督学习(SSL)方法无法直接适用于胚胎发育视频(因其包含大量帧、长度不一及存在许多异常帧),需要一种新的解决方案来提高IVF中胚胎活力预测的自动化水平。 Method: 提出了一种名为STPT的方法,包括空间和时间两个预训练阶段:在每个阶段只训练一个编码器而冻结另一个,以减少内存需求;空间阶段利用单个视频及其时间一致增强进行对齐学习,时间阶段建模视频嵌入之间的关系,从而避免跨视频帧对帧对齐的问题。 Result: 在23,027个时移视频(其中3,286个有标签)上,该方法达到了最高的AUC值为0.635(95% CI: 0.632-0.638)。 Conclusion: STPT有效解决了胚胎发育视频中时空对齐的挑战,通过分阶段的空间和时间预训练方法,在有限计算资源下实现了对长视频和时间变异的高效处理。 Abstract: Automating embryo viability prediction for in vitro fertilization (IVF) is important but challenging due to the limited availability of labeled pregnancy outcome data, as only a small fraction of embryos are labeled after transfer. Self-supervised learning (SSL) can leverage both labeled and unlabeled data to improve prediction. However, existing SSL methods for videos are not directly applicable to embryo development videos due to two challenges: (1) embryo time-lapse videos contain hundreds of frames, requiring significant GPU memory for conventional SSL; (2) the dataset contains videos with varying lengths and many outlier frames, causing traditional video alignment methods to struggle with semantic misalignment. We propose Spatial-Temporal Pre-Training (STPT) to address these challenges. STPT includes two stages: spatial and temporal. In each stage, only one encoder is trained while the other is frozen, reducing memory demands. To handle temporal misalignment, STPT avoids frame-by-frame alignment across videos. The spatial stage learns from alignments within each video and its temporally consistent augmentations. The temporal stage then models relationships between video embeddings. Our method efficiently handles long videos and temporal variability. On 23,027 time-lapse videos (3,286 labeled), STPT achieves the highest AUC of 0.635 (95% CI: 0.632-0.638) compared to baselines, with limited computational resources.[95] VMRA-MaR: An Asymmetry-Aware Temporal Framework for Longitudinal Breast Cancer Risk Prediction
Zijun Sun,Solveig Thrun,Michael Kampffmeyer
Main category: cs.CV
TL;DR: This paper proposes a novel deep learning framework, Vision Mamba RNN (VMRNN), for early breast cancer detection by capturing temporal dynamics and asymmetry in breast tissue, achieving improved performance over existing methods.
Details
Motivation: Breast cancer detection typically relies on screening programs, and automated risk prediction approaches can improve this process by dynamically targeting high-risk groups. While existing models focus on the most recent screening, there is growing interest in leveraging temporal information to capture evolving trends in breast tissue. However, challenges remain in fully harnessing the rich temporal dynamics of longitudinal imaging data. Method: The study proposes a Vision Mamba RNN (VMRNN) with a state-space model (SSM) and LSTM-like memory mechanisms to capture nuanced trends in breast tissue evolution. An asymmetry module utilizing a Spatial Asymmetry Detector (SAD) and Longitudinal Asymmetry Tracker (LAT) is incorporated to identify bilateral differences. Result: The proposed framework shows notable improvements in predicting cancer onset, especially for high-density breast cases, and achieves superior performance at extended time points (years four and five). Conclusion: The proposed VMRNN with SSM and LSTM-like memory mechanisms, along with the asymmetry module, demonstrates notable improvements in predicting cancer onset, particularly for high-density breast cases and achieves superior performance at extended time points. Abstract: Breast cancer remains a leading cause of mortality worldwide and is typically detected via screening programs where healthy people are invited in regular intervals. Automated risk prediction approaches have the potential to improve this process by facilitating dynamically screening of high-risk groups. While most models focus solely on the most recent screening, there is growing interest in exploiting temporal information to capture evolving trends in breast tissue, as inspired by clinical practice. Early methods typically relied on two time steps, and although recent efforts have extended this to multiple time steps using Transformer architectures, challenges remain in fully harnessing the rich temporal dynamics inherent in longitudinal imaging data. In this work, we propose to instead leverage Vision Mamba RNN (VMRNN) with a state-space model (SSM) and LSTM-like memory mechanisms to effectively capture nuanced trends in breast tissue evolution. To further enhance our approach, we incorporate an asymmetry module that utilizes a Spatial Asymmetry Detector (SAD) and Longitudinal Asymmetry Tracker (LAT) to identify clinically relevant bilateral differences. This integrated framework demonstrates notable improvements in predicting cancer onset, especially for the more challenging high-density breast cases and achieves superior performance at extended time points (years four and five), highlighting its potential to advance early breast cancer recognition and enable more personalized screening strategies. Our code is available at https://github.com/Mortal-Suen/VMRA-MaR.git.[96] Trans${^2}$-CBCT: A Dual-Transformer Framework for Sparse-View CBCT Reconstruction
Minmin Yang,Huantao Ren,Senem Velipasalar
Main category: cs.CV
TL;DR: This paper proposes Trans-CBCT and Trans$^2$-CBCT models to enhance sparse-view CBCT reconstruction by integrating CNN-Transformer features with point-based geometric reasoning, significantly improving image quality.
Details
Motivation: Sparse-view CBCT scans suffer from severe under-sampling, causing artifacts and poor spatial coverage. The work aims to improve reconstruction quality while maintaining faster scans and lower radiation doses. Method: The authors use a hybrid CNN-Transformer model (TransUNet) adapted to CBCT, incorporating multi-scale features, view-specific feature querying, and a lightweight attenuation-prediction head. They further introduce a neighbor-aware Point Transformer for volumetric coherence. Result: Trans-CBCT outperforms prior baselines by 1.17 dB PSNR and 0.0163 SSIM on the LUNA16 dataset using six views. Trans$^2$-CBCT provides an additional improvement of 0.63 dB PSNR and 0.0117 SSIM. Experiments on LUNA16 and ToothFairy datasets validate consistent gains across six to ten views. Conclusion: The paper concludes that the proposed Trans-CBCT and Trans$^2$-CBCT models effectively address challenges in sparse-view CBCT reconstruction by combining CNN-Transformer features with point-based geometry reasoning. Abstract: Cone-beam computed tomography (CBCT) using only a few X-ray projection views enables faster scans with lower radiation dose, but the resulting severe under-sampling causes strong artifacts and poor spatial coverage. We address these challenges in a unified framework. First, we replace conventional UNet/ResNet encoders with TransUNet, a hybrid CNN-Transformer model. Convolutional layers capture local details, while self-attention layers enhance global context. We adapt TransUNet to CBCT by combining multi-scale features, querying view-specific features per 3D point, and adding a lightweight attenuation-prediction head. This yields Trans-CBCT, which surpasses prior baselines by 1.17 dB PSNR and 0.0163 SSIM on the LUNA16 dataset with six views. Second, we introduce a neighbor-aware Point Transformer to enforce volumetric coherence. This module uses 3D positional encoding and attention over k-nearest neighbors to improve spatial consistency. The resulting model, Trans$^2$-CBCT, provides an additional gain of 0.63 dB PSNR and 0.0117 SSIM. Experiments on LUNA16 and ToothFairy show consistent gains from six to ten views, validating the effectiveness of combining CNN-Transformer features with point-based geometry reasoning for sparse-view CBCT reconstruction.[97] Enhancing Wireless Device Identification through RF Fingerprinting: Leveraging Transient Energy Spectrum Analysis
Nisar Ahmed,Gulshan Saleem,Hafiz Muhammad Shahzad Asif,Muhammad Usman Younus,Kalsoom Safdar
Main category: cs.CV
TL;DR: 本研究提出了一种基于瞬态能量谱分析和混合深度学习模型(CNN-Bi-GRU)的方法,以实现复杂电磁环境下射频设备的高效识别和分类。
Details
Motivation: 随着物联网技术和5G无线网络的发展,管理与保障这些设备的安全成为关键挑战,需要准确识别和分类辐射设备。 Method: 使用广义线性Chirplet变换提取特征,并引入一种名为CNN-Bi-GRU的混合深度学习模型进行分类建模。 Result: 通过10折交叉验证,该方法的精确度达到99.33%,召回率为99.53%,F1得分为99.43%,分类准确率达到99.17%。 Conclusion: 该论文提出的方法在复杂电磁环境中对射频设备的识别和分类具有较高的准确性和实用性。 Abstract: In recent years, the rapid growth of the Internet of Things technologies and the widespread adoption of 5G wireless networks have led to an exponential increase in the number of radiation devices operating in complex electromagnetic environments. A key challenge in managing and securing these devices is accurate identification and classification. To address this challenge, specific emitter identification techniques have emerged as a promising solution that aims to provide reliable and efficient means of identifying individual radiation devices in a unified and standardized manner. This research proposes an approach that leverages transient energy spectrum analysis using the General Linear Chirplet Transform to extract features from RF devices. A dataset comprising nine RF devices is utilized, with each sample containing 900 attributes and a total of 1080 equally distributed samples across the devices. These features are then used in a classification modeling framework. To overcome the limitations of conventional machine learning methods, we introduce a hybrid deep learning model called the CNN-Bi-GRU for learning the identification of RF devices based on their transient characteristics. The proposed approach provided a 10-fold cross-validation performance with a precision of 99.33%, recall of 99.53%, F1-score of 99.43%, and classification accuracy of 99.17%. The results demonstrate the promising classification performance of the CNN-Bi-GRU approach, indicating its suitability for accurately identifying RF devices based on their transient characteristics and its potential for enhancing device identification and classification in complex wireless environments.[98] AQUA20: A Benchmark Dataset for Underwater Species Classification under Challenging Conditions
Taufikur Rahman Fuad,Sabbir Ahmed,Shahriar Ivan
Main category: cs.CV
TL;DR: 本论文推出了AQUA20数据集用于水下视觉理解,并评估了多种深度学习模型在水下物种识别中的性能和复杂度权衡。
Details
Motivation: 由于浑浊、低照度和遮挡等复杂失真,水下环境中的视觉识别仍然是一个重大挑战,标准视觉系统的性能会因此严重下降。 Method: 构建了一个包含8,171张水下图像的综合基准数据集AQUA20,并评估了13种最先进的深度学习模型(包括轻量级CNN和基于Transformer的架构)在挑战性条件下对海洋物种进行分类的性能。此外,还使用GRAD-CAM和LIME进行了广泛的可解释性分析。 Result: 实验结果显示ConvNeXt表现最佳,Top-3准确率达到98.82%,Top-1准确率为90.69%,整体F1得分最高达88.92%。 Conclusion: AQUA20为未来水下物种识别研究提供了基础,同时实验结果揭示了不同模型在性能与复杂度之间的权衡。 Abstract: Robust visual recognition in underwater environments remains a significant challenge due to complex distortions such as turbidity, low illumination, and occlusion, which severely degrade the performance of standard vision systems. This paper introduces AQUA20, a comprehensive benchmark dataset comprising 8,171 underwater images across 20 marine species reflecting real-world environmental challenges such as illumination, turbidity, occlusions, etc., providing a valuable resource for underwater visual understanding. Thirteen state-of-the-art deep learning models, including lightweight CNNs (SqueezeNet, MobileNetV2) and transformer-based architectures (ViT, ConvNeXt), were evaluated to benchmark their performance in classifying marine species under challenging conditions. Our experimental results show ConvNeXt achieving the best performance, with a Top-3 accuracy of 98.82% and a Top-1 accuracy of 90.69%, as well as the highest overall F1-score of 88.92% with moderately large parameter size. The results obtained from our other benchmark models also demonstrate trade-offs between complexity and performance. We also provide an extensive explainability analysis using GRAD-CAM and LIME for interpreting the strengths and pitfalls of the models. Our results reveal substantial room for improvement in underwater species recognition and demonstrate the value of AQUA20 as a foundation for future research in this domain. The dataset is publicly available at: https://huggingface.co/datasets/taufiktrf/AQUA20.[99] When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network
Dong Xiao,Guangyao Chen,Peixi Peng,Yangru Huang,Yifan Zhao,Yongxing Dai,Yonghong Tian
Main category: cs.CV
TL;DR: This paper proposes a real-time anomaly detection system for autonomous vehicles using a hybrid network that integrates event camera data with RGB images, resulting in faster and more accurate detection compared to existing methods.
Details
Motivation: The motivation stems from the critical need for fast and accurate anomaly detection in autonomous driving systems, where current methods often prioritize accuracy over response time, which can be detrimental in time-sensitive scenarios. Method: The method involves a novel multimodal asynchronous hybrid network that combines event data from event cameras with image data from RGB cameras. An asynchronous Graph Neural Network processes the event streams, while a CNN extracts spatial features from RGB images. Result: The experiments demonstrated that the proposed approach outperforms existing methods in both detection accuracy and response time, achieving millisecond-level real-time performance on benchmark datasets. Conclusion: The paper concludes that the proposed multimodal asynchronous hybrid network is highly effective for real-time anomaly detection in autonomous driving, combining both event streams and RGB images to achieve high accuracy and minimal response time. Abstract: Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.[100] Photogranulometry -- Dataset of soil images with corresponding particle size distributions
Thomas Plante St-Cyr,François Duhaime,Jean-Sébastien Dubé,Simon Grenier
Main category: cs.CV
TL;DR: 论文提出了一种新的高分辨率图像数据集,旨在为地质工程应用中的卷积神经网络训练提供坚实的基础。
Details
Motivation: 传统的颗粒尺寸分布分析存在显著停机时间,并且在人力和维护方面成本高昂,这可以通过集成光学粒径分析到常规地质技术实验室工作流程中得到缓解。 Method: 采集了321个不同土壤样本的12714张图像,并与它们的颗粒尺寸分布分析结果一起呈现。 Result: 设计了一个自定义测试台,使用13x9英寸白色铝盘将样本铺成薄层进行拍摄,对于超过尺寸限制的样本采用圆锥四分法进行质量减少处理。 Conclusion: 该论文提供了一个用于地质工程技术中训练卷积神经网络的高分辨率土壤样本图像数据集。 Abstract: Traditional particle size distribution (PSD) analyses create significant downtime and are expensive in labor and maintenance. These drawbacks could be alleviated using optical grain size analysis integrated into routine geotechnical laboratory workflow. This paper presents a high-resolution dataset of 12,714 images of 321 different soil samples collected in the Montreal, Quebec region, alongside their PSD analysis. It is designed to provide a robust starting point for training convolutional neural networks (CNN) in geotechnical applications. Soil samples were photographed in a standardized top-view position with a resolution of 45 MP and a minimum scale of 39.4 micrometers per pixel, both in their moist and dry states. A custom test bench employing 13x9 inch white aluminum trays, on which the samples are spread in a thin layer, was used. For samples exceeding a size limit, a coning and quartering method was employed for mass reduction.[101] Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation
Julio Silva-Rodríguez,Fereshteh Shakeri,Houda Bahig,Jose Dolz,Ismail Ben Ayed
Main category: cs.CV
TL;DR: 本研究挑战了现有视觉-语言模型在医学图像分析中的理想化部署假设,提出了更符合实际的不平衡、无验证集的适应设置,并设计了一种高效的训练无关线性探测方法。
Details
Motivation: 现有的医学图像分析假设适应数据分布平衡且需要额外的验证集,这在现实场景中并不合理。 Method: 在多种模态和下游任务上进行了广泛的基准测试,并提出了一种训练无关的线性探测方法来动态融合视觉和文本监督信号。 Result: 当前方法在现实条件下性能显著下降,甚至可能表现不如零样本推理。 Conclusion: 引入了一种无需训练的线性探测方法,能够在具有挑战性的不平衡、无验证集的适应场景中实现稳健的适应。 Abstract: Vision-language models (VLMs) are gaining attention in medical image analysis. These are pre-trained on large, heterogeneous data sources, yielding rich and transferable representations. Notably, the combination of modality-specialized VLMs with few-shot adaptation has provided fruitful results, enabling the efficient deployment of high-performing solutions. However, previous works on this topic make strong assumptions about the distribution of adaptation data, which are unrealistic in the medical domain. First, prior art assumes access to a balanced support set, a condition that breaks the natural imbalance in disease prevalence found in real-world scenarios. Second, these works typically assume the presence of an additional validation set to fix critical hyper-parameters, which is highly data-inefficient. This work challenges these favorable deployment scenarios and introduces a realistic, imbalanced, validation-free adaptation setting. Our extensive benchmark across various modalities and downstream tasks demonstrates that current methods systematically compromise their performance when operating under realistic conditions, occasionally even performing worse than zero-shot inference. Also, we introduce a training-free linear probe that adaptively blends visual and textual supervision. Detailed studies demonstrate that the proposed solver is a strong, efficient baseline, enabling robust adaptation in challenging scenarios.[102] Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction
Julio Silva-Rodríguez,Ismail Ben Ayed,Jose Dolz
Main category: cs.CV
TL;DR: 本文研究了如何提高医学视觉-语言模型的可靠性,并提出了一种新的迁移学习框架SCA-T来提高效率和条件覆盖率。
Details
Motivation: 尽管医学视觉-语言模型(VLMs)因其迁移能力而受到欢迎,但其可靠性方面仍存在不足。 Method: 论文提出了一个名为SCA-T的新框架,并通过综合实验验证其效果。 Result: 与SCP相比,提出的框架在效率和条件覆盖率上都有所提升。 Conclusion: 该论文提出了一种新的迁移学习框架SCA-T,用于在保持经验保证的同时提高效率和条件覆盖率。 Abstract: Medical vision-language models (VLMs) have demonstrated unprecedented transfer capabilities and are being increasingly adopted for data-efficient image classification. Despite its growing popularity, its reliability aspect remains largely unexplored. This work explores the split conformal prediction (SCP) framework to provide trustworthiness guarantees when transferring such models based on a small labeled calibration set. Despite its potential, the generalist nature of the VLMs' pre-training could negatively affect the properties of the predicted conformal sets for specific tasks. While common practice in transfer learning for discriminative purposes involves an adaptation stage, we observe that deploying such a solution for conformal purposes is suboptimal since adapting the model using the available calibration data breaks the rigid exchangeability assumptions for test data in SCP. To address this issue, we propose transductive split conformal adaptation (SCA-T), a novel pipeline for transfer learning on conformal scenarios, which performs an unsupervised transductive adaptation jointly on calibration and test data. We present comprehensive experiments utilizing medical VLMs across various image modalities, transfer tasks, and non-conformity scores. Our framework offers consistent gains in efficiency and conditional coverage compared to SCP, maintaining the same empirical guarantees.[103] Learning golf swing signatures from a single wrist-worn inertial sensor
Jessy Lauer
Main category: cs.CV
TL;DR: 本研究开发了一种基于手腕传感器的个性化高尔夫挥杆分析系统,结合大规模职业数据与深度学习模型,实现了高精度运动估计与技术反馈,并揭示了运动表现中的个体化特征与潜在应用。
Details
Motivation: 当前高尔夫挥杆分析受限于孤立指标、职业运动员数据不足以及缺乏丰富的可解释运动表示,本文旨在填补这些空白。 Method: 利用公开视频构建职业运动员挥杆数据集,通过生物准确的人体网格恢复重建3D运动学数据,并生成合成惯性数据训练神经网络以从手腕输入中推断动作并分割挥杆阶段。 Result: 系统能够从手腕数据中准确估计全身运动学特征和挥杆事件,提供实验室级别的现场运动分析,并实现异常运动模式的早期检测,同时纵向跟踪显示其在技术进步和反馈中的实用价值。 Conclusion: 论文提出了一种基于手腕传感器的个性化高尔夫挥杆分析框架,挑战了挥杆一致性及“理想”挥杆的常见假设,并揭示了运动表现中的个体差异和潜在生物标志物。 Abstract: Despite its importance for performance and injury prevention, golf swing analysis is limited by isolated metrics, underrepresentation of professional athletes, and a lack of rich, interpretable movement representations. We address these gaps with a holistic, data-driven framework for personalized golf swing analysis from a single wrist-worn sensor. We build a large dataset of professional swings from publicly available videos, reconstruct full-body 3D kinematics using biologically accurate human mesh recovery, and generate synthetic inertial data to train neural networks that infer motion and segment swing phases from wrist-based input. We learn a compositional, discrete vocabulary of motion primitives that facilitates the detection and visualization of technical flaws, and is expressive enough to predict player identity, club type, sex, and age. Our system accurately estimates full-body kinematics and swing events from wrist data, delivering lab-grade motion analysis on-course and supporting early detection of anomalous movement patterns. Explainability methods reveal subtle, individualized movement signatures, reinforcing the view that variability is a hallmark of skilled performance. Longitudinal tracking demonstrates practical value: as one player's handicap improved from 50 to 2.2 over 1.5 years, our system captured measurable technical progress and provided targeted, actionable feedback. Our findings challenge common assumptions, such as swing consistency across clubs and the existence of a single "ideal" swing, and uncover latent biomarkers shaped by both intrinsic traits and task-specific constraints. This work bridges lab and field-based biomechanics, offering scalable, accessible, high-fidelity motion analysis for research, coaching, and injury prevention, while opening new directions in movement-based phenotyping, personalized equipment design, and motor skill development.[104] Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations
Zhihao Yuan,Shuyi Jiang,Chun-Mei Feng,Yaolun Zhang,Shuguang Cui,Zhen Li,Na Zhao
Main category: cs.CV
TL;DR: Scene-R1 is a video-grounded framework that enables transparent 3D scene understanding using reinforcement learning and a two-stage grounding pipeline, eliminating reliance on 3D detectors and offering efficient annotation and accurate results.
Details
Motivation: Existing 3D-aware large language models (LLMs) are black boxes that rely on pre-trained 3D detectors and do not provide transparent decision-making. Scene-R1 aims to enable transparent, step-by-step reasoning without requiring point-wise 3D instance supervision. Method: Scene-R1 uses a two-stage grounding pipeline involving temporal grounding to select relevant video snippets and image grounding to predict 2D bounding boxes. It uses SAM2 for object tracking and mask projection into 3D, relying only on task-level 2D boxes or textual labels. Result: Scene-R1 outperforms existing open-vocabulary baselines on multiple datasets while providing pixel-accurate masks and capturing fine geometry and material cues without dense 3D annotations. Conclusion: Scene-R1 provides a practical and annotation-efficient approach for trustworthy 3D scene understanding by combining reinforcement-learning-based reasoning with RGB-D video, eliminating the reliance on pre-trained 3D detectors. Abstract: Currently, utilizing large language models to understand the 3D world is becoming popular. Yet existing 3D-aware LLMs act as black boxes: they output bounding boxes or textual answers without revealing how those decisions are made, and they still rely on pre-trained 3D detectors to supply object proposals. We introduce Scene-R1, a video-grounded framework that learns to reason about 3D scenes without any point-wise 3D instance supervision by pairing reinforcement-learning-driven reasoning with a two-stage grounding pipeline. In the temporal grounding stage, we explicitly reason about the video and select the video snippets most relevant to an open-ended query. In the subsequent image grounding stage, we analyze the image and predict the 2D bounding box. After that, we track the object using SAM2 to produce pixel-accurate masks in RGB frames, and project them back into 3D, thereby eliminating the need for 3D detector-based proposals while capturing fine geometry and material cues. Scene-R1 can also adapt to the 3D visual question answering task to answer free-form questions directly from video. Our training pipeline only needs task-level 2D boxes or textual labels without dense 3D point-wise labels. Scene-R1 surpasses existing open-vocabulary baselines on multiple datasets, while delivering transparent, step-by-step rationales. These results show that reinforcement-learning-based reasoning combined with RGB-D video alone offers a practical, annotation-efficient route to trustworthy 3D scene understanding.[105] SynDaCaTE: A Synthetic Dataset For Evaluating Part-Whole Hierarchical Inference
Jake Levi,Mark van der Wilk
Main category: cs.CV
TL;DR: This paper introduces SynDaCaTE, a synthetic dataset for evaluating capsule models, revealing limitations in current designs and showing that self-attention mechanisms are effective for hierarchical object representation.
Details
Motivation: The motivation stems from the difficulty in evaluating whether capsule networks truly learn part-whole hierarchies when trained end-to-end on supervised tasks. This leads to the need for a controlled synthetic dataset to better assess and improve such models. Method: The authors introduce a synthetic dataset called SynDaCaTE to evaluate capsule models, using it to analyze the performance of existing models and test new approaches like self-attention mechanisms. Result: The results show a clear bottleneck in an existing capsule model and demonstrate that permutation-equivariant self-attention performs well for parts-to-wholes inference. Conclusion: The paper concludes that permutation-equivariant self-attention is highly effective for parts-to-wholes inference and suggests future directions for designing effective inductive biases for computer vision. Abstract: Learning to infer object representations, and in particular part-whole hierarchies, has been the focus of extensive research in computer vision, in pursuit of improving data efficiency, systematic generalisation, and robustness. Models which are \emph{designed} to infer part-whole hierarchies, often referred to as capsule networks, are typically trained end-to-end on supervised tasks such as object classification, in which case it is difficult to evaluate whether such a model \emph{actually} learns to infer part-whole hierarchies, as claimed. To address this difficulty, we present a SYNthetic DAtaset for CApsule Testing and Evaluation, abbreviated as SynDaCaTE, and establish its utility by (1) demonstrating the precise bottleneck in a prominent existing capsule model, and (2) demonstrating that permutation-equivariant self-attention is highly effective for parts-to-wholes inference, which motivates future directions for designing effective inductive biases for computer vision.[106] VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
Chongkai Gao,Zixuan Liu,Zhenghao Chi,Junshan Huang,Xin Fei,Yiwen Hou,Yuxuan Zhang,Yudi Lin,Zhirui Fang,Zeyu Jiang,Lin Shao
Main category: cs.CV
TL;DR: This paper introduces VLA-OS, a flexible VLA framework, to evaluate how different planning paradigms and representations affect performance, finding that visual planning is more effective than language planning and that Hierarchical-VLA achieves strong results at the cost of speed.
Details
Motivation: To address the lack of clarity regarding the sources of performance gains in existing VLA models and identify components for further improvement by isolating planning paradigms and representations from network architectures and training data. Method: The authors introduced VLA-OS, a unified VLA architecture series, and conducted controlled experiments across various object categories, visual modalities, environments, and end-effectors to evaluate different planning paradigms and representations. Result: 1) Visually grounded planning representations outperformed language planning representations. 2) The Hierarchical-VLA paradigm showed better or comparable performance in terms of task performance, pretraining, generalization, scalability, and continual learning, but with slower training and inference speeds. Conclusion: The study concludes that visually grounded planning representations outperform language-based ones and that the Hierarchical-VLA paradigm offers superior performance across multiple metrics, despite slower training and inference times. Abstract: Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.[107] LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning
Haoxuan Che,Haibo Jin,Zhengrui Guo,Yi Lin,Cheng Jin,Hao Chen
Main category: cs.CV
TL;DR: FedMRG是一个基于联邦学习的隐私保护医学报告生成框架,解决多中心数据异构性和通信效率问题。
Details
Motivation: 由于医疗数据隐私限制,集中多中心数据困难,阻碍了大模型在医学报告生成中的发展。 Method: 提出FedMRG框架,结合低秩分解、对比学习与双适配器机制,以应对数据异构性和通信开销。 Result: 实验表明FedMRG在保持通信效率的同时,能生成准确且适应不同中心风格的医学报告。 Conclusion: FedMRG通过联邦学习框架,有效解决了多中心医学图像报告生成中的隐私保护和通信效率问题,展现出在临床应用中的潜力。 Abstract: LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.[108] HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models
Le Yu,Kaishen Wang,Jianlong Xiong,Yue Cao,Tao He
Main category: cs.CV
TL;DR: 这项研究提出了一种新的架构级解决方案来减少大型视觉-语言模型中的幻觉问题,该方案使用一个新颖的双门控深度传播单元模块,可以跨层共享并递归优化隐藏状态,从而实现强大且稳健的性能。
Details
Motivation: 尽管大型视觉-语言模型(LVLMs)在各种任务上都取得了显著的表现,但它们仍然容易产生幻觉——生成文本上合理但视觉上未被支持的输出。 Method: 提出了一种架构级解决方案HalluRNN,其中包括一个跨层共享并递归优化隐藏状态的新颖双门控深度传播单元(DG-DPU)模块。 Result: 引入的DG-DPU模块允许信息在整个模型中自适应传播,强制各层之间的一致性,并减轻由表示漂移引起的幻觉。 Conclusion: 通过仅微调DG-DPU模块,HalluRNN在多个基准测试中实现了强大且稳健的性能。 Abstract: Though Large Vision-Language Models (LVLMs) have achieved remarkable performance across various tasks, they are still prone to hallucinations-generating outputs that are textually plausible but visually ungrounded. While prior approaches generally address this issue through data-centric fine-tuning or innovative decoding strategies, these methods often require substantial resources or task-specific configurations. In this work, we introduce an architecture-level solution, HalluRNN, which enhances model stability through recurrent cross-layer reasoning. Specifically, we propose a novel Dual-Gated Depth Propagation Unit (DG-DPU) module, which is shared across layers and recurrently refines hidden states. This allows for the adaptive propagation of information throughout the model, enforces consistency across layers, and mitigates hallucinations caused by representational drift. By fine-tuning only the DG-DPU module, HalluRNN achieves strong and robust performance across multiple benchmarks.[109] DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving
Mihir Godbole,Xiangbo Gao,Zhengzhong Tu
Main category: cs.CV
TL;DR: This paper introduces DRAMA-X, a new benchmark for evaluating intent prediction in autonomous driving, and proposes SGG-Intent, a framework that leverages vision-language models and scene-graph reasoning to improve risk assessment and decision-making in complex urban environments.
Details
Motivation: The motivation stems from the lack of benchmarks evaluating multi-class intent prediction in safety-critical urban driving situations involving vulnerable road users. The authors aim to bridge this gap by introducing DRAMA-X and exploring the utility of vision-language models (VLMs) for fine-grained intent reasoning. Method: The authors introduced DRAMA-X, a fine-grained benchmark for multi-class intent prediction in safety-critical situations. They proposed SGG-Intent, a lightweight, training-free framework that generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends actions using a large language model. Result: DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions, and descriptive motion summaries. The experiments showed that scene-graph-based reasoning improves intent prediction and risk assessment, especially when contextual cues are explicitly modeled. Conclusion: The paper concludes that scene-graph-based reasoning, particularly when contextual cues are explicitly modeled, enhances intent prediction and risk assessment in autonomous driving scenarios. Abstract: Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle's reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.[110] SELFI: Selective Fusion of Identity for Generalizable Deepfake Detection
Younghun Kim,Minsuk Jang,Myung-Joon Kwon,Wonjun Lee,Changick Kim
Main category: cs.CV
TL;DR: 本文提出SELFI框架,通过显式建模和自适应控制身份特征,提升深度伪造检测的跨操作泛化性能。
Details
Motivation: 现有研究对身份特征在深度伪造检测中的作用存在分歧:一些方法抑制身份线索以减少偏差,而另一些则将其作为取证证据。需要解决这种矛盾并提高检测方法的泛化能力。 Method: 提出SELFI框架,包括Forgery-Aware Identity Adapter (FAIA) 和 Identity-Aware Fusion Module (IAFM),前者从固定人脸识别模型中提取身份嵌入并向伪造相关空间投影,后者根据相关性引导机制选择性融合身份和视觉特征。 Result: 实验表明,身份特征对于深度伪造检测是有信息量的,但其有效性依赖于上下文。某些操作保留了身份一致的伪影,而其他操作则扭曲身份线索并影响泛化性能。SELFI在四个基准数据集上的平均AUC比之前方法高出3.1%,在DFDC数据集上超过先前最佳方法6%。 Conclusion: 身份特征不应被盲目抑制或依赖,而应根据样本相关性进行显式建模和自适应控制。SELFI框架通过动态调节身份信息的使用,提高了深度伪造检测的跨操作泛化能力。 Abstract: Face identity provides a powerful signal for deepfake detection. Prior studies show that even when not explicitly modeled, classifiers often learn identity features implicitly. This has led to conflicting views: some suppress identity cues to reduce bias, while others rely on them as forensic evidence. To reconcile these views, we analyze two hypotheses: (1) whether face identity alone is discriminative for detecting deepfakes, and (2) whether such identity features generalize poorly across manipulation methods. Our experiments confirm that identity is informative but context-dependent. While some manipulations preserve identity-consistent artifacts, others distort identity cues and harm generalization. We argue that identity features should neither be blindly suppressed nor relied upon, but instead be explicitly modeled and adaptively controlled based on per-sample relevance. We propose \textbf{SELFI} (\textbf{SEL}ective \textbf{F}usion of \textbf{I}dentity), a generalizable detection framework that dynamically modulates identity usage. SELFI consists of: (1) a Forgery-Aware Identity Adapter (FAIA) that extracts identity embeddings from a frozen face recognition model and projects them into a forgery-relevant space via auxiliary supervision; and (2) an Identity-Aware Fusion Module (IAFM) that selectively integrates identity and visual features using a relevance-guided fusion mechanism. Experiments on four benchmarks show that SELFI improves cross-manipulation generalization, outperforming prior methods by an average of 3.1\% AUC. On the challenging DFDC dataset, SELFI exceeds the previous best by 6\%. Code will be released upon paper acceptance.[111] A Multimodal In Vitro Diagnostic Method for Parkinson's Disease Combining Facial Expressions and Behavioral Gait Data
Wei Huang,Yinxuan Xu,Yintao Zhou,Zhengyu Li,Jing Huang,Meng Pang
Main category: cs.CV
TL;DR: This paper introduces a novel, lightweight, multimodal in vitro diagnostic approach for early Parkinson's disease detection using facial expressions and gait analysis, validated on the largest such dataset to date.
Details
Motivation: Parkinson's disease is incurable, rapidly progressing, and severely disabling. With an aging population, there is a growing need for early, non-invasive, and cost-effective detection methods. Existing approaches suffer from limited data, poor generalizability, and reliance on single-modal diagnosis. Method: We propose a lightweight deep learning model for feature extraction and fusion from facial expressions and gait data to enable early detection of PD. Additionally, we established the largest multimodal PD dataset for validation. Result: Extensive experiments on the newly established multimodal PD dataset validate the effectiveness of the proposed method in improving diagnostic accuracy compared to existing techniques. Conclusion: The proposed multimodal in vitro diagnostic method for Parkinson's disease (PD), which combines facial expressions and behavioral gait, demonstrates improved diagnostic accuracy and potential for deployment on mobile devices. Abstract: Parkinson's disease (PD), characterized by its incurable nature, rapid progression, and severe disability, poses significant challenges to the lives of patients and their families. Given the aging population, the need for early detection of PD is increasing. In vitro diagnosis has garnered attention due to its non-invasive nature and low cost. However, existing methods present several challenges: 1) limited training data for facial expression diagnosis; 2) specialized equipment and acquisition environments required for gait diagnosis, resulting in poor generalizability; 3) the risk of misdiagnosis or missed diagnosis when relying on a single modality. To address these issues, we propose a novel multimodal in vitro diagnostic method for PD, leveraging facial expressions and behavioral gait. Our method employs a lightweight deep learning model for feature extraction and fusion, aimed at improving diagnostic accuracy and facilitating deployment on mobile devices. Furthermore, we have established the largest multimodal PD dataset in collaboration with a hospital and conducted extensive experiments to validate the effectiveness of our proposed method.[112] OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor
Pengyu Kan,Craig Jones,Kenichi Oishi
Main category: cs.CV
TL;DR: This paper introduces a scalable, interpretable transformer-based model for accurate brain age prediction using MRI scans, demonstrating strong performance across datasets and cognitive groups.
Details
Motivation: To develop an interpretable and robust brain age prediction model that accounts for demographic and technological variances in MRI scans. Method: A transformer-based architecture with self-supervised pre-training was used. The model processes pseudo-3D T1-weighted MRI scans from three anatomical views and incorporates brain volumetric data, reducing transformer complexity from quadratic to linear for scalability. Result: The model achieved an MAE of 3.65 years on ADNI2 & 3 and OASIS3 test sets and generalized well to the AIBL dataset with an MAE of 3.54 years. Brain age gap (BAG) increased across cognitive groups, and a negative correlation between BAG and cognitive scores was observed. Conclusion: The model effectively combines information from different views and volumetric data to achieve high accuracy in brain age prediction while improving generalizability and interpretability, linking aging patterns to neurodegenerative disorders. Abstract: Purpose: To develop an age prediction model which is interpretable and robust to demographic and technological variances in brain MRI scans. Materials and Methods: We propose a transformer-based architecture that leverages self-supervised pre-training on large-scale datasets. Our model processes pseudo-3D T1-weighted MRI scans from three anatomical views and incorporates brain volumetric information. By introducing a stem architecture, we reduce the conventional quadratic complexity of transformer models to linear complexity, enabling scalability for high-dimensional MRI data. We trained our model on ADNI2 $\&$ 3 (N=1348) and OASIS3 (N=716) datasets (age range: 42 - 95) from the North America, with an 8:1:1 split for train, validation and test. Then, we validated it on the AIBL dataset (N=768, age range: 60 - 92) from Australia. Results: We achieved an MAE of 3.65 years on ADNI2 $\&$ 3 and OASIS3 test set and a high generalizability of MAE of 3.54 years on AIBL. There was a notable increase in brain age gap (BAG) across cognitive groups, with mean of 0.15 years (95% CI: [-0.22, 0.51]) in CN, 2.55 years ([2.40, 2.70]) in MCI, 6.12 years ([5.82, 6.43]) in AD. Additionally, significant negative correlation between BAG and cognitive scores was observed, with correlation coefficient of -0.185 (p < 0.001) for MoCA and -0.231 (p < 0.001) for MMSE. Gradient-based feature attribution highlighted ventricles and white matter structures as key regions influenced by brain aging. Conclusion: Our model effectively fused information from different views and volumetric information to achieve state-of-the-art brain age prediction accuracy, improved generalizability and interpretability with association to neurodegenerative disorders.[113] HIRE: Lightweight High-Resolution Image Feature Enrichment for Multimodal LLMs
Nikitha SR,Aradhya Neeraj Mathur,Tarun Ram Menta,Rishabh Jain,Mausoom Sarkar
Main category: cs.CV
TL;DR: 本研究提出了一种浅层特征增强方法,在保持高性能的同时显著降低了多模态大语言模型中的计算成本。
Details
Motivation: 高分辨率图像特征虽然提高了视觉理解任务的表现,但其计算成本显著增加,因此需要一种更高效的解决方案。 Method: 通过广泛的实验和消融研究,探索特征上采样作为高分辨率特征生成的自然延伸。 Result: 该方法在多个基准测试中表现出具有竞争力的结果,并实现了高达1.5倍的FLOPs节省。 Conclusion: 使用浅层特征增强器可以实现与高分辨率特征生成相当的结果,同时显著减少计算成本和时间。 Abstract: The integration of high-resolution image features in modern multimodal large language models has demonstrated significant improvements in fine-grained visual understanding tasks, achieving high performance across multiple benchmarks. Since these features are obtained from large image encoders like ViT, they come with a significant increase in computational costs due to multiple calls to these encoders. In this work, we first develop an intuition for feature upsampling as a natural extension of high-resolution feature generation. Through extensive experiments and ablations, we demonstrate how a shallow feature enricher can achieve competitive results with tremendous reductions in training and inference time as well as computational cost, with upto 1.5x saving in FLOPs.[114] JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent
Yunlong Lin,Zixu Lin,Kunjie Lin,Jinbin Bai,Panwang Pan,Chenxin Li,Haoyu Chen,Zhongdao Wang,Xinghao Ding,Wenbo Li,Shuicheng Yan
Main category: cs.CV
TL;DR: JarvisArt是一种基于多模态大语言模型的智能修图代理,能够理解用户意图并协调200多个Lightroom工具进行高效、个性化的照片润饰。
Details
Motivation: 现有的专业工具需要大量专业知识和手动努力,而现有的AI解决方案通常调整能力有限且泛化性能差,无法满足多样化和个性化的编辑需求。 Method: 通过两阶段训练过程:初始的思维链监督微调和后续的GRPO-R优化,并提出了Agent-to-Lightroom协议以实现与Lightroom的无缝集成。 Result: JarvisArt在用户交互友好性、泛化能力和对全局及局部调整的细粒度控制方面表现出色,在MMArt-Bench上平均像素级指标比GPT-4o提高了60%。 Conclusion: JarivsArt是一个基于多模态大语言模型的智能修图代理,能够有效弥合专业工具和现有AI解决方案之间的差距。 Abstract: Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.[115] CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Kailing Li,Qi'ao Xu,Tianwen Qian,Yuqian Fu,Yang Jiao,Xiaoling Wang
Main category: cs.CV
TL;DR: 本文介绍了一个新的无训练框架 CLiViS,用于解决具身视觉推理问题,通过结合大型语言模型的推理能力和视觉-语言模型的感知能力,特别是在处理长期第一视角视频中的复杂指令和时空动态方面表现出色。
Details
Motivation: 为了克服现有方法在处理复杂指令和长期第一视角视频中的时空动态方面的局限性。 Method: 提出了一种新的无训练框架 CLiViS,利用大型语言模型进行高层任务规划,并通过视觉-语言模型驱动的开放世界视觉感知来迭代更新场景上下文。 Result: 实验结果显示 CLiViS 在多个基准测试中表现出了有效性和通用性,尤其是在处理长期视觉依赖方面。 Conclusion: CLiViS 提供了一个有效的框架,结合了大型语言模型和视觉-语言模型的优势,实现了更强大的具身视觉推理能力。 Abstract: Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.[116] Optimization-Free Patch Attack on Stereo Depth Estimation
Hangcheng Liu,Xu Kuang,Xingshuo Han,Xingwan Wu,Haoran Ou,Shangwei Guo,Xingyi Huang,Tao Xiang,Tianwei Zhang
Main category: cs.CV
TL;DR: 本文提出了PatchHunter,一种针对立体深度估计(SDE)的新型对抗补丁攻击方法,通过强化学习生成能够破坏SDE假设的视觉模式,实现在现实条件下的高效和可转移攻击。
Details
Motivation: 最近的研究表明,SDE模型容易受到对抗攻击,但这些攻击通常局限于不现实的设置,限制了它们在现实世界中的应用。因此,需要设计一种物理上可实现、场景自适应且可转移的攻击方法。 Method: 提出了一种统一的攻击框架,并介绍了PatchHunter,这是一种无需优化的对抗补丁攻击,通过强化学习驱动的搜索生成破坏SDE假设的视觉模式。 Result: 对9个主流SDE模型的全面评估显示,基于优化的补丁在可转移性方面表现不佳。PatchHunter在KITTI数据集、CARLA模拟器和真实车辆部署中验证,其攻击成功率显著提高,尤其是在低光照条件下。 Conclusion: PatchHunter不仅在有效性上超越了基于优化的方法,还实现了更好的黑盒可转移性。即使在低光照等具有挑战性的物理条件下,PatchHunter仍能保持高攻击成功率,而基于优化的方法则失败。 Abstract: Stereo Depth Estimation (SDE) is essential for scene understanding in vision-based systems like autonomous driving. However, recent studies show that SDE models are vulnerable to adversarial attacks, which are often limited to unrealistic settings, e.g., digital perturbations on separate stereo views in static scenes, restricting their real-world applicability. This raises a critical question: how can we design physically realizable, scene-adaptive, and transferable attacks against SDE under realistic constraints? To answer this, we make two key contributions. First, we propose a unified attack framework that extends optimization-based techniques to four core stages of stereo matching: feature extraction, cost-volume construction, cost aggregation, and disparity regression. A comprehensive stage-wise evaluation across 9 mainstream SDE models, under constraints like photometric consistency, reveals that optimization-based patches suffer from poor transferability. Interestingly, partially transferable patches suggest that patterns, rather than pixel-level perturbations, may be key to generalizable attacks. Motivated by this, we present PatchHunter, the first optimization-free adversarial patch attack against SDE. PatchHunter formulates patch generation as a reinforcement learning-driven search over a structured space of visual patterns crafted to disrupt SDE assumptions. We validate PatchHunter across three levels: the KITTI dataset, the CARLA simulator, and real-world vehicle deployment. PatchHunter not only surpasses optimization-based methods in effectiveness but also achieves significantly better black-box transferability. Even under challenging physical conditions like low light, PatchHunter maintains high attack success (e.g., D1-all > 0.4), whereas optimization-based methods fail.[117] Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection
Xiang Fang,Arvind Easwaran,Blaise Genest
Main category: cs.CV
TL;DR: This paper introduces AMCN, a new approach for few-shot OOD detection that leverages textual prompts and contrastive learning to adaptively separate ID and OOD samples.
Details
Motivation: Traditional OOD detection methods require many IID samples, limiting their applicability; this work focuses on the more challenging few-shot OOD detection scenario. Method: AMCN uses CLIP to connect text with images, generating adaptive prompts and thresholds for ID-OOD separation. Result: Experimental results show that AMCN outperforms state-of-the-art approaches in few-shot OOD detection. Conclusion: The proposed AMCN effectively addresses the few-shot OOD detection problem by leveraging adaptive prompts and a prompt-guided separation module. Abstract: Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.[118] Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning
Shih-Wen Liu,Hsuan-Yu Fan,Wei-Ta Chu,Fu-En Yang,Yu-Chiang Frank Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为PathGenIC的上下文学习框架,用于从组织病理学图像生成医疗报告,通过结合训练集的上下文和多模态上下文学习机制,取得了最先进的结果。
Details
Motivation: 自动化生成组织病理学图像是一个重大挑战,需要有效的视觉表示和领域特定知识。受人类专家常用实践的启发,研究者提出了PathGenIC框架以改进这一过程。 Method: PathGenIC框架动态检索语义上相似的全切片图像(WSI)-报告对,并引入自适应反馈来增强上下文相关性和生成质量。 Result: 在HistGen基准数据集上的评估表明,该方法在BLEU、METEOR和ROUGE-L等指标上取得了显著提升,并且在不同报告长度和疾病类别中表现出稳健性。 Conclusion: 通过最大化训练数据效用并使用ICL桥接视觉与语言,PathGenIC为AI驱动的组织病理学报告生成提供了有效解决方案,并为未来多模态临床应用奠定了基础。 Abstract: Automating medical report generation from histopathology images is a critical challenge requiring effective visual representations and domain-specific knowledge. Inspired by the common practices of human experts, we propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning (ICL) mechanism. Our method dynamically retrieves semantically similar whole slide image (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality. Evaluated on the HistGen benchmark, the framework achieves state-of-the-art results, with significant improvements across BLEU, METEOR, and ROUGE-L metrics, and demonstrates robustness across diverse report lengths and disease categories. By maximizing training data utility and bridging vision and language with ICL, our work offers a solution for AI-driven histopathology reporting, setting a strong foundation for future advancements in multimodal clinical applications.[119] MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation
Shuaiye Lu,Linjiang Zhou,Xiaochuan Shi
Main category: cs.CV
TL;DR: 本文提出了一种新的无需训练的方法 MDSAM,用于减少大型视觉语言模型中的幻觉问题,提升了模型性能和可靠性。
Details
Motivation: 大型视觉语言模型(LVLMs)在解码过程中对图像令牌敏感,导致产生幻觉,这需要一种新的解决方案来减少幻觉并提高可靠性。 Method: 提出了Memory-Driven Sparse Attention Matrix (MDSAM) 方法,通过记忆注意力模式并在解码过程中通过对齐激活更新来动态捕捉和优化图像令牌的注意力分配。 Result: MDSAM 在多个基准任务上展示了其一致减少幻觉的能力,并提高了图像字幕生成和视觉问答等任务的可靠性。 Conclusion: MDSAM是一种无需训练的新型方法,能够有效减少大型视觉语言模型中的幻觉问题,并具有良好的适应性和有效性。 Abstract: Hallucinations in large vision-language models (LVLMs) often stem from the model's sensitivity to image tokens during decoding, as evidenced by attention peaks observed when generating both real and hallucinated entities. To address this, we propose Memory-Driven Sparse Attention Matrix (MDSAM) , a novel training-free approach that dynamically captures and refines the attention allocated to image tokens at each layer. MDSAM memorizes attention patterns and activates updates through alignment during decoding, enhancing focus on relevant image tokens while effectively reducing hallucinations. We evaluate MDSAM on multiple benchmarks for tasks such as image captioning and visual question answering, demonstrating its ability to consistently reduce hallucinations and improve reliability. Compatible with various LVLM architectures, MDSAM highlights its adaptability and effectiveness in mitigating hallucinations without requiring additional training or external tools.[120] CSDN: A Context-Gated Self-Adaptive Detection Network for Real-Time Object Detection
Wei Haolin
Main category: cs.CV
TL;DR: 本文提出一种基于Transformer的新颖检测头架构CSDN,通过门控机制增强全局上下文建模能力,提升CNN检测器性能。
Details
Motivation: 卷积神经网络在目标检测中受限于有限感受野,难以捕获全局上下文信息;同时,DETR启发式头部网络中的自注意力机制存在显著的信息冗余问题。 Method: 提出了一种受自然语言处理和人类视觉感知启发的上下文门控尺度自适应检测网络(CSDN),采用新颖的门控机制来实现特征维度和尺度信息的自适应选择与组合。 Result: CSDN具备强大的全局上下文建模能力,可以更好地适应不同尺寸和结构的对象,只需少量微调即可显著提高检测精度,无需大规模重训练。 Conclusion: CSDN提供了一种新的Transformer检测头架构,通过门控机制取代传统注意力机制,有效利用CNN主干特征,并能够直接替换多种CNN检测器的原生头模块。 Abstract: Convolutional neural networks (CNNs) have long been the cornerstone of target detection, but they are often limited by limited receptive fields, which hinders their ability to capture global contextual information. This paper believes that the effective utilization of extracted features is as important as the feature extraction process itself. We critically re-evaluated the DETR-inspired header network architecture, questioning the indispensable nature of its self-attention mechanism, and discovering significant information redundancies. To solve these problems, we introduced the Context-Gated Scale-Adaptive Detection Network (CSDN), a Transformer-based detection header inspired by natural language processing architecture and human visual perception. CSDN aims to efficiently utilize the characteristics of the CNN backbone network by replacing the traditional stacked self-attention and cross-attention layers with a novel gating mechanism. This mechanism enables each region of interest (ROI) to adaptively select and combine feature dimensions and scale information from multiple attention patterns. CSDN provides more powerful global context modeling capabilities and can better adapt to objects of different sizes and structures. Our proposed detection head can directly replace the native heads of various CNN-based detectors, and only a few rounds of fine-tuning on the pre-training weights can significantly improve the detection accuracy, thus avoiding the need to achieve small improvements. Various layer modules undergo extensive re-training.[121] Domain Generalization using Action Sequences for Egocentric Action Recognition
Amirshayan Nasirimajd,Chiara Plizzari,Simone Alberto Peirone,Marco Ciccone,Giuseppe Averta,Barbara Caputo
Main category: cs.CV
TL;DR: This paper proposes SeqDG, a domain generalization approach for egocentric action recognition that improves cross-domain performance using visual-text sequence reconstruction and mixed-domain training.
Details
Motivation: Egocentric Action Recognition models suffer performance drops in unseen environments due to variability in illumination, viewpoint, and environment. This paper aims to enhance domain generalization by exploiting consistent user intent across action sequences. Method: SeqDG introduces a visual-text sequence reconstruction objective (SeqRec) and trains on mixed action sequences from different domains (SeqMix) to improve model robustness and generalization. Result: On EPIC-KITCHENS-100, SeqDG achieved a +2.4% relative average improvement in cross-domain action recognition, and on EGTEA, it showed a +0.6% Top-1 accuracy gain over the state-of-the-art in intra-domain recognition. Conclusion: The proposed SeqDG method effectively enhances cross-domain generalization for egocentric action recognition by leveraging action sequences and contextual cues from visual-text inputs. Abstract: Recognizing human activities from visual inputs, particularly through a first-person viewpoint, is essential for enabling robots to replicate human behavior. Egocentric vision, characterized by cameras worn by observers, captures diverse changes in illumination, viewpoint, and environment. This variability leads to a notable drop in the performance of Egocentric Action Recognition models when tested in environments not seen during training. In this paper, we tackle these challenges by proposing a domain generalization approach for Egocentric Action Recognition. Our insight is that action sequences often reflect consistent user intent across visual domains. By leveraging action sequences, we aim to enhance the model's generalization ability across unseen environments. Our proposed method, named SeqDG, introduces a visual-text sequence reconstruction objective (SeqRec) that uses contextual cues from both text and visual inputs to reconstruct the central action of the sequence. Additionally, we enhance the model's robustness by training it on mixed sequences of actions from different domains (SeqMix). We validate SeqDG on the EGTEA and EPIC-KITCHENS-100 datasets. Results on EPIC-KITCHENS-100, show that SeqDG leads to +2.4% relative average improvement in cross-domain action recognition in unseen environments, and on EGTEA the model achieved +0.6% Top-1 accuracy over SOTA in intra-domain action recognition.[122] SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification
Gnana Praveen Rajasekhar,Jahangir Alam
Main category: cs.CV
TL;DR: This paper proposes a self-supervised audiovisual speaker verification approach using a unified vision transformer-based framework, reducing computational costs and labeled data dependency.
Details
Motivation: Traditional audio-visual speaker verification methods require large amounts of labeled data and use separate modality-specific architectures, making them computationally expensive and limiting scalability. Method: A unified self-supervised learning framework using contrastive learning with asymmetric masking and masked data modeling, along with a shared vision transformer backbone for handling audio, visual, or audiovisual inputs. Result: The method achieves competitive performance without relying on labeled data and is computationally efficient, showing robustness to missing modalities. Conclusion: The proposed self-supervised audiovisual speaker verification framework effectively reduces computational costs while achieving competitive performance without labeled data. Abstract: Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.[123] DreamJourney: Perpetual View Generation with Video Diffusion Models
Bo Pan,Yang Chen,Yingwei Pan,Ting Yao,Wei Chen,Tao Mei
Main category: cs.CV
TL;DR: DreamJourney旨在解决现有方法在永久视图生成中的问题,如缺乏3D感知和忽视物体运动。
Details
Motivation: 为了解决现有方法仅限于生成静态3D场景的视图的问题,我们提出了DreamJourney。 Method: 在阶段I中,DreamJourney首先将输入图像提升到3D点云,并从特定相机轨迹渲染一系列部分图像。然后使用视频扩散模型作为生成先验来完成序列中的缺失区域,并增强视觉连贯性。在阶段II中,DreamJourney利用多模态大语言模型生成描述当前视图中物体运动的文本提示,并使用视频扩散模型来实现当前视图的动画效果。 Result: 广泛的实验证明了我们的DreamJourney在定量和定性方面都优于最先进的方法。 Conclusion: DreamJourney是一个两阶段框架,通过利用视频扩散模型的世界模拟能力,实现了相机运动和物体动态的新的永久场景视图生成任务。 Abstract: Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image. Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement. However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts. Moreover, they are limited to generating views of static 3D scenes, neglecting to capture object movements within the dynamic 4D world. To alleviate these issues, we present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task with both camera movements and object dynamics. Specifically, in stage I, DreamJourney first lifts the input image to 3D point cloud and renders a sequence of partial images from a specific camera trajectory. A video diffusion model is then utilized as generative prior to complete the missing regions and enhance visual coherence across the sequence, producing a cross-view consistent video adheres to the 3D scene and camera trajectory. Meanwhile, we introduce two simple yet effective strategies (early stopping and view padding) to further stabilize the generation process and improve visual quality. Next, in stage II, DreamJourney leverages a multimodal large language model to produce a text prompt describing object movements in current view, and uses video diffusion model to animate current view with object movements. Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation. Extensive experiments demonstrate the superiority of our DreamJourney over state-of-the-art methods both quantitatively and qualitatively. Our project page: https://dream-journey.vercel.app.[124] Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models
Jihyun Kim,Junho Park,Kyeongbo Kong,Suk-Ju Kang
Main category: cs.CV
TL;DR: Programmable-Room 是一个基于自然语言指令的框架,可以交互式地生成和编辑3D房间网格,通过视觉编程方法实现任务分解与模块化处理,显著提高了生成质量与灵活性。
Details
Motivation: 为了实现对3D房间网格的精确控制并提升生成和编辑的灵活性,提出了一种新的自然语言驱动的框架。 Method: 该框架将复杂的任务分解为多个简单步骤,包括生成合理的3D坐标、创建全景纹理图像、整合坐标与纹理构建3D网格以及布置家具;利用大语言模型(LLM)编写Python风格程序以支持多种任务;使用预训练的大规模扩散模型生成全景图像,并通过1D表示优化训练目标来提高全景图像生成质量。 Result: 实验结果表明,Programmable-Room 在定量和定性评估上均优于现有模型,展示了其在生成和编辑3D房间网格方面的优越性能。 Conclusion: Programmable-Room 是一种基于自然语言指令交互生成和编辑3D房间网格的框架,它通过视觉编程方法实现了对房间各个属性的精确控制,并在生成质量和灵活性方面优于现有模型。 Abstract: We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room's each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and panorama texture images, and arranging furniture. To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP). VP is a method that utilizes a large language model (LLM) to write a Python-like program which is an ordered list of necessary modules for the various tasks given in natural language. We develop most of the modules. Especially, for the texture generating module, we utilize a pretrained large-scale diffusion model to generate panorama images conditioned on text and visual prompts (i.e., layout, depth, and semantic map) simultaneously. Specifically, we enhance the panorama image generation quality by optimizing the training objective with a 1D representation of a panorama scene obtained from bidirectional LSTM. We demonstrate Programmable-Room's flexibility in generating and editing 3D room meshes, and prove our framework's superiority to an existing model quantitatively and qualitatively. Project page is available in https://jihyun0510.github.io/Programmable_Room_Page/.[125] PDC-Net: Pattern Divide-and-Conquer Network for Pelvic Radiation Injury Segmentation
Xinyu Xiong,Wuteng Cao,Zihuang Wu,Lei Zhang,Chong Gao,Guanbin Li,Qiyuan Qin
Main category: cs.CV
TL;DR: This paper proposes a new deep learning method called PDC-Net for improved segmentation of Pelvic Radiation Injury in MRI images, which combines multiple specialized modules to better capture local and global patterns, resulting in more accurate segmentation.
Details
Motivation: Accurate segmentation of Pelvic Radiation Injury from MRI is essential for prognosis assessment and personalized treatment plans, but automated segmentation faces challenges like complex organ morphologies and confusing context. Method: A Pattern Divide-and-Conquer Network (PDC-Net) is introduced, incorporating a Multi-Direction Aggregation module, a Memory-Guided Context module, and an Adaptive Fusion Decoder based on the MoE framework. Result: Evaluation on a large-scale pelvic radiation injury dataset shows that PDC-Net outperforms current approaches in PRI segmentation. Conclusion: The proposed PDC-Net demonstrates superior performance in segmenting Pelvic Radiation Injury from MRI compared to existing methods. Abstract: Accurate segmentation of Pelvic Radiation Injury (PRI) from Magnetic Resonance Images (MRI) is crucial for more precise prognosis assessment and the development of personalized treatment plans. However, automated segmentation remains challenging due to factors such as complex organ morphologies and confusing context. To address these challenges, we propose a novel Pattern Divide-and-Conquer Network (PDC-Net) for PRI segmentation. The core idea is to use different network modules to "divide" various local and global patterns and, through flexible feature selection, to "conquer" the Regions of Interest (ROI) during the decoding phase. Specifically, considering that our ROI often manifests as strip-like or circular-like structures in MR slices, we introduce a Multi-Direction Aggregation (MDA) module. This module enhances the model's ability to fit the shape of the organ by applying strip convolutions in four distinct directions. Additionally, to mitigate the challenge of confusing context, we propose a Memory-Guided Context (MGC) module. This module explicitly maintains a memory parameter to track cross-image patterns at the dataset level, thereby enhancing the distinction between global patterns associated with the positive and negative classes. Finally, we design an Adaptive Fusion Decoder (AFD) that dynamically selects features from different patterns based on the Mixture-of-Experts (MoE) framework, ultimately generating the final segmentation results. We evaluate our method on the first large-scale pelvic radiation injury dataset, and the results demonstrate the superiority of our PDC-Net over existing approaches.[126] YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception
Mengqi Lei,Siqi Li,Yihong Wu,Han Hu,You Zhou,Xinhu Zheng,Guiguang Ding,Shaoyi Du,Zongze Wu,Yue Gao
Main category: cs.CV
TL;DR: 本文提出了YOLOv13,一种基于HyperACE机制和FullPAD范式的高效目标检测模型,在提升检测性能的同时减少了计算资源消耗。
Details
Motivation: YOLO系列模型在实时目标检测中表现优异,但其卷积架构和区域自注意力机制在复杂场景下无法有效捕获全局多对多高阶相关性,限制了检测性能。因此,需要一种新的方法来克服这些局限性。 Method: 提出了一种基于超图计算的HyperACE机制来利用潜在的高阶相关性,并设计了FullPAD范式实现网络中的细粒度信息流和表示协同作用。同时采用深度可分离卷积替代传统大核卷积以减少参数和计算复杂度。 Result: 在MS COCO基准上的实验结果显示,YOLOv13相比YOLO11-N和YOLOv12-N分别提高了3.0%和1.5%的mAP,同时具有更少的参数和FLOPs,达到了SOTA性能。 Conclusion: YOLOv13是一个准确且轻量级的目标检测器,通过HyperACE机制和FullPAD范式提升了全局多对多高阶相关性的建模能力,从而在复杂场景中实现了更好的检测性能。 Abstract: The YOLO series models reign supreme in real-time object detection due to their superior accuracy and computational efficiency. However, both the convolutional architectures of YOLO11 and earlier versions and the area-based self-attention mechanism introduced in YOLOv12 are limited to local information aggregation and pairwise correlation modeling, lacking the capability to capture global multi-to-multi high-order correlations, which limits detection performance in complex scenarios. In this paper, we propose YOLOv13, an accurate and lightweight object detector. To address the above-mentioned challenges, we propose a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism that adaptively exploits latent high-order correlations and overcomes the limitation of previous methods that are restricted to pairwise correlation modeling based on hypergraph computation, achieving efficient global cross-location and cross-scale feature fusion and enhancement. Subsequently, we propose a Full-Pipeline Aggregation-and-Distribution (FullPAD) paradigm based on HyperACE, which effectively achieves fine-grained information flow and representation synergy within the entire network by distributing correlation-enhanced features to the full pipeline. Finally, we propose to leverage depthwise separable convolutions to replace vanilla large-kernel convolutions, and design a series of blocks that significantly reduce parameters and computational complexity without sacrificing performance. We conduct extensive experiments on the widely used MS COCO benchmark, and the experimental results demonstrate that our method achieves state-of-the-art performance with fewer parameters and FLOPs. Specifically, our YOLOv13-N improves mAP by 3.0\% over YOLO11-N and by 1.5\% over YOLOv12-N. The code and models of our YOLOv13 model are available at: https://github.com/iMoonLab/yolov13.[127] PhysID: Physics-based Interactive Dynamics from a Single-view Image
Sourabh Vasant Gothe,Ayon Chattopadhyay,Gunturi Venkata Sai Phani Kiran,Pratik,Vibhav Agarwal,Jayesh Rajkumar Vachhani,Sourav Ghosh,Parameswaranath VM,Barath Raj KR
Main category: cs.CV
TL;DR: PhysID enables interactive dynamics from single-view images using AI-generated 3D models and on-device physics simulation, making it easier to create personalized, real-time mobile AR/VR experiences.
Details
Motivation: Transforming static images into interactive experiences is challenging but crucial for enhancing mobile user experiences, especially for AR/VR applications. Current methods require either multi-view images or pre-recorded videos. Method: PhysID utilizes large generative models for 3D mesh generation and physical property prediction from a single-view image. It integrates an on-device physics-based engine for real-time rendering and user interaction. Result: PhysID streamlines the creation of physics-based interactive dynamics, reducing the need for expert engineering tasks like 3D modeling and property calibration. Experiments show that the framework works cohesively and effectively across diverse tasks. Conclusion: PhysID is a significant advancement in mobile-based interactive dynamics, enabling real-time, non-deterministic interactions with minimal manual input and efficient on-device memory usage. Abstract: Transforming static images into interactive experiences remains a challenging task in computer vision. Tackling this challenge holds the potential to elevate mobile user experiences, notably through interactive and AR/VR applications. Current approaches aim to achieve this either using pre-recorded video responses or requiring multi-view images as input. In this paper, we present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image by leveraging large generative models for 3D mesh generation and physical property prediction. This significantly reduces the expertise required for engineering-intensive tasks like 3D modeling and intrinsic property calibration, enabling the process to be scaled with minimal manual intervention. We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions. PhysID represents a leap forward in mobile-based interactive dynamics, offering real-time, non-deterministic interactions and user-personalization with efficient on-device memory consumption. Experiments evaluate the zero-shot capabilities of various Multimodal Large Language Models (MLLMs) on diverse tasks and the performance of 3D reconstruction models. These results demonstrate the cohesive functioning of all modules within the end-to-end framework, contributing to its effectiveness.[128] LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging
Fadi Abdeladhim Zidi,Djamel Eddine Boukhari,Abdellah Zakaria Sellam,Abdelkrim Ouafi,Cosimo Distante,Salah Eddine Bekhouche,Abdelmalik Taleb-Ahmed
Main category: cs.CV
TL;DR: This paper introduces LoLA-SpecViT, an efficient hyperspectral image classification model that improves scalability and adaptability by integrating low-rank adaptation and local attention mechanisms, achieving superior performance with fewer parameters.
Details
Motivation: Hyperspectral image classification is challenging due to high dimensionality, inter-band redundancy, and limited annotated samples. Existing transformer-based models struggle with scalability and adaptability under label-scarce conditions. Method: The paper proposes LoLA-SpecViT, a lightweight spectral vision transformer combining a 3D convolutional spectral front-end with local window-based self-attention. It integrates low-rank adaptation (LoRA) into attention and projection layers and uses a novel cyclical learning rate scheduler to improve adaptability and convergence. Result: Experiments show that LoLA-SpecViT consistently outperforms state-of-the-art baselines on three benchmark datasets (WHU-Hi LongKou, WHU-Hi HongHu, and Salinas), achieving up to 99.91% accuracy with significantly fewer parameters and improved robustness in low-label regimes. Conclusion: LoLA-SpecViT provides a scalable and generalizable solution for real-world HSI applications such as agriculture, environmental monitoring, and remote sensing analytics. Abstract: Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbf{LoLA-SpecViT}(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80\% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91\% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \href{https://github.com/FadiZidiDz/LoLA-SpecViT}{GitHub Repository}.[129] Incorporating Rather Than Eliminating: Achieving Fairness for Skin Disease Diagnosis Through Group-Specific Expert
Gelei Xu,Yuying Duan,Zheyuan Liu,Xueyang Li,Meng Jiang,Michael Lemmon,Wei Jin,Yiyu Shi
Main category: cs.CV
TL;DR: 本文介绍了FairMoE,一个利用混合专家模块以实现公平且准确的皮肤疾病AI诊断的新框架。
Details
Motivation: 基于人工智能的系统在皮肤疾病诊断中表现出高准确性,但在不同人口群体间往往存在偏差,导致医疗服务不公和患者信任下降。现有的偏差缓解方法大多试图消除敏感属性与诊断预测之间的相关性,但这些方法由于丢失了临床上相关的诊断线索而经常降低性能。 Method: 提出了一种名为FairMoE的框架,该框架使用逐层混合专家模块作为特定群体的学习者,与传统方法不同,FairMoE能够根据数据特征动态地将数据路由到最合适的专家。 Result: 实验结果显示,与其他先前的公平性方法不同,FairMoE不仅保持了可比较的公平性指标,而且实现了显著的准确性提升。 Conclusion: FairMoE通过动态路由数据到最适合的专家,实现了在保持公平性的同时提高准确性,为解决AI医疗诊断中的偏差问题提供了一种有效的方法。 Abstract: AI-based systems have achieved high accuracy in skin disease diagnostics but often exhibit biases across demographic groups, leading to inequitable healthcare outcomes and diminished patient trust. Most existing bias mitigation methods attempt to eliminate the correlation between sensitive attributes and diagnostic prediction, but those methods often degrade performance due to the lost of clinically relevant diagnostic cues. In this work, we propose an alternative approach that incorporates sensitive attributes to achieve fairness. We introduce FairMoE, a framework that employs layer-wise mixture-of-experts modules to serve as group-specific learners. Unlike traditional methods that rigidly assign data based on group labels, FairMoE dynamically routes data to the most suitable expert, making it particularly effective for handling cases near group boundaries. Experimental results show that, unlike previous fairness approaches that reduce performance, FairMoE achieves substantial accuracy improvements while preserving comparable fairness metrics.[130] Time-Contrastive Pretraining for In-Context Image and Video Segmentation
Assefa Wahd,Jacob Jaremko,Abhilash Hareendranathan
Main category: cs.CV
TL;DR: This paper proposes Temporal, an improved ICL framework for vision tasks that reframes segmentation as a video object segmentation problem, leveraging a self-supervised prompt retriever to select optimal context images, resulting in significant performance gains.
Details
Motivation: Traditional ICL methods use a rigid gridding strategy that limits flexibility and effectiveness in vision applications. This work aims to overcome these limitations by introducing a more adaptable approach suited for visual data. Method: Temporal introduces a time-contrastive self-supervised objective to pretrain a prompt retriever. It formulates in-context learning (ICL) as a video object segmentation (VOS) task, using adjacent video frames as positives and distant frames as negatives during training. For segmentation tasks, the prompt retriever selects relevant sequences or keyframes for processing through the ICL pipeline. Result: Evaluated on MICCAI FLARE 2022, the method achieved a 90.95% Dice score for image segmentation (10.64% improvement over baselines) and 92.45% Dice for video segmentation (14.88% improvement). Conclusion: The proposed Temporal method, which frames ICL as a VOS task and uses a self-supervised prompt retriever, significantly improves performance in both image and video segmentation tasks compared to traditional grid-based ICL approaches. Abstract: In-context learning (ICL) enables generalization to new tasks with minimal labeled data. However, mainstream ICL approaches rely on a gridding strategy, which lacks the flexibility required for vision applications. We introduce Temporal, a time-contrastive self-supervised objective that pretrains a prompt retriever for visual ICL, and formulate ICL as a video object segmentation (VOS) task. Temporal addresses key limitations of grid-based methods that restrict the number and resolution of context images. By reframing ICL as a VOS problem, our approach supports a variable number of context images while preserving their full resolution. To address the challenge of selecting optimal context sets for queries, we pretrain a prompt retriever on videos via self-supervised learning, where adjacent frames serve as positives and distant frames as negatives. For image segmentation, the prompt retriever selects relevant sequences that, when combined with the query, form coherent videos for VOS processing. For video segmentation, it identifies keyframes, predicts their masks using our ICL pipeline, and propagates them throughout the sequence. When evaluated on MICCAI FLARE 2022, our method achieves substantial improvements over baselines: 90.95% Dice score for image segmentation (10.64% improvement) and 92.45% Dice for video segmentation (14.88% improvement).[131] Robust Foreground-Background Separation for Severely-Degraded Videos Using Convolutional Sparse Representation Modeling
Kazuki Naganuma,Shunsuke Ono
Main category: cs.CV
TL;DR: 提出一种新的前景-背景分离方法,利用卷积稀疏表示来提高低帧率和含噪视频的分离精度。
Details
Motivation: 现有的FBS方法无法准确地从退化视频中分离前景和背景成分,因为它们仅捕捉特定数据或通用特征,且缺乏明确的噪声模型。 Method: 将FBS表述为约束多凸优化问题,并开发了一种通过交替求解两个凸子问题的算法。 Result: 该方法能够自适应捕捉成像数据中的特定空间结构,并有效去除多种类型的噪声。 Conclusion: 实验结果证明,所提出的基于CSR的FBS方法在处理低帧率和多种噪声条件下的视频时优于现有方法。 Abstract: This paper proposes a foreground-background separation (FBS) method with a novel foreground model based on convolutional sparse representation (CSR). In order to analyze the dynamic and static components of videos acquired under undesirable conditions, such as hardware, environmental, and power limitations, it is essential to establish an FBS method that can handle videos with low frame rates and various types of noise. Existing FBS methods have two limitations that prevent us from accurately separating foreground and background components from such degraded videos. First, they only capture either data-specific or general features of the components. Second, they do not include explicit models for various types of noise to remove them in the FBS process. To this end, we propose a robust FBS method with a CSR-based foreground model. This model can adaptively capture specific spatial structures scattered in imaging data. Then, we formulate FBS as a constrained multiconvex optimization problem that incorporates CSR, functions that capture general features, and explicit noise characterization functions for multiple types of noise. Thanks to these functions, our method captures both data-specific and general features to accurately separate the components from various types of noise even under low frame rates. To obtain a solution of the optimization problem, we develop an algorithm that alternately solves its two convex subproblems by newly established algorithms. Experiments demonstrate the superiority of our method over existing methods using two types of degraded videos: infrared and microscope videos.[132] Fetuses Made Simple: Modeling and Tracking of Fetal Shape and Pose
Yingcheng Liu,Peiqi Wang,Sebastian Diaz,Esra Abaci Turk,Benjamin Billot,Patricia Ellen Grant,Polina Golland
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的3D可变形胎儿身体模型,结合了姿态和形状分析,提高了胎儿MRI分析的准确性和实用性。
Details
Motivation: 现有胎儿MRI分析方法依赖于解剖关键点或体积身体分割,但这些方法要么忽略完整的身体形状信息,要么难以进行时间序列分析,因此需要一种更加鲁棒的方法。 Method: 该论文通过迭代估计图像空间中的身体姿态和规范姿态空间中的身体形状,构建了一个3D可变形的胎儿身体统计模型。 Result: 该方法在未见过的胎儿身体形状上测试时,对于3mm MRI体素大小,表面配准误差仅为3.2mm,并且能够提供直观的可视化结果以及自动化的人体测量。 Conclusion: 该论文提出了首个基于SMPL的3D可变形胎儿身体模型,为产前诊断中的胎儿运动和形状分析提供了新方法。 Abstract: Analyzing fetal body motion and shape is paramount in prenatal diagnostics and monitoring. Existing methods for fetal MRI analysis mainly rely on anatomical keypoints or volumetric body segmentations. Keypoints simplify body structure to facilitate motion analysis, but may ignore important details of full-body shape. Body segmentations capture complete shape information but complicate temporal analysis due to large non-local fetal movements. To address these limitations, we construct a 3D articulated statistical fetal body model based on the Skinned Multi-Person Linear Model (SMPL). Our algorithm iteratively estimates body pose in the image space and body shape in the canonical pose space. This approach improves robustness to MRI motion artifacts and intensity distortions, and reduces the impact of incomplete surface observations due to challenging fetal poses. We train our model on segmentations and keypoints derived from $19,816$ MRI volumes across $53$ subjects. Our model captures body shape and motion across time series and provides intuitive visualization. Furthermore, it enables automated anthropometric measurements traditionally difficult to obtain from segmentations and keypoints. When tested on unseen fetal body shapes, our method yields a surface alignment error of $3.2$ mm for $3$ mm MRI voxel size. To our knowledge, this represents the first 3D articulated statistical fetal body model, paving the way for enhanced fetal motion and shape analysis in prenatal diagnostics. The code is available at https://github.com/MedicalVisionGroup/fetal-smpl .[133] Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation
Xiaodong Guo,Zi'ang Lin,Luwen Hu,Zhihong Deng,Tong Liu,Wujie Zhou
Main category: cs.CV
TL;DR: 本研究提出了一种名为CM-SSM的高效RGB-热成像语义分割架构,通过跨模态状态空间建模方法,解决了传统Transformer方法计算开销大的问题,并在多个数据集上取得了优异性能。
Details
Motivation: 尽管多源数据处理显著提升了野外环境的语义分割性能,但基于Transformer的方法计算开销大,难以适用于资源受限系统。 Method: 提出了跨模态2D选择扫描模块和跨模态状态空间关联模块,前者构建跨模态视觉序列并导出隐藏状态表示,后者将全局关联与局部空间特征有效整合。 Result: 在CART数据集上实现SOTA性能,并在PST900数据集上验证了泛化性。 Conclusion: CM-SSM是一种高效的RGB-热成像语义分割架构,其基于跨模态状态空间建模方法,在减少参数和计算成本的同时实现了SOTA性能。 Abstract: The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross-modal 2D-selective-scan (CM-SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross-modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross-modal state space association (CM-SSA) module that effectively integrates global associations from CM-SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer-based approaches, CM-SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at https://github.com/xiaodonguo/CMSSM.[134] SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
Guankun Wang,Wenjin Mo,Junyi Wang,Long Bai,Kun Yuan,Ming Hu,Jinlin Wu,Junjun He,Yiming Huang,Nicolas Padoy,Zhen Lei,Hongbin Liu,Nassir Navab,Hongliang Ren
Main category: cs.CV
TL;DR: 本文介绍了一种名为SurgVidLM的新视频语言模型,该模型专门用于外科视频的理解,在全面和细粒度的任务中表现优异。
Details
Motivation: 现有的Vid-LLMs缺乏专门针对细粒度外科视频理解任务的模型,这促使研究者提出一种新的模型来填补这一空白。 Method: 提出了StageFocus机制和Multi-frequency Fusion Attention方法,并构建了SVU-31K数据集用于训练和评估SurgVidLM。 Result: 实验结果表明,SurgVidLM在全面和细粒度的视频理解任务中均显著优于最先进的Vid-LLMs,证明了其在捕捉复杂程序上下文方面的卓越能力。 Conclusion: SurgVidLM是一个专门用于外科视频理解的视频语言模型,它在全面和细粒度的视频理解任务中显著优于现有的最先进 Vid-LLMs。 Abstract: Recent advances in Multimodal Large Language Models have demonstrated great potential in the medical domain, facilitating users to understand surgical scenes and procedures. Beyond image-based methods, the exploration of Video Large Language Models (Vid-LLMs) has emerged as a promising avenue for capturing the complex sequences of information involved in surgery. However, there is still a lack of Vid-LLMs specialized for fine-grained surgical video understanding tasks, which is crucial for analyzing specific processes or details within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K dataset which consists of over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Furthermore, we introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos. We also develop the Multi-frequency Fusion Attention to effectively integrate low and high-frequency visual tokens, ensuring the retention of critical information. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing complex procedural contexts.[135] StainPIDR: A Pathological Image Decouplingand Reconstruction Method for StainNormalization Based on Color VectorQuantization and Structure Restaining
Zheng Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为 StainPIDR 的病理图像染色归一化方法,并设计了模板图像选择算法,以提高染色标准化效果。
Details
Motivation: 病理图像的颜色外观与成像协议、染料比例和扫描设备密切相关,这可能导致计算机辅助诊断系统在处理这些颜色变化的图像时性能下降。因此,需要一种有效的染色归一化方法。 Method: 提出了一种称为 StainPIDR 的染色归一化方法,该方法通过将图像解耦为结构特征和向量量化颜色特征,并利用交叉注意力机制进行重新染色。此外,还设计了一个模板图像选择算法。 Result: 在广泛的实验中验证了 StainPIDR 和模板图像选择算法的有效性,结果显示该方法在染色归一化任务中表现良好。 Conclusion: StainPIDR 是一种有效的染色归一化方法,并设计了模板图像选择算法,以提高病理图像染色标准化的效果。 Abstract: The color appearance of a pathological image is highly related to the imaging protocols, the proportion of different dyes, and the scanning devices. Computer-aided diagnostic systems may deteriorate when facing these color-variant pathological images. In this work, we propose a stain normalization method called StainPIDR. We try to eliminate this color discrepancy by decoupling the image into structure features and vector-quantized color features, restaining the structure features with the target color features, and decoding the stained structure features to normalized pathological images. We assume that color features decoupled by different images with the same color should be exactly the same. Under this assumption, we train a fixed color vector codebook to which the decoupled color features will map. In the restaining part, we utilize the cross-attention mechanism to efficiently stain the structure features. As the target color (decoupled from a selected template image) will also affect the performance of stain normalization, we further design a template image selection algorithm to select a template from a given dataset. In our extensive experiments, we validate the effectiveness of StainPIDR and the template image selection algorithm. All the results show that our method can perform well in the stain normalization task. The code of StainPIDR will be publicly available later.[136] Cloud-Aware SAR Fusion for Enhanced Optical Sensing in Space Missions
Trong-An Bui,Thanh-Thoai Le
Main category: cs.CV
TL;DR: This study introduces a Cloud-Attentive Reconstruction Framework that combines SAR and optical data using deep learning to generate high-quality, cloud-free optical satellite images.
Details
Motivation: Cloud contamination impairs the usability of optical satellite imagery, affecting critical applications like environmental monitoring, disaster response, and land-use analysis. This research aims to address this limitation by generating cloud-free optical images. Method: The framework uses an attention-driven feature fusion mechanism to align structural information from SAR with spectral characteristics from optical data. Additionally, it employs a cloud-aware model update strategy with adaptive loss weighting to prioritize cloud-occluded regions. Result: Experimental results show that the proposed method outperforms existing approaches, achieving a PSNR of 31.01 dB, SSIM of 0.918, and MAE of 0.017. Conclusion: The proposed Cloud-Attentive Reconstruction Framework effectively produces high-fidelity, spatially and spectrally consistent cloud-free optical images by integrating SAR-optical feature fusion with deep learning-based image reconstruction. Abstract: Cloud contamination significantly impairs the usability of optical satellite imagery, affecting critical applications such as environmental monitoring, disaster response, and land-use analysis. This research presents a Cloud-Attentive Reconstruction Framework that integrates SAR-optical feature fusion with deep learning-based image reconstruction to generate cloud-free optical imagery. The proposed framework employs an attention-driven feature fusion mechanism to align complementary structural information from Synthetic Aperture Radar (SAR) with spectral characteristics from optical data. Furthermore, a cloud-aware model update strategy introduces adaptive loss weighting to prioritize cloud-occluded regions, enhancing reconstruction accuracy. Experimental results demonstrate that the proposed method outperforms existing approaches, achieving a PSNR of 31.01 dB, SSIM of 0.918, and MAE of 0.017. These outcomes highlight the framework's effectiveness in producing high-fidelity, spatially and spectrally consistent cloud-free optical images.[137] Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation
Jiahao Lu,Jiacheng Deng
Main category: cs.CV
TL;DR: 本文提出了一种名为Relation3D的新方法,通过增强关系建模来提高点云实例分割的效果。该方法在多个数据集上展示了优越的性能。
Details
Motivation: 现有的基于Transformer的方法主要关注场景特征与查询特征之间的外部关系建模,缺乏对场景特征内部以及查询特征之间内部关系的有效建模。 Method: 引入了自适应超点聚合模块和对比学习引导的超点优化模块,并采用了关系感知的自注意力机制。 Result: 在ScanNetV2、ScanNet++、ScanNet200和S3DIS数据集上的实验表明,Relation3D具有优越的性能。 Conclusion: Relation3D通过增强关系建模,提高了点云实例分割的性能。 Abstract: 3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines and superior predictions. However, these methods primarily focus on modeling the external relationships between scene features and query features through mask attention. They lack effective modeling of the internal relationships among scene features as well as between query features. In light of these disadvantages, we propose \textbf{Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation}. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning-guided superpoint refinement module to better represent superpoint features (scene features) and leverage contrastive learning to guide the updates of these features. Furthermore, our relation-aware self-attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self-attention mechanism. Extensive experiments on the ScanNetV2, ScanNet++, ScanNet200 and S3DIS datasets demonstrate the superior performance of Relation3D.[138] BeltCrack: the First Sequential-image Industrial Conveyor Belt Crack Detection Dataset and Its Baseline with Triple-domain Feature Learning
Jianghong Huang,Luping Ji,Xin Ma,Mao Ye
Main category: cs.CV
TL;DR: 本文构建了首个真实工业场景下的传送带裂纹检测数据集,并提出一种三域特征融合学习方法,验证了数据集的有效性及优越性。
Details
Motivation: 现有的裂纹数据集主要关注路面场景或合成数据,缺乏真实工业传送带裂纹数据,因此需要构建真实世界的数据集以推动机器学习在此领域的发展。 Method: 提出了一种基于时间-空间-频率三域特征分层融合学习的基线方法,对新构建的数据集进行验证。 Result: 实验结果表明所构建的数据集有效且可用,并且提出的基线方法在检测效果上明显优于其他类似方法。 Conclusion: 作者构建了首个面向工业传送带裂纹检测的序列图像数据集,并提出了一个基于三域特征融合学习的基线方法,验证了数据集的有效性和可用性。 Abstract: Conveyor belt is a category of important equipments in modern industry, widely applied in production and manufacturing Fields. Its health status is much critical to operation efficiency and safety hazards. Among the factors affecting belt health, crack is often one of the most threatening risks. Currently, considering safety, how to intelligently detect belt cracks is catching an increasing attention. To implement the intelligent detection with machine learning, real crack samples are believed to be necessary. However, existing crack datasets primarily focus on pavement scenarios or synthetic data, no real-world industrial belt crack datasets at all. To propel machine learning advancement in this field, this paper constructs the first sequential-image belt crack detection datasets (BeltCrack14ks and BeltCrack9kd), from real-world factory scenes. Furthermore, to validate usability and effectiveness, we propose a special baseline method with triple-domain (i.e., time-space-frequency) feature hierarchical fusion learning for the two whole-new datasets. Experimental results demonstrate the availability and effectiveness of our dataset. Besides, they also show that our baseline is obviously superior to other similar detection methods. Our datasets and source codes are available at https://github.com/UESTC-nnLab/BeltCrack.[139] EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations
Junho Park,Andrew Sangwoo Ye,Taein Kwon
Main category: cs.CV
TL;DR: EgoWorld introduces a novel method for generating egocentric views using point cloud reprojection and diffusion-based inpainting, outperforming existing approaches in flexibility and performance.
Details
Motivation: Current methods for exocentric-to-egocentric translation rely on restrictive assumptions and 2D cues; EgoWorld aims to overcome these limitations with a more flexible and realistic approach. Method: A two-stage framework that utilizes depth maps, point cloud reprojection, and diffusion-based inpainting to generate egocentric images from exocentric observations. Result: EgoWorld achieves state-of-the-art results on H2O and TACO datasets, generalizes well to new objects and scenes, and performs promisingly on unlabeled real-world examples. Conclusion: EgoWorld presents a novel framework for translating exocentric views into egocentric ones, showing robustness and state-of-the-art performance across datasets and real-world applications. Abstract: Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as necessity of initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel two-stage framework that reconstructs an egocentric view from rich exocentric observations, including projected point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images. Evaluated on the H2O and TACO datasets, EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld shows promising results even on unlabeled real-world examples.[140] PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs
Yixuan Wu,Yang Zhang,Jian Wu,Philip Torr,Jindong Gu
Main category: cs.CV
TL;DR: This paper proposes MMGrounded-PostAlign, a post-multimodal alignment framework that improves visual understanding and reduces hallucinations in MLLMs through visual and textual grounding modules with specialized mechanisms.
Details
Motivation: MLLMs often rely on spurious correlations due to linguistic priors, reducing their ability to utilize actual visual information. This work aims to enhance visual understanding and reduce hallucinations in these models. Method: The paper introduces the MMGrounded-PostAlign framework, which includes a multimodal grounding module for visual and textual grounding, along with mechanisms such as negative rejection and selective reasoning to improve model accuracy and reliability. Result: Extensive evaluations on benchmarks like POPE, HaloQuest, VQAv2, MME, and MMBench show that the proposed framework significantly improves fine-grained visual understanding and hallucination suppression. Conclusion: The study concludes that the MMGrounded-PostAlign framework effectively enhances visual understanding and suppresses hallucinations in MLLMs, leading to improved performance on multimodal benchmarks. Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language tasks, such as image captioning and visual question answering. However, they often suffer from over-reliance on spurious correlations, primarily due to linguistic priors that distract the model from leveraging actual visual information. To address these issues, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities and mitigate the hallucinations of MLLMs. Our framework incorporates a multimodal grounding module for both visual grounding, which identifies the referred object in the image, and textual grounding, which generates the rationale for the final answer, ensuring that outputs are anchored in both visual and textual evidence. To mitigate the hallucinations, we introduce a negative rejection mechanism in the visual grounding module to distinguish grounded entities from non-existent objects influenced by linguistic biases. On the textual grounding side, we propose a selective reasoning mechanism that adjusts the model's reasoning strategy based on query complexity. Extensive evaluations are conducted on benchmarks such as POPE, HaloQuest, VQAv2, MME, and MMBench showing significant improvements in fine-grained visual understanding and hallucination suppression.[141] Cause-Effect Driven Optimization for Robust Medical Visual Question Answering with Language Biases
Huanjia Zhu,Yishu Liu,Xiaozhao Fang,Guangming Lu,Bingzhi Chen
Main category: cs.CV
TL;DR: This paper proposes CEDO, a framework that reduces language biases in Medical Visual Question Answering by addressing modality optimization, synergy, and loss rescaling.
Details
Motivation: Existing Med-VQA models suffer from language biases due to spurious correlations between question types and answer categories, which needs comprehensive mitigation. Method: CEDO incorporates three mechanisms: Modality-driven Heterogeneous Optimization (MHO), Gradient-guided Modality Synergy (GMS), and Distribution-adapted Loss Rescaling (DLR) to address language biases from causal and effectual perspectives. Result: Experiments show that CEDO outperforms state-of-the-art methods on traditional and bias-sensitive benchmarks, proving its effectiveness in reducing language biases. Conclusion: The proposed CEDO framework effectively mitigates language biases in Med-VQA models by addressing both causal and effectual factors, demonstrating robust performance on benchmarks. Abstract: Existing Medical Visual Question Answering (Med-VQA) models often suffer from language biases, where spurious correlations between question types and answer categories are inadvertently established. To address these issues, we propose a novel Cause-Effect Driven Optimization framework called CEDO, that incorporates three well-established mechanisms, i.e., Modality-driven Heterogeneous Optimization (MHO), Gradient-guided Modality Synergy (GMS), and Distribution-adapted Loss Rescaling (DLR), for comprehensively mitigating language biases from both causal and effectual perspectives. Specifically, MHO employs adaptive learning rates for specific modalities to achieve heterogeneous optimization, thus enhancing robust reasoning capabilities. Additionally, GMS leverages the Pareto optimization method to foster synergistic interactions between modalities and enforce gradient orthogonality to eliminate bias updates, thereby mitigating language biases from the effect side, i.e., shortcut bias. Furthermore, DLR is designed to assign adaptive weights to individual losses to ensure balanced learning across all answer categories, effectively alleviating language biases from the cause side, i.e., imbalance biases within datasets. Extensive experiments on multiple traditional and bias-sensitive benchmarks consistently demonstrate the robustness of CEDO over state-of-the-art competitors.[142] Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis
Mohamed Benkedadra,Matei Mancas,Sidi Ahmed Mahmoudi
Main category: cs.CV
TL;DR: 本文提出了一種基於3D立體視覺的交互式系統流程,結合多相機融合和反饋機制,提升在大規模複雜環境中的性能。
Details
Motivation: 2D相機與現有3D相機在大型複雜環境中可靠性不足,因此需要一種更強大的解決方案。 Method: 使用多個3D相機進行場景重建,結合反饋方法讓系統學習改進決策。 Result: 提出了新的3D立體視覺管道,並展示了初步實驗結果,可應用於事件識別、對象追蹤等任務。 Conclusion: 作者提出了一種基於3D立體視覺的交互系統流程,用於處理普通和敏感應用,通過多個3D相機融合進行全場景重建,並探索了反饋方法以適應新環境。 Abstract: 2D cameras are often used in interactive systems. Other systems like gaming consoles provide more powerful 3D cameras for short range depth sensing. Overall, these cameras are not reliable in large, complex environments. In this work, we propose a 3D stereo vision based pipeline for interactive systems, that is able to handle both ordinary and sensitive applications, through robust scene understanding. We explore the fusion of multiple 3D cameras to do full scene reconstruction, which allows for preforming a wide range of tasks, like event recognition, subject tracking, and notification. Using possible feedback approaches, the system can receive data from the subjects present in the environment, to learn to make better decisions, or to adapt to completely new environments. Throughout the paper, we introduce the pipeline and explain our preliminary experimentation and results. Finally, we draw the roadmap for the next steps that need to be taken, in order to get this pipeline into production[143] PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis
Chuhao Jin,Haosen Li,Bingzi Zhang,Che Liu,Xiting Wang,Ruihua Song,Wenbing Huang,Ying Qin,Fuzheng Zhang,Di Zhang
Main category: cs.CV
TL;DR: PlanMoGPT improves text-to-motion generation by addressing motion tokenization issues, achieving better performance and diversity than existing methods.
Details
Motivation: There is a significant performance gap in text-to-motion generation where LLM-based methods lag behind non-LLM methods due to issues with motion tokenization granularity. Method: PlanMoGPT integrates progressive planning and flow-enhanced fine-grained motion tokenization. The method uses LLMs' autoregressive capabilities to hierarchically generate motion tokens and employs a flow-enhanced tokenizer and decoder to preserve motion details. Result: Experiments show that PlanMoGPT achieves state-of-the-art performance, improving FID scores by 63.8% and enhancing motion diversity by 49.9% compared to existing methods. Conclusion: The proposed PlanMoGPT framework successfully addresses the diversity-quality trade-off in text-to-motion generation, establishing new standards for LLM-based approaches. Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemphasize short-term coherence at the expense of global semantic alignment, while coarse-grained tokenization sacrifices motion details. To resolve this issue, we propose PlanMoGPT, an LLM-based framework integrating progressive planning and flow-enhanced fine-grained motion tokenization. First, our progressive planning mechanism leverages LLMs' autoregressive capabilities to hierarchically generate motion tokens by starting from sparse global plans and iteratively refining them into full sequences. Second, our flow-enhanced tokenizer doubles the downsampling resolution and expands the codebook size by eight times, minimizing detail loss during discretization, while a flow-enhanced decoder recovers motion nuances. Extensive experiments on text-to-motion benchmarks demonstrate that it achieves state-of-the-art performance, improving FID scores by 63.8% (from 0.380 to 0.141) on long-sequence generation while enhancing motion diversity by 49.9% compared to existing methods. The proposed framework successfully resolves the diversity-quality trade-off that plagues current non-LLM approaches, establishing new standards for text-to-motion generation.[144] IDAL: Improved Domain Adaptive Learning for Natural Images Dataset
Ravi Kant Gupta,Shounak Das,Amit Sethi
Main category: cs.CV
TL;DR: A new unsupervised domain adaptation approach improves alignment of multimodal distributions for natural images using a tailored neural architecture and loss function combination, achieving superior generalization on multiple datasets.
Details
Motivation: To improve unsupervised domain adaptation by effectively aligning multimodal distributions across domains, particularly addressing scale, noise, and style shifts in natural images. Method: The method involves a neural architecture combining ResNet's deep structure and FPN for handling content and style features, trained using a novel loss function combined with existing ones to address challenges in natural images. Result: Enhanced model accuracy, robustness on the target domain, and faster training convergence were achieved compared to existing adversarial domain adaptation methods. Conclusion: The proposed UDA scheme generalizes better than state-of-the-art CNN-based methods on certain datasets, including Office-Home, Office-31, and VisDA-2017, with comparable performance on DomainNet. Abstract: We present a novel approach for unsupervised domain adaptation (UDA) for natural images. A commonly-used objective for UDA schemes is to enhance domain alignment in representation space even if there is a domain shift in the input space. Existing adversarial domain adaptation methods may not effectively align different domains of multimodal distributions associated with classification problems. Our approach has two main features. Firstly, its neural architecture uses the deep structure of ResNet and the effective separation of scales of feature pyramidal network (FPN) to work with both content and style features. Secondly, it uses a combination of a novel loss function and judiciously selected existing loss functions to train the network architecture. This tailored combination is designed to address challenges inherent to natural images, such as scale, noise, and style shifts, that occur on top of a multi-modal (multi-class) distribution. The combined loss function not only enhances model accuracy and robustness on the target domain but also speeds up training convergence. Our proposed UDA scheme generalizes better than state-of-the-art for CNN-based methods on Office-Home, Office-31, and VisDA-2017 datasets and comaparable for DomainNet dataset.[145] GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning
Bo Liu,Xiangyu Zhao,Along He,Yidi Chen,Huazhu Fu,Xiao-Ming Wu
Main category: cs.CV
TL;DR: 本文提出了ThinkVG数据集和一种新的强化学习奖励机制,用于医学视觉问答任务,在减少训练数据的同时提高了模型的可解释性和答案可靠性。
Details
Motivation: 现有的多模态医学视觉问答方法在答案可靠性与可解释性方面存在不足,影响了临床医生和患者对模型生成答案的理解与信任。 Method: 1. 构建了一个名为ThinkVG的数据集,将答案生成分解为中间推理步骤,并与医学图像中的视觉区域明确关联;2. 引入了一种新的强化学习验证奖励机制,以改进模型推理过程与最终答案的一致性。 Result: 所提方法在仅使用1/8训练数据的情况下达到了与现有方法相当的性能,证明了其高效性和有效性。此外,提出的ThinkVG数据集提供了细粒度的可解释性支持。 Conclusion: 该研究提出了一种新的医学视觉问答方法,通过引入可解释性机制和奖励机制,在减少训练数据量的情况下实现了与现有方法相当的性能。 Abstract: Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG.[146] SegChange-R1:Augmented Reasoning for Remote Sensing Change Detection via Large Language Models
Fei Zhou
Main category: cs.CV
TL;DR: This paper proposes SegChange-R1, an enhanced remote sensing change detection approach using an LLM and BEV spatial transformation, achieving superior performance on multiple datasets.
Details
Motivation: The motivation stems from the need to improve remote sensing change detection in various applications like urban planning and environmental monitoring by better identifying significant feature changes over time. Method: The method involves an LLM-augmented inference approach to enhance detection capabilities and a linear attention-based BEV module to address modal misalignment. Additionally, a new dataset (DVCD) for UAV-based building change detection is constructed. Result: The experiments demonstrate that the proposed approach outperforms existing methods on four widely-used change detection datasets, showing significant improvements in performance. Conclusion: The paper concludes that the proposed SegChange-R1 approach, which integrates textual descriptive information and utilizes a spatial transformation module (BEV), significantly improves building change detection performance across multiple datasets. Abstract: Remote sensing change detection is widely used in a variety of fields such as urban planning, terrain and geomorphology analysis, and environmental monitoring, mainly by analyzing the significant change differences of features (e.g., building changes) in the same spatial region at different time phases. In this paper, we propose a large language model (LLM) augmented inference approach (SegChange-R1), which enhances the detection capability by integrating textual descriptive information and aims at guiding the model to segment the more interested change regions, thus accelerating the convergence speed. Moreover, we design a spatial transformation module (BEV) based on linear attention, which solves the problem of modal misalignment in change detection by unifying features from different temporal perspectives onto the BEV space. In addition, we construct the first dataset for building change detection from UAV viewpoints (DVCD ), and our experiments on four widely-used change detection datasets show a significant improvement over existing methods. The code and pre-trained models are available in https://github.com/Yu-Zhouz/SegChange-R1.[147] Classification of Tents in Street Bazaars Using CNN
Azamat Ibragimov,Ruslan Isaev,Remudin Reshid Mekuria,Gulnaz Gimaletdinova,Dim Shaiakhmetov
Main category: cs.CV
TL;DR: 本研究提出了一种改进的深度学习模型用于街道集市帐篷分类,结果显示预训练模型如EfficientNetB0比自定义CNN模型在准确性和泛化能力上更优。
Details
Motivation: 露天集市是许多地区的重要经济枢纽,但其非结构化特性为市场基础设施的自动分类带来了重大挑战;此外,传统的手动帐篷分类方法效率低下,因此需要一种改进的深度学习模型来解决这一问题。 Method: 构建了一个自定义卷积神经网络(CNN)并与EfficientNetB0进行比较,通过扩展数据集的126张原始照片生成更多图像,并采用多种性能指标对模型进行了评估。 Result: 自定义的CNN模型达到了92.8%的准确率,而EfficientNetB0则展示了98.4%的准确率,证实了迁移学习在集市图像分类中的有效性。 Conclusion: 使用预训练模型如EfficientNetB0可以显著提高集市图像分类的准确性和泛化能力。 Abstract: This research paper proposes an improved deep learning model for classifying tents in street bazaars, comparing a custom Convolutional Neural Network (CNN) with EfficientNetB0. This is a critical task for market organization with a tent classification, but manual methods in the past have been inefficient. Street bazaars represent a vital economic hub in many regions, yet their unstructured nature poses significant challenges for the automated classification of market infrastructure, such as tents. In Kyrgyzstan, more than a quarter of the country's GDP is derived from bazaars. While CNNs have been widely applied to object recognition, their application to bazaar-specific tasks remains underexplored. Here, we build upon our original approach by training on an extended set of 126 original photographs that were augmented to generate additional images. This dataset is publicly available for download on Kaggle. A variety of performance metrics, such as accuracy, precision, recall, F1 score, and mean average precision (mAP), were used to assess the models comparatively, providing a more extensive analysis of classification performance. The results show that the CNN custom model achieved 92.8% accuracy, and EfficientNetB0 showed 98.4% accuracy results, confirming the effectiveness of transfer learning in the bazaar image classification. Also, when analyzing the confusion matrix, the analysis reveals the weaknesses and strengths of each model. These findings suggest that using a pre-trained model such as EfficientNetB0 significantly improves classification accuracy and generalization.[148] Mobile Image Analysis Application for Mantoux Skin Test
Liong Gele,Tan Chye Cheah
Main category: cs.CV
TL;DR: 开发了一款基于移动应用的LTBI诊断工具,通过图像处理和机器学习技术提升诊断精度和效率。
Details
Motivation: 传统的TST方法存在随访率低、患者不适和主观手动解释的问题,导致误诊和治疗延误。 Method: 利用ARCore、DeepLabv3等先进的图像处理技术和机器学习算法进行精确的皮肤硬结测量,并使用边缘检测算法提高准确性。 Result: 与标准临床实践相比,该系统显示出在准确性和可靠性方面的显著改进。 Conclusion: 该移动应用程序通过自动化和标准化TST评估,提高了TB诊断的可及性和效率,特别是在资源有限的地区。 Abstract: This paper presents a newly developed mobile application designed to diagnose Latent Tuberculosis Infection (LTBI) using the Mantoux Skin Test (TST). Traditional TST methods often suffer from low follow-up return rates, patient discomfort, and subjective manual interpretation, particularly with the ball-point pen method, leading to misdiagnosis and delayed treatment. Moreover, previous developed mobile applications that used 3D reconstruction, this app utilizes scaling stickers as reference objects for induration measurement. This mobile application integrates advanced image processing technologies, including ARCore, and machine learning algorithms such as DeepLabv3 for robust image segmentation and precise measurement of skin indurations indicative of LTBI. The system employs an edge detection algorithm to enhance accuracy. The application was evaluated against standard clinical practices, demonstrating significant improvements in accuracy and reliability. This innovation is crucial for effective tuberculosis management, especially in resource-limited regions. By automating and standardizing TST evaluations, the application enhances the accessibility and efficiency of TB di-agnostics. Future work will focus on refining machine learning models, optimizing measurement algorithms, expanding functionalities to include comprehensive patient data management, and enhancing ARCore's performance across various lighting conditions and operational settings.[149] ELMAR: Enhancing LiDAR Detection with 4D Radar Motion Awareness and Cross-modal Uncertainty
Xiangyuan Peng,Miao Tang,Huawei Sun,Bierzynski Kay,Lorenzo Servadei,Robert Wille
Main category: cs.CV
TL;DR: 本文提出了一种结合4D雷达和LiDAR的检测框架,通过利用4D雷达的运动状态和跨模态不确定性来增强感知性能。
Details
Motivation: 尽管LiDAR提供了丰富的空间信息,4D雷达在不利条件下保持稳健并提供速度测量。然而不同模态之间的错位常常被忽视,这限制了感知系统的性能。 Method: 提出了一种动态运动感知编码模块以捕捉4D雷达的对象运动信息,并估计边界框的实例不确定性以减轻跨模态错位,从而优化最终的LiDAR预测。 Result: 在View-of-Delft (VoD)数据集上的实验表明,该方法在整体区域达到74.89%的mAP,在驾驶走廊内达到88.70%,同时保持30.02 FPS的实时推理速度。 Conclusion: 所提出的融合方法有效结合了4D雷达和LiDAR的优势,解决了跨模态错位问题,并在实际应用中展示了出色的性能。 Abstract: LiDAR and 4D radar are widely used in autonomous driving and robotics. While LiDAR provides rich spatial information, 4D radar offers velocity measurement and remains robust under adverse conditions. As a result, increasing studies have focused on the 4D radar-LiDAR fusion method to enhance the perception. However, the misalignment between different modalities is often overlooked. To address this challenge and leverage the strengths of both modalities, we propose a LiDAR detection framework enhanced by 4D radar motion status and cross-modal uncertainty. The object movement information from 4D radar is first captured using a Dynamic Motion-Aware Encoding module during feature extraction to enhance 4D radar predictions. Subsequently, the instance-wise uncertainties of bounding boxes are estimated to mitigate the cross-modal misalignment and refine the final LiDAR predictions. Extensive experiments on the View-of-Delft (VoD) dataset highlight the effectiveness of our method, achieving state-of-the-art performance with the mAP of 74.89% in the entire area and 88.70% within the driving corridor while maintaining a real-time inference speed of 30.02 FPS.[150] BPCLIP: A Bottom-up Image Quality Assessment from Distortion to Semantics Based on CLIP
Chenyue Song,Chen Hui,Wei Zhang,Haiqi Zhu,Shaohui Liu,Hong Huang,Feng Jiang
Main category: cs.CV
TL;DR: This paper proposes BPCLIP, a bottom-up image quality assessment method using CLIP and a multiscale cross attention module, achieving strong performance on IQA tasks by better capturing how distortions affect semantic perception.
Details
Motivation: Existing IQA methods often rely on linear fusion of multiscale features, which may not adequately capture the effect of distortions on semantic content. This motivates the development of a more effective method that models distortion impacts on high-level semantics. Method: BPCLIP utilizes an encoder to extract multiscale features and introduces a bottom-up multiscale cross attention module to model relationships between shallow and deep features. It also uses CLIP's pre-trained text encoder with image quality adjectives to enhance the link between image quality perception and human language. Result: The method achieves superior results on most public Full-Reference (FR) and No-Reference (NR) IQA benchmarks and demonstrates greater robustness compared to existing approaches. Conclusion: The proposed BPCLIP approach effectively captures the impact of distortions on high-level semantics by leveraging a bottom-up multiscale cross attention module and incorporating quality adjectives, resulting in superior performance and robustness on IQA benchmarks. Abstract: Image Quality Assessment (IQA) aims to evaluate the perceptual quality of images based on human subjective perception. Existing methods generally combine multiscale features to achieve high performance, but most rely on straightforward linear fusion of these features, which may not adequately capture the impact of distortions on semantic content. To address this, we propose a bottom-up image quality assessment approach based on the Contrastive Language-Image Pre-training (CLIP, a recently proposed model that aligns images and text in a shared feature space), named BPCLIP, which progressively extracts the impact of low-level distortions on high-level semantics. Specifically, we utilize an encoder to extract multiscale features from the input image and introduce a bottom-up multiscale cross attention module designed to capture the relationships between shallow and deep features. In addition, by incorporating 40 image quality adjectives across six distinct dimensions, we enable the pre-trained CLIP text encoder to generate representations of the intrinsic quality of the image, thereby strengthening the connection between image quality perception and human language. Our method achieves superior results on most public Full-Reference (FR) and No-Reference (NR) IQA benchmarks, while demonstrating greater robustness.[151] Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models
Mischa Dombrowski,Bernhard Kainz
Main category: cs.CV
TL;DR: 本文介绍了一种符合隐私法规、高性能的合成数据生成方法,特别适用于医疗影像领域。
Details
Motivation: 合成数据虽然具有高度视觉真实性,但在法律合规性和性能方面仍存在显著问题,尤其是在医疗影像领域。 Method: 提出了一种通用的扩散模型训练框架,以生成不包含个人身份信息的合成数据集。 Result: 该方法在性能上接近真实数据模型(误差在一个百分点内),并显著优于未确保隐私的最先进方法。 Conclusion: 合成数据在保护隐私的同时可以实现接近真实数据的性能,同时满足法规要求。 Abstract: Synthetic data has recently reached a level of visual fidelity that makes it nearly indistinguishable from real data, offering great promise for privacy-preserving data sharing in medical imaging. However, fully synthetic datasets still suffer from significant limitations: First and foremost, the legal aspect of sharing synthetic data is often neglected and data regulations, such as the GDPR, are largley ignored. Secondly, synthetic models fall short of matching the performance of real data, even for in-domain downstream applications. Recent methods for image generation have focused on maximising image diversity instead of fidelity solely to improve the mode coverage and therefore the downstream performance of synthetic data. In this work, we shift perspective and highlight how maximizing diversity can also be interpreted as protecting natural persons from being singled out, which leads to predicate singling-out (PSO) secure synthetic datasets. Specifically, we propose a generalisable framework for training diffusion models on personal data which leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models while significantly outperforming state-of-the-art methods that do not ensure privacy. Our code is available at https://github.com/MischaD/Trichotomy.[152] Fast Neural Inverse Kinematics on Human Body Motions
David Tolpin,Sefy Kagarlitsky
Main category: cs.CV
TL;DR: 本文提出了一种用于实时人体动作捕捉的高效神经逆运动学框架,解决了传统无标记动作捕捉系统的局限性。
Details
Motivation: 无标记动作捕捉虽然灵活且成本较低,但计算需求高且推理速度慢,限制了其在实时场景中的应用。 Method: 描述了网络架构、训练方法和推理过程,并通过消融研究支持关键设计决策。 Result: 对框架进行了定性和定量评估,验证了其性能和可靠性。 Conclusion: 该论文提出了一个快速且可靠的神经逆运动学框架,适用于从3D关键点实时捕捉人体运动。 Abstract: Markerless motion capture enables the tracking of human motion without requiring physical markers or suits, offering increased flexibility and reduced costs compared to traditional systems. However, these advantages often come at the expense of higher computational demands and slower inference, limiting their applicability in real-time scenarios. In this technical report, we present a fast and reliable neural inverse kinematics framework designed for real-time capture of human body motions from 3D keypoints. We describe the network architecture, training methodology, and inference procedure in detail. Our framework is evaluated both qualitatively and quantitatively, and we support key design decisions through ablation studies.[153] OSDMamba: Enhancing Oil Spill Detection from Remote Sensing Images Using Selective State Space Model
Shuaiyu Chen,Fu Wang,Peng Ren,Chunbo Luo,Zeyu Fu
Main category: cs.CV
TL;DR: This paper introduces OSDMamba, a Mamba-based model for oil spill detection that addresses issues with traditional CNNs, achieving top performance on two datasets.
Details
Motivation: Challenges in semantic segmentation for Oil Spill Detection include limited labeled samples, class imbalance, and difficulties in detecting small areas due to CNNs' limited receptive fields and poor global context capture. Method: The study proposes OSDMamba using State-Space Models (SSMs), specifically Mamba, to overcome the limitations of CNNs in detecting oil spills. It uses Mamba's selective scanning mechanism and an asymmetric decoder with ConvSSM and deep supervision for improved multi-scale feature fusion. Result: The proposed OSDMamba model yields improvements of 8.9% and 11.8% in oil spill detection across two publicly available datasets. Conclusion: OSDMamba, a Mamba-based architecture, achieves state-of-the-art performance in oil spill detection with significant improvements on two datasets. Abstract: Semantic segmentation is commonly used for Oil Spill Detection (OSD) in remote sensing images. However, the limited availability of labelled oil spill samples and class imbalance present significant challenges that can reduce detection accuracy. Furthermore, most existing methods, which rely on convolutional neural networks (CNNs), struggle to detect small oil spill areas due to their limited receptive fields and inability to effectively capture global contextual information. This study explores the potential of State-Space Models (SSMs), particularly Mamba, to overcome these limitations, building on their recent success in vision applications. We propose OSDMamba, the first Mamba-based architecture specifically designed for oil spill detection. OSDMamba leverages Mamba's selective scanning mechanism to effectively expand the model's receptive field while preserving critical details. Moreover, we designed an asymmetric decoder incorporating ConvSSM and deep supervision to strengthen multi-scale feature fusion, thereby enhancing the model's sensitivity to minority class samples. Experimental results show that the proposed OSDMamba achieves state-of-the-art performance, yielding improvements of 8.9% and 11.8% in OSD across two publicly available datasets.[154] On the Robustness of Human-Object Interaction Detection against Distribution Shift
Chi Xie,Shuang Liang,Jie Li,Feng Zhu,Rui Zhao,Yichen Wei,Shengjie Zhao
Main category: cs.CV
TL;DR: 本论文研究了在分布偏移情况下提升人-物交互(HOI)检测模型鲁棒性的问题,提出了新的基准测试方法和增强策略。
Details
Motivation: 现有的HOI检测方法在理想图像和自然分布下表现良好,但在实际场景中面临不可避免的分布偏移问题,限制了其应用。 Method: 提出了一种自动化创建鲁棒性评估基准的方法,并通过超过40个现有HOI检测模型的评估分析其不足;随后提出了基于mixup的跨域数据增强方法和结合视觉基础模型的特征融合策略。 Result: 实验结果显示,所提出的增强策略显著提升了各种方法的鲁棒性,并对标准基准测试也有好处。 Conclusion: 该研究表明HOI检测在分布偏移下的鲁棒性可以被有效提升,并为未来研究提供了新方向和公开数据集及代码。 Abstract: Human-Object Interaction (HOI) detection has seen substantial advances in recent years. However, existing works focus on the standard setting with ideal images and natural distribution, far from practical scenarios with inevitable distribution shifts. This hampers the practical applicability of HOI detection. In this work, we investigate this issue by benchmarking, analyzing, and enhancing the robustness of HOI detection models under various distribution shifts. We start by proposing a novel automated approach to create the first robustness evaluation benchmark for HOI detection. Subsequently, we evaluate more than 40 existing HOI detection models on this benchmark, showing their insufficiency, analyzing the features of different frameworks, and discussing how the robustness in HOI is different from other tasks. With the insights from such analyses, we propose to improve the robustness of HOI detection methods through: (1) a cross-domain data augmentation integrated with mixup, and (2) a feature fusion strategy with frozen vision foundation models. Both are simple, plug-and-play, and applicable to various methods. Our experimental results demonstrate that the proposed approach significantly increases the robustness of various methods, with benefits on standard benchmarks, too. The dataset and code will be released.[155] PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
Kui Huang,Xinrong Chen,Wenyu Lv,Jincheng Liao,Guanzhong Wang,Yi Liu
Main category: cs.CV
TL;DR: PP-DocBee2 通过优化数据质量、特征融合与推理方法,显著提升多模态文档理解性能并降低延迟。
Details
Motivation: 解决 PP-DocBee 在多模态文档理解中的局限性,提升性能和推理效率。 Method: 基于大规模多模态模型架构,提出数据质量优化策略、改进视觉特征融合策略并优化推理方法。 Result: 在中文商业文档的内部基准测试中性能提升了 11.4%,推理延迟减少了 73.0%。 Conclusion: PP-DocBee2 是 PP-DocBee 的改进版本,通过多项技术创新实现了更高效的多模态文档理解,并已开源。 Abstract: This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4\%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0\%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.[156] MiCo: Multiple Instance Learning with Context-Aware Clustering for Whole Slide Image Analysis
Junjian Li,Hulin Kuang,Jin Liu,Hailin Yue,Mengshen He,Jianxin Wang
Main category: cs.CV
TL;DR: The paper introduces MiCo, an improved method for analyzing cancer-related histopathology images by enhancing spatial interaction modeling.
Details
Motivation: Spatial heterogeneity of WSIs poses challenges for conventional MIL methods, as they struggle to model scattered tissue distributions and capture cross-regional interactions. Method: A novel Multiple Instance Learning framework named MiCo was proposed. It uses a Cluster Route module to link similar instances across regions and a Cluster Reducer module to enhance inter-tissue associations. Result: MiCo performed effectively on two challenging tasks across nine large-scale cancer datasets. Conclusion: MiCo is effective in histopathology whole slide image analysis, showing superiority over existing methods. Abstract: Multiple instance learning (MIL) has shown significant promise in histopathology whole slide image (WSI) analysis for cancer diagnosis and prognosis. However, the inherent spatial heterogeneity of WSIs presents critical challenges, as morphologically similar tissue types are often dispersed across distant anatomical regions. Conventional MIL methods struggle to model these scattered tissue distributions and capture cross-regional spatial interactions effectively. To address these limitations, we propose a novel Multiple instance learning framework with Context-Aware Clustering (MiCo), designed to enhance cross-regional intra-tissue correlations and strengthen inter-tissue semantic associations in WSIs. MiCo begins by clustering instances to distill discriminative morphological patterns, with cluster centroids serving as semantic anchors. To enhance cross-regional intra-tissue correlations, MiCo employs a Cluster Route module, which dynamically links instances of the same tissue type across distant regions via feature similarity. These semantic anchors act as contextual hubs, propagating semantic relationships to refine instance-level representations. To eliminate semantic fragmentation and strengthen inter-tissue semantic associations, MiCo integrates a Cluster Reducer module, which consolidates redundant anchors while enhancing information exchange between distinct semantic groups. Extensive experiments on two challenging tasks across nine large-scale public cancer datasets demonstrate the effectiveness of MiCo, showcasing its superiority over state-of-the-art methods. The code is available at https://github.com/junjianli106/MiCo.[157] Pre-Trained LLM is a Semantic-Aware and Generalizable Segmentation Booster
Fenghe Tang,Wenxin Ma,Zhiyang He,Xiaodong Tao,Zihang Jiang,S. Kevin Zhou
Main category: cs.CV
TL;DR: This paper introduces LLM4Seg, a hybrid model combining LLMs and CNNs for improved medical image segmentation with minimal added complexity.
Details
Motivation: The motivation is to explore whether the semantic awareness of pre-trained LLMs can be transferred to improve medical image segmentation tasks with minimal increases in trainable parameters. Method: The authors propose a hybrid structure called LLM4Seg, which incorporates a frozen LLM layer within a CNN encoder-decoder segmentation framework. They evaluate the approach on multiple medical imaging modalities and analyze its effectiveness. Result: The proposed method improves segmentation accuracy across different modalities like ultrasound, dermoscopy, polypscopy, and CT scans while using only a small number of additional trainable parameters. Conclusion: The paper concludes that integrating a pre-trained, frozen LLM layer into a CNN encoder-decoder framework enhances medical image segmentation performance across various modalities. Abstract: With the advancement of Large Language Model (LLM) for natural language processing, this paper presents an intriguing finding: a frozen pre-trained LLM layer can process visual tokens for medical image segmentation tasks. Specifically, we propose a simple hybrid structure that integrates a pre-trained, frozen LLM layer within the CNN encoder-decoder segmentation framework (LLM4Seg). Surprisingly, this design improves segmentation performance with a minimal increase in trainable parameters across various modalities, including ultrasound, dermoscopy, polypscopy, and CT scans. Our in-depth analysis reveals the potential of transferring LLM's semantic awareness to enhance segmentation tasks, offering both improved global understanding and better local modeling capabilities. The improvement proves robust across different LLMs, validated using LLaMA and DeepSeek.[158] CmFNet: Cross-modal Fusion Network for Weakly-supervised Segmentation of Medical Images
Dongdong Meng,Sheng Li,Hao Wu,Suqing Tian,Wenjun Ma,Guoping Wang,Xueqing Yan
Main category: cs.CV
TL;DR: This paper proposes CmFNet for 3D weakly supervised cross-modal medical image segmentation, which uses a hybrid-supervised learning strategy and integrates information from multiple imaging modalities to achieve superior segmentation performance while reducing reliance on dense annotations.
Details
Motivation: Accurate automatic medical image segmentation relies on high-quality, dense annotations which are costly and time-consuming. Weakly supervised learning provides a more efficient alternative by leveraging sparse and coarse annotations instead of dense, precise ones. Method: CmFNet consists of three main components: a modality-specific feature learning network, a cross-modal feature learning network, and a hybrid-supervised learning strategy. These components integrate complementary information from multi-modal images and guide segmentation through scribble supervision, intra-modal regularization, and inter-modal consistency. Result: Extensive experiments show that CmFNet outperforms state-of-the-art weakly supervised methods and even fully supervised methods when full annotation is used. The approach effectively addresses performance degradation and overfitting issues caused by sparse annotations. Conclusion: CmFNet, a novel 3D weakly supervised cross-modal medical image segmentation approach, effectively mitigates overfitting and delivers robust segmentation results. It excels in segmenting both challenging small tumor regions and common anatomical structures and can facilitate clinical therapy. Abstract: Accurate automatic medical image segmentation relies on high-quality, dense annotations, which are costly and time-consuming. Weakly supervised learning provides a more efficient alternative by leveraging sparse and coarse annotations instead of dense, precise ones. However, segmentation performance degradation and overfitting caused by sparse annotations remain key challenges. To address these issues, we propose CmFNet, a novel 3D weakly supervised cross-modal medical image segmentation approach. CmFNet consists of three main components: a modality-specific feature learning network, a cross-modal feature learning network, and a hybrid-supervised learning strategy. Specifically, the modality-specific feature learning network and the cross-modal feature learning network effectively integrate complementary information from multi-modal images, enhancing shared features across modalities to improve segmentation performance. Additionally, the hybrid-supervised learning strategy guides segmentation through scribble supervision, intra-modal regularization, and inter-modal consistency, modeling spatial and contextual relationships while promoting feature alignment. Our approach effectively mitigates overfitting, delivering robust segmentation results. It excels in segmenting both challenging small tumor regions and common anatomical structures. Extensive experiments on a clinical cross-modal nasopharyngeal carcinoma (NPC) dataset (including CT and MR imaging) and the publicly available CT Whole Abdominal Organ dataset (WORD) show that our approach outperforms state-of-the-art weakly supervised methods. In addition, our approach also outperforms fully supervised methods when full annotation is used. Our approach can facilitate clinical therapy and benefit various specialists, including physicists, radiologists, pathologists, and oncologists.[159] CLGRPO: Reasoning Ability Enhancement for Small VLMs
Fanyi Wang,Binzhi Dong,Haotian Hu,Jinjin Xu,Zhiwang Zhang
Main category: cs.CV
TL;DR: This paper introduces an Incremental Training Strategy to boost the reasoning abilities of small vision language models, enabling a 1B model to perform similarly to an 8B model.
Details
Motivation: Small Vision Language Models (SVLMs) have commercial value due to their low cost and power consumption but are limited in reasoning ability due to parameter constraints. This work aims to enhance their reasoning capabilities. Method: An Incremental Training Strategy with four stages was proposed: Supervised Fine-Tuning on COT data, GRPO training constrained by format rewards, GRPO training constrained by both format and accuracy rewards, and ClipLow GRPO to constrain the training capture space. A Self-Supervised Chain-of-Thought Data Construction System was also developed. Result: Experimental results on the EMOSet-118K dataset showed significant improvements in accuracy (2.77 increase) and recall (0.69 increase) compared to the baseline model fine-tuned on original data. Conclusion: The proposed Incremental Training Strategy significantly enhances the reasoning ability of 1B Small Vision Language Models (SVLMs), achieving performance comparable to that of 8B models. Abstract: Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small amount of Group Relative Policy Optimization (GRPO) training constrained only by format rewards on the COT data. Stage 3 enhances reasoning ability by applying GRPO training on the COT data with constraints on both format and accuracy rewards. The resulting model shows significant improvement compared to the baseline. Stage 4 addresses the limited capacity of the SVLMs and the weak ability to capture complex patterns by proposing ClipLow GRPO (CLGRPO) to constrain the capture space of the training process. We conducted extensive comparative and ablation experiments on the abstract semantic recognition dataset EMOSet-118K. Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM. Compared to the baseline model fine-tuned on the original data, accuracy increased by 2.77 and recall by 0.69, achieving performance comparable to that of 8B models.[160] Deep Supervised LSTM for 3D morphology estimation from Multi-View RGB Images of Wheat Spikes
Olivia Zumsteg,Nico Graf,Aaron Haeusler,Norbert Kirchgessner,Nicola Storni,Lukas Roth,Andreas Hund
Main category: cs.CV
TL;DR: This paper introduces a deep learning model combining DINOv2 and LSTM for accurate wheat spike volume estimation from 2D images, outperforming traditional geometric and projection methods.
Details
Motivation: Estimating 3D morphological traits from 2D RGB images is challenging due to depth loss, distortions, and occlusions. Accurate non-destructive volume estimation of wheat spikes is important for agricultural analysis under varying field conditions. Method: A neural network approach using a transfer learning pipeline combining DINOv2 (a self-supervised Vision Transformer) and a unidirectional LSTM network, trained with deep supervision. The model was benchmarked against two baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Result: The deep supervised model achieved a MAPE of 6.46% on six-view indoor images, outperforming the area (9.36%) and geometric (13.98%) baselines. Fine-tuning on field-based data yielded a MAPE of 10.82%, demonstrating effective domain adaptation. Conclusion: The proposed deep learning model, combining DINOv2 and LSTM with deep supervision, outperforms conventional geometric and projection-based methods in estimating wheat spike volumes from 2D RGB images. It demonstrates strong generalization across indoor and field conditions. Abstract: Estimating three-dimensional morphological traits from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes, using RGB image sequences and structured-light 3D scans as ground truth references. Due to the complex geometry of the spikes, we propose a neural network approach for volume estimation in 2D images, employing a transfer learning pipeline that combines DINOv2, a self-supervised Vision Transformer, with a unidirectional Long Short-Term Memory (LSTM) network. By using deep supervision, the model is able to learn more robust intermediate representations, which enhances its generalisation ability across varying evaluation sequences. We benchmark our model against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Our deep supervised model achieves a mean absolute percentage error (MAPE) of 6.46% on six-view indoor images, outperforming the area (9.36%) and geometric (13.98%) baselines. Fine-tuning the model on field-based single-image data enables domain adaptation, yielding a MAPE of 10.82%. We demonstrate that object shape significantly impacts volume prediction accuracy, with irregular geometries such as wheat spikes posing greater challenges for geometric methods compared to our deep learning approach.[161] Training-free Test-time Improvement for Explainable Medical Image Classification
Hangzhou He,Jiachen Tang,Lei Zhu,Kaiwen Li,Yanye Lu
Main category: cs.CV
TL;DR: This paper proposes a training-free method to improve the adaptability of explainable deep learning models for medical image classification by identifying and correcting confusion in concept activation with minimal new data.
Details
Motivation: The motivation is to address the limitations of Concept Bottleneck Models (CBMs) in adapting to new environments due to concept-level shifts and the high cost of acquiring expert-annotated concept labels. Method: The method involves a two-step process: masking misactivated confounding concepts and amplifying under-activated discriminative concepts, using minimal new data with only image-level labels. Result: The proposed approach enhances out-of-domain performance without sacrificing source domain accuracy, validated on skin and white blood cell images. Conclusion: The paper concludes that the proposed training-free confusion concept identification strategy effectively improves out-of-domain performance while maintaining source domain accuracy in Concept Bottleneck Models for medical image classification. Abstract: Deep learning-based medical image classification techniques are rapidly advancing in medical image analysis, making it crucial to develop accurate and trustworthy models that can be efficiently deployed across diverse clinical scenarios. Concept Bottleneck Models (CBMs), which first predict a set of explainable concepts from images and then perform classification based on these concepts, are increasingly being adopted for explainable medical image classification. However, the inherent explainability of CBMs introduces new challenges when deploying trained models to new environments. Variations in imaging protocols and staining methods may induce concept-level shifts, such as alterations in color distribution and scale. Furthermore, since CBM training requires explicit concept annotations, fine-tuning models solely with image-level labels could compromise concept prediction accuracy and faithfulness - a critical limitation given the high cost of acquiring expert-annotated concept labels in medical domains. To address these challenges, we propose a training-free confusion concept identification strategy. By leveraging minimal new data (e.g., 4 images per class) with only image-level labels, our approach enhances out-of-domain performance without sacrificing source domain accuracy through two key operations: masking misactivated confounding concepts and amplifying under-activated discriminative concepts. The efficacy of our method is validated on both skin and white blood cell images. Our code is available at: https://github.com/riverback/TF-TTI-XMed.[162] MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
Jisheng Dang,Huilin Song,Junbin Xiao,Bimei Wang,Han Peng,Haoxuan Li,Xun Yang,Meng Wang,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本研究提出了MUPA,一种用于视频问答任务的高效多路径代理方法,显著提升了模型的视觉证据对齐能力和整体性能。
Details
Motivation: 现代多模态模型依赖语言先验和虚假相关性,导致预测效果不佳,因此需要一种更可靠的视频问答解决方案。 Method: 提出了一种名为MUPA的多路径协同代理方法,包括三个不同的推理路径和一个专门的反思代理来整合结果。 Result: 尽管仅使用2B参数,MUPA优于所有7B规模的竞争对手;当扩展到7B参数时,在NExT-GQA和DeVE-QA数据集上分别达到了30.3%和47.4%的Acc@GQA,取得了新的领先成果。 Conclusion: MUPA通过多路径代理方法有效解决了视频问答中的视觉证据对齐问题,提高了模型的可信赖性,并在不同参数规模下均表现出色。 Abstract: Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.[163] TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving
Wenzhuo Liu,Yicheng Qiao,Zhen Wang,Qiannan Guo,Zilong Chen,Meihua Zhou,Xinran Li,Letian Wang,Zhiwei Li,Huaping Liu,Wenshuo Wang
Main category: cs.CV
TL;DR: 本论文提出了一种名为TEM^3-Learning的新型多任务学习框架,用于辅助驾驶系统,能够有效提升驾驶员情绪与行为识别、交通环境识别和车辆行为识别的性能,同时保证高效计算和实时推理能力。
Details
Motivation: 现有的多任务学习方法在辅助驾驶方面存在两个关键限制:单一模态约束限制了对场景的全面理解,以及低效的架构阻碍了实时部署。 Method: 提出了TEM^3-Learning(Time-Efficient Multimodal Multi-task Learning)方法,包含mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba) 和 MTL-based gated multimodal feature integrator (MGMI) 两个组件。前者引入前向-后向时间扫描机制和全局-局部空间注意力,后者采用任务特定的多门控模块以自适应地突出每个任务最相关的模态特征。 Result: 在AIDE数据集上的评估显示,所提出的模型在所有四项任务中均达到最先进的准确率,同时维持了一个轻量级架构(少于600万个参数),并实现了每秒142.32帧的推理速度。消融实验进一步验证了框架的有效性及各模块的独立贡献。 Conclusion: TEM^3-Learning是一个新颖的框架,通过两个阶段的架构联合优化驾驶员情绪识别、驾驶员行为识别、交通环境识别和车辆行为识别。在AIDE数据集上的评估中,该模型在所有四个任务上都达到了最先进的准确性,同时保持了轻量级架构,并提供了出色的推理速度。代码可在指定链接获取。 Abstract: Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM^3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.[164] ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
Junying Chen,Zhenyang Cai,Pengcheng Chen,Shunian Chen,Ke Ji,Xidong Wang,Yunjin Yang,Benyou Wang
Main category: cs.CV
TL;DR: 该研究通过构建大规模合成数据集ShareGPT-4o-Image并开发多模态大语言模型Janus-4o,首次实现了从文本和图文联合输入生成高质量图像,并推动开放性的图像生成研究。
Details
Motivation: 为了使先进的多模态图像生成技术更加开放可及,而非仅限于专有系统如GPT-4o-Image。 Method: 利用GPT-4o的图像生成能力创建了包含91K样本的合成数据集ShareGPT-4o-Image,并在此基础上训练了新的多模态大语言模型Janus-4o。 Result: 成功开发出支持文本到图像和图文到图像生成的模型Janus-4o,在少量数据(仅91K合成样本)和短时间(6小时)训练下表现出色。 Conclusion: 通过发布ShareGPT-4o-Image和Janus-4o,研究人员希望促进开放性、指令对齐且逼真的图像生成技术的发展。 Abstract: Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.[165] Enhancing VICReg: Random-Walk Pairing for Improved Generalization and Better Global Semantics Capturing
Idan Simai,Ronen Talmon,Uri Shaham
Main category: cs.CV
TL;DR: This paper identifies a generalization issue in the SSL method VICReg, proposes SAG-VICReg as an improved alternative, and introduces a new evaluation metric for embeddings that doesn't require labels.
Details
Motivation: The authors observe that VICReg, a popular SSL method, may struggle with robust generalization due to overreliance on training data, prompting them to enhance its performance on unseen data. Method: The authors propose SAG-VICReg, an enhanced version of VICReg that incorporates new training techniques to improve global semantics capture and generalization capabilities. They also introduce a new standalone evaluation metric for embeddings that accounts for global data structure without requiring labels. Result: Experiments show that SAG-VICReg improves generalization while maintaining competitive results on local evaluation metrics. It demonstrates superior performance on global semantic understanding metrics and introduces a novel label-free evaluation method for embeddings. Conclusion: SAG-VICReg effectively addresses the generalization challenge in SSL while matching or surpassing state-of-the-art baselines, particularly on global semantic understanding metrics. Abstract: In this paper, we argue that viewing VICReg-a popular self-supervised learning (SSL) method--through the lens of spectral embedding reveals a potential source of sub-optimality: it may struggle to generalize robustly to unseen data due to overreliance on the training data. This observation invites a closer look at how well this method achieves its goal of producing meaningful representations of images outside of the training set as well. Here, we investigate this issue and introduce SAG-VICReg (Stable and Generalizable VICReg), a method that builds on VICReg by incorporating new training techniques. These enhancements improve the model's ability to capture global semantics within the data and strengthen the generalization capabilities. Experiments demonstrate that SAG-VICReg effectively addresses the generalization challenge while matching or surpassing diverse state-of-the-art SSL baselines. Notably, our method exhibits superior performance on metrics designed to evaluate global semantic understanding, while simultaneously maintaining competitive results on local evaluation metrics. Furthermore, we propose a new standalone evaluation metric for embeddings that complements the standard evaluation methods and accounts for the global data structure without requiring labels--a key issue when tagged data is scarce or not available.[166] Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection
Quan Zhou,Gan Luo,Qiang Hu,Qingyong Zhang,Jinhua Zhang,Yinjiao Tian,Qiang Li,Zhiwei Wang
Main category: cs.CV
TL;DR: This paper proposes DADA, an adversarial diffusion framework that synthesizes high-value false positives to enhance polyp detection for colorectal cancer screening, achieving superior performance over existing methods.
Details
Motivation: The motivation is to overcome the limitations of existing models in handling false positives during polyp detection, which is critical for reliable colorectal cancer screening. Method: The authors introduced a regional noise matching strategy and the Detector-guided Adversarial Diffusion Attacker (DADA) module to generate high-value false positives by perturbing a negative-centric diffusion process. Result: The experiments showed that the proposed method outperforms state-of-the-art approaches, with improvements of at least 2.6% and 2.7% in F1-score on public and in-house datasets, respectively. Conclusion: The paper concludes that their proposed method, DADA, effectively synthesizes high-value false positives for polyp detection, improving the performance of detectors in colorectal cancer screening. Abstract: Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to synthesize high-value false positives. The extensive variability of negative backgrounds presents a significant challenge in false positive synthesis. To overcome this, we introduce two key innovations: First, we design a regional noise matching strategy to construct a negative synthesis space using polyp detection datasets. This strategy trains a negative-centric diffusion model by masking polyp regions, ensuring the model focuses exclusively on learning diverse background patterns. Second, we introduce the Detector-guided Adversarial Diffusion Attacker (DADA) module, which perturbs the negative synthesis process to disrupt a pre-trained detector's decision, guiding the negative-centric diffusion model to generate high-value, detector-confusing false positives instead of low-value, ordinary backgrounds. Our approach is the first to apply adversarial diffusion to lesion detection, establishing a new paradigm for targeted false positive synthesis and paving the way for more reliable clinical applications in colorectal cancer screening. Extensive results on public and in-house datasets verify the superiority of our method over the current state-of-the-arts, with our synthesized data improving the detectors by at least 2.6% and 2.7% in F1-score, respectively, over the baselines. Codes are at https://github.com/Huster-Hq/DADA.[167] See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis
Ruinan Jin,Gexin Huang,Xinwei Shen,Qiong Zhang,Yan Shuo Tan,Xiaoxiao Li
Main category: cs.CV
TL;DR: This paper explores enhancing medical vision-language models (VLMs) with comparative reasoning using reference images and clinical prompts, leading to improved diagnostic accuracy.
Details
Motivation: The motivation stems from the inherent challenges in medical imaging diagnosis due to diseases mimicking normal anatomy and the lack of explicit comparative reasoning mechanisms in existing medical VLMs. Method: The researchers conducted an extensive empirical analysis using general-purpose VLMs, providing them with query and normative matched reference images along with clinically-informed comparative prompts to evaluate diagnostic outcomes. Result: The results showed that using comparative reasoning with reference images significantly improved diagnostic outcomes compared to single-image baselines, especially after supervised fine-tuning (SFT). Conclusion: The study concludes that integrating comparative analysis into VLMs enhances diagnostic accuracy by leveraging reference images and clinically-informed prompts. Abstract: Medical imaging diagnosis presents inherent challenges due to diseases that mimic normal anatomy and exhibit significant inter-patient variability. Clinicians routinely employ comparative reasoning-using reference images from healthy controls or previous patient examinations-to discern subtle yet diagnostically critical abnormalities. However, existing medical vision-language models (VLMs) focus primarily on single-image or single-series analyses and lack explicit mechanisms for comparative reasoning. Conversely, general-purpose VLMs demonstrate strong multi-image comparative reasoning capabilities but lack essential medical-domain knowledge to identify nuanced clinical differences. This work aims to bridge this gap by exploring clinically-inspired comparative analysis within VLMs, leveraging reference images to enhance diagnostic accuracy. Through extensive empirical analysis, we show that providing general-purpose VLMs with query and normative matched reference images, accompanied by clinically-informed comparative prompts, significantly improves diagnostic outcomes compared to single-image baselines, especially after supervised finetuning (SFT). Our contributions highlight the clinical relevance of comparative analysis introduce novel strategies for leveraging reference images in VLMs, empirically demonstrate enhanced performance across multiple medical visual question answering (VQA) tasks, and provide theoretical insights into the efficacy of comparative image analysis in medical diagnosis.[168] Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry
Christian Sax,Jochen Kriegseis
Main category: cs.CV
TL;DR: This paper presents a deep learning-based method for separating phases in two-phase flow velocimetry using single-camera setups, achieving high accuracy with CNNs and GAN-generated training data.
Details
Motivation: The motivation is to develop a robust method for phase separation in defocusing particle tracking velocimetry for dispersed two-phase flows, especially when traditional approaches such as wavelength-, size-, or ensemble correlation-based methods are not feasible. Method: A post-processing-based approach using convolutional neural networks (Faster R-CNN and YOLOv4 variants) was developed to detect and classify particle images based on pattern differences in defocused particle images. A generative adversarial network (GAN) framework was introduced to generate large, labeled training datasets. Result: The method achieved high detection precision and classification accuracy (95-100%) across six datasets, including synthetic and real-world two-phase flows, demonstrating its effectiveness and robustness under domain shifts. Conclusion: The study concludes that CNNs can be effectively used for phase separation in disperse two-phase DPTV, offering high detection precision and classification accuracy even under domain shifts, particularly where traditional methods are impractical. Abstract: This work investigates the feasibility of a post-processing-based approach for phase separation in defocusing particle tracking velocimetry for dispersed two-phase flows. The method enables the simultaneous 3D localization determination of both tracer particles and particles of the dispersed phase, using a single-camera setup. The distinction between phases is based on pattern differences in defocused particle images, which arise from distinct light scattering behaviors of tracer particles and bubbles or droplets. Convolutional neural networks, including Faster R-CNN and YOLOv4 variants, are trained to detect and classify particle images based on these pattern features. To generate large, labeled training datasets, a generative adversarial network based framework is introduced, allowing the generation of auto-labeled data that more closely reflects experiment-specific visual appearance. Evaluation across six datasets, comprising synthetic two-phase and real single- and two-phase flows, demonstrates high detection precision and classification accuracy (95-100%), even under domain shifts. The results confirm the viability of using CNNs for robust phase separation in disperse two-phase DPTV, particularly in scenarios where traditional wavelength-, size-, or ensemble correlation-based methods are impractical.[169] CDG-MAE: Learning Correspondences from Diffusion Generated Views
Varun Belagali,Pierre Marza,Srikar Yellapragada,Zilinghan Li,Tarak Nath Nandi,Ravi K Madduri,Joel Saltz,Stergios Christodoulidis,Maria Vakalopoulou,Dimitris Samaras
Main category: cs.CV
TL;DR: 本文提出了一种名为CDG-MAE的新型基于MAE的自监督方法,通过从静态图像生成多样化的合成视图以用于密集对应学习。
Details
Motivation: 学习密集对应关系对于视频标签传播等应用至关重要,但受限于繁琐且不可扩展的手动标注。现有自监督方法使用交叉视图预任务,但有效训练数据获取仍是一个挑战。 Method: 该文采用一种基于图像条件扩散模型的方法生成具有显著姿态和视角变化的多样化合成视图,并引入多锚点策略增强标准单锚点MAE设置,以有效调节预任务难度。 Result: CDG-MAE在性能上显著优于仅依赖图像的最先进MAE方法,并大幅缩小了与基于视频的方法之间的差距。 Conclusion: CDG-MAE克服了视频和基于裁剪锚点的局限性,为交叉视图自监督预训练提供了新的解决方案。 Abstract: Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to evaluate local and global consistency of generated images, discussing their use for cross-view self-supervised pretraining. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor strategy to effectively modulate the difficulty of pretext task. CDG-MAE significantly outperforms state-of-the-art MAE methods reliant only on images and substantially narrows the performance gap to video-based approaches.[170] STACT-Time: Spatio-Temporal Cross Attention for Cine Thyroid Ultrasound Time Series Classification
Irsyad Adam,Tengyue Zhang,Shrayes Raman,Zhuyu Qiu,Brandon Taraku,Hexiang Feng,Sile Wang,Ashwath Radhachandran,Shreeram Athreya,Vedrana Ivezic,Peipei Ping,Corey Arnold,William Speier
Main category: cs.CV
TL;DR: 本研究提出了 STACT-Time 模型,通过整合甲状腺超声动态影像和分割掩码特征,显著提高了甲状腺结节良恶性分类的准确性,有助于减少不必要的活检并改善患者管理。
Details
Motivation: 现有 TI-RADS 系统因观察者间变异性而受限,且现有的深度学习方法未能充分利用 US cine clips 提供的动态全局信息及多视角结构变化,从而影响风险分层效果。 Method: 提出了一种新的表示学习框架 STACT-Time,该模型结合了来自 US cine clips 的成像特征和由预训练模型自动生成的分割掩码特征,并利用自注意力和跨注意力机制捕捉丰富的时空上下文信息。 Result: 与最先进的模型相比,所提出的 STACT-Time 模型在恶性预测方面表现出更高的性能,交叉验证精度达到 0.91(±0.02),F1 分数为 0.89(±0.02)。 Conclusion: STACT-Time 模型在甲状腺癌的超声影像分析中表现优异,减少了不必要的良性结节活检,同时保持了对恶性肿瘤的高灵敏度检测能力,具有提升临床决策和患者预后的潜力。 Abstract: Thyroid cancer is among the most common cancers in the United States. Thyroid nodules are frequently detected through ultrasound (US) imaging, and some require further evaluation via fine-needle aspiration (FNA) biopsy. Despite its effectiveness, FNA often leads to unnecessary biopsies of benign nodules, causing patient discomfort and anxiety. To address this, the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS) has been developed to reduce benign biopsies. However, such systems are limited by interobserver variability. Recent deep learning approaches have sought to improve risk stratification, but they often fail to utilize the rich temporal and spatial context provided by US cine clips, which contain dynamic global information and surrounding structural changes across various views. In this work, we propose the Spatio-Temporal Cross Attention for Cine Thyroid Ultrasound Time Series Classification (STACT-Time) model, a novel representation learning framework that integrates imaging features from US cine clips with features from segmentation masks automatically generated by a pretrained model. By leveraging self-attention and cross-attention mechanisms, our model captures the rich temporal and spatial context of US cine clips while enhancing feature representation through segmentation-guided learning. Our model improves malignancy prediction compared to state-of-the-art models, achieving a cross-validation precision of 0.91 (plus or minus 0.02) and an F1 score of 0.89 (plus or minus 0.02). By reducing unnecessary biopsies of benign nodules while maintaining high sensitivity for malignancy detection, our model has the potential to enhance clinical decision-making and improve patient outcomes.[171] OmniGen2: Exploration to Advanced Multimodal Generation
Chenyuan Wu,Pengfei Zheng,Ruiran Yan,Shitao Xiao,Xin Luo,Yueze Wang,Wanli Li,Xiyan Jiang,Yexin Liu,Junjie Zhou,Ze Liu,Ziyi Xia,Chaofan Li,Haoge Deng,Jiahao Wang,Kun Luo,Bo Zhang,Defu Lian,Xinlong Wang,Zhongyuan Wang,Tiejun Huang,Zheng Liu
Main category: cs.CV
TL;DR: OmniGen2是一个多功能的开源生成模型,能够处理多种生成任务,在多个任务基准测试中表现良好,并将发布相关资源以支持未来的研究。
Details
Motivation: OmniGen2旨在提供一个统一的解决方案来处理各种生成任务,并能够在不重新适应VAE输入的情况下建立现有的多模态理解模型。 Method: OmniGen2采用了两个独立的解码路径,分别用于文本和图像模态,使用非共享参数和解耦图像分词器,并开发了全面的数据构建管道,以及引入了适用于图像生成任务的反思机制。 Result: 尽管其参数规模相对适中,OmniGen2在多个任务基准测试中取得了具有竞争力的结果,并在一致性方面达到了开源模型中的最先进水平。 Conclusion: OmniGen2实现了在多个生成任务上的竞争性能,包括文本到图像生成、图像编辑和上下文生成,并将在未来研究中释放模型、训练代码、数据集和数据构建管道。 Abstract: In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2[172] DExNet: Combining Observations of Domain Adapted Critics for Leaf Disease Classification with Limited Data
Sabbir Ahmed,Md. Bakhtiar Hasan,Tasnim Ahmed,Md. Hasanul Kabir
Main category: cs.CV
TL;DR: This paper introduces DExNet, a few-shot learning framework for plant disease classification that achieves near state-of-the-art performance with significantly reduced training data by combining domain adaptation and feature fusion from multiple pre-trained CNNs.
Details
Motivation: Deep learning models typically require large-scale datasets to achieve high performance, which limits their effectiveness in scenarios with limited samples, such as plant leaf disease classification. This work aims to overcome this limitation by introducing a few-shot learning approach that reduces the dependency on large training sets. Method: A few-shot learning framework called Domain-adapted Expert Network (DExNet) was developed. It uses feature embeddings ('observations') from nine pre-trained CNN-based 'critics', which are domain adapted using a non-overlapping leaf disease dataset. These observations are then fused in a Feature Fusion Block and classified using Bi-LSTM layers. Result: The model achieved 89.06%, 92.46%, and 94.07% accuracy for 5-shot, 10-shot, and 15-shot classifications, respectively, on tomato leaf images from the PlantVillage dataset. In 80-shot classification, it reached an accuracy of 98.09±0.7%, only 1.2% less than state-of-the-art methods, while requiring 94.5% less training data. Conclusion: The proposed DExNet framework effectively addresses the challenge of plant disease classification with limited training data by leveraging domain adaptation and feature fusion from multiple pre-trained CNNs. It achieves competitive performance compared to state-of-the-art models while significantly reducing the need for large datasets. Abstract: While deep learning-based architectures have been widely used for correctly detecting and classifying plant diseases, they require large-scale datasets to learn generalized features and achieve state-of-the-art performance. This poses a challenge for such models to obtain satisfactory performance in classifying leaf diseases with limited samples. This work proposes a few-shot learning framework, Domain-adapted Expert Network (DExNet), for plant disease classification that compensates for the lack of sufficient training data by combining observations of a number of expert critics. It starts with extracting the feature embeddings as 'observations' from nine 'critics' that are state-of-the-art pre-trained CNN-based architectures. These critics are 'domain adapted' using a publicly available leaf disease dataset having no overlapping classes with the specific downstream task of interest. The observations are then passed to the 'Feature Fusion Block' and finally to a classifier network consisting of Bi-LSTM layers. The proposed pipeline is evaluated on the 10 classes of tomato leaf images from the PlantVillage dataset, achieving promising accuracies of 89.06%, 92.46%, and 94.07%, respectively, for 5-shot, 10-shot, and 15-shot classification. Furthermore, an accuracy of 98.09+-0.7% has been achieved in 80-shot classification, which is only 1.2% less than state-of-the-art, allowing a 94.5% reduction in the training data requirement. The proposed pipeline also outperforms existing works on leaf disease classification with limited data in both laboratory and real-life conditions in single-domain, mixed-domain, and cross-domain scenarios.[173] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Jiaming Han,Hao Chen,Yang Zhao,Hanyu Wang,Qi Zhao,Ziyan Yang,Hao He,Xiangyu Yue,Lu Jiang
Main category: cs.CV
TL;DR: 本文提出了一种名为Tar的多模态框架,通过文本对齐的编码器将图像转换为离散标记,并采用尺度自适应编码解码和生成式解码器,提高了视觉理解和生成的效果,同时提升了训练效率。
Details
Motivation: 尝试在共享离散语义表示中统一视觉理解和生成,并消除模态特定设计的需求。 Method: 提出了一个统一视觉理解和生成的多模态框架Tar,其核心是文本对齐的编码器(TA-Tok),并使用了尺度自适应编码解码以及生成式解码器来产生高保真视觉输出。 Result: 实验结果显示,Tar在多个基准测试中表现优异,实现了更快的收敛速度和更高的训练效率。 Conclusion: Tar能够匹配或超越现有的多模态LLM方法,实现更快的收敛和更高的训练效率。 Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com[174] Multimodal Fusion SLAM with Fourier Attention
Youjie Zhou,Guofeng Mei,Yiming Wang,Yi Wan,Fabio Poiesi
Main category: cs.CV
TL;DR: 本文提出了一种高效的多模态融合SLAM方法FMF-SLAM,通过引入基于傅里叶的注意力机制和多尺度知识蒸馏,在噪声和复杂光照条件下实现了高性能实时定位与建图。
Details
Motivation: 为了解决传统基于光流的视觉SLAM方法在噪声、不同光照和黑暗环境下面临的计算资源消耗大且性能受限的问题。 Method: 该方法引入了基于傅里叶的自注意力和跨模态注意力机制,并结合多尺度知识蒸馏来增强多模态特征交互,同时利用快速傅里叶变换(FFT)提升算法效率。 Result: FMF-SLAM在TUM、TartanAir和真实世界数据集上验证了其在复杂环境下的实时性和先进性能。 Conclusion: FMF-SLAM展现出在噪声、光照变化和黑暗条件下的卓越性能,并通过与GNSS-RTK和全局BA的集成为现实世界中的实时应用提供了可行性。 Abstract: Visual SLAM is particularly challenging in environments affected by noise, varying lighting conditions, and darkness. Learning-based optical flow algorithms can leverage multiple modalities to address these challenges, but traditional optical flow-based visual SLAM approaches often require significant computational resources.To overcome this limitation, we propose FMF-SLAM, an efficient multimodal fusion SLAM method that utilizes fast Fourier transform (FFT) to enhance the algorithm efficiency. Specifically, we introduce a novel Fourier-based self-attention and cross-attention mechanism to extract features from RGB and depth signals. We further enhance the interaction of multimodal features by incorporating multi-scale knowledge distillation across modalities. We also demonstrate the practical feasibility of FMF-SLAM in real-world scenarios with real time performance by integrating it with a security robot by fusing with a global positioning module GNSS-RTK and global Bundle Adjustment. Our approach is validated using video sequences from TUM, TartanAir, and our real-world datasets, showcasing state-of-the-art performance under noisy, varying lighting, and dark conditions.Our code and datasets are available at https://github.com/youjie-zhou/FMF-SLAM.git.[175] Limitations of NERF with pre-trained Vision Features for Few-Shot 3D Reconstruction
Ankit Sanjyal
Main category: cs.CV
TL;DR: 本研究发现使用DINO特征的NeRF模型在少样本3D重建中的表现不如未使用DINO特征的基线模型,这挑战了该领域的一些常见假设。
Details
Motivation: 探索集成预训练视觉特征以提高少样本重建能力的有效性仍不清楚,尤其是在极端少样本场景下。 Method: 对DINO增强的NeRF模型进行了系统评估,比较了基线NeRF、冻结的DINO特征、LoRA微调特征以及多尺度特征融合。 Result: 所有DINO变体的表现均不如基线NeRF,PSNR值约为12.9至13.0,而基线为14.71。 Conclusion: 集成预训练视觉特征(特别是DINO)的NeRF模型在少样本3D重建中表现不如基线NeRF模型,这表明预训练视觉特征可能并不适用于少样本3D重建,并且可能引入有害偏差。 Abstract: Neural Radiance Fields (NeRF) have revolutionized 3D scene reconstruction from sparse image collections. Recent work has explored integrating pre-trained vision features, particularly from DINO, to enhance few-shot reconstruction capabilities. However, the effectiveness of such approaches remains unclear, especially in extreme few-shot scenarios. In this paper, we present a systematic evaluation of DINO-enhanced NeRF models, comparing baseline NeRF, frozen DINO features, LoRA fine-tuned features, and multi-scale feature fusion. Surprisingly, our experiments reveal that all DINO variants perform worse than the baseline NeRF, achieving PSNR values around 12.9 to 13.0 compared to the baseline's 14.71. This counterintuitive result suggests that pre-trained vision features may not be beneficial for few-shot 3D reconstruction and may even introduce harmful biases. We analyze potential causes including feature-task mismatch, overfitting to limited data, and integration challenges. Our findings challenge common assumptions in the field and suggest that simpler architectures focusing on geometric consistency may be more effective for few-shot scenarios.[176] Deep Learning-based Alignment Measurement in Knee Radiographs
Zhisen Hu,Dominic Cullen,Peter Thompson,David Johnson,Chang Bian,Aleksei Tiulpin,Timothy Cootes,Claudia Lindner
Main category: cs.CV
TL;DR: A deep learning-based method automates accurate knee alignment measurement, offering reliable and efficient assessment for improved clinical workflows.
Details
Motivation: Traditional methods for knee alignment measurement are manual, time-consuming, and require long-leg radiographs, prompting the need for an automated and efficient solution. Method: The method uses hourglass networks with an attention gate structure to automatically localize over 100 knee anatomical landmarks and measure knee alignment using the anatomical tibiofemoral angle. Result: The proposed method achieves mean absolute differences of approximately 1 degree compared to clinical ground truth measurements, with excellent agreement pre-operatively (ICC = 0.97) and good agreement post-operatively (ICC = 0.86). Conclusion: This study demonstrates that radiographic knee alignment assessment can be automated with high accuracy using a deep learning-based method, which enhances clinical workflows. Abstract: Radiographic knee alignment (KA) measurement is important for predicting joint health and surgical outcomes after total knee replacement. Traditional methods for KA measurements are manual, time-consuming and require long-leg radiographs. This study proposes a deep learning-based method to measure KA in anteroposterior knee radiographs via automatically localized knee anatomical landmarks. Our method builds on hourglass networks and incorporates an attention gate structure to enhance robustness and focus on key anatomical features. To our knowledge, this is the first deep learning-based method to localize over 100 knee anatomical landmarks to fully outline the knee shape while integrating KA measurements on both pre-operative and post-operative images. It provides highly accurate and reliable anatomical varus/valgus KA measurements using the anatomical tibiofemoral angle, achieving mean absolute differences ~1{\deg} when compared to clinical ground truth measurements. Agreement between automated and clinical measurements was excellent pre-operatively (intra-class correlation coefficient (ICC) = 0.97) and good post-operatively (ICC = 0.86). Our findings demonstrate that KA assessment can be automated with high accuracy, creating opportunities for digitally enhanced clinical workflows.[177] Shape from Polarization of Thermal Emission and Reflection
Kazuma Kitazawa,Tsuyoshi Takatani
Main category: cs.CV
TL;DR: This paper proposes an improved method for shape estimation of transparent objects using LWIR SfP and develops a new polarization model, achieving high accuracy across various materials.
Details
Motivation: The motivation is to overcome challenges in shape estimation for transparent objects due to their complex light transport by leveraging LWIR SfP where most materials are opaque and emissive. Method: The paper uses Shape from Polarization (SfP) in the Long-Wave Infrared (LWIR) spectrum along with a neural network trained on a synthetic dataset. They also modeled LWIR polarimetric imaging while accounting for systematic errors. Result: The result demonstrates high accuracy and broad applicability of the proposed method in estimating surface normals for materials, including those transparent in the visible spectrum. Conclusion: The paper concludes that by creating a new polarization model and implementing a prototype system, the method achieves high accuracy in shape estimation for transparent objects across various materials. Abstract: Shape estimation for transparent objects is challenging due to their complex light transport. To circumvent these difficulties, we leverage the Shape from Polarization (SfP) technique in the Long-Wave Infrared (LWIR) spectrum, where most materials are opaque and emissive. While a few prior studies have explored LWIR SfP, these attempts suffered from significant errors due to inadequate polarimetric modeling, particularly the neglect of reflection. Addressing this gap, we formulated a polarization model that explicitly accounts for the combined effects of emission and reflection. Based on this model, we estimated surface normals using not only a direct model-based method but also a learning-based approach employing a neural network trained on a physically-grounded synthetic dataset. Furthermore, we modeled the LWIR polarimetric imaging process, accounting for inherent systematic errors to ensure accurate polarimetry. We implemented a prototype system and created ThermoPol, the first real-world benchmark dataset for LWIR SfP. Through comprehensive experiments, we demonstrated the high accuracy and broad applicability of our method across various materials, including those transparent in the visible spectrum.[178] Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano
Berk Yilmaz,Aniruddh Aiyengar
Main category: cs.CV
TL;DR: This paper proposes an AI-driven solution for early detection of retinal diseases in resource-limited settings by compressing a vision transformer model into a lightweight CNN model while maintaining diagnostic accuracy.
Details
Motivation: Early and accurate identification of retinal ailments is crucial to prevent ocular decline, but dependable diagnostic devices are often unavailable in low-resourced settings. This project aims to address this issue by creating an edge-device deployable disease classifier. Method: A lightweight CNN-based student model was developed using cross-architecture knowledge distillation from a pre-trained ViT teacher model. The compression utilized a framework including a Partitioned Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and multi-view robust training to retain diagnostic performance. Result: The teacher model achieved 89 percent classification accuracy and the student model retained roughly 93 percent of the teacher model's diagnostic performance, despite having 97.4 percent fewer parameters. Conclusion: The work presents a scalable, AI-driven triage solution for retinal disorders in under-resourced areas by compressing a high-capacity vision transformer model into a lightweight CNN-based model while retaining diagnostic accuracy. Abstract: Early and accurate identification of retinal ailments is crucial for averting ocular decline; however, access to dependable diagnostic devices is not often available in low-resourced settings. This project proposes to solve that by developing a lightweight, edge-device deployable disease classifier using cross-architecture knowledge distilling. We first train a high-capacity vision transformer (ViT) teacher model, pre-trained using I-JEPA self-supervised learning, to classify fundus images into four classes: Normal, Diabetic Retinopathy, Glaucoma, and Cataract. We kept an Internet of Things (IoT) focus when compressing to a CNN-based student model for deployment in resource-limited conditions, such as the NVIDIA Jetson Nano. This was accomplished using a novel framework which included a Partitioned Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and a multi-view robust training method. The teacher model has 97.4 percent more parameters than the student model, with it achieving 89 percent classification with a roughly 93 percent retention of the teacher model's diagnostic performance. The retention of clinical classification behavior supports our method's initial aim: compression of the ViT while retaining accuracy. Our work serves as an example of a scalable, AI-driven triage solution for retinal disorders in under-resourced areas.[179] Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation
Xunzhi Xiang,Qi Fan
Main category: cs.CV
TL;DR: This paper proposes ADSA, a training-free context optimization method for autoregressive image generation, improving efficiency by reducing memory use and computation without sacrificing quality.
Details
Motivation: Excessively long contexts during inference in autoregressive models lead to significant memory overhead and computational delays, which this work aims to address. Method: ADSA dynamically identifies crucial tokens for texture consistency and semantic coherence, along with a dynamic KV-cache update mechanism tailored for efficient attention computation. Result: Experiments show that ADSA improves resource efficiency by reducing GPU memory consumption by approximately 50% while maintaining or enhancing generation quality. Conclusion: The proposed ADSA method effectively reduces memory overhead and computational delays in autoregressive image generation while maintaining generation quality. Abstract: Autoregressive conditional image generation models have emerged as a dominant paradigm in text-to-image synthesis. These methods typically convert images into one-dimensional token sequences and leverage the self-attention mechanism, which has achieved remarkable success in natural language processing, to capture long-range dependencies, model global context, and ensure semantic coherence. However, excessively long contexts during inference lead to significant memory overhead caused by KV-cache and computational delays. To alleviate these challenges, we systematically analyze how global semantics, spatial layouts, and fine-grained textures are formed during inference, and propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA). Conceptually, ADSA dynamically identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention computation. Additionally, we introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50\%$. Extensive qualitative and quantitative experiments demonstrate the effectiveness and superiority of our approach in terms of both generation quality and resource efficiency.[180] Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
Yue Li,Meng Tian,Dechang Zhu,Jiangtong Zhu,Zhenyu Lin,Zhiwei Xiong,Xinhai Zhao
Main category: cs.CV
TL;DR: The paper introduces Drive-R1, a domain-specific vision-language model designed to bridge scenario reasoning and motion planning for autonomous driving, overcoming challenges related to input reliance and reasoning misalignment through supervised fine-tuning and reinforcement learning.
Details
Motivation: Two critical challenges identified in evolving vision-language models (VLMs) for autonomous driving are their reliance on historical input information without genuine understanding of visual inputs and the misalignment of chain-of-thought reasoning with motion planning outcomes. Method: Drive-R1 is developed through supervised fine-tuning on a dataset containing both long and short chain-of-thought (COT) data, followed by training within a reinforcement learning framework to discover informative reasoning paths for planning. Result: Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate that Drive-R1 outperforms existing state-of-the-art VLMs in performance. Conclusion: Drive-R1 successfully bridges the gap between scenario reasoning and motion planning in autonomous driving, presenting a promising direction for future research and applications. Abstract: Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-ofthought (COT) reasoning processes are always misaligned with the motion planning outcomes, and how to effectively leverage the complex reasoning capability to enhance planning remains largely underexplored. In this paper, we start from a small-scale domain-specific VLM and propose Drive-R1 designed to bridges the scenario reasoning and motion planning for AD. Drive-R1 first undergoes the supervised finetuning on a elaborate dataset containing both long and short COT data. Drive-R1 is encouraged to reason step-by-step from visual input to final planning decisions. Subsequently, Drive-R1 is trained within a reinforcement learning framework that incentivizes the discovery of reasoning paths that are more informative for planning, guided by rewards based on predicted trajectories and meta actions. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate that Drive-R1 achieves superior performance compared to existing state-of-the-art VLMs. We believe that Drive-R1 presents a promising direction for bridging reasoning and planning in AD, offering methodological insights for future research and applications.[181] Referring Expression Instance Retrieval and A Strong End-to-End Baseline
Xiangzhao Hao,Kuan Zhu,Hongyu Guo,Haiyun Guo,Ming Tang,JinQiao Wang
Main category: cs.CV
TL;DR: 本文提出了一个新的视觉-语言任务REIR,用于联合支持实例级别检索和定位,构建了大规模基准数据集REIRCOCO,并提出了CLARE方法,取得了优异的性能表现。
Details
Motivation: 现实场景中通常需要同时支持实例级别的检索和定位,而传统TIR缺乏精度,REC缺乏可扩展性,因此提出了REIR任务来填补这一空白。 Method: 提出了一种新的任务REIR(Referring Expression Instance Retrieval),并构建了大规模基准数据集REIRCOCO;提出了CLARE方法,采用双流架构和MORE模块捕捉实例间关系,结合对象检测、REC预训练和对比语言-实例对齐(CLIA)进行端到端优化。 Result: CLARE在REIR任务上取得了SOTA性能,并且能够很好地泛化到TIR和REC任务。 Conclusion: CLARE实现了REIR上的SOTA性能,并且在TIR和REC任务上也有良好的泛化能力,突出了其有效性和通用性。 Abstract: Natural language querying of visual content underpins many vision-language tasks, typically categorized by text granularity and visual search scope. Text-Image Retrieval (TIR) retrieves whole images using coarse descriptions, while Referring Expression Comprehension (REC) localizes objects using fine-grained expressions within a single image. However, real-world scenarios often require both instance-level retrieval and localization across large galleries -- tasks where TIR lacks precision and REC lacks scalability. To address this gap, we propose a new task: Referring Expression Instance Retrieval (REIR), which jointly supports instance-level retrieval and localization. We introduce REIRCOCO, a large-scale benchmark constructed by prompting vision-language models to generate fine-grained expressions for MSCOCO and RefCOCO instances. We also present a baseline method, CLARE, featuring a dual-stream architecture with a Mix of Relation Experts (MORE) module for capturing inter-instance relationships. CLARE integrates object detection and REC pretraining with Contrastive Language-Instance Alignment (CLIA) for end-to-end optimization. Experiments show that CLARE achieves state-of-the-art performance on REIR and generalizes well to TIR and REC, highlighting its effectiveness and versatility.[182] Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability
Jongoh Jeong,Hunmin Yang,Jaeseok Jeong,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: This paper introduces a new adversarial attack method that improves transferability by utilizing semantic features within generative models through a Mean Teacher framework and feature distillation.
Details
Motivation: Existing generative adversarial attacks underutilize the semantic information present in the intermediate activations of generators, which limits their effectiveness. Method: A Mean Teacher framework with feature distillation is used to ensure semantic consistency in perturbation generation. Result: Experiments show consistent improvements in adversarial transferability across various models, domains, and tasks compared to state-of-the-art methods. Conclusion: The proposed semantic structure-aware attack framework enhances adversarial transferability by leveraging semantic features from intermediate activations of a generator. Abstract: Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features--object boundaries and coarse shapes--that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).[183] Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain
Rui Su,Dong Xu,Luping Zhou,Wanli Ouyang
Main category: cs.CV
TL;DR: This paper presents a two-stage approach for weakly supervised temporal action localization, using multi-resolution temporal information to generate and refine frame-level pseudo labels, thereby improving localization accuracy.
Details
Motivation: The motivation is to address the challenge of weakly supervised temporal action localization where only video-level annotations are available during training. The goal is to generate reliable frame-level pseudo labels and improve localization performance by exploiting appearance and motion streams across multiple temporal resolutions. Method: The method involves two stages: In the first stage, the Initial Label Generation (ILG) module generates high-quality class activation sequences (CASs) by leveraging temporal multi-resolution consistency. In the second stage, the Progressive Temporal Label Refinement (PTLR) framework uses two networks (Network-OTS and Network-RTS) to iteratively refine pseudo labels at different temporal scales, enhancing each stream using refined labels from the other. Result: The proposed method successfully generates high-quality frame-level pseudo labels and achieves improved performance in temporal action localization by utilizing multi-resolution temporal information through the ILG and PTLR modules. Conclusion: The paper proposes a two-stage approach for weakly supervised temporal action localization, which effectively exploits multi-resolution temporal information and improves action localization performance through the Initial Label Generation (ILG) module and the Progressive Temporal Label Refinement (PTLR) framework. Abstract: Weakly supervised temporal action localization is a challenging task as only the video-level annotation is available during the training process. To address this problem, we propose a two-stage approach to fully exploit multi-resolution information in the temporal domain and generate high quality frame-level pseudo labels based on both appearance and motion streams. Specifically, in the first stage, we generate reliable initial frame-level pseudo labels, and in the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks and better predict action class scores at each frame. We fully exploit temporal information at multiple scales to improve temporal action localization performance. Specifically, in order to obtain reliable initial frame-level pseudo labels, in the first stage, we propose an Initial Label Generation (ILG) module, which leverages temporal multi-resolution consistency to generate high quality class activation sequences (CASs), which consist of a number of sequences with each sequence measuring how likely each video frame belongs to one specific action class. In the second stage, we propose a Progressive Temporal Label Refinement (PTLR) framework. In our PTLR framework, two networks called Network-OTS and Network-RTS, which are respectively used to generate CASs for the original temporal scale and the reduced temporal scales, are used as two streams (i.e., the OTS stream and the RTS stream) to refine the pseudo labels in turn. By this way, the multi-resolution information in the temporal domain is exchanged at the pseudo label level, and our work can help improve each stream (i.e., the OTS/RTS stream) by exploiting the refined pseudo labels from another stream (i.e., the RTS/OTS stream).[184] YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos
Haoming Chen,Lichen Yuan,TianFang Sun,Jingyu Gong,Xin Tan,Zhizhong Zhang,Yuan Xie
Main category: cs.CV
TL;DR: 论文提出了一种基于室内互联网数据的3D语义占用预测方法,无需精确几何关系或相机参数,通过自监督模型利用2D知识实现强大的3D室内感知。
Details
Motivation: 在复杂室内环境中,获取大规模且精细标注的数据变得不切实际,因此需要一种更实用的方法。 Method: 通过建立一个完全自监督的模型,利用2D先验知识进行3D语义占用预测,并将2D区域级知识蒸馏到占用网络中。 Result: 实验结果表明,该方法在两个流行基准(NYUv2和OccScanNet)上实现了最先进的零样本性能。 Conclusion: 该论文提出了一种无需先验相机参数即可实现3D室内感知的新方法,并展示了使用仅含室内互联网数据的YouTube-Occ数据集的有效性。 Abstract: 3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet[185] ThermalLoc: A Vision Transformer-Based Approach for Robust Thermal Camera Relocalization in Large-Scale Environments
Yu Liu,Yangtao Meng,Xianfei Pan,Jie Jiang,Changhao Chen
Main category: cs.CV
TL;DR: This paper introduces ThermalLoc, an effective deep learning method for thermal image relocalization that combines EfficientNet and Transformers.
Details
Motivation: Traditional visual relocalization methods are not applicable to thermal images, and deep learning approaches tailored for thermal cameras remain underexplored. Method: ThermalLoc integrates EfficientNet with Transformers to extract features and uses MLP networks for pose regression. Result: ThermalLoc achieves superior accuracy and robustness compared to existing methods like AtLoc, MapNet, PoseNet, and RobustLoc. Conclusion: ThermalLoc is a novel end-to-end deep learning method that outperforms existing methods for thermal camera relocalization. Abstract: Thermal cameras capture environmental data through heat emission, a fundamentally different mechanism compared to visible light cameras, which rely on pinhole imaging. As a result, traditional visual relocalization methods designed for visible light images are not directly applicable to thermal images. Despite significant advancements in deep learning for camera relocalization, approaches specifically tailored for thermal camera-based relocalization remain underexplored. To address this gap, we introduce ThermalLoc, a novel end-to-end deep learning method for thermal image relocalization. ThermalLoc effectively extracts both local and global features from thermal images by integrating EfficientNet with Transformers, and performs absolute pose regression using two MLP networks. We evaluated ThermalLoc on both the publicly available thermal-odometry dataset and our own dataset. The results demonstrate that ThermalLoc outperforms existing representative methods employed for thermal camera relocalization, including AtLoc, MapNet, PoseNet, and RobustLoc, achieving superior accuracy and robustness.[186] Adaptive Mask-guided K-space Diffusion for Accelerated MRI Reconstruction
Qinrong Cai,Yu Guan,Zhibo Chen,Dong Liang,Qiuyun Fan,Qiegen Liu
Main category: cs.CV
TL;DR: This paper proposes an adaptive mask-based diffusion model (AMDM) for MRI reconstruction, which leverages k-space frequency distribution to enhance image quality by capturing frequency-specific information.
Details
Motivation: Previous MRI reconstruction strategies typically optimized the entire image domain or k-space without considering the significance of various frequency regions. The motivation behind this work is to improve MRI reconstruction quality by focusing on frequency-specific information. Method: The study introduces a hybrid mask mechanism that adapts to different k-space inputs, allowing for the separation of high-frequency and low-frequency components. This method guides a closed-loop diffusion process informed by k-space frequency distribution. Result: Experimental results demonstrate that the AMDM method can learn specific frequency information, leading to improved MRI reconstruction quality and providing a flexible framework for future optimization of k-space data using masks. Conclusion: This paper concludes that the proposed diffusion model based on adaptive masks (AMDM) successfully enhances MRI reconstruction by effectively utilizing k-space frequency distribution to generate adaptive masks. Abstract: As the deep learning revolution marches on, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training, and has demonstrated exceptional performance in multiple fields. Magnetic Resonance Imaging (MRI) reconstruction is a critical task in medical imaging that seeks to recover high-quality images from under-sampled k-space data. However, previous MRI reconstruction strategies usually optimized the entire image domain or k-space, without considering the importance of different frequency regions in the k-space This work introduces a diffusion model based on adaptive masks (AMDM), which utilizes the adaptive adjustment of frequency distribution based on k-space data to develop a hybrid masks mechanism that adapts to different k-space inputs. This enables the effective separation of high-frequency and low-frequency components, producing diverse frequency-specific representations. Additionally, the k-space frequency distribution informs the generation of adaptive masks, which, in turn, guide a closed-loop diffusion process. Experimental results verified the ability of this method to learn specific frequency information and thereby improved the quality of MRI reconstruction, providing a flexible framework for optimizing k-space data using masks in the future.[187] ReFrame: Rectification Framework for Image Explaining Architectures
Debjyoti Das Adhikary,Aritra Hazra,Partha Pratim Chakrabarti
Main category: cs.CV
TL;DR: This paper proposes an interpretable framework that enhances image explanation methods by reducing hallucinations and improving object recognition completeness, showing significant quantitative improvements.
Details
Motivation: Existing approaches for image explanation often hallucinate objects or miss identifying all objects in an image, prompting the need for a more reliable and complete solution. Method: An interpretable framework was developed that can be integrated atop existing image explanation methods to rectify inconsistencies and incompleteness by improving object recognition. Result: Quantitative results showed significant improvements in completeness and inconsistency metrics across different frameworks, surpassing the state-of-the-art by a substantial margin. Conclusion: The proposed framework significantly improves the consistency and completeness of image explanations across multiple frameworks like Image Captioning, VQA, and Prompt-based AI using LLMs. Abstract: Image explanation has been one of the key research interests in the Deep Learning field. Throughout the years, several approaches have been adopted to explain an input image fed by the user. From detecting an object in a given image to explaining it in human understandable sentence, to having a conversation describing the image, this problem has seen an immense change throughout the years, However, the existing works have been often found to (a) hallucinate objects that do not exist in the image and/or (b) lack identifying the complete set of objects present in the image. In this paper, we propose a novel approach to mitigate these drawbacks of inconsistency and incompleteness of the objects recognized during the image explanation. To enable this, we propose an interpretable framework that can be plugged atop diverse image explaining frameworks including Image Captioning, Visual Question Answering (VQA) and Prompt-based AI using LLMs, thereby enhancing their explanation capabilities by rectifying the incorrect or missing objects. We further measure the efficacy of the rectified explanations generated through our proposed approaches leveraging object based precision metrics, and showcase the improvements in the inconsistency and completeness of image explanations. Quantitatively, the proposed framework is able to improve the explanations over the baseline architectures of Image Captioning (improving the completeness by 81.81% and inconsistency by 37.10%), Visual Question Answering(average of 9.6% and 37.10% in completeness and inconsistency respectively) and Prompt-based AI model (0.01% and 5.2% for completeness and inconsistency respectively) surpassing the current state-of-the-art by a substantial margin.[188] Open Set Recognition for Endoscopic Image Classification: A Deep Learning Approach on the Kvasir Dataset
Kasra Moazzami,Seoyoun Son,John Lin,Sun Min Lee,Daniel Son,Hayeon Lee,Jeongho Lee,Seongji Lee
Main category: cs.CV
TL;DR: 本研究探讨了在医学图像分析中使用开放集识别(OSR)技术来提升内窥镜图像分类模型在面对未知病理情况时的可靠性和适用性。
Details
Motivation: 传统的封闭集分类框架在开放世界的临床环境中存在局限性,因为未知的病理状况可能会影响模型的可靠性。 Method: 采用OpenMax作为基线OSR方法,评估了几种深度学习架构(包括ResNet-50、Swin Transformer和混合ResNet-Transformer模型)在闭集和开集条件下的OSR能力。 Result: 该研究首次将开放集识别应用于Kvasir数据集,并为评估医学图像分析中的OSR性能提供了基础基准。 Conclusion: 研究强调了在临床环境中安全部署AI系统的重要性,并表明OSR技术对于内窥镜检查中的现实世界应用至关重要。 Abstract: Endoscopic image classification plays a pivotal role in medical diagnostics by identifying anatomical landmarks and pathological findings. However, conventional closed-set classification frameworks are inherently limited in open-world clinical settings, where previously unseen conditions can arise andcompromise model reliability. To address this, we explore the application of Open Set Recognition (OSR) techniques on the Kvasir dataset, a publicly available and diverse endoscopic image collection. In this study, we evaluate and compare the OSR capabilities of several representative deep learning architectures, including ResNet-50, Swin Transformer, and a hybrid ResNet-Transformer model, under both closed-set and open-set conditions. OpenMax is adopted as a baseline OSR method to assess the ability of these models to distinguish known classes from previously unseen categories. This work represents one of the first efforts to apply open set recognition to the Kvasir dataset and provides a foundational benchmark for evaluating OSR performance in medical image analysis. Our results offer practical insights into model behavior in clinically realistic settings and highlight the importance of OSR techniques for the safe deployment of AI systems in endoscopy.[189] Selective Social-Interaction via Individual Importance for Fast Human Trajectory Prediction
Yota Urano,Hiromu Taketsugu,Norimichi Ukita
Main category: cs.CV
TL;DR: This paper proposes a method to predict a person's trajectory by focusing on important neighbors using an Importance Estimator and Gumbel Softmax, improving efficiency without compromising accuracy.
Details
Motivation: To improve the efficiency and accuracy of trajectory prediction by selecting relevant neighboring individuals rather than considering all surrounding people. Method: We introduced an Importance Estimator module and used Gumbel Softmax to enable training while avoiding issues with non-differentiable operations. Result: Experiments on the JRDB dataset demonstrated that the method achieves competitive prediction accuracy while significantly speeding up the process. Conclusion: The proposed architecture effectively selects important neighboring people for predicting the primary person's trajectory, achieving speed improvements without sacrificing prediction accuracy. Abstract: This paper presents an architecture for selecting important neighboring people to predict the primary person's trajectory. To achieve effective neighboring people selection, we propose a people selection module called the Importance Estimator which outputs the importance of each neighboring person for predicting the primary person's future trajectory. To prevent gradients from being blocked by non-differentiable operations when sampling surrounding people based on their importance, we employ the Gumbel Softmax for training. Experiments conducted on the JRDB dataset show that our method speeds up the process with competitive prediction accuracy.[190] Rapeseed population point cloud completion network (RP-PCN) with dynamic graph convolution for 3D reconstruction of crop canopy occlusion architecture
Ziyue Guo,Xin Yang,Yutao Shen,Yang Zhu,Lixi Jiang,Haiyan Cen
Main category: cs.CV
TL;DR: 本研究开发了一个用于田间油菜种群三维点云补全的新框架 RP-PCN,通过整合虚拟现实模拟和遮挡点检测算法生成完整点云数据,从而提高作物产量预测精度,并为其他作物的冠层结构分析提供参考。
Details
Motivation: 为了克服严重遮挡和复杂架构对准确描述作物冠层结构的影响,以评估作物光合作用和产量并指导理想型设计。 Method: 提出了一种结合虚拟现实集成(VRI)模拟方法和遮挡点检测算法的点云补全过程,设计了基于多分辨率动态图卷积编码器(MRDG)和点金字塔解码器(PPD)的 RP-PCN 网络。 Result: 结果表明,在幼苗期、抽薹期、开花期和角果期,RP-PCN 的 Chamfer Distance (CD) 值分别为 3.35 cm、3.46 cm、4.32 cm 和 4.51 cm;通过消融研究证明 MRDG 和 DGCFE 模块分别降低了 10% 和 23% 的 CD 值;SEI 将产量预测准确性提高了 11.2%。 Conclusion: RP-PCN pipeline 提出了一个用于油菜种群点云补全的框架,具有推广到其他作物的潜力,并能显著增强田间种群冠层结构分析。 Abstract: Quantitative descriptions of complete canopy architecture are crucial for evaluating crop photosynthesis and yield to guide ideotype design. Although three-dimensional (3D) sensing technologies have been developed for plant and canopy reconstruction, severe occlusion and complex architectures hinder accurate canopy descriptions. In this study, we propose a point cloud completion model for 3D reconstruction of rapeseed populations from seeding to silique stages using multi-view imaging. A complete point cloud generation framework was developed with the virtual-real integration (VRI) simulation method and occlusion point detection algorithm to annotate the training dataset by distinguishing surface from occluded points. The rapeseed population point cloud completion network (RP-PCN) was designed with a multi-resolution dynamic graph convolutional encoder (MRDG) and point pyramid decoder (PPD) to predict occluded points based on input surface point clouds. A dynamic graph convolutional feature extractor (DGCFE) was introduced to capture structural variations across the growth period. The effectiveness of point cloud completion was validated by predicting yield using architectural indicators from complete point clouds of rapeseed population. The results demonstrated that RP-PCN achieved chamfer distance (CD) values of 3.35 cm, 3.46 cm, 4.32 cm, and 4.51 cm at the seedling, bolting, flowering, and silique stages, respectively. Ablation studies showed the effectiveness of the MRDG and DGCFE modules, reducing CD values by 10% and 23%, respectively. The silique efficiency index (SEI) from RP-PCN improved yield prediction accuracy by 11.2% compared to incomplete point clouds. The RP-PCN pipeline proposed in this study has the potential to be extended to other crops, significantly enhancing the analysis of population canopy architectures in field environments.[191] Attention-Based Ensemble Learning for Crop Classification Using Landsat 8-9 Fusion
Zeeshan Ramzan,Nisar Ahmed,Qurat-ul-Ain Akram,Shahzad Asif,Muhammad Shahbaz,Rabin Chakrabortty,Ahmed F. Elaksher
Main category: cs.CV
TL;DR: This paper explores the use of remote sensing and advanced modeling techniques to improve crop classification accuracy in irrigated agricultural regions, specifically focusing on Central Punjab.
Details
Motivation: Accurate information on total cropped area and crop types is crucial for effective agricultural management, particularly in irrigated areas of Central Punjab. Method: The study used field surveys and Landsat 8-9 imagery to create a labeled dataset. Pre-processing steps included radiometric calibration, atmospheric correction, and georeferencing verification. Image fusion techniques were applied to enhance spectral information, and vegetation indices like NDVI, SAVO, RECI, and NDRE were extracted. Classification models using conventional classifiers, ensemble learning, and artificial neural networks were developed, incorporating feature selection for optimal performance. Result: The study successfully created a comprehensive dataset of 50,835 data points and improved crop classification accuracy through the integration of remote sensing data and advanced modeling techniques. Conclusion: The study concludes that combining remote sensing data with advanced modeling techniques improves crop classification accuracy in irrigated agricultural regions. Abstract: Remote sensing offers a highly effective method for obtaining accurate information on total cropped area and crop types. The study focuses on crop cover identification for irrigated regions of Central Punjab. Data collection was executed in two stages: the first involved identifying and geocoding six target crops through field surveys conducted in January and February 2023. The second stage involved acquiring Landsat 8-9 imagery for each geocoded field to construct a labelled dataset. The satellite imagery underwent extensive pre-processing, including radiometric calibration for reflectance values, atmospheric correction, and georeferencing verification to ensure consistency within a common coordinate system. Subsequently, image fusion techniques were applied to combine Landsat 8 and 9 spectral bands, creating a composite image with enhanced spectral information, followed by contrast enhancement. During data acquisition, farmers were interviewed, and fields were meticulously mapped using GPS instruments, resulting in a comprehensive dataset of 50,835 data points. This dataset facilitated the extraction of vegetation indices such as NDVI, SAVO, RECI, and NDRE. These indices and raw reflectance values were utilized for classification modeling using conventional classifiers, ensemble learning, and artificial neural networks. A feature selection approach was also incorporated to identify the optimal feature set for classification learning. This study demonstrates the effectiveness of combining remote sensing data and advanced modeling techniques to improve crop classification accuracy in irrigated agricultural regions.[192] Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?
Yiwei Yang,Chung Peng Lee,Shangbin Feng,Dora Zhao,Bingbing Wen,Anthony Z. Liu,Yulia Tsvetkov,Bill Howe
Main category: cs.CV
TL;DR: This paper presents SpuriVerse, a benchmark for identifying spurious correlations in LVLMs, showing that fine-tuning on synthetic examples can significantly improve model performance.
Details
Motivation: Finetuning can create spurious correlations between non-essential features and target labels, which is problematic in multi-modal Large Vision Language Models (LVLMs). Method: Developed a benchmark called SpuriVerse with 124 distinct types of spurious correlations from real-world datasets. Evaluated 15 LVLMs and tested improvement through fine-tuning on synthetic examples. Result: State-of-the-art closed-source models struggled on the benchmark, achieving at best only 37.1% accuracy. Fine-tuning improved performance to 78.40%. Conclusion: The study concludes that training on diverse spurious patterns can improve model performance, suggesting models learn to avoid shortcuts and focus on overall image context. Abstract: Finetuning can cause spurious correlations to arise between non-essential features and the target labels, but benchmarks to study these effects involve contrived settings and narrow tasks. In contrast, we consider spurious correlations in multi-modal Large Vision Language Models (LVLMs) pretrained on extensive and diverse datasets without explicit task supervision. We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks, then curating a subset through LVLM-human annotation and synthetic counterfactual evaluation to identify errors caused by spurious correlations. This process yields SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples for a total of 1364 multiple choice questions. We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly, achieving at best only 37.1% accuracy. Fine-tuning on synthetic examples that emphasize the spurious correlation improves performance to 78.40%, suggesting that training on diverse spurious patterns generalizes to unseen situations: models appear to learn to avoid "shortcuts" and attend to the overall image context.[193] A Multi-Scale Spatial Attention-Based Zero-Shot Learning Framework for Low-Light Image Enhancement
Muhammad Azeem Aslam,Hassan Khalid,Nisar Ahmed
Main category: cs.CV
TL;DR: This paper introduces LucentVisionNet, a zero-shot learning framework for low-light image enhancement that combines multi-scale spatial attention and a deep curve estimation network for improved performance.
Details
Motivation: The study aims to address the limitations of traditional and deep learning-based methods for low-light image enhancement, particularly in the absence of paired training data. Method: LucentVisionNet integrates multi-scale spatial attention with a deep curve estimation network and uses a recurrent enhancement strategy optimized by a composite loss function. Result: Extensive experiments show that LucentVisionNet outperforms state-of-the-art methods across various image quality metrics on both paired and unpaired datasets. Conclusion: LucentVisionNet is a promising framework for low-light image enhancement, offering high visual quality, structural consistency, and computational efficiency suitable for real-world applications. Abstract: Low-light image enhancement remains a challenging task, particularly in the absence of paired training data. In this study, we present LucentVisionNet, a novel zero-shot learning framework that addresses the limitations of traditional and deep learning-based enhancement methods. The proposed approach integrates multi-scale spatial attention with a deep curve estimation network, enabling fine-grained enhancement while preserving semantic and perceptual fidelity. To further improve generalization, we adopt a recurrent enhancement strategy and optimize the model using a composite loss function comprising six tailored components, including a novel no-reference image quality loss inspired by human visual perception. Extensive experiments on both paired and unpaired benchmark datasets demonstrate that LucentVisionNet consistently outperforms state-of-the-art supervised, unsupervised, and zero-shot methods across multiple full-reference and no-reference image quality metrics. Our framework achieves high visual quality, structural consistency, and computational efficiency, making it well-suited for deployment in real-world applications such as mobile photography, surveillance, and autonomous navigation.[194] NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation
Yu Xie,Chengjie Zeng,Lingyun Zhang,Yanwei Fu
Main category: cs.CV
TL;DR: 本文提出了一种名为PromptSan的方法,用于净化文本到图像模型中的有害提示,通过两种变体实现安全与可用性的平衡。
Details
Motivation: 文本到图像(T2I)模型的快速发展提高了它们从文本提示合成图像的能力,但也引发了滥用风险,包括生成有害内容,这违背了T2I技术的伦理目标,并阻碍其可持续发展。 Method: PromptSan有两种变体:PromptSan-Modify和PromptSan-Suffix。前者在推理过程中使用文本NSFW分类器识别并替换输入提示中的有害标记;后者训练优化的后缀标记序列以中和有害意图。 Result: 大量实验表明,PromptSan在减少多个指标上的有害内容生成方面达到了最先进的性能。 Conclusion: PromptSan是一种有效平衡安全性与可用性的新方法,可在不改变模型架构或降低生成能力的情况下净化有害提示。 Abstract: The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by "jailbreak" attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and PromptSan-Suffix, which trains an optimized suffix token sequence to neutralize harmful intent while passing both text and image NSFW classifier checks. Extensive experiments demonstrate that PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics, effectively balancing safety and usability.[195] Geometry-Aware Preference Learning for 3D Texture Generation
AmirHossein Zamani,Tianhao Xie,Amir G. Aghdam,Tiberiu Popa,Eugene Belilovsky
Main category: cs.CV
TL;DR: The paper proposes a new framework for 3D content creation that better aligns with human preferences and understands 3D structure, enhancing the quality and control of generated outputs.
Details
Motivation: To improve the alignment of 3D generative models with human preferences and task-specific criteria, addressing the lack of inherent 3D structure understanding in current methods. Method: They introduced an end-to-end differentiable preference learning framework that integrates human preferences through reward functions into the 3D generative pipeline. Result: The framework was shown to be effective using four novel geometry-aware reward functions, providing a pathway for creating high-quality 3D content from natural language inputs. Conclusion: The paper concludes that their proposed framework enhances the generation of 3D content by incorporating human preferences and understanding of geometry, leading to more controllable and interpretable results. Abstract: Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjective human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.[196] Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention
Saad Wazir,Daeyoung Kim
Main category: cs.CV
TL;DR: A new decoder architecture improves medical image segmentation by effectively integrating multi-scale features from encoders, leading to better performance than current state-of-the-art methods.
Details
Motivation: The motivation stems from the limitations of current Transformer and CNN-based methods in handling variations in staining and morphology in medical images, especially when dealing with limited dataset sizes. Method: The method involves a novel decoder design that integrates multi-scale features from pre-trained encoders, emphasizing important channels and regions to improve segmentation accuracy. Result: Experiments on four datasets (MoNuSeg, DSB, Electron Microscopy, TNBC) showed that the proposed method outperforms existing state-of-the-art methods with absolute performance gains ranging from 2.76% to 4.03%. Conclusion: The proposed architecture for medical image segmentation demonstrates superior performance over existing state-of-the-art methods across multiple datasets, indicating its effectiveness in capturing multi-scale features and enhancing segmentation accuracy. Abstract: Segmenting biomarkers in medical images is crucial for various biotech applications. Despite advances, Transformer and CNN based methods often struggle with variations in staining and morphology, limiting feature extraction. In medical image segmentation, where datasets often have limited sample availability, recent state-of-the-art (SOTA) methods achieve higher accuracy by leveraging pre-trained encoders, whereas end-to-end methods tend to underperform. This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. Our method, compatible with various encoders, outperforms SOTA methods, as demonstrated by experiments on four datasets and ablation studies. Specifically, our method achieves absolute performance gains of 2.76% on MoNuSeg, 3.12% on DSB, 2.87% on Electron Microscopy, and 4.03% on TNBC datasets compared to existing SOTA methods. Code: https://github.com/saadwazir/MCADS-Decoder[197] BSMamba: Brightness and Semantic Modeling for Long-Range Interaction in Low-Light Image Enhancement
Tongshun Zhang,Pingping Liu,Mengen Cai,Zijian Zhang,Yubing Lu,Qiuzhan Zhou
Main category: cs.CV
TL;DR: This paper proposes BSMamba, a novel visual Mamba architecture for low-light image enhancement, combining Brightness Mamba and Semantic Mamba to improve brightness and preserve semantic consistency more effectively than existing methods.
Details
Motivation: Current LLIE methods struggle with balancing brightness improvement, semantic consistency, fine detail preservation, and computational efficiency. Existing visual Mamba approaches are limited by fixed scanning rules that hinder capturing long-range dependencies. Method: BSMamba introduces two components: Brightness Mamba for brightness restoration using brightness-guided selective attention and Semantic Mamba for maintaining contextual consistency through semantically related token interactions. Result: BSMamba achieves superior performance in low-light image enhancement tasks while effectively preserving the semantic structure of images. Conclusion: BSMamba provides state-of-the-art performance in low-light image enhancement while preserving semantic consistency, overcoming limitations of existing methods. Abstract: Current low-light image enhancement (LLIE) methods face significant limitations in simultaneously improving brightness while preserving semantic consistency, fine details, and computational efficiency. With the emergence of state-space models, particularly Mamba, image restoration has achieved remarkable performance, yet existing visual Mamba approaches flatten 2D images into 1D token sequences using fixed scanning rules, critically limiting interactions between distant tokens with causal relationships and constraining their ability to capture meaningful long-range dependencies. To address these fundamental limitations, we propose BSMamba, a novel visual Mamba architecture comprising two specially designed components: Brightness Mamba and Semantic Mamba. The Brightness Mamba revolutionizes token interaction patterns by prioritizing connections between distant tokens with similar brightness levels, effectively addressing the challenge of brightness restoration in LLIE tasks through brightness-guided selective attention. Complementing this, the Semantic Mamba establishes priority interactions between tokens sharing similar semantic meanings, allowing the model to maintain contextual consistency by connecting semantically related regions across the image, thus preserving the hierarchical nature of image semantics during enhancement. By intelligently modeling tokens based on brightness and semantic similarity rather than arbitrary scanning patterns, BSMamba transcends the constraints of conventional token sequencing while adhering to the principles of causal modeling. Extensive experiments demonstrate that BSMamba achieves state-of-the-art performance in LLIE while preserving semantic consistency.[198] Spatial frequency information fusion network for few-shot learning
Wenqing Zhao,Guojia Xie,Han Pan,Biao Yang,Weichuan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的少样本学习方法SFIFNet,结合频率域和空间域信息,提高了分类性能。
Details
Motivation: 目前许多少样本分类模型更关注空间域信息而忽略了包含更多特征信息的频率域信息,这会阻止模型充分挖掘特征信息,影响分类性能。 Method: 提出了SFIFNet,通过创新的数据预处理方法,将频率域信息与空间域信息结合起来增强图像特征表示的准确性。 Result: 实验结果证明了该方法在提升分类性能方面的有效性。 Conclusion: 结合频率域和空间域信息可以提高少样本分类性能。 Abstract: The objective of Few-shot learning is to fully leverage the limited data resources for exploring the latent correlations within the data by applying algorithms and training a model with outstanding performance that can adequately meet the demands of practical applications. In practical applications, the number of images in each category is usually less than that in traditional deep learning, which can lead to over-fitting and poor generalization performance. Currently, many Few-shot classification models pay more attention to spatial domain information while neglecting frequency domain information, which contains more feature information. Ignoring frequency domain information will prevent the model from fully exploiting feature information, which would effect the classification performance. Based on conventional data augmentation, this paper proposes an SFIFNet with innovative data preprocessing. The key of this method is enhancing the accuracy of image feature representation by integrating frequency domain information with spatial domain information. The experimental results demonstrate the effectiveness of this method in enhancing classification performance.[199] Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection
Anja Delić,Matej Grcić,Siniša Šegvić
Main category: cs.CV
TL;DR: 本文提出了一种名为SeeKer的方法,用于检测基于骨架序列的人类异常行为,具有良好的性能和竞争力。
Details
Motivation: 检测人类异常行为对于医疗监控、工作场所安全和公共监视等安全关键应用至关重要,而这些异常通常反映为不寻常的人体姿态。 Method: SeeKer通过对关键点级别的骨架序列密度进行自回归分解,利用因果预测的条件高斯分布来计算关节分布,并通过加权的每关键点对数条件之和作为异常分数。 Result: SeeKer在多个数据集上表现优异,特别是在UBnormal和MSAD-HR数据集上超越了现有方法。 Conclusion: SeeKer作为一种新的异常行为检测方法,在UBnormal和MSAD-HR数据集上优于之前的所有方法,并在ShanghaiTech数据集上表现出色。 Abstract: Detecting anomalous human behaviour is an important visual task in safety-critical applications such as healthcare monitoring, workplace safety, or public surveillance. In these contexts, abnormalities are often reflected with unusual human poses. Thus, we propose SeeKer, a method for detecting anomalies in sequences of human skeletons. Our method formulates the skeleton sequence density through autoregressive factorization at the keypoint level. The corresponding conditional distributions represent probable keypoint locations given prior skeletal motion. We formulate the joint distribution of the considered skeleton as causal prediction of conditional Gaussians across its constituent keypoints. A skeleton is flagged as anomalous if its keypoint locations surprise our model (i.e. receive a low density). In practice, our anomaly score is a weighted sum of per-keypoint log-conditionals, where the weights account for the confidence of the underlying keypoint detector. Despite its conceptual simplicity, SeeKer surpasses all previous methods on the UBnormal and MSAD-HR datasets while delivering competitive performance on the ShanghaiTech dataset.[200] RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models
Yeongtak Oh,Jisoo Mok,Dohyun Chung,Juhyeon Shin,Sangha Park,Johan Barthelemy,Sungroh Yoon
Main category: cs.CV
TL;DR: This paper proposes an RL-based post-training framework to improve the ability of MLLMs to generate personalized image captions, especially for complex multi-concept images.
Details
Motivation: Current MLLM personalization methods struggle with generating faithful descriptions in real-world scenarios like multi-concept image captioning, and acquiring large-scale, high-quality captions is costly and difficult. Method: A reinforcement learning (RL)-based post-training framework is proposed to overcome the limitations of existing SFT-based methods. Result: The proposed method significantly enhances both visual recognition and personalized generation capabilities of MLLMs and outperforms existing SFT-based baselines. Conclusion: The proposed RL-based post-training framework improves the performance of MLLMs in personalized image captioning, particularly for multi-concept image captioning. Abstract: Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.[201] OpenEvents V1: Large-Scale Benchmark Dataset for Multimodal Event Grounding
Hieu Nguyen,Phuc-Tan Nguyen,Thien-Phuc Tran,Minh-Quang Nguyen,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: OpenEvents V1是一个大型基准数据集,旨在推动以事件为中心的视觉-语言理解,提供复杂的事件推理能力。
Details
Motivation: 传统图像描述和检索数据集强调表面层次的描述,而OpenEvents V1专注于情境化和时间定位的理解。 Method: 通过引入两个主要任务:生成丰富的、与事件相关的图像标题和基于叙述性文本查询检索相关事件图像,构建了一个大规模数据集。 Result: 该数据集包含超过20万篇新闻文章和40万张相关图片,来源于CNN和The Guardian,涵盖多个领域和时间段。 Conclusion: OpenEvents V1为开发能够对复杂现实事件进行深度推理的多模态模型奠定了坚实基础,并提供了基准测试结果和标准化评估协议。 Abstract: We introduce OpenEvents V1, a large-scale benchmark dataset aimed at advancing event-centric vision-language understanding. Unlike conventional image captioning and retrieval datasets that emphasize surface-level descriptions, OpenEvents V1 focuses on contextual and temporal grounding through two primary tasks: (1) generating rich, event-aware image captions and (2) retrieving event-relevant images based on narrative-style textual queries. The dataset contains over 200,000 news articles and 400,000 associated images sourced from CNN and The Guardian, spanning diverse domains and time periods. We provide extensive baseline results and standardized evaluation protocols for both tasks. OpenEvents V1 establishes a robust foundation for developing multimodal models capable of deep reasoning over complex real-world events. The dataset is available at https://ltnghia.github.io/eventa/openevents-v1[202] InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models
Nianchen Deng,Lixin Gu,Shenglong Ye,Yinan He,Zhe Chen,Songze Li,Haomin Wang,Xingguang Wei,Tianshuo Yang,Min Dou,Tong He,Wenqi Shao,Kaipeng Zhang,Yi Wang,Botian Shi,Yanting Zhang,Jifeng Dai,Yu Qiao,Hongjie Zhang,Wenhai Wang
Main category: cs.CV
TL;DR: This paper introduces InternSpatial, the largest open-source dataset for spatial reasoning in vision-language models, along with InternSpatial-Bench, a new evaluation benchmark that improves model performance and expands multi-view reasoning through novel tasks.
Details
Motivation: Existing datasets for spatial reasoning in vision-language models are limited in scale, diversity, and instruction expressiveness, necessitating more comprehensive and open resources to advance the field. Method: The authors introduced InternSpatial, a large-scale open-source dataset with 12 million QA pairs across single-view and multi-view settings, and InternSpatial-Bench, a benchmark that includes new rotation angle prediction tasks. The dataset supports 19 instruction formats and diverse visual environments. Result: Models trained on InternSpatial showed a 12.1% improvement on InternSpatial-Bench and a 10.7% improvement on VSI-Bench, demonstrating enhanced spatial reasoning performance. Conclusion: The proposed InternSpatial dataset and InternSpatial-Bench benchmark significantly enhance spatial reasoning capabilities in vision-language models, showing notable improvements in evaluation metrics while maintaining strong general performance. Abstract: Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain limited in scale, visual diversity, and instruction expressiveness. In this work, we introduce InternSpatial, the largest open-source dataset for spatial reasoning in VLMs, along with InternSpatial-Bench, a corresponding evaluation benchmark designed to assess spatial understanding under diverse instruction formats. InternSpatial comprises 12 million QA pairs spanning both single-view and multi-view settings, drawn from diverse visual environments and supporting 19 instruction formats that reflect varied query styles. For evaluation, we propose InternSpatial-Bench for single-view tasks and expand multi-view reasoning by introducing a novel rotation angle prediction task that has not been explored in prior work. Experimental results show that models trained on InternSpatial achieve 12.1% improvement on InternSpatial-Bench and 10.7% on VSI-Bench, while maintaining strong performance on general-purpose benchmarks. We hope these resources will support the development of spatially capable VLMs in practical applications such as robotics and embodied AI.[203] Distributed Poisson multi-Bernoulli filtering via generalised covariance intersection
Ángel F. García-Fernández,Giorgio Battistelli
Main category: cs.CV
TL;DR: 本文提出了一种新的分布式多目标滤波方法,解决了PMB密度精确融合不可行的问题,并展示了该方法的优势。
Details
Motivation: 由于两个PMB密度的精确GCI融合是难以处理的,因此需要找到一个可行的近似方法。 Method: 提出了一种基于广义协方差交叉融合规则的分布式泊松多伯努利滤波器,并推导出一种有原则的近似方法来解决精确融合不可行的问题。 Result: 通过将PMB密度的幂近似为非归一化的PMB密度,实现了GCI融合规则的应用,并证明其结果是一个可以闭式表达的泊松多伯努利混合物(PMBM)。 Conclusion: 实验结果表明,与其他分布式多目标滤波器相比,该方法具有更好的性能。 Abstract: This paper presents the distributed Poisson multi-Bernoulli (PMB) filter based on the generalised covariance intersection (GCI) fusion rule for distributed multi-object filtering. Since the exact GCI fusion of two PMB densities is intractable, we derive a principled approximation. Specifically, we approximate the power of a PMB density as an unnormalised PMB density, which corresponds to an upper bound of the PMB density. Then, the GCI fusion rule corresponds to the normalised product of two unnormalised PMB densities. We show that the result is a Poisson multi-Bernoulli mixture (PMBM), which can be expressed in closed form. Future prediction and update steps in each filter preserve the PMBM form, which can be projected back to a PMB density before the next fusion step. Experimental results show the benefits of this approach compared to other distributed multi-object filters.[204] Latent Space Analysis for Melanoma Prevention
Ciro Listone,Aniello Murano
Main category: cs.CV
TL;DR: 本研究利用条件变分自编码器构建了用于皮肤病变分类的可解释AI模型,通过学习结构化的潜在空间并结合SVM,实现了良性和恶性病变的高效区分,并为临床决策提供了直观的风险评估工具。
Details
Motivation: 黑色素瘤由于其侵袭性强且死亡率高,需要早期、可解释的诊断工具;而现有的皮肤病变分类模型多提供二元输出,缺乏临床洞察力。 Method: 引入了一种条件变分自编码器(Conditional Variational Autoencoder)来学习包含病变之间语义关系的结构化潜在空间,并在其表示上训练了一个支持向量机(SVM)以区分良性痣和黑色素瘤。 Result: 所提出的潜在空间能够实现对形态差异的细致、连续评估,同时支持可视化和几何解释。SVM在此表示上表现出强大且稳定的性能。 Conclusion: 该研究提出了一种基于条件变分自编码器的新型方法,不仅实现了良性痣和黑色素瘤的有效区分,还提供了可视化和几何解释恶性肿瘤风险的能力,从而在预测性能与临床应用之间架起了桥梁。 Abstract: Melanoma represents a critical health risk due to its aggressive progression and high mortality, underscoring the need for early, interpretable diagnostic tools. While deep learning has advanced in skin lesion classification, most existing models provide only binary outputs, offering limited clinical insight. This work introduces a novel approach that extends beyond classification, enabling interpretable risk modelling through a Conditional Variational Autoencoder. The proposed method learns a structured latent space that captures semantic relationships among lesions, allowing for a nuanced, continuous assessment of morphological differences. An SVM is also trained on this representation effectively differentiating between benign nevi and melanomas, demonstrating strong and consistent performance. More importantly, the learned latent space supports visual and geometric interpretation of malignancy, with the spatial proximity of a lesion to known melanomas serving as a meaningful indicator of risk. This approach bridges predictive performance with clinical applicability, fostering early detection, highlighting ambiguous cases, and enhancing trust in AI-assisted diagnosis through transparent and interpretable decision-making.[205] Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging
Filippo Ruffini,Elena Mulero Ayllon,Linlin Shen,Paolo Soda,Valerio Guarrasi
Main category: cs.CV
TL;DR: This paper introduces a benchmark for evaluating AI models' ability to predict clinical outcomes in COVID-19 patients using Chest X-ray data, focusing on their adaptability under data-limited and imbalanced conditions.
Details
Motivation: Effective application of AI for prognosis prediction in medical imaging remains challenging, particularly under conditions of data scarcity and class imbalance. The work aims to bridge this gap by systematically evaluating model adaptability and generalizability in realistic clinical scenarios. Method: A structured benchmark was designed to evaluate and compare the transferability of Convolutional Neural Networks and Foundation Models using diverse Chest X-ray datasets. A wide range of fine-tuning strategies such as Full Fine-Tuning, Linear Probing, Low-Rank Adaptation, BitFit, VeRA, and IA3 were tested across full-data and Few-Shot Learning scenarios. Result: Through large-scale comparative analysis, the study offers detailed insights into how well different pretrained models (e.g., CLIP, DINOv2, MedCLIP) and fine-tuning methods adapt to prognosis tasks, especially when dealing with limited or imbalanced data. Conclusion: The benchmark provides critical insights into the effectiveness of various fine-tuning strategies and pretrained models in handling prognosis tasks under challenging conditions like data scarcity and class imbalance, aiming to guide the practical deployment of AI solutions in clinical settings. Abstract: Artificial Intelligence (AI) holds significant promise for improving prognosis prediction in medical imaging, yet its effective application remains challenging. In this work, we introduce a structured benchmark explicitly designed to evaluate and compare the transferability of Convolutional Neural Networks and Foundation Models in predicting clinical outcomes in COVID-19 patients, leveraging diverse publicly available Chest X-ray datasets. Our experimental methodology extensively explores a wide set of fine-tuning strategies, encompassing traditional approaches such as Full Fine-Tuning and Linear Probing, as well as advanced Parameter-Efficient Fine-Tuning methods including Low-Rank Adaptation, BitFit, VeRA, and IA3. The evaluations were conducted across multiple learning paradigms, including both extensive full-data scenarios and more clinically realistic Few-Shot Learning settings, which are critical for modeling rare disease outcomes and rapidly emerging health threats. By implementing a large-scale comparative analysis involving a diverse selection of pretrained models, including general-purpose architectures pretrained on large-scale datasets such as CLIP and DINOv2, to biomedical-specific models like MedCLIP, BioMedCLIP, and PubMedCLIP, we rigorously assess each model's capacity to effectively adapt and generalize to prognosis tasks, particularly under conditions of severe data scarcity and pronounced class imbalance. The benchmark was designed to capture critical conditions common in prognosis tasks, including variations in dataset size and class distribution, providing detailed insights into the strengths and limitations of each fine-tuning strategy. This extensive and structured evaluation aims to inform the practical deployment and adoption of robust, efficient, and generalizable AI-driven solutions in real-world clinical prognosis prediction workflows.[206] Frequency-Domain Fusion Transformer for Image Inpainting
Sijin He,Guangfeng Lin,Tao Li,Yajun Chen
Main category: cs.CV
TL;DR: This paper proposes a Transformer-based image inpainting approach that incorporates frequency-domain fusion techniques to better preserve high-frequency details and reduce computational costs.
Details
Motivation: Traditional methods struggle with complex textures and large occlusions, while Transformer-based approaches fail to preserve high-frequency details and suffer from high computational costs. Method: A Transformer-based image inpainting method incorporating frequency-domain fusion is proposed, using an attention mechanism with wavelet transform and Gabor filtering, along with a learnable frequency-domain filter based on fast Fourier transform. Result: Experimental results show that the proposed method effectively enhances image inpainting performance in terms of preserving high-frequency details. Conclusion: The proposed method improves the quality of image inpainting by preserving more high-frequency information. Abstract: Image inpainting plays a vital role in restoring missing image regions and supporting high-level vision tasks, but traditional methods struggle with complex textures and large occlusions. Although Transformer-based approaches have demonstrated strong global modeling capabilities, they often fail to preserve high-frequency details due to the low-pass nature of self-attention and suffer from high computational costs. To address these challenges, this paper proposes a Transformer-based image inpainting method incorporating frequency-domain fusion. Specifically, an attention mechanism combining wavelet transform and Gabor filtering is introduced to enhance multi-scale structural modeling and detail preservation. Additionally, a learnable frequency-domain filter based on the fast Fourier transform is designed to replace the feedforward network, enabling adaptive noise suppression and detail retention. The model adopts a four-level encoder-decoder structure and is guided by a novel loss strategy to balance global semantics and fine details. Experimental results demonstrate that the proposed method effectively improves the quality of image inpainting by preserving more high-frequency information.[207] CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing
Dinh-Khoi Vo,Thanh-Toan Do,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: 本文提出了一种名为CPAM的新方法,通过改进注意力机制和引入掩码引导策略,实现了更有效的复杂图像编辑任务。
Details
Motivation: 现有的文本到图像扩散模型在自然图像编辑方面存在诸多挑战,特别是在保持纹理、身份一致性和处理复杂非刚性物体时。 Method: 提出了保留适应模块和局部提取模块,调整自注意力机制以有效保留和独立控制对象和背景,并引入多种掩码引导策略。 Result: 在IMBA数据集上的大量实验表明,所提方法在人类评分中优于现有最先进的编辑技术。 Conclusion: 论文提出了一种新的零样本框架CPAM,用于复杂的非刚性真实图像编辑,并通过新构建的IMBA数据集验证了其优越性。 Abstract: Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques.[208] DIP: Unsupervised Dense In-Context Post-training of Visual Representations
Sophia Sirko-Galouchenko,Spyros Gidaris,Antonin Vobecky,Andrei Bursuc,Nicolas Thome
Main category: cs.CV
TL;DR: DIP是一种基于元学习原理的新颖无监督后训练方法,通过伪任务模拟上下文场景来提升视觉编码器的密集表示能力。
Details
Motivation: 为了解决现有自蒸馏方法复杂度过高问题,提出一种新型无监督后训练方法以增强图像密集表示。 Method: 通过伪任务模拟下游上下文场景进行训练,并利用预训练扩散模型与视觉编码器本身生成自动机制的上下文任务。 Result: 该方法在单个A100 GPU上运行不到9小时,就在多种现实世界上下文场景理解任务中表现出优异性能,优于初始视觉编码器和先前方法。 Conclusion: DIP是一个简单、无监督且计算高效的后训练方法,能有效提升大规模预训练视觉编码器在上下文场景理解任务中的密集表示能力。 Abstract: We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP[209] AViLA: Asynchronous Vision-Language Agent for Streaming Multimodal Data Interaction
Gengyuan Zhang,Tanveer Hannan,Hermine Kleiner,Beste Aydemir,Xinyu Xie,Jian Lan,Thomas Seidl,Volker Tresp,Jindong Gu
Main category: cs.CV
TL;DR: 本文提出了一个诊断基准来评估多模态大语言模型(MLLMs)处理流数据交互的能力,并介绍了一个能够处理临时查询并给出时间感知响应的异步视频-语言代理AViLA。
Details
Motivation: 当代理以动态数据流的形式与世界互动时,支持查询的知识通常会随着查询的到来时间而异步出现,代理需要基于历史数据、当前观察甚至未来的流进行响应,这带来了一个挑战:Query-Evidence Asynchrony。 Method: 本文设计了三个关键模块:全面的记忆保留、证据识别和基于证据的触发器,旨在维护通用记忆并及时回应查询。 Result: 实验表明,现有模型经常无法在适当的时间做出回应,而AViLa显著提高了准确性和时间感知能力。 Conclusion: 该研究为多模态大语言模型提供了一种新的方法,以更好地处理流数据中的异步查询和证据问题。 Abstract: An ideal vision-language agent serves as a bridge between the human users and their surrounding physical world in real-world applications like autonomous driving and embodied agents, and proactively provides accurate and timely responses given user intents. An intriguing challenge arises when agents interact with the world as a dynamic data stream and ad-hoc queries from users: supporting knowledge for queries, namely evidence, usually appears asynchronously with the arrival time of queries, and agents need to ground their responses in historical data, present observations, and even future streams. We frame this challenge as Query-Evidence Asynchrony, where user queries and their supporting evidence typically arrive asynchronously in the streaming setting. This setting requires not only strong reasoning capabilities but also the ability to retain past observations and respond to queries with temporal awareness. In this paper, we introduce a diagnostic benchmark that evaluates Multimodal Large Language Models (MLLMs) on their ability to handle interaction with streaming data. Further, we present AViLA, Asynchronous Video-Language Agent for streaming data interaction that can handle ad-hoc queries and give time-aware responses. For this purpose, AViLA consists of three key modules: comprehensive memory retention, evidence identification, and evidence-grounded trigger, that are designed to maintain a general-purpose memory and respond readily and timely to queries. Our experiments show that existing models often fail to respond at appropriate times, while AViLA significantly improves both accuracy and temporal awareness. Our code and dataset will be publicly available.[210] Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding
Yaokun Zhong,Siyu Jiang,Jian Zhu,Jian-Fang Hu
Main category: cs.CV
TL;DR: This paper introduces a new Context Consistency Learning (CCL) framework for Semi-Supervised Video Paragraph Grounding, combining consistency regularization and pseudo-labeling to enhance learning, showing significant performance improvements over existing techniques.
Details
Motivation: Existing SSVPG methods overlook the importance of perturbing query contexts to generate strong supervisory signals. Method: CCL unifies consistency regularization and pseudo-labeling; it uses teacher-student learning with augmented samples and retrains the model using mutual agreement for label confidence. Result: Extensive experiments demonstrate that CCL achieves superior performance compared to current approaches. Conclusion: The proposed Context Consistency Learning (CCL) framework significantly outperforms existing methods in Semi-Supervised Video Paragraph Grounding. Abstract: Semi-Supervised Video Paragraph Grounding (SSVPG) aims to localize multiple sentences in a paragraph from an untrimmed video with limited temporal annotations. Existing methods focus on teacher-student consistency learning and video-level contrastive loss, but they overlook the importance of perturbing query contexts to generate strong supervisory signals. In this work, we propose a novel Context Consistency Learning (CCL) framework that unifies the paradigms of consistency regularization and pseudo-labeling to enhance semi-supervised learning. Specifically, we first conduct teacher-student learning where the student model takes as inputs strongly-augmented samples with sentences removed and is enforced to learn from the adequately strong supervisory signals from the teacher model. Afterward, we conduct model retraining based on the generated pseudo labels, where the mutual agreement between the original and augmented views' predictions is utilized as the label confidence. Extensive experiments show that CCL outperforms existing methods by a large margin.[211] GANs vs. Diffusion Models for virtual staining with the HER2match dataset
Pascal Klöckner,José Teixeira,Diana Montezuma,Jaime S. Cardoso,Hugo M. Horlings,Sara P. Oliveira
Main category: cs.CV
TL;DR: 本研究介绍了HER2match数据集以促进H&E-HER2染色转化的研究,并评估了不同模型框架的效果,指出GANs和一种新颖的BBDM模型表现最佳。
Details
Motivation: 由于缺乏足够的公共数据集,H&E-HER2染色转移的研究进展受到限制,而且目前尚不清楚哪种模型框架最适合这一任务。 Method: 引入了首个公开的H&E和HER2双重染色乳腺癌组织数据集HER2match,并比较了几种GANs和DMs的表现,同时实现了一种新的布朗桥扩散模型(BBDM)用于H&E-HER2转换。 Result: 发现GANs总体上优于DMs,其中只有BBDM取得了与GANs相当的结果;所有在HER2match上训练的模型都比使用BCI数据集生成的视觉效果大幅提升。 Conclusion: 研究得出GANs在H&E-HER2翻译任务中总体上优于DMs,BBDM也取得了可比较的结果,并强调了数据对齐的重要性。 Abstract: Virtual staining is a promising technique that uses deep generative models to recreate histological stains, providing a faster and more cost-effective alternative to traditional tissue chemical staining. Specifically for H&E-HER2 staining transfer, despite a rising trend in publications, the lack of sufficient public datasets has hindered progress in the topic. Additionally, it is currently unclear which model frameworks perform best for this particular task. In this paper, we introduce the HER2match dataset, the first publicly available dataset with the same breast cancer tissue sections stained with both H&E and HER2. Furthermore, we compare the performance of several Generative Adversarial Networks (GANs) and Diffusion Models (DMs), and implement a novel Brownian Bridge Diffusion Model for H&E-HER2 translation. Our findings indicate that, overall, GANs perform better than DMs, with only the BBDM achieving comparable results. Furthermore, we emphasize the importance of data alignment, as all models trained on HER2match produced vastly improved visuals compared to the widely used consecutive-slide BCI dataset. This research provides a new high-quality dataset ([available upon publication acceptance]), improving both model training and evaluation. In addition, our comparison of frameworks offers valuable guidance for researchers working on the topic.[212] ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation
Trong-Vu Hoang,Quang-Binh Nguyen,Thanh-Toan Do,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: ShowFlow presents an effective solution for controllable image synthesis by addressing identity preservation and prompt alignment in both single- and multi-concept generation scenarios.
Details
Motivation: Customizing image generation to maintain identity preservation and prompt alignment remains challenging, especially in multi-concept scenarios where additional conditions like layout boxes or masks are not used. Method: The paper proposes ShowFlow-S for single-concept generation using a KronA-WED adapter and disentangled learning with attention regularization. For multi-concept generation, ShowFlow-M reuses ShowFlow-S models and introduces SAMA and layout consistency strategies as plug-and-play modules. Result: Extensive experiments and user studies demonstrate the effectiveness of ShowFlow, showing its potential for real-world applications such as advertising and virtual dressing. Conclusion: ShowFlow is a comprehensive framework that effectively addresses the challenges of customizing both single-concept and multi-concept image generation while maintaining identity preservation and prompt alignment. Abstract: Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and employs a disentangled learning approach with a novel attention regularization objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses the learned models from ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a layout consistency strategy as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing.[213] Biased Teacher, Balanced Student
Seonghak Kim
Main category: cs.CV
TL;DR: This paper proposes Long-Tailed Knowledge Distillation (LTKD), a novel framework tailored for class-imbalanced scenarios, addressing teacher bias by decomposing the KD objective into inter-group and intra-group KL divergence components and introducing rebalancing techniques.
Details
Motivation: Conventional Knowledge Distillation (KD) suffers significantly when applied to long-tailed data distributions because the teacher model tends to be biased toward head classes, providing limited supervision for tail classes. Method: The paper introduces Long-Tailed Knowledge Distillation (LTKD), which reformulates the standard KD objective into inter-group and intra-group KL divergence components. It addresses teacher bias through a rebalanced inter-group loss and a uniform intra-group loss. Result: Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods, achieving significant gains in both overall accuracy and tail-class performance. Conclusion: LTKD enables effective knowledge transfer even from biased teachers, making it a strong candidate for real-world deployment in resource-constrained and imbalanced settings. Abstract: Knowledge Distillation (KD) is a widely adopted model compression technique where a compact student model learns from the output of a larger, pre-trained teacher. While effective in balanced settings, conventional KD suffers significantly when applied to long-tailed data distributions, as the teacher model tends to be biased toward head classes and provides limited supervision for tail classes. In this paper, we propose Long-Tailed Knowledge Distillation (LTKD), a novel framework tailored for class-imbalanced scenarios. We begin by reformulating the standard KD objective into two components: inter-group and intra-group Kullback-Leibler (KL) divergence, corresponding to the prediction distributions across and within class groups (head, medium, tail), respectively. This decomposition allows us to identify and quantify the sources of teacher bias. To address them, we introduce (1) a rebalanced inter-group loss that calibrates the teacher's group-level predictions and (2) a uniform intra-group loss that ensures equal contribution from all groups during distillation. Extensive experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods, achieving significant gains in both overall accuracy and tail-class performance. Our results demonstrate that LTKD enables effective knowledge transfer even from biased teachers, making it a strong candidate for real-world deployment in resource-constrained and imbalanced settings.[214] Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey
Xinyao Li,Jingjing Li,Fengling Li,Lei Zhu,Yang Yang,Heng Tao Shen
Main category: cs.CV
TL;DR: This survey reviews vision-language models' generalization approaches, including prompt-based, parameter-based, and feature-based methods, and compares their performance while exploring relationships with multimodal large language models.
Details
Motivation: Despite their zero-shot capabilities, VLMs often struggle with domain-specific or specialized generalization tasks. This survey aims to summarize strategies for transferring knowledge from VLMs to downstream applications. Method: This paper reviews existing literature on vision-language models (VLMs), categorizing them into prompt-based, parameter-based, and feature-based methods. It also compares performance across popular benchmarks and discusses connections with multimodal large language models. Result: The paper provides a comprehensive overview of VLM generalization settings, methodologies, and benchmark results. It introduces novel interpretations of transfer learning in the context of VLMs and highlights differences between VLMs and multimodal large language models like DeepSeek-VL. Conclusion: The survey contributes to understanding the current and future directions of vision-language models by systematically reviewing recent literature, methodologies, and benchmarks in the field. Abstract: Recently, vision-language pretraining has emerged as a transformative technique that integrates the strengths of both visual and textual modalities, resulting in powerful vision-language models (VLMs). Leveraging web-scale pretraining data, these models exhibit strong zero-shot capabilities. However, their performance often deteriorates when confronted with domain-specific or specialized generalization tasks. To address this, a growing body of research focuses on transferring or generalizing the rich knowledge embedded in VLMs to various downstream applications. This survey aims to comprehensively summarize the generalization settings, methodologies, benchmarking and results in VLM literatures. Delving into the typical VLM structures, current literatures are categorized into prompt-based, parameter-based and feature-based methods according to the transferred modules. The differences and characteristics in each category are furthered summarized and discussed by revisiting the typical transfer learning (TL) settings, providing novel interpretations for TL in the era of VLMs. Popular benchmarks for VLM generalization are further introduced with thorough performance comparisons among the reviewed methods. Following the advances in large-scale generalizable pretraining, this survey also discusses the relations and differences between VLMs and up-to-date multimodal large language models (MLLM), e.g., DeepSeek-VL. By systematically reviewing the surging literatures in vision-language research from a novel and practical generalization prospective, this survey contributes to a clear landscape of current and future multimodal researches.[215] MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Yuting Zhang,Kaishen Yuan,Hao Lu,Yutao Yue,Jintai Chen,Kaishun Wu
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态大语言模型框架MedTVT-R1,用于解决多疾病诊断问题,通过整合多种临床数据并采用新颖的训练方法提高了诊断的准确性与可解释性。
Details
Motivation: 准确且可解释的多疾病诊断是医学研究中的一个关键挑战,尤其是利用异构多模态医疗数据时。当前的方法通常依赖于单模态数据,限制了对复杂疾病的全面理解。 Method: 构建了一个名为MedTVT-QA的数据集,并使用带有Jaccard Reward函数的Group Relative Policy Optimization (GRPO)强化微调方法。 Result: 实验结果表明MedTVT-R1在多模态特征利用和多疾病诊断方面表现出色,为临床应用提供了巨大的潜力。 Conclusion: MedTVT-R1是一个创新的多模态大语言模型框架,能够有效整合多种临床数据,提高多疾病诊断的能力,并具有显著的临床应用潜力。 Abstract: Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.[216] Enhancing Image Restoration Transformer via Adaptive Translation Equivariance
JiaKui Hu,Zhengjian Yao,Lujia Jin,Hangzhou He,Yanye Lu
Main category: cs.CV
TL;DR: This paper proposes TEAFormer, a Translation Equivariance Adaptive Transformer, that addresses issues of translation equivariance in attention mechanisms for improved image restoration.
Details
Motivation: Attention mechanisms in modern restoration transformers undermine translation equivariance, which is crucial for training convergence and generalization. Method: The paper introduces slide indexing and component stacking strategies, along with an adaptive sliding indexing mechanism, to maintain translation equivariance in the model design. Result: TEAFormer demonstrates superior effectiveness, training convergence, and generalization across various image restoration tasks. Conclusion: TEAFormer effectively incorporates translation equivariance into attention mechanisms, leading to improved performance in image restoration tasks. Abstract: Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. Slide indexing maintains operator responses at fixed positions, with sliding window attention being a notable example, while component stacking enables the arrangement of translation-equivariant operators in parallel or sequentially, thereby building complex architectures while preserving translation equivariance. However, these strategies still create a dilemma in model design between the high computational cost of self-attention and the fixed receptive field associated with sliding window attention. To address this, we develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then concatenated in parallel with globally aggregated key-value pairs. The designed network, called the Translation Equivariance Adaptive Transformer (TEAFormer), is assessed across a variety of image restoration tasks. The results highlight its superiority in terms of effectiveness, training convergence, and generalization.[217] Multi-Scale Representation of Follicular Lymphoma Pathology Images in a Single Hyperbolic Space
Kei Taguchi,Kazumasa Ohara,Tatsuya Yokota,Hiroaki Miyoshi,Noriaki Hashimoto,Ichiro Takeuchi,Hidekata Hontani
Main category: cs.CV
TL;DR: A self-supervised learning method is proposed to represent malignant lymphoma pathology images in a hyperbolic space, capturing multi-scale morphological changes and variations in disease state and cell types.
Details
Motivation: To capture morphological changes that occur across scales during disease progression in malignant lymphoma pathology images. Method: The approach uses self-supervised learning to embed tissue and corresponding nucleus images close to each other based on inclusion relationships, utilizing the Poincaré ball as the feature space. Result: The learned representations successfully encode hierarchical structures and capture variations in disease state and cell types. Conclusion: The proposed method effectively captures both disease state and cell type variations by embedding pathology images within a single hyperbolic space using self-supervised learning. Abstract: We propose a method for representing malignant lymphoma pathology images, from high-resolution cell nuclei to low-resolution tissue images, within a single hyperbolic space using self-supervised learning. To capture morphological changes that occur across scales during disease progression, our approach embeds tissue and corresponding nucleus images close to each other based on inclusion relationships. Using the Poincar\'e ball as the feature space enables effective encoding of this hierarchical structure. The learned representations capture both disease state and cell type variations.[218] Auto-Regressively Generating Multi-View Consistent Images
JiaKui Hu,Yuxiao Yang,Jialun Liu,Jinbo Wu,Chen Zhao,Yanye Lu
Main category: cs.CV
TL;DR: 本文提出了一种名为MV-AR的方法,用于从任意提示生成一致的多视角图像,通过自回归模型、条件注入模块和数据增强技术解决了多视角图像生成中的关键问题。
Details
Motivation: 生成多视角图像是3D内容创建的关键,但存在保持多视角一致性和在不同条件下有效合成形状和纹理的挑战。 Method: 提出了Multi-View Auto-Regressive (MV-AR) 方法,利用自回归模型逐步生成多视角图像,并引入了条件注入模块和渐进训练策略。此外还采用了“Shuffle View”数据增强技术。 Result: 实验表明MV-AR在多种条件下能够持续生成一致的多视角图像,并显著提升了训练数据量。 Conclusion: MV-AR方法在多视角图像生成任务中表现出色,能够一致地生成多视角图像,并与基于扩散的多视角图像生成模型性能相当。 Abstract: Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at https://github.com/MILab-PKU/MVAR.[219] A Set-to-Set Distance Measure in Hyperbolic Space
Pengxiang Li,Wei Wu,Zhi Gao,Xiaomeng Fan,Peilin Yu,Yuwei Wu,Zhipeng Lu,Yunde Jia,Mehrtash Harandi
Main category: cs.CV
TL;DR: 研究提出了一种新的用于测量双曲空间中集合间差异的方法HS2SD,该方法能够同时考虑集合的全局和局部结构信息,并在多个任务上展示了优越性能。
Details
Motivation: 现实应用中需要比较包含层次关系的双曲空间中的数据集,现有的点到点距离度量不足以捕捉集合间的复杂关系。 Method: 提出了一种名为HS2SD的双曲集合到集合的距离度量方法,结合了爱因斯坦中点间的测地距离和拓扑特征来捕捉全局与局部结构信息。 Result: HS2SD在多个任务上(如实体匹配、标准图像分类和少量样本图像分类)优于现有方法,有效建模了超球集内的层次和复杂关系。 Conclusion: HS2SD通过整合超球集的全局和局部结构信息,提供了对超球集之间关系的更细致理解,并在实体匹配和图像分类任务中表现出色。 Abstract: We propose a hyperbolic set-to-set distance measure for computing dissimilarity between sets in hyperbolic space. While point-to-point distances in hyperbolic space effectively capture hierarchical relationships between data points, many real-world applications require comparing sets of hyperbolic data points, where the local structure and the global structure of the sets carry crucial semantic information. The proposed the \underline{h}yperbolic \underline{s}et-\underline{to}-\underline{s}et \underline{d}istance measure (HS2SD) integrates both global and local structural information: global structure through geodesic distances between Einstein midpoints of hyperbolic sets, and local structure through topological characteristics of the two sets. To efficiently compute topological differences, we prove that using a finite Thue-Morse sequence of degree and adjacency matrices can serve as a robust approximation to capture the topological structure of a set. In this case, by considering the topological differences, HS2SD provides a more nuanced understanding of the relationships between two hyperbolic sets. Empirical evaluation on entity matching, standard image classification, and few-shot image classification demonstrates that our distance measure outperforms existing methods by effectively modeling the hierarchical and complex relationships inherent in hyperbolic sets.[220] Geometry-aware Distance Measure for Diverse Hierarchical Structures in Hyperbolic Spaces
Pengxiang Li,Yuwei Wu,Zhi Gao,Xiaomeng Fan,Wei Wu,Zhipeng Lu,Yunde Jia,Mehrtash Harandi
Main category: cs.CV
TL;DR: This paper proposes an adaptive distance measure in hyperbolic spaces for better modeling of diverse hierarchical structures, outperforming traditional fixed-measure approaches, especially in few-shot learning scenarios.
Details
Motivation: The motivation is the limitation of existing hyperbolic learning methods that use fixed distance measures under the assumption of uniform hierarchy, which is not reflective of the real-world data with significant hierarchical diversity. Method: The method involves a geometry-aware distance measure in hyperbolic spaces, dynamically adapting to varying hierarchical structures through tailored projections and curvatures for each pair of data points. It includes a revised low-rank decomposition scheme and a hard-pair mining mechanism. Result: Extensive experiments showed notable improvements, especially in few-shot learning tasks where over 5% gains were achieved on mini-ImageNet. Visualization demonstrated clearer class boundaries and improved prototype separation in hyperbolic spaces. Conclusion: The paper concludes that adaptive distance measures in hyperbolic spaces better capture diverse hierarchical structures, showing consistent outperformance over fixed distance measure methods, particularly in few-shot learning tasks. Abstract: Learning in hyperbolic spaces has attracted increasing attention due to its superior ability to model hierarchical structures of data. Most existing hyperbolic learning methods use fixed distance measures for all data, assuming a uniform hierarchy across all data points. However, real-world hierarchical structures exhibit significant diversity, making this assumption overly restrictive. In this paper, we propose a geometry-aware distance measure in hyperbolic spaces, which dynamically adapts to varying hierarchical structures. Our approach derives the distance measure by generating tailored projections and curvatures for each pair of data points, effectively mapping them to an appropriate hyperbolic space. We introduce a revised low-rank decomposition scheme and a hard-pair mining mechanism to mitigate the computational cost of pair-wise distance computation without compromising accuracy. We present an upper bound on the low-rank approximation error using Talagrand's concentration inequality, ensuring theoretical robustness. Extensive experiments on standard image classification (MNIST, CIFAR-10 and CIFAR-100), hierarchical classification (5-level CIFAR-100), and few-shot learning tasks (mini-ImageNet, tiered-ImageNet) demonstrate the effectiveness of our method. Our approach consistently outperforms learning methods that use fixed distance measures, with notable improvements on few-shot learning tasks, where it achieves over 5\% gains on mini-ImageNet. The results reveal that adaptive distance measures better capture diverse hierarchical structures, with visualization showing clearer class boundaries and improved prototype separation in hyperbolic spaces.[221] Normality Prior Guided Multi-Semantic Fusion Network for Unsupervised Image Anomaly Detection
Muhao Xu,Xueying Zhou,Xizhan Gao,Weiye Song,Guang Feng,Sijie Niu
Main category: cs.CV
TL;DR: This paper proposes a novel unsupervised anomaly detection method using a multi-semantic fusion network that leverages normal sample features for improved anomaly reconstruction and detection.
Details
Motivation: Detecting logical anomalies is challenging because their local features resemble normal semantics while their global semantics deviate significantly. Existing encoder-decoder methods fail to suppress these anomalies effectively due to neural networks' generalization capabilities. Method: A normality prior guided multi-semantic fusion network (NPGMF) is proposed, which utilizes a pre-trained vision-language network to extract global semantics and learnable semantic codebooks for feature representation. Multi-semantic features are fused and used to guide anomaly reconstruction. Result: The proposed method achieves improvements of 5.7% in pixel-sPRO and 2.6% in image-AUROC on the MVTec LOCO AD dataset, demonstrating its effectiveness in unsupervised anomaly detection. Conclusion: The proposed NPGMF method achieves state-of-the-art performance on the MVTec LOCO AD dataset for unsupervised anomaly detection by incorporating multi-semantic features of normal samples into the reconstruction process. Abstract: Recently, detecting logical anomalies is becoming a more challenging task compared to detecting structural ones. Existing encoder decoder based methods typically compress inputs into low-dimensional bottlenecks on the assumption that the compression process can effectively suppress the transmission of logical anomalies to the decoder. However, logical anomalies present a particular difficulty because, while their local features often resemble normal semantics, their global semantics deviate significantly from normal patterns. Thanks to the generalisation capabilities inherent in neural networks, these abnormal semantic features can propagate through low-dimensional bottlenecks. This ultimately allows the decoder to reconstruct anomalous images with misleading fidelity. To tackle the above challenge, we propose a novel normality prior guided multi-semantic fusion network for unsupervised anomaly detection. Instead of feeding the compressed bottlenecks to the decoder directly, we introduce the multi-semantic features of normal samples into the reconstruction process. To this end, we first extract abstract global semantics of normal cases by a pre-trained vision-language network, then the learnable semantic codebooks are constructed to store representative feature vectors of normal samples by vector quantisation. Finally, the above multi-semantic features are fused and employed as input to the decoder to guide the reconstruction of anomalies to approximate normality. Extensive experiments are conducted to validate the effectiveness of our proposed method, and it achieves the SOTA performance on the MVTec LOCO AD dataset with improvements of 5.7% in pixel-sPRO and 2.6% in image-AUROC. The source code is available at https://github.com/Xmh-L/NPGMF.[222] Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Sung Jin Um,Dongjin Kim,Sangmin Lee,Jung Uk Kim
Main category: cs.CV
TL;DR: 本文提出了一种新的音频-视觉声音源定位框架,通过多模态大语言模型和新设计的损失函数,在复杂场景中实现了更精确的声音定位。
Details
Motivation: 现有方法在复杂场景中难以准确区分发声物体和静默物体,因为它们依赖简单的音视频对应关系。 Method: 引入了对象感知对比对齐损失和对象区域隔离损失函数,结合多模态大语言模型生成详细的上下文信息。 Result: 在MUSIC和VGGSound数据集上的实验表明,该方法在单源和多源定位任务中均显著优于现有方法。 Conclusion: 本文提出了一种利用多模态大语言模型的新框架,有效解决了复杂场景中声音定位的问题。 Abstract: Audio-visual sound source localization task aims to spatially localize sound-making objects within visual scenes by integrating visual and audio cues. However, existing methods struggle with accurately localizing sound-making objects in complex scenes, particularly when visually similar silent objects coexist. This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. To effectively integrate this detailed information, we introduce two novel loss functions: Object-aware Contrastive Alignment (OCA) loss and Object Region Isolation (ORI) loss. Extensive experimental results on MUSIC and VGGSound datasets demonstrate the effectiveness of our approach, significantly outperforming existing methods in both single-source and multi-source localization scenarios. Code and generated detailed contextual information are available at: https://github.com/VisualAIKHU/OA-SSL.[223] VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning
Xuanyu Zhang,Weiqi Li,Shijie Zhao,Junlin Li,Li Zhang,Jian Zhang
Main category: cs.CV
TL;DR: The paper introduces VQ-Insight, a new framework for assessing the quality of AI-generated videos, showing better performance than current methods by enhancing generalization and specialization through innovative learning schemes and reward designs.
Details
Motivation: Current approaches to evaluating AIGC-generated videos face challenges such as limited generalization, lack of temporal awareness, heavy reliance on annotated datasets, and the decoupling of understanding and generation. Method: VQ-Insight uses a progressive video quality learning scheme and designs multi-dimension scoring rewards, preference comparison rewards, and temporal modeling rewards for enhanced generalization and specialization. Result: Extensive experiments show that VQ-Insight consistently outperforms existing methods in various video quality assessment aspects. Conclusion: VQ-Insight demonstrates significant improvements in video generation tasks compared to state-of-the-art baselines through preference comparison, multi-dimension scoring, and natural video scoring. Abstract: Recent advances in AI-generated content (AIGC) have led to the emergence of powerful text-to-video generation models. Despite these successes, evaluating the quality of AIGC-generated videos remains challenging due to limited generalization, lack of temporal awareness, heavy reliance on large-scale annotated datasets, and the lack of effective interaction with generation models. Most current approaches rely on supervised finetuning of vision-language models (VLMs), which often require large-scale annotated datasets and tend to decouple understanding and generation. To address these shortcomings, we propose VQ-Insight, a novel reasoning-style VLM framework for AIGC video quality assessment. Our approach features: (1) a progressive video quality learning scheme that combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model; (2) the design of multi-dimension scoring rewards, preference comparison rewards, and temporal modeling rewards to enhance both generalization and specialization in video quality evaluation. Extensive experiments demonstrate that VQ-Insight consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring, bringing significant improvements for video generation tasks.[224] VisualChef: Generating Visual Aids in Cooking via Mask Inpainting
Oleh Kuzyk,Zuoyue Li,Marc Pollefeys,Xi Wang
Main category: cs.CV
TL;DR: VisualChef enhances cooking support by generating tailored visual aids using mask-based alignment and targeted object modifications while maintaining environmental consistency.
Details
Motivation: Cooking requires detailed understanding and monitoring, which is challenging without consistent visual guidance. Existing recipe images and videos often lack focus and consistency, prompting the need for better tailored visual aids. Method: VisualChef generates images by identifying action-relevant objects, classifying them for targeted changes, and preserving the initial environment. It uses a mask-based visual grounding approach and includes an automated pipeline to extract high-quality frames. Result: VisualChef demonstrates improvements over state-of-the-art methods, as shown through quantitative and qualitative evaluations on three egocentric video datasets. Conclusion: VisualChef offers an effective solution for generating contextual visual aids in cooking scenarios, outperforming existing methods through mask-based visual grounding and targeted modifications. Abstract: Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object, while preserving the initial frame's environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.[225] 2D Triangle Splatting for Direct Differentiable Mesh Training
Kaifeng Sheng,Zheng Zhou,Yingliang Peng,Qianwei Wang
Main category: cs.CV
TL;DR: This paper introduces 2D Triangle Splatting (2DTS), a new method for 3D scene reconstruction that improves rendering quality and efficiency over existing approaches.
Details
Motivation: To overcome the rendering speed and advanced effects limitations of 3D Gaussian methods compared to mesh-based models. Method: 2DTS replaces 3D Gaussian primitives with 2D triangle facelets to form a discrete mesh-like structure while maintaining volumetric modeling benefits. A compactness parameter is incorporated for photorealistic mesh training. Result: The triangle-based method achieves higher fidelity than state-of-the-art Gaussian-based methods and produces meshes with superior visual quality. Conclusion: 2D Triangle Splatting (2DTS) effectively reconstructs high-quality 3D scenes, outperforming Gaussian-based and existing mesh reconstruction methods. Abstract: Differentiable rendering with 3D Gaussian primitives has emerged as a powerful method for reconstructing high-fidelity 3D scenes from multi-view images. While it offers improvements over NeRF-based methods, this representation still encounters challenges with rendering speed and advanced rendering effects, such as relighting and shadow rendering, compared to mesh-based models. In this paper, we propose 2D Triangle Splatting (2DTS), a novel method that replaces 3D Gaussian primitives with 2D triangle facelets. This representation naturally forms a discrete mesh-like structure while retaining the benefits of continuous volumetric modeling. By incorporating a compactness parameter into the triangle primitives, we enable direct training of photorealistic meshes. Our experimental results demonstrate that our triangle-based method, in its vanilla version (without compactness tuning), achieves higher fidelity compared to state-of-the-art Gaussian-based methods. Furthermore, our approach produces reconstructed meshes with superior visual quality compared to existing mesh reconstruction methods.[226] Resampling Augmentation for Time Series Contrastive Learning: Application to Remote Sensing
Antoine Saget,Baptiste Lafabregue,Antoine Cornuéjols,Pierre Gançarski
Main category: cs.CV
TL;DR: 本文提出了一种新的基于重采样的增强策略,用于卫星图像时间序列的对比学习,以更有效地利用未标记数据进行自监督预训练。
Details
Motivation: 由于未标记卫星图像时间序列数据丰富,而标记数据稀缺,因此需要一种自然的工具来利用这些未标记数据,对比自监督预训练成为首选。然而,如何设计有效的对比学习数据增强对于时间序列来说仍然是一个挑战。 Method: 通过上采样时间序列并提取不相交的子序列来生成正样本对,同时保持时间覆盖范围。 Result: 该方法在使用Sentinel-2影像的多个农业分类基准上验证了其有效性,并且在S2-Agri100数据集上达到了最先进的性能。 Conclusion: 本文提出了一种基于重采样的增强策略,为遥感时间序列的对比学习提供了一种简单而有效的方法。 Abstract: Given the abundance of unlabeled Satellite Image Time Series (SITS) and the scarcity of labeled data, contrastive self-supervised pretraining emerges as a natural tool to leverage this vast quantity of unlabeled data. However, designing effective data augmentations for contrastive learning remains challenging for time series. We introduce a novel resampling-based augmentation strategy that generates positive pairs by upsampling time series and extracting disjoint subsequences while preserving temporal coverage. We validate our approach on multiple agricultural classification benchmarks using Sentinel-2 imagery, showing that it outperforms common alternatives such as jittering, resizing, and masking. Further, we achieve state-of-the-art performance on the S2-Agri100 dataset without employing spatial information or temporal encodings, surpassing more complex masked-based SSL frameworks. Our method offers a simple, yet effective, contrastive learning augmentation for remote sensing time series.[227] SpaNN: Detecting Multiple Adversarial Patches on CNNs by Spanning Saliency Thresholds
Mauricio Byrd Victorica,György Dán,Henrik Sandberg
Main category: cs.CV
TL;DR: SpaNN is a novel attack detector that efficiently identifies adversarial patches regardless of their number, significantly outperforming current defenses.
Details
Motivation: Current defenses against adversarial patch attacks are limited in handling multiple patches efficiently or effectively, leaving a need for a more robust solution. Method: SpaNN builds an ensemble of binarized feature maps using saliency thresholds on neural activations from the first convolutional layer, performs clustering on these maps, and uses cluster features as input for a classifier to detect attacks. Result: SpaNN demonstrates superior performance, outperforming existing methods by up to 11 percentage points in object detection and 27 percentage points in image classification. Conclusion: SpaNN is an effective attack detector that outperforms state-of-the-art defenses in detecting adversarial patches in object detection and image classification tasks. Abstract: State-of-the-art convolutional neural network models for object detection and image classification are vulnerable to physically realizable adversarial perturbations, such as patch attacks. Existing defenses have focused, implicitly or explicitly, on single-patch attacks, leaving their sensitivity to the number of patches as an open question or rendering them computationally infeasible or inefficient against attacks consisting of multiple patches in the worst cases. In this work, we propose SpaNN, an attack detector whose computational complexity is independent of the expected number of adversarial patches. The key novelty of the proposed detector is that it builds an ensemble of binarized feature maps by applying a set of saliency thresholds to the neural activations of the first convolutional layer of the victim model. It then performs clustering on the ensemble and uses the cluster features as the input to a classifier for attack detection. Contrary to existing detectors, SpaNN does not rely on a fixed saliency threshold for identifying adversarial regions, which makes it robust against white box adversarial attacks. We evaluate SpaNN on four widely used data sets for object detection and classification, and our results show that SpaNN outperforms state-of-the-art defenses by up to 11 and 27 percentage points in the case of object detection and the case of image classification, respectively. Our code is available at https://github.com/gerkbyrd/SpaNN.[228] RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
Wenxu Qian,Chaoyue Wang,Hou Peng,Zhiyu Tan,Hao Li,Anxiang Zeng
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的视频生成模型优化框架RDPO,通过利用现实世界视频中的动态信息,显著提升了生成视频的物理真实性和动作连贯性。
Details
Motivation: 尽管视频生成技术在视觉质量上取得了显著进步,但在忠实地再现现实世界的物理规律方面仍然存在挑战。现有的基于偏好的模型微调方法需要昂贵的人工标注数据或尚不可行的奖励模型,因此作者提出了RDPO以解决这些问题。 Method: 提出了一种名为Real Data Preference Optimisation (RDPO) 的无注释框架,通过从现实世界视频中提取物理先验知识来优化视频生成模型。 Result: 实验表明,RDPO在多个基准测试和人类评估中均表现出多方面的改进,并且可以自动构建具有统计可区分性的偏好对,从而指导生成器更好地遵循物理定律。 Conclusion: RDPO是一个有前景的视频生成框架,能够显著提升生成视频的物理真实感和动作连贯性。 Abstract: Video generation techniques have achieved remarkable advancements in visual quality, yet faithfully reproducing real-world physics remains elusive. Preference-based model post-training may improve physical consistency, but requires costly human-annotated datasets or reward models that are not yet feasible. To address these challenges, we present Real Data Preference Optimisation (RDPO), an annotation-free framework that distills physical priors directly from real-world videos. Specifically, the proposed RDPO reverse-samples real video sequences with a pre-trained generator to automatically build preference pairs that are statistically distinguishable in terms of physical correctness. A multi-stage iterative training schedule then guides the generator to obey physical laws increasingly well. Benefiting from the dynamic information explored from real videos, our proposed RDPO significantly improves the action coherence and physical realism of the generated videos. Evaluations on multiple benchmarks and human evaluations have demonstrated that RDPO achieves improvements across multiple dimensions. The source code and demonstration of this paper are available at: https://wwenxu.github.io/RDPO/[229] Historical Report Guided Bi-modal Concurrent Learning for Pathology Report Generation
Ling Zhang,Boxiang Yun,Qingli Li,Yan Wang
Main category: cs.CV
TL;DR: 本研究提出BiGen框架,通过结合医学知识库和双模态学习解决病理报告生成中的关键问题。
Details
Motivation: 自动化病理报告生成面临两个关键挑战:视觉特征缺乏语义内容以及WSIs存在信息冗余。 Method: 提出了一种基于知识检索机制与双模态并发学习策略的多模态解码器方法,用于病理报告生成。 Result: 实验结果显示,BiGen在PathText (BRCA)数据集上取得了7.4%的NLP指标提升和19.1%的Her-2预测分类指标增强。 Conclusion: BiGen框架在病理报告生成中表现出色,解决了视觉特征语义内容不足和WSIs中的信息冗余问题。 Abstract: Automated pathology report generation from Whole Slide Images (WSIs) faces two key challenges: (1) lack of semantic content in visual features and (2) inherent information redundancy in WSIs. To address these issues, we propose a novel Historical Report Guided \textbf{Bi}-modal Concurrent Learning Framework for Pathology Report \textbf{Gen}eration (BiGen) emulating pathologists' diagnostic reasoning, consisting of: (1) A knowledge retrieval mechanism to provide rich semantic content, which retrieves WSI-relevant knowledge from pre-built medical knowledge bank by matching high-attention patches and (2) A bi-modal concurrent learning strategy instantiated via a learnable visual token and a learnable textual token to dynamically extract key visual features and retrieved knowledge, where weight-shared layers enable cross-modal alignment between visual features and knowledge features. Our multi-modal decoder integrates both modals for comprehensive diagnostic reports generation. Experiments on the PathText (BRCA) dataset demonstrate our framework's superiority, achieving state-of-the-art performance with 7.4\% relative improvement in NLP metrics and 19.1\% enhancement in classification metrics for Her-2 prediction versus existing methods. Ablation studies validate the necessity of our proposed modules, highlighting our method's ability to provide WSI-relevant rich semantic content and suppress information redundancy in WSIs. Code is publicly available at https://github.com/DeepMed-Lab-ECNU/BiGen.[230] Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping
Pablo Meseguer,Rocío del Amor,Valery Naranjo
Main category: cs.CV
TL;DR: This paper introduces a new benchmark and FM-SI metric to assess histopathology foundation models' performance in whole slide image analysis, showing that less biased feature extraction improves classification accuracy.
Details
Motivation: To address the need for real-world challenges evaluating histopathology foundation models' effectiveness due to their diversity. Method: A novel benchmark using the AI4SkIN dataset and Foundation Model-Silhouette Index (FM-SI) metric was developed to evaluate histopathology FMs as patch-level feature extractors within a MIL classification framework. Result: The study found that extracting less biased features enhances classification performance, particularly in similarity-based MIL classifiers. Conclusion: Histopathology foundation models' effectiveness can be enhanced by extracting less biased features, especially for similarity-based MIL classifiers. Abstract: Pretraining on large-scale, in-domain datasets grants histopathology foundation models (FM) the ability to learn task-agnostic data representations, enhancing transfer learning on downstream tasks. In computational pathology, automated whole slide image analysis requires multiple instance learning (MIL) frameworks due to the gigapixel scale of the slides. The diversity among histopathology FMs has highlighted the need to design real-world challenges for evaluating their effectiveness. To bridge this gap, our work presents a novel benchmark for evaluating histopathology FMs as patch-level feature extractors within a MIL classification framework. For that purpose, we leverage the AI4SkIN dataset, a multi-center cohort encompassing slides with challenging cutaneous spindle cell neoplasm subtypes. We also define the Foundation Model - Silhouette Index (FM-SI), a novel metric to measure model consistency against distribution shifts. Our experimentation shows that extracting less biased features enhances classification performance, especially in similarity-based MIL classifiers.[231] MedSeg-R: Medical Image Segmentation with Clinical Reasoning
Hao Shao,Qibin Hou
Main category: cs.CV
TL;DR: MedSeg-R是一个受临床推理启发的双阶段轻量级框架,用于医学图像分割。它通过将医学报告转换为结构化语义先验,并将其嵌入到SAM主干中,提高了对小病变的敏感性。
Details
Motivation: 医学图像分割由于解剖结构重叠、边界模糊以及前景和背景类别之间严重不平衡而具有挑战性,这尤其影响了小病变的描绘。现有的方法依赖于局部线索或用户提示,缺乏综合的语义先验,因此无法很好地推广到低对比度或重叠目标。 Method: 提出了一种名为MedSeg-R的双阶段框架,第一阶段(认知阶段)将医学报告解释为结构化语义先验(位置、纹理、形状),并通过变压器块进行融合。第二阶段(感知阶段)利用这些先验调制SAM主干,包括空间注意力、动态卷积和可变形采样。 Result: 在具有挑战性的基准测试中,MedSeg-R在重叠和模糊结构中的Dice得分有显著提高,并且与基于SAM的系统具有即插即用的兼容性。 Conclusion: MedSeg-R通过嵌入细粒度的指导信息,有效地解决了类间混淆并放大了少数类特征,从而提高了医学图像分割的性能。 Abstract: Medical image segmentation is challenging due to overlapping anatomies with ambiguous boundaries and a severe imbalance between the foreground and background classes, which particularly affects the delineation of small lesions. Existing methods, including encoder-decoder networks and prompt-driven variants of the Segment Anything Model (SAM), rely heavily on local cues or user prompts and lack integrated semantic priors, thus failing to generalize well to low-contrast or overlapping targets. To address these issues, we propose MedSeg-R, a lightweight, dual-stage framework inspired by inspired by clinical reasoning. Its cognitive stage interprets medical report into structured semantic priors (location, texture, shape), which are fused via transformer block. In the perceptual stage, these priors modulate the SAM backbone: spatial attention highlights likely lesion regions, dynamic convolution adapts feature filters to expected textures, and deformable sampling refines spatial support. By embedding this fine-grained guidance early, MedSeg-R disentangles inter-class confusion and amplifies minority-class cues, greatly improving sensitivity to small lesions. In challenging benchmarks, MedSeg-R produces large Dice improvements in overlapping and ambiguous structures, demonstrating plug-and-play compatibility with SAM-based systems.[232] Reconstructing Tornadoes in 3D with Gaussian Splatting
Adam Yang,Nadula Kadawedduwa,Tianfu Wang,Maria Molina,Christopher Metzler
Main category: cs.CV
TL;DR: 该论文提出了一种基于3D高斯随机化技术的龙卷风三维结构重建方法,并发布了一个小型实验室龙卷风的多视角数据集。
Details
Motivation: 准确重建龙卷风的三维结构对于理解和预防这种破坏性天气现象至关重要,但目前缺乏用于开发和验证相关工具的受控龙卷风数据集。 Method: 使用3D高斯随机化(3DGS)技术对新捕获的小型实验室龙卷风的多视角数据集进行三维结构重建和可视化。 Result: 实验表明,可以有效地利用3DGS技术重建并可视化此龙卷风的三维结构。 Conclusion: 该研究通过提供一个新型的小型实验室龙卷风多视角数据集,以及应用3DGS技术实现有效的三维结构重建和可视化,为龙卷风的研究和预测提供了新的工具和途径。 Abstract: Accurately reconstructing the 3D structure of tornadoes is critically important for understanding and preparing for this highly destructive weather phenomenon. While modern 3D scene reconstruction techniques, such as 3D Gaussian splatting (3DGS), could provide a valuable tool for reconstructing the 3D structure of tornados, at present we are critically lacking a controlled tornado dataset with which to develop and validate these tools. In this work we capture and release a novel multiview dataset of a small lab-based tornado. We demonstrate one can effectively reconstruct and visualize the 3D structure of this tornado using 3DGS.[233] MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation
Tianchen Deng,Guole Shen,Xun Chen,Shenghai Yuan,Hongming Shen,Guohao Peng,Zhenyu Wu,Jingchuan Wang,Lihua Xie,Danwei Wang,Hesheng Wang,Weidong Chen
Main category: cs.CV
TL;DR: 提出了一种新的多智能体分布式协作神经SLAM框架,包含混合场景表示、分布式相机跟踪、环路闭合和在线蒸馏方法,并推出了首个真实世界的密集SLAM数据集DES。
Details
Motivation: 现有隐式SLAM算法受限于单智能体场景,在大规模场景和长序列中表现不佳,且基于NeRF的多智能体SLAM框架无法满足通信带宽约束。此外,缺乏提供连续时间轨迹和高精度3D网格真值的真实世界数据集。 Method: 提出了一种新的三平面网格联合场景表示方法以提升场景重建效果,设计了内部到交互环路闭合方法实现局部与全局一致性,并开发了在线蒸馏方法融合不同子地图信息。同时创建了覆盖单智能体与多智能体场景的DES数据集。 Result: 实验表明该方法在映射、跟踪和通信方面均表现出优越性,提出的DES数据集为SLAM、3D重建及视觉基础模型研究提供了支持。 Conclusion: 所提分布式多智能体协作神经SLAM框架解决了当前方法在大规模场景和通信带宽方面的限制,新提出的DES数据集填补了真实世界数据集的空白,有望推动相关领域的发展。 Abstract: Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on https://github.com/dtc111111/mcnslam.[234] MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation
Ruicheng Zhang,Yu Sun,Zeyu Zhang,Jinai Li,Xiaofan Liu,Au Hoi Fan,Haowei Guo,Puxin Yan
Main category: cs.CV
TL;DR: 本文提出了一种基于多智能体强化学习(MARL)的医学图像分割框架MARL-MambaContour,该框架通过生成拓扑一致的对象级轮廓来改进传统基于像素的方法。
Details
Motivation: 传统的基于像素的医学图像分割方法可能缺乏拓扑约束和对解剖区域整体结构的认知,因此需要一种新的方法来提升分割效果。 Method: 将分割任务重构为多智能体协作任务,每个轮廓点作为一个自主智能体,通过迭代调整其位置以精确对齐目标边界。使用特定于轮廓的Soft Actor-Critic (SAC)算法进行优化,并引入Entropy Regularization Adjustment Mechanism (ERAM) 来平衡智能体探索与轮廓平滑度。此外,框架中还采用了基于Mamba的策略网络,包含新的双向交叉注意力隐藏状态融合机制(BCHFM),以解决状态空间模型中长距离建模可能导致的记忆混淆问题。 Result: 在五个不同的医学图像数据集上进行了广泛的实验,结果表明MARL-MambaContour具有最先进的性能。 Conclusion: MARL-MambaContour是一种准确且鲁棒的医学图像分割方法,具有潜在的临床应用价值。 Abstract: We introduce MARL-MambaContour, the first contour-based medical image segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi-agent cooperation task focused on generate topologically consistent object-level contours, addressing the limitations of traditional pixel-based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour-specific Soft Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balance agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba-based policy network featuring a novel Bidirectional Cross-attention Hidden-state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long-range modeling in state space models, thereby facilitating more accurate inter-agent information exchange and informed decision-making. Extensive experiments on five diverse medical imaging datasets demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting its potential as an accurate and robust clinical application.[235] Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios
Imad Ali Shah,Jiarong Li,Tim Brophy,Martin Glavin,Edward Jones,Enda Ward,Brian Deegan
Main category: cs.CV
TL;DR: This paper proposes UNet-MSAM, a novel method for hyperspectral imaging (HSI) processing in autonomous driving, which uses a Multi-scale Spectral Attention Module (MSAM) to improve semantic segmentation performance. It achieves better accuracy with minimal computational cost.
Details
Motivation: Recent advances in autonomous driving have shown the potential of Hyperspectral Imaging (HSI) for environmental perception, especially under challenging conditions. However, processing high-dimensional HSI data efficiently remains a challenge, prompting the need for improved feature extraction methods. Method: The authors introduce a Multi-scale Spectral Attention Module (MSAM) that enhances spectral feature extraction using three parallel 1D convolutions with varying kernel sizes (from 1 to 11), combined with an adaptive feature aggregation mechanism. This module is integrated into UNet's skip connections to form the UNet-MSAM architecture. Result: The proposed UNet-MSAM achieves significant improvements across multiple HSI datasets (HyKo-VIS v2, HSI-Drive v2, and Hyperspectral City v2). With minimal computational overhead (0.02% increase in parameters and 0.82% GFLOPS), it outperforms the baseline UNet-SC by improving mean IoU by 3.61% and mF1 by 3.80% on average across the datasets. Conclusion: This paper concludes that the proposed UNet-MSAM model, incorporating a Multi-scale Spectral Attention Module, significantly improves semantic segmentation performance for hyperspectral imaging datasets used in autonomous driving. The model demonstrates minimal computational overhead while achieving notable gains in performance metrics like mean IoU and mF1. Abstract: Recent advances in autonomous driving (AD) have highlighted the potential of Hyperspectral Imaging (HSI) for enhanced environmental perception, particularly in challenging weather and lighting conditions. However, efficiently processing its high-dimensional spectral data remains a significant challenge. This paper introduces a Multi-scale Spectral Attention Module (MSAM) that enhances spectral feature extraction through three parallel 1D convolutions with varying kernel sizes between 1 to 11, coupled with an adaptive feature aggregation mechanism. By integrating MSAM into UNet's skip connections (UNet-SC), our proposed UNet-MSAM achieves significant improvements in semantic segmentation performance across multiple HSI datasets: HyKo-VIS v2, HSI-Drive v2, and Hyperspectral City v2. Our comprehensive experiments demonstrate that with minimal computational overhead (on average 0.02% in parameters and 0.82% GFLOPS), UNet-MSAM consistently outperforms UNet-SC, achieving average improvements of 3.61% in mean IoU and 3.80% in mF1 across the three datasets. Through extensive ablation studies, we have established that multi-scale kernel combinations perform better than single-scale configurations. These findings demonstrate the potential of HSI processing for AD and provide valuable insights into designing robust, multi-scale spectral feature extractors for real-world applications.[236] SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds from RGB Images for 2D Classification
Youcef Sklab,Hanane Ariouat,Eric Chenin,Edi Prifti,Jean-Daniel Zucker
Main category: cs.CV
TL;DR: 本文提出了SIM-Net,一种结合2D图像和3D点云表示的新型图像分类架构,显著提升了标本分类的准确性。
Details
Motivation: 传统基于2D图像的模型在处理具有复杂背景、非植物元素和遮挡的标本分类任务时效果不佳,需要引入3D结构信息以提升性能。 Method: 提出了一种像素到点的转换方法,将2D对象掩码转换为3D点云,并结合CNN编码器和PointNet编码器进行特征融合。 Result: 实验表明,SIM-Net相比ResNet101在准确率上提高了9.9%,F分数提高了12.3%,并优于多种基于Transformer的最先进架构。 Conclusion: SIM-Net通过融合2D图像特征和3D几何特征,在2D图像分类任务中表现出优越的性能,尤其是在数字化标本分类方面。 Abstract: We introduce the Shape-Image Multimodal Network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitized herbarium specimens (a task made challenging by heterogeneous backgrounds), non-plant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.[237] Matrix-Game: Interactive World Foundation Model
Yifan Zhang,Chunli Peng,Boyang Wang,Puyi Wang,Qingcheng Zhu,Fei Kang,Biao Jiang,Zedong Gao,Eric Li,Yang Liu,Yahui Zhou
Main category: cs.CV
TL;DR: Matrix-Game is a 17-billion-parameter interactive world model that enables precise control over character actions and camera movements while maintaining high-quality visuals. It introduces a new two-stage training approach and sets a new standard in controllable game world generation.
Details
Motivation: The motivation behind Matrix-Game is to create a more precise and controllable model for generating interactive game worlds, addressing the limitations of prior models in terms of controllability, physical consistency, and visual quality. Method: Matrix-Game uses a two-stage training pipeline: large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. It adopts a controllable image-to-world generation paradigm conditioned on reference images, motion context, and user actions. Result: Matrix-Game outperforms existing Minecraft world models like Oasis and MineWorld across all metrics, particularly in controllability and physical consistency. Human evaluations confirm its ability to generate realistic and precisely controllable videos in diverse game scenarios. Conclusion: Matrix-Game is a highly effective interactive world foundation model for controllable game world generation, outperforming previous models in visual quality, temporal coherence, and especially controllability. The model and evaluation benchmark will be open-sourced to support future research. Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.[238] Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition
Dustin Aganian,Erik Franze,Markus Eisenbach,Horst-Michael Gross
Main category: cs.CV
TL;DR: This paper proposes a novel skeleton-based action recognition method that utilizes word embeddings to encode semantic information, improving classification performance and generalization capabilities in complex interactions for Industry 4.0 cobots.
Details
Motivation: Conventional skeleton-based methods lose keypoint semantics, limiting their effectiveness in complex human-action recognition tasks. This work aims to address this limitation by incorporating semantic information into the input representation. Method: The proposed method replaces traditional one-hot encodings with semantic volumes derived from word embeddings. This allows the model to capture meaningful relationships between joints and objects during action recognition. Result: Extensive experiments on multiple assembly datasets show that the approach significantly improves classification performance and supports diverse skeleton types and object classes, enhancing generalization. Conclusion: Incorporating semantic information enhances skeleton-based action recognition, making it more effective for cobots in dynamic and diverse environments like Industry 4.0 settings. Abstract: Effective human action recognition is widely used for cobots in Industry 4.0 to assist in assembly tasks. However, conventional skeleton-based methods often lose keypoint semantics, limiting their effectiveness in complex interactions. In this work, we introduce a novel approach to skeleton-based action recognition that enriches input representations by leveraging word embeddings to encode semantic information. Our method replaces one-hot encodings with semantic volumes, enabling the model to capture meaningful relationships between joints and objects. Through extensive experiments on multiple assembly datasets, we demonstrate that our approach significantly improves classification performance, and enhances generalization capabilities by simultaneously supporting different skeleton types and object classes. Our findings highlight the potential of incorporating semantic information to enhance skeleton-based action recognition in dynamic and diverse environments.[239] Deep CNN Face Matchers Inherently Support Revocable Biometric Templates
Aman Bhatta,Michael C. King,Kevin W. Bowyer
Main category: cs.CV
TL;DR: This paper proposes a revocable biometric system using deep CNNs, demonstrating that it is possible to generate multiple models with equivalent performance and incompatible templates, enhancing security. Vision Transformers are found to be less effective in this framework.
Details
Motivation: Biometric authentication faces criticism because if an individual's biometric data is compromised, there is typically no recourse. Revocable biometrics aim to address this issue by allowing revoked biometric templates to become worthless and enabling re-enrollment with new templates. Method: The study explores the ability of deep CNN face matchers to generate an unlimited number of distinct models with equivalent recognition power and strongly incompatible biometric templates. It also evaluates the feasibility of using a Vision Transformer (ViT) in the revocable biometric system. Result: The research shows that state-of-the-art deep CNNs can produce multiple models with equivalent recognition power and strongly incompatible templates. The cross-instance similarity scores between templates of the same person were found to be lower than same-instance scores for different persons, making revoked templates less valuable to attackers. Vision Transformers were found to be less suitable for such systems compared to ResNet-based CNNs. Conclusion: The paper concludes that modern deep CNN face matchers inherently allow for a robust revocable biometric scheme, and ResNet-based deep CNN backbones are more suitable compared to Vision Transformer (ViT) backbone-based face matchers in the proposed revocable biometric system. Abstract: One common critique of biometric authentication is that if an individual's biometric is compromised, then the individual has no recourse. The concept of revocable biometrics was developed to address this concern. A biometric scheme is revocable if an individual can have their current enrollment in the scheme revoked, so that the compromised biometric template becomes worthless, and the individual can re-enroll with a new template that has similar recognition power. We show that modern deep CNN face matchers inherently allow for a robust revocable biometric scheme. For a given state-of-the-art deep CNN backbone and training set, it is possible to generate an unlimited number of distinct face matcher models that have both (1) equivalent recognition power, and (2) strongly incompatible biometric templates. The equivalent recognition power extends to the point of generating impostor and genuine distributions that have the same shape and placement on the similarity dimension, meaning that the models can share a similarity threshold for a 1-in-10,000 false match rate. The biometric templates from different model instances are so strongly incompatible that the cross-instance similarity score for images of the same person is typically lower than the same-instance similarity score for images of different persons. That is, a stolen biometric template that is revoked is of less value in attempting to match the re-enrolled identity than the average impostor template. We also explore the feasibility of using a Vision Transformer (ViT) backbone-based face matcher in the revocable biometric system proposed in this work and demonstrate that it is less suitable compared to typical ResNet-based deep CNN backbones.[240] USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways
Shanliang Yao,Runwei Guan,Yi Ni,Sen Xu,Yong Yue,Xiaohui Zhu,Ryan Wen Liu
Main category: cs.CV
TL;DR: This paper presents USVTrack, the first 4D radar-camera tracking dataset for autonomous driving in waterborne transportation systems, along with an effective radar-camera matching method (RCM) that enhances object tracking accuracy.
Details
Motivation: Object tracking in inland waterways is essential for safe and cost-effective applications like waterborne transportation, environmental monitoring, and surface rescue. This work introduces the first 4D radar-camera tracking dataset tailored for autonomous driving in waterborne environments. Method: A simple but effective radar-camera matching method (RCM) was developed and integrated into popular two-stage association trackers to improve object tracking. The Unmanned Surface Vehicle (USV) collected data using a 4D radar, monocular camera, GPS, and IMU. Result: The experimental results demonstrate the effectiveness of the RCM method in improving object tracking performance by leveraging radar-camera fusion. The USVTrack dataset includes rich scenarios across different waterways, times of day, and weather conditions. Conclusion: The paper concludes that the proposed radar-camera matching method (RCM) improves object tracking accuracy and reliability for autonomous driving in waterborne environments, and the USVTrack dataset is publicly available for further research. Abstract: Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensors, our USV collected comprehensive object tracking data, which we present as USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in new generation waterborne transportation systems. Our USVTrack dataset presents rich scenarios, featuring diverse various waterways, varying times of day, and multiple weather and lighting conditions. Moreover, we present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers. Experimental results utilizing RCM demonstrate the effectiveness of the radar-camera matching in improving object tracking accuracy and reliability for autonomous driving in waterborne environments. The USVTrack dataset is public on https://usvtrack.github.io.[241] SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous Driving
Helin Cao,Rafael Materla,Sven Behnke
Main category: cs.CV
TL;DR: 本文提出了一种新的注意力机制SWA,用于提高自动驾驶中语义占用预测的性能。
Details
Motivation: 由于遮挡和数据稀疏性,现有的基于传感器的感知系统无法捕捉完整信息,因此需要一种能同时推断未观测区域的占用和语义的方法。 Method: 提出了Spatially-aware Window Attention (SWA)机制,将局部空间上下文引入注意力计算中。 Result: 在LiDAR和相机基础上的SOP任务中,SWA都显著提高了场景补全的性能,并达到了最先进的结果。 Conclusion: SWA是一种有效的注意力机制,能够提升自动驾驶中语义占用预测的性能。 Abstract: Perception systems in autonomous driving rely on sensors such as LiDAR and cameras to perceive the 3D environment. However, due to occlusions and data sparsity, these sensors often fail to capture complete information. Semantic Occupancy Prediction (SOP) addresses this challenge by inferring both occupancy and semantics of unobserved regions. Existing transformer-based SOP methods lack explicit modeling of spatial structure in attention computation, resulting in limited geometric awareness and poor performance in sparse or occluded areas. To this end, we propose Spatially-aware Window Attention (SWA), a novel mechanism that incorporates local spatial context into attention. SWA significantly improves scene completion and achieves state-of-the-art results on LiDAR-based SOP benchmarks. We further validate its generality by integrating SWA into a camera-based SOP pipeline, where it also yields consistent gains across modalities.[242] 3D Arena: An Open Platform for Generative 3D Evaluation
Dylan Ebert
Main category: cs.CV
TL;DR: 本文提出了一个用于评估生成3D模型的新平台——3D Arena,通过大量用户参与的人类偏好调查,提供更符合感知质量的模型评估方法。
Details
Motivation: 当前的生成3D模型评估依赖于忽略3D结构或感知吸引力的图像指标或几何测量方法,缺乏与人类感知一致的评价体系。 Method: 开发了一个名为 3D Arena 的开放平台,使用成对比较法进行大规模的人类偏好数据收集,并利用ELO排名系统对模型进行可靠评估。 Result: 自2024年6月推出以来,平台从8,096名用户中收集了123,243次投票,建立了包含100个评估提示的iso3d数据集,并实现了99.75%的用户真实性认证。Gaussian splat 输出比网格模型具有16.6 ELO优势,而带纹理的模型比无纹理模型高出144.1 ELO。 Conclusion: 3D Arena 提供了一个评估生成3D模型的平台,通过大规模人类偏好数据收集和分析,填补了自动化指标与人类感知质量之间的差距,并成为该领域的基准评估资源。 Abstract: Evaluating Generative 3D models remains challenging due to misalignment between automated metrics and human perception of quality. Current benchmarks rely on image-based metrics that ignore 3D structure or geometric measures that fail to capture perceptual appeal and real-world utility. To address this gap, we present 3D Arena, an open platform for evaluating image-to-3D generation models through large-scale human preference collection using pairwise comparisons. Since launching in June 2024, the platform has collected 123,243 votes from 8,096 users across 19 state-of-the-art models, establishing the largest human preference evaluation for Generative 3D. We contribute the iso3d dataset of 100 evaluation prompts and demonstrate quality control achieving 99.75% user authenticity through statistical fraud detection. Our ELO-based ranking system provides reliable model assessment, with the platform becoming an established evaluation resource. Through analysis of this preference data, we present insights into human preference patterns. Our findings reveal preferences for visual presentation features, with Gaussian splat outputs achieving a 16.6 ELO advantage over meshes and textured models receiving a 144.1 ELO advantage over untextured models. We provide recommendations for improving evaluation methods, including multi-criteria assessment, task-oriented evaluation, and format-aware comparison. The platform's community engagement establishes 3D Arena as a benchmark for the field while advancing understanding of human-centered evaluation in Generative 3D.[243] Focus Your Attention: Towards Data-Intuitive Lightweight Vision Transformers
Suyash Gaurav,Muhammad Farhan Humayun,Jukka Heikkonen,Jatin Chaudhary
Main category: cs.CV
TL;DR: This paper proposes a new Vision Transformer architecture using Super-Pixel Based Patch Pooling (SPPP) and Light Latent Attention (LLA) modules to enhance computational efficiency and reduce energy consumption while maintaining performance, making it suitable for edge deployment.
Details
Motivation: Vision Transformers face significant challenges such as reliance on extensive computational and memory resources for pre-training, difficulties in task-specific transfer learning, and energy inefficiencies due to the computation-intensive self-attention mechanism. Method: We propose a novel Super-Pixel Based Patch Pooling (SPPP) technique and introduce the Light Latent Attention (LLA) module into the pipeline. These methods reduce architectural complexity and improve efficiency by generating context-aware patch embeddings and integrating latent tokens into the attention mechanism, respectively. Result: Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with state-of-the-art approaches. Conclusion: The proposed architecture with the SPPP and LLA modules significantly improves computational efficiency while maintaining performance comparable to state-of-the-art approaches, making it a promising solution for energy-efficient transformers suitable for edge deployment. Abstract: The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time and space complexity of the attention module. By leveraging the data-intuitive patch embeddings coupled with dynamic positional encodings, our approach adaptively modulates the cross-attention process to focus on informative regions while maintaining the global semantic structure. This targeted attention improves training efficiency and accelerates convergence. Notably, the SPPP module is lightweight and can be easily integrated into existing transformer architectures. Extensive experiments demonstrate that our proposed architecture provides significant improvements in terms of computational efficiency while achieving comparable results with the state-of-the-art approaches, highlighting its potential for energy-efficient transformers suitable for edge deployment. (The code is available on our GitHub repository: https://github.com/zser092/Focused-Attention-ViT).[244] ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs
Michal Nazarczuk,Sibi Catley-Chandar,Thomas Tanay,Zhensong Zhang,Gregory Slabaugh,Eduardo Pérez-Pellitero
Main category: cs.CV
TL;DR: ViDAR是一种利用个性化扩散模型生成伪多视角监督信号的动态新视角合成方法,在极端视角变化下的重建表现优于现有技术。
Details
Motivation: 单目视频下动态新视角合成任务面临结构与运动分离困难、监督数据稀缺的挑战,需要一种能够生成伪多视角监督信号的方法以提升重建质量。 Method: 提出了一种名为Video Diffusion-Aware Reconstruction (ViDAR)的4D重建框架,结合个性化扩散模型与高斯点阵表示,通过扩散感知损失函数和相机姿态优化策略解决监督信号时空不一致的问题。 Result: 在DyCheck基准测试中,ViDAR在视觉质量和几何一致性方面均优于所有最先进的基线方法。 Conclusion: ViDAR在动态区域的重建效果优于现有方法,并在运动丰富的场景部分提供新的基准测试。 Abstract: Dynamic Novel View Synthesis aims to generate photorealistic views of moving subjects from arbitrary viewpoints. This task is particularly challenging when relying on monocular video, where disentangling structure from motion is ill-posed and supervision is scarce. We introduce Video Diffusion-Aware Reconstruction (ViDAR), a novel 4D reconstruction framework that leverages personalised diffusion models to synthesise a pseudo multi-view supervision signal for training a Gaussian splatting representation. By conditioning on scene-specific features, ViDAR recovers fine-grained appearance details while mitigating artefacts introduced by monocular ambiguity. To address the spatio-temporal inconsistency of diffusion-based supervision, we propose a diffusion-aware loss function and a camera pose optimisation strategy that aligns synthetic views with the underlying scene geometry. Experiments on DyCheck, a challenging benchmark with extreme viewpoint variation, show that ViDAR outperforms all state-of-the-art baselines in visual quality and geometric consistency. We further highlight ViDAR's strong improvement over baselines on dynamic regions and provide a new benchmark to compare performance in reconstructing motion-rich parts of the scene. Project page: https://vidar-4d.github.io[245] OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric Awareness
Helin Cao,Sven Behnke
Main category: cs.CV
TL;DR: This paper proposes Object-Centric SOP (OC-SOP), which improves semantic occupancy prediction by integrating object-centric cues, achieving state-of-the-art results on SemanticKITTI.
Details
Motivation: Conventional camera-based methods treat all categories equally and rely on local features, leading to suboptimal predictions for dynamic foreground objects in semantic occupancy prediction. Method: Object-Centric SOP (OC-SOP) integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. Result: The object-centric integration in OC-SOP enhances prediction accuracy for foreground objects and achieves state-of-the-art performance across all categories on SemanticKITTI. Conclusion: The proposed Object-Centric SOP framework significantly improves semantic occupancy prediction accuracy, particularly for foreground objects, achieving state-of-the-art results on SemanticKITTI. Abstract: Autonomous driving perception faces significant challenges due to occlusions and incomplete scene data in the environment. To overcome these issues, the task of semantic occupancy prediction (SOP) is proposed, which aims to jointly infer both the geometry and semantic labels of a scene from images. However, conventional camera-based methods typically treat all categories equally and primarily rely on local features, leading to suboptimal predictions, especially for dynamic foreground objects. To address this, we propose Object-Centric SOP (OC-SOP), a framework that integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. This object-centric integration significantly enhances the prediction accuracy for foreground objects and achieves state-of-the-art performance among all categories on SemanticKITTI.[246] PicoSAM2: Low-Latency Segmentation In-Sensor for Edge Vision Applications
Pietro Bonazzi,Nicola Farronato,Stefan Zihlmann,Haotong Qi,Michele Magno
Main category: cs.CV
TL;DR: PicoSAM2是一种轻量级可提示图像分割模型,专为边缘设备优化,具备高效能和低延迟特性。
Details
Motivation: 实现低延迟、隐私保护的设备端实时分割,以适应智能眼镜和物联网设备等应用的需求。 Method: 基于深度可分离U-Net架构,并通过知识蒸馏与定点提示编码从Segment Anything Model 2 (SAM2) 学习。 Result: 在COCO和LVIS数据集上分别达到51.9%和44.9% mIoU;量化模型在IMX500上运行仅需14.3毫秒,实现86 MACs/cycle。 Conclusion: PicoSAM2实现了高效的可提示分割,适用于边缘和传感器内部署,兼顾了内存和计算限制。 Abstract: Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications like smart glasses and IoT devices. We introduce PicoSAM2, a lightweight (1.3M parameters, 336M MACs) promptable segmentation model optimized for edge and in-sensor execution, including the Sony IMX500. It builds on a depthwise separable U-Net, with knowledge distillation and fixed-point prompt encoding to learn from the Segment Anything Model 2 (SAM2). On COCO and LVIS, it achieves 51.9% and 44.9% mIoU, respectively. The quantized model (1.22MB) runs at 14.3 ms on the IMX500-achieving 86 MACs/cycle, making it the only model meeting both memory and compute constraints for in-sensor deployment. Distillation boosts LVIS performance by +3.5% mIoU and +5.1% mAP. These results demonstrate that efficient, promptable segmentation is feasible directly on-camera, enabling privacy-preserving vision without cloud or host processing.[247] 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation
Chaoyang Wang,Ashkan Mirzaei,Vidit Goel,Willi Menapace,Aliaksandr Siarohin,Avalon Vinella,Michael Vasilkovsky,Ivan Skorokhodov,Vladislav Shakhrai,Sergey Korolev,Sergey Tulyakov,Peter Wonka
Main category: cs.CV
TL;DR: 本文提出了一个创新的4D视频生成框架,通过融合注意力机制和改进3D重建算法,显著提升了生成视频的质量和重建性能。
Details
Motivation: 现有的4D视频扩散架构在空间和时间注意力方面存在限制,需要一种更高效的单层融合方法以提高生成效果。 Method: 设计了一个包含4D视频模型和4D重建模型的前馈架构,引入了空间-时间融合注意力机制和稀疏注意力模式,并扩展了3D重建算法。 Result: 建立了4D生成的新技术标准,改进了视频生成的视觉质量和重建能力。 Conclusion: 本文提出了一种新的4D时空视频生成框架,实现了视觉质量和重建能力的提升。 Abstract: We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.[248] Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset
Zhuowei Chen,Bingchuan Li,Tianxiang Ma,Lijie Liu,Mingcong Liu,Yi Zhang,Gen Li,Xinghui Li,Siyu Zhou,Qian He,Xinglong Wu
Main category: cs.CV
TL;DR: Phantom-Data addresses the copy-paste issue in subject-to-video generation by offering a large-scale, cross-pair dataset that enhances model performance while preserving identity consistency.
Details
Motivation: Existing models struggle with faithfully following textual instructions due to the copy-paste problem caused by in-pair training. A new dataset is needed to decouple subject identity from background and contextual attributes. Method: Phantom-Data is created using a three-stage pipeline: subject detection, cross-context retrieval, and identity verification. It provides one million identity-consistent pairs for training. Result: Training with Phantom-Data improves prompt alignment and visual quality without compromising identity consistency compared to traditional in-pair training methods. Conclusion: Phantom-Data effectively enhances prompt alignment and visual quality while maintaining identity consistency, addressing the copy-paste problem in subject-to-video generation. Abstract: Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce \textbf{Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset}, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.[249] RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base
Kuanning Wang,Yuqian Fu,Tianyu Wang,Yanwei Fu,Longfei Liang,Yu-Gang Jiang,Xiangyang Xue
Main category: cs.CV
TL;DR: RAG-6DPose improves 6D pose estimation for robotic manipulation by integrating visual and geometric cues from a multi-modal CAD knowledge base, enhancing performance in challenging scenarios.
Details
Motivation: Accurate 6D pose estimation is crucial for robotic manipulation, particularly for tasks like grasping, where precise object localization is required. Method: The method involves three stages: building a multi-modal CAD knowledge base with 2D visual features and 3D points, retrieving relevant CAD features using the ReSPC module, and refining pose predictions through retrieval-augmented decoding. Result: Experimental results demonstrate that RAG-6DPose performs well on standard benchmarks and real-world robotic tasks, especially under challenging conditions such as occlusions and novel viewpoints. Conclusion: RAG-6DPose effectively enhances 6D pose estimation by leveraging a multi-modal CAD knowledge base, showing robustness in handling occlusions and novel viewpoints. Abstract: Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose .[250] TAMMs: Temporal-Aware Multimodal Model for Satellite Image Change Understanding and Forecasting
Zhongbin Guo,Yuhao Wang,Ping Jian,Xinyue Chen,Wei Peng,Ertai E
Main category: cs.CV
TL;DR: 本研究提出了一种新的时序感知多模态模型TAMMs,用于解决多模态大语言模型在卫星图像时间序列分析中的挑战。
Details
Motivation: 现有的多模态大语言模型在时间序列卫星图像分析方面仍面临挑战,需要细粒度的时空推理能力。 Method: 提出了TAMMs,一种用于卫星图像变化理解和预测的时序感知多模态模型,并引入了语义融合控制注入机制。 Result: 实验表明,TAMMs在时间变化理解和未来图像预测任务上优于强基准MLLM。 Conclusion: TAMMs通过精心设计的时间推理和语义融合,释放了MLLM在时空理解方面的潜力。 Abstract: Satellite image time-series analysis demands fine-grained spatial-temporal reasoning, which remains a challenge for existing multimodal large language models (MLLMs). In this work, we study the capabilities of MLLMs on a novel task that jointly targets temporal change understanding and future scene generation, aiming to assess their potential for modeling complex multimodal dynamics over time. We propose TAMMs, a Temporal-Aware Multimodal Model for satellite image change understanding and forecasting, which enhances frozen MLLMs with lightweight temporal modules for structured sequence encoding and contextual prompting. To guide future image generation, TAMMs introduces a Semantic-Fused Control Injection (SFCI) mechanism that adaptively combines high-level semantic reasoning and structural priors within an enhanced ControlNet. This dual-path conditioning enables temporally consistent and semantically grounded image synthesis. Experiments demonstrate that TAMMs outperforms strong MLLM baselines in both temporal change understanding and future image forecasting tasks, highlighting how carefully designed temporal reasoning and semantic fusion can unlock the full potential of MLLMs for spatio-temporal understanding.[251] OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
Qijun Gan,Ruizi Yang,Jianke Zhu,Shaofei Xue,Steven Hoi
Main category: cs.CV
TL;DR: OmniAvatar 是一种新的音频驱动全身动画生成模型,解决了现有方法在自然同步、流畅性和提示控制方面的局限性。
Details
Motivation: 现有的音频驱动人物动画方法主要关注面部运动,难以生成具有自然同步和流畅性的全身动画,并且在精细生成的提示控制上存在困难。因此引入 OmniAvatar 解决这些挑战。 Method: OmniAvatar 使用像素级多层次音频嵌入策略捕捉音频特征,并采用 LoRA 基础训练方法保持基础模型的提示控制能力。 Result: 实验表明,OmniAvatar 在面部和半身视频生成方面优于现有模型,提供基于文本的精确控制,适用于播客、人物互动、动态场景和歌唱等多个领域。 Conclusion: OmniAvatar 是一种创新的音频驱动全身视频生成模型,改进了唇形同步精度和自然动作,通过像素级多层次音频嵌入策略和 LoRA 基础训练方法,在面部和半身视频生成方面超越了现有模型,并提供精确的文本控制功能。 Abstract: Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.[252] Let Your Video Listen to Your Music!
Xinyu Zhang,Dong Gong,Zicheng Duan,Anton van den Hengel,Lingqiao Liu
Main category: cs.CV
TL;DR: MVAA是一种自动将视频动作与音乐节拍对齐的新框架,它通过插入关键帧并利用扩散模型生成中间帧,从而在保持原始视觉内容的同时实现高效且灵活的音视频同步。
Details
Motivation: 需要自动化工具来满足多媒体制作中视频与音乐同步的实际需求。 Method: 使用两步策略:对齐关键帧与音频节拍,然后通过扩散模型进行视频修复。 Result: 实验表明MVAA能在10分钟内完成适配,并生成高质量、流畅的对齐结果。 Conclusion: MVAA实现了视频与音乐节奏的自动对齐,提高了编辑效率和灵活性。 Abstract: Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework, termed MVAA (Music-Video Auto-Alignment), that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video's semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within 10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness.[253] Light of Normals: Unified Feature Representation for Universal Photometric Stereo
Hong Li,Houyuan Chen,Chongjie Ye,Zhaoxi Chen,Bohan Li,Shaocong Xu,Xianda Guo,Xuhui Liu,Yikai Wang,Baochang Zhang,Satoshi Ikehata,Boxin Shi,Anyi Rao,Hao Zhao
Main category: cs.CV
TL;DR: This paper addresses key challenges in universal photometric stereo by improving the separation of lighting effects from surface geometry and enhancing the capture of fine surface details under arbitrary lighting conditions.
Details
Motivation: The motivation is to overcome the limitations in current universal photometric stereo techniques, particularly the ambiguity between lighting changes and surface orientation variations, and the difficulty in capturing fine geometric details due to complex surface interactions. Method: The paper probably introduces a novel approach or framework that specifically tackles the deep coupling between illumination and surface normal features while enhancing the capture of high-frequency geometric details in complex surfaces. Result: The results are expected to demonstrate improved performance in recovering high-quality surface normals under arbitrary lighting conditions, with enhanced accuracy in separating illumination effects from surface geometry and better preservation of intricate surface details. Conclusion: The paper likely concludes that the proposed method effectively addresses the two fundamental challenges in universal photometric stereo by decoupling illumination and surface normal features and preserving high-frequency geometric details. Abstract: Universal photometric stereo (PS) aims to recover high-quality surface normals from objects under arbitrary lighting conditions without relying on specific illumination models. Despite recent advances such as SDM-UniPS and Uni MS-PS, two fundamental challenges persist: 1) the deep coupling between varying illumination and surface normal features, where ambiguity in observed intensity makes it difficult to determine whether brightness variations stem from lighting changes or surface orientation; and 2) the preservation of high-frequency geometric details in complex surfaces, where intricate geometries create self-shadowing, inter-reflections, and subtle normal variations that conventional feature processing operations struggle to capture accurately.[254] Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Zeqian Li,Shangzhe Di,Zhonghua Zhai,Weilin Huang,Yanfeng Wang,Weidi Xie
Main category: cs.CV
TL;DR: This paper introduces UniTime, a universal video temporal grounding model using Multi-modal Large Language Models (MLLMs), achieving superior performance across diverse video types and improving video question-answering accuracy.
Details
Motivation: Existing methods for video temporal grounding are often limited to specific domains or durations, necessitating a more universal approach. This work aims to leverage the vision-language understanding capabilities of MLLMs for broader applicability and precision in temporal grounding. Method: The paper introduces UniTime, which steers MLLMs for temporal grounding by incorporating temporal information through interleaving timestamp tokens with video tokens. The model handles different input granularities via adaptive frame scaling to ensure robust performance for both short and long videos. Result: UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific fine-tuned settings across five public temporal grounding benchmarks. It also enhances VideoQA accuracy for complex video understanding tasks. Conclusion: UniTime is a robust and universal video grounding model that leverages generative Multi-modal Large Language Models (MLLMs) to accurately localize temporal moments in videos based on natural language queries. It significantly improves VideoQA accuracy when used as a preliminary moment retriever. Abstract: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.[255] 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time
Ziqiao Ma,Xuweiyi Chen,Shoubin Yu,Sai Bi,Kai Zhang,Chen Ziwen,Sihan Xu,Jianing Yang,Zexiang Xu,Kalyan Sunkavalli,Mohit Bansal,Joyce Chai,Hao Tan
Main category: cs.CV
TL;DR: This paper introduces 4D-LRM, a novel approach for efficient and accurate 4D object reconstruction that outperforms prior methods in speed, quality, and generalization.
Details
Motivation: The authors aim to address the challenge of efficiently and accurately reconstructing objects in 4D (space and time) from limited views, which previous approaches struggled with in terms of efficiency, generalization, or faithfulness. Method: The paper proposes 4D-LRM, a large-scale 4D reconstruction model that learns a unified space-time representation and predicts per-pixel 4D Gaussian primitives from posed image tokens across time. Result: 4D-LRM achieves fast, high-quality rendering at infinite frame rate, reconstructs 24-frame sequences in under 1.5 seconds on a single A100 GPU, and generalizes well to novel objects and camera setups. Conclusion: Scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. Abstract: Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.[256] FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
Kaiyi Huang,Yukun Huang,Xintao Wang,Zinan Lin,Xuefei Ning,Pengfei Wan,Di Zhang,Yu Wang,Xihui Liu
Main category: cs.CV
TL;DR: FilMaster is an advanced AI system for film generation that integrates cinematic principles and audience feedback, resulting in high-quality, engaging films.
Details
Motivation: Existing film generation systems struggle with implementing cinematic principles, resulting in low-quality films with templated visuals and unengaging narratives. The motivation is to develop a system that can produce professional-grade films using AI. Method: The study introduces FilMaster, an end-to-end system with two stages: Reference-Guided Generation Stage using a Multi-shot Synergized RAG Camera Language Design module, and Generative Post-Production Stage featuring an Audience-Centric Cinematic Rhythm Control module. It also includes the FilmEval benchmark for evaluation. Result: FilMaster demonstrates superior performance in camera language design and cinematic rhythm control, producing engaging, professional-quality films. The introduction of FilmEval provides a comprehensive benchmark for evaluating AI-generated films. Conclusion: FilMaster represents a significant advancement in AI-driven film production by integrating real-world cinematic principles and audience-centric post-production workflows, leading to professional-grade outputs. Abstract: AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models. Furthermore, we introduce FilmEval, a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster's superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.[257] Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models
Kiymet Akdemir,Tahira Kazimi,Pinar Yanardag
Main category: cs.CV
TL;DR: 本文提出一种多智能体协作框架,用于改进多画面故事可视化的视觉一致性问题。
Details
Motivation: 当前方法常未能保持关键角色属性,导致叙事不连贯,因此需要一种更具一致性的解决方案。 Method: 提出了一种协作式多智能体框架,能够自主识别、纠正和优化多画面故事可视化中的不一致之处,并通过迭代循环实现细粒度的画面级更新。 Result: 新框架能够在不重新生成整个序列的情况下完成细节修正,并适用于包括Flux和Stable Diffusion在内的多种扩散模型。 Conclusion: 该框架具有模型无关性,可灵活集成多种扩散模型,并在定量和定性实验中证明了其在多画面一致性上的优越性。 Abstract: Story visualization has become a popular task where visual scenes are generated to depict a narrative across multiple panels. A central challenge in this setting is maintaining visual consistency, particularly in how characters and objects persist and evolve throughout the story. Despite recent advances in diffusion models, current approaches often fail to preserve key character attributes, leading to incoherent narratives. In this work, we propose a collaborative multi-agent framework that autonomously identifies, corrects, and refines inconsistencies across multi-panel story visualizations. The agents operate in an iterative loop, enabling fine-grained, panel-level updates without re-generating entire sequences. Our framework is model-agnostic and flexibly integrates with a variety of diffusion models, including rectified flow transformers such as Flux and latent diffusion models such as Stable Diffusion. Quantitative and qualitative experiments show that our method outperforms prior approaches in terms of multi-panel consistency.[258] From Virtual Games to Real-World Play
Wenqiang Sun,Fangyun Wei,Jinjing Zhao,Xi Chen,Zilong Chen,Hongyang Zhang,Jun Zhang,Yan Lu
Main category: cs.CV
TL;DR: RealPlay 是一种基于神经网络的真实世界游戏引擎,可将用户控制信号转化为逼真且时间一致的视频,具有良好的实时性和跨场景泛化能力。
Details
Motivation: 现有的工作主要集中在游戏风格的视觉生成上,而 RealPlay 的目标是生成更接近真实世界的高质量视频序列,从而提升用户的沉浸感和交互体验。 Method: RealPlay 通过结合标记的游戏数据和未标记的真实世界视频进行训练,并采用迭代块预测、保持时间一致性以及精确的控制响应等方法来实现低延迟反馈和高质量的视频生成。 Result: RealPlay 成功实现了对多种真实世界实体(如自行车和行人)的控制,同时保持了视频的时间一致性和低延迟反馈,表现出良好的泛化能力。 Conclusion: RealPlay 是一个基于神经网络的真实世界游戏引擎,能够从用户控制信号中生成逼真且时间一致的视频序列。 Abstract: We introduce RealPlay, a neural network-based real-world game engine that enables interactive video generation from user control signals. Unlike prior works focused on game-style visuals, RealPlay aims to produce photorealistic, temporally consistent video sequences that resemble real-world footage. It operates in an interactive loop: users observe a generated scene, issue a control command, and receive a short video chunk in response. To enable such realistic and responsive generation, we address key challenges including iterative chunk-wise prediction for low-latency feedback, temporal consistency across iterations, and accurate control response. RealPlay is trained on a combination of labeled game data and unlabeled real-world videos, without requiring real-world action annotations. Notably, we observe two forms of generalization: (1) control transfer-RealPlay effectively maps control signals from virtual to real-world scenarios; and (2) entity transfer-although training labels originate solely from a car racing game, RealPlay generalizes to control diverse real-world entities, including bicycles and pedestrians, beyond vehicles. Project page can be found: https://wenqsun.github.io/RealPlay/[259] VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
Runjia Li,Philip Torr,Andrea Vedaldi,Tomas Jakab
Main category: cs.CV
TL;DR: This paper introduces Surfel-Indexed View Memory (VMem), a novel mechanism for video generators that improves long-term scene coherence and camera control while reducing computational costs.
Details
Motivation: To overcome limitations in existing methods such as error accumulation from incremental 3D reconstruction or incoherent scenes due to short context windows in video generators. Method: The researchers introduced Surfel-Indexed View Memory (VMem), which remembers past views by geometrically indexing them based on observed 3D surface elements (surfels), enabling the retrieval of relevant views when generating new ones. Result: The proposed method produces consistent explorations of imagined environments at a lower computational cost while maintaining long-term scene coherence and better camera control compared to existing approaches. Conclusion: Surfel-Indexed View Memory (VMem) proves to be an efficient mechanism for video generators in exploring environments interactively, demonstrating superior performance in scene coherence and camera control. Abstract: We propose a novel memory mechanism to build video generators that can explore environments interactively. Similar results have previously been achieved by out-painting 2D views of the scene while incrementally reconstructing its 3D geometry, which quickly accumulates errors, or by video generators with a short context window, which struggle to maintain scene coherence over the long term. To address these limitations, we introduce Surfel-Indexed View Memory (VMem), a mechanism that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed. VMem enables the efficient retrieval of the most relevant past views when generating new ones. By focusing only on these relevant views, our method produces consistent explorations of imagined environments at a fraction of the computational cost of using all past views as context. We evaluate our approach on challenging long-term scene synthesis benchmarks and demonstrate superior performance compared to existing methods in maintaining scene coherence and camera control.[260] TC-Light: Temporally Consistent Relighting for Dynamic Long Videos
Yang Liu,Chuanchen Luo,Zimo Tang,Yingyan Li,Yuran Yang,Yuanyong Ning,Lue Fan,Junran Peng,Zhaoxiang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为TC-Light的新视频重照明框架,利用两阶段优化机制解决了现有方法在时间一致性和计算效率方面的限制,并在新构建的长视频基准上取得了良好效果。