Skip to content

Table of Contents

cs.CL [Back]

[1] An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training

Yanis Labrak,Richard Dufour,Mickaël Rouvier

Main category: cs.CL

TL;DR: 本文研究了语音语言模型中离散单元表示的优化方法,发现离散化策略与模型容量密切相关,并强调了聚类数据选择对模型鲁棒性的重要性。

Details Motivation: 探索如何在语音语言模型中优化语音建模,特别是在持续预训练过程中改进离散单元表示。 Method: 系统研究模型架构、数据表示和训练鲁棒性对持续预训练阶段的影响,并通过语音编码器和聚类粒度分析不同模型规模下的离散化策略。 Result: 实验表明语音编码器和聚类粒度对模型性能有显著影响,同时揭示了离散词汇的有效使用及其在语言和副语言模式中的作用。 Conclusion: 优化离散单元表示可以提升语音语言模型的建模效果,并且离散化策略应与目标应用领域匹配以提高模型鲁棒性。 Abstract: This paper investigates discrete unit representations in Speech Language Models (SLMs), focusing on optimizing speech modeling during continual pre-training. In this paper, we systematically examine how model architecture, data representation, and training robustness influence the pre-training stage in which we adapt existing pre-trained language models to the speech modality. Our experiments highlight the role of speech encoders and clustering granularity across different model scales, showing how optimal discretization strategies vary with model capacity. By examining cluster distribution and phonemic alignments, we investigate the effective use of discrete vocabulary, uncovering both linguistic and paralinguistic patterns. Additionally, we explore the impact of clustering data selection on model robustness, highlighting the importance of domain matching between discretization training and target applications.

[2] Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection

Jerry Li,Evangelos Papalexakis

Main category: cs.CL

TL;DR: This paper proposes a novel hallucination detection approach using tensor decomposition and an MLP classifier, showing improved performance over traditional methods and competitive results against state-of-the-art techniques.

Details Motivation: Detecting hallucinations in Large Language Models (LLMs) is crucial for improving their trustworthiness in generating consistent, truthful information. Existing methods often lack the semantic depth necessary for effective hallucination detection. Method: A novel approach inspired by ROUGE constructs an N-Gram frequency tensor from LLM-generated text, which captures richer semantic structure. Tensor decomposition methods extract singular values used as input features to train a multi-layer perceptron (MLP) binary classifier for hallucinations. Result: The method achieves significant improvements over traditional baselines and demonstrates competitive performance against state-of-the-art LLM judges on the HaluEval dataset. Conclusion: The proposed method demonstrates significant improvements over traditional baselines and shows competitive performance against state-of-the-art LLM judges in detecting hallucinations. Abstract: Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language, however, a fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information. Detecting hallucinations has quickly become an important topic, with various methods such as uncertainty estimation, LLM Judges, retrieval augmented generation (RAG), and consistency checks showing promise. Many of these methods build upon foundational metrics, such as ROUGE, BERTScore, or Perplexity, which often lack the semantic depth necessary to detect hallucinations effectively. In this work, we propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text. This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content. We demonstrate this by applying tensor decomposition methods to extract singular values from each mode and use these as input features to train a multi-layer perceptron (MLP) binary classifier for hallucinations. Our method is evaluated on the HaluEval dataset and demonstrates significant improvements over traditional baselines, as well as competitive performance against state-of-the-art LLM judges.

[3] A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs

Jiacheng Wei,Faguo Wu,Xiao Zhang

Main category: cs.CL

TL;DR: This paper proposes SAGE, a dynamic fine-tuning framework that allows large language models to adaptively update knowledge during inference, improving reasoning performance.

Details Motivation: Large language models lack the ability to continuously adapt and learn from new data during reasoning at inference time, which limits their performance on complex reasoning tasks. Method: The paper introduces SAGE, which includes a Trigger module for detecting reasoning failures, a Trigger Buffer module for clustering anomaly samples, and a Lora Store module for dynamic parameter optimization. Result: The evaluation results show that SAGE achieves excellent performance in terms of accuracy, robustness, and stability on atomic reasoning subtasks through dynamic knowledge updating during test time. Conclusion: SAGE enables adaptive updates during reasoning at inference time by decomposing complex reasoning tasks into atomic subtasks, thereby improving accuracy, robustness, and stability. Abstract: Large language models are unable to continuously adapt and learn from new data during reasoning at inference time. To address this limitation, we propose that complex reasoning tasks be decomposed into atomic subtasks and introduce SAGE, a trigger-guided dynamic fine-tuning framework that enables adaptive updates during reasoning at inference time. SAGE consists of three key components: (1) a Trigger module that detects reasoning failures through multiple evaluation metrics in real time; (2) a Trigger Buffer module that clusters anomaly samples using a streaming clustering process with HDBSCAN, followed by stability checks and similarity-based merging; and (3) a Lora Store module that dynamically optimizes parameter updates with an adapter pool for knowledge retention. Evaluation results show that SAGE demonstrates excellent accuracy, robustness, and stability on the atomic reasoning subtask through dynamic knowledge updating during test time.

[4] Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

Andrea Wynn,Harsh Satija,Gillian Hadfield

Main category: cs.CL

TL;DR: Multi-agent debate may harm reasoning accuracy as agents prioritize agreement over challenging flawed reasoning, especially when models lack incentives or capabilities to resist incorrect persuasion.

Details Motivation: While multi-agent debate has been suggested as a way to improve AI reasoning, prior work focused only on homogeneous agent groups. This study explores the effects of model diversity and reveals potential drawbacks in current debate strategies. Method: The researchers conducted a series of experiments to analyze how diversity in model capabilities affects the dynamics and outcomes of multi-agent debates, focusing on how agents respond to peer reasoning over time. Result: The experiments showed that debate can reduce accuracy over time, even when stronger models outnumber weaker ones, as agents tend to favor agreement over correcting flawed reasoning, leading to shifts from correct to incorrect answers. Conclusion: The study concludes that multi-agent debates can lead to performance degradation when agents are not incentivized or equipped to resist incorrect reasoning, highlighting the importance of model diversity and its impact on reasoning outcomes. Abstract: While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. The prior work has exclusively focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time -- even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivized nor adequately equipped to resist persuasive but incorrect reasoning.

[5] No Translation Needed: Forecasting Quality from Fertility and Metadata

Jessica M. Lundin,Ada Zhang,David Adelani,Cody Carroll

Main category: cs.CL

TL;DR: 无需运行翻译系统即可预测翻译质量,通过token fertility ratios、token counts和语言元数据等特征实现高效预测。

Details Motivation: 翻译质量评估通常需要运行翻译系统,但这种方法可以提供更高效的评估方式。 Method: 使用梯度提升模型,基于token fertility ratios、token counts以及基础语言元数据等特征预测翻译质量。 Result: 在FLORES-200基准测试中,GPT-4o翻译的ChrF得分预测结果表现良好(XX→English的R²为0.66,English→XX的R²为0.72) Conclusion: 翻译质量可以通过一些特征进行预测,而无需实际运行翻译系统本身。 Abstract: We show that translation quality can be predicted with surprising accuracy \textit{without ever running the translation system itself}. Using only a handful of features, token fertility ratios, token counts, and basic linguistic metadata (language family, script, and region), we can forecast ChrF scores for GPT-4o translations across 203 languages in the FLORES-200 benchmark. Gradient boosting models achieve favorable performance ($R^{2}=0.66$ for XX$\rightarrow$English and $R^{2}=0.72$ for English$\rightarrow$XX). Feature importance analyses reveal that typological factors dominate predictions into English, while fertility plays a larger role for translations into diverse target languages. These findings suggest that translation quality is shaped by both token-level fertility and broader linguistic typology, offering new insights for multilingual evaluation and quality estimation.

[6] Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

Logan Lawrence,Ashton Williamson,Alexander Shelton

Main category: cs.CL

TL;DR: 本文提出了一种能够分配绝对分数的自动摘要评估方法,并在多个基准上表现优异。

Details Motivation: 现有的基于成对比较的自动评分模型在分配单个摘要绝对分数方面存在不足,而这种方法对于需要阈值的应用场景至关重要。 Method: 使用合成摘要在测试时作为成对机器排名,提出了一种直接评分方法。 Result: 该方法在SummEval(+0.03)、TopicalChat(-0.03)和HANNA(+0.05)元评估基准上的轴平均样本级相关性方面表现良好。 Conclusion: 研究结果表明,所提出的直接评分方法在多个评估基准上表现良好,并且合成的上下文摘要数据被发布以促进未来的研究工作。 Abstract: As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf{+0.03}), TopicalChat (\textbf{-0.03}), and HANNA (\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.

[7] From Staff Messages to Actionable Insights: A Multi-Stage LLM Classification Framework for Healthcare Analytics

Hajar Sakai,Yi-En Tseng,Mohammadsadegh Mikaeili,Joshua Bosire,Franziska Jovin

Main category: cs.CL

TL;DR: This paper introduces a multi-stage LLM-based framework to analyze hospital call center staff messages, achieving high classification accuracy and providing actionable insights for improving patient care and navigator training while ensuring data security and HIPAA compliance.

Details Motivation: The motivation stems from the need to efficiently process and extract insights from the large volume of text data generated by hospital call center staff messages without relying on traditional supervised learning methods that require extensive annotation and training. Method: The paper proposes a multi-stage LLM-based framework that identifies topics and classifies staff messages by their reasons using various LLM types (reasoning, general-purpose, and lightweight models). The methodology includes evaluating model performance, incorporating data security measures, and integrating outputs into a visualization decision support tool. Result: The best-performing model, o3, achieved a 78.4% weighted F1-score and 79.2% accuracy, closely followed by gpt-5 with 75.3% weighted F1-score and 76.2% accuracy. The approach successfully transforms staff messaging data into actionable insights for healthcare professionals. Conclusion: The paper concludes that using a multi-stage LLM-based framework can efficiently analyze staff messaging data in hospital call centers to provide actionable insights, improve navigator training, and enhance patient experience and care quality while maintaining data security and HIPAA compliance. Abstract: Hospital call centers serve as the primary contact point for patients within a hospital system. They also generate substantial volumes of staff messages as navigators process patient requests and communicate with the hospital offices following the established protocol restrictions and guidelines. This continuously accumulated large amount of text data can be mined and processed to retrieve insights; however, traditional supervised learning approaches require annotated data, extensive training, and model tuning. Large Language Models (LLMs) offer a paradigm shift toward more computationally efficient methodologies for healthcare analytics. This paper presents a multi-stage LLM-based framework that identifies staff message topics and classifies messages by their reasons in a multi-class fashion. In the process, multiple LLM types, including reasoning, general-purpose, and lightweight models, were evaluated. The best-performing model was o3, achieving 78.4% weighted F1-score and 79.2% accuracy, followed closely by gpt-5 (75.3% Weighted F1-score and 76.2% accuracy). The proposed methodology incorporates data security measures and HIPAA compliance requirements essential for healthcare environments. The processed LLM outputs are integrated into a visualization decision support tool that transforms the staff messages into actionable insights accessible to healthcare professionals. This approach enables more efficient utilization of the collected staff messaging data, identifies navigator training opportunities, and supports improved patient experience and care quality.

[8] The Token Tax: Systematic Bias in Multilingual Tokenization

Jessica M. Lundin,Ada Zhang,Nihal Karim,Hamza Louzan,Victor Wei,David Adelani,Cody Carroll

Main category: cs.CL

TL;DR: This paper shows that inefficient tokenization harms performance and increases costs for morphologically complex, low-resource languages in NLP. It recommends better tokenization methods, fair pricing, and multilingual benchmarks for equity in language processing.

Details Motivation: Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, increasing compute resources and reducing accuracy. This study aims to understand the impact of tokenization on model performance and propose solutions for more equitable NLP. Method: The authors evaluated 10 large language models on the AfriMMLU dataset (9,000 MCQA items; 5 subjects; 16 African languages) to analyze the relationship between tokenization efficiency (fertility) and model accuracy, and further examined the economic impact of token inflation. Result: Higher fertility (tokens per word) consistently predicts lower accuracy across all models and subjects. Reasoning models like DeepSeek and o1 outperform non-reasoning models across both high- and low-resource languages, reducing accuracy gaps. Additionally, a doubling in tokens leads to a quadrupling of training cost and time. Conclusion: The paper concludes that tokenization inefficiency negatively affects morphologically complex, low-resource languages, and highlights the need for morphologically aware tokenization, fair pricing, and multilingual benchmarks to achieve equitable NLP. Abstract: Tokenization inefficiency imposes structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and depressing accuracy. We evaluate 10 large language models (LLMs) on AfriMMLU (9,000 MCQA items; 5 subjects; 16 African languages) and show that fertility (tokens/word) reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (DeepSeek, o1) consistently outperform non-reasoning peers across high and low resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. Finally, translating token inflation to economics, a doubling in tokens results in quadrupled training cost and time, underscoring the token tax faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).

[9] Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG)

Mansi Garg,Lee-Chi Wang,Bhavesh Ghanchi,Sanjana Dumpala,Shreyash Kakde,Yen Chih Chen

Main category: cs.CL

TL;DR: This paper presents a biomedical literature Q&A system using a RAG architecture to improve access to medical information, showing improved performance over baseline models and potential for future applications in multilingual and personalized medical AI.

Details Motivation: To improve access to accurate, evidence-based medical information by addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research. Method: The system uses a Retrieval-Augmented Generation (RAG) architecture, integrating sources like PubMed articles and medical encyclopedias. It employs MiniLM-based semantic embeddings and FAISS vector search for retrieval, while answer generation is performed using a fine-tuned Mistral-7B-v0.3 language model optimized with QLoRA. Result: The system demonstrated substantial improvements in factual consistency and semantic relevance compared to baseline models, as measured by BERTScore (F1), particularly in domain-specific tasks like breast cancer literature evaluation. Conclusion: The study concludes that RAG-enhanced language models can effectively bridge the gap between complex biomedical literature and accessible public health knowledge, suggesting future directions like multilingual adaptation and personalized medical AI systems. Abstract: This work presents a Biomedical Literature Question Answering (Q&A) system based on a Retrieval-Augmented Generation (RAG) architecture, designed to improve access to accurate, evidence-based medical information. Addressing the shortcomings of conventional health search engines and the lag in public access to biomedical research, the system integrates diverse sources, including PubMed articles, curated Q&A datasets, and medical encyclopedias ,to retrieve relevant information and generate concise, context-aware responses. The retrieval pipeline uses MiniLM-based semantic embeddings and FAISS vector search, while answer generation is performed by a fine-tuned Mistral-7B-v0.3 language model optimized using QLoRA for efficient, low-resource training. The system supports both general medical queries and domain-specific tasks, with a focused evaluation on breast cancer literature demonstrating the value of domain-aligned retrieval. Empirical results, measured using BERTScore (F1), show substantial improvements in factual consistency and semantic relevance compared to baseline models. The findings underscore the potential of RAG-enhanced language models to bridge the gap between complex biomedical literature and accessible public health knowledge, paving the way for future work on multilingual adaptation, privacy-preserving inference, and personalized medical AI systems.

[10] Using Contrastive Learning to Improve Two-Way Reasoning in Large Language Models: The Obfuscation Task as a Case Study

Serge Lionel Nikiema,Jordan Samhi,Micheline Bénédicte Moumoula,Albérick Euraste Djiré,Abdoul Kader Kaboré,Jacques Klein,Tegawendé F. Bissyandé

Main category: cs.CL

TL;DR: This research explores whether large language models understand concepts or just recognize patterns by proposing bidirectional reasoning as a test for genuine understanding and introducing Contrastive Fine-Tuning (CFT) to develop deeper understanding and reverse capabilities without explicit reverse training.

Details Motivation: The research addresses the question of whether large language models truly understand concepts or simply recognize patterns, proposing bidirectional reasoning as a test for genuine understanding. Method: The researchers tested current language models and developed Contrastive Fine-Tuning (CFT), which trains models using positive examples, negative examples, and forward-direction obfuscation examples. Result: The experiments demonstrated that CFT successfully achieved bidirectional reasoning, enabling strong reverse performance while maintaining forward task capabilities. Conclusion: The authors conclude that bidirectional reasoning serves both as a theoretical framework for assessing genuine understanding and as a practical training approach for developing more capable AI systems. Abstract: This research addresses a fundamental question in AI: whether large language models truly understand concepts or simply recognize patterns. The authors propose bidirectional reasoning,the ability to apply transformations in both directions without being explicitly trained on the reverse direction, as a test for genuine understanding. They argue that true comprehension should naturally allow reversibility. For example, a model that can change a variable name like userIndex to i should also be able to infer that i represents a user index without reverse training. The researchers tested current language models and discovered what they term cognitive specialization: when models are fine-tuned on forward tasks, their performance on those tasks improves, but their ability to reason bidirectionally becomes significantly worse. To address this issue, they developed Contrastive Fine-Tuning (CFT), which trains models using three types of examples: positive examples that maintain semantic meaning, negative examples with different semantics, and forward-direction obfuscation examples. This approach aims to develop deeper understanding rather than surface-level pattern recognition and allows reverse capabilities to develop naturally without explicit reverse training. Their experiments demonstrated that CFT successfully achieved bidirectional reasoning, enabling strong reverse performance while maintaining forward task capabilities. The authors conclude that bidirectional reasoning serves both as a theoretical framework for assessing genuine understanding and as a practical training approach for developing more capable AI systems.

[11] Ad hoc conventions generalize to new referents

Anya Ji,Claire Augusta Bergey,Ron Eliav,Yoav Artzi,Robert D. Hawkins

Main category: cs.CL

TL;DR: This study investigates how people form shared naming systems and finds that it involves broader conceptual alignment rather than arbitrary labels.

Details Motivation: To determine whether forming a shared way of describing objects involves broader conceptual alignment or just arbitrary naming. Method: A dyadic communication study using the KiloGram dataset with over 1,000 abstract tangram images was conducted. Result: Strong evidence for generalization was found, with alignment increasing relative to pre-test labels. Generalization decayed nonlinearly with visual similarity and was robust across levels of nameability. Conclusion: Ad hoc conventions are not arbitrary labels but reflect genuine conceptual coordination. Abstract: How do people talk about things they've never talked about before? One view suggests that a new shared naming system establishes an arbitrary link to a specific target, like proper names that cannot extend beyond their bearers. An alternative view proposes that forming a shared way of describing objects involves broader conceptual alignment, reshaping each individual's semantic space in ways that should generalize to new referents. We test these competing accounts in a dyadic communication study (N=302) leveraging the recently-released KiloGram dataset containing over 1,000 abstract tangram images. After pairs of participants coordinated on referential conventions for one set of images through repeated communication, we measured the extent to which their descriptions aligned for undiscussed images. We found strong evidence for generalization: partners showed increased alignment relative to their pre-test labels. Generalization also decayed nonlinearly with visual similarity (consistent with Shepard's law) and was robust across levels of the images' nameability. These findings suggest that ad hoc conventions are not arbitrary labels but reflect genuine conceptual coordination, with implications for theories of reference and the design of more adaptive language agents.

[12] Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation

Hongyan Xie,Yitong Yao,Yikun Ban,Zixuan Huang,Deqing Wang,Zhenhe Wu,Haoxiang Su,Chao Wang,Shuangyong Song,Xuelong Li

Main category: cs.CL

TL;DR: 本文提出了一种新的方法CoPeD,通过改进任务设置和损失函数,提升小模型在模仿大模型推理链时的准确性和鲁棒性。

Details Motivation: 现有的小语言模型(SLMs)在模仿大语言模型(LLMs)生成的推理链(CoT)数据时,可能会受到噪声推理的影响,导致模型学习到错误的关联并降低推理质量。为解决此问题,本文提出CoPeD来提升推理质量。 Method: 提出了一种名为Chain-of-Thought Correctness Perception Distillation (CoPeD) 的方法,包括正确性感知的任务设置和动态调整训练实例贡献的加权损失函数。 Result: 实验表明,CoPeD在分布内和分布外的推理基准数据集上均表现出良好的效果。 Conclusion: CoPeD通过正确性感知的任务设置和加权损失策略,有效提升了学生模型在推理任务中的表现,特别是在处理分布内和分布外数据时。 Abstract: Large language models (LLMs) excel at reasoning tasks but are expensive to deploy. Thus small language models (SLMs) are fine-tuned on CoT data generated by LLMs to copy LLMs' abilities. However, these CoT data may include noisy rationales that either fail to substantiate the answers or contribute no additional information to support answer prediction, which leads SLMs to capture spurious correlations between questions and answers and compromise the quality of reasoning. In this work, we propose Chain-of-Thought Correctness Perception Distillation (CoPeD), which aims to improve the reasoning quality of the student model from the perspectives of task setting and data utilization. Firstly, we introduce a correctness-aware task setting that encourages the student model to predict answers based on correct rationales and revise them when they are incorrect. This setting improves the faithfulness of reasoning and allows the model to learn from its mistakes. Then, we propose a Correctness-Aware Weighted loss, which dynamically adjusts the contribution of each training instance based on the combined loss of the rationale and the answer. This strategy encourages the model to focus more on samples where the rationale offers stronger support for the correct answer. Experiments have shown that CoPeD is effective on both in-distribution (IND) and out-of-distribution (OOD) benchmark reasoning datasets.

[13] Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

Qiyuan Chen,Hongsen Huang,Qian Shao,Jiahe Chen,Jintai Chen,Hongxia Xu,Renjie Hua,Ren Chuan,Jian Wu

Main category: cs.CL

TL;DR: Icon$^{2}$ improves preference dataset construction for LLMs by leveraging inherent regulation of representation space, achieving better alignment and efficiency.

Details Motivation: Conventional methods face challenges such as distribution mismatches and high computational overhead, necessitating a more efficient and tailored approach. Method: Icon$^{2}$ leverages inherent regulation of LLMs' representation space by extracting layer-wise direction vectors to encode human preferences and applying bidirectional inherent control during decoding. Result: Llama3-8B and Qwen2-7B show a 13.89% improvement on AlpacaEval 2.0 and 13.45% on Arena-Hard with up to 48.1% reduction in computational costs. Conclusion: The proposed Icon$^{2}$ method significantly improves the alignment and efficiency of preference dataset construction for Large Language Models (LLMs). Abstract: Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs' representation space for efficient and tailored preference dataset construction, named Icon$^{2}$. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.

[14] Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents

Qiyuan Chen,Jiahe Chen,Hongsen Huang,Qian Shao,Jintai Chen,Renjie Hua,Hongxia Xu,Ruijia Wu,Ren Chuan,Jian Wu

Main category: cs.CL

TL;DR: 本文提出了生成搜索引擎优化的新框架,包括多智能体系统和评估基准,为优化内容影响力提供了新方法。

Details Motivation: 生成搜索引擎的出现使传统SEO指标失效,因此迫切需要理解和优化内容对生成答案的影响。 Method: 设计了一个多智能体系统,并构建了一个大规模的以内容为中心的基准CC-GSEO-Bench,同时提出了一个多层次的评估框架。 Result: 通过实证分析揭示了内容影响力的新见解,并为内容创作者提供了可操作的策略。 Conclusion: 该论文提出了一种全面的生成搜索引擎优化(GSEO)端到端框架,为未来GSEO的研究和实践奠定了基础。 Abstract: The paradigm shift from traditional ranked-based search to Generative Search Engines has rendered conventional SEO metrics obsolete, creating an urgent need to understand, measure, and optimize for content influence on synthesized answers. This paper introduces a comprehensive, end-to-end framework for Generative Search Engine Optimization (GSEO) to address this challenge. We make two primary contributions. First, we construct CC-GSEO-Bench, a large-scale, content-centric benchmark, and propose a multi-dimensional evaluation framework that systematically quantifies influence, moving beyond surface-level attribution to assess substantive semantic impact. Second, we design a novel multi-agent system that operationalizes this framework, automating the strategic refinement of content through a collaborative analyze-revise-evaluate workflow. Our empirical analysis using this framework reveals novel insights into the dynamics of content influence, offering actionable strategies for creators and establishing a principled foundation for future GSEO research.

[15] New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

Xugang Lu,Peng Shen,Yu Tsao,Hisashi Kawai

Main category: cs.CL

TL;DR: This paper proposes an alignment model using unbalanced optimal transport to address structural asymmetry and improve ASR performance by ensuring precise and comprehensive matching between acoustic and linguistic representations.

Details Motivation: Aligning acoustic and linguistic representations is crucial for knowledge transfer in ASR, with challenges arising from structural asymmetry and imbalanced matching conditions. Method: An unbalanced optimal transport-based alignment model is proposed, treating alignment as a detection problem to identify meaningful correspondences with high precision and recall. Result: The method demonstrates effectiveness in flexibly controlling the degree of matching, thereby improving ASR performance when evaluated on a CTC-based system with a pre-trained language model. Conclusion: The proposed unbalanced optimal transport-based alignment model effectively addresses the alignment challenge in ASR by ensuring every linguistic token is grounded in acoustic observations while allowing flexible mappings. Abstract: Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.

[16] From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics

Shay Dahary,Avi Edana,Alexander Apartsin,Yehudit Aperstein

Main category: cs.CL

TL;DR: This paper explores the use of zero-shot and fine-tuned language models for multi-label emotional attribution in song lyrics, highlighting their strengths and limitations.

Details Motivation: The emotional content of song lyrics significantly influences listener experiences and musical preferences, yet accurately attributing emotions to lyrics remains a challenging task. Method: The paper constructs a manually labeled dataset using a mean opinion score (MOS) approach and evaluates several publicly available large language models (LLMs) in zero-shot scenarios. A BERT-based model is also fine-tuned for predicting multi-label emotion scores. Result: The experimental results demonstrate the effectiveness of both zero-shot and fine-tuned models in capturing emotional nuances in lyrics, offering insights into model selection for emotion-based music information retrieval applications. Conclusion: The study concludes that both zero-shot and fine-tuned models have their strengths and limitations in recognizing nuanced emotions in song lyrics, with large language models showing potential for emotion recognition in creative texts. Abstract: The emotional content of song lyrics plays a pivotal role in shaping listener experiences and influencing musical preferences. This paper investigates the task of multi-label emotional attribution of song lyrics by predicting six emotional intensity scores corresponding to six fundamental emotions. A manually labeled dataset is constructed using a mean opinion score (MOS) approach, which aggregates annotations from multiple human raters to ensure reliable ground-truth labels. Leveraging this dataset, we conduct a comprehensive evaluation of several publicly available large language models (LLMs) under zero-shot scenarios. Additionally, we fine-tune a BERT-based model specifically for predicting multi-label emotion scores. Experimental results reveal the relative strengths and limitations of zero-shot and fine-tuned models in capturing the nuanced emotional content of lyrics. Our findings highlight the potential of LLMs for emotion recognition in creative texts, providing insights into model selection strategies for emotion-based music information retrieval applications. The labeled dataset is available at https://github.com/LLM-HITCS25S/LyricsEmotionAttribution.

[17] Few-Shot Query Intent Detection via Relation-Aware Prompt Learning

Liang Zhang,Yuan Li,Shijie Zhang,Zheng Zhang,Xitong Li

Main category: cs.CL

TL;DR: 本文提出SAID框架,通过整合文本和关系结构信息提升意图检测效果,并提出QueryAdapt机制实现更细粒度的知识迁移。

Details Motivation: 现有方法主要关注文本数据,忽略了对话系统中的重要结构信息,如查询-查询关系和查询-回答关系。 Method: 提出SAID框架和QueryAdapt机制,利用大规模未标注对话文本预训练模型,并通过生成意图特定的关系标记实现更细粒度的知识迁移。 Result: 在两个真实数据集上的实验结果显示,SAID显著优于最先进的方法。 Conclusion: SAID框架通过整合文本和关系结构信息,在意图检测任务上显著优于现有方法。 Abstract: Intent detection is a crucial component of modern conversational systems, since accurately identifying user intent at the beginning of a conversation is essential for generating effective responses. Recent efforts have focused on studying this problem under a challenging few-shot scenario. These approaches primarily leverage large-scale unlabeled dialogue text corpora to pretrain language models through various pretext tasks, followed by fine-tuning for intent detection with very limited annotations. Despite the improvements achieved, existing methods have predominantly focused on textual data, neglecting to effectively capture the crucial structural information inherent in conversational systems, such as the query-query relation and query-answer relation. To address this gap, we propose SAID, a novel framework that integrates both textual and relational structure information in a unified manner for model pretraining for the first time. Building on this framework, we further propose a novel mechanism, the query-adaptive attention network (QueryAdapt), which operates at the relation token level by generating intent-specific relation tokens from well-learned query-query and query-answer relations explicitly, enabling more fine-grained knowledge transfer. Extensive experimental results on two real-world datasets demonstrate that SAID significantly outperforms state-of-the-art methods.

[18] LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding

Yuxuan Hu,Jihao Liu,Ke Wang,Jinliang Zhen,Weikang Shi,Manyuan Zhang,Qi Dou,Rui Liu,Aojun Zhou,Hongsheng Li

Main category: cs.CL

TL;DR: 本文提出了LM-Searcher,一种无需特定领域调整即可进行跨领域神经架构优化的新方法,核心在于NCode表示法和将NAS问题重新定义为排名任务。

Details Motivation: 现有的LLM驱动的NAS方法严重依赖提示工程和特定领域调优,限制了其在不同任务中的实用性与可扩展性。 Method: 提出了一种新框架LM-Searcher,核心在于NCode的通用数值字符串表示法和将NAS问题重新定义为排名任务。 Result: LM-Searcher在域内和域外任务中均表现出竞争力,如图像分类的CNN、分割和生成的LoRA配置。 Conclusion: LM-Searcher实现了跨领域的神经架构优化,无需大量的领域特定调整,为LLM驱动的架构搜索提供了一个新的范式。 Abstract: Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at https://github.com/Ashone3/LM-Searcher.

[19] Cross-Question Method Reuse in Large Language Models: From Word-Level Prediction to Rational Logical-Layer Reasoning

Hong Su

Main category: cs.CL

TL;DR: This paper proposes a method to enhance cross-question solution reuse in LLMs by focusing on solution adaptation, enabling reuse even when questions have low or hidden similarity.

Details Motivation: The motivation is to extend the scope of method reuse in LLMs to address questions with low similarity or hidden similarities that are not explicitly observable. Method: The method involves separating the question and solution, guiding the LLM to adapt solutions to new but related questions, and extending the approach to cases with partial or hidden characteristics. Result: Experimental verification shows that the scope-extension approach increases the probability of filtering out reusable solutions, improving the effectiveness of cross-question method reuse. Conclusion: The proposed scope-extension approach enables cross-question method reuse beyond conventional similarity constraints by focusing on solution transfer rather than question recognition. Abstract: Large language models (LLMs) have been widely applied to assist in finding solutions for diverse questions. Prior work has proposed representing a method as a pair of a question and its corresponding solution, enabling method reuse. However, existing approaches typically require the questions to be highly similar. In this paper, we extend the scope of method reuse to address questions with low similarity or with hidden similarities that are not explicitly observable. For questions that are similar in a general-specific sense (i.e., broader or narrower in scope), we propose to first separate the question and solution, rather than directly feeding the pair to the LLM. The LLM is then guided to adapt the solution to new but related questions, allowing it to focus on solution transfer rather than question recognition. Furthermore, we extend this approach to cases where questions only share partial features or hidden characteristics. This enables cross-question method reuse beyond conventional similarity constraints. Experimental verification shows that our scope-extension approach increases the probability of filtering out reusable solutions, thereby improving the effectiveness of cross-question method reuse.

[20] Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Michael Hoffmann,Jophin John,Stefan Schweter,Gokul Ramakrishnan,Hoi-Fong Mak,Alice Zhang,Dmitry Gaynullin,Nicolay J. Hammer

Main category: cs.CL

TL;DR: Llama-GENBA-10B is a trilingual model designed to reduce English dominance in large language models, achieving strong performance across English, German, and Bavarian while promoting inclusivity for low-resource languages.

Details Motivation: To tackle English-centric bias in large language models and promote inclusivity by incorporating low-resource languages like Bavarian. Method: Llama-GENBA-10B was built on Llama 3.1-8B and scaled to 10B parameters, pretrained on a balanced multilingual dataset of 164B tokens, including English, German, and Bavarian. A unified tokenizer and optimized architecture were developed, along with a standardized trilingual evaluation suite. Result: Llama-GENBA-10B demonstrated strong cross-lingual performance, surpassing other models in Bavarian and outperforming EuroLLM in English while matching results in German. Conclusion: Llama-GENBA-10B is a trilingual foundation model that effectively addresses English-centric bias in large language models, promoting inclusivity for low-resource languages like Bavarian. Abstract: We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.

[21] Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models

Ningyuan Deng,Hanyu Duan,Yixuan Tang,Yi Yang

Main category: cs.CL

TL;DR: 研究显示,当前文本嵌入模型在捕捉文本中数字信息的细微差别方面存在困难,需要进一步改进。

Details Motivation: 文本嵌入模型广泛用于自然语言处理,但其在处理文本中细微数字信息的能力尚不清楚,尤其是在金融和医疗等领域,数字的细微差别可能非常重要。 Method: 使用金融领域的合成数据评估了13种广泛使用的文本嵌入模型。 Result: 研究发现,这些模型通常难以准确捕捉数字细节,这对依赖精确数字理解的NLP系统提出了挑战。 Conclusion: 当前的文本嵌入模型在精确编码文本中的数字信息方面存在困难,这需要未来研究加以改进。 Abstract: Text embedding models are widely used in natural language processing applications. However, their capability is often benchmarked on tasks that do not require understanding nuanced numerical information in text. As a result, it remains unclear whether current embedding models can precisely encode numerical content, such as numbers, into embeddings. This question is critical because embedding models are increasingly applied in domains where numbers matter, such as finance and healthcare. For example, Company X's market share grew by 2\% should be interpreted very differently from Company X's market share grew by 20\%, even though both indicate growth in market share. This study aims to examine whether text embedding models can capture such nuances. Using synthetic data in a financial context, we evaluate 13 widely used text embedding models and find that they generally struggle to capture numerical details accurately. Our further analyses provide deeper insights into embedding numeracy, informing future research to strengthen embedding model-based NLP systems with improved capacity for handling numerical content.

[22] A Survey of the State-of-the-Art in Conversational Question Answering Systems

Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Fahmida Islam,Maryam Tahermazandarani,Quan Z. Sheng

Main category: cs.CL

TL;DR: 这篇论文综述了对话问答系统的发展现状,分析了其核心技术,并探讨了未来研究的方向。

Details Motivation: 对话问答系统在多个领域中扮演着重要角色,需要对其发展进行系统性的回顾和分析。 Method: 本文采用了文献综述的方法,分析了对话问答系统的多个方面,包括核心组件和先进的机器学习技术。 Result: 调查了对话问答系统的关键技术、大型语言模型的作用以及相关数据集,并指出了未来的研究方向。 Conclusion: 本文总结了对话问答系统的发展现状,探讨了其关键技术和未来研究方向。 Abstract: Conversational Question Answering (ConvQA) systems have emerged as a pivotal area within Natural Language Processing (NLP) by driving advancements that enable machines to engage in dynamic and context-aware conversations. These capabilities are increasingly being applied across various domains, i.e., customer support, education, legal, and healthcare where maintaining a coherent and relevant conversation is essential. Building on recent advancements, this survey provides a comprehensive analysis of the state-of-the-art in ConvQA. This survey begins by examining the core components of ConvQA systems, i.e., history selection, question understanding, and answer prediction, highlighting their interplay in ensuring coherence and relevance in multi-turn conversations. It further investigates the use of advanced machine learning techniques, including but not limited to, reinforcement learning, contrastive learning, and transfer learning to improve ConvQA accuracy and efficiency. The pivotal role of large language models, i.e., RoBERTa, GPT-4, Gemini 2.0 Flash, Mistral 7B, and LLaMA 3, is also explored, thereby showcasing their impact through data scalability and architectural advancements. Additionally, this survey presents a comprehensive analysis of key ConvQA datasets and concludes by outlining open research directions. Overall, this work offers a comprehensive overview of the ConvQA landscape and provides valuable insights to guide future advancements in the field.

[23] Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models

Donya Rooein,Flor Miriam Plaza-del-Arco,Debora Nozza,Dirk Hovy

Main category: cs.CL

TL;DR: This paper highlights that the current state of Farsi NLP is hindered by poor data quality and lack of demographic context, despite having a large speaker base and growing data availability.

Details Motivation: Despite Farsi being considered a middle-resource language with over 127 million speakers and increasing digital text availability, there is a lack of reliable datasets and consistent results in NLP tasks, which the study aims to investigate. Method: The researchers analyzed 110 publications related to subjective tasks in Farsi, focusing on Sentiment Analysis, Emotion Analysis, and Toxicity Detection. They evaluated data availability, quality, and the impact of demographic factors on modeling subjectivity. Result: The study found significant challenges in data availability and quality for Farsi NLP tasks. Existing datasets often lack important demographic details, and model predictions were found to be highly unstable across datasets and models. Conclusion: The study concludes that merely having a substantial volume of data is insufficient to enhance a language's prospects in NLP, as other factors such as data quality and demographic information are crucial. Abstract: Given Farsi's speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and find significant challenges in data availability and quality, despite the overall increase in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings indicate that the volume of data is insufficient to significantly improve a language's prospects in NLP.

[24] QCSE: A Pretrained Quantum Context-Sensitive Word Embedding for Natural Language Processing

Charles M. Varmantchaonala,Niclas GÖtting,Nils-Erik SchÜtte,Jean Louis E. K. Fendji,Christopher Gies

Main category: cs.CL

TL;DR: This paper introduces QCSE, a quantum context-sensitive embedding model that leverages quantum computation to capture contextual relationships in languages. It demonstrates effectiveness in both low-resource (Fulani) and slightly larger (English) datasets, highlighting the potential of Quantum NLP in real-world applications.

Details Motivation: The motivation behind this paper is to explore the potential of quantum computation in natural language processing (NLP), particularly in capturing context-sensitive word embeddings, which can address challenges in linguistic tasks. The focus on Fulani highlights the importance of developing NLP solutions for low-resource languages where data scarcity is a significant problem. Method: The paper introduces a pretrained quantum context-sensitive embedding model named QCSE that uses quantum-native context learning and innovative context matrix computation methods to capture contextual relationships in languages. Five distinct methods for computing the context matrices are proposed, including exponential decay, sinusoidal modulation, phase shifts, and hash-based transformations. The model is evaluated on a Fulani corpus (a low-resource African language) and an English corpus. Result: The results show that QCSE successfully captures context sensitivity and utilizes the expressibility of quantum systems to represent rich, context-aware language information. The model's performance on both the Fulani and English corpora demonstrates its effectiveness in handling linguistic tasks, particularly in low-resource language scenarios. Conclusion: This paper concludes that QCSE, a pretrained quantum context-sensitive embedding model, effectively captures context sensitivity and leverages the expressibility of quantum systems for representing context-aware language information, highlighting the potential of Quantum Natural Language Processing (QNLP) in addressing linguistic challenges, especially for low-resource languages. Abstract: Quantum Natural Language Processing (QNLP) offers a novel approach to encoding and understanding the complexity of natural languages through the power of quantum computation. This paper presents a pretrained quantum context-sensitive embedding model, called QCSE, that captures context-sensitive word embeddings, leveraging the unique properties of quantum systems to learn contextual relationships in languages. The model introduces quantum-native context learning, enabling the utilization of quantum computers for linguistic tasks. Central to the proposed approach are innovative context matrix computation methods, designed to create unique, representations of words based on their surrounding linguistic context. Five distinct methods are proposed and tested for computing the context matrices, incorporating techniques such as exponential decay, sinusoidal modulation, phase shifts, and hash-based transformations. These methods ensure that the quantum embeddings retain context sensitivity, thereby making them suitable for downstream language tasks where the expressibility and properties of quantum systems are valuable resources. To evaluate the effectiveness of the model and the associated context matrix methods, evaluations are conducted on both a Fulani corpus, a low-resource African language, dataset of small size and an English corpus of slightly larger size. The results demonstrate that QCSE not only captures context sensitivity but also leverages the expressibility of quantum systems for representing rich, context-aware language information. The use of Fulani further highlights the potential of QNLP to mitigate the problem of lack of data for this category of languages. This work underscores the power of quantum computation in natural language processing (NLP) and opens new avenues for applying QNLP to real-world linguistic challenges across various tasks and domains.

[25] Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification

Fernando Gabriela García,Qiyang Shi,Zilin Feng

Main category: cs.CL

TL;DR: VeriFact-CoT 是一种通过事实验证和引用整合机制来提高大型语言模型生成事实敏感内容的准确性和可信度的新方法。

Details Motivation: 解决大型语言模型生成内容时存在的幻觉问题和缺乏可信引用来源的问题。 Method: 通过引入多阶段的“事实验证-反思-引用整合”机制,使模型能够自我审查和修改中间推理步骤和最终答案。 Result: VeriFact-CoT 有效增强了生成输出的客观准确性、信任度和可追溯性。 Conclusion: VeriFact-CoT 提高了大型语言模型在生成复杂、事实敏感内容时的准确性、可信度和可追溯性,使其更适用于科学、新闻和法律等高要求领域。 Abstract: This research introduces VeriFact-CoT (Verified Factual Chain-of-Thought), a novel method designed to address the pervasive issues of hallucination and the absence of credible citation sources in Large Language Models (LLMs) when generating complex, fact-sensitive content. By incorporating a multi-stage mechanism of 'fact verification-reflection-citation integration,' VeriFact-CoT empowers LLMs to critically self-examine and revise their intermediate reasoning steps and final answers. This process significantly enhances the objective accuracy, trustworthiness, and traceability of the generated outputs, making LLMs more reliable for applications demanding high fidelity such as scientific research, news reporting, and legal consultation.

[26] LatinX: Aligning a Multilingual TTS Model with Direct Preference Optimization

Luis Felipe Chary,Miguel Arjona Ramirez

Main category: cs.CL

TL;DR: LatinX is a multilingual text-to-speech model that preserves speaker identity across languages, outperforming existing models in both objective metrics and human evaluations.

Details Motivation: To develop a TTS model that maintains the speaker's voice identity across different languages for improved speech-to-speech translation. Method: LatinX uses a 12-layer decoder-only Transformer with three training stages: pre-training, supervised fine-tuning, and alignment using DPO with WER and speaker-similarity metrics. Result: LatinX reduced WER and improved objective similarity compared to the baseline, with further improvements in perceived speaker similarity in human evaluations. Conclusion: LatinX, a multilingual TTS model, effectively preserves speaker identity across languages and outperforms baselines in human evaluations. Abstract: We present LatinX, a multilingual text-to-speech (TTS) model for cascaded speech-to-speech translation that preserves the source speaker's identity across languages. LatinX is a 12-layer decoder-only Transformer trained in three stages: (i) pre-training for text-to-audio mapping, (ii) supervised fine-tuning for zero-shot voice cloning, and (iii) alignment with Direct Preference Optimization (DPO) using automatically labeled pairs based on Word Error Rate (WER) and speaker-similarity metrics. Trained on English and Romance languages with emphasis on Portuguese, LatinX with DPO consistently reduces WER and improves objective similarity over the fine-tuned baseline. Human evaluations further indicate stronger perceived speaker similarity than a strong baseline (XTTSv2), revealing gaps between objective and subjective measures. We provide cross-lingual analyses and discuss balanced preference signals and lower-latency architectures as future work.

[27] ZhiFangDanTai: Fine-tuning Graph-based Retrieval-Augmented Generation Model for Traditional Chinese Medicine Formula

ZiXuan Zhang,Bowen Hao,Yingjie Li,Hongzhi Yin

Main category: cs.CL

TL;DR: ZhiFangDanTai improves TCM formula generation by combining GraphRAG and LLM fine-tuning, offering better performance and more accurate, explainable results.

Details Motivation: Existing TCM models lack comprehensive results and detailed explanations due to insufficient datasets. This work aims to enhance formula generation by integrating structured knowledge and improving LLM capabilities. Method: ZhiFangDanTai combines Graph-based Retrieval-Augmented Generation (GraphRAG) with LLM fine-tuning. GraphRAG synthesizes structured TCM knowledge, and an enhanced instruction dataset improves the model's ability to utilize retrieved information. Result: Experiments show that ZhiFangDanTai outperforms state-of-the-art models on both collected and clinical datasets, with theoretical proofs supporting its effectiveness in reducing errors and hallucinations. Conclusion: The proposed ZhiFangDanTai framework effectively improves the generation of TCM formulas by integrating GraphRAG with LLM fine-tuning, offering reduced generalization error and hallucination rates while achieving state-of-the-art performance. Abstract: Traditional Chinese Medicine (TCM) formulas play a significant role in treating epidemics and complex diseases. Existing models for TCM utilize traditional algorithms or deep learning techniques to analyze formula relationships, yet lack comprehensive results, such as complete formula compositions and detailed explanations. Although recent efforts have used TCM instruction datasets to fine-tune Large Language Models (LLMs) for explainable formula generation, existing datasets lack sufficient details, such as the roles of the formula's sovereign, minister, assistant, courier; efficacy; contraindications; tongue and pulse diagnosis-limiting the depth of model outputs. To address these challenges, we propose ZhiFangDanTai, a framework combining Graph-based Retrieval-Augmented Generation (GraphRAG) with LLM fine-tuning. ZhiFangDanTai uses GraphRAG to retrieve and synthesize structured TCM knowledge into concise summaries, while also constructing an enhanced instruction dataset to improve LLMs' ability to integrate retrieved information. Furthermore, we provide novel theoretical proofs demonstrating that integrating GraphRAG with fine-tuning techniques can reduce generalization error and hallucination rates in the TCM formula task. Experimental results on both collected and clinical datasets demonstrate that ZhiFangDanTai achieves significant improvements over state-of-the-art models. Our model is open-sourced at https://huggingface.co/tczzx6/ZhiFangDanTai1.0.

[28] MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

François Grolleau,Emily Alsentzer,Timothy Keyes,Philip Chung,Akshay Swaminathan,Asad Aali,Jason Hom,Tridu Huynh,Thomas Lew,April S. Liang,Weihan Chu,Natasha Z. Steele,Christina F. Lin,Jingkun Yang,Kameron C. Black,Stephen P. Ma,Fateme N. Haredasht,Nigam H. Shah,Kevin Schulman,Jonathan H. Chen

Main category: cs.CL

TL;DR: MedFactEval和MedAgentBrief的提出为临床文本生成的评估和生成提供了可扩展且准确的解决方案。

Details Motivation: 评估LLM生成的临床文本的事实准确性是采用该系统的障碍,传统专家评审方法难以持续进行质量保证。 Method: 引入MedFactEval框架,采用多LLM多数投票机制评估生成摘要的事实准确性,并提出MedAgentBrief多步骤工作流生成高质量摘要。 Result: MedFactEval与七名医生的多数投票参考标准高度一致(Cohen's kappa=81%),且优于单个专家评估(kappa=67%)。 Conclusion: 本研究提供了高效的评估框架和生成工作流,有助于负责任地在临床环境中部署生成式AI。 Abstract: Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

[29] Let's Roleplay: Examining LLM Alignment in Collaborative Dialogues

Abhijnan Nath,Carine Graff,Nikhil Krishnaswamy

Main category: cs.CL

TL;DR: This paper explores how friction-aware alignment methods improve the effectiveness of Large Language Models (LLMs) as collaborators in multiturn, multiparty interactions, showing that these methods outperform traditional approaches in achieving group consensus and accurate outcomes.

Details Motivation: The motivation stems from the increasing integration of Large Language Models (LLMs) into diverse workflows as 'collaborators' with humans. The paper addresses the need for reliable and predictable LLM behavior in long-horizon, multiparty interactions, which current alignment techniques typically do not account for. Method: The study uses a roleplay methodology to evaluate interventions from differently-trained friction agents in collaborative task conversations. It also proposes a novel counterfactual evaluation framework to quantify how friction interventions affect the trajectory of group collaboration and belief alignment. Result: The results demonstrate that a friction-aware approach outperforms traditional alignment methods in fostering group convergence on task-relevant propositions and achieving more accurate task outcomes. Conclusion: The paper concludes that friction-aware approaches significantly improve the effectiveness of LLM agents as collaborators in multiturn, multiparty interactions by enhancing convergence to a common ground and improving the correctness of task outcomes compared to common alignment baselines. Abstract: As Large Language Models (LLMs) integrate into diverse workflows, they are increasingly being considered "collaborators" with humans. If such AI collaborators are to be reliable, their behavior over multiturn interactions must be predictable, validated and verified before deployment. Common alignment techniques are typically developed under simplified single-user settings and do not account for the dynamics of long-horizon multiparty interactions. This paper examines how different alignment methods affect LLM agents' effectiveness as partners in multiturn, multiparty collaborations. We study this question through the lens of friction agents that intervene in group dialogues to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Using a roleplay methodology, we evaluate interventions from differently-trained friction agents in collaborative task conversations. We propose a novel counterfactual evaluation framework that quantifies how friction interventions change the trajectory of group collaboration and belief alignment. Our results show that a friction-aware approach significantly outperforms common alignment baselines in helping both convergence to a common ground, or agreed-upon task-relevant propositions, and correctness of task outcomes.

[30] Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling

Yue Gu,Zhihao Du,Ying Shi,Shiliang Zhang,Qian Chen,Jiqing Han

Main category: cs.CL

TL;DR: 本文提出了一种名为PSC-Joint的方法,用于解决个性化偏置短语识别中交叉注意力机制受偏置信息量变化影响的问题,并通过实验验证了其有效性。

Details Motivation: 交叉注意力机制在识别个性化偏置短语方面取得了进展,但其效果受偏置信息量变化的影响,尤其是在偏置列表显著增加时。 Method: 提出了Purified Semantic Correlation Joint Modeling (PSC-Joint) 方法,通过从粗到细的方式定义并计算三个语义相关性,并通过联合建模突出最相关的偏置信息。 Result: PSC-Joint方法在AISHELL-1和KeSpeech数据集上分别实现了平均相对F1分数提升21.34%和28.46%。 Conclusion: PSC-Joint方法在计算成本降低的同时,显著提高了个性化偏置短语识别的准确性,在AISHELL-1和KeSpeech数据集上均取得了显著的F1分数提升。 Abstract: Recently, cross-attention-based contextual automatic speech recognition (ASR) models have made notable advancements in recognizing personalized biasing phrases. However, the effectiveness of cross-attention is affected by variations in biasing information volume, especially when the length of the biasing list increases significantly. We find that, regardless of the length of the biasing list, only a limited amount of biasing information is most relevant to a specific ASR intermediate representation. Therefore, by identifying and integrating the most relevant biasing information rather than the entire biasing list, we can alleviate the effects of variations in biasing information volume for contextual ASR. To this end, we propose a purified semantic correlation joint modeling (PSC-Joint) approach. In PSC-Joint, we define and calculate three semantic correlations between the ASR intermediate representations and biasing information from coarse to fine: list-level, phrase-level, and token-level. Then, the three correlations are jointly modeled to produce their intersection, so that the most relevant biasing information across various granularities is highlighted and integrated for contextual recognition. In addition, to reduce the computational cost introduced by the joint modeling of three semantic correlations, we also propose a purification mechanism based on a grouped-and-competitive strategy to filter out irrelevant biasing phrases. Compared with baselines, our PSC-Joint approach achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech, across biasing lists of varying lengths.

[31] Accelerating Large Language Model Inference via Early-Exiting Algorithms

Sangmin Bae

Main category: cs.CL

TL;DR: 这篇论文研究了如何通过协同设计自适应算法和模型架构,优化大规模语言模型的计算效率,提出了一种新的方法来平衡动态推理和系统效率,从而降低计算成本并提高性能。

Details Motivation: 大规模语言模型虽然表现出色,但由于计算成本高,实际部署受到限制。现有的自适应计算方法(如提前退出)虽然旨在降低计算成本,但会引入系统级瓶颈,反而降低批量推理的吞吐量。因此,需要一种新的方法来平衡动态性和效率。 Method: 论文采用了三种主要方法:1) 提出一种高效的并行解码机制,以减少传统提前退出方法的开销;2) 利用深度参数共享来缓解动态推理中的同步问题;3) 提出一个统一的框架,通过预训练轻量级路由模块为每个token分配最优的递归深度。 Result: 论文的结果包括:1) 提出了高效的并行解码机制,显著减少了传统早期退出的开销;2) 通过深度参数共享,构建了紧凑且参数高效的模型,并缓解了同步问题;3) 提出了一种统一的框架,使用预训练的轻量级路由模块动态分配每个token的最优递归深度,从而在效率和性能之间建立了新的帕累托前沿。 Conclusion: 论文得出结论:通过共同设计自适应算法和模型架构,可以有效解决大规模语言模型在实际部署中的计算成本问题,从而达到动态性和效率的最佳平衡。 Abstract: Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.

[32] KatotohananQA: Evaluating Truthfulness of Large Language Models in Filipino

Lorenzo Alfred Nery,Ronald Dawson Catignas,Thomas James Tiam-Lee

Main category: cs.CL

TL;DR: KatotohananQA, a Filipino translation of TruthfulQA, identifies significant truthfulness gaps between English and Filipino in LLMs, highlighting multilingual evaluation needs.

Details Motivation: The lack of truthfulness benchmarks in low-resource languages like Filipino motivated the creation of KatotohananQA to evaluate LLM reliability. Method: The study translates TruthfulQA into Filipino and evaluates seven free-tier models using a binary-choice framework. Result: A significant truthfulness performance gap exists between English and Filipino, with some models showing strong multilingual robustness, while results vary across question types and topics. Conclusion: Multilingual evaluation is essential to ensure fairness and reliability in LLMs, as some aspects are less robust to language transfer. Abstract: Large Language Models (LLMs) achieve remarkable performance across various tasks, but their tendency to produce hallucinations limits reliable adoption. Benchmarks such as TruthfulQA have been developed to measure truthfulness, yet they are primarily available in English, leaving a gap in evaluating LLMs in low-resource languages. To address this, we present KatotohananQA, a Filipino translation of the TruthfulQA benchmark. Seven free-tier proprietary models were assessed using a binary-choice framework. Findings show a significant performance gap between English and Filipino truthfulness, with newer OpenAI models (GPT-5 and GPT-5 mini) demonstrating strong multilingual robustness. Results also reveal disparities across question characteristics, suggesting that some question types, categories, and topics are less robust to multilingual transfer which highlight the need for broader multilingual evaluation to ensure fairness and reliability in LLM usage.

[33] Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis

Zhenqi Jia,Rui Liu,Berrak Sisman,Haizhou Li

Main category: cs.CL

TL;DR: 本文提出了一种新的基于多模态细粒度交互图的对话语音合成系统MFCIG-CSS,通过建模词级别的语义和韵律交互信息,显著提升了合成语音的自然度和表现力。

Details Motivation: 现有方法忽略了多模态对话历史中词级别的细粒度语义和韵律交互建模,而这些信息对于生成具有自然韵律的语音至关重要。 Method: 提出了一种基于多模态细粒度上下文交互图的CSS系统MFCIG-CSS,通过构建两个专门的多模态细粒度对话交互图(语义交互图和韵律交互图)来建模词级别的语义和韵律交互特征。 Result: 在DailyTalk数据集上的实验表明,MFCIG-CSS在韵律表现力方面优于所有基线模型。 Conclusion: MFCIG-CSS通过构建语义交互图和韵律交互图,有效编码了多模态对话历史中的细粒度交互信息,从而生成了更具表现力的合成语音。 Abstract: Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.

[34] Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Hao Liang,Ruitao Wu,Bohan Zeng,Junbo Niu,Wentao Zhang,Bin Dong

Main category: cs.CL

TL;DR: 本文提出了一种多模态推理新框架,结合视觉与文本信息,在多项测试中表现优异,具有良好的泛化能力和应用潜力。

Details Motivation: 尽管文本推理取得了显著进展,但如GPT-o3等先进模型在多模态场景下表现仍不理想,因此需要更有效的解决方案。 Method: 研究引入了一种基于字幕辅助的推理框架(caption-assisted reasoning framework),以解决多模态推理中的挑战。 Result: 该方法在ICML 2025 AI for Math Workshop & Challenge 2: SeePhys中取得了第一名,并在MathVerse几何推理基准测试中验证了其泛化能力。 Conclusion: 该研究提出了一种高效的多模态推理框架,在视觉和文本模态之间实现了有效衔接,并且在SeePhys挑战赛中获得第一名,验证了其有效性和鲁棒性。 Abstract: Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.

[35] Orthogonal Low-rank Adaptation in Lie Groups for Continual Learning of Large Language Models

Kefan Cao,Shuaicheng Wu

Main category: cs.CL

TL;DR: 提出OLieRA方法,利用李群理论和乘法更新保持参数几何结构,解决LLM在顺序多任务中的灾难性遗忘问题。

Details Motivation: 传统参数正则化方法(如O-LoRA和N-LoRA)忽略了参数几何结构的保持,限制了性能。LLM的参数空间具有几何结构,需要在正交性之外予以保留。 Method: 基于李群理论的正交低秩自适应(OLieRA)方法,结合乘法更新与子空间正交约束。 Result: OLieRA在标准CL基准测试中达到了最先进的结果,并在任务数量较多的设置中表现优异。 Conclusion: OLieRA通过引入李群理论进行乘法更新,保持参数几何结构,同时应用正交性约束,有效缓解了顺序多任务设置中LLM的灾难性遗忘问题。 Abstract: Large language models (LLMs) are prone to catastrophic forgetting in sequential multi-task settings. Parameter regularization methods such as O-LoRA and N-LoRA alleviate task interference by enforcing low-rank subspace orthogonality, but they overlook the fact that conventional additive fine-tuning disrupts the intrinsic geometric structure of LLM parameters, limiting performance. Our key insight is that the parameter space of LLMs possesses a geometric structure, which must be preserved in addition to enforcing orthogonality. Based on this, we propose Orthogonal Low-rank Adaptation in Lie Groups (OLieRA), which introduces Lie group theory into LLM fine-tuning: leveraging multiplicative updates to preserve parameter geometry while applying orthogonality constraints to task subspaces. Experiments demonstrate that OLieRA achieves state-of-the-art results on the Standard CL benchmark and remains among the top-performing methods in the Large Number of Tasks setting.

[36] Benchmarking Gender and Political Bias in Large Language Models

Jinrui Yang,Xudong Han,Timothy Baldwin

Main category: cs.CL

TL;DR: EuroParlVote是一项用于评估大型语言模型在政治敏感环境中表现的新基准,揭示了模型在性别和政治团体上的偏见问题。

Details Motivation: 为了评估大型语言模型在政治敏感环境中的表现,引入EuroParlVote基准,并分析模型在不同性别和政治团体上的公平性。 Method: 通过构建EuroParlVote基准,将欧洲议会辩论演讲与点名投票结果关联,并利用包含性别、年龄、国家和政治团体等丰富人口统计信息的数据集,评估最先进的大型语言模型在性别分类和投票预测两个任务上的表现。 Result: 研究发现,大型语言模型经常将女性欧洲议会议员误分类为男性,并在模拟女性发言者的投票时表现出较低的准确性。从政治角度看,这些模型倾向于中间派团体,而在极左和极右派团体上的表现较差。 Conclusion: EuroParlVote揭示了大型语言模型在政治敏感环境中的潜在偏见,专有模型如GPT-4o在稳健性和公平性方面优于开源模型。 Abstract: We introduce EuroParlVote, a novel benchmark for evaluating large language models (LLMs) in politically sensitive contexts. It links European Parliament debate speeches to roll-call vote outcomes and includes rich demographic metadata for each Member of the European Parliament (MEP), such as gender, age, country, and political group. Using EuroParlVote, we evaluate state-of-the-art LLMs on two tasks -- gender classification and vote prediction -- revealing consistent patterns of bias. We find that LLMs frequently misclassify female MEPs as male and demonstrate reduced accuracy when simulating votes for female speakers. Politically, LLMs tend to favor centrist groups while underperforming on both far-left and far-right ones. Proprietary models like GPT-4o outperform open-weight alternatives in terms of both robustness and fairness. We release the EuroParlVote dataset, code, and demo to support future research on fairness and accountability in NLP within political contexts.

[37] Understanding the Influence of Synthetic Data for Text Embedders

Jacob Mitchell Springer,Vaibhav Adlakha,Siva Reddy,Aditi Raghunathan,Marius Mosbach

Main category: cs.CL

TL;DR: 本文探讨了合成数据在通用文本嵌入模型中的作用,发现其优势有限且局部化,且在不同任务间存在权衡。

Details Motivation: 缺乏公开的合成数据集限制了对合成数据泛化作用的研究,因此本文重现并公开了Wang等提出的合成数据(Mistral-E5)并进行分析。 Method: 首先复现并公开合成数据集 Mistral-E5,然后对合成数据提升模型泛化能力的具体场景进行批判性分析。 Result: 研究发现合成数据带来的优势稀疏且高度局部化到特定数据集,并观察到不同任务类别之间存在性能权衡。 Conclusion: 当前合成数据方法在构建通用嵌入模型方面存在局限性,且训练合成数据并不一定能在各种任务中产生更鲁棒的嵌入模型。 Abstract: Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.

[38] Augmented Fine-Tuned LLMs for Enhanced Recruitment Automation

Mohamed T. Younes,Omar Walid,Khaled Shaban,Ali Hamdi,Mai Hassan

Main category: cs.CL

TL;DR: This paper introduces a recruitment automation framework using fine-tuned LLMs and structured data, achieving high performance in candidate-job matching.

Details Motivation: The motivation was to address the limitations of generic LLMs in recruitment automation by improving accuracy, efficiency, and data consistency through a structured framework. Method: The researchers fine-tuned LLMs specifically for recruitment tasks, using a synthetic dataset and real-world resumes parsed by DeepSeek, formatted into a standardized JSON structure for training. Result: The fine-tuned Phi-4 model achieved an F1 score of 90.62%, showing significant improvements in exact match, BLEU score, ROUGE score, and overall similarity compared to base models and other state-of-the-art LLMs. Conclusion: The study concludes that fine-tuned LLMs, particularly the Phi-4 model, significantly enhance recruitment automation by improving accuracy in candidate-job matching and overall performance metrics. Abstract: This paper presents a novel approach to recruitment automation. Large Language Models (LLMs) were fine-tuned to improve accuracy and efficiency. Building upon our previous work on the Multilayer Large Language Model-Based Robotic Process Automation Applicant Tracking (MLAR) system . This work introduces a novel methodology. Training fine-tuned LLMs specifically tuned for recruitment tasks. The proposed framework addresses the limitations of generic LLMs by creating a synthetic dataset that uses a standardized JSON format. This helps ensure consistency and scalability. In addition to the synthetic data set, the resumes were parsed using DeepSeek, a high-parameter LLM. The resumes were parsed into the same structured JSON format and placed in the training set. This will help improve data diversity and realism. Through experimentation, we demonstrate significant improvements in performance metrics, such as exact match, F1 score, BLEU score, ROUGE score, and overall similarity compared to base models and other state-of-the-art LLMs. In particular, the fine-tuned Phi-4 model achieved the highest F1 score of 90.62%, indicating exceptional precision and recall in recruitment tasks. This study highlights the potential of fine-tuned LLMs. Furthermore, it will revolutionize recruitment workflows by providing more accurate candidate-job matching.

[39] MSLEF: Multi-Segment LLM Ensemble Finetuning in Recruitment

Omar Walid,Mohamed T. Younes,Khaled Shaban,Mai Hassan,Ali Hamdi

Main category: cs.CL

TL;DR: MSLEF improves resume parsing for recruitment automation by using an ensemble of fine-tuned LLMs that specialize in different resume segments, resulting in better accuracy and adaptability compared to single-model systems.

Details Motivation: The motivation behind MSLEF is to enhance resume parsing in recruitment automation by addressing the limitations of single-model systems and adapting to diverse resume formats and structures. Method: MSLEF integrates fine-tuned Large Language Models (LLMs) using weighted voting, with each model specializing in a specific resume segment. It introduces a segment-aware architecture that leverages field-specific weighting tailored to each resume part, and uses Gemini-2.5-Flash LLM as a high-level aggregator for complex sections. Result: MSLEF achieves significant improvements in Exact Match (EM), F1 score, BLEU, ROUGE, and Recruitment Similarity (RS) metrics, outperforming the best single model by up to +7% in RS. Conclusion: MSLEF is highly adaptable to real-world hiring scenarios and ensures precise and reliable candidate representation by overcoming the limitations of single-model systems and enhancing generalization across varied resume layouts. Abstract: This paper presents MSLEF, a multi-segment ensemble framework that employs LLM fine-tuning to enhance resume parsing in recruitment automation. It integrates fine-tuned Large Language Models (LLMs) using weighted voting, with each model specializing in a specific resume segment to boost accuracy. Building on MLAR , MSLEF introduces a segment-aware architecture that leverages field-specific weighting tailored to each resume part, effectively overcoming the limitations of single-model systems by adapting to diverse formats and structures. The framework incorporates Gemini-2.5-Flash LLM as a high-level aggregator for complex sections and utilizes Gemma 9B, LLaMA 3.1 8B, and Phi-4 14B. MSLEF achieves significant improvements in Exact Match (EM), F1 score, BLEU, ROUGE, and Recruitment Similarity (RS) metrics, outperforming the best single model by up to +7% in RS. Its segment-aware design enhances generalization across varied resume layouts, making it highly adaptable to real-world hiring scenarios while ensuring precise and reliable candidate representation.

[40] No Encore: Unlearning as Opt-Out in Music Generation

Jinju Kim,Taehan Kim,Abdul Waheed,Rita Singh

Main category: cs.CL

TL;DR: 该论文研究了机器遗忘技术在音乐生成中的应用,以防止使用受版权保护的内容,并分析了现有方法的有效性。

Details Motivation: 音乐生成系统存在使用受版权保护内容的风险,因此需要探索机器遗忘技术来解决相关的伦理和法律问题。 Method: 作者探索了现有的机器遗忘方法,并将其应用于预训练的文本到音乐(TTM)基线模型,分析这些方法在不影响模型性能的情况下遗忘预训练数据的有效性。 Result: 实验结果表明,机器遗忘技术可以在一定程度上帮助模型遗忘特定数据,同时保持模型性能。 Conclusion: 该论文得出的结论是,在音乐生成中应用机器遗忘技术可以帮助防止无意中使用受版权保护的内容,并为未来相关研究提供了基础分析。 Abstract: AI music generation is rapidly emerging in the creative industries, enabling intuitive music generation from textual descriptions. However, these systems pose risks in exploitation of copyrighted creations, raising ethical and legal concerns. In this paper, we present preliminary results on the first application of machine unlearning techniques from an ongoing research to prevent inadvertent usage of creative content. Particularly, we explore existing methods in machine unlearning to a pre-trained Text-to-Music (TTM) baseline and analyze their efficacy in unlearning pre-trained datasets without harming model performance. Through our experiments, we provide insights into the challenges of applying unlearning in music generation, offering a foundational analysis for future works on the application of unlearning for music generative models.

[41] Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Junjie Mu,Zonghao Ying,Zhekui Fan,Zonglei Jing,Yaoyuan Zhang,Zhengmin Yu,Wenxin Zhang,Quanchen Zou,Xiangzheng Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为Mask-GCG的新方法,通过使用可学习的标记屏蔽来识别后缀中的关键标记,从而减少冗余,降低计算开销并缩短实现成功攻击所需的时间。

Details Motivation: 尽管已经提出了几种改进的GCG变体,但它们都依赖于固定长度的后缀,而这些后缀中的潜在冗余仍未被探索。 Method: 提出了Mask-GCG,一种使用可学习的标记屏蔽来识别后缀中关键标记的方法。 Result: 实验结果表明,后缀中的大多数标记对攻击成功有显著贡献,而修剪少数低影响的标记不会影响损失值或损害攻击成功率(ASR),从而揭示了LLM提示中的标记冗余。 Conclusion: Mask-GCG能够有效识别后缀中的关键标记,通过修剪低影响位置的标记减少冗余,从而降低计算开销并缩短实现成功攻击所需的时间。 Abstract: Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

Ao Chang,Yubo Chen,Jun Zhao

Main category: cs.CL

TL;DR: This paper proposes PL-CA, a parametric RAG-based method that alleviates context window constraints and computational overhead in conventional RAG, particularly for multi-task legal scenarios. The approach integrates parametric knowledge into LLMs via LoRA and introduces a new expert-annotated legal dataset.

Details Motivation: Conventional RAG methods face limitations due to context window constraints, computational overhead, and inadequate benchmarks, particularly in complex legal domains requiring high knowledge rigor and logical consistency. Method: The authors propose a parametric RAG (P-RAG) framework that performs data augmentation on corpus knowledge, encodes legal knowledge into parametric vectors, and integrates this knowledge into LLMs via LoRA. They also construct a multi-task legal dataset with expert annotations. Result: The experimental results on the newly constructed multi-task legal dataset show that the proposed method reduces context-related overhead while maintaining competitive performance on downstream tasks. Conclusion: The proposed PL-CA method effectively reduces the overhead of long contexts while maintaining competitive performance on downstream tasks compared to conventional RAG methods. Abstract: Conventional RAG is considered one of the most effective methods for addressing model knowledge insufficiency and hallucination, particularly in the judicial domain that requires high levels of knowledge rigor, logical consistency, and content integrity. However, the conventional RAG method only injects retrieved documents directly into the model's context, which severely constrains models due to their limited context windows and introduces additional computational overhead through excessively long contexts, thereby disrupting models' attention and degrading performance on downstream tasks. Moreover, many existing benchmarks lack expert annotation and focus solely on individual downstream tasks while real-world legal scenarios consist of multiple mixed legal tasks, indicating conventional benchmarks' inadequacy for reflecting models' true capabilities. To address these limitations, we propose PL-CA, which introduces a parametric RAG (P-RAG) framework to perform data augmentation on corpus knowledge and encode this legal knowledge into parametric vectors, and then integrates this parametric knowledge into the LLM's feed-forward networks (FFN) via LoRA, thereby alleviating models' context pressure. Additionally, we also construct a multi-task legal dataset comprising more than 2000 training and test instances, which are all expert-annotated and manually verified. We conduct our experiments on our dataset, and the experimental results demonstrate that our method reduces the overhead associated with excessively long contexts while maintaining competitive performance on downstream tasks compared to conventional RAG. Our code and dataset are provided in the appendix.

[43] Do LLMs exhibit the same commonsense capabilities across languages?

Ivan Martínez-Murillo,Elena Lloret,Paloma Moreda,Albert Gatt

Main category: cs.CL

TL;DR: 本文提出了MULTICOM基准,用于评估多语言常识生成能力,结果显示大型语言模型在不同语言中的表现存在差异,英语表现最好,而资源较少的语言表现较差。

Details Motivation: 探索大型语言模型的多语言常识生成能力,并开发一个评估框架来衡量不同语言的表现差异。 Method: 引入了一个新的基准MULTICOM,扩展了COCOTEROS数据集到四种语言,并评估了一系列开源大型语言模型的表现。 Result: 结果表明,英语表现最佳,而资源较少的语言表现显著较低,上下文支持对表现欠佳的语言有一定帮助。 Conclusion: 该论文强调了当前大型语言模型在多语言常识生成任务中的局限性,并提出了一个新的基准MULTICOM用于评估多语言常识生成能力。 Abstract: This paper explores the multilingual commonsense generation abilities of Large Language Models (LLMs). To facilitate this investigation, we introduce MULTICOM, a novel benchmark that extends the COCOTEROS dataset to four languages: English, Spanish, Dutch, and Valencian. The task involves generating a commonsensical sentence that includes a given triplet of words. We evaluate a range of open-source LLMs, including LLaMA, Qwen, Gemma, EuroLLM, and Salamandra, on this benchmark. Our evaluation combines automatic metrics, LLM-as-a-judge approaches (using Prometheus and JudgeLM), and human annotations. Results consistently show superior performance in English, with significantly lower performance in less-resourced languages. While contextual support yields mixed results, it tends to benefit underrepresented languages. These findings underscore the current limitations of LLMs in multilingual commonsense generation. The dataset is publicly available at https://huggingface.co/datasets/gplsi/MULTICOM.

[44] WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Junteng Liu,Yunji Li,Chi Zhang,Jingyang Li,Aili Chen,Ke Ji,Weiyu Cheng,Zijia Wu,Chengyu Du,Qidi Xu,Jiayuan Song,Zhengmao Zhu,Wenhu Chen,Pengyu Zhao,Junxian He

Main category: cs.CL

TL;DR: 本文提出WebExplorer-8B,一种通过系统数据生成和强化学习实现的高性能长视野网络代理。

Details Motivation: 现有开源网络代理在复杂任务上的信息检索能力有限,或者缺乏透明实现,关键挑战在于缺乏具有挑战性的信息检索数据。 Method: 引入了WebExplorer:一种使用基于模型的探索和迭代长到短查询演化的系统数据生成方法。通过监督微调和强化学习开发了WebExplorer-8B。 Result: WebExplorer-8B支持128K上下文长度和多达100次工具调用回合,在多个信息检索基准测试中达到其规模的SOTA性能,平均经过16轮RL训练后有效搜索。 Conclusion: WebExplorer-8B实现了在信息检索任务中的SOTA性能,并且在HLE基准测试中展现出强大的泛化能力,表明了其在长视野网络代理方面的实用性。 Abstract: The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

[45] Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training

Andrei Baroian,Kasper Notebomer

Main category: cs.CL

TL;DR: The paper explores new Layer-Wise Scaling variants to optimize the training of transformer-based language models, showing improved performance over traditional uniform layer size approaches without a significant throughput cost.

Details Motivation: The motivation is to address the ignored diverse functional roles of different depths in transformer-based language models and their varying computational capacity needs, aiming for more efficient model design. Method: The paper introduces three new LWS variants - Framed, Reverse, and Crown - which redistribute FFN widths and attention heads through linear interpolation during pre-training, and presents a systematic ablation study on a fixed parameter budget. Result: All models using the new LWS variants converged to similar losses and showed improved performance over the isotropic baseline on a fixed budget of 180M parameters trained on 5B tokens. Conclusion: The paper concludes that by using new LWS variants, it is possible to redistribute FFN widths and attention heads during pre-training, resulting in models that perform better than the isotropic baseline without significantly affecting training throughput. Abstract: Transformer-based language models traditionally use uniform (isotropic) layer sizes, yet they ignore the diverse functional roles that different depths can play and their computational capacity needs. Building on Layer-Wise Scaling (LWS) and pruning literature, we introduce three new LWS variants - Framed, Reverse, and Crown - that redistribute FFN widths and attention heads via two or three-point linear interpolation in the pre-training stage. We present the first systematic ablation of LWS and its variants, on a fixed budget of 180M parameters, trained on 5B tokens. All models converge to similar losses and achieve better performance compared to an equal-cost isotropic baseline, without a substantial decrease in training throughput. This work represents an initial step into the design space of layer-wise architectures for pre-training, but future work should scale experiments to orders of magnitude more tokens and parameters to fully assess their potential.

[46] LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection

Jian Wu,Hang Yu,Bingchang Liu,Wenjie Yang,Peng Di,Jianguo Li,Yue Zhang

Main category: cs.CL

TL;DR: LAMDAS是一种利用预训练大语言模型作为隐式分类器进行领域特定数据选择的新方法,通过将数据选择重构为一类分类问题,实现了在使用少量数据的情况下优于全数据训练和其他最先进的方法的性能。

Details Motivation: 将大语言模型适应特定领域时,高质量、人工整理的数据稀缺是一个关键瓶颈。现有方法在准确性和效率方面存在挑战。 Method: LAMDAS利用预训练的大语言模型本身作为隐式分类器,避免了显式的特征工程和计算密集的优化过程。该方法将数据选择重构为一类分类问题,识别属于由小参考数据集定义的目标领域的候选数据。 Result: 实验结果表明,LAMDAS不仅使用少量数据优于全数据训练的性能,还在各种场景下超过了九个最先进的基线方法,并且在性能提升和计算效率之间取得了最平衡的效果。 Conclusion: LAMDAS提供了一种高效准确的领域特定数据选择方法,优于现有方法,具有良好的性能和计算效率平衡。 Abstract: Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.

[47] SLiNT: Structure-aware Language Model with Injection and Contrastive Training for Knowledge Graph Completion

Mengxue Yang,Chun Yang,Jiaqi Zhu,Jiafan Li,Jingqi Zhang,Yuyang Li,Ying Li

Main category: cs.CL

TL;DR: SLiNT improves link prediction in knowledge graphs by incorporating structural context into large language models through a modular framework.

Details Motivation: Link prediction in knowledge graphs is challenging due to structural sparsity and semantic ambiguity, especially in incomplete or zero-shot settings. Existing large language models lack sufficient exploitation of structural signals. Method: SLiNT framework uses Structure-Guided Neighborhood Enhancement (SGNE), Dynamic Hard Contrastive Learning (DHCL), and Gradient-Decoupled Dual Injection (GDDI) to incorporate structural context into a frozen LLM backbone. Result: SLiNT achieves superior or competitive performance compared to both embedding-based and generation-based baselines on WN18RR and FB15k-237 datasets. Conclusion: SLiNT demonstrates the effectiveness of integrating structural signals into LLMs for scalable knowledge graph completion, achieving superior or competitive performance on WN18RR and FB15k-237 datasets. Abstract: Link prediction in knowledge graphs requires integrating structural information and semantic context to infer missing entities. While large language models offer strong generative reasoning capabilities, their limited exploitation of structural signals often results in structural sparsity and semantic ambiguity, especially under incomplete or zero-shot settings. To address these challenges, we propose SLiNT (Structure-aware Language model with Injection and coNtrastive Training), a modular framework that injects knowledge-graph-derived structural context into a frozen LLM backbone with lightweight LoRA-based adaptation for robust link prediction. Specifically, Structure-Guided Neighborhood Enhancement (SGNE) retrieves pseudo-neighbors to enrich sparse entities and mitigate missing context; Dynamic Hard Contrastive Learning (DHCL) introduces fine-grained supervision by interpolating hard positives and negatives to resolve entity-level ambiguity; and Gradient-Decoupled Dual Injection (GDDI) performs token-level structure-aware intervention while preserving the core LLM parameters. Experiments on WN18RR and FB15k-237 show that SLiNT achieves superior or competitive performance compared with both embedding-based and generation-based baselines, demonstrating the effectiveness of structure-aware representation learning for scalable knowledge graph completion.

[48] HAVE: Head-Adaptive Gating and ValuE Calibration for Hallucination Mitigation in Large Language Models

Xin Tong,Zhi Lin,Jingya Wang,Bo Jin

Main category: cs.CL

TL;DR: The paper proposes HAVE, a framework that reduces hallucinations in LLMs through head-adaptive gating and value calibration, enhancing the reliability of model outputs.

Details Motivation: The research is motivated by the problem of hallucinations in LLMs during retrieval-augmented or long-context generation, stemming from the input-agnostic treatment of head importance and poor reflection of token contributions by raw attention weights. Method: The study introduces HAVE, which performs instance-level soft reweighing of attention heads and augments attention with the magnitude of value vectors to approximate write-back contribution. Result: Experiments across multiple QA benchmarks and LLM families show that HAVE consistently reduces hallucinations and outperforms strong baselines with modest overhead. Conclusion: HAVE is a parameter-free decoding framework that effectively addresses the hallucinations issue in LLMs by introducing head-adaptive gating and value calibration, making it efficient and broadly applicable. Abstract: Large Language Models (LLMs) often produce hallucinations in retrieval-augmented or long-context generation, even when relevant evidence is present. This stems from two issues: head importance is treated as input-agnostic, and raw attention weights poorly reflect each token's true contribution. We present HAVE (Head-Adaptive Gating and ValuE Calibration), a parameter-free decoding framework that directly addresses both challenges. HAVE introduces head-adaptive gating, which performs instance-level soft reweighing of attention heads, and value calibration, which augments attention with the magnitude of value vectors to approximate write-back contribution. Together, these modules construct token-level evidence aligned with model updates and fuse it with the LM distribution through a lightweight uncertainty-scaled policy. HAVE requires no finetuning and operates in a single forward pass, making it efficient and broadly applicable. Experiments across multiple QA benchmarks and LLM families demonstrate that HAVE consistently reduces hallucinations and outperforms strong baselines, including DAGCD, with modest overhead. The framework is transparent, reproducible, and readily integrates with off-the-shelf LLMs, advancing trustworthy generation in real-world settings.

[49] Guided Decoding and Its Critical Role in Retrieval-Augmented Generation

Özgür Uğur,Musa Yılmaz,Esra Şavirdi,Özay Ezerceli,Mahmut El Huseyni,Selva Taş,Reyhan Bayraktar

Main category: cs.CL

TL;DR: 这篇论文探讨了在RAG系统中使用引导解码来确保输出格式和减少幻觉的方法,比较了三种方法在不同多轮提示设置下的表现。

Details Motivation: 将大型语言模型(LLMs)集成到各种应用中需要结构化和可靠的响应。RAG系统中的一个关键挑战是确保输出符合预期格式并减少幻觉。 Method: 比较了三种引导解码方法,Outlines、XGrammar和LM Format Enforcer,在不同的多轮提示设置(0轮、1轮和2轮)下的表现。 Result: 研究结果揭示了多轮交互如何影响引导解码,并发现了意外的性能差异,为特定用例的方法选择提供了依据。 Conclusion: 这项工作推进了对RAG系统中结构化输出生成的理解,提供了理论见解和LLMs部署的实际指导。 Abstract: The integration of Large Language Models (LLMs) into various applications has driven the need for structured and reliable responses. A key challenge in Retrieval-Augmented Generation (RAG) systems is ensuring that outputs align with expected formats while minimizing hallucinations. This study examines the role of guided decoding in RAG systems, comparing three methods, Outlines, XGrammar, and LM Format Enforcer, across different multi-turn prompting setups (0-turn, 1-turn, and 2-turn). By evaluating success rates, hallucination rates, and output quality, we provide insights into their performance and applicability. Our findings reveal how multi-turn interactions influence guided decoding, uncovering unexpected performance variations that can inform method selection for specific use cases. This work advances the understanding of structured output generation in RAG systems, offering both theoretical insights and practical guidance for LLM deployment.

[50] Modelling Intertextuality with N-gram Embeddings

Yi Xing

Main category: cs.CL

TL;DR: 这篇论文提出了一种新的文本间性定量模型,通过n-gram嵌入的成对比较来量化文本间的互文关系,并通过大规模测试和网络分析验证了其有效性。

Details Motivation: 文本间性是文学研究的核心概念之一,但缺乏一种可扩展且有效的量化方法。该论文旨在提出一种新的定量模型,以支持大规模分析和基于网络的洞见。 Method: 该论文提出了一种新的文本间性定量模型,通过两个文本的n-gram嵌入进行成对比较并计算平均结果来衡量整体的文本间性。 Result: 在四个具有已知文本间性程度的文本上进行了验证,并在267个多样化文本上进行了可扩展性测试,结果证明了该方法的有效性和高效性。此外,网络分析揭示了中心性和社区结构,进一步确认了该方法的成功。 Conclusion: 该论文得出结论,提出的基于n-gram嵌入的文本间性定量模型能够有效地捕捉和量化文本间的互文关系,并通过网络分析验证了其在中心性和社区结构上的有效性。 Abstract: Intertextuality is a central tenet in literary studies. It refers to the intricate links between literary texts that are created by various types of references. This paper proposes a new quantitative model of intertextuality to enable scalable analysis and network-based insights: perform pairwise comparisons of the embeddings of n-grams from two texts and average their results as the overall intertextuality. Validation on four texts with known degrees of intertextuality, alongside a scalability test on 267 diverse texts, demonstrates the method's effectiveness and efficiency. Network analysis further reveals centrality and community structures, affirming the approach's success in capturing and quantifying intertextual relationships.

[51] Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval

Hao Lin,Peitong Xie,Jingxue Chen,Jie Lin,Qingkun Tang,Qianchun Lu

Main category: cs.CL

TL;DR: MoLER is a domain-aware RAG method that optimizes retrieval using MoL-Enhanced Reinforcement Learning. It achieves superior performance on benchmark datasets.

Details Motivation: Existing coarse-ranking optimization approaches struggle to balance domain-specific knowledge learning with query enhancement in RAG systems. Method: MoLER uses a two-stage pipeline: continual pre-training with a Mixture of Losses and reinforcement learning with Group Relative Policy Optimization. It also uses a Multi-query Single-passage Late Fusion strategy. Result: Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. Conclusion: MoLER successfully bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains. Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on the retrieval stage, particularly the coarse-ranking process. Existing coarse-ranking optimization approaches often struggle to balance domain-specific knowledge learning with query enhencement, resulting in suboptimal retrieval performance. To address this challenge, we propose MoLER, a domain-aware RAG method that uses MoL-Enhanced Reinforcement Learning to optimize retrieval. MoLER has a two-stage pipeline: a continual pre-training (CPT) phase using a Mixture of Losses (MoL) to balance domain-specific knowledge with general language capabilities, and a reinforcement learning (RL) phase leveraging Group Relative Policy Optimization (GRPO) to optimize query and passage generation for maximizing document recall. A key innovation is our Multi-query Single-passage Late Fusion (MSLF) strategy, which reduces computational overhead during RL training while maintaining scalable inference via Multi-query Multi-passage Late Fusion (MMLF). Extensive experiments on benchmark datasets show that MoLER achieves state-of-the-art performance, significantly outperforming baseline methods. MoLER bridges the knowledge gap in RAG systems, enabling robust and scalable retrieval in specialized domains.

[52] IntrEx: A Dataset for Modeling Engagement in Educational Conversations

Xingwei Tan,Mahathi Parvatham,Chiara Gambi,Gabriele Pergola

Main category: cs.CL

TL;DR: 本研究构建了 IntrEx 数据集,分析教育对话中影响兴趣的因素,并发现微调后的语言模型能够有效预测学习者的兴趣度。

Details Motivation: 为了弥补关于对话中兴趣驱动因素研究的不足,探讨教育对话中维持学习者兴趣的重要性。 Method: 通过引入 IntrEx 数据集,使用超过100名二语学习者进行基于比较的评分标注,并分析语言模型对兴趣度的预测能力。 Result: 经过兴趣度微调的 LLM(7B/8B 参数)表现优于 GPT-4o 等大型专有模型,语言和认知因素(如具体性、可读性)显著影响对话的吸引力。 Conclusion: 专门的数据集如 IntrEx 可用于建模教育场景中的参与度,语言和认知因素显著影响学习者的兴趣和参与度。 Abstract: Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.

[53] ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data

Vladislav Stankov,Matyáš Kopp,Ondřej Bojar

Main category: cs.CL

TL;DR: ParCzech4Speech 1.0 是一个处理过的语音建模数据集,包含 2,695 小时的捷克议会演讲声音记录,与官方成绩单对齐,提供了三种变体,适用于自动语音识别和语音合成等任务。

Details Motivation: 为了提供一个更大、更可靠的语音建模数据集,改进先前版本的数据提取和对齐可靠性。 Method: 将 ParCzech 4.0 语料库的声音记录与捷克议会演讲的官方成绩单结合,并使用 WhisperX 和 Wav2Vec 2.0 提取自动音频-文本对齐。 Result: 创建了 ParCzech4Speech 1.0 数据集,包含 2,695 小时的语音数据,提供三种变体:(1) 句子分段,(2) 未分段,(3) 原始对齐。 Conclusion: ParCzech4Speech 1.0 是一个适用于语音建模任务的数据集,它改进了 ParCzech 3.0 的语音识别版本,提供了更高的对齐可靠性,并且以三种变体形式发布,以满足不同的任务需求。 Abstract: We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. Our processing pipeline improves upon the ParCzech 3.0 speech recognition version by extracting more data with higher alignment reliability. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. All variants maintain the original metadata and are released under a permissive CC-BY license. The dataset is available in the LINDAT repository, with the sentence-segmented and unsegmented variants additionally available on Hugging Face.

[54] Will Annotators Disagree? Identifying Subjectivity in Value-Laden Arguments

Amir Homayounirad,Enrico Liscio,Tong Wang,Catholijn M. Jonker,Luciano C. Siebert

Main category: cs.CL

TL;DR: 该研究发现直接识别主观性在论点分析中优于通过预测价值推断主观性,有助于识别个体可能解释不同的论点,从而促进更精细的注释过程。

Details Motivation: 聚合多个注释为单一真实标签可能会掩盖注释者分歧的宝贵见解,尤其是在主观性起关键作用的任务中。因此,探索识别论点中主观性的方法具有重要意义。 Method: 评估了两种识别主观性的主要方法:通过价值预测推断主观性与直接识别主观性,并结合对比损失与二元交叉熵损失进行实验。 Result: 实验表明,直接识别主观性显著提升了模型性能,而结合对比损失与二元交叉熵损失并未提升性能,但减少了对每标签主观性的依赖。 Conclusion: 直接识别主观性可以显著提升模型在标记主观性论点方面的性能,且结合对比损失与二元交叉熵损失不会提升性能但减少了对每标签主观性的依赖。 Abstract: Aggregating multiple annotations into a single ground truth label may hide valuable insights into annotator disagreement, particularly in tasks where subjectivity plays a crucial role. In this work, we explore methods for identifying subjectivity in recognizing the human values that motivate arguments. We evaluate two main approaches: inferring subjectivity through value prediction vs. directly identifying subjectivity. Our experiments show that direct subjectivity identification significantly improves the model performance of flagging subjective arguments. Furthermore, combining contrastive loss with binary cross-entropy loss does not improve performance but reduces the dependency on per-label subjectivity. Our proposed methods can help identify arguments that individuals may interpret differently, fostering a more nuanced annotation process.

[55] Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

Yanrui Du,Fenglei Fan,Sendong Zhao,Jiawei Cao,Qika Lin,Kai He,Ting Liu,Bing Qin,Mengling Feng

Main category: cs.CL

TL;DR: 本文提出ProCon方法,通过投影约束和训练策略优化,缓解指令微调对大语言模型安全性的影响,有效提升模型对恶意指令的拒绝能力。

Details Motivation: IFT可能导致LLM安全性的下降,特别是对恶意指令的拒绝能力。需要研究其内部机制并加以改进。 Method: 引入了基于投影约束的损失项,并结合预热策略和数据分布扩展来优化约束信号。 Result: ProCon在多个数据集、场景和LLM上的实验结果显著降低了安全风险,同时保持性能优势。 Conclusion: ProCon方法能够有效缓解IFT带来的安全风险,同时保持任务性能提升,并为未来LLM安全研究奠定基础。 Abstract: Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance various abilities of Large Language Models (LLMs). However, prior studies have shown that IFT can significantly compromise LLMs' safety, particularly their ability to refuse malicious instructions, raising significant concerns. Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior. Building on this insight, our study reveals that the r-direction tends to drift during training, which we identify as one of the causes of the associated safety risks. To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample's hidden state onto the r-direction. Our initial analysis shows that applying an appropriate constraint can effectively mitigate the refusal direction drift and associated safety risks, but remains limited by overall performance barriers. To overcome this barrier, informed by our observation of early-stage sharp drift and a data-driven perspective, we introduce a warm-up strategy that emphasizes early-stage strong constraints and broaden the data distribution to strengthen constraint signals, leading to an enhanced ProCon method. Experimental results under various datasets, scenarios, and LLMs demonstrate that our method can significantly mitigate safety risks posed by IFT while preserving task performance gains. Even compared with strong baselines, our method consistently delivers superior overall performance. Crucially, our analysis indicates that ProCon can contribute to stabilizing the r-direction during training, while such an interpretability-driven exploration of LLMs' internal mechanisms lays a solid foundation for future safety research.

[56] MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

Haoyu Dong,Pengkun Zhang,Mingzhe Lu,Yanzhen Shen,Guolin Ke

Main category: cs.CL

TL;DR: 本文提出MachineLearningLM框架,通过预训练增强大模型的上下文机器学习能力,显著提升多领域任务性能。

Details Motivation: 大型语言模型在标准机器学习任务中难以通过上下文学习有效利用多示例,本文旨在解决这一问题。 Method: 使用结构因果模型合成机器学习任务,并利用随机森林教师模型蒸馏基于树的决策策略。 Result: 在金融、物理、生物和医疗等领域,MachineLearningLM平均超越强基线模型约15%,并展示出显著的多示例扩展规律。 Conclusion: MachineLearningLM通过预训练框架赋予大模型更强的上下文机器学习能力,同时保持其通用知识和推理能力。 Abstract: Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.

[57] MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

Yanrui Du,Fenglei Fan,Sendong Zhao,Jiawei Cao,Ting Liu,Bing Qin

Main category: cs.CL

TL;DR: MoGU_v2 improves the balance between usability and security in Large Language Models by dynamically adapting security features, offering robust performance across diverse model types and use cases.

Details Motivation: To advance the Pareto frontier between LLM usability and security, avoiding trade-offs that lead to either security risks or overly conservative responses that hurt usability. Method: The MoGU framework uses an intra-layer router to dynamically balance contributions between security-optimized and usability-optimized variants; MoGU_v2 improves upon this by embedding routers only in layers encoding highly classifiable security features and enabling bidirectional adaptation through backbone module activation. Result: MoGU_v2 achieves stable improvements across different types of LLMs, including those for resource-constrained environments and interpretability, and can restore security without sacrificing performance even after Instruction Fine-tuning. Conclusion: MoGU_v2 is a robust and versatile solution for mitigating security risks in real-world LLM applications, offering adaptability across various LLM series and maintaining security without compromising task performance. Abstract: As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs' security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs' usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.

[58] Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem

Valentin Quesnel,Damien Sileo

Main category: cs.CL

TL;DR: The paper presents a framework that uses E-prover's saturation capabilities on the TPTP axiom library to create a scalable source of symbolic training data for improving the mathematical reasoning of Large Language Models, highlighting a weakness in current models when it comes to deep, structural reasoning tasks.

Details Motivation: The motivation for this work is the scarcity of high-quality, logically sound data that is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Method: The paper's method involves leveraging E-prover's saturation capabilities on the TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems, which is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Result: The result of this work is a framework that eliminates factual errors by construction and reveals a clear weakness in zero-shot experiments on frontier models where performance collapses on tasks requiring deep, structural reasoning. Conclusion: The paper concludes that the framework provides both a diagnostic tool to measure the gap in deep, structural reasoning and a scalable source of symbolic training data to address it. Abstract: The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover's saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for "interesting" theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1

[59] A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Max Malyi,Jonathan Shek,Alasdair McDonald,Andre Biscaya

Main category: cs.CL

TL;DR: The paper proposes an open-source framework to evaluate LLMs for classifying wind turbine maintenance logs, highlighting the need for human-in-the-loop systems to enhance data quality and reduce energy costs.

Details Motivation: The motivation stems from the need to reduce the Levelised Cost of Energy (LCOE) by improving the analysis of unstructured turbine maintenance logs. Current challenges in automated analysis necessitate a transparent and reproducible framework for evaluating LLMs. Method: The paper introduces an open-source framework for benchmarking LLMs in classifying turbine maintenance logs. It systematically evaluates various state-of-the-art proprietary and open-source LLMs on criteria like reliability, operational efficiency, and model calibration. Result: The results show a clear performance hierarchy among LLMs, with some models exhibiting high alignment with a benchmark standard and reliable confidence scores. Performance is influenced by task ambiguity, with better consensus on objective tasks compared to interpretive ones. Conclusion: The paper concludes that while certain LLMs perform well in classifying turbine maintenance logs, no model achieves perfect accuracy, and calibration varies significantly. Thus, the most effective near-term solution is a Human-in-the-Loop system where LLMs assist human experts in enhancing data quality and downstream reliability analysis. Abstract: Effective Operation and Maintenance (O&M) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task's semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing O&M data quality and downstream reliability analysis.

[60] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Eugene Kwek,Wenpeng Yin

Main category: cs.CL

TL;DR: COMPACT improves LLM efficiency via joint pruning of vocabulary and FFN channels, achieving superior performance and reduced resource usage.

Details Motivation: Efficient LLMs are crucial for edge deployment and sustainable inference. Existing pruning methods have limitations in accuracy and deployment. Method: COMPACT jointly prunes rare vocabulary and FFN intermediate channels using common-token-weighted activations. Result: Experiments show COMPACT outperforms prior methods in memory savings, throughput, and performance across multiple LLM families. Conclusion: The proposed COMPACT method effectively prunes LLMs, achieving state-of-the-art performance with reduced parameters, GPU memory, and latency. Abstract: Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

[61] EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models

Mohammad Reza Mirbagheri,Mohammad Mahdi Mirkamali,Zahra Motoshaker Arani,Ali Javeri,Amir Mahdi Sadeghzadeh,Rasool Jalili

Main category: cs.CL

TL;DR: This study proposes the EPT metric to evaluate the trustworthiness of LLMs in the Persian cultural context, uncovering critical safety deficiencies and alignment gaps with ethical-cultural values.

Details Motivation: Ensuring the trustworthiness of Large Language Models (LLMs) is crucial not only for accurate performance but also for upholding ethical, cultural, and social values, especially in the context of Persian culture. Method: The study introduces the EPT (Evaluation of Persian Trustworthiness) metric and evaluates leading LLMs using both automated and human assessments on a curated labeled dataset focusing on six aspects: truthfulness, safety, fairness, robustness, privacy, and ethical alignment. Result: The evaluation revealed significant deficiencies in the safety aspect of LLMs. Insights were gained into how well these models align with Persian ethical-cultural values, identifying critical gaps and opportunities for improvement. Conclusion: The study concludes that there are significant deficiencies in the safety dimension of current LLMs, highlighting the urgent need for focused improvements. It also emphasizes the importance of aligning AI systems with Persian ethical-cultural values for more trustworthy and responsible AI. Abstract: Large Language Models (LLMs), trained on extensive datasets using advanced deep learning architectures, have demonstrated remarkable performance across a wide range of language tasks, becoming a cornerstone of modern AI technologies. However, ensuring their trustworthiness remains a critical challenge, as reliability is essential not only for accurate performance but also for upholding ethical, cultural, and social values. Careful alignment of training data and culturally grounded evaluation criteria are vital for developing responsible AI systems. In this study, we introduce the EPT (Evaluation of Persian Trustworthiness) metric, a culturally informed benchmark specifically designed to assess the trustworthiness of LLMs across six key aspects: truthfulness, safety, fairness, robustness, privacy, and ethical alignment. We curated a labeled dataset and evaluated the performance of several leading models - including ChatGPT, Claude, DeepSeek, Gemini, Grok, LLaMA, Mistral, and Qwen - using both automated LLM-based and human assessments. Our results reveal significant deficiencies in the safety dimension, underscoring the urgent need for focused attention on this critical aspect of model behavior. Furthermore, our findings offer valuable insights into the alignment of these models with Persian ethical-cultural values and highlight critical gaps and opportunities for advancing trustworthy and culturally responsible AI. The dataset is publicly available at: https://github.com/Rezamirbagheri110/EPT-Benchmark.

[62] The Majority is not always right: RL training for solution aggregation

Wenting Zhao,Pranjal Aggarwal,Swarnadeep Saha,Asli Celikyilmaz,Jason Weston,Ilia Kulikov

Main category: cs.CL

TL;DR: This paper introduces AggLM, a reinforcement learning-based aggregator model that improves large language model performance on reasoning tasks by effectively synthesizing multiple candidate solutions.

Details Motivation: Prior approaches to aggregating solutions, such as majority voting or reward model ranking, offer limited benefits. This work aims to enhance aggregation by treating it as an explicit reasoning skill learned through reinforcement learning. Method: An aggregator model is trained using reinforcement learning from verifiable rewards to review, reconcile, and synthesize correct answers from a set of candidate solutions. The training process involves a careful balance of easy and hard examples. Result: The AggLM method surpasses both rule-based and reward-model baselines across multiple benchmarks. It generalizes well to solutions from different models, including stronger ones not seen during training, and achieves this with significantly fewer tokens compared to majority voting. Conclusion: The proposed AggLM method effectively improves the performance of large language models on challenging reasoning tasks by learning to aggregate candidate solutions through reinforcement learning, outperforming existing baselines while using fewer tokens. Abstract: Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.

[63] UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction

Joe Wilder,Nikhil Kadapala,Benji Xu,Mohammed Alsaadi,Aiden Parsons,Mitchell Rogers,Palash Agarwal,Adam Hassick,Laura Dietz

Main category: cs.CL

TL;DR: 研究探索了多种方法来从社交媒体中提取值得检查的声明,并发现微调FLAN-T5模型虽然在METEOR评分上最高,但其他方法有时能提取更高质量的声明。

Details Motivation: 我们参与CheckThat!任务2英语,并旨在通过各种提示和上下文学习方法,从社交媒体段落中提取值得检查的声明。 Method: 我们探索了多种提示和上下文学习方法,包括少量提示和使用不同LLM家族的微调,以从社交媒体段落中提取值得检查的声明。 Result: 我们的最佳METEOR评分是通过微调FLAN-T5模型实现的。然而,我们观察到,在某些情况下,即使使用其他方法的METEOR评分较低,也可以提取出更高质量的声明。 Conclusion: 尽管微调FLAN-T5模型在METEOR评分上表现最佳,但在某些情况下,即使其他方法的METEOR评分较低,也可以提取出更高质量的声明。 Abstract: We participate in CheckThat! Task 2 English and explore various methods of prompting and in-context learning, including few-shot prompting and fine-tuning with different LLM families, with the goal of extracting check-worthy claims from social media passages. Our best METEOR score is achieved by fine-tuning a FLAN-T5 model. However, we observe that higher-quality claims can sometimes be extracted using other methods, even when their METEOR scores are lower.

[64] mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Marc Marone,Orion Weller,William Fleshman,Eugene Yang,Dawn Lawrie,Benjamin Van Durme

Main category: cs.CL

TL;DR: mmBERT is a new encoder-only multilingual language model that achieves strong performance on classification and retrieval tasks, even for low-resource languages, using novel training techniques and a large multilingual dataset.

Details Motivation: There is a lack of recent research on encoder-only models, especially multilingual ones. The authors aim to improve performance on classification and retrieval tasks across a wide range of languages, including low-resource ones. Method: The authors introduced mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages, using techniques like an inverse mask ratio schedule and inverse temperature sampling. Low-resource languages were added during the decay phase of training. Result: Despite including low-resource languages only during the decay phase, mmBERT achieves performance comparable to large models like OpenAI's o3 and Google's Gemini 2.5 Pro, while significantly outperforming previous models on both classification and retrieval tasks. Conclusion: mmBERT significantly outperforms previous models on classification and retrieval tasks across both high and low-resource languages. Abstract: Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.

[65] Proof-Carrying Numbers (PCN): A Protocol for Trustworthy Numeric Answers from LLMs via Claim Verification

Aivin V. Solatorio

Main category: cs.CL

TL;DR: This paper proposes Proof-Carrying Numbers (PCN), a protocol that enforces numeric fidelity in LLM outputs by requiring mechanical verification before display, ensuring trust is established through proof.

Details Motivation: The motivation is to address the issue of numeric hallucination in Large Language Models (LLMs), where generated numbers may deviate from real or expected values, undermining trust and reliability. Method: The paper introduces Proof-Carrying Numbers (PCN), a protocol where numeric spans are emitted as claim-bound tokens tied to structured claims. Verification is performed by a renderer based on declared policies, ensuring only verified numbers are marked as correct. Result: The authors formalize PCN and prove its properties, including soundness, completeness under honest tokens, fail-closed behavior, and monotonicity under policy refinement. They demonstrate that PCN prevents spoofing and guarantees reliable numeric output. Conclusion: PCN is a lightweight and model-agnostic protocol that effectively ensures numeric fidelity by making verification a mandatory step before display, thus establishing trust through proof in numerically sensitive settings. Abstract: Large Language Models (LLMs) as stochastic systems may generate numbers that deviate from available data, a failure known as \emph{numeric hallucination}. Existing safeguards -- retrieval-augmented generation, citations, and uncertainty estimation -- improve transparency but cannot guarantee fidelity: fabricated or misquoted values may still be displayed as if correct. We propose \textbf{Proof-Carrying Numbers (PCN)}, a presentation-layer protocol that enforces numeric fidelity through mechanical verification. Under PCN, numeric spans are emitted as \emph{claim-bound tokens} tied to structured claims, and a verifier checks each token under a declared policy (e.g., exact equality, rounding, aliases, or tolerance with qualifiers). Crucially, PCN places verification in the \emph{renderer}, not the model: only claim-checked numbers are marked as verified, and all others default to unverified. This separation prevents spoofing and guarantees fail-closed behavior. We formalize PCN and prove soundness, completeness under honest tokens, fail-closed behavior, and monotonicity under policy refinement. PCN is lightweight and model-agnostic, integrates seamlessly into existing applications, and can be extended with cryptographic commitments. By enforcing verification as a mandatory step before display, PCN establishes a simple contract for numerically sensitive settings: \emph{trust is earned only by proof}, while the absence of a mark communicates uncertainty.

[66] Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

Liang Chen,Xueting Han,Li Shen,Jing Bai,Kam-Fai Wong

Main category: cs.CL

TL;DR: This study introduces a bilevel optimization approach to enhance the cooperation between supervised fine-tuning and reinforcement learning, improving the effectiveness and efficiency of learning reasoning models.

Details Motivation: The motivation stems from the inefficiency of reinforcement learning due to its trial-and-error nature and the limitations of the decoupled two-stage approach involving supervised fine-tuning and RL, which restricts interaction and effectiveness. Method: The method employs bilevel optimization where the SFT objective is conditioned on the optimal RL policy, allowing SFT to guide RL's optimization process. The lower level performs RL updates with SFT supervision, while the upper level maximizes the cooperative gain. Result: Empirical evaluations on five reasoning benchmarks show that the proposed method consistently outperforms baselines, achieving a better balance between effectiveness and efficiency. Conclusion: The study concludes that the proposed bilevel optimization method enhances the cooperation between SFT and RL, leading to improved effectiveness and efficiency in learning reasoning models. Abstract: Reinforcement learning (RL) has proven effective in incentivizing the reasoning abilities of large language models (LLMs), but suffers from severe efficiency challenges due to its trial-and-error nature. While the common practice employs supervised fine-tuning (SFT) as a warm-up stage for RL, this decoupled two-stage approach limits interaction between SFT and RL, thereby constraining overall effectiveness. This study introduces a novel method for learning reasoning models that employs bilevel optimization to facilitate better cooperation between these training paradigms. By conditioning the SFT objective on the optimal RL policy, our approach enables SFT to meta-learn how to guide RL's optimization process. During training, the lower level performs RL updates while simultaneously receiving SFT supervision, and the upper level explicitly maximizes the cooperative gain-the performance advantage of joint SFT-RL training over RL alone. Empirical evaluations on five reasoning benchmarks demonstrate that our method consistently outperforms baselines and achieves a better balance between effectiveness and efficiency.

[67] Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Yinjie Wang,Ling Yang,Bowen Li,Ye Tian,Ke Shen,Mengdi Wang

Main category: cs.CL

TL;DR: TraceRL 是一种适用于扩散语言模型的轨迹感知强化学习框架,可提升复杂数学和编码任务的推理性能,并具备跨架构适用性。

Details Motivation: 旨在解决扩散语言模型在推理任务中性能不足的问题,同时提升训练稳定性和采样灵活性。 Method: 通过引入基于扩散的价值模型和轨迹感知强化学习框架 TraceRL,将优选推理轨迹纳入后训练,并利用课程学习方法提升长链推理能力。 Result: TraDo 系列模型在数学推理任务上表现优异,例如 TraDo-8B-Instruct 在数学推理基准测试中分别比 Qwen2.5-7B-Instruct 和 Llama3.1-8B-Instruct 提升了 6.1% 和 51.3% 的准确率。 Conclusion: TraceRL 框架有效提升了扩散语言模型的推理性能,并促进了模型的跨架构应用,为未来研究和实际应用提供了强大支持。 Abstract: We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL

[68] On the Same Wavelength? Evaluating Pragmatic Reasoning in Language Models across Broad Concepts

Linlu Qiu,Cedegao E. Zhang,Joshua B. Tenenbaum,Yoon Kim,Roger P. Levy

Main category: cs.CL

TL;DR: 本文提出了一种评估语言模型语用推理能力的框架,并发现先进模型在理解任务中表现优异,而链式推理和贝叶斯推理方法在生成任务中带来了显著改进。

Details Motivation: 随着语言模型越来越多地用作对话代理,了解它们的语用推理能力变得尤为重要。 Method: 研究提出了一种基于Wavelength通信游戏的评估框架,用于评估语言模型在语言理解和生成方面的能力,并探索了贝叶斯语用推理在模型推理中的应用。 Result: 最先进的语言模型在语言理解任务中表现出与人类相似的准确性,并且与人类判断高度相关;在语言生成任务中,链式推理提示和理性语言行为模型方法显著优于直接提示。 Conclusion: 研究发现,最先进的语言模型在语言理解方面表现出色,而在语言生成任务中,通过使用链式推理提示和理性语言行为模型可以显著提高性能,这为改进语言模型的语用推理能力提供了潜在途径。 Abstract: Language use is shaped by pragmatics -- i.e., reasoning about communicative goals and norms in context. As language models (LMs) are increasingly used as conversational agents, it becomes ever more important to understand their pragmatic reasoning abilities. We propose an evaluation framework derived from Wavelength, a popular communication game where a speaker and a listener communicate about a broad range of concepts in a granular manner. We study a range of LMs on both language comprehension and language production using direct and Chain-of-Thought (CoT) prompting, and further explore a Rational Speech Act (RSA) approach to incorporating Bayesian pragmatic reasoning into LM inference. We find that state-of-the-art LMs, but not smaller ones, achieve strong performance on language comprehension, obtaining similar-to-human accuracy and exhibiting high correlations with human judgments even without CoT prompting or RSA. On language production, CoT can outperform direct prompting, and using RSA provides significant improvements over both approaches. Our study helps identify the strengths and limitations in LMs' pragmatic reasoning abilities and demonstrates the potential for improving them with RSA, opening up future avenues for understanding conceptual representation, language understanding, and social reasoning in LMs and humans.

cs.CV [Back]

[69] Label Smoothing++: Enhanced Label Regularization for Training Neural Networks

Sachin Chhabra,Hemanth Venkateswara,Baoxin Li

Main category: cs.CV

TL;DR: 本文提出了一种改进的标签平滑方法 Label Smoothing++,通过考虑类间关系来缓解神经网络的过度自信问题并提升模型泛化能力。

Details Motivation: 传统的标签平滑方法通过为非目标类分配相同的概率,忽略了类间关系,因此需要一种更有效的方法来缓解过度自信和过拟合问题。 Method: 提出了一种新的标签平滑方法 Label Smoothing++,该方法为目标类使用固定标签,同时让网络学习与非目标类相关的标签,从而考虑类间关系。 Result: 在多个数据集上的实验表明,Label Smoothing++ 能够缓解神经网络的过度自信问题,并提升模型的泛化能力。 Conclusion: Label Smoothing++ 是一种有效的标签正则化训练策略,能够缓解预测中的过度自信问题,同时促进类间关系和泛化能力的提升。 Abstract: Training neural networks with one-hot target labels often results in overconfidence and overfitting. Label smoothing addresses this issue by perturbing the one-hot target labels by adding a uniform probability vector to create a regularized label. Although label smoothing improves the network's generalization ability, it assigns equal importance to all the non-target classes, which destroys the inter-class relationships. In this paper, we propose a novel label regularization training strategy called Label Smoothing++, which assigns non-zero probabilities to non-target classes and accounts for their inter-class relationships. Our approach uses a fixed label for the target class while enabling the network to learn the labels associated with non-target classes. Through extensive experiments on multiple datasets, we demonstrate how Label Smoothing++ mitigates overconfident predictions while promoting inter-class relationships and generalization capabilities.

[70] VILOD: A Visual Interactive Labeling Tool for Object Detection

Isac Holm

Main category: cs.CV

TL;DR: This paper introduces VILOD, a visual interactive labeling tool for object detection that enhances human-AI collaboration through interactive visualizations.

Details Motivation: The motivation is to address the challenge of acquiring large, accurately labeled datasets for object detection by making active learning more transparent and inclusive of human expertise. Method: Development and empirical investigation of VILOD, which uses t-SNE projections, uncertainty heatmaps, and model state views to support interactive visual labeling. Result: VILOD's visually-guided labeling strategies were shown to yield competitive OD performance compared to automated uncertainty sampling baselines. Conclusion: The paper concludes that VILOD provides a transparent and manageable HITL-AL workflow for OD annotation, enhancing the interpretability of model states and dataset characteristics. Abstract: The advancement of Object Detection (OD) using Deep Learning (DL) is often hindered by the significant challenge of acquiring large, accurately labeled datasets, a process that is time-consuming and expensive. While techniques like Active Learning (AL) can reduce annotation effort by intelligently querying informative samples, they often lack transparency, limit the strategic insight of human experts, and may overlook informative samples not aligned with an employed query strategy. To mitigate these issues, Human-in-the-Loop (HITL) approaches integrating human intelligence and intuition throughout the machine learning life-cycle have gained traction. Leveraging Visual Analytics (VA), effective interfaces can be created to facilitate this human-AI collaboration. This thesis explores the intersection of these fields by developing and investigating "VILOD: A Visual Interactive Labeling tool for Object Detection". VILOD utilizes components such as a t-SNE projection of image features, together with uncertainty heatmaps and model state views. Enabling users to explore data, interpret model states, AL suggestions, and implement diverse sample selection strategies within an iterative HITL workflow for OD. An empirical investigation using comparative use cases demonstrated how VILOD, through its interactive visualizations, facilitates the implementation of distinct labeling strategies by making the model's state and dataset characteristics more interpretable (RQ1). The study showed that different visually-guided labeling strategies employed within VILOD result in competitive OD performance trajectories compared to an automated uncertainty sampling AL baseline (RQ2). This work contributes a novel tool and empirical insight into making the HITL-AL workflow for OD annotation more transparent, manageable, and potentially more effective.

[71] Context-Aware Knowledge Distillation with Adaptive Weighting for Image Classification

Zhengda Li

Main category: cs.CV

TL;DR: This paper proposes an adaptive knowledge distillation framework that dynamically adjusts the balance between hard and soft supervision, leading to improved accuracy and convergence stability compared to traditional methods.

Details Motivation: Traditional KD uses a static alpha, which is suboptimal as the ideal balance between hard and soft supervision may change during training. Method: AKD introduces a learnable parameter alpha and a Context-Aware Module to dynamically adjust the balance between hard and soft supervision during training. Result: Experiments on CIFAR-10 showed that AKD surpasses fixed-weight KD baselines in accuracy and provides more stable convergence. Conclusion: The proposed AKD framework outperforms traditional KD methods in terms of accuracy and convergence stability. Abstract: Knowledge distillation (KD) is a widely used technique to transfer knowledge from a large teacher network to a smaller student model. Traditional KD uses a fixed balancing factor alpha as a hyperparameter to combine the hard-label cross-entropy loss with the soft-label distillation loss. However, a static alpha is suboptimal because the optimal trade-off between hard and soft supervision can vary during training. In this work, we propose an Adaptive Knowledge Distillation (AKD) framework. First we try to make alpha as learnable parameter that can be automatically learned and optimized during training. Then we introduce a formula to reflect the gap between the student and the teacher to compute alpha dynamically, guided by student-teacher discrepancies, and further introduce a Context-Aware Module (CAM) using MLP + Attention to adaptively reweight class-wise teacher outputs. Experiments on CIFAR-10 with ResNet-50 as teacher and ResNet-18 as student demonstrate that our approach achieves superior accuracy compared to fixed-weight KD baselines, and yields more stable convergence.

[72] A Dataset Generation Scheme Based on Video2EEG-SPGN-Diffusion for SEED-VD

Yunfei Guo,Tao Zhang,Wu Huang,Yao Song

Main category: cs.CV

TL;DR: 该论文介绍了一个新的开源框架Video2EEG-SPGN-Diffusion,用于生成基于视频刺激的脑电图信号,并发布了一个新的数据集,包含超过1000个样本,用于促进多模态研究和情感分析的发展。

Details Motivation: 为了推动多模态大型模型的研究,需要一个能够对齐视频和脑电图数据的框架,从而实现情感分析、数据增强和脑机接口应用的发展。 Method: 利用SEED-VD数据集,通过自播放图网络(SPGN)结合扩散模型生成个性化的脑电图信号,并披露了一个用于对齐视频和脑电图数据的工程管道。 Result: 发布了一个包含超过1000个样本的新数据集,每个样本包括SEED-VD视频刺激、生成的62通道脑电图信号和情感标签。 Conclusion: 该论文提出了一种新的开源框架Video2EEG-SPGN-Diffusion,用于生成基于视频刺激的脑电图信号,并促进了多模态研究的发展。 Abstract: This paper introduces an open-source framework, Video2EEG-SPGN-Diffusion, that leverages the SEED-VD dataset to generate a multimodal dataset of EEG signals conditioned on video stimuli. Additionally, we disclose an engineering pipeline for aligning video and EEG data pairs, facilitating the training of multimodal large models with EEG alignment capabilities. Personalized EEG signals are generated using a self-play graph network (SPGN) integrated with a diffusion model. As a major contribution, we release a new dataset comprising over 1000 samples of SEED-VD video stimuli paired with generated 62-channel EEG signals at 200 Hz and emotion labels, enabling video-EEG alignment and advancing multimodal research. This framework offers novel tools for emotion analysis, data augmentation, and brain-computer interface applications, with substantial research and engineering significance.

[73] Application of discrete Ricci curvature in pruning randomly wired neural networks: A case study with chest x-ray classification of COVID-19

Pavithra Elumalai,Sudharsan Vijayaraghavan,Madhumita Mondal,Areejit Samal

Main category: cs.CV

TL;DR: This study explores the use of Forman-Ricci curvature (FRC) for pruning Randomly Wired Neural Networks (RWNNs) in comparison to Ollivier-Ricci curvature (ORC) and Edge Betweenness Centrality (EBC). It shows that FRC can achieve similar performance to ORC with significantly lower computational cost, providing insights into the structural properties of pruned networks.

Details Motivation: The motivation of the study is to explore whether Forman-Ricci curvature (FRC), which is computationally more efficient than Ollivier-Ricci curvature (ORC), can achieve comparable pruning effectiveness. It also aims to understand how different edge-centric network measures impact the compression and performance of Randomly Wired Neural Networks (RWNNs). Method: The study investigates three edge-centric network measures - Forman-Ricci curvature (FRC), Ollivier-Ricci curvature (ORC), and Edge Betweenness Centrality (EBC) - for compressing RWNNs. The measures are applied across three network generators: Erdős-Rényi (ER), Watts-Strogatz (WS), and Barabási-Albert (BA) models. Structural properties of pruned networks, including modularity and global efficiency, are analyzed. Result: The results show that FRC-based pruning can effectively simplify RWNNs while maintaining accuracy, specificity, and sensitivity comparable to ORC-based pruning. FRC offers significant computational advantages over ORC. A comparative analysis of compression ratio and theoretical speedup among FRC, ORC, and EBC is provided, along with insights into the structural properties of pruned networks through modularity and global efficiency. Conclusion: The study concludes that FRC-based pruning can effectively simplify Randomly Wired Neural Networks (RWNNs) while maintaining performance comparable to Ollivier-Ricci curvature (ORC) pruning, offering significant computational advantages. It also highlights the trade-off between modular segregation and network efficiency in compressed RWNNs. Abstract: Randomly Wired Neural Networks (RWNNs) serve as a valuable testbed for investigating the impact of network topology in deep learning by capturing how different connectivity patterns impact both learning efficiency and model performance. At the same time, they provide a natural framework for exploring edge-centric network measures as tools for pruning and optimization. In this study, we investigate three edge-centric network measures: Forman-Ricci curvature (FRC), Ollivier-Ricci curvature (ORC), and edge betweenness centrality (EBC), to compress RWNNs by selectively retaining important synapses (or edges) while pruning the rest. As a baseline, RWNNs are trained for COVID-19 chest x-ray image classification, aiming to reduce network complexity while preserving performance in terms of accuracy, specificity, and sensitivity. We extend prior work on pruning RWNN using ORC by incorporating two additional edge-centric measures, FRC and EBC, across three network generators: Erd\"{o}s-R\'{e}nyi (ER) model, Watts-Strogatz (WS) model, and Barab\'{a}si-Albert (BA) model. We provide a comparative analysis of the pruning performance of the three measures in terms of compression ratio and theoretical speedup. A central focus of our study is to evaluate whether FRC, which is computationally more efficient than ORC, can achieve comparable pruning effectiveness. Along with performance evaluation, we further investigate the structural properties of the pruned networks through modularity and global efficiency, offering insights into the trade-off between modular segregation and network efficiency in compressed RWNNs. Our results provide initial evidence that FRC-based pruning can effectively simplify RWNNs, offering significant computational advantages while maintaining performance comparable to ORC.

[74] Optical Music Recognition of Jazz Lead Sheets

Juan Carlos Martinez-Sevilla,Francesco Foscarin,Patricia Garcia-Iasci,David Rizo,Jorge Calvo-Zaragoza,Gerhard Widmer

Main category: cs.CV

TL;DR: 这篇论文主要研究了用于手写爵士乐谱的光学乐谱识别,提出了一个新的数据集和OMR模型,并讨论了特定于数据的标记化选择,使用合成分数和预训练模型的优势。

Details Motivation: 本文的动机是解决现有的光学乐谱识别系统无法处理和手写图像相关的变化性和质量问题的挑战。 Method: 本文的方法包括提供一个包含293份手写爵士乐谱的数据集,并开发一个针对爵士乐谱的OMR模型。 Result: 本文的结果是一个新的数据集和一个OMR模型,以及特定于数据的标记化选择,使用合成分数和预训练模型的优势的讨论。 Conclusion: 本文的结论是,作者成功地应对了用于手写爵士乐谱的光学乐谱识别挑战,并开发了一个OMR模型。他们还讨论了特定于数据的标记化选择,以及使用合成分数和预训练模型的优势。所有代码、数据和模型都已公开发布。 Abstract: In this paper, we address the challenge of Optical Music Recognition (OMR) for handwritten jazz lead sheets, a widely used musical score type that encodes melody and chords. The task is challenging due to the presence of chords, a score component not handled by existing OMR systems, and the high variability and quality issues associated with handwritten images. Our contribution is two-fold. We present a novel dataset consisting of 293 handwritten jazz lead sheets of 163 unique pieces, amounting to 2021 total staves aligned with Humdrum **kern and MusicXML ground truth scores. We also supply synthetic score images generated from the ground truth. The second contribution is the development of an OMR model for jazz lead sheets. We discuss specific tokenisation choices related to our kind of data, and the advantages of using synthetic scores and pretrained models. We publicly release all code, data, and models.

[75] RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness

Junghyun Park,Tuan Anh Nguyen,Dugki Min

Main category: cs.CV

TL;DR: 本文提出RT-VLM框架,通过合成数据生成和自我批评机制提升视觉模型在领域转移下的鲁棒性。

Details Motivation: 现代目标识别模型在面对领域转移时表现严重下降,包括低级图像统计变化、物体姿态和视角变化、部分遮挡和类别间的视觉混淆。 Method: 构建一个合成数据集生成流水线,生成带有“4-Clues”的图像,并在该资源上对Llama 3.2 11B Vision Instruct进行参数高效监督调优。在推理时执行两阶段的Re-Thinking方案。 Result: RT-VLM在隔离的领域转移鲁棒性基准测试中持续超越强大的基线模型。 Conclusion: RT-VLM通过结合结构化多模态证据和显式的自我批评循环,为可靠和可迁移的视觉理解提供了一条有前景的途径。 Abstract: Real world deployments often expose modern object recognition models to domain shifts that precipitate a severe drop in accuracy. Such shifts encompass (i) variations in low level image statistics, (ii) changes in object pose and viewpoint, (iii) partial occlusion, and (iv) visual confusion across adjacent classes. To mitigate this degradation, we introduce the Re-Thinking Vision Language Model (RT-VLM) framework. The foundation of this framework is a unique synthetic dataset generation pipeline that produces images annotated with "4-Clues": precise bounding boxes, class names, detailed object-level captions, and a comprehensive context-level caption for the entire scene. We then perform parameter efficient supervised tuning of Llama 3.2 11B Vision Instruct on this resource. At inference time, a two stage Re-Thinking scheme is executed: the model first emits its own four clues, then re examines these responses as evidence and iteratively corrects them. Across robustness benchmarks that isolate individual domain shifts, RT-VLM consistently surpasses strong baselines. These findings indicate that the integration of structured multimodal evidence with an explicit self critique loop constitutes a promising route toward reliable and transferable visual understanding.

[76] A Real-Time, Vision-Based System for Badminton Smash Speed Estimation on Mobile Devices

Diwen Huang

Main category: cs.CV

TL;DR: This paper introduces a smartphone-based system for measuring badminton smash speed using computer vision and video analysis, making performance tracking affordable and accessible to all players.

Details Motivation: The motivation stems from the lack of affordable and accessible technology for measuring performance metrics like shot speed in amateur and recreational sports, particularly in badminton. Method: The system uses a custom-trained YOLOv5 model for shuttlecock detection, a Kalman filter for trajectory tracking, and video-based kinematic speed estimation with spatiotemporal scaling to calculate shuttlecock velocity. Result: The result is a functional mobile application that accurately estimates smash speed using standard smartphone video recordings, making advanced performance analytics widely accessible. Conclusion: The paper concludes that the proposed smartphone-based system effectively and affordably measures badminton smash speed, offering an accessible solution for players at all levels. Abstract: Performance metrics in sports, such as shot speed and angle, provide crucial feedback for athlete development. However, the technology to capture these metrics has historically been expensive, complex, and largely inaccessible to amateur and recreational players. This paper addresses this gap in the context of badminton, one of the world's most popular sports, by introducing a novel, cost-effective, and user-friendly system for measuring smash speed using ubiquitous smartphone technology. Our approach leverages a custom-trained YOLOv5 model for shuttlecock detection, combined with a Kalman filter for robust trajectory tracking. By implementing a video-based kinematic speed estimation method with spatiotemporal scaling, the system automatically calculates the shuttlecock's velocity from a standard video recording. The entire process is packaged into an intuitive mobile application, democratizing access to high-level performance analytics and empowering players at all levels to analyze and improve their game.

[77] A Stroke-Level Large-Scale Database of Chinese Character Handwriting and the OpenHandWrite_Toolbox for Handwriting Research

Zebo Xu,Shaoyun Yu,Mark Torrance,Guido Nottbusch,Nan Zhao,Zhenguang Cai

Main category: cs.CV

TL;DR: 本研究构建了大规模汉字手写数据库,并升级了手写数据处理工具箱,揭示了语言成分在不同层面上对手写过程的调节作用,为未来跨语言手写研究提供了重要支持。

Details Motivation: 研究汉字手写在字符、部首和笔画层面上受哪些语言成分(如语音、语义和正字法系统)的调节,以及缺乏能够捕捉和批量处理精细手写数据的综合工具。 Method: 构建了一个大规模手写数据库,每个受试者手写1200个汉字;改进了现有的手写工具包,以支持实验设计修改、笔画级轨迹捕捉和批量处理手写测量数据;采用多元回归分析方法研究正字法、语音学等因素对手写准备和执行的影响。 Result: 正字法预测因素在字符、部首和笔画层面上对手写准备和执行产生影响;语音因素在所有三个层面上都影响执行过程;词汇效应呈现出分层衰减特征,字符层面最明显,其次是部首,笔画层面最弱。 Conclusion: 该研究构建了一个大规模手写数据库,并改进了现有的手写处理工具包,为未来跨语言的字符和子字符手写心理语言学和神经语言学研究提供了宝贵的资源。 Abstract: Understanding what linguistic components (e.g., phonological, semantic, and orthographic systems) modulate Chinese handwriting at the character, radical, and stroke levels remains an important yet understudied topic. Additionally, there is a lack of comprehensive tools for capturing and batch-processing fine-grained handwriting data. To address these issues, we constructed a large-scale handwriting database in which 42 Chinese speakers for each handwriting 1200 characters in a handwriting-to-dictation task. Additionally, we enhanced the existing handwriting package and provided comprehensive documentation for the upgraded OpenHandWrite_Toolbox, which can easily modify the experimental design, capture the stroke-level handwriting trajectory, and batch-process handwriting measurements (e.g., latency, duration, and pen-pressure). In analysing our large-scale database, multiple regression results show that orthographic predictors impact handwriting preparation and execution across character, radical, and stroke levels. Phonological factors also influence execution at all three levels. Importantly, these lexical effects demonstrate hierarchical attenuation - they were most pronounced at the character level, followed by the radical, and were weakest at the stroke levels. These findings demonstrate that handwriting preparation and execution at the radical and stroke levels are closely intertwined with linguistic components. This database and toolbox offer valuable resources for future psycholinguistic and neurolinguistic research on the handwriting of characters and sub-characters across different languages.

[78] Anticipatory Fall Detection in Humans with Hybrid Directed Graph Neural Networks and Long Short-Term Memory

Younggeol Cho,Gokhan Solak,Olivia Nocentini,Marta Lorenzini,Andrea Fortuna,Arash Ajoudani

Main category: cs.CV

TL;DR: This paper proposes a hybrid DGNN-LSTM model for anticipatory fall detection, effectively predicting and classifying falls with high accuracy using real-time skeletal data.

Details Motivation: Detecting and preventing falls is crucial for assistive robotic systems, but predicting falls before they happen and analyzing the transient state between stability and an impending fall remain unexplored. Method: The method uses a hybrid model combining Dynamic Graph Neural Networks (DGNN) and Long Short-Term Memory (LSTM) networks, employing real-time skeletal features from video sequences. DGNN classifies gait states while LSTM predicts future movements. Result: The model outperformed existing approaches in prediction error and recognition accuracy on the OUMVLP-Pose and URFD datasets, showing that decoupling prediction and classification improves performance. Conclusion: The proposed method successfully anticipates falls with high accuracy by decoupling motion prediction and gait classification tasks using a hybrid model of DGNN and LSTM, offering insights that could enhance advanced assistance systems. Abstract: Detecting and preventing falls in humans is a critical component of assistive robotic systems. While significant progress has been made in detecting falls, the prediction of falls before they happen, and analysis of the transient state between stability and an impending fall remain unexplored. In this paper, we propose a anticipatory fall detection method that utilizes a hybrid model combining Dynamic Graph Neural Networks (DGNN) with Long Short-Term Memory (LSTM) networks that decoupled the motion prediction and gait classification tasks to anticipate falls with high accuracy. Our approach employs real-time skeletal features extracted from video sequences as input for the proposed model. The DGNN acts as a classifier, distinguishing between three gait states: stable, transient, and fall. The LSTM-based network then predicts human movement in subsequent time steps, enabling early detection of falls. The proposed model was trained and validated using the OUMVLP-Pose and URFD datasets, demonstrating superior performance in terms of prediction error and recognition accuracy compared to models relying solely on DGNN and models from literature. The results indicate that decoupling prediction and classification improves performance compared to addressing the unified problem using only the DGNN. Furthermore, our method allows for the monitoring of the transient state, offering valuable insights that could enhance the functionality of advanced assistance systems.

[79] Comparative Evaluation of Hard and Soft Clustering for Precise Brain Tumor Segmentation in MR Imaging

Dibya Jyoti Bora,Mrinal Kanti Mishra

Main category: cs.CV

TL;DR: 该研究比较了K-Means和Fuzzy C-Means (FCM) 在脑肿瘤MRI分割中的性能,发现K-Means速度快但精度较低,而FCM精度高但计算成本较高。

Details Motivation: 脑肿瘤的MRI图像分割因肿瘤形态和强度分布的异质性而具有挑战性,准确的肿瘤边界划分对于临床决策、放射治疗规划和疾病监测至关重要。 Method: 研究采用了硬聚类(K-Means)和软聚类(FCM)方法,并使用BraTS2020数据集进行实验验证,同时结合了高斯滤波和CLAHE预处理技术。 Result: K-Means在平均运行时间上优于FCM(0.3秒/图像),而FCM在平均Dice相似性系数(DSC)上更高(0.67对0.43),但计算成本更高(1.3秒/图像) Conclusion: 该研究得出结论,K-Means在计算效率方面优于FCM,而FCM在分割精度上表现更好,两者在脑肿瘤MRI分割中各有优劣。 Abstract: Segmentation of brain tumors from Magnetic Resonance Imaging (MRI) remains a pivotal challenge in medical image analysis due to the heterogeneous nature of tumor morphology and intensity distributions. Accurate delineation of tumor boundaries is critical for clinical decision-making, radiotherapy planning, and longitudinal disease monitoring. In this study, we perform a comprehensive comparative analysis of two major clustering paradigms applied in MRI tumor segmentation: hard clustering, exemplified by the K-Means algorithm, and soft clustering, represented by Fuzzy C-Means (FCM). While K-Means assigns each pixel strictly to a single cluster, FCM introduces partial memberships, meaning each pixel can belong to multiple clusters with varying degrees of association. Experimental validation was performed using the BraTS2020 dataset, incorporating pre-processing through Gaussian filtering and Contrast Limited Adaptive Histogram Equalization (CLAHE). Evaluation metrics included the Dice Similarity Coefficient (DSC) and processing time, which collectively demonstrated that K-Means achieved superior speed with an average runtime of 0.3s per image, whereas FCM attained higher segmentation accuracy with an average DSC of 0.67 compared to 0.43 for K-Means, albeit at a higher computational cost (1.3s per image). These results highlight the inherent trade-off between computational efficiency and boundary precision.

[80] Handling imbalance and few-sample size in ML based Onion disease classification

Abhijeet Manoj Pal,Rajbabu Velmurugan

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的多类分类模型,通过集成注意力模块和数据增强管道,实现了对洋葱作物病虫害的高准确率分类。

Details Motivation: 准确的病虫害分类在精准农业中起着至关重要的作用,而当前的方法主要集中在二元分类上,限制了它们的实际应用,尤其是在需要准确识别特定类型疾病或害虫的场景中。 Method: 通过增强预训练卷积神经网络(CNN)模型,集成注意力模块,并采用全面的数据增强管道。 Result: 该模型在现实世界田间图像数据集上达到了96.90%的整体准确率和0.96的F1分数,比使用相同数据集的其他方法表现更好。 Conclusion: 本文提出了一种基于深度学习的多类分类模型,用于洋葱作物病虫害的准确分类,通过集成注意力模块和采用全面的数据增强管道来缓解类别不平衡问题。 Abstract: Accurate classification of pests and diseases plays a vital role in precision agriculture, enabling efficient identification, targeted interventions, and preventing their further spread. However, current methods primarily focus on binary classification, which limits their practical applications, especially in scenarios where accurately identifying the specific type of disease or pest is essential. We propose a robust deep learning based model for multi-class classification of onion crop diseases and pests. We enhance a pre-trained Convolutional Neural Network (CNN) model by integrating attention based modules and employing comprehensive data augmentation pipeline to mitigate class imbalance. We propose a model which gives 96.90% overall accuracy and 0.96 F1 score on real-world field image dataset. This model gives better results than other approaches using the same datasets.

[81] Delta Velocity Rectified Flow for Text-to-Image Editing

Gaspard Beaudouin,Minghan Li,Jaeyeon Kim,Sunghoon Yoon,Mengyu Wang

Main category: cs.CV

TL;DR: Delta Velocity Rectified Flow (DVRF) improves text-to-image editing by modeling velocity field discrepancies and using a time-dependent shift term for better alignment and reduced over-smoothing.

Details Motivation: To address over-smoothing artifacts in prior distillation sampling approaches and improve alignment with the target distribution in text-to-image editing tasks. Method: DVRF employs a distillation-based approach, explicitly modeling the discrepancy between source and target velocity fields and incorporating a time-dependent shift term to align noisy latents with the target trajectory. Result: DVRF achieves superior editing performance, theoretically connects score-based diffusion and velocity-based rectified-flow optimization, and generalizes the Inversion-free method FlowEdit. Conclusion: DVRF is an efficient and broadly applicable framework for text-to-image editing that achieves superior quality, fidelity, and controllability without architectural modifications. Abstract: We propose Delta Velocity Rectified Flow (DVRF), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DVRF is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DVRF reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DVRF generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. Experimental results indicate that DVRF achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications, making it efficient and broadly applicable to text-to-image editing tasks. Code is available at https://github.com/gaspardbd/DeltaVelocityRectifiedFlow.

[82] Systematic Integration of Attention Modules into CNNs for Accurate and Generalizable Medical Image Diagnosis

Zahid Ullah,Minki Hong,Tahir Mahmood,Jihie Kim

Main category: cs.CV

TL;DR: This paper integrates attention mechanisms into CNNs for improved performance in medical image analysis, achieving better accuracy and feature localization.

Details Motivation: Conventional CNNs struggle with capturing fine-grained features necessary for accurate medical diagnosis, prompting the integration of attention mechanisms. Method: Attention mechanisms (Squeeze and Excitation block or hybrid Convolutional Block Attention Module) were integrated into five CNN architectures (VGG16, ResNet18, InceptionV3, DenseNet121, EfficientNetB5), and evaluated on two medical imaging datasets. Result: Attention-augmented CNNs outperformed baseline models across all metrics, with EfficientNetB5 with hybrid attention showing the best performance. Conclusion: The study concludes that integrating attention mechanisms into CNNs improves their performance in medical image analysis, offering better accuracy and feature localization. Abstract: Deep learning has become a powerful tool for medical image analysis; however, conventional Convolutional Neural Networks (CNNs) often fail to capture the fine-grained and complex features critical for accurate diagnosis. To address this limitation, we systematically integrate attention mechanisms into five widely adopted CNN architectures, namely, VGG16, ResNet18, InceptionV3, DenseNet121, and EfficientNetB5, to enhance their ability to focus on salient regions and improve discriminative performance. Specifically, each baseline model is augmented with either a Squeeze and Excitation block or a hybrid Convolutional Block Attention Module, allowing adaptive recalibration of channel and spatial feature representations. The proposed models are evaluated on two distinct medical imaging datasets, a brain tumor MRI dataset comprising multiple tumor subtypes, and a Products of Conception histopathological dataset containing four tissue categories. Experimental results demonstrate that attention augmented CNNs consistently outperform baseline architectures across all metrics. In particular, EfficientNetB5 with hybrid attention achieves the highest overall performance, delivering substantial gains on both datasets. Beyond improved classification accuracy, attention mechanisms enhance feature localization, leading to better generalization across heterogeneous imaging modalities. This work contributes a systematic comparative framework for embedding attention modules in diverse CNN architectures and rigorously assesses their impact across multiple medical imaging tasks. The findings provide practical insights for the development of robust, interpretable, and clinically applicable deep learning based decision support systems.

[83] Vision-Based Object Detection for UAV Solar Panel Inspection Using an Enhanced Defects Dataset

Ashen Rodrigo,Isuru Munasinghe,Asanka Perera

Main category: cs.CV

TL;DR: 本研究评估了五种目标检测模型在太阳能电池板缺陷和污染物检测中的性能,提供了一个定制数据集,并比较了它们的准确性与计算效率。

Details Motivation: 及时准确地检测太阳能电池板上的缺陷和污染物对于保持光伏系统的效率和可靠性至关重要。 Method: 开发了一个自定义数据集,并设计了一个用户界面来训练和评估YOLOv3、Faster R-CNN、RetinaNet、EfficientDet和Swin Transformer这五种对象检测模型。 Result: 基于平均平均精度(mAP)、精确度、召回率和推理速度评估并比较了每种模型的性能,结果表明了检测精度和计算效率之间的权衡。 Conclusion: 研究总结了五种最先进的目标检测模型在太阳能电池板缺陷和污染物检测方面的性能,为实际监测和维护场景中的检测方法选择提供了有价值的指导。 Abstract: Timely and accurate detection of defects and contaminants in solar panels is critical for maintaining the efficiency and reliability of photovoltaic systems. This study presents a comprehensive evaluation of five state-of-the-art object detection models: YOLOv3, Faster R-CNN, RetinaNet, EfficientDet, and Swin Transformer, for identifying physical and electrical defects as well as surface contaminants such as dust, dirt, and bird droppings on solar panels. A custom dataset, annotated in the COCO format and specifically designed for solar panel defect and contamination detection, was developed alongside a user interface to train and evaluate the models. The performance of each model is assessed and compared based on mean Average Precision (mAP), precision, recall, and inference speed. The results demonstrate the trade-offs between detection accuracy and computational efficiency, highlighting the relative strengths and limitations of each model. These findings provide valuable guidance for selecting appropriate detection approaches in practical solar panel monitoring and maintenance scenarios. The dataset will be publicly available at https://github.com/IsuruMunasinghe98/solar-panel-inspection-dataset.

[84] Unsupervised Instance Segmentation with Superpixels

Cuong Manh Hoang

Main category: cs.CV

TL;DR: 提出了一种新的无需人工标注的实例分割框架,结合MultiCut算法、掩码过滤、超像素引导损失和自训练方法,取得了优越的性能。

Details Motivation: 当前的实例分割模型需要大量人工标注数据,成本高昂,因此提出了一种无需人工标注的新框架。 Method: 使用MultiCut算法进行粗掩码分割,采用掩码过滤器获得高质量掩码,计算超像素引导的掩码损失,以及提出一种带有自适应损失的自训练过程。 Result: 实验结果表明,该框架在实例分割和目标检测任务中优于现有最先进方法。 Conclusion: 该框架在实例分割和目标检测的公共数据集中表现出了卓越的有效性,并超越了之前最先进的方法。 Abstract: Instance segmentation is essential for numerous computer vision applications, including robotics, human-computer interaction, and autonomous driving. Currently, popular models bring impressive performance in instance segmentation by training with a large number of human annotations, which are costly to collect. For this reason, we present a new framework that efficiently and effectively segments objects without the need for human annotations. Firstly, a MultiCut algorithm is applied to self-supervised features for coarse mask segmentation. Then, a mask filter is employed to obtain high-quality coarse masks. To train the segmentation network, we compute a novel superpixel-guided mask loss, comprising hard loss and soft loss, with high-quality coarse masks and superpixels segmented from low-level image features. Lastly, a self-training process with a new adaptive loss is proposed to improve the quality of predicted masks. We conduct experiments on public datasets in instance segmentation and object detection to demonstrate the effectiveness of the proposed framework. The results show that the proposed framework outperforms previous state-of-the-art methods.

[85] Augmented Structure Preserving Neural Networks for cell biomechanics

Juan Olalla-Pombo,Alberto Badías,Miguel Ángel Sanz-Gómez,José María Benítez,Francisco Javier Montáns

Main category: cs.CV

TL;DR: The paper presents a new approach combining Structure Preserving Neural Networks and Machine Learning tools to accurately predict cell trajectories and mitosis events in complex cell biomechanics scenarios.

Details Motivation: The motivation is to better understand the complex phenomena of cell biomechanics and how they influence cell decisions as a collective network or cluster, particularly in processes like embryo-genesis, maintenance of damaged structures, and tumor growth. Method: The method involves combining Structure Preserving Neural Networks to study cell movements as a mechanical system with other Machine Learning tools like Artificial Neural Networks, which consider environmental factors deduced from experiments using Computer Vision techniques. Result: The result is a new model that accurately predicts complete cell trajectories following a roll-out policy and includes a mitosis event prediction model based on Neural Network architectures. Conclusion: The paper concludes that their newly developed model, combining Structure Preserving Neural Networks and other Machine Learning tools, accurately predicts complete cell trajectories and mitosis events based on observed features. Abstract: Cell biomechanics involve a great number of complex phenomena that are fundamental to the evolution of life itself and other associated processes, ranging from the very early stages of embryo-genesis to the maintenance of damaged structures or the growth of tumors. Given the importance of such phenomena, increasing research has been dedicated to their understanding, but the many interactions between them and their influence on the decisions of cells as a collective network or cluster remain unclear. We present a new approach that combines Structure Preserving Neural Networks, which study cell movements as a purely mechanical system, with other Machine Learning tools (Artificial Neural Networks), which allow taking into consideration environmental factors that can be directly deduced from an experiment with Computer Vision techniques. This new model, tested on simulated and real cell migration cases, predicts complete cell trajectories following a roll-out policy with a high level of accuracy. This work also includes a mitosis event prediction model based on Neural Networks architectures which makes use of the same observed features.

[86] Advanced Brain Tumor Segmentation Using EMCAD: Efficient Multi-scale Convolutional Attention Decoding

GodsGift Uzor,Tania-Amanda Nkoyo Fredrick Eneye,Chukwuebuka Ijezue

Main category: cs.CV

TL;DR: EMCAD offers a computationally efficient solution for brain tumor segmentation with moderate performance on the BraTs2020 dataset.

Details Motivation: Brain tumor segmentation is a crucial preprocessing step in medical image analysis, but decoding mechanisms often come with high computational costs. This necessitates the development of more efficient methods like EMCAD. Method: EMCAD, an efficient multi-scale convolutional attention decoder, was used to optimize performance and computational efficiency on the BraTs2020 dataset. Result: The model achieved a best Dice score of 0.31 and maintained a stable mean Dice score of 0.285 ± 0.015 throughout training. Conclusion: EMCAD demonstrates moderate performance in brain tumor segmentation with a stable mean Dice score and no signs of over-fitting. Abstract: Brain tumor segmentation is a critical pre-processing step in the medical image analysis pipeline that involves precise delineation of tumor regions from healthy brain tissue in medical imaging data, particularly MRI scans. An efficient and effective decoding mechanism is crucial in brain tumor segmentation especially in scenarios with limited computational resources. However these decoding mechanisms usually come with high computational costs. To address this concern EMCAD a new efficient multi-scale convolutional attention decoder designed was utilized to optimize both performance and computational efficiency for brain tumor segmentation on the BraTs2020 dataset consisting of MRI scans from 369 brain tumor patients. The preliminary result obtained by the model achieved a best Dice score of 0.31 and maintained a stable mean Dice score of 0.285 plus/minus 0.015 throughout the training process which is moderate. The initial model maintained consistent performance across the validation set without showing signs of over-fitting.

[87] FAVAE-Effective Frequency Aware Latent Tokenizer

Tejaswini Medi,Hsien-Yi Wang,Arianna Rampini,Margret Keuper

Main category: cs.CV

TL;DR: This paper introduces FA-VAE, a frequency-aware variational autoencoder that improves high-frequency detail reconstruction in latent generative models, enhancing realism and perceptual quality.

Details Motivation: Latent generative models often produce over-smoothed outputs and lack realism due to the bias of latent tokenizers toward low-frequency information and neglect of high-frequency details. This work aims to address this issue by introducing a frequency-aware optimization approach. Method: A wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework is proposed, which explicitly decouples the optimization of low- and high-frequency components during latent representation learning. Result: The FA-VAE framework achieves improved reconstruction of fine textures while preserving global structure, effectively bridging the fidelity gap in current latent tokenizers. Conclusion: The FA-VAE framework improves the reconstruction of fine textures in latent generative models by decoupling low- and high-frequency components, leading to better perceptual quality and realism in synthesized images. Abstract: Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, the reconstructed images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information, when jointly optimized, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image representation, with broader implications for applications in content creation, neural rendering, and medical imaging.

[88] Dynamic Sensitivity Filter Pruning using Multi-Agent Reinforcement Learning For DCNN's

Iftekhar Haider Chowdhury,Zaed Ikbal Syed,Ahmed Faizul Haque Dhrubo,Mohammad Abdul Qayum

Main category: cs.CV

TL;DR: This paper proposes Differential Sensitivity Fusion Pruning, a novel and efficient filter pruning method for compressing Deep Convolutional Neural Networks, enabling high accuracy and efficient deployment on edge and mobile platforms.

Details Motivation: Deep Convolutional Neural Networks face deployment challenges due to computational and memory overhead. A more efficient, deterministic, and scalable filter pruning method is required. Method: Differential Sensitivity Fusion Pruning computes a differential sensitivity score for each filter by fusing gradient-based sensitivity, first-order Taylor expansion, and KL divergence of activation distributions, followed by an exponential scaling mechanism to identify structurally unstable or less critical filters. Result: Experiments showed that the proposed method significantly reduces model complexity, achieving over 80% Floating Point Operations Per Second reduction. At 70% pruning, it retains up to 98.23% of baseline accuracy, surpassing traditional heuristics in compression and generalization. Conclusion: The proposed Differential Sensitivity Fusion Pruning method effectively compresses Deep Convolutional Neural Networks, enabling efficient deployment on edge and mobile platforms while maintaining high accuracy. Abstract: Deep Convolutional Neural Networks have achieved state of the art performance across various computer vision tasks, however their practical deployment is limited by computational and memory overhead. This paper introduces Differential Sensitivity Fusion Pruning, a novel single shot filter pruning framework that focuses on evaluating the stability and redundancy of filter importance scores across multiple criteria. Differential Sensitivity Fusion Pruning computes a differential sensitivity score for each filter by fusing the discrepancies among gradient based sensitivity, first order Taylor expansion, and KL divergence of activation distributions. An exponential scaling mechanism is applied to emphasize filters with inconsistent importance across metrics, identifying candidates that are structurally unstable or less critical to the model performance. Unlike iterative or reinforcement learning based pruning strategies, Differential Sensitivity Fusion Pruning is efficient and deterministic, requiring only a single forward-backward pass for scoring and pruning. Extensive experiments across varying pruning rates between 50 to 70 percent demonstrate that Differential Sensitivity Fusion Pruning significantly reduces model complexity, achieving over 80 percent Floating point Operations Per Seconds reduction while maintaining high accuracy. For instance, at 70 percent pruning, our approach retains up to 98.23 percent of baseline accuracy, surpassing traditional heuristics in both compression and generalization. The proposed method presents an effective solution for scalable and adaptive Deep Convolutional Neural Networks compression, paving the way for efficient deployment on edge and mobile platforms.

[89] Veriserum: A dual-plane fluoroscopic dataset with knee implant phantoms for deep learning in medical imaging

Jinhao Wang,Florian Vogl,Pascal Schütz,Saša Ćuković,William R. Taylor

Main category: cs.CV

TL;DR: Veriserum是一个开源X光图像数据集,用于训练深度学习模型进行双平面荧光分析,包含110,000张图像和多种应用场景支持。

Details Motivation: 支持深度学习在双平面荧光分析中的配准训练,并促进2D/3D图像配准、分割、X光畸变校正和3D重建等应用的发展。 Method: 提供包含110,000张X光图像的数据集,涵盖10种膝关节植入组合,并附带自动注册的真实姿态标注及校准工具。 Result: 该数据集包括双平面图像和校准工具,其中200张图像包含手动配准姿态,用于基准测试,并可通过公开链接访问。 Conclusion: Veriserum为开源数据集,旨在推动计算机视觉和医学成像研究,提供可重复的算法开发和评估基准。 Abstract: Veriserum is an open-source dataset designed to support the training of deep learning registration for dual-plane fluoroscopic analysis. It comprises approximately 110,000 X-ray images of 10 knee implant pair combinations (2 femur and 5 tibia implants) captured during 1,600 trials, incorporating poses associated with daily activities such as level gait and ramp descent. Each image is annotated with an automatically registered ground-truth pose, while 200 images include manually registered poses for benchmarking. Key features of Veriserum include dual-plane images and calibration tools. The dataset aims to support the development of applications such as 2D/3D image registration, image segmentation, X-ray distortion correction, and 3D reconstruction. Freely accessible, Veriserum aims to advance computer vision and medical imaging research by providing a reproducible benchmark for algorithm development and evaluation. The Veriserum dataset used in this study is publicly available via https://movement.ethz.ch/data-repository/veriserum.html, with the data stored at ETH Z\"urich Research Collections: https://doi.org/10.3929/ethz-b-000701146.

[90] An Analysis of Layer-Freezing Strategies for Enhanced Transfer Learning in YOLO Architectures

Andrzej D. Dobrzycki,Ana M. Bernardos,José R. Casar

Main category: cs.CV

TL;DR: This study analyzes layer-freezing strategies for transfer learning in YOLOv8 and YOLOv10, showing that optimal strategies depend on dataset properties and that certain freezing approaches can outperform full fine-tuning in terms of efficiency and performance.

Details Motivation: Deploying YOLO architectures in resource-constrained environments requires efficient transfer learning techniques. While layer freezing is commonly used, its impact on modern YOLO variants like YOLOv8 and YOLOv10 remains underexplored, especially in relation to dataset properties and training dynamics. Method: The study systematically evaluates multiple layer-freezing configurations across YOLOv8 and YOLOv10 architectures using four challenging datasets. It incorporates gradient behavior analysis (L2 norm) and visual explanations (Grad-CAM) to understand training dynamics under different freezing strategies. Result: The results show that freezing strategies are dataset-dependent, with some configurations reducing GPU memory consumption by up to 28% and achieving mAP@50 scores that surpass full fine-tuning. Gradient analysis reveals distinct convergence patterns for moderately frozen models. Conclusion: This research concludes that the optimal layer-freezing strategy in YOLO architectures for transfer learning depends on the specific characteristics of the dataset being used, and that different freezing strategies can lead to improved performance and reduced GPU memory consumption compared to full fine-tuning. Abstract: The You Only Look Once (YOLO) architecture is crucial for real-time object detection. However, deploying it in resource-constrained environments such as unmanned aerial vehicles (UAVs) requires efficient transfer learning. Although layer freezing is a common technique, the specific impact of various freezing configurations on contemporary YOLOv8 and YOLOv10 architectures remains unexplored, particularly with regard to the interplay between freezing depth, dataset characteristics, and training dynamics. This research addresses this gap by presenting a detailed analysis of layer-freezing strategies. We systematically investigate multiple freezing configurations across YOLOv8 and YOLOv10 variants using four challenging datasets that represent critical infrastructure monitoring. Our methodology integrates a gradient behavior analysis (L2 norm) and visual explanations (Grad-CAM) to provide deeper insights into training dynamics under different freezing strategies. Our results reveal that there is no universal optimal freezing strategy but, rather, one that depends on the properties of the data. For example, freezing the backbone is effective for preserving general-purpose features, while a shallower freeze is better suited to handling extreme class imbalance. These configurations reduce graphics processing unit (GPU) memory consumption by up to 28% compared to full fine-tuning and, in some cases, achieve mean average precision (mAP@50) scores that surpass those of full fine-tuning. Gradient analysis corroborates these findings, showing distinct convergence patterns for moderately frozen models. Ultimately, this work provides empirical findings and practical guidelines for selecting freezing strategies. It offers a practical, evidence-based approach to balanced transfer learning for object detection in scenarios with limited resources.

[91] Quaternion Approximation Networks for Enhanced Image Classification and Oriented Object Detection

Bryce Grant,Peng Wang

Main category: cs.CV

TL;DR: This paper proposes QUAN, a deep learning framework based on quaternion algebra, which achieves superior performance in rotation-equivariant image classification and object detection while being computationally efficient.

Details Motivation: The motivation is to develop a deep learning framework that effectively handles rotation equivariance in vision tasks while overcoming the limitations of conventional quaternion neural networks that operate entirely in the quaternion domain. Method: The paper introduces QUAN, which approximates quaternion convolution through Hamilton product decomposition using real-valued operations, and proposes Independent Quaternion Batch Normalization (IQBN) for training stability. It also extends quaternion operations to spatial attention mechanisms. Result: QUAN achieves higher accuracy with fewer parameters and faster convergence in image classification tasks, and demonstrates improved parameter efficiency and rotation handling in object detection tasks, establishing the SOTA for quaternion CNNs. Conclusion: QUAN is a novel deep learning framework that uses quaternion algebra for rotation equivariant image classification and object detection, achieving better performance than existing methods while being efficient in terms of parameters and computation. Abstract: This paper introduces Quaternion Approximate Networks (QUAN), a novel deep learning framework that leverages quaternion algebra for rotation equivariant image classification and object detection. Unlike conventional quaternion neural networks attempting to operate entirely in the quaternion domain, QUAN approximates quaternion convolution through Hamilton product decomposition using real-valued operations. This approach preserves geometric properties while enabling efficient implementation with custom CUDA kernels. We introduce Independent Quaternion Batch Normalization (IQBN) for training stability and extend quaternion operations to spatial attention mechanisms. QUAN is evaluated on image classification (CIFAR-10/100, ImageNet), object detection (COCO, DOTA), and robotic perception tasks. In classification tasks, QUAN achieves higher accuracy with fewer parameters and faster convergence compared to existing convolution and quaternion-based models. For objection detection, QUAN demonstrates improved parameter efficiency and rotation handling over standard Convolutional Neural Networks (CNNs) while establishing the SOTA for quaternion CNNs in this downstream task. These results highlight its potential for deployment in resource-constrained robotic systems requiring rotation-aware perception and application in other domains.

[92] OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation

Ahad Jawaid,Yu Xiang

Main category: cs.CV

TL;DR: OpenEgo是一个大规模、多模态的自我中心操作数据集,旨在促进灵巧手部操作和视觉-语言-动作学习的研究。

Details Motivation: 现有自我中心视频数据集往往缺乏细粒度的动作描述或灵巧手部标注,因此需要一个更全面、标准化的数据集来推动相关研究。 Method: 构建了一个包含标准化手部姿态标注和意图对齐动作原语的自我中心操作数据集,并通过训练语言条件模仿学习策略来验证其效用。 Result: OpenEgo总计包含1107小时的数据,涵盖290个操作任务,训练结果显示其能够有效支持灵巧操作的学习。 Conclusion: OpenEgo是一个旨在降低从自我中心视频中学习灵巧操作的门槛,并支持视觉-语言-动作学习的可重复研究的多模态数据集。 Abstract: Egocentric human videos provide scalable demonstrations for imitation learning, but existing corpora often lack either fine-grained, temporally localized action descriptions or dexterous hand annotations. We introduce OpenEgo, a multimodal egocentric manipulation dataset with standardized hand-pose annotations and intention-aligned action primitives. OpenEgo totals 1107 hours across six public datasets, covering 290 manipulation tasks in 600+ environments. We unify hand-pose layouts and provide descriptive, timestamped action primitives. To validate its utility, we train language-conditioned imitation-learning policies to predict dexterous hand trajectories. OpenEgo is designed to lower the barrier to learning dexterous manipulation from egocentric video and to support reproducible research in vision-language-action learning. All resources and instructions will be released at www.openegocentric.com.

[93] Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting

Sen Wang,Kunyi Li,Siyun Liang,Elena Alegret,Jing Ma,Nassir Navab,Stefano Gasperini

Main category: cs.CV

TL;DR: This paper presents VALA, a method for distilling open-vocabulary language features from 2D images into 3D Gaussians, addressing issues of background feature dominance and multi-view inconsistencies, resulting in improved open-vocabulary localization and segmentation.

Details Motivation: Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. Method: We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Result: Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. Conclusion: VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works. Abstract: Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.

[94] DuoCLR: Dual-Surrogate Contrastive Learning for Skeleton-based Human Action Segmentation

Haitao Tian,Pierre Payeur

Main category: cs.CV

TL;DR: 本文提出了一种名为DuoCLR的对比学习框架,通过多尺度表示和新代理任务,显著提升了动作分割性能。

Details Motivation: 现有的表示学习方法主要针对动作识别,且基于孤立的序列表示,缺乏对多尺度表示和跨序列变化的利用。 Method: 提出了一种新的对比表示学习框架,包括“Shuffle and Warp”数据增强策略和两个代理任务:跨排列对比(CPC)和相对顺序推理(ROR)以进行动作分割。 Result: DuoCLR在未修剪数据集上表现优于现有方法,并通过消融研究验证了各组件的有效性。 Conclusion: DuoCLR通过使用多尺度表示和对比学习,显著提升了未修剪数据集上多类和多标签动作分割任务的性能。 Abstract: In this paper, a contrastive representation learning framework is proposed to enhance human action segmentation via pre-training using trimmed (single action) skeleton sequences. Unlike previous representation learning works that are tailored for action recognition and that build upon isolated sequence-wise representations, the proposed framework focuses on exploiting multi-scale representations in conjunction with cross-sequence variations. More specifically, it proposes a novel data augmentation strategy, 'Shuffle and Warp', which exploits diverse multi-action permutations. The latter effectively assists two surrogate tasks that are introduced in contrastive learning: Cross Permutation Contrasting (CPC) and Relative Order Reasoning (ROR). In optimization, CPC learns intra-class similarities by contrasting representations of the same action class across different permutations, while ROR reasons about inter-class contexts by predicting relative mapping between two permutations. Together, these tasks enable a Dual-Surrogate Contrastive Learning (DuoCLR) network to learn multi-scale feature representations optimized for action segmentation. In experiments, DuoCLR is pre-trained on a trimmed skeleton dataset and evaluated on an untrimmed dataset where it demonstrates a significant boost over state-the-art comparatives in both multi-class and multi-label action segmentation tasks. Lastly, ablation studies are conducted to evaluate the effectiveness of each component of the proposed approach.

[95] RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentangled Representation

Yihong Leng,Siming Zheng,Jinwei Chen,Bo Li,Jiaojiao Li,Peng-Tao Jiang

Main category: cs.CV

TL;DR: This paper introduces a Robust Event-guided Deblurring (RED) network that addresses the incompleteness of event data from Dynamic Vision Sensors using a novel perturbation strategy and disentangled attention mechanism, achieving superior motion deblurring performance.

Details Motivation: Existing event-guided deblurring methods overlook the inherent incompleteness of event streams caused by the thresholding mechanism in Dynamic Vision Sensors (DVS), which limits the accuracy of motion priors and deblurring performance. This work aims to enhance robustness and effectiveness in motion deblurring under real-world conditions. Method: The authors propose a Robust Event-guided Deblurring (RED) network that incorporates a Robustness-Oriented Perturbation Strategy (RPS) to handle incomplete event data and a disentangled OmniAttention module to model intra-motion, inter-motion, and cross-modality correlations between blurry images and event streams. Two interactive modules further refine motion-sensitive areas and semantic context. Result: Extensive experiments on both synthetic and real-world datasets show that the RED network consistently outperforms existing methods in terms of accuracy and robustness for motion deblurring. Conclusion: The proposed RED network with modality-specific disentangled representation achieves state-of-the-art performance in motion deblurring by effectively addressing the inherent incompleteness of event streams and enhancing robustness through a robustness-oriented perturbation strategy and disentangled OmniAttention. Abstract: Event cameras provide sparse yet temporally high-temporal-resolution motion information, demonstrating great potential for motion deblurring. Existing methods focus on cross-modal interaction, overlooking the inherent incompleteness of event streams, which arises from the trade-off between sensitivity and noise introduced by the thresholding mechanism of Dynamic Vision Sensors (DVS). Such degradation compromises the integrity of motion priors and limits the effectiveness of event-guided deblurring. To tackle these challenges, we propose a Robust Event-guided Deblurring (RED) network with modality-specific disentangled representation. First, we introduce a Robustness-Oriented Perturbation Strategy (RPS) that applies random masking to events, which exposes RED to incomplete patterns and then foster robustness against various unknown scenario conditions.Next, a disentangled OmniAttention is presented to explicitly model intra-motion, inter-motion, and cross-modality correlations from two inherently distinct but complementary sources: blurry images and partially disrupted events. Building on these reliable features, two interactive modules are designed to enhance motion-sensitive areas in blurry images and inject semantic context into incomplete event representations. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in both accuracy and robustness.

[96] Sensitivity-Aware Post-Training Quantization for Deep Neural Networks

Zekang Zheng,Haokun Li,Yaofo Chen,Mingkui Tan,Qing Du

Main category: cs.CV

TL;DR: This paper proposes an efficient post-training quantization method that balances speed and accuracy by leveraging parameter sensitivity and a novel parallel framework.

Details Motivation: Model quantization often compromises accuracy, and existing PTQ methods suffer from high computational complexity. There is a need for a more efficient and accurate PTQ method for edge computing and real-time applications. Method: The method uses parameter sensitivity analysis to prioritize quantization, incorporates a compensation mechanism using low-sensitivity parameters, and introduces a row-parallel quantization framework with a shared inverse Hessian matrix update. Result: The proposed method achieves a 20-200 fold quantization speedup over the baseline with less than 0.3% mean accuracy loss on ResNet-50 and YOLOv5s. Conclusion: The proposed PTQ method achieves a significant quantization speedup while maintaining minimal accuracy loss, making it suitable for resource-constrained environments. Abstract: Model quantization reduces neural network parameter precision to achieve compression, but often compromises accuracy. Existing post-training quantization (PTQ) methods employ iterative parameter updates to preserve accuracy under high compression ratios, incurring significant computational complexity and resource overhead, which limits applicability in resource-constrained edge computing and real-time inference scenarios. This paper proposes an efficient PTQ method guided by parameter sensitivity analysis. The approach prioritizes quantization of high-sensitivity parameters, leveraging unquantized low-sensitivity parameters to compensate for quantization errors, thereby mitigating accuracy degradation. Furthermore, by exploiting column-wise clustering of parameter sensitivity, the method introduces a row-parallel quantization framework with a globally shared inverse Hessian matrix update mechanism, reducing computational complexity by an order of magnitude. Experimental results on ResNet-50 and YOLOv5s demonstrate a 20-200-fold quantization speedup over the Optimal Brain Quantization baseline, with mean accuracy loss below 0.3%, confirming the method's efficacy in balancing efficiency and accuracy.

[97] Reconstruction and Reenactment Separated Method for Realistic Gaussian Head

Zhiling Ye,Cong Zhou,Xiubao Zhang,Haifeng Shen,Weihong Deng,Quan Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯表示的头部分解重建与重演框架,能够通过单张输入图像生成可控化身,实现高效高分辨率渲染并优于现有技术。

Details Motivation: 为了实现仅需单张肖像图像即可生成可控化身的目标,并解决高帧率渲染和高分辨率下的性能问题。 Method: 开发了一个基于WebSSL的大规模单样本高斯头生成器,并采用两阶段训练方法,提高了泛化能力和高频纹理重建能力。 Result: 在推理过程中,通过超轻量级高斯化身实现了512x512分辨率下每秒90帧的渲染速度,且实验表明增加重建模块的参数规模可提升性能,而驱动效率不受影响。 Conclusion: 该论文提出的3D高斯头部分解重建和重演框架在性能和效率上优于当前最先进的方法。 Abstract: In this paper, we explore a reconstruction and reenactment separated framework for 3D Gaussians head, which requires only a single portrait image as input to generate controllable avatar. Specifically, we developed a large-scale one-shot gaussian head generator built upon WebSSL and employed a two-stage training approach that significantly enhances the capabilities of generalization and high-frequency texture reconstruction. During inference, an ultra-lightweight gaussian avatar driven by control signals enables high frame-rate rendering, achieving 90 FPS at a resolution of 512x512. We further demonstrate that the proposed framework follows the scaling law, whereby increasing the parameter scale of the reconstruction module leads to improved performance. Moreover, thanks to the separation design, driving efficiency remains unaffected. Finally, extensive quantitative and qualitative experiments validate that our approach outperforms current state-of-the-art methods.

[98] MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios

Changtao Miao,Yi Zhang,Man Luo,Weiwei Feng,Kaiyuan Zheng,Qi Chu,Tao Gong,Jianshu Li,Yunfeng Diao,Wei Zhou,Joey Tianyi Zhou,Xiaoshuai Hao

Main category: cs.CV

TL;DR: 为了解决Deepfake检测方法受限于现有数据集的问题,研究者提出了MFFI数据集,它基于四个维度增强真实性,具有50种伪造方法和1024K个图像样本,并在基准评估中表现优异。

Details Motivation: 当前的Deepfake检测方法受限于现有数据集的不足,这些数据集缺乏现实世界场景中所需的多样性。 Method: 提出了一种名为Multi-dimensional Face Forgery Image (MFFI)的数据集,该数据集基于四个战略维度增强真实性:1)更广泛的伪造方法;2)多样的面部场景;3)多样化的真人数据;4)多级降质操作。 Result: MFFI数据集集成了50种不同的伪造方法,包含1024K个图像样本,并在基准评估中表现出色,超越了现有的公共数据集。 Conclusion: MFFI是一个为真实世界场景量身定制的多维人脸伪造图像数据集,具有较高的场景复杂性、跨域泛化能力和检测难度梯度,其技术和实用价值得到了验证。 Abstract: Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (\textbf{MFFI}) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates $50$ different forgery methods and contains $1024K$ image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at {https://github.com/inclusionConf/MFFI}.

[99] Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization

Jungin Park,Jiyoung Lee,Kwanghoon Sohn

Main category: cs.CV

TL;DR: VideoGraph是一种语言引导的时空图建模方法,用于视频摘要,能够捕捉帧和对象之间的语义关系,并通过递归策略优化摘要结果。

Details Motivation: 传统方法关注帧之间的时序关系,但忽略了细粒度视觉实体(如对象)对视频内容表达的重要性,同时需要更全面的语言理解能力。 Method: 提出了一种递归时空图网络VideoGraph,将对象和帧分别作为空间和时间图的节点,利用语言查询增强语义关系,并通过递归策略优化初始图结构。 Result: VideoGraph在多个视频摘要任务中达到了最先进的性能,包括通用和查询聚焦的视频摘要,并且适用于监督和无监督场景。 Conclusion: VideoGraph通过语言引导的时空图建模方法,在多种基准数据集上实现了最先进的性能,适用于监督和无监督下的通用和查询聚焦视频摘要任务。 Abstract: Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called VideoGraph, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at https://github.com/park-jungin/videograph.

[100] Patch-level Kernel Alignment for Self-Supervised Dense Representation Learning

Juan Yeo,Ijun Jang,Taesup Kim

Main category: cs.CV

TL;DR: 本文提出了一种通过教师-学生模型对齐密集特征分布的自监督学习框架,并引入了补丁级核对齐(PaKA)方法和专用数据增强策略,从而提升密集视觉任务的表现。

Details Motivation: 全局表示方法在捕捉密集预测任务所需的局部语义方面存在不足,因此需要一种能将现有语义知识转移到密集特征空间的方法。 Method: 提出了一种基于教师-学生模型间密集特征分布对齐的框架,具体引入了补丁级核对齐(PaKA)目标,并研究了专为密集表示学习设计的数据增强策略。 Result: 所提方法在多个密集视觉任务中达到了最先进的性能表现。 Conclusion: 该框架在各种密集视觉基准测试中取得了最先进的结果,证明了该方法的有效性。 Abstract: Dense representations are essential for vision tasks that require spatial precision and fine-grained detail. While most self-supervised representation learning methods focus on global representations that summarize the image as a whole, such approaches often fall short in capturing the localized semantics necessary for dense prediction tasks. To overcome these limitations, we propose a framework that builds on pretrained representations through additional self-supervised learning, aiming to transfer existing semantic knowledge into the dense feature space. Our method aligns the distributions of dense features between a teacher and a student model. Specifically, we introduce Patch-level Kernel Alignment (PaKA), a simple yet effective alignment objective that captures statistical dependencies, thereby matching the structural relationships of dense patches across the two models. In addition, we investigate augmentation strategies specifically designed for dense representation learning. Our framework achieves state-of-the-art results across a variety of dense vision benchmarks, demonstrating the effectiveness of our approach.

[101] SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

Hanzhen Wang,Jiaming Xu,Jiayi Pan,Yongkang Zhou,Guohao Dai

Main category: cs.CV

TL;DR: SpecPrune-VLA通过结合当前和过去的上下文信息进行两阶段剪枝,在保持成功率的同时显著提升了视觉-语言-动作模型的推理速度。

Details Motivation: 现有的视觉-语言-动作(VLA)模型剪枝方法仅依赖当前动作的局部信息,忽略了先前动作的全局上下文,导致成功率下降和加速效果有限。作者观察到连续动作之间具有高度相似性,因此提出结合局部和全局信息进行更智能的token选择。 Method: 提出SpecPrune-VLA,一种无需训练的剪枝方法,包含两个层面的剪枝和一个轻量级的动作感知控制器:(1) 静态剪枝(动作级别):利用全局历史和局部上下文减少每个动作中的视觉token;(2) 动态剪枝(层级别):根据不同层的重要性对每层的token进行剪枝;(3) 控制器:根据动作的粗粒度或细粒度(速度)分类,调整剪枝强度,因为细粒度动作对剪枝更敏感。 Result: 在LIBERO数据集上的实验表明,与OpenVLA-OFT相比,SpecPrune-VLA在NVIDIA A800上实现了1.46倍的加速,在NVIDIA GeForce RTX 3090上实现了1.57倍的加速,成功率损失可以忽略不计。 Conclusion: SpecPrune-VLA是一种有效的训练后剪枝方法,通过结合局部和全局信息,在不影响性能的前提下显著提升了VLA模型的推理效率。 Abstract: Pruning accelerates compute-bound models by reducing computation. Recently applied to Vision-Language-Action (VLA) models, existing methods prune tokens using only local info from current action, ignoring global context from prior actions, causing >20% success rate drop and limited speedup. We observe high similarity across consecutive actions and propose leveraging both local (current) and global (past) info for smarter token selection. We introduce SpecPrune-VLA, a training-free method with two-level pruning and heuristic control: (1) Static pruning at action level: uses global history and local context to reduce visual tokens per action; (2) Dynamic pruning at layer level: prunes tokens per layer based on layer-specific importance; (3) Lightweight action-aware controller: classifies actions as coarse/fine-grained (by speed), adjusting pruning aggressiveness since fine-grained actions are pruning-sensitive. Experiments on LIBERO show SpecPrune-VLA achieves 1.46 times speedup on NVIDIA A800 and 1.57 times on NVIDIA GeForce RTX 3090 vs. OpenVLA-OFT, with negligible success rate loss.

[102] SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models

Kien Nguyen,Anh Tran,Cuong Pham

Main category: cs.CV

TL;DR: This paper introduces Subspace Mapping (SuMa), a method for erasing narrow concepts in text-to-image diffusion models, effectively addressing legal and copyright concerns while maintaining image quality.

Details Motivation: The misuse of text-to-image diffusion models in generating harmful or unauthorized content necessitates the development of Concept Erasure methods that are both robust and effective, particularly for narrow concepts like copyrighted characters or celebrities. Method: SuMa derives a target subspace representing the concept to be erased and neutralizes it by mapping it to a reference subspace that minimizes the distance between the two. Result: SuMa achieves comparable image quality to methods focused on effectiveness and yields results on par with methods targeting completeness across various tasks including subclass erasure, celebrity erasure, artistic style erasure, and instance erasure. Conclusion: The proposed Subspace Mapping (SuMa) method effectively and robustly erases narrow concepts such as copyrighted characters or celebrities in text-to-image diffusion models, maintaining image quality while addressing legal and copyright concerns. Abstract: The rapid growth of text-to-image diffusion models has raised concerns about their potential misuse in generating harmful or unauthorized contents. To address these issues, several Concept Erasure methods have been proposed. However, most of them fail to achieve both robustness, i.e., the ability to robustly remove the target concept., and effectiveness, i.e., maintaining image quality. While few recent techniques successfully achieve these goals for NSFW concepts, none could handle narrow concepts such as copyrighted characters or celebrities. Erasing these narrow concepts is critical in addressing copyright and legal concerns. However, erasing them is challenging due to their close distances to non-target neighboring concepts, requiring finer-grained manipulation. In this paper, we introduce Subspace Mapping (SuMa), a novel method specifically designed to achieve both robustness and effectiveness in easing these narrow concepts. SuMa first derives a target subspace representing the concept to be erased and then neutralizes it by mapping it to a reference subspace that minimizes the distance between the two. This mapping ensures the target concept is robustly erased while preserving image quality. We conduct extensive experiments with SuMa across four tasks: subclass erasure, celebrity erasure, artistic style erasure, and instance erasure and compare the results with current state-of-the-art methods. Our method achieves image quality comparable to approaches focused on effectiveness, while also yielding results that are on par with methods targeting completeness.

[103] Self-supervised Learning for Hyperspectral Images of Trees

Moqsadur Rahman,Saurav Kumar,Santosh S. Palmate,M. Shahriar Hossain

Main category: cs.CV

TL;DR: This paper explores self-supervised learning to improve the analysis of aerial hyperspectral images for vegetation properties, enhancing machine learning task performance.

Details Motivation: Analyzing hyperspectral images with limited or no labels is challenging, prompting the need for self-supervised learning methods. Method: The paper employs self-supervised learning techniques to develop neural network embeddings from aerial hyperspectral images for representing vegetation properties. Result: The experimental results indicate that using a vegetation property-related embedding space enhances performance in downstream machine learning tasks compared to directly using hyperspectral data. Conclusion: Self-supervised learning can be used to create effective neural network embeddings for analyzing aerial hyperspectral images in precision agriculture. Abstract: Aerial remote sensing using multispectral and RGB imagers has provided a critical impetus to precision agriculture. Analysis of the hyperspectral images with limited or no labels is challenging. This paper focuses on self-supervised learning to create neural network embeddings reflecting vegetation properties of trees from aerial hyperspectral images of crop fields. Experimental results demonstrate that a constructed tree representation, using a vegetation property-related embedding space, performs better in downstream machine learning tasks compared to the direct use of hyperspectral vegetation properties as tree representations.

[104] Evaluating YOLO Architectures: Implications for Real-Time Vehicle Detection in Urban Environments of Bangladesh

Ha Meem Hossain,Pritam Nath,Mahitun Nesa Mahi,Imtiaz Uddin,Ishrat Jahan Eiste,Syed Nasibur Rahman Ratul,Md Naim Uddin Mozumdar,Asif Mohammed Saad

Main category: cs.CV

TL;DR: 该研究评估了六个YOLO模型变体在孟加拉国特定车辆检测中的性能,发现YOLOv11x和中型变体提供了最佳的检测性能,为发展中国家自动驾驶技术的发展提供了可能的解决方案。

Details Motivation: 非孟加拉国数据集上训练的车辆检测系统在孟加拉国独特的道路环境中难以准确识别本地车辆类型,这在发展地区自动驾驶技术中产生了关键缺口。 Method: 研究中采用了六个YOLO模型变体,并在包含29个车辆类别的定制数据集上进行评估,这些数据集包括特定地区的车辆,如“Desi Nosimon”、“Leguna”、“Battery Rickshaw”和“CNG”。数据集由高分辨率图像组成,并使用LabelImg进行手动注释。 Result: YOLOv11x是表现最好的模型,mAP@0.5达到63.7%,但推理时间较长。中型变体YOLOv8m和YOLOv11m在mAP@0.5上分别达到62.5%和61.8%,同时保持了适中的推理时间。 Conclusion: 该研究得出结论,专门为孟加拉国交通条件定制的YOLO模型变体能够提供稳健的物体检测性能,为发展中国家自动驾驶技术的进步提供了基础。 Abstract: Vehicle detection systems trained on Non-Bangladeshi datasets struggle to accurately identify local vehicle types in Bangladesh's unique road environments, creating critical gaps in autonomous driving technology for developing regions. This study evaluates six YOLO model variants on a custom dataset featuring 29 distinct vehicle classes, including region-specific vehicles such as ``Desi Nosimon'', ``Leguna'', ``Battery Rickshaw'', and ``CNG''. The dataset comprises high-resolution images (1920x1080) captured across various Bangladeshi roads using mobile phone cameras and manually annotated using LabelImg with YOLO format bounding boxes. Performance evaluation revealed YOLOv11x as the top performer, achieving 63.7\% mAP@0.5, 43.8\% mAP@0.5:0.95, 61.4\% recall, and 61.6\% F1-score, though requiring 45.8 milliseconds per image for inference. Medium variants (YOLOv8m, YOLOv11m) struck an optimal balance, delivering robust detection performance with mAP@0.5 values of 62.5\% and 61.8\% respectively, while maintaining moderate inference times around 14-15 milliseconds. The study identified significant detection challenges for rare vehicle classes, with Construction Vehicles and Desi Nosimons showing near-zero accuracy due to dataset imbalances and insufficient training samples. Confusion matrices revealed frequent misclassifications between visually similar vehicles, particularly Mini Trucks versus Mini Covered Vans. This research provides a foundation for developing robust object detection systems specifically adapted to Bangladesh traffic conditions, addressing critical needs in autonomous vehicle technology advancement for developing regions where conventional generic-trained models fail to perform adequately.

[105] EditIDv2: Editable ID Customization with Data-Lubricated ID Feature Integration for Text-to-Image Generation

Guandong Li,Zhaobin Chu

Main category: cs.CV

TL;DR: EditIDv2 improves character editing for complex narratives by maintaining identity consistency and enabling multi-level semantic editing with minimal data.

Details Motivation: Existing character editing methods struggle with degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when dealing with complex narratives and long text inputs. Method: EditIDv2 uses a sophisticated decomposition of PerceiverAttention, introduces ID loss, employs joint dynamic training with the diffusion model, and implements an offline fusion strategy for the integration module. Result: EditIDv2 achieves excellent results in the IBench evaluation, meeting the demands of long prompts and high-quality image generation. Conclusion: EditIDv2 fulfills the demand for high-complexity narrative scenes and long text inputs by maintaining identity consistency and enabling deep, multi-level semantic editing with minimal data lubrication. Abstract: We propose EditIDv2, a tuning-free solution specifically designed for high-complexity narrative scenes and long text inputs. Existing character editing methods perform well under simple prompts, but often suffer from degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when faced with long text narratives containing multiple semantic layers, temporal logic, and complex contextual relationships. In EditID, we analyzed the impact of the ID integration module on editability. In EditIDv2, we further explore and address the influence of the ID feature integration module. The core of EditIDv2 is to discuss the issue of editability injection under minimal data lubrication. Through a sophisticated decomposition of PerceiverAttention, the introduction of ID loss and joint dynamic training with the diffusion model, as well as an offline fusion strategy for the integration module, we achieve deep, multi-level semantic editing while maintaining identity consistency in complex narrative environments using only a small amount of data lubrication. This meets the demands of long prompts and high-quality image generation, and achieves excellent results in the IBench evaluation.

[106] OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation

Xiaomeng Zhu,Changwei Wang,Haozhe Wang,Xinyu Liu,Fangzhen Lin

Main category: cs.CV

TL;DR: The paper introduces a new approach called Linguistic Scene Graph Anticipation (LSGA) that uses commonsense knowledge to improve predictions of future scene graphs in videos. The proposed method, named Object-Oriented Two-Staged Method (OOTSM), enhances both short-term and long-term prediction accuracy when applied to existing frameworks, as demonstrated by significant improvements in experimental results.

Details Motivation: Existing Scene Graph Anticipation (SGA) approaches primarily rely on visual cues and struggle to incorporate commonsense knowledge, limiting their long-term prediction robustness. This work aims to explicitly leverage commonsense knowledge through a new linguistic approach (LSGA), which focuses on predicting future scene graphs using text-based modeling. Method: The method decouples the Scene Graph Anticipation (SGA) task into two steps: converting video clips into scene graphs and then using a text-based model to predict future scene graphs. Specifically, the OOTSM uses a Large Language Model (LLM) to first forecast object appearances and disappearances before generating detailed human-object relations. The experiments evaluate OOTSM for LSGA using open-sourced LLMs and zero-shot APIs on a benchmark derived from Action Genome annotations, and for SGA by combining OOTSM with STTran++. Result: The experiments show that for LSGA, fine-tuned open-sourced LLMs perform well against zero-shot APIs like GPT-4o, GPT-4o-mini, and DeepSeek-V3. For SGA, combining OOTSM with STTran++ achieves state-of-the-art performance, with short-term mean-Recall (@10) increasing by 3.4% and long-term mean-Recall (@50) improving by 21.9%. Conclusion: The paper proposes a new approach called Linguistic Scene Graph Anticipation (LSGA) using an Object-Oriented Two-Staged Method (OOTSM) to better leverage commonsense knowledge in predicting future scene graphs. The experiments demonstrate that the proposed method improves both short-term and long-term prediction performance when integrated into existing frameworks. Abstract: A scene graph is a structured represention of objects and their relationships in a scene. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications as intelligent surveillance and human-machine collaboration. Existing SGA approaches primarily leverage visual cues, often struggling to integrate valuable commonsense knowledge, thereby limiting long-term prediction robustness. To explicitly leverage such commonsense knowledge, we propose a new approach to better understand the objects, concepts, and relationships in a scene graph. Our approach decouples the SGA task in two steps: first a scene graph capturing model is used to convert a video clip into a sequence of scene graphs, then a pure text-based model is used to predict scene graphs in future frames. Our focus in this work is on the second step, and we call it Linguistic Scene Graph Anticipation (LSGA) and believes it should have independent interest beyond the use in SGA discussed here. For LSGA, we introduce an Object-Oriented Two-Staged Method (OOTSM) where an Large Language Model (LLM) first forecasts object appearances and disappearances before generating detailed human-object relations. We conduct extensive experiments to evaluate OOTSM in two settings. For LSGA, we evaluate our fine-tuned open-sourced LLMs against zero-shot APIs (i.e., GPT-4o, GPT-4o-mini, and DeepSeek-V3) on a benchmark constructed from Action Genome annotations. For SGA, we combine our OOTSM with STTran++ from, and our experiments demonstrate effective state-of-the-art performance: short-term mean-Recall (@10) increases by 3.4% while long-term mean-Recall (@50) improves dramatically by 21.9%. Code is available at https://github.com/ZhuXMMM/OOTSM.

[107] WIPUNet: A Physics-inspired Network with Weighted Inductive Biases for Image Denoising

Wasikul Islam

Main category: cs.CV

TL;DR: This paper explores the application of physics-guided inductive biases from particle physics to image denoising, demonstrating improved robustness under high noise conditions through novel architectures like WIPUNet.

Details Motivation: The motivation stems from high-energy particle physics, where pileup noise obscures meaningful signals. The authors aim to translate physical priors like conservation, locality, and isolation into image denoising to improve robustness under strong corruption. Method: The authors introduce a hierarchy of pileup (PU)-inspired denoising architectures, including a residual CNN with conservation constraints, its Gaussian-noise variants, and WIPUNet, which incorporates these physics-guided inductive biases into a UNet framework. Result: On CIFAR-10 with varying levels of Gaussian noise, PU-inspired CNNs perform competitively with standard baselines, while WIPUNet shows a significant performance advantage at higher noise levels. BSD500 experiments confirm the same trend, indicating improved stability from physics-inspired priors. Conclusion: The paper concludes that physics-inspired priors, derived from pileup-mitigation principles, enhance the robustness of image denoising models, particularly under high noise conditions, without requiring state-of-the-art model complexity. Abstract: In high-energy particle physics, collider measurements are contaminated by "pileup", overlapping soft interactions that obscure the hard-scatter signal of interest. Dedicated subtraction strategies exploit physical priors such as conservation, locality, and isolation. Inspired by this analogy, we investigate how such principles can inform image denoising by embedding physics-guided inductive biases into neural architectures. This paper is a proof of concept: rather than targeting state-of-the-art (SOTA) benchmarks, we ask whether physics-inspired priors improve robustness under strong corruption. We introduce a hierarchy of PU-inspired denoisers: a residual CNN with conservation constraints, its Gaussian-noise variants, and the Weighted Inductive Pileup-physics-inspired U-Network for Denoising (WIPUNet), which integrates these ideas into a UNet backbone. On CIFAR-10 with Gaussian noise at $\sigma\in\{15,25,50,75,100\}$, PU-inspired CNNs are competitive with standard baselines, while WIPUNet shows a \emph{widening margin} at higher noise. Complementary BSD500 experiments show the same trend, suggesting physics-inspired priors provide stability where purely data-driven models degrade. Our contributions are: (i) translating pileup-mitigation principles into modular inductive biases; (ii) integrating them into UNet; and (iii) demonstrating robustness gains at high noise without relying on heavy SOTA machinery.

[108] Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

Weijie Shen,Xinrui Wang,Yuanqi Nie,Apiradee Boonmee

Main category: cs.CV

TL;DR: CAMVR通过引入视觉文本上下文记忆单元和自适应视觉焦点引导机制,提升了视觉语言大模型在多轮交互中的上下文理解和视觉推理能力。

Details Motivation: 现有的大语言模型和视觉语言模型在单轮任务中表现出色,但在需要深度上下文理解和复杂视觉推理的多轮交互中面临挑战,包括碎片化推理、上下文丢失和幻觉问题。 Method: 提出了CAMVR框架,包含视觉文本上下文记忆单元(VCMU)和自适应视觉焦点引导(AVFG)机制,前者用于存储和管理关键视觉特征、文本语义表示及其跨模态对应关系,后者利用VCMU上下文动态调整视觉编码器的注意力。 Result: 在VisDial、A-OKVQA和新提出的多轮指令跟随(MTIF)数据集上的实验表明,CAMVR持续达到最先进的性能。 Conclusion: CAMVR框架显著增强了视觉语言大模型在多轮交互中的推理一致性和上下文理解能力。 Abstract: Current Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) excel in single-turn tasks but face significant challenges in multi-turn interactions requiring deep contextual understanding and complex visual reasoning, often leading to fragmented reasoning, context loss, and hallucinations. To address these limitations, we propose Context-Aware Multi-Turn Visual Reasoning (CAMVR), a novel framework designed to empower LVLMs with robust and coherent multi-turn visual-textual inference capabilities. CAMVR introduces two key innovations: a Visual-Textual Context Memory Unit (VCMU), a dynamic read-write memory network that stores and manages critical visual features, textual semantic representations, and their cross-modal correspondences from each interaction turn; and an Adaptive Visual Focus Guidance (AVFG) mechanism, which leverages the VCMU's context to dynamically adjust the visual encoder's attention to contextually relevant image regions. Our multi-level reasoning integration strategy ensures that response generation is deeply coherent with both current inputs and accumulated historical context. Extensive experiments on challenging datasets, including VisDial, an adapted A-OKVQA, and our novel Multi-Turn Instruction Following (MTIF) dataset, demonstrate that CAMVR consistently achieves state-of-the-art performance.

[109] MeshMetrics: A Precise Implementation of Distance-Based Image Segmentation Metrics

Gašper Podobnik,Tomaž Vrtovec

Main category: cs.CV

TL;DR: MeshMetrics improves the accuracy and reliability of distance-based metric computation in image segmentation, addressing reproducibility issues in existing open-source tools.

Details Motivation: The paper addresses a reproducibility crisis in image segmentation research, specifically pitfalls in the implementation of distance-based metrics that lead to discrepancies between open-source tools. Method: Theoretical analysis and empirical validation were used to compare MeshMetrics to existing tools. Result: MeshMetrics outperforms conventional grid-based approaches in accuracy and precision, with reduced impact from discretization artifacts like distance quantization. Conclusion: MeshMetrics is a more accurate and precise framework for computing distance-based metrics in image segmentation and is less affected by discretization artifacts. Abstract: The surge of research in image segmentation has yielded remarkable performance gains but also exposed a reproducibility crisis. A major contributor is performance evaluation, where both selection and implementation of metrics play critical roles. While recent efforts have improved the former, the reliability of metric implementation has received far less attention. Pitfalls in distance-based metric implementation can lead to considerable discrepancies between common open-source tools, for instance, exceeding 100 mm for the Hausdorff distance and 30%pt for the normalized surface distance for the same pair of segmentations. To address these pitfalls, we introduce MeshMetrics, a mesh-based framework that provides a more precise computation of distance-based metrics than conventional grid-based approaches. Through theoretical analysis and empirical validation, we demonstrate that MeshMetrics achieves higher accuracy and precision than established tools, and is substantially less affected by discretization artifacts, such as distance quantization. We release MeshMetrics as an open-source Python package, available at https://github.com/gasperpodobnik/MeshMetrics.

[110] Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization

Jingwei Peng,Zhixuan Qiu,Boyu Jin,Surasakdi Siripong

Main category: cs.CV

TL;DR: 本文提出了LVLM-VAR,一种结合视觉-语言大模型和视频语义标记的新方法,在视频动作识别方面表现优异,并增强了模型的可解释性。

Details Motivation: 传统方法在处理复杂语境信息和精细区分动作时面临挑战,而大语言模型的能力为解决这些问题提供了新思路。 Method: 提出了一种名为LVLM-VAR的新框架,包括视频到语义标记模块(VST)和LoRA微调的LVLM(如LLaVA-13B),将视频转换为语义动作标记,并结合自然语言指令进行动作分类。 Result: LVLM-VAR在NTU RGB+D和NTU RGB+D 120数据集上取得了最先进的性能,准确率分别为94.1%和90.0%,同时通过生成自然语言解释显著提升了模型的可解释性。 Conclusion: LVLM-VAR实现了视频动作识别的准确性和可解释性的提升,通过使用预训练视觉-语言大模型结合视频到语义标记模块,为未来的研究提供了新方向。 Abstract: Human action recognition often struggles with deep semantic understanding, complex contextual information, and fine-grained distinction, limitations that traditional methods frequently encounter when dealing with diverse video data. Inspired by the remarkable capabilities of large language models, this paper introduces LVLM-VAR, a novel framework that pioneers the application of pre-trained Vision-Language Large Models (LVLMs) to video action recognition, emphasizing enhanced accuracy and interpretability. Our method features a Video-to-Semantic-Tokens (VST) Module, which innovatively transforms raw video sequences into discrete, semantically and temporally consistent "semantic action tokens," effectively crafting an "action narrative" that is comprehensible to an LVLM. These tokens, combined with natural language instructions, are then processed by a LoRA-fine-tuned LVLM (e.g., LLaVA-13B) for robust action classification and semantic reasoning. LVLM-VAR not only achieves state-of-the-art or highly competitive performance on challenging benchmarks such as NTU RGB+D and NTU RGB+D 120, demonstrating significant improvements (e.g., 94.1% on NTU RGB+D X-Sub and 90.0% on NTU RGB+D 120 X-Set), but also substantially boosts model interpretability by generating natural language explanations for its predictions.

[111] JRN-Geo: A Joint Perception Network based on RGB and Normal images for Cross-view Geo-localization

Hongyu Zhou,Yunzhou Zhang,Tingsong Huang,Fawei Ge,Man Qi,Xichen Zhang,Yizhong Zhang

Main category: cs.CV

TL;DR: This paper introduces JRN-Geo, a method for cross-view geo-localization that effectively handles viewpoint differences by integrating RGB and Normal images through a dual-branch framework and 3D augmentation techniques.

Details Motivation: The motivation stems from the significant challenges in cross-view geo-localization due to drastic viewpoint differences and appearance variations, where existing methods overlook the importance of spatial structural information. Method: The method involves a dual-branch feature extraction framework with a Difference-Aware Fusion Module (DAFM) and Joint-Constrained Interaction Aggregation (JCIA) strategy, along with a 3D geographic augmentation technique to enhance learning viewpoint-invariant features. Result: Extensive experiments on the University-1652 and SUES-200 datasets demonstrated the robustness and state-of-the-art performance of the proposed method against complex viewpoint variations. Conclusion: The proposed JRN-Geo method effectively addresses the challenges of cross-view geo-localization by incorporating geometric structural information and utilizing a joint perception network, achieving state-of-the-art performance on datasets. Abstract: Cross-view geo-localization plays a critical role in Unmanned Aerial Vehicle (UAV) localization and navigation. However, significant challenges arise from the drastic viewpoint differences and appearance variations between images. Existing methods predominantly rely on semantic features from RGB images, often neglecting the importance of spatial structural information in capturing viewpoint-invariant features. To address this issue, we incorporate geometric structural information from normal images and introduce a Joint perception network to integrate RGB and Normal images (JRN-Geo). Our approach utilizes a dual-branch feature extraction framework, leveraging a Difference-Aware Fusion Module (DAFM) and Joint-Constrained Interaction Aggregation (JCIA) strategy to enable deep fusion and joint-constrained semantic and structural information representation. Furthermore, we propose a 3D geographic augmentation technique to generate potential viewpoint variation samples, enhancing the network's ability to learn viewpoint-invariant features. Extensive experiments on the University-1652 and SUES-200 datasets validate the robustness of our method against complex viewpoint ariations, achieving state-of-the-art performance.

[112] Knowledge-Augmented Vision Language Models for Underwater Bioacoustic Spectrogram Analysis

Ragib Amin Nihal,Benjamin Yen,Takeshi Ashizawa,Kazuhiro Nakadai

Main category: cs.CV

TL;DR: 本文提出了一种利用视觉语言模型和大语言模型联合分析水生哺乳动物声音的方法,无需手动标注或重新训练模型。

Details Motivation: 水生哺乳动物叫声分析依赖于生物声学频谱图,而视觉语言模型(VLM)并未针对这些特定领域的可视化数据进行训练。 Method: 将视觉语言模型(VLM)与基于大语言模型(LLM)的验证相结合,以分析水生哺乳动物的叫声频谱图。 Result: 研究表明VLM可以从频谱图中提取有意义的模式,并通过LLM验证构建领域知识。 Conclusion: 通过结合VLM解释和LLM验证,该方法能够在没有手动注释或模型重新训练的情况下适应声学数据。 Abstract: Marine mammal vocalization analysis depends on interpreting bioacoustic spectrograms. Vision Language Models (VLMs) are not trained on these domain-specific visualizations. We investigate whether VLMs can extract meaningful patterns from spectrograms visually. Our framework integrates VLM interpretation with LLM-based validation to build domain knowledge. This enables adaptation to acoustic data without manual annotation or model retraining.

[113] LiDAR-BIND-T: Improving SLAM with Temporally Consistent Cross-Modal LiDAR Reconstruction

Niels Balemans,Ali Anwar,Jan Steckel,Siegfried Mercelis

Main category: cs.CV

TL;DR: This paper introduces LiDAR-BIND-T, a temporal extension of the LiDAR-BIND framework, which improves multi-modal sensor fusion for SLAM by enhancing temporal stability and spatial coherence, resulting in better performance and robustness.

Details Motivation: The motivation is to improve the temporal and spatial coherence of multi-modal sensor fusion in SLAM applications, specifically for radar/sonar-to-LiDAR translation, by explicitly enforcing temporal consistency and enhancing robustness and performance. Method: The paper introduces three contributions: temporal embedding similarity to align consecutive latents, a motion-aligned transformation loss to match displacement between predictions and ground truth LiDAR, and windowed temporal fusion using a specialized temporal module. The model architecture was also updated to preserve spatial structure. Result: Evaluations showed improved temporal and spatial coherence with lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM. New metrics based on Fréchet Video Motion Distance (FVMD) and a correlation-peak distance metric were proposed to assess temporal quality. Conclusion: The paper concludes that LiDAR-BIND-T significantly enhances temporal stability while maintaining plug-and-play modality fusion, leading to improved robustness and performance in downstream SLAM applications. Abstract: This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latents, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windows temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fr\'echet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.

[114] Multi-LVI-SAM: A Robust LiDAR-Visual-Inertial Odometry for Multiple Fisheye Cameras

Xinyu Zhang,Kai Huang,Junqiao Zhao,Zihan Yuan,Tiantian Feng

Main category: cs.CV

TL;DR: 本文提出了一种名为Multi-LVI-SAM的多传感器融合框架,通过创新的全景视觉特征模型和外参补偿方法,显著提升了多相机LiDAR-视觉-惯性系统的精度和鲁棒性。

Details Motivation: 为了实现高效且一致的多鱼眼相机视觉信息融合,同时避免单独处理各个相机带来的冗余操作。 Method: 提出了一种多相机LiDAR-视觉-惯性里程计框架Multi-LVI-SAM,并引入了一种全景视觉特征模型以统一多相机观测到单一表示中,同时提出了一种外参补偿方法以解决多相机帧与全景模型帧之间的三角不一致问题。 Result: 全景模型能够作为全局几何优化框架整合多视角约束,支持无缝的回环检测和全局姿态优化,同时外参补偿方法显著减少了三角化和优化误差。 Conclusion: 实验结果表明,全景视觉特征模型提高了多相机约束条件的质量和一致性,从而比现有的多相机LiDAR-视觉-惯性系统具有更高的精度和鲁棒性。 Abstract: We propose a multi-camera LiDAR-visual-inertial odometry framework, Multi-LVI-SAM, which fuses data from multiple fisheye cameras, LiDAR and inertial sensors for highly accurate and robust state estimation. To enable efficient and consistent integration of visual information from multiple fisheye cameras, we introduce a panoramic visual feature model that unifies multi-camera observations into a single representation. The panoramic model serves as a global geometric optimization framework that consolidates multi-view constraints, enabling seamless loop closure and global pose optimization, while simplifying system design by avoiding redundant handling of individual cameras. To address the triangulation inconsistency caused by the misalignment between each camera's frame and the panoramic model's frame, we propose an extrinsic compensation method. This method improves feature consistency across views and significantly reduces triangulation and optimization errors, leading to more accurate pose estimation. We integrate the panoramic visual feature model into a tightly coupled LiDAR-visual-inertial system based on a factor graph. Extensive experiments on public datasets demonstrate that the panoramic visual feature model enhances the quality and consistency of multi-camera constraints, resulting in higher accuracy and robustness than existing multi-camera LiDAR-visual-inertial systems.

[115] Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation

Tianhao Guo,Bingjie Lu,Feng Wang,Zhengyang Lu

Main category: cs.CV

TL;DR: 这项研究提出了一种新的空间自适应超分辨率重建框架,通过整合几何场景理解和大气散射理论,显著提高了深度变化场景的重建质量。

Details Motivation: 传统的单图像超分辨率技术假设空间不变的退化模型,而现实世界的成像系统表现出复杂的距离依赖效应,如大气散射、景深变化和透视失真。 Method: 该方法通过级联残差块实现离散梯度流动力学,使用深度条件卷积核,并结合大气散射理论的谱约束来防止远场区域的带宽违规和噪声放大。 Result: 综合评估显示,该方法在KITTI户外场景的2和4倍尺度上分别达到36.89/0.9516和30.54/0.8721的PSNR/SSIM,比现有方法分别高出0.44dB和0.36dB。 Conclusion: 该研究提出了一个基于变分框架的空间自适应超分辨率重建方法,该方法结合了几何场景理解,有效地解决了距离相关的退化问题。 Abstract: Single image super-resolution traditionally assumes spatially-invariant degradation models, yet real-world imaging systems exhibit complex distance-dependent effects including atmospheric scattering, depth-of-field variations, and perspective distortions. This fundamental limitation necessitates spatially-adaptive reconstruction strategies that explicitly incorporate geometric scene understanding for optimal performance. We propose a rigorous variational framework that characterizes super-resolution as a spatially-varying inverse problem, formulating the degradation operator as a pseudodifferential operator with distance-dependent spectral characteristics that enable theoretical analysis of reconstruction limits across depth ranges. Our neural architecture implements discrete gradient flow dynamics through cascaded residual blocks with depth-conditional convolution kernels, ensuring convergence to stationary points of the theoretical energy functional while incorporating learned distance-adaptive regularization terms that dynamically adjust smoothness constraints based on local geometric structure. Spectral constraints derived from atmospheric scattering theory prevent bandwidth violations and noise amplification in far-field regions, while adaptive kernel generation networks learn continuous mappings from depth to reconstruction filters. Comprehensive evaluation across five benchmark datasets demonstrates state-of-the-art performance, achieving 36.89/0.9516 and 30.54/0.8721 PSNR/SSIM at 2 and 4 scales on KITTI outdoor scenes, outperforming existing methods by 0.44dB and 0.36dB respectively. This work establishes the first theoretically-grounded distance-adaptive super-resolution framework and demonstrates significant improvements on depth-variant scenarios while maintaining competitive performance across traditional benchmarks.

[116] InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios

Leo Ho,Yinghao Huang,Dafei Qin,Mingyi Shi,Wangpok Tse,Wei Liu,Junichi Yamagishi,Taku Komura

Main category: cs.CV

TL;DR: This paper introduces InterAct, a new dataset for capturing two-person interactions, and a diffusion-based method to estimate body and facial movements from speech.

Details Motivation: The motivation is to improve the capture of dynamic, real-world interactions between two people, as previous studies often oversimplified by assuming static positions or focusing only on conversational gestures. Method: The authors created the InterAct dataset with 241 motion sequences capturing audio, body motion, and facial expressions from two-person interactions. They used a diffusion-based method to estimate facial expressions and body motions from speech, regressing body motions hierarchically and applying a fine-tuning mechanism for lip accuracy. Result: The result is the creation of the InterAct dataset, which features diverse and long-term interaction patterns, along with a diffusion-based method that effectively estimates interactive motions and expressions from speech. Conclusion: The paper concludes that their proposed method successfully captures interactive behaviors between two people, offering a novel dataset and an effective diffusion-based approach for estimating interactions from speech. Abstract: We address the problem of accurate capture of interactive behaviors between two people in daily scenarios. Most previous works either only consider one person or solely focus on conversational gestures of two people, assuming the body orientation and/or position of each actor are constant or barely change over each interaction. In contrast, we propose to simultaneously model two people's activities, and target objective-driven, dynamic, and semantically consistent interactions which often span longer duration and cover bigger space. To this end, we capture a new multi-modal dataset dubbed InterAct, which is composed of 241 motion sequences where two people perform a realistic and coherent scenario for one minute or longer over a complete interaction. For each sequence, two actors are assigned different roles and emotion labels, and collaborate to finish one task or conduct a common interaction activity. The audios, body motions, and facial expressions of both persons are captured. InterAct contains diverse and complex motions of individuals and interesting and relatively long-term interaction patterns barely seen before. We also demonstrate a simple yet effective diffusion-based method that estimates interactive face expressions and body motions of two people from speech inputs. Our method regresses the body motions in a hierarchical manner, and we also propose a novel fine-tuning mechanism to improve the lip accuracy of facial expressions. To facilitate further research, the data and code is made available at https://hku-cg.github.io/interact/ .

[117] Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation

Bingrui Zhao,Lin Yuanbo Wu,Xiangtian Fan,Deyin Liu,Lu Zhang,Ruyi He,Jialie Shen,Ximing Li

Main category: cs.CV

TL;DR: 本文提出了一种新的无需训练的框架PARSE-VOS,利用大语言模型解决视频中的对象分割问题,并在多个基准测试中表现优异。

Details Motivation: 当前方法在处理复杂、组合描述时存在困难,需要更有效的解决方案来对齐静态文本与动态视觉内容。 Method: 提出了一种新的、无需训练的框架PARSE-VOS,利用大语言模型进行跨文本和视频领域的层次化、从粗到细的推理。 Result: PARSE-VOS在Ref-YouTube-VOS、Ref-DAVIS17和MeViS三个主要基准测试中达到了最先进的性能。 Conclusion: PARSE-VOS实现了最先进的性能,在三个主要基准测试中表现出色,提供了一种无需训练的解决方案。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual-language fusion that struggles with complex, compositional descriptions. In this paper, we propose \textbf{PARSE-VOS}, a novel, training-free framework powered by Large Language Models (LLMs), for a hierarchical, coarse-to-fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio-temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two-stage reasoning process: it first performs coarse-grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine-grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. \textbf{PARSE-VOS} achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.

[118] PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters

Zijian Chen,Wenjie Hua,Jinhao Li,Lirong Deng,Fan Du,Tingzhu Chen,Guangtao Zhai

Main category: cs.CV

TL;DR: This paper introduces PictOBI-20k, a dataset for evaluating LMMs in visually deciphering pictographic OBCs, revealing that while LMMs have some visual decipherment skills, they are limited by language priors.

Details Motivation: The motivation is to find an effective way to visually decipher Oracle Bone Characters (OBCs), which are the oldest attested form of written Chinese, using large multimodal models (LMMs) due to their powerful visual perception capabilities. Method: The authors introduce PictOBI-20k, a dataset containing 20k OBC and real object images, forming over 15k multi-choice questions. They also conduct subjective annotations to study the consistency of reference points between humans and LMMs in visual reasoning. Result: Experiments show that general LMMs have some visual decipherment abilities, but their performance is largely restricted by language priors, and they do not effectively use visual information. Conclusion: The paper concludes that while general LMMs have preliminary visual decipherment skills, they are mostly limited by language priors and do not effectively utilize visual information. The authors hope their dataset will aid in evaluating and optimizing visual attention in future OBC-oriented LMMs. Abstract: Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity's early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.

[119] Posterior shape models revisited: Improving 3D reconstructions from partial data using target specific models

Jonathan Aellen,Florian Burkhardt,Thomas Vetter,Marcel Lüthi

Main category: cs.CV

TL;DR: 本文提出了一种无需原始训练数据的姿态对齐方法,提高了医学成像中部分形状重建的精度和适用性。

Details Motivation: 姿态对齐在医学成像的部分形状重建中至关重要,训练数据与目标形状的姿态差异会导致偏差,尤其是在观察形状的小部分时。 Method: 该方法在保持线性模型计算效率的同时,通过简单的预处理步骤显著提高了重建精度和预测方差。 Result: 该方法在平移情况下能够准确恢复预期的对齐模型,并在小旋转情况下提供了良好的近似结果。 Conclusion: 本文提出了一种高效的姿态对齐方法,用于部分形状重建,无需原始训练数据即可调整现有模型,从而在即插即用场景中广泛适用。 Abstract: In medical imaging, point distribution models are often used to reconstruct and complete partial shapes using a statistical model of the full shape. A commonly overlooked, but crucial factor in this reconstruction process, is the pose of the training data relative to the partial target shape. A difference in pose alignment of the training and target shape leads to biased solutions, particularly when observing small parts of a shape. In this paper, we demonstrate the importance of pose alignment for partial shape reconstructions and propose an efficient method to adjust an existing model to a specific target. Our method preserves the computational efficiency of linear models while significantly improving reconstruction accuracy and predicted variance. It exactly recovers the intended aligned model for translations, and provides a good approximation for small rotations, all without access to the original training data. Hence, existing shape models in reconstruction pipelines can be adapted by a simple preprocessing step, making our approach widely applicable in plug-and-play scenarios.

[120] 3DPillars: Pillar-based two-stage 3D object detection

Jongyoun Noh,Junghyup Lee,Hyekang Park,Bumsub Ham

Main category: cs.CV

TL;DR: The paper introduces a two-stage 3D detection framework that improves the performance of PointPillars while maintaining its efficiency, through a new CNN architecture and an RoI head with a sparse scene context feature module.

Details Motivation: The motivation is to improve upon PointPillars, which is efficient but underperforms compared to state-of-the-art methods due to limitations in pseudo image representations and the difficulty of adopting a two-stage detection pipeline. Method: The paper introduces a new CNN architecture called 3DPillars and an RoI head with a sparse scene context feature module to overcome the limitations of PointPillars. Result: The experimental results on the KITTI and Waymo Open datasets show that the proposed approach is both effective and efficient, offering a good balance between speed and accuracy. Conclusion: The paper concludes that the proposed two-stage 3D detection framework narrows the performance gap between PointPillars and state-of-the-art methods while maintaining efficiency. Abstract: PointPillars is the fastest 3D object detector that exploits pseudo image representations to encode features for 3D objects in a scene. Albeit efficient, PointPillars is typically outperformed by state-of-the-art 3D detection methods due to the following limitations: 1) The pseudo image representations fail to preserve precise 3D structures, and 2) they make it difficult to adopt a two-stage detection pipeline using 3D object proposals that typically shows better performance than a single-stage approach. We introduce in this paper the first two-stage 3D detection framework exploiting pseudo image representations, narrowing the performance gaps between PointPillars and state-of-the-art methods, while retaining its efficiency. Our framework consists of two novel components that overcome the aforementioned limitations of PointPillars: First, we introduce a new CNN architecture, dubbed 3DPillars, that enables learning 3D voxel-based features from the pseudo image representation efficiently using 2D convolutions. The basic idea behind 3DPillars is that 3D features from voxels can be viewed as a stack of pseudo images. To implement this idea, we propose a separable voxel feature module that extracts voxel-based features without using 3D convolutions. Second, we introduce an RoI head with a sparse scene context feature module that aggregates multi-scale features from 3DPillars to obtain a sparse scene feature. This enables adopting a two-stage pipeline effectively, and fully leveraging contextual information of a scene to refine 3D object proposals. Experimental results on the KITTI and Waymo Open datasets demonstrate the effectiveness and efficiency of our approach, achieving a good compromise in terms of speed and accuracy.

[121] CRAB: Camera-Radar Fusion for Reducing Depth Ambiguity in Backward Projection based View Transformation

In-Jae Lee,Sihwan Hwang,Youngseok Kim,Wonjune Kim,Sanmin Kim,Dongsuk Kum

Main category: cs.CV

TL;DR: This paper proposes CRAB, a camera-radar fusion model for 3D object detection, which improves depth distinction and achieves state-of-the-art performance on the nuScenes dataset.

Details Motivation: The motivation is to address the limitations of previous approaches that either struggle with sparse BEV feature generation using forward projection or overlook depth ambiguity leading to false positives with backward projection. Method: The method involves a backward projection technique that uses radar to mitigate depth ambiguity. It aggregates perspective view image context features into BEV queries and introduces spatial cross-attention with a feature map containing radar context information. Result: The result is that the proposed approach achieves a state-of-the-art performance on the nuScenes open dataset with 62.4% NDS and 54.0% mAP in 3D object detection. Conclusion: The paper concludes that CRAB, a novel camera-radar fusion-based 3D object detection and segmentation model, achieves state-of-the-art performance in backward projection-based camera-radar fusion methods. Abstract: Recently, camera-radar fusion-based 3D object detection methods in bird's eye view (BEV) have gained attention due to the complementary characteristics and cost-effectiveness of these sensors. Previous approaches using forward projection struggle with sparse BEV feature generation, while those employing backward projection overlook depth ambiguity, leading to false positives. In this paper, to address the aforementioned limitations, we propose a novel camera-radar fusion-based 3D object detection and segmentation model named CRAB (Camera-Radar fusion for reducing depth Ambiguity in Backward projection-based view transformation), using a backward projection that leverages radar to mitigate depth ambiguity. During the view transformation, CRAB aggregates perspective view image context features into BEV queries. It improves depth distinction among queries along the same ray by combining the dense but unreliable depth distribution from images with the sparse yet precise depth information from radar occupancy. We further introduce spatial cross-attention with a feature map containing radar context information to enhance the comprehension of the 3D scene. When evaluated on the nuScenes open dataset, our proposed approach achieves a state-of-the-art performance among backward projection-based camera-radar fusion methods with 62.4\% NDS and 54.0\% mAP in 3D object detection.

[122] Dual-Mode Deep Anomaly Detection for Medical Manufacturing: Structural Similarity and Feature Distance

Julio Zanon Diaz,Georgios Siogkas,Peter Corcoran

Main category: cs.CV

TL;DR: This study proposes two attention-guided autoencoder architectures for deep anomaly detection to improve visual inspection in medical device manufacturing, achieving high accuracy and addressing regulatory and operational constraints.

Details Motivation: Automating visual inspection in medical device manufacturing is challenging due to small and imbalanced datasets, high-resolution imagery, and strict regulatory requirements. This research aims to address these constraints through deep anomaly detection methods. Method: The study proposes two architectures: one using a structural similarity-based anomaly score (4-MS-SSIM) for real-time defect detection, and another employing Mahalanobis scoring on reduced latent features for supervisory monitoring. Result: The first architecture achieved ACC 0.903 (unsupervised thresholding) and 0.931 (supervised thresholding) on the Surface Seal Image Test split with only 10% of defective samples. The second achieved ACC 0.722 with supervised thresholding, demonstrating complementary capabilities for inline inspection and post-production surveillance. Conclusion: This work introduces two attention-guided autoencoder architectures for deep anomaly detection in medical device manufacturing, providing a practical pathway for deploying these techniques in regulated environments by aligning accuracy, efficiency, and regulatory obligations. Abstract: Automating visual inspection in medical device manufacturing remains challenging due to small and imbalanced datasets, high-resolution imagery, and stringent regulatory requirements. This work proposes two attention-guided autoencoder architectures for deep anomaly detection designed to address these constraints. The first employs a structural similarity-based anomaly score (4-MS-SSIM), offering lightweight and accurate real-time defect detection, yielding ACC 0.903 (unsupervised thresholding) and 0.931 (supervised thresholding) on the - Surface Seal Image - Test split with only 10% of defective samples. The second applies a feature-distance approach using Mahalanobis scoring on reduced latent features, providing high sensitivity to distributional shifts for supervisory monitoring, achieving ACC 0.722 with supervised thresholding. Together, these methods deliver complementary capabilities: the first supports reliable inline inspection, while the second enables scalable post-production surveillance and regulatory compliance monitoring. Experimental results demonstrate that both approaches surpass re-implemented baselines and provide a practical pathway for deploying deep anomaly detection in regulated manufacturing environments, aligning accuracy, efficiency, and the regulatory obligations defined for high-risk AI systems under the EU AI Act.

[123] A Probabilistic Segment Anything Model for Ambiguity-Aware Medical Image Segmentation

Tyler Ward,Abdullah Imran

Main category: cs.CV

TL;DR: 本研究提出了一种基于SAM的新型概率分割模型Probabilistic SAM,通过引入潜在变量空间和变分目标训练,实现了对医学影像中不确定性和差异性的有效建模。

Details Motivation: 由于医学影像中存在标注不确定性和专家间差异,导致当前的分割模型(如SAM)无法充分捕捉现实任务中的多义性。 Method: Probabilistic SAM在原始SAM的基础上引入了先验网络和后验网络,利用潜在变量空间对输入图像和提示进行调制,通过变分目标进行训练,实现分割结果的多样性生成。 Result: Probabilistic SAM在LIDC-IDRI肺结节数据集上展示了与专家意见不一致相一致的多样化输出,并在不确定性感知指标上优于现有的概率基线模型。 Conclusion: Probabilistic SAM有效地解决了医学影像中因标注不确定性和专家间差异导致的多义性问题,通过引入潜在变量空间和变分目标训练,实现了在最小开销下的不确定性感知输出。 Abstract: Recent advances in promptable segmentation, such as the Segment Anything Model (SAM), have enabled flexible, high-quality mask generation across a wide range of visual domains. However, SAM and similar models remain fundamentally deterministic, producing a single segmentation per object per prompt, and fail to capture the inherent ambiguity present in many real-world tasks. This limitation is particularly troublesome in medical imaging, where multiple plausible segmentations may exist due to annotation uncertainty or inter-expert variability. In this paper, we introduce Probabilistic SAM, a probabilistic extension of SAM that models a distribution over segmentations conditioned on both the input image and prompt. By incorporating a latent variable space and training with a variational objective, our model learns to generate diverse and plausible segmentation masks reflecting the variability in human annotations. The architecture integrates a prior and posterior network into the SAM framework, allowing latent codes to modulate the prompt embeddings during inference. The latent space allows for efficient sampling during inference, enabling uncertainty-aware outputs with minimal overhead. We evaluate Probabilistic SAM on the public LIDC-IDRI lung nodule dataset and demonstrate its ability to produce diverse outputs that align with expert disagreement, outperforming existing probabilistic baselines on uncertainty-aware metrics. Our code is available at: https://github.com/tbwa233/Probabilistic-SAM/.

[124] Near Real-Time Dust Aerosol Detection with 3D Convolutional Neural Networks on MODIS Data

Caleb Gates,Patrick Moorhead,Jayden Ferguson,Omar Darwish,Conner Stallman,Pablo Rivas,Paapa Quansah

Main category: cs.CV

TL;DR: 本文提出了一种利用NASA Terra和Aqua卫星的多波段图像实时检测沙尘暴的方法,使用3D卷积网络对沙尘进行像素级识别,并改进了训练速度和处理效果,结果表明该方法具有较高的准确性。

Details Motivation: 沙尘暴对健康和能见度有害,需要通过卫星快速检测。 Method: 采用3D卷积网络学习所有36个波段及分裂热波段的模式,以区分沙尘、云层和地表特征;使用简单归一化和局部填充处理缺失数据;改进版本提升了训练速度并支持快速处理完整场景。 Result: 在17个独立MODIS场景中,模型达到约0.92的准确率,均方误差为0.014;沙尘羽流核心区域的地图显示高度一致,大多数漏检出现在边缘区域。 Conclusion: 联合波段与空间学习可以实现全球范围内及时的沙尘预警;使用更宽的输入窗口或基于注意力的模型可能进一步优化边缘检测。 Abstract: Dust storms harm health and reduce visibility; quick detection from satellites is needed. We present a near real-time system that flags dust at the pixel level using multi-band images from NASA's Terra and Aqua (MODIS). A 3D convolutional network learns patterns across all 36 bands, plus split thermal bands, to separate dust from clouds and surface features. Simple normalization and local filling handle missing data. An improved version raises training speed by 21x and supports fast processing of full scenes. On 17 independent MODIS scenes, the model reaches about 0.92 accuracy with a mean squared error of 0.014. Maps show strong agreement in plume cores, with most misses along edges. These results show that joint band-and-space learning can provide timely dust alerts at global scale; using wider input windows or attention-based models may further sharpen edges.

[125] Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets

Phongsakon Mark Konrad,Andrei-Alexandru Popa,Yaser Sabzehmeidani,Liang Zhong,Elisa A. Liehn,Serkan Ayvaz

Main category: cs.CV

TL;DR: 研究评估了多种深度学习分割模型在心血管组织学图像上的表现,发现模型性能对数据分割高度敏感,且受统计噪声影响较大。

Details Motivation: 颈动脉结构的准确分割对心血管疾病研究和诊断至关重要,但由于缺乏注释数据,深度学习模型的发展受到限制。 Method: 系统评估了最先进的深度学习分割模型,包括U-Net、DeepLabV3+、SegFormer、SAM、MedSAM和MedSAM+UNet,并采用贝叶斯搜索进行超参数优化。 Result: 研究发现,模型性能高度依赖于数据分割,微小差异主要由统计噪声引起,而非算法本身的优越性。 Conclusion: 标准基准测试在低数据临床环境中存在局限性,性能排名并不能反映真实的临床实用性。 Abstract: Accurate segmentation of carotid artery structures in histopathological images is vital for advancing cardiovascular disease research and diagnosis. However, deep learning model development in this domain is constrained by the scarcity of annotated cardiovascular histopathological data. This study investigates a systematic evaluation of state-of-the-art deep learning segmentation models, including convolutional neural networks (U-Net, DeepLabV3+), a Vision Transformer (SegFormer), and recent foundation models (SAM, MedSAM, MedSAM+UNet), on a limited dataset of cardiovascular histology images. Despite employing an extensive hyperparameter optimization strategy with Bayesian search, our findings reveal that model performance is highly sensitive to data splits, with minor differences driven more by statistical noise than by true algorithmic superiority. This instability exposes the limitations of standard benchmarking practices in low-data clinical settings and challenges the assumption that performance rankings reflect meaningful clinical utility.

[126] BTCChat: Advancing Remote Sensing Bi-temporal Change Captioning with Multimodal Large Language Model

Yujie Li,Wenjia Xu,Yuanben Zhang,Zhiwei Wei,Mugen Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为BTCChat的双时态多模态大语言模型,通过创新的Change Extraction模块和Prompt Augmentation机制,显著提升了双时态变化理解和空间细节关注能力,并在相关任务中表现出色。

Details Motivation: 现有的双时态变化分析方法在处理图像对时存在时间相关性和空间语义变化建模不足的问题,这限制了视觉-语义对齐和整体方法的有效性。 Method: 设计了一个Change Extraction模块来更好地捕捉图像对中的时间特征和空间语义变化,并引入了Prompt Augmentation机制,将上下文线索融入提示中,以增强模型性能。 Result: 实验结果表明,BTCChat在双时态变化描述和视觉问答任务中达到了最先进的性能。 Conclusion: BTCChat通过其Change Extraction模块和Prompt Augmentation机制,有效提升了双时态变化理解和空间细节关注能力,从而在变化描述和视觉问答任务中实现了最先进的性能。 Abstract: Bi-temporal satellite imagery supports critical applications such as urban development monitoring and disaster assessment. Although powerful multimodal large language models (MLLMs) have been applied in bi-temporal change analysis, previous methods process image pairs through direct concatenation, inadequately modeling temporal correlations and spatial semantic changes. This deficiency hampers visual-semantic alignment in change understanding, thereby constraining the overall effectiveness of current approaches. To address this gap, we propose BTCChat, a multi-temporal MLLM with advanced bi-temporal change understanding capability. BTCChat supports bi-temporal change captioning and retains single-image interpretation capability. To better capture temporal features and spatial semantic changes in image pairs, we design a Change Extraction module. Moreover, to enhance the model's attention to spatial details, we introduce a Prompt Augmentation mechanism, which incorporates contextual clues into the prompt to enhance model performance. Experimental results demonstrate that BTCChat achieves state-of-the-art performance on change captioning and visual question answering tasks.

[127] A Fine-Grained Attention and Geometric Correspondence Model for Musculoskeletal Risk Classification in Athletes Using Multimodal Visual and Skeletal Features

Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Tamanna Shermin,Md Rafiqul Islam,Mukhtar Hussain,Sami Azam

Main category: cs.CV

TL;DR: 本研究提出ViSK-GAT模型,通过融合视觉和骨骼坐标数据,实现了高精度的肌肉骨骼风险分类,为运动员的早期风险干预提供了人工智能支持。

Details Motivation: 现有的肌肉骨骼风险评估方法受限于单一数据类型和受控环境,需要一种能够在复杂环境中进行可靠风险评估的新方法。 Method: ViSK-GAT结合了残差块和轻量级变压器块,并引入了FGAM和MGCM模块以提升多模态数据处理性能。 Result: ViSK-GAT在验证集和测试集上分别达到了93.55%和93.89%的准确率,并在回归任务中表现出较低的均方根误差(0.1205)和平均绝对误差(0.0156)。 Conclusion: ViSK-GAT模型在肌肉骨骼风险分类方面优于现有方法,为体育运动中的早期干预提供了有效工具。 Abstract: Musculoskeletal disorders pose significant risks to athletes, and assessing risk early is important for prevention. However, most existing methods are designed for controlled settings and fail to reliably assess risk in complex environments due to their reliance on a single type of data. This research proposes ViSK-GAT (Visual-Skeletal Geometric Attention Transformer), a novel multimodal deep learning framework designed to classify musculoskeletal risk using visual and skeletal coordinate-based features. In addition, a custom multimodal dataset is constructed by combining visual data and skeletal coordinates for risk assessment. Each sample is labeled into eight risk categories based on the Rapid Entire Body Assessment system. ViSK-GAT combines a Residual Block with a Lightweight Transformer Block to learn spatial and temporal dependencies jointly. It incorporates two novel modules: the Fine-Grained Attention Module (FGAM), which enables precise inter-modal feature refinement through cross-attention between visual and skeletal inputs, and the Multimodal Geometric Correspondence Module (MGCM), which enhances cross-modal coherence by aligning image features with coordinate-based representations. ViSK-GAT achieved strong performance with validation and test accuracies of 93.55\% and 93.89\%, respectively; a precision of 93.86\%; an F1 score of 93.85\%; and Cohen's Kappa and Matthews Correlation Coefficient of 93\%. The regression results also indicated a low Root Mean Square Error of the predicted probability distribution of 0.1205 and a corresponding Mean Absolute Error of 0.0156. Compared to nine popular transfer learning backbones, ViSK-GAT consistently outperformed previous methods. The ViSK-GAT model advances artificial intelligence implementation and application, transforming musculoskeletal risk classification and enabling impactful early interventions in sports.

[128] Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

Ruiqi Shen,Haotian Wu,Wenjing Zhang,Jiangjing Hu,Deniz Gunduz

Main category: cs.CV

TL;DR: This paper proposes a novel semantic compression method using the CLIP model to preserve semantic information at a significantly lower bit rate than traditional approaches, achieving robust performance across diverse tasks and data distributions.

Details Motivation: Emerging applications prioritize semantic preservation over pixel-level reconstruction and require robust performance across diverse data distributions and downstream tasks, which traditional lossy image compression methods are not well-suited to address. Method: The study introduces a novel semantic compression method leveraging the contrastive language-image pretraining (CLIP) model. Instead of focusing on pixel-level reconstruction, it compresses CLIP feature embeddings while maintaining semantic integrity across tasks. Result: The experiments demonstrate that the proposed method maintains semantic integrity across benchmark datasets with an average bit rate of approximately 2-3 × 10⁻³ bits per pixel—less than 5% of the bit rate required by mainstream approaches. It also shows zero-shot robustness under extreme compression. Conclusion: The proposed semantic compression method based on the CLIP model successfully preserves semantic information at a significantly lower bit rate compared to traditional image compression approaches, demonstrating zero-shot robustness across diverse data distributions and downstream tasks. Abstract: Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.

[129] AttriPrompt: Dynamic Prompt Composition Learning for CLIP

Qiqi Zhan,Shiwei Li,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TL;DR: AttriPrompt improves deep text prompting by adaptively refining semantic representations and achieving fine-grained alignment through a novel framework leveraging visual features and self-regularization.

Details Motivation: Current deep text prompting methods overly rely on contrastive learning that neglects fine-grained feature optimization and uses static prompts that do not adapt to input variations. Method: AttriPrompt uses intermediate-layer features of CLIP's vision encoder, incorporating an Attribute Retrieval module, Dual-stream Contrastive Learning, and a Self-Regularization mechanism. Result: AttriPrompt outperforms state-of-the-art methods by up to 7.37% in base-to-novel settings and demonstrates strong cross-domain knowledge transfer capabilities. Conclusion: AttriPrompt provides a more effective and adaptable method for deep text prompting, improving model performance and cross-domain knowledge transfer. Abstract: The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37\% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.

[130] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

Feng Wang,Zihao Yu

Main category: cs.CV

TL;DR: 本文提出了一种新的图像生成采样方法CPS,解决了SDE采样带来的噪声问题,使强化学习优化更高效。

Details Motivation: SDE采样在Flow Matching模型中引入了噪声伪影,影响奖励学习过程,需要改进采样方法。 Method: 受DDIM启发,重新构建了采样过程,提出了一种名为CPS的新方法,以消除SDE采样引入的噪声。 Result: CPS方法有效消除了噪声,提升了强化学习优化器(如Flow-GRPO和Dance-GRPO)的收敛速度和稳定性。 Conclusion: 提出了一种新的采样方法CPS,消除了图像生成中的噪声伪影,使奖励建模更准确,从而实现更快、更稳定的强化学习优化。 Abstract: Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

[131] Dual Interaction Network with Cross-Image Attention for Medical Image Segmentation

Jeonghyun Noh,Wangsu Jeon,Jinsun Park

Main category: cs.CV

TL;DR: This paper proposes a dual interactive fusion module (DIFM) and multi-scale boundary loss to enhance medical image segmentation accuracy by effectively combining original and enhanced images while preserving critical spatial information.

Details Motivation: Medical image segmentation is critical for disease diagnosis, but factors like noise and low contrast hinder accuracy. Image enhancement techniques can mitigate these issues but risk altering critical information. Existing fusion methods struggle to balance original and enhanced image advantages. Method: DIFM uses bidirectional cross-attention to integrate spatial information from original and enhanced images and applies global spatial attention to refine features. Additionally, a multi-scale boundary loss is introduced to improve segmentation at object boundaries. Result: Experimental results on the ACDC and Synapse datasets show that the proposed method outperforms existing approaches in segmentation accuracy, particularly at object boundaries. Conclusion: The proposed dual interactive fusion module (DIFM) improves medical image segmentation by effectively utilizing complementary information from original and enhanced images, leading to enhanced spatial characteristics and segmentation accuracy. Abstract: Medical image segmentation is a crucial method for assisting professionals in diagnosing various diseases through medical imaging. However, various factors such as noise, blurriness, and low contrast often hinder the accurate diagnosis of diseases. While numerous image enhancement techniques can mitigate these issues, they may also alter crucial information needed for accurate diagnosis in the original image. Conventional image fusion strategies, such as feature concatenation can address this challenge. However, they struggle to fully leverage the advantages of both original and enhanced images while suppressing the side effects of the enhancements. To overcome the problem, we propose a dual interactive fusion module (DIFM) that effectively exploits mutual complementary information from the original and enhanced images. DIFM employs cross-attention bidirectionally to simultaneously attend to corresponding spatial information across different images, subsequently refining the complementary features via global spatial attention. This interaction leverages low- to high-level features implicitly associated with diverse structural attributes like edges, blobs, and object shapes, resulting in enhanced features that embody important spatial characteristics. In addition, we introduce a multi-scale boundary loss based on gradient extraction to improve segmentation accuracy at object boundaries. Experimental results on the ACDC and Synapse datasets demonstrate the superiority of the proposed method quantitatively and qualitatively. Code available at: https://github.com/JJeong-Gari/DIN

[132] StripDet: Strip Attention-Based Lightweight 3D Object Detection from Point Cloud

Weichao Wang,Wendong Mao,Zhongfeng Wang

Main category: cs.CV

TL;DR: StripDet is a lightweight, efficient 3D object detection framework that achieves high accuracy with fewer parameters, ideal for deployment on edge devices.

Details Motivation: High-accuracy 3D object detection models are difficult to deploy due to high computational and memory demands, necessitating a more efficient solution. Method: StripDet uses a novel Strip Attention Block (SAB) with asymmetric strip convolutions and a hierarchical backbone with depthwise separable convolutions for efficiency. Result: StripDet achieves 79.97% mAP for car detection on KITTI with 0.65M parameters, outperforming PointPillars and other lightweight methods by a significant margin. Conclusion: StripDet is a lightweight framework for 3D object detection that achieves high efficiency and accuracy, making it suitable for edge devices. Abstract: The deployment of high-accuracy 3D object detection models from point cloud remains a significant challenge due to their substantial computational and memory requirements. To address this, we introduce StripDet, a novel lightweight framework designed for on-device efficiency. First, we propose the novel Strip Attention Block (SAB), a highly efficient module designed to capture long-range spatial dependencies. By decomposing standard 2D convolutions into asymmetric strip convolutions, SAB efficiently extracts directional features while reducing computational complexity from quadratic to linear. Second, we design a hardware-friendly hierarchical backbone that integrates SAB with depthwise separable convolutions and a simple multiscale fusion strategy, achieving end-to-end efficiency. Extensive experiments on the KITTI dataset validate StripDet's superiority. With only 0.65M parameters, our model achieves a 79.97% mAP for car detection, surpassing the baseline PointPillars with a 7x parameter reduction. Furthermore, StripDet outperforms recent lightweight and knowledge distillation-based methods, achieving a superior accuracy-efficiency trade-off while establishing itself as a practical solution for real-world 3D detection on edge devices.

[133] Neural Bloom: A Deep Learning Approach to Real-Time Lighting

Rafal Karp,Dawid Gruszka,Tomasz Trzcinski

Main category: cs.CV

TL;DR: 作者提出了一种基于神经网络的快速生成bloom光照效果的方法,其效果优于现有方法。

Details Motivation: 现有的传统技术依赖于多次模糊操作和纹理采样,而且通常在其实现中存在条件分支,这些操作占据了执行时间的很大一部分,因此作者希望提出一种更高效的方法来生成bloom光照效果。 Method: 作者通过提出两种神经网络方法,NBL和FastNBL,来生成bloom光照效果,并在各种3D场景中对这两种方法进行了测试,评估了亮度掩码准确性和推理速度。 Result: 两种方法在生成高质量bloom效果的同时,都优于标准的最先进的bloom实现,FastNBL快28%,NBL快12%。 Conclusion: 作者提出两种基于神经网络的实时生成bloom光照效果的方法,Neural Bloom Lighting (NBL) 和 Fast Neural Bloom Lighting (FastNBL),在质量和性能上都优于现有方法。 Abstract: We propose a novel method to generate bloom lighting effect in real time using neural networks. Our solution generate brightness mask from given 3D scene view up to 30% faster than state-of-the-art methods. The existing traditional techniques rely on multiple blur appliances and texture sampling, also very often have existing conditional branching in its implementation. These operations occupy big portion of the execution time. We solve this problem by proposing two neural network-based bloom lighting methods, Neural Bloom Lighting (NBL) and Fast Neural Bloom Lighting (FastNBL), focusing on their quality and performance. Both methods were tested on a variety of 3D scenes, with evaluations conducted on brightness mask accuracy and inference speed. The main contribution of this work is that both methods produce high-quality bloom effects while outperforming the standard state-of-the-art bloom implementation, with FastNBL being faster by 28% and NBL faster by 12%. These findings highlight that we can achieve realistic bloom lighting phenomena faster, moving us towards more realism in real-time environments in the future. This improvement saves computational resources, which is a major bottleneck in real-time rendering. Furthermore, it is crucial for sustaining immersion and ensuring smooth experiences in high FPS environments, while maintaining high-quality realism.

[134] Spatial-Aware Self-Supervision for Medical 3D Imaging with Multi-Granularity Observable Tasks

Yiqin Zhang,Meiling Chen,Zhengjie Zhang

Main category: cs.CV

TL;DR: 该研究提出了一种新的自监督学习方法,用于医学3D可视化任务,提高了模型的可解释性,同时保持了性能。

Details Motivation: 现有的自监督学习方法主要受通用2D视觉领域的设计影响,缺乏对医学3D图像中模型学习过程的直观展示,导致医学可解释性不足。 Method: 提出了一种包含三个子任务的方法来捕捉医学3D成像中的空间相关语义,并利用三维成像中额外维度提供的增强语义深度进行多粒度空间关系建模。 Result: 实验结果表明,该方法能够保持与现有方法相当的性能,并促进对自监督学习过程的直观理解。 Conclusion: 实验结果表明,该方法在保持与现有方法相当的性能的同时,提高了模型的可解释性。 Abstract: The application of self-supervised techniques has become increasingly prevalent within medical visualization tasks, primarily due to its capacity to mitigate the data scarcity prevalent in the healthcare sector. The majority of current works are influenced by designs originating in the generic 2D visual domain, which lack the intuitive demonstration of the model's learning process regarding 3D spatial knowledge. Consequently, these methods often fall short in terms of medical interpretability. We propose a method consisting of three sub-tasks to capture the spatially relevant semantics in medical 3D imaging. Their design adheres to observable principles to ensure interpretability, and minimize the performance loss caused thereby as much as possible. By leveraging the enhanced semantic depth offered by the extra dimension in 3D imaging, this approach incorporates multi-granularity spatial relationship modeling to maintain training stability. Experimental findings suggest that our approach is capable of delivering performance that is on par with current methodologies, while facilitating an intuitive understanding of the self-supervised learning process.

[135] OmniStyle2: Scalable and High Quality Artistic Style Transfer Data Generation via Destylization

Ye Wang,Zili Yi,Yibo Zhang,Peng Zheng,Xuping Xie,Jiang Lin,Yilin Wang,Rui Ma

Main category: cs.CV

TL;DR: OmniStyle2提出destylization方法生成大规模数据集,结合简单模型实现了艺术风格迁移的新突破。

Details Motivation: 缺乏真实风格迁移数据是艺术风格迁移的主要挑战,因此提出destylization方法生成数据集。 Method: 通过反向风格迁移(destylization)生成大规模数据集DST-100K,使用FLUX.1-dev模型进行训练。 Result: OmniStyle2在定性和定量基准上都优于现有方法,验证了数据生成方法的有效性。 Conclusion: OmniStyle2通过大规模数据集DST-100K和简单前馈模型在艺术风格迁移领域达到了新的高度,克服了缺乏真实数据的挑战。 Abstract: OmniStyle2 introduces a novel approach to artistic style transfer by reframing it as a data problem. Our key insight is destylization, reversing style transfer by removing stylistic elements from artworks to recover natural, style-free counterparts. This yields DST-100K, a large-scale dataset that provides authentic supervision signals by aligning real artistic styles with their underlying content. To build DST-100K, we develop (1) DST, a text-guided destylization model that reconstructs stylefree content, and (2) DST-Filter, a multi-stage evaluation model that employs Chain-of-Thought reasoning to automatically discard low-quality pairs while ensuring content fidelity and style accuracy. Leveraging DST-100K, we train OmniStyle2, a simple feed-forward model based on FLUX.1-dev. Despite its simplicity, OmniStyle2 consistently surpasses state-of-the-art methods across both qualitative and quantitative benchmarks. Our results demonstrate that scalable data generation via destylization provides a reliable supervision paradigm, overcoming the fundamental challenge posed by the lack of ground-truth data in artistic style transfer.

[136] ConstStyle: Robust Domain Generalization with Unified Style Transformation

Nam Duong Tran,Nam Nguyen Phuong,Hieu H. Pham,Phi Le Nguyen,My T. Thai

Main category: cs.CV

TL;DR: ConstStyle is a novel domain generalization approach that maps training and testing data onto a unified domain to reduce the impact of domain shifts, showing superior performance even with limited training domains.

Details Motivation: Deep neural networks suffer performance drops when test data distribution differs from training data. Existing domain generalization methods struggle with limited training domains or significant gaps between seen and unseen domains. Method: ConstStyle maps all training samples onto a unified domain optimized for seen domains and projects unseen domain samples similarly during testing to bridge the domain gap. Result: ConstStyle consistently outperforms existing methods across diverse scenarios, with up to a 19.82% increase in accuracy when only a limited number of seen domains are available. Conclusion: ConstStyle effectively reduces the impact of domain shifts and outperforms existing methods in domain generalization, especially when only a limited number of seen domains are available. Abstract: Deep neural networks often suffer performance drops when test data distribution differs from training data. Domain Generalization (DG) aims to address this by focusing on domain-invariant features or augmenting data for greater diversity. However, these methods often struggle with limited training domains or significant gaps between seen (training) and unseen (test) domains. To enhance DG robustness, we hypothesize that it is essential for the model to be trained on data from domains that closely resemble unseen test domains-an inherently difficult task due to the absence of prior knowledge about the unseen domains. Accordingly, we propose ConstStyle, a novel approach that leverages a unified domain to capture domain-invariant features and bridge the domain gap with theoretical analysis. During training, all samples are mapped onto this unified domain, optimized for seen domains. During testing, unseen domain samples are projected similarly before predictions. By aligning both training and testing data within this unified domain, ConstStyle effectively reduces the impact of domain shifts, even with large domain gaps or few seen domains. Extensive experiments demonstrate that ConstStyle consistently outperforms existing methods across diverse scenarios. Notably, when only a limited number of seen domains are available, ConstStyle can boost accuracy up to 19.82\% compared to the next best approach.

[137] Multi-Strategy Guided Diffusion via Sparse Masking Temporal Reweighting Distribution Correction

Zekun Zhou,Yanru Gong,Liu Shi,Qiegen Liu

Main category: cs.CV

TL;DR: 本研究提出了一种名为STRIDE的扩散模型,用于解决稀疏视角CT重建问题,通过一系列创新方法实现了高质量的图像重建。

Details Motivation: 扩散模型在图像处理任务中展示了卓越的生成能力,但稀疏视角CT重建仍面临缺失投影视图补全和全局信息建模的挑战。 Method: 提出了一种基于稀疏条件概率的联合训练机制,采用时间变化的稀疏条件重加权引导策略,并结合线性回归校正分布偏移,同时使用双网络并行架构对多子频带分量进行全局校正和优化。 Result: 在公共和真实数据集上的实验结果表明,该方法在PSNR上提高了2.58 dB,SSIM提高了2.37%,MSE减少了0.236。 Conclusion: STRIDE实现了高质量的CT图像重建,具有良好的泛化性和鲁棒性。 Abstract: Diffusion models have demonstrated remarkable generative capabilities in image processing tasks. We propose a Sparse condition Temporal Rewighted Integrated Distribution Estimation guided diffusion model (STRIDE) for sparse-view CT reconstruction. Specifically, we design a joint training mechanism guided by sparse conditional probabilities to facilitate the model effective learning of missing projection view completion and global information modeling. Based on systematic theoretical analysis, we propose a temporally varying sparse condition reweighting guidance strategy to dynamically adjusts weights during the progressive denoising process from pure noise to the real image, enabling the model to progressively perceive sparse-view information. The linear regression is employed to correct distributional shifts between known and generated data, mitigating inconsistencies arising during the guidance process. Furthermore, we construct a dual-network parallel architecture to perform global correction and optimization across multiple sub-frequency components, thereby effectively improving the model capability in both detail restoration and structural preservation, ultimately achieving high-quality image reconstruction. Experimental results on both public and real datasets demonstrate that the proposed method achieves the best improvement of 2.58 dB in PSNR, increase of 2.37\% in SSIM, and reduction of 0.236 in MSE compared to the best-performing baseline methods. The reconstructed images exhibit excellent generalization and robustness in terms of structural consistency, detail restoration, and artifact suppression.

[138] S-LAM3D: Segmentation-Guided Monocular 3D Object Detection via Feature Space Fusion

Diana-Alexandra Sas,Florin Oniga

Main category: cs.CV

TL;DR: This paper introduces a decoupled strategy for monocular 3D object detection that improves performance by integrating precomputed segmentation information into the detection model without increasing its complexity.

Details Motivation: The motivation is to address the challenge of depth estimation in monocular 3D object detection by leveraging additional segmentation information to guide the detection process, without adding complexity to the model. Method: The authors propose a decoupled strategy that injects precomputed segmentation information into the feature space of a detection model, without expanding the model or jointly learning the priors. Result: The proposed method outperforms existing approaches that rely solely on RGB image features on the KITTI 3D Object Detection Benchmark, particularly for detecting small objects. Conclusion: The paper concludes that incorporating precomputed segmentation information can significantly improve the performance of monocular 3D object detection, particularly for small objects like pedestrians and cyclists. Abstract: Monocular 3D Object Detection represents a challenging Computer Vision task due to the nature of the input used, which is a single 2D image, lacking in any depth cues and placing the depth estimation problem as an ill-posed one. Existing solutions leverage the information extracted from the input by using Convolutional Neural Networks or Transformer architectures as feature extraction backbones, followed by specific detection heads for 3D parameters prediction. In this paper, we introduce a decoupled strategy based on injecting precomputed segmentation information priors and fusing them directly into the feature space for guiding the detection, without expanding the detection model or jointly learning the priors. The focus is on evaluating the impact of additional segmentation information on existing detection pipelines without adding additional prediction branches. The proposed method is evaluated on the KITTI 3D Object Detection Benchmark, outperforming the equivalent architecture that relies only on RGB image features for small objects in the scene: pedestrians and cyclists, and proving that understanding the input data can balance the need for additional sensors or training data.

[139] Motion Aware ViT-based Framework for Monocular 6-DoF Spacecraft Pose Estimation

Jose Sosa,Dan Pineau,Arunkumar Rathinam,Abdelrahman Shabayek,Djamila Aouada

Main category: cs.CV

TL;DR: 本文提出了一种结合运动感知热图和光流的深度学习方法,用于航天器的6-DoF姿态估计,提高了性能和泛化能力。

Details Motivation: 现有的姿态估计方法主要依赖于静态关键点定位的单张图像,未能充分利用太空操作中固有的时间信息,因此需要一种更有效的方法来提高估计性能。 Method: 该方法结合了Vision Transformer (ViT) 编码器的图像特征和预训练光流模型的运动线索,以定位2D关键点,并通过Perspective-n-Point (PnP) 求解器从已知的2D-3D对应关系中恢复6-DoF姿态。 Result: 在SPARK-2024数据集的不同数据分布上测试时,该方法在2D关键点定位和6-DoF姿态估计方面均优于单图像基线方法,并显示出良好的泛化能力。 Conclusion: 该研究成功地将一种来自人体姿态估计的深度学习框架应用于航天器姿态估计领域,通过结合运动感知热图和光流来捕捉动态运动信息,提高了姿态估计的性能和在不同数据分布下的泛化能力。 Abstract: Monocular 6-DoF pose estimation plays an important role in multiple spacecraft missions. Most existing pose estimation approaches rely on single images with static keypoint localisation, failing to exploit valuable temporal information inherent to space operations. In this work, we adapt a deep learning framework from human pose estimation to the spacecraft pose estimation domain that integrates motion-aware heatmaps and optical flow to capture motion dynamics. Our approach combines image features from a Vision Transformer (ViT) encoder with motion cues from a pre-trained optical flow model to localise 2D keypoints. Using the estimates, a Perspective-n-Point (PnP) solver recovers 6-DoF poses from known 2D-3D correspondences. We train and evaluate our method on the SPADES-RGB dataset and further assess its generalisation on real and synthetic data from the SPARK-2024 dataset. Overall, our approach demonstrates improved performance over single-image baselines in both 2D keypoint localisation and 6-DoF pose estimation. Furthermore, it shows promising generalisation capabilities when testing on different data distributions.

[140] Khana: A Comprehensive Indian Cuisine Dataset

Omkar Prabhu

Main category: cs.CV

TL;DR: 本文介绍了一个名为Khana的新基准数据集,用于印度菜肴的食品图像分类、分割和检索,填补了现有数据集中印度美食细节的空白。

Details Motivation: 由于印度美食的地区多样性、复杂的制作方法,以及缺乏全面的标记数据集,当前的食品数据集无法完整覆盖印度美食的广度和细节。 Method: 创建了一个新的基准数据集Khana,其中包括大约131K张图像,涵盖80个标签,并对最先进的模型进行了分类、分割和检索的基准评估。 Result: Khana提供了一个全面且具有挑战性的基准,为研究人员和开发人员提供了一个有价值的资源,用于开发利用印度美食丰富性的实际应用。 Conclusion: Khana填补了研究和开发之间的空白,为食品图像模型提供了一个全面且具有挑战性的基准数据集,特别针对印度美食。 Abstract: As global interest in diverse culinary experiences grows, food image models are essential for improving food-related applications by enabling accurate food recognition, recipe suggestions, dietary tracking, and automated meal planning. Despite the abundance of food datasets, a noticeable gap remains in capturing the nuances of Indian cuisine due to its vast regional diversity, complex preparations, and the lack of comprehensive labeled datasets that cover its full breadth. Through this exploration, we uncover Khana, a new benchmark dataset for food image classification, segmentation, and retrieval of dishes from Indian cuisine. Khana fills the gap by establishing a taxonomy of Indian cuisine and offering around 131K images in the dataset spread across 80 labels, each with a resolution of 500x500 pixels. This paper describes the dataset creation process and evaluates state-of-the-art models on classification, segmentation, and retrieval as baselines. Khana bridges the gap between research and development by providing a comprehensive and challenging benchmark for researchers while also serving as a valuable resource for developers creating real-world applications that leverage the rich tapestry of Indian cuisine. Webpage: https://khana.omkar.xyz

[141] Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Jaemin Son,Sujin Choi,Inyong Yun

Main category: cs.CV

TL;DR: A lightweight token pruning framework is proposed to reduce the computational burden of vision-language models in document understanding tasks while maintaining accuracy.

Details Motivation: High computational demands of vision-language models in document understanding tasks necessitate a more efficient approach. Method: A binary patch-level classifier and a max-pooling refinement step are used to filter out non-informative regions in document images before VLM processing. Result: The proposed framework successfully lowers computational costs while maintaining similar levels of accuracy on real-world document datasets. Conclusion: The lightweight token pruning framework effectively reduces computational costs in vision-language models for document understanding, without sacrificing accuracy. Abstract: Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

[142] BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users

Wanyin Cheng,Zanxi Ruan

Main category: cs.CV

TL;DR: The paper presents BLaVe-CoT, a new VQA framework designed to handle ambiguity in questions from Blind and Low Vision users, showing improved robustness and performance on a relevant benchmark.

Details Motivation: VQA has potential to assist Blind and Low Vision users, but real-world usage is challenging due to blurry photos and difficulty in articulating specific questions, resulting in ambiguous questions and multiple valid answers. Method: The paper proposes a VQA framework named BLaVe-CoT, which uses a LoRA-tuned BLIP-2 model to propose diverse candidate answers, spatially grounds each answer using PolyFormer, and applies a chain-of-thought reasoning module to assess answer consistency. Result: Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. Conclusion: This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. Abstract: Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions-posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at https://github.com/Accecwan/BLaVe-CoT.

[143] Interleaving Reasoning for Better Text-to-Image Generation

Wenxuan Huang,Shuang Chen,Zheyong Xie,Shaosheng Cao,Shixiang Tang,Yufan Shen,Qingyu Yin,Wenbo Hu,Xiaoman Wang,Yuntian Tang,Junbo Qiao,Yue Guo,Yao Hu,Zhenfei Yin,Philip Torr,Yu Cheng,Wanli Ouyang,Shaohui Lin

Main category: cs.CV

TL;DR: This paper introduces a new framework and training method for text-to-image generation that achieves state-of-the-art results by leveraging interleaving reasoning between text and images.

Details Motivation: To bridge the gap in instruction following and detail preservation in current multimodal models compared to tightly coupled systems like GPT-4o by leveraging interleaving reasoning. Method: Introducing the Interleaving Reasoning Generation (IRG) framework and Interleaving Reasoning Generation Learning (IRGL) method, which alternates between text-based thinking and image synthesis to enhance image generation. Result: The model achieved significant improvements, with 5-10 point gains on several benchmarks, along with better visual quality and fine-grained fidelity. Conclusion: The proposed IRG framework and IRGL training method significantly improve the performance of text-to-image generation models, achieving state-of-the-art results on several benchmarks. Abstract: Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

[144] Cross-Modal Enhancement and Benchmark for UAV-based Open-Vocabulary Object Detection

Zhenhai Weng,Zhongliang Yu

Main category: cs.CV

TL;DR: This paper addresses the domain gap in open-vocabulary object detection for UAV imagery by introducing a new UAV-Label engine, two large-scale datasets (UAVDE-2M and UAVCAP-15k), and a novel CAGE module, which together improve detection performance on UAV images.

Details Motivation: The motivation for this work stems from the domain gap between existing large-scale datasets for open-vocabulary object detection (OVD) pre-training, which are mainly composed of ground-level, natural images, and UAV imagery, leading to reduced model performance on UAV images. Method: The authors constructed two new datasets, UAVDE-2M and UAVCAP-15k, and proposed a Cross-Attention Gated Enhancement Fusion (CAGE) module integrated into the YOLO-World-v2 architecture to address the domain gap in UAV imagery object detection. Result: Extensive experiments on the VisDrone and SIMD datasets demonstrated the effectiveness of the proposed method in improving object detection performance on UAV-based imagery and remote sensing tasks. Conclusion: The paper concludes that the proposed UAV-Label engine and the CAGE module significantly improve the performance of open-vocabulary object detection for UAV-based imagery and remote sensing applications. Abstract: Open-Vocabulary Object Detection (OVD) has emerged as a pivotal technology for applications involving Unmanned Aerial Vehicles (UAVs). However, the prevailing large-scale datasets for OVD pre-training are predominantly composed of ground-level, natural images. This creates a significant domain gap, causing models trained on them to exhibit a substantial drop in performance on UAV imagery. To address this limitation, we first propose a refined UAV-Label engine. Then we construct and introduce UAVDE-2M(contains over 2,000,000 instances and 1800 categories) and UAVCAP-15k(contains over 15,000 images). Furthermore, we propose a novel Cross-Attention Gated Enhancement Fusion (CAGE) module and integrate it into the YOLO-World-v2 architecture. Finally, extensive experiments on the VisDrone and SIMD datasets verify the effectiveness of our proposed method for applications in UAV-based imagery and remote sensing.

[145] Micro-Expression Recognition via Fine-Grained Dynamic Perception

Zhiwen Shao,Yifan Cheng,Fan Zhang,Xuehuai Shi,Canlin Li,Lizhuang Ma,Dit-yan Yeung

Main category: cs.CV

TL;DR: 本文提出了一种用于微表情识别的细粒度动态感知框架(FDP),通过帧级特征排序和动态图像构建缓解数据稀缺问题,并在多个数据集上取得显著性能提升。

Details Motivation: 现有的微表情识别方法依赖手工特征或深度网络,但手工特征需要关键帧,而深度网络受限于小规模和低多样性数据。 Method: 提出了一种新颖的细粒度动态感知(FDP)框架,包括帧级特征排序、局部-全局特征感知变压器、排序评分器和动态图像构建模块。 Result: FDP在CASME II、SAMM、CAS(ME)^2和CAS(ME)^3数据集上的F1分数分别比之前最佳结果提高了4.05%、2.50%、7.71%和2.11%。 Conclusion: 实验结果表明,FDP在微表情识别和动态图像构建任务中优于现有方法,有效缓解了数据稀缺问题。 Abstract: Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic perception (FDP) framework for MER. We propose to rank frame-level features of a sequence of raw frames in chronological order, in which the rank process encodes the dynamic information of both ME appearances and motions. Specifically, a novel local-global feature-aware transformer is proposed for frame representation learning. A rank scorer is further adopted to calculate rank scores of each frame-level feature. Afterwards, the rank features from rank scorer are pooled in temporal dimension to capture dynamic representation. Finally, the dynamic representation is shared by a MER module and a dynamic image construction module, in which the former predicts the ME category, and the latter uses an encoder-decoder structure to construct the dynamic image. The design of dynamic image construction task is beneficial for capturing facial subtle actions associated with MEs and alleviating the data scarcity issue. Extensive experiments show that our method (i) significantly outperforms the state-of-the-art MER methods, and (ii) works well for dynamic image construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11% over the previous best results in terms of F1-score on the CASME II, SAMM, CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at https://github.com/CYF-cuber/FDP.

[146] DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-temporal Fusion

Mengmeng Liu,Michael Ying Yang,Jiuming Liu,Yunpeng Zhang,Jiangtao Li,Sander Oude Elberink,George Vosselman,Hao Cheng

Main category: cs.CV

TL;DR: DVLO4D improves visual-LiDAR odometry with sparse spatiotemporal fusion, achieving high accuracy, robustness, and real-time efficiency.

Details Motivation: Traditional visual-LiDAR odometry approaches struggle with sensor misalignment, underutilize temporal information, and require extensive manual tuning for different sensor configurations. This work aims to overcome these limitations by developing a more accurate, robust, and efficient framework. Method: DVLO4D introduces three innovations: Sparse Query Fusion for effective multi-modal data fusion, a Temporal Interaction and Update module to improve pose estimation initialization and reduce accumulative errors, and a Temporal Clip Training strategy with a Collective Average Loss mechanism for global optimization and reduced scale drift. Result: Extensive experiments on the KITTI and Argoverse Odometry datasets demonstrate that DVLO4D achieves state-of-the-art performance in pose accuracy and robustness, with an inference time of 82 ms, making it suitable for real-time applications. Conclusion: DVLO4D is a novel visual-LiDAR odometry framework that significantly improves the accuracy and robustness of autonomous system localization while maintaining high efficiency for real-time deployment. Abstract: Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model's robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.

[147] Analysis of Blood Report Images Using General Purpose Vision-Language Models

Nadia Bakhsheshi,Hamid Beigy

Main category: cs.CV

TL;DR: This paper investigates the use of Vision-Language Models (VLMs) for automated blood report analysis, showing that models like Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick can improve patient understanding of medical reports, though results are preliminary due to limited data.

Details Motivation: Interpreting blood reports is challenging for individuals, often causing anxiety and overlooked health issues. This study explores how VLMs can assist by automatically analyzing blood report images, aiming to improve health literacy and access to medical information. Method: A comparative evaluation of three VLMs—Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick—was conducted on a dataset of 100 diverse blood report images. Clinically relevant questions were adapted for each report, and model responses were evaluated using Sentence-BERT to assess similarity. Result: The findings indicate that general-purpose VLMs can effectively interpret blood reports, offering clear explanations directly from images. This capability can enhance health literacy and reduce barriers to understanding complex medical data. Conclusion: The study concludes that general-purpose Vision-Language Models (VLMs) are a practical and promising solution for preliminary blood report analysis, laying the foundation for reliable and accessible AI-assisted healthcare applications. Abstract: The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.

[148] TinyDef-DETR:An Enhanced DETR Detector for UAV Power Line Defect Detection

Jiaming Cui

Main category: cs.CV

TL;DR: TinyDef-DETR是一种基于DETR的小缺陷检测框架,通过创新的模块设计显著提升了无人机巡检中对小缺陷的检测效果。

Details Motivation: 现有的检测器在复杂背景下难以检测小缺陷,存在细节丢失、边界敏感性弱和全局上下文与局部线索整合不足的问题。 Method: TinyDef-DETR引入了一种无步幅的space-to-depth模块、边缘增强卷积、跨阶段双域多尺度注意力模块以及Focaler-Wise-SIoU回归损失。 Result: 在CSG-ADCD数据集上的实验表明,TinyDef-DETR比现有方法在精度和召回率上都有显著提升,尤其是在小目标子集上;在VisDrone基准上的验证表明了该方法的泛化能力。 Conclusion: TinyDef-DETR结合了细节保留的下采样、边缘敏感表示、双域注意力和难度自适应回归,为电力线路的小缺陷检测提供了一种实用高效的解决方案。 Abstract: Automated inspection of transmission lines using UAVs is hindered by the difficulty of detecting small and ambiguous defects against complex backgrounds. Conventional detectors often suffer from detail loss due to strided downsampling, weak boundary sensitivity in lightweight backbones, and insufficient integration of global context with local cues. To address these challenges, we propose TinyDef-DETR, a DETR-based framework designed for small-defect detection. The method introduces a stride-free space-to-depth module for lossless downsampling, an edge-enhanced convolution for boundary-aware feature extraction, a cross-stage dual-domain multi-scale attention module to jointly capture global and local information, and a Focaler-Wise-SIoU regression loss to improve localization of small objects. Experiments conducted on the CSG-ADCD dataset demonstrate that TinyDef-DETR achieves substantial improvements in both precision and recall compared to competitive baselines, with particularly notable gains on small-object subsets, while incurring only modest computational overhead. Further validation on the VisDrone benchmark confirms the generalization capability of the proposed approach. Overall, the results indicate that integrating detail-preserving downsampling, edge-sensitive representations, dual-domain attention, and difficulty-adaptive regression provides a practical and efficient solution for UAV-based small-defect inspection in power grids.

[149] BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Yuming Li,Yikai Wang,Yuying Zhu,Zhongyu Zhao,Ming Lu,Qi She,Shanghang Zhang

Main category: cs.CV

TL;DR: 本文提出BranchGRPO,通过改进采样和剪枝策略,显著提升图像和视频生成模型对齐的效率和性能。

Details Motivation: 现有的图像和视频生成模型对齐方法计算成本高且训练不稳定,需要更高效的方法来解决这些问题。 Method: 提出了BranchGRPO,通过引入分支采样策略、基于树的优势估计器以及利用路径和深度冗余的剪枝策略,降低了计算成本并提高了训练稳定性。 Result: BranchGRPO显著降低了每次更新的计算成本,同时保持或提高了探索多样性,并加速了收敛过程。 Conclusion: BranchGRPO在图像和视频偏好对齐方面比强基线提高了16%的对齐分数,同时减少了50%的训练时间。 Abstract: Recent advancements in aligning image and video generative models via GRPO have achieved remarkable gains in enhancing human preference alignment. However, these methods still face high computational costs from on-policy rollouts and excessive SDE sampling steps, as well as training instability due to sparse rewards. In this paper, we propose BranchGRPO, a novel method that introduces a branch sampling policy updating the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO substantially lowers the per-update compute cost while maintaining or improving exploration diversity. This work makes three main contributions: (1) a branch sampling scheme that reduces rollout and training cost; (2) a tree-based advantage estimator incorporating dense process-level rewards; and (3) pruning strategies exploiting path and depth redundancy to accelerate convergence and boost performance. Experiments on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over strong baselines, while cutting training time by 50%.

[150] Multi-Stage Graph Neural Networks for Data-Driven Prediction of Natural Convection in Enclosed Cavities

Mohammad Ahangarkiasari,Hassan Pouraria

Main category: cs.CV

TL;DR: This paper proposes a multi-stage GNN architecture to improve the modeling of buoyancy-driven heat transfer in closed cavities, demonstrating superior performance over existing GNN approaches.

Details Motivation: High-fidelity CFD modeling for buoyancy-driven heat transfer is limited by its reliance on expert-crafted physics models, fine meshes, and intensive computation. Conventional GNNs struggle with capturing long-range dependencies in high-resolution graph structures. This work aims to overcome these limitations. Method: A novel multi-stage GNN architecture with hierarchical pooling and unpooling operations was developed to model global-to-local interactions across multiple spatial scales. The model was evaluated on a newly developed CFD dataset simulating natural convection within rectangular cavities. Result: Experimental results showed that the proposed model outperforms state-of-the-art GNN baselines in predictive accuracy, training efficiency, and reduction of long-term error accumulation. Conclusion: The proposed multi-stage GNN approach has potential for modeling complex heat transfer in mesh-based fluid dynamics simulations, offering higher predictive accuracy, improved training efficiency, and reduced long-term error accumulation compared to existing methods. Abstract: Buoyancy-driven heat transfer in closed cavities serves as a canonical testbed for thermal design High-fidelity CFD modelling yields accurate thermal field solutions, yet its reliance on expert-crafted physics models, fine meshes, and intensive computation limits rapid iteration. Recent developments in data-driven modeling, especially Graph Neural Networks (GNNs), offer new alternatives for learning thermal-fluid behavior directly from simulation data, particularly on irregular mesh structures. However, conventional GNNs often struggle to capture long-range dependencies in high-resolution graph structures. To overcome this limitation, we propose a novel multi-stage GNN architecture that leverages hierarchical pooling and unpooling operations to progressively model global-to-local interactions across multiple spatial scales. We evaluate the proposed model on our newly developed CFD dataset simulating natural convection within a rectangular cavities with varying aspect ratios where the bottom wall is isothermal hot, the top wall is isothermal cold, and the two vertical walls are adiabatic. Experimental results demonstrate that the proposed model achieves higher predictive accuracy, improved training efficiency, and reduced long-term error accumulation compared to state-of-the-art (SOTA) GNN baselines. These findings underscore the potential of the proposed multi-stage GNN approach for modeling complex heat transfer in mesh-based fluid dynamics simulations.

[151] Home-made Diffusion Model from Scratch to Hatch

Shih-Ying Yeh

Main category: cs.CV

TL;DR: This paper presents HDM, an efficient text-to-image diffusion model that delivers high-quality results with significantly reduced computational and financial costs.

Details Motivation: To make high-quality text-to-image generation more accessible by reducing computational and cost barriers. Method: The study introduces Cross-U-Transformer (XUT) for better feature integration and employs a training recipe with TREAD acceleration, a shifted square crop strategy, and progressive resolution scaling. Result: HDM achieves competitive 1024x1024 image generation quality at a significantly reduced training cost of $535-620 using four RTX5090 GPUs. Conclusion: HDM offers a more accessible and cost-effective solution for high-quality text-to-image generation, especially for those with limited computational resources. Abstract: We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024x1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: (1) Cross-U-Transformer (XUT), a novel U-shape transformer, Cross-U-Transformer (XUT), that employs cross-attention for skip connections, providing superior feature integration that leads to remarkable compositional consistency; (2) a comprehensive training recipe that incorporates TREAD acceleration, a novel shifted square crop strategy for efficient arbitrary aspect-ratio training, and progressive resolution scaling; and (3) an empirical demonstration that smaller models (343M parameters) with carefully crafted architectures can achieve high-quality results and emergent capabilities, such as intuitive camera control. Our work provides an alternative paradigm of scaling, demonstrating a viable path toward democratizing high-quality text-to-image generation for individual researchers and smaller organizations with limited computational resources.

[152] High-Quality Tomographic Image Reconstruction Integrating Neural Networks and Mathematical Optimization

Anuraag Mishra,Andrea Gilch,Benjamin Apeleo Zubiri,Jan Rolfes,Frauke Liers

Main category: cs.CV

TL;DR: 本研究提出了一种新颖的图像重建技术,利用神经网络识别边缘并结合优化模型,提高了纳米和显微断层扫描图像的清晰度和质量。

Details Motivation: 提高由均匀材料相和锐利边缘组成的样本的图像重建质量,解决之前重建中的伪影问题。 Method: 训练神经网络以识别子图像中的边缘,并将训练好的网络集成到数学优化模型中,以优化重建过程。 Result: 实验数据显示,与基准算法相比,该技术显著提高了界面锐度和材料均匀性。 Conclusion: 该技术通过将训练好的神经网络集成到数学优化模型中,成功提高了投影纳米和显微断层扫描图像的重建质量,展现了其在断层成像技术中的应用潜力。 Abstract: In this work, we develop a novel technique for reconstructing images from projection-based nano- and microtomography. Our contribution focuses on enhancing reconstruction quality, particularly for specimen composed of homogeneous material phases connected by sharp edges. This is accomplished by training a neural network to identify edges within subpictures. The trained network is then integrated into a mathematical optimization model, to reduce artifacts from previous reconstructions. To this end, the optimization approach favors solutions according to the learned predictions, however may also determine alternative solutions if these are strongly supported by the raw data. Hence, our technique successfully incorporates knowledge about the homogeneity and presence of sharp edges in the sample and thereby eliminates blurriness. Our results on experimental datasets show significant enhancements in interface sharpness and material homogeneity compared to benchmark algorithms. Thus, our technique produces high-quality reconstructions, showcasing its potential for advancing tomographic imaging techniques.

[153] MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation

Yiwen Ye,Yicheng Wu,Xiangde Luo,He Zhang,Ziyang Chen,Ting Dang,Yanning Zhang,Yong Xia

Main category: cs.CV

TL;DR: MedSeqFT是一种用于医学图像分析的连续微调框架,通过最大数据相似性选择和知识与泛化保留微调方案,能够适应不断发展的临床任务并优于现有策略。

Details Motivation: 现有的微调策略在医学图像分割任务中存在局限性,如并行微调无法利用共享知识,而多任务微调需要同时访问所有数据集并难以集成增量任务。 Method: MedSeqFT引入了两个核心组件:最大数据相似性(MDS)选择和基于LoRA的知识与泛化保留微调(K&G RFT)方案。 Result: 在两个多任务数据集上的广泛实验表明,MedSeqFT始终优于最先进的微调策略,并在两个未见过的任务上验证了其增强的可转移性。 Conclusion: MedSeqFT是一种有效的、知识保留的连续微调框架,能够适应不断发展的临床任务,优于现有的微调策略。 Abstract: Foundation models have become a promising paradigm for advancing medical image analysis, particularly for segmentation tasks where downstream applications often emerge sequentially. Existing fine-tuning strategies, however, remain limited: parallel fine-tuning isolates tasks and fails to exploit shared knowledge, while multi-task fine-tuning requires simultaneous access to all datasets and struggles with incremental task integration. To address these challenges, we propose MedSeqFT, a sequential fine-tuning framework that progressively adapts pre-trained models to new tasks while refining their representational capacity. MedSeqFT introduces two core components: (1) Maximum Data Similarity (MDS) selection, which identifies downstream samples most representative of the original pre-training distribution to preserve general knowledge, and (2) Knowledge and Generalization Retention Fine-Tuning (K&G RFT), a LoRA-based knowledge distillation scheme that balances task-specific adaptation with the retention of pre-trained knowledge. Extensive experiments on two multi-task datasets covering ten 3D segmentation tasks demonstrate that MedSeqFT consistently outperforms state-of-the-art fine-tuning strategies, yielding substantial performance gains (e.g., an average Dice improvement of 3.0%). Furthermore, evaluations on two unseen tasks (COVID-19-20 and Kidney) verify that MedSeqFT enhances transferability, particularly for tumor segmentation. Visual analyses of loss landscapes and parameter variations further highlight the robustness of MedSeqFT. These results establish sequential fine-tuning as an effective, knowledge-retentive paradigm for adapting foundation models to evolving clinical tasks. Code will be released.

[154] PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology

Yating Huang,Ziyan Huang,Lintao Xiang,Qijun Yang,Hujun Yin

Main category: cs.CV

TL;DR: This paper introduces PathoHR-Bench, a new benchmark for evaluating vision-language models in pathology, and proposes a pathology-specific training scheme that significantly improves performance in capturing complex cross-modal relationships for tumor diagnosis.

Details Motivation: Accurate analysis of pathological images is crucial for automated tumor diagnosis, but it is challenging due to high structural similarity and subtle morphological variations in tissue images. Existing VL models struggle with the complex reasoning required for interpreting structured pathological reports. Method: A pathology-specific vision-language (VL) training scheme is introduced, which generates enhanced and perturbed samples for multimodal contrastive learning. This approach aims to improve hierarchical semantic understanding and compositional reasoning of VL models in the pathology domain. Result: The results show that existing VL models fail to effectively model intricate cross-modal relationships in pathology tasks. However, the proposed training scheme achieves state-of-the-art performance on PathoHR-Bench and six other pathology datasets, demonstrating its effectiveness in improving fine-grained pathology representation. Conclusion: The proposed pathology-specific VL training scheme achieves state-of-the-art performance on PathoHR-Bench and other pathology datasets, demonstrating its effectiveness in addressing the limitations of existing models in capturing complex cross-modal relationships in pathology. Abstract: Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models' abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.

[155] CARDIE: clustering algorithm on relevant descriptors for image enhancement

Giulia Bonino,Luca Alberto Rizzo

Main category: cs.CV

TL;DR: 本文介绍了一种名为 CARDIE 的新方法,用于基于颜色和亮度的图像聚类,以改进图像增强任务。

Details Motivation: 由于定义对特定图像增强任务有意义的聚类存在困难,自动图像聚类在图像增强中的应用仍然有限。 Method: 引入了 CARDIE 算法,该算法基于图像的颜色和亮度内容进行聚类,并引入了一种量化图像增强算法对亮度分布和局部方差影响的方法。 Result: CARDIE 产生的聚类比从语义图像属性中获得的聚类更能与图像增强相关,并且 CARDIE 聚类可以用来改进图像增强数据集的采样。 Conclusion: CARDIE 是一种无监督算法,它根据颜色和亮度内容对图像进行聚类,并且能够改进图像增强任务的数据集重采样。 Abstract: Automatic image clustering is a cornerstone of computer vision, yet its application to image enhancement remains limited, primarily due to the difficulty of defining clusters that are meaningful for this specific task. To address this issue, we introduce CARDIE, an unsupervised algorithm that clusters images based on their color and luminosity content. In addition, we introduce a method to quantify the impact of image enhancement algorithms on luminance distribution and local variance. Using this method, we demonstrate that CARDIE produces clusters more relevant to image enhancement than those derived from semantic image attributes. Furthermore, we demonstrate that CARDIE clusters can be leveraged to resample image enhancement datasets, leading to improved performance for tone mapping and denoising algorithms. To encourage adoption and ensure reproducibility, we publicly release CARDIE code on our GitHub.

[156] SpecSwin3D: Generating Hyperspectral Imagery from Multispectral Data via Transformer Networks

Tang Sui,Songxi Yang,Qunying Huang

Main category: cs.CV

TL;DR: SpecSwin3D is a transformer-based model that effectively generates high-quality hyperspectral images from multispectral inputs, preserving spatial and spectral detail better than existing methods.

Details Motivation: The study addresses the trade-off between spatial and spectral resolution in multispectral and hyperspectral imagery, aiming to preserve both qualities during hyperspectral generation, which existing methods struggle to achieve. Method: SpecSwin3D, a transformer-based model utilizing a cascade training strategy and optimized band sequence within a 3D shifted-window framework, generates hyperspectral imagery from multispectral inputs by progressively expanding the spectral range to improve fidelity. Result: SpecSwin3D achieved a PSNR of 35.82 dB, SAM of 2.40°, and SSIM of 0.96, outperforming MHF-Net by +5.6 dB in PSNR and reducing ERGAS by more than half, while also demonstrating effectiveness in downstream tasks. Conclusion: SpecSwin3D outperforms baseline methods in hyperspectral image generation, achieving high spatial and spectral quality while demonstrating practical utility in downstream tasks like land use classification and burnt area segmentation. Abstract: Multispectral and hyperspectral imagery are widely used in agriculture, environmental monitoring, and urban planning due to their complementary spatial and spectral characteristics. A fundamental trade-off persists: multispectral imagery offers high spatial but limited spectral resolution, while hyperspectral imagery provides rich spectra at lower spatial resolution. Prior hyperspectral generation approaches (e.g., pan-sharpening variants, matrix factorization, CNNs) often struggle to jointly preserve spatial detail and spectral fidelity. In response, we propose SpecSwin3D, a transformer-based model that generates hyperspectral imagery from multispectral inputs while preserving both spatial and spectral quality. Specifically, SpecSwin3D takes five multispectral bands as input and reconstructs 224 hyperspectral bands at the same spatial resolution. In addition, we observe that reconstruction errors grow for hyperspectral bands spectrally distant from the input bands. To address this, we introduce a cascade training strategy that progressively expands the spectral range to stabilize learning and improve fidelity. Moreover, we design an optimized band sequence that strategically repeats and orders the five selected multispectral bands to better capture pairwise relations within a 3D shifted-window transformer framework. Quantitatively, our model achieves a PSNR of 35.82 dB, SAM of 2.40{\deg}, and SSIM of 0.96, outperforming the baseline MHF-Net by +5.6 dB in PSNR and reducing ERGAS by more than half. Beyond reconstruction, we further demonstrate the practical value of SpecSwin3D on two downstream tasks, including land use classification and burnt area segmentation.

[157] RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo,Chi Liu,Dongfu Xiao,Zhen Yu,Yueye Wang,Tianqing Zhu

Main category: cs.CV

TL;DR: RetinaGuard 是一种保护医学图像生物隐私的新框架,能够在不影响图像质量和疾病诊断的前提下模糊视网膜年龄,并可扩展到其他医学图像生物标志物。

Details Motivation: 视网膜年龄是一种可以从眼底图像中预测的生物标志物,已被证明可以预测全身疾病风险、行为模式、衰老轨迹甚至死亡率。然而,这种敏感生物数据的提取引发了严重的隐私风险,因此需要一种方法来保护生物隐私。 Method: RetinaGuard 使用了一种特征级生成对抗掩码机制,并采用多对一知识蒸馏策略,结合视网膜基础模型和多种替代年龄编码器,以实现对黑盒年龄预测模型的通用防御。 Result: RetinaGuard 在全面评估中成功地模糊了视网膜年龄预测,对图像质量及病理特征表示的影响极小。 Conclusion: RetinaGuard 是一种新型的隐私增强框架,能够有效模糊视网膜年龄,同时保留图像的视觉质量和疾病诊断效用,并且可以灵活扩展到其他医学图像生物标志物。 Abstract: The integration of AI with medical images enables the extraction of implicit image-derived biomarkers for a precise health assessment. Recently, retinal age, a biomarker predicted from fundus images, is a proven predictor of systemic disease risks, behavioral patterns, aging trajectory and even mortality. However, the capability to infer such sensitive biometric data raises significant privacy risks, where unauthorized use of fundus images could lead to bioinformation leakage, breaching individual privacy. In response, we formulate a new research problem of biometric privacy associated with medical images and propose RetinaGuard, a novel privacy-enhancing framework that employs a feature-level generative adversarial masking mechanism to obscure retinal age while preserving image visual quality and disease diagnostic utility. The framework further utilizes a novel multiple-to-one knowledge distillation strategy incorporating a retinal foundation model and diverse surrogate age encoders to enable a universal defense against black-box age prediction models. Comprehensive evaluations confirm that RetinaGuard successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation. RetinaGuard is also flexible for extension to other medical image derived biomarkers. RetinaGuard is also flexible for extension to other medical image biomarkers.

[158] UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Duomin Wang,Wei Zuo,Aojie Li,Ling-Hao Chen,Xinyao Liao,Deyu Zhou,Zixin Yin,Xili Dai,Daxin Jiang,Gang Yu

Main category: cs.CV

TL;DR: UniVerse-1是一种统一的音频-视频生成模型,通过专家缝合技术与在线注释管道,在生成高质量协调音频视频内容方面取得了显著进展,并通过开源推动相关研究。

Details Motivation: 为了提升音频与视频生成的协调性并克服传统文本注释带来的对齐问题,同时减少从头训练模型的高昂成本。 Method: 采用了专家缝合(Stitching of Experts, SoE)技术,将预训练的视频和音频生成模型的对应模块进行深度融合,并开发了在线注释管道以提升训练数据的标注准确性和时间对齐性。 Result: UniVerse-1在约7600小时的音频-视频数据上微调后,能够生成协调性良好的环境音和对齐性良好的语音内容,并引入了新的基准数据集Verse-Bench用于评估。 Conclusion: UniVerse-1有效地结合了预训练的视频和音频生成模型,通过专家缝合技术和在线注释管道,实现了高质量的音频-视频同步生成,并且通过开源推动了相关领域的研究进展。 Abstract: We introduce UniVerse-1, a unified, Veo-3-like model capable of simultaneously generating coordinated audio and video. To enhance training efficiency, we bypass training from scratch and instead employ a stitching of experts (SoE) technique. This approach deeply fuses the corresponding blocks of pre-trained video and music generation experts models, thereby fully leveraging their foundational capabilities. To ensure accurate annotations and temporal alignment for both ambient sounds and speech with video content, we developed an online annotation pipeline that processes the required training data and generates labels during training process. This strategy circumvents the performance degradation often caused by misalignment text-based annotations. Through the synergy of these techniques, our model, after being finetuned on approximately 7,600 hours of audio-video data, produces results with well-coordinated audio-visuals for ambient sounds generation and strong alignment for speech generation. To systematically evaluate our proposed method, we introduce Verse-Bench, a new benchmark dataset. In an effort to advance research in audio-video generation and to close the performance gap with state-of-the-art models such as Veo3, we make our model and code publicly available. We hope this contribution will benefit the broader research community. Project page: https://dorniwang.github.io/UniVerse-1/.

[159] UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning

Huy Le,Nhat Chung,Tung Kieu,Jingkang Yang,Ngan Le

Main category: cs.CV

TL;DR: UNO is a single-stage, unified framework for Video Scene Graph Generation that effectively addresses both box-level and pixel-level tasks.

Details Motivation: Prior studies target either coarse-grained or fine-grained VidSGG with task-specific architectures, so a unified framework is needed. Method: UNO uses an extended slot attention mechanism, object temporal consistency learning, and a dynamic triplet prediction module. Result: UNO achieves competitive performance and improved efficiency on both box-level and pixel-level VidSGG benchmarks. Conclusion: UNO provides a unified and efficient framework for VidSGG that effectively handles both box-level and pixel-level tasks. Abstract: Video Scene Graph Generation (VidSGG) aims to represent dynamic visual content by detecting objects and modeling their temporal interactions as structured graphs. Prior studies typically target either coarse-grained box-level or fine-grained panoptic pixel-level VidSGG, often requiring task-specific architectures and multi-stage training pipelines. In this paper, we present UNO (UNified Object-centric VidSGG), a single-stage, unified framework that jointly addresses both tasks within an end-to-end architecture. UNO is designed to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots. To ensure robust temporal modeling, we introduce object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time. We evaluate UNO on standard box-level and pixel-level VidSGG benchmarks. Results demonstrate that UNO not only achieves competitive performance across both tasks but also offers improved efficiency through a unified, object-centric design.

[160] AI-Based Applied Innovation for Fracture Detection in X-rays Using Custom CNN and Transfer Learning Models

Amna Hassan,Ilsa Afzaal,Nouman Muneeb,Aneeqa Batool,Hamail Noor

Main category: cs.CV

TL;DR: 本论文提出了一种基于AI的骨折检测解决方案,通过使用自定义卷积神经网络(CNN)从X光图像中自动检测骨折,旨在解决低资源环境下缺乏专业放射学服务的问题。

Details Motivation: 骨折是全球主要的健康问题,尤其是在低资源环境中,由于缺乏专家放射学服务,诊断常常受限。传统的成像方法成本高、辐射暴露风险大,并且依赖专业解读,因此需要一种低成本、高效的解决方案。 Method: 研究团队开发了一种自定义的卷积神经网络(CNN)模型,并将其与包括EfficientNetB0、MobileNetV2和ResNet50在内的迁移学习模型进行基准测试。训练数据来自于公开的FracAtlas数据集,包含4083张匿名化肌肉骨骼放射图像。 Result: 自定义CNN模型在FracAtlas数据集上达到了95.96%的准确率,0.94的精确率,0.88的召回率,以及0.91的F1分数。相比之下,迁移学习模型在本实验设置中表现较差,但结果需结合数据集的类别不平衡和数据集限制进行解读。 Conclusion: 本研究表明轻量级CNN在X光骨折检测中具有巨大潜力,并强调了公平基准测试、多样化数据集和外部验证在临床转化中的重要性。 Abstract: Bone fractures present a major global health challenge, often resulting in pain, reduced mobility, and productivity loss, particularly in low-resource settings where access to expert radiology services is limited. Conventional imaging methods suffer from high costs, radiation exposure, and dependency on specialized interpretation. To address this, we developed an AI-based solution for automated fracture detection from X-ray images using a custom Convolutional Neural Network (CNN) and benchmarked it against transfer learning models including EfficientNetB0, MobileNetV2, and ResNet50. Training was conducted on the publicly available FracAtlas dataset, comprising 4,083 anonymized musculoskeletal radiographs. The custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and an F1-score of 0.91 on the FracAtlas dataset. Although transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) performed poorly in this specific setup, these results should be interpreted in light of class imbalance and data set limitations. This work highlights the promise of lightweight CNNs for detecting fractures in X-rays and underscores the importance of fair benchmarking, diverse datasets, and external validation for clinical translation

[161] Exploring Light-Weight Object Recognition for Real-Time Document Detection

Lucas Wojcik,Luiz Coelho,Roger Granada,David Menotti

Main category: cs.CV

TL;DR: 本文提出了一种高效的文档检测和校正方法,在保证OCR质量的同时,比现有技术更快速、更轻量。

Details Motivation: 文档检测与校正是从视觉文档中自动提取信息的重要步骤,但目前相关研究较少。现有模型主要关注性能提升或效率优化,而缺乏对实时文档处理的探索。 Method: 改进了IWPOD-Net模型,并在NBID合成ID卡数据集上进行训练,通过数据增强和跨数据集验证(如使用MIDV数据集)优化模型性能,最后通过OCR质量指标评估检测和校正效果。 Result: 提出的模型在文档校正不完全的情况下,依然能够达到先进的性能分数,且比现有方法更高效。 Conclusion: 作者提出了一种高效的文档检测和校正模型,在OCR质量方面达到了与现有最先进解决方案相当的性能,同时模型更小、更高效。 Abstract: Object Recognition and Document Skew Estimation have come a long way in terms of performance and efficiency. New models follow one of two directions: improving performance using larger models, and improving efficiency using smaller models. However, real-time document detection and rectification is a niche that is largely unexplored by the literature, yet it remains a vital step for automatic information retrieval from visual documents. In this work, we strive towards an efficient document detection pipeline that is satisfactory in terms of Optical Character Recognition (OCR) retrieval and faster than other available solutions. We adapt IWPOD-Net, a license plate detection network, and train it for detection on NBID, a synthetic ID card dataset. We experiment with data augmentation and cross-dataset validation with MIDV (another synthetic ID and passport document dataset) to find the optimal scenario for the model. Other methods from both the Object Recognition and Skew Estimation state-of-the-art are evaluated for comparison with our approach. We use each method to detect and rectify the document, which is then read by an OCR system. The OCR output is then evaluated using a novel OCR quality metric based on the Levenshtein distance. Since the end goal is to improve automatic information retrieval, we use the overall OCR quality as a performance metric. We observe that with a promising model, document rectification does not have to be perfect to attain state-of-the-art performance scores. We show that our model is smaller and more efficient than current state-of-the-art solutions while retaining a competitive OCR quality metric. All code is available at https://github.com/BOVIFOCR/iwpod-doc-corners.git

[162] Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Mohsen Gholami,Ahmad Rezaei,Zhou Weimin,Yong Zhang,Mohammad Akbari

Main category: cs.CV

TL;DR: Ego3D-Bench是一个新的基准,用于评估视觉-语言模型(VLMs)在使用以自我为中心的多视角户外数据时的空间推理能力,并提出了Ego3D-VLM框架来提高3D空间推理能力。

Details Motivation: 当前视觉-语言模型(VLMs)在理解3D空间关系方面存在局限性,而现实世界中的具身智能体(如机器人和自动驾驶汽车)通常依赖于以自我为中心的多视角观测。 Method: 引入Ego3D-Bench,一个包含超过8600个问答对的新基准,以及Ego3D-VLM后训练框架,该框架基于估计的全局3D坐标生成认知地图。 Result: 在多选问答任务中,Ego3D-VLM实现了平均12%的提升,在绝对距离估计任务中实现了平均56%的提升。 Conclusion: Ego3D-Bench和Ego3D-VLM为在现实世界的多视角环境中实现人类水平的空间理解提供了有价值的工具。 Abstract: Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

[163] AI-driven Remote Facial Skin Hydration and TEWL Assessment from Selfie Images: A Systematic Solution

Cecelia Soh,Rizhao Cai,Monalisha Paul,Dennis Sng,Alex Kot

Main category: cs.CV

TL;DR: This paper proposes a novel method to estimate skin hydration and trans-epidermal water loss from selfie facial images, making skin assessment more accessible to the general public.

Details Motivation: The motivation for this study is the limited accessibility of skin hydration and trans-epidermal water loss measurements to the general public, as these measurements typically require specialized instruments found in dermatology clinics. Method: The authors proposed a systematic solution that includes skin hydration and trans-epidermal water loss data collection, data preprocessing, and the development of a novel Skin-Prior Adaptive Vision Transformer model for regression. They also introduced a symmetric-based contrastive regularization to address annotation imbalance. Result: The result of this work is the development of a systematic solution that allows for the estimation of skin hydration and trans-epidermal water loss from selfie facial images, which can enable AI-driven accessible skin analysis for broader real-world applications. Conclusion: This paper concludes that their proposed method, the Skin-Prior Adaptive Vision Transformer model, can effectively estimate skin hydration and trans-epidermal water loss from selfie facial images, making skin assessment more accessible. Abstract: Skin health and disease resistance are closely linked to the skin barrier function, which protects against environmental factors and water loss. Two key physiological indicators can quantitatively represent this barrier function: skin hydration (SH) and trans-epidermal water loss (TEWL). Measurement of SH and TEWL is valuable for the public to monitor skin conditions regularly, diagnose dermatological issues, and personalize their skincare regimens. However, these measurements are not easily accessible to general users unless they visit a dermatology clinic with specialized instruments. To tackle this problem, we propose a systematic solution to estimate SH and TEWL from selfie facial images remotely with smartphones. Our solution encompasses multiple stages, including SH/TEWL data collection, data preprocessing, and formulating a novel Skin-Prior Adaptive Vision Transformer model for SH/TEWL regression. Through experiments, we identified the annotation imbalance of the SH/TEWL data and proposed a symmetric-based contrastive regularization to reduce the model bias due to the imbalance effectively. This work is the first study to explore skin assessment from selfie facial images without physical measurements. It bridges the gap between computer vision and skin care research, enabling AI-driven accessible skin analysis for broader real-world applications.

[164] Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

Jiangnan Xie,Xiaolong Zheng,Liang Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态学习框架PAML,通过引入多邻域语义原型机制和多阶段解码器,在开放词汇场景下取得了优异的视觉基础任务性能。

Details Motivation: 为了解决视觉基础任务中现有基于Transformer的方法在开放词汇场景下的局限性,包括跨模态不对齐、跨模态特征融合不足、语义原型信息利用无效等问题。 Method: 提出了一种名为Prototype-Aware Multimodal Learning (PAML) 的框架,通过ALBEF建立跨模态对齐、视觉判别特征编码器增强对象表示、抑制无关视觉内容、发现和继承多邻域语义原型机制、以及通过多阶段解码器进行多模态整合。 Result: 在五个基准数据集上的大量实验验证了该方法的有效性。 Conclusion: 实验结果表明,PAML方法在标准场景中表现具有竞争力,在开放词汇场景中达到最先进的结果。 Abstract: Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.

[165] Video-based Generalized Category Discovery via Memory-Guided Consistency-Aware Contrastive Learning

Zhang Jing,Pu Nan,Xie Yu Xiang,Guo Yanming,Lu Qianqi,Zou Shiwei,Yan Jie,Chen Yan

Main category: cs.CV

TL;DR: This paper introduces a novel Memory-guided Consistency-aware Contrastive Learning (MCCL) framework for Video-GCD, which effectively integrates multi-perspective information across time through a consistency-guided voting mechanism and a dual-level memory buffer.

Details Motivation: The motivation of the paper is to address the limitations of existing Generalized Category Discovery (GCD) methods that focus on static images and to extend the GCD problem to the video domain, termed Video-GCD, as relying solely on static visual content is insufficient to reliably discover novel categories. Method: The paper proposes a novel Memory-guided Consistency-aware Contrastive Learning (MCCL) framework with two core components: Consistency-Aware Contrastive Learning (CACL) and Memory-Guided Representation Enhancement (MGRE). Result: Extensive experiments demonstrate that the proposed method significantly outperforms competitive GCD approaches adapted from image-based settings, highlighting the importance of temporal information for discovering novel categories in videos. Conclusion: The paper concludes that the proposed MCCL framework significantly improves the performance of Video-GCD by effectively integrating multi-perspective temporal features and forming a mutually reinforcing feedback loop between representation learning and consistency modeling. Abstract: Generalized Category Discovery (GCD) is an emerging and challenging open-world problem that has garnered increasing attention in recent years. Most existing GCD methods focus on discovering categories in static images. However, relying solely on static visual content is often insufficient to reliably discover novel categories. To bridge this gap, we extend the GCD problem to the video domain and introduce a new setting, termed Video-GCD. Thus, effectively integrating multi-perspective information across time is crucial for accurate Video-GCD. To tackle this challenge, we propose a novel Memory-guided Consistency-aware Contrastive Learning (MCCL) framework, which explicitly captures temporal-spatial cues and incorporates them into contrastive learning through a consistency-guided voting mechanism. MCCL consists of two core components: Consistency-Aware Contrastive Learning(CACL) and Memory-Guided Representation Enhancement (MGRE). CACL exploits multiperspective temporal features to estimate consistency scores between unlabeled instances, which are then used to weight the contrastive loss accordingly. MGRE introduces a dual-level memory buffer that maintains both feature-level and logit-level representations, providing global context to enhance intra-class compactness and inter-class separability. This in turn refines the consistency estimation in CACL, forming a mutually reinforcing feedback loop between representation learning and consistency modeling. To facilitate a comprehensive evaluation, we construct a new and challenging Video-GCD benchmark, which includes action recognition and bird classification video datasets. Extensive experiments demonstrate that our method significantly outperforms competitive GCD approaches adapted from image-based settings, highlighting the importance of temporal information for discovering novel categories in videos. The code will be publicly available.

[166] Text4Seg++: Advancing Image Segmentation via Generative Language Modeling

Mengcheng Lan,Chaofeng Chen,Jiaxing Xu,Zongrui Li,Yiping Ke,Xudong Jiang,Yingchen Yu,Yunqing Zhao,Song Bai

Main category: cs.CV

TL;DR: 本研究提出了一种创新的基于文本的图像分割方法,通过将图像分割转化为文本生成问题,实现了高效且精确的分割效果,并在多个数据集上取得了优于现有方法的表现。

Details Motivation: 多模态大语言模型(MLLMs)在视觉语言任务中表现出色,但如何有效地将图像分割集成到这些模型中仍然是一个重大挑战。 Method: 引入了语义描述符(semantic descriptors)和行式游程编码(R-RLE)压缩技术,并提出了Text4Seg和Text4Seg++框架。 Result: Text4Seg++ 在多个自然和遥感数据集上均优于现有最先进模型,且无需任务特定微调。 Conclusion: Text4Seg++ 提出了一种基于文本的图像分割方法,通过将分割问题转化为文本生成问题,实现了高精度和高效率的分割。 Abstract: Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.

[167] Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap

Ruiming Du,Guangxun Zhai,Tian Qiu,Yu Jiang

Main category: cs.CV

TL;DR: 这篇论文综述了3D植物表型分割技术的挑战,并提出了一个开源框架PSS用于基准测试,同时评估了深度学习方法和sim-to-real学习策略的效果。

Details Motivation: 3D分割技术在植物表型分析中具有潜力,但其应用受到大规模标注数据集的稀缺、将先进深度神经网络适应于植物点云的技术难题以及缺乏针对植物科学的标准化基准和评估协议的限制。 Method: 该研究系统地总结了基于深度学习的点云语义和实例分割方法,并引入了一个开源框架Plant Segmentation Studio (PSS) 用于可重复基准测试,同时进行了广泛的定量实验以评估代表性网络和sim-to-real学习策略。 Result: 研究结果强调了稀疏卷积主干网络和基于Transformer的实例分割的有效性,并强调了基于建模和基于增强的合成数据生成在减少标注需求的sim-to-real学习中的互补作用。 Conclusion: 该研究弥合了算法进步与实际部署之间的差距,为研究人员提供了即时工具,并为开发数据高效且可推广的3D植物表型深度学习解决方案提供了路线图。数据和代码可在GitHub上获得。 Abstract: The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at https://github.com/perrydoremi/PlantSegStudio.

[168] Quantitative Currency Evaluation in Low-Resource Settings through Pattern Analysis to Assist Visually Impaired Users

Md Sultanul Islam Ovi,Mainul Hossain,Md Badsha Biswas

Main category: cs.CV

TL;DR: This paper proposes a unified framework for currency evaluation that integrates denomination classification, damage quantification, and counterfeit detection, specifically designed for usability and authenticity assessment in low-resource environments.

Details Motivation: Existing currency recognition systems often neglect usability and authenticity assessment, especially in low-resource environments with visually impaired users and offline validation needs. Current methods focus on denomination classification while ignoring physical degradation and forgery, limiting their real-world applicability. Method: The paper introduces a unified framework with three modules: lightweight CNN models for denomination classification, a Unified Currency Damage Index (UCDI) for damage quantification, and feature-based template matching for counterfeit detection. The dataset includes over 82,000 annotated images of clean, damaged, and counterfeit notes. Result: The Custom_CNN model achieves strong performance with low parameter count. The UCDI metric offers a continuous usability score based on multiple factors. The counterfeit detection module reliably identifies forged notes under various imaging conditions, supporting real-time, on-device inference. Conclusion: The proposed framework provides accurate, interpretable, and compact solutions for inclusive currency evaluation in practical and resource-constrained environments. Abstract: Currency recognition systems often overlook usability and authenticity assessment, especially in low-resource environments where visually impaired users and offline validation are common. While existing methods focus on denomination classification, they typically ignore physical degradation and forgery, limiting their applicability in real-world conditions. This paper presents a unified framework for currency evaluation that integrates three modules: denomination classification using lightweight CNN models, damage quantification through a novel Unified Currency Damage Index (UCDI), and counterfeit detection using feature-based template matching. The dataset consists of over 82,000 annotated images spanning clean, damaged, and counterfeit notes. Our Custom_CNN model achieves high classification performance with low parameter count. The UCDI metric provides a continuous usability score based on binary mask loss, chromatic distortion, and structural feature loss. The counterfeit detection module demonstrates reliable identification of forged notes across varied imaging conditions. The framework supports real-time, on-device inference and addresses key deployment challenges in constrained environments. Results show that accurate, interpretable, and compact solutions can support inclusive currency evaluation in practical settings.

[169] Multi-Modal Camera-Based Detection of Vulnerable Road Users

Penelope Brown,Julie Stephany Berrio Perez,Mao Shan,Stewart Worrall

Main category: cs.CV

TL;DR: 本文介绍了一种用于提高弱势道路使用者检测的新方法,这种方法在低光照和恶劣天气条件下表现出色。

Details Motivation: 弱势道路使用者(如行人、骑自行车者和骑摩托车者)占全球交通事故死亡人数的一半以上,然而在低光照、恶劣天气和不平衡数据集的情况下,他们的检测仍然具有挑战性。 Method: 论文中提出的方法是一种多模态检测框架,结合了RGB和热红外成像,并使用了微调的YOLOv8模型。 Result: 实验结果表明,640像素的分辨率和部分骨干网络冻结优化了准确性和效率,而类别加权损失增强了对罕见弱势道路使用者的召回率。结果还表明,热模型实现了最高的精度,RGB到热成像的增强提高了召回率。 Conclusion: 这篇论文得出的结论是,多模态检测框架在交叉口提高弱势道路使用者的安全性方面具有潜力。 Abstract: Vulnerable road users (VRUs) such as pedestrians, cyclists, and motorcyclists represent more than half of global traffic deaths, yet their detection remains challenging in poor lighting, adverse weather, and unbalanced data sets. This paper presents a multimodal detection framework that integrates RGB and thermal infrared imaging with a fine-tuned YOLOv8 model. Training leveraged KITTI, BDD100K, and Teledyne FLIR datasets, with class re-weighting and light augmentations to improve minority-class performance and robustness, experiments show that 640-pixel resolution and partial backbone freezing optimise accuracy and efficiency, while class-weighted losses enhance recall for rare VRUs. Results highlight that thermal models achieve the highest precision, and RGB-to-thermal augmentation boosts recall, demonstrating the potential of multimodal detection to improve VRU safety at intersections.

[170] Harnessing Object Grounding for Time-Sensitive Video Understanding

Tz-Ying Wu,Sharath Nittur Sridhar,Subarna Tripathi

Main category: cs.CV

TL;DR: 本文提出了一种名为 GO-Tokenizer 的轻量级模块,通过编码基础对象信息提升视频大语言模型在时间敏感型视频理解任务中的表现,并验证了其有效性与普适性。

Details Motivation: 时间敏感型视频理解任务可以从帧内的基础目标信息中受益,而直接在提示中添加目标标注的文本描述会引入额外的 token 长度和噪声敏感性问题。 Method: 提出了一种名为 GO-Tokenizer 的轻量级模块,该模块利用现成的目标检测器实时编码紧凑的目标信息,并通过预训练验证其效果。 Result: 实验结果显示,使用 GO-Tokenizer 进行预训练的模型优于基础视频大语言模型及其在提示中使用对象文本描述的变体,且在多个模型、数据集和任务上表现良好。 Conclusion: GO-Tokenizer 是一种轻量级的模块,能有效提升视频大语言模型在时间敏感型视频理解任务中的性能,且其效果在多种模型、数据集和视频理解任务中具有普适性。 Abstract: We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual description of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart utilizing textual description of objects in the prompt. The gain generalizes across different models, datasets and video understanding tasks such as reasoning temporal localization and dense captioning.

[171] Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing

Jeongmin Yu,Susang Kim,Kisu Lee,Taekyoung Kwon,Won-Yong Shin,Ha Young Kim

Main category: cs.CV

TL;DR: 本文提出MVP-FAS,通过多视角文本提示提升跨域人脸反欺骗性能。

Details Motivation: 现有基于CLIP的人脸反欺骗模型未能充分利用CLIP的块嵌入标记,并且依赖于每类单一文本提示,限制了模型的泛化能力。 Method: MVP-FAS框架结合了多视角槽注意(MVS)和多文本块对齐(MTPA)模块,使用多个改写文本生成泛化特征,减少对特定领域文本的依赖。 Result: 在跨域数据集上,MVP-FAS的泛化性能优于之前的最先进方法。 Conclusion: MVP-FAS通过利用多视角槽注意和多文本块对齐模块,显著提高了跨域人脸反欺骗的泛化性能。 Abstract: Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.

[172] A Multi-Modal Deep Learning Framework for Colorectal Pathology Diagnosis: Integrating Histological and Colonoscopy Data in a Pilot Study

Krithik Ramesh,Ritvik Koneru

Main category: cs.CV

TL;DR: 该研究提出了一种统一的深度学习方法,通过ResNet-50网络同时处理结肠镜图像和组织学图像,以提高结直肠疾病诊断的效率和准确性。

Details Motivation: 传统的诊断流程需要大量的准备工作,并且依赖于对组织学图像和结肠镜检查视频的单独评估,这可能导致变异性与低效率。为解决这些问题,本研究提出了一种统一的深度学习网络。 Method: 本研究采用ResNet-50卷积神经网络架构,对来自PathMNIST数据集的静态结肠组织图像和来自HyperKvasir数据集的下消化道(结肠镜)视频帧进行分类,并引入了类别平衡学习、鲁棒增强和校准方法以确保结果的准确性。 Result: 本试点研究提出了一种统一的深度学习网络,能够在一个流程中对组织病理学切片和结肠镜视频帧进行分类。 Conclusion: 该研究展示了一个可解释且可重复的诊断流程,通过整合多种诊断方式推进并简化结直肠疾病检测。 Abstract: Colorectal diseases, including inflammatory conditions and neoplasms, require quick, accurate care to be effectively treated. Traditional diagnostic pipelines require extensive preparation and rely on separate, individual evaluations on histological images and colonoscopy footage, introducing possible variability and inefficiencies. This pilot study proposes a unified deep learning network that uses convolutional neural networks (CN N s) to classify both histopathological slides and colonoscopy video frames in one pipeline. The pipeline integrates class-balancing learning, robust augmentation, and calibration methods to ensure accurate results. Static colon histology images were taken from the PathMNIST dataset, and the lower gastrointestinal (colonoscopy) videos were drawn from the HyperKvasir dataset. The CNN architecture used was ResNet-50. This study demonstrates an interpretable and reproducible diagnostic pipeline that unifies multiple diagnostic modalities to advance and ease the detection of colorectal diseases.

[173] MRD-LiNet: A Novel Lightweight Hybrid CNN with Gradient-Guided Unlearning for Improved Drought Stress Identification

Aswini Kumar Patra,Lingaraj Sahoo

Main category: cs.CV

TL;DR: 本文提出了一种轻量级混合CNN框架结合机器遗忘机制,用于资源受限条件下的高效干旱胁迫监测。

Details Motivation: 传统的干旱胁迫检测方法耗时且劳动强度大,而现有的深度学习模型通常依赖大量的可训练参数,限制了它们在资源有限和实时农业环境中的应用。 Method: 提出了一种新型的轻量级混合CNN框架,并引入了基于梯度范数影响函数的机器遗忘机制,用于去除特定训练数据的影响,从而提高模型的适应性。 Result: 该框架相比传统CNN和视觉变换模型减少了15倍的可训练参数,同时保持了较高的准确性,并在专家标注的马铃薯田航拍图像数据集上进行了评估,显示出高准确性和显著降低的计算成本。 Conclusion: 该论文提出的轻量级混合CNN框架结合了ResNet、DenseNet和MobileNet的特性,并引入了基于梯度范数影响函数的机器遗忘机制,实现了在资源受限条件下对干旱胁迫的高效监测。 Abstract: Drought stress is a major threat to global crop productivity, making its early and precise detection essential for sustainable agricultural management. Traditional approaches, though useful, are often time-consuming and labor-intensive, which has motivated the adoption of deep learning methods. In recent years, Convolutional Neural Network (CNN) and Vision Transformer architectures have been widely explored for drought stress identification; however, these models generally rely on a large number of trainable parameters, restricting their use in resource-limited and real-time agricultural settings. To address this challenge, we propose a novel lightweight hybrid CNN framework inspired by ResNet, DenseNet, and MobileNet architectures. The framework achieves a remarkable 15-fold reduction in trainable parameters compared to conventional CNN and Vision Transformer models, while maintaining competitive accuracy. In addition, we introduce a machine unlearning mechanism based on a gradient norm-based influence function, which enables targeted removal of specific training data influence, thereby improving model adaptability. The method was evaluated on an aerial image dataset of potato fields with expert-annotated healthy and drought-stressed regions. Experimental results show that our framework achieves high accuracy while substantially lowering computational costs. These findings highlight its potential as a practical, scalable, and adaptive solution for drought stress monitoring in precision agriculture, particularly under resource-constrained conditions.

[174] Your Super Resolution Model is not Enough for Tackling Real-World Scenarios

Dongsik Yoon,Jongeun Kim

Main category: cs.CV

TL;DR: The paper proposes SAAM, a lightweight module that enhances existing SR models to handle arbitrary-scale super-resolution with minimal overhead and improved detail sharpness.

Details Motivation: Traditional SISR models struggle to generalize across varying scale factors, limiting their real-world applicability. The study aims to address this by proposing SAAM. Method: SAAM uses lightweight, scale-adaptive feature extraction and upsampling, incorporating SimAM for efficient guidance and gradient variance loss to enhance image detail sharpness. Result: The method integrates seamlessly into multiple state-of-the-art SR backbones, delivering competitive or superior performance across a wide range of integer and non-integer scale factors. Conclusion: SAAM is a practical solution for real-world scenarios, enabling robust multi-scale upscaling with minimal computational overhead. Abstract: Despite remarkable progress in Single Image Super-Resolution (SISR), traditional models often struggle to generalize across varying scale factors, limiting their real-world applicability. To address this, we propose a plug-in Scale-Aware Attention Module (SAAM) designed to retrofit modern fixed-scale SR models with the ability to perform arbitrary-scale SR. SAAM employs lightweight, scale-adaptive feature extraction and upsampling, incorporating the Simple parameter-free Attention Module (SimAM) for efficient guidance and gradient variance loss to enhance sharpness in image details. Our method integrates seamlessly into multiple state-of-the-art SR backbones (e.g., SCNet, HiT-SR, OverNet), delivering competitive or superior performance across a wide range of integer and non-integer scale factors. Extensive experiments on benchmark datasets demonstrate that our approach enables robust multi-scale upscaling with minimal computational overhead, offering a practical solution for real-world scenarios.

[175] AI-based response assessment and prediction in longitudinal imaging for brain metastases treated with stereotactic radiosurgery

Lorenz Achim Kuhn,Daniel Abler,Jonas Richiardi,Andreas F. Hottinger,Luis Schiappacasse,Vincent Dunet,Adrien Depeursinge,Vincent Andrearczyk

Main category: cs.CV

TL;DR: 本研究开发了一种自动化方法整理脑转移瘤放射治疗的纵向数据,并通过机器学习模型预测治疗反应,以提高评估精度并支持个性化治疗决策。

Details Motivation: 脑转移瘤是癌症患者死亡的主要原因,放射外科治疗后需定期磁共振成像监测。由于纵向影像分析对临床医生来说工作量巨大,研究旨在开发自动化方法以提高评估和预测的精确性。 Method: 研究使用数据驱动的聚类方法识别BM生长轨迹,并利用经典机器学习和图机器学习(GML)进行治疗反应预测。 Result: 研究构建了一个包含896个BM病灶和177名患者的纵向数据集,并识别出5种主要的生长轨迹。使用梯度提升算法和GML进行治疗反应预测,AUC分别达到0.90和0.88。 Conclusion: 研究结果表明,自动化管道可用于大规模数据整理,并为研究脑转移瘤(BM)生长模式提供基础,最终目标是优化个性化治疗的临床决策支持系统。 Abstract: Brain Metastases (BM) are a large contributor to mortality of patients with cancer. They are treated with Stereotactic Radiosurgery (SRS) and monitored with Magnetic Resonance Imaging (MRI) at regular follow-up intervals according to treatment guidelines. Analyzing and quantifying this longitudinal imaging represents an intractable workload for clinicians. As a result, follow-up images are not annotated and merely assessed by observation. Response to treatment in longitudinal imaging is being studied, to better understand growth trajectories and ultimately predict treatment success or toxicity as early as possible. In this study, we implement an automated pipeline to curate a large longitudinal dataset of SRS treatment data, resulting in a cohort of 896 BMs in 177 patients who were monitored for >360 days at approximately two-month intervals at Lausanne University Hospital (CHUV). We use a data-driven clustering to identify characteristic trajectories. In addition, we predict 12 months lesion-level response using classical as well as graph machine learning Graph Machine Learning (GML). Clustering revealed 5 dominant growth trajectories with distinct final response categories. Response prediction reaches up to 0.90 AUC (CI95%=0.88-0.92) using only pre-treatment and first follow-up MRI with gradient boosting. Similarly, robust predictive performance of up to 0.88 AUC (CI95%=0.86-0.90) was obtained using GML, offering more flexibility with a single model for multiple input time-points configurations. Our results suggest potential automation and increased precision for the comprehensive assessment and prediction of BM response to SRS in longitudinal MRI. The proposed pipeline facilitates scalable data curation for the investigation of BM growth patterns, and lays the foundation for clinical decision support systems aiming at optimizing personalized care.

[176] 3DOF+Quantization: 3DGS quantization for large scenes with limited Degrees of Freedom

Matthieu Gendrin,Stéphane Pateux,Théo Ladune

Main category: cs.CV

TL;DR: This paper proposes a new spherical coordinate-based quantization scheme to enhance the rate-distortion performance in 3D Gaussian Splatting for large scenes with limited camera position freedom.

Details Motivation: The motivation is to improve the reconstruction quality of novel views in large scenes where input views are acquired from a limited spatial zone, focusing on the problem of coordinate quantization and its effects on projection accuracy. Method: The paper studies the impact of position error on projection error in 3D Gaussian Splatting and proposes a new quantization scheme based on spherical coordinates to address the issue. Result: The study finds that projection error is proportional to the squared inverse distance of the point being projected, and the proposed quantization method demonstrates improved rate-distortion performance on the Garden scene. Conclusion: The paper concludes that the proposed spherical coordinate-based quantization scheme improves rate-distortion performance in 3D scene reconstruction, particularly for large scenes with limited camera position freedom. Abstract: 3D Gaussian Splatting (3DGS) is a major breakthrough in 3D scene reconstruction. With a number of views of a given object or scene, the algorithm trains a model composed of 3D gaussians, which enables the production of novel views from arbitrary points of view. This freedom of movement is referred to as 6DoF for 6 degrees of freedom: a view is produced for any position (3 degrees), orientation of camera (3 other degrees). On large scenes, though, the input views are acquired from a limited zone in space, and the reconstruction is valuable for novel views from the same zone, even if the scene itself is almost unlimited in size. We refer to this particular case as 3DoF+, meaning that the 3 degrees of freedom of camera position are limited to small offsets around the central position. Considering the problem of coordinate quantization, the impact of position error on the projection error in pixels is studied. It is shown that the projection error is proportional to the squared inverse distance of the point being projected. Consequently, a new quantization scheme based on spherical coordinates is proposed. Rate-distortion performance of the proposed method are illustrated on the well-known Garden scene.

[177] VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results

Yixiao Li,Xin Li,Chris Wei Zhou,Shuo Xing,Hadi Amirpour,Xiaoshuai Hao,Guanghui Yue,Baoquan Zhao,Weide Liu,Xiaoyuan Yang,Zhengzhong Tu,Xinyu Li,Chuanbiao Song,Chenqi Zhang,Jun Lan,Huijia Zhu,Weiqiang Wang,Xiaoyan Sun,Shishun Tian,Dongyang Yan,Weixia Zhang,Junlin Chen,Wei Sun,Zhihua Wang,Zhuohang Shi,Zhizun Luo,Hang Ouyang,Tianxin Xiao,Fan Yang,Zhaowang Wu,Kaixin Deng

Main category: cs.CV

TL;DR: The ISRGC-Q Challenge introduces a new dataset and competition focused on evaluating the perceptual quality of super-resolution images from modern generative models like GANs and diffusion models, resulting in state-of-the-art solutions.

Details Motivation: Existing SR-IQA datasets do not adequately address the unique artifacts and perceptual quality challenges introduced by recent generative super-resolution techniques such as GANs and diffusion models. Method: Development of the ISRGen-QA dataset and organization of the ISRGC-Q Challenge as part of the VQualA Competition at ICCV 2025 Workshops, focusing on artifacts and quality assessment in SR images from generative models like GANs and diffusion models. Result: 108 participants registered, 4 teams submitted valid solutions achieving SOTA performance on the ISRGen-QA dataset, demonstrating the challenge's success and impact. Conclusion: The ISRGC-Q Challenge provides a new benchmark for evaluating the perceptual quality of super-resolution images generated by modern generative methods, offering a valuable resource for advancing SR-IQA research. Abstract: This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA.

[178] Phantom-Insight: Adaptive Multi-cue Fusion for Video Camouflaged Object Detection with Multimodal LLM

Hua Zhang,Changjiang Luo,Ruoyu Chen

Main category: cs.CV

TL;DR: 本文提出Phantom-Insight方法,通过融合SAM和MLLM,并引入特征融合、动态前景视觉标记评分模块、提示网络和解耦前景背景学习策略,有效提升了视频伪装物体检测的性能,在多个数据集上表现优异。

Details Motivation: 解决现有方法在动态环境中难以分离伪装物体边缘和前景背景的问题。 Method: 结合SAM和MLLM,通过特征融合、动态前景视觉标记评分模块、提示网络和解耦前景背景学习策略提升检测性能。 Result: 在MoCA-Mask和CAD2016数据集上均表现优异,具有高准确率和泛化能力。 Conclusion: Phantom-Insight实现了SOTA性能,并展现出强大的泛化能力。 Abstract: Video camouflaged object detection (VCOD) is challenging due to dynamic environments. Existing methods face two main issues: (1) SAM-based methods struggle to separate camouflaged object edges due to model freezing, and (2) MLLM-based methods suffer from poor object separability as large language models merge foreground and background. To address these issues, we propose a novel VCOD method based on SAM and MLLM, called Phantom-Insight. To enhance the separability of object edge details, we represent video sequences with temporal and spatial clues and perform feature fusion via LLM to increase information density. Next, multiple cues are generated through the dynamic foreground visual token scoring module and the prompt network to adaptively guide and fine-tune the SAM model, enabling it to adapt to subtle textures. To enhance the separability of objects and background, we propose a decoupled foreground-background learning strategy. By generating foreground and background cues separately and performing decoupled training, the visual token can effectively integrate foreground and background information independently, enabling SAM to more accurately segment camouflaged objects in the video. Experiments on the MoCA-Mask dataset show that Phantom-Insight achieves state-of-the-art performance across various metrics. Additionally, its ability to detect unseen camouflaged objects on the CAD2016 dataset highlights its strong generalization ability.

[179] When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection

Rabin Dulal,Lihong Zheng,Muhammad Ashad Kabir

Main category: cs.CV

TL;DR: This study proposes an annotation-free, zero-shot framework for cattle muzzle detection using Grounding DINO, achieving a mAP@0.5 of 76.8% and offering a scalable, flexible, and practical alternative to supervised methods in livestock monitoring.

Details Motivation: The motivation of the study is to overcome the limitations of existing supervised methods for cattle muzzle detection, which require extensive annotated datasets and are often data-dependent, limiting their performance on new or unseen cattle. Method: The study proposes a zero-shot muzzle detection framework using Grounding DINO, a vision-language model, which leverages natural language prompts to guide detection without requiring any task-specific training or annotated data. Result: The proposed model achieves a mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance without requiring annotated data. It provides a scalable and flexible solution for muzzle localization across diverse cattle breeds and environments. Conclusion: The study concludes that the proposed zero-shot muzzle detection framework based on Grounding DINO offers a practical, annotation-free solution for cattle muzzle detection, providing improved adaptability and ease of deployment in livestock monitoring applications. Abstract: Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8\%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.

[180] Perception-oriented Bidirectional Attention Network for Image Super-resolution Quality Assessment

Yixiao Li,Xiaoyuan Yang,Guanghui Yue,Jun Fu,Qiuping Jiang,Xu Jia,Paul L. Rosin,Hantao Liu,Wei Zhou

Main category: cs.CV

TL;DR: This paper proposes a novel Perception-oriented Bidirectional Attention Network (PBAN) for image super-resolution full-reference image quality assessment, which outperforms existing methods.

Details Motivation: The authors aim to address the limited availability of full-reference image quality assessment metrics for comparing and evaluating different super-resolution algorithms. Method: The paper introduces the Perception-oriented Bidirectional Attention Network (PBAN) for image super-resolution full-reference image quality assessment (SR FR-IQA). The PBAN consists of three modules: an image encoder module, a perception-oriented bidirectional attention (PBA) module, and a quality prediction module. The PBA module incorporates Bidirectional Attention to model visual attention to distortion, Grouped Multi-scale Deformable Convolution to adaptively perceive distortion, and Sub-information Excitation Convolution to direct visual perception to sub-pixel and sub-channel attention. Result: Extensive experiments demonstrate that the proposed PBAN outperforms state-of-the-art quality assessment methods. Conclusion: The proposed PBAN method outperforms state-of-the-art quality assessment methods in extensive experiments. Abstract: Many super-resolution (SR) algorithms have been proposed to increase image resolution. However, full-reference (FR) image quality assessment (IQA) metrics for comparing and evaluating different SR algorithms are limited. In this work, we propose the Perception-oriented Bidirectional Attention Network (PBAN) for image SR FR-IQA, which is composed of three modules: an image encoder module, a perception-oriented bidirectional attention (PBA) module, and a quality prediction module. First, we encode the input images for feature representations. Inspired by the characteristics of the human visual system, we then construct the perception-oriented PBA module. Specifically, different from existing attention-based SR IQA methods, we conceive a Bidirectional Attention to bidirectionally construct visual attention to distortion, which is consistent with the generation and evaluation processes of SR images. To further guide the quality assessment towards the perception of distorted information, we propose Grouped Multi-scale Deformable Convolution, enabling the proposed method to adaptively perceive distortion. Moreover, we design Sub-information Excitation Convolution to direct visual perception to both sub-pixel and sub-channel attention. Finally, the quality prediction module is exploited to integrate quality-aware features and regress quality scores. Extensive experiments demonstrate that our proposed PBAN outperforms state-of-the-art quality assessment methods.

[181] Cross3DReg: Towards a Large-scale Real-world Cross-source Point Cloud Registration Benchmark

Zongyi Xu,Zhongpeng Lang,Yilong Chen,Shanshan Zhao,Xiaoshui Huang,Yifan Zuo,Yan Zhang,Qianni Zhang,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了一种用于跨源点云配准的新方法,包括构建大规模数据集Cross3DReg和基于重叠区域预测与视觉几何注意力机制的配准框架,实现了高精度和鲁棒性的配准效果。

Details Motivation: 跨源点云配准面临缺乏大规模真实数据集和不同传感器点云固有差异带来的特征提取和匹配困难的挑战。 Method: 构建了Cross3DReg数据集,设计了利用未对齐图像预测重叠区域的框架,并提出视觉-几何注意力机制的匹配模块以融合图像和几何信息,建立可靠对应关系。 Result: 实验表明该方法在相对旋转误差(RRE)和相对平移误差(RTE)上分别降低了63.2%和40.2%,注册召回率(RR)提高了5.4%,达到了最先进的配准性能。 Conclusion: 该研究有效解决了跨源点云配准中的关键问题,为未来研究提供了重要基础。 Abstract: Cross-source point cloud registration, which aims to align point cloud data from different sensors, is a fundamental task in 3D vision. However, compared to the same-source point cloud registration, cross-source registration faces two core challenges: the lack of publicly available large-scale real-world datasets for training the deep registration models, and the inherent differences in point clouds captured by multiple sensors. The diverse patterns induced by the sensors pose great challenges in robust and accurate point cloud feature extraction and matching, which negatively influence the registration accuracy. To advance research in this field, we construct Cross3DReg, the currently largest and real-world multi-modal cross-source point cloud registration dataset, which is collected by a rotating mechanical lidar and a hybrid semi-solid-state lidar, respectively. Moreover, we design an overlap-based cross-source registration framework, which utilizes unaligned images to predict the overlapping region between source and target point clouds, effectively filtering out redundant points in the irrelevant regions and significantly mitigating the interference caused by noise in non-overlapping areas. Then, a visual-geometric attention guided matching module is proposed to enhance the consistency of cross-source point cloud features by fusing image and geometric information to establish reliable correspondences and ultimately achieve accurate and robust registration. Extensive experiments show that our method achieves state-of-the-art registration performance. Our framework reduces the relative rotation error (RRE) and relative translation error (RTE) by $63.2\%$ and $40.2\%$, respectively, and improves the registration recall (RR) by $5.4\%$, which validates its effectiveness in achieving accurate cross-source registration.

[182] IGAff: Benchmarking Adversarial Iterative and Genetic Affine Algorithms on Deep Neural Networks

Sebastian-Vasile Echim,Andrei-Alexandru Preda,Dumitru-Clementin Cercel,Florin Pop

Main category: cs.CV

TL;DR: This paper introduces two new black-box adversarial attack algorithms, ATA and AGA, which effectively uncover weaknesses in deep learning models like ResNet-18 and Vision Transformer. The algorithms outperform existing methods and provide insights into adversarial robustness.

Details Motivation: The motivation stems from the fact that deep neural networks, despite their dominance in AI, are hard to understand and exhibit surprising weaknesses. Adversarial attacks are used to uncover these weaknesses, particularly in challenging black-box scenarios where model details are not accessible. Method: The paper benchmarks two novel black-box iterative adversarial algorithms: Affine Transformation Attack (ATA) and Affine Genetic Attack (AGA). ATA uses random affine transformations to maximize an attack score function, while AGA employs a genetic algorithm involving random noise and affine transformations. These algorithms are evaluated across different neural network architectures and datasets. Result: The experiments show that the proposed algorithms (ATA and AGA) outperform existing black-box adversarial methods, achieving up to an 8.82% improvement in accuracy on image classification tasks. The algorithms demonstrate effectiveness in both global and targeted attack configurations, and insights into adversarial robustness are provided through parameter variation. Conclusion: This paper concludes that the proposed adversarial algorithms, ATA and AGA, are effective in uncovering weaknesses in deep neural networks, particularly in black-box scenarios. Insights into adversarial robustness and defense strategies are provided, with the algorithms outperforming existing methods like Pixle and Square Attack. Abstract: Deep neural networks currently dominate many fields of the artificial intelligence landscape, achieving state-of-the-art results on numerous tasks while remaining hard to understand and exhibiting surprising weaknesses. An active area of research focuses on adversarial attacks, which aim to generate inputs that uncover these weaknesses. However, this proves challenging, especially in the black-box scenario where model details are inaccessible. This paper explores in detail the impact of such adversarial algorithms on ResNet-18, DenseNet-121, Swin Transformer V2, and Vision Transformer network architectures. Leveraging the Tiny ImageNet, Caltech-256, and Food-101 datasets, we benchmark two novel black-box iterative adversarial algorithms based on affine transformations and genetic algorithms: 1) Affine Transformation Attack (ATA), an iterative algorithm maximizing our attack score function using random affine transformations, and 2) Affine Genetic Attack (AGA), a genetic algorithm that involves random noise and affine transformations. We evaluate the performance of the models in the algorithm parameter variation, data augmentation, and global and targeted attack configurations. We also compare our algorithms with two black-box adversarial algorithms, Pixle and Square Attack. Our experiments yield better results on the image classification task than similar methods in the literature, achieving an accuracy improvement of up to 8.82%. We provide noteworthy insights into successful adversarial defenses and attacks at both global and targeted levels, and demonstrate adversarial robustness through algorithm parameter variation.

[183] Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Yuyao Ge,Shenghua Liu,Yiwei Wang,Lingrui Mei,Baolong Bi,Xuanshan Zhou,Jiayu Yao,Jiafeng Guo,Xueqi Cheng

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视觉增强方法CARVE,通过注意力机制分析并提取任务相关视觉信号,在复杂视觉环境中显著提升了视觉推理性能。

Details Motivation: 现有的视觉增强方法需要额外训练、依赖外部分割工具或仅在粗粒度级别上操作,忽视了VLMs内在的能力。 Method: 研究了视觉语言模型(VLMs)的注意力模式,发现视觉复杂度与注意力熵密切相关,并通过对比一般查询和任务特定查询的注意力图来分解视觉信号为语义信号和视觉噪声成分,从而提出了CARVE方法。 Result: CARVE在多个实验中一致提升了性能,尤其是在开源模型上实现了高达75%的改进。 Conclusion: 该论文提出了一种训练无关的视觉增强方法CARVE,通过注意力对比提取任务相关的视觉信号,有效提升了复杂视觉环境下的视觉推理性能。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

[184] A Statistical 3D Stomach Shape Model for Anatomical Analysis

Erez Posner,Ore Shtalrid,Oded Erell,Daniel Noy,Moshe Bouhnik

Main category: cs.CV

TL;DR: 本文提出了一种生成三维胃部统计形状模型的新方法,结合了合成数据生成、参数建模和真实数据验证,实现了对胃部解剖变异的准确建模,并具有广泛的应用前景。

Details Motivation: 由于数据可用性和方法论挑战的限制,目前对人体内部器官(如胃部)的详细模型开发仍较为有限。本研究旨在提出一种新方法,以生成具有解剖多样性的三维胃部模型,并建立一个能够捕捉自然解剖变异的统计形状模型。 Method: 本文提出了一种生成合成三维胃部模型的新流程,并基于此构建了一个合成胃部数据集。利用该数据集,开发了一个三维统计形状模型,并通过来自公开数据集的CT网格进行了半监督对齐优化,以增强其对未知解剖变异的泛化能力。 Result: 研究者在保留的真实胃部CT扫描数据集上对该模型进行了评估,证明了其强大的泛化能力和拟合精度。此外,研究者还在GitLab上公开了统计形状模型和合成数据集,以促进进一步的研究。 Conclusion: 这项工作介绍了首个胃部的三维统计形状模型,应用范围包括手术模拟、术前规划、医学教育和计算建模。通过结合合成数据生成、参数建模和真实世界验证,该方法代表了器官建模的重大进步,并为个性化医疗解决方案开辟了新的可能性。 Abstract: Realistic and parameterized 3D models of human anatomy have become invaluable in research, diagnostics, and surgical planning. However, the development of detailed models for internal organs, such as the stomach, has been limited by data availability and methodological challenges. In this paper, we propose a novel pipeline for the generation of synthetic 3D stomach models, enabling the creation of anatomically diverse morphologies informed by established studies on stomach shape variability. Using this pipeline, we construct a dataset of synthetic stomachs. Building on this dataset, we develop a 3D statistical shape model of the stomach, trained to capture natural anatomical variability in a low-dimensional shape space. The model is further refined using CT meshes derived from publicly available datasets through a semi-supervised alignment process, enhancing its ability to generalize to unseen anatomical variations. We evaluated the model on a held-out test set of real stomach CT scans, demonstrating robust generalization and fit accuracy. We make the statistical shape model along with the synthetic dataset publicly available on GitLab: https://gitlab.com/Erez.Posner/stomach_pytorch to facilitate further research. This work introduces the first statistical 3D shape model of the stomach, with applications ranging from surgical simulation and pre-operative planning to medical education and computational modeling. By combining synthetic data generation, parametric modeling, and real-world validation, our approach represents a significant advancement in organ modeling and opens new possibilities for personalized healthcare solutions.

[185] Does DINOv3 Set a New Medical Vision Standard?

Che Liu,Yinda Chen,Haoyuan Shi,Jinpeng Lu,Bailiang Jian,Jiazhen Pan,Linghan Cai,Jiayi Wang,Yundi Zhang,Jun Li,Cosmin I. Bercea,Cheng Ouyang,Chen Chen,Zhiwei Xiong,Benedikt Wiestler,Christian Wachinger,Daniel Rueckert,Wenjia Bai,Rossella Arcucci

Main category: cs.CV

TL;DR: DINOv3 performs well on a range of medical vision tasks without domain-specific pre-training, sometimes outperforming medical-specific models, but struggles in deeply specialized domains and shows inconsistent scaling behavior.

Details Motivation: To investigate whether DINOv3, a state-of-the-art self-supervised vision transformer trained on natural images, can be directly used as an encoder for medical vision tasks without domain-specific pre-training. Method: Benchmarking DINOv3 across various medical vision tasks including 2D/3D classification and segmentation, with varying model sizes and image resolutions. Result: DINOv3 showed impressive performance and outperformed some medical-specific models in certain tasks, but faced limitations in deeply specialized domains like WSIs, EM, and PET. It also did not consistently follow scaling laws in the medical domain. Conclusion: DINOv3 can serve as a robust baseline for medical vision tasks despite being trained on natural images, but it shows limitations in deeply specialized domains. Abstract: The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.

[186] FSG-Net: Frequency-Spatial Synergistic Gated Network for High-Resolution Remote Sensing Change Detection

Zhongxiang Xie,Shuangxi Miao,Yuhan Jiang,Zhewei Zhang,Jing Yao,Xuecao Li,Jianxi Huang,Pedram Ghamisi

Main category: cs.CV

TL;DR: FSG-Net是一种新的高分辨率遥感图像变化检测方法,它解决了误报和深层抽象特征与浅层细节特征之间的语义差距问题。

Details Motivation: 高分辨率遥感图像的变化检测是地球观测应用的核心,但其效果常常受到两个关键问题的影响:误报和深层抽象特征与浅层细节特征之间的语义差距。 Method: FSG-Net包括三个主要模块:DAWIM(在频域中减少伪变化)、STSAM(增强真实变化区域的显著性)和LGFU(桥接语义差距)。 Result: FSG-Net在CDD、GZ-CD和LEVIR-CD数据集上的F1分数分别为94.16%、89.51%和91.27%,达到了最先进的结果。 Conclusion: FSG-Net有效地解决了高分辨率遥感图像变化检测中的两个关键问题:误报和深层抽象特征与浅层细节特征之间的语义差距。实验结果表明,FSG-Net在多个基准数据集上表现优异,达到了新的最先进的结果。 Abstract: Change detection from high-resolution remote sensing images lies as a cornerstone of Earth observation applications, yet its efficacy is often compromised by two critical challenges. First, false alarms are prevalent as models misinterpret radiometric variations from temporal shifts (e.g., illumination, season) as genuine changes. Second, a non-negligible semantic gap between deep abstract features and shallow detail-rich features tends to obstruct their effective fusion, culminating in poorly delineated boundaries. To step further in addressing these issues, we propose the Frequency-Spatial Synergistic Gated Network (FSG-Net), a novel paradigm that aims to systematically disentangle semantic changes from nuisance variations. Specifically, FSG-Net first operates in the frequency domain, where a Discrepancy-Aware Wavelet Interaction Module (DAWIM) adaptively mitigates pseudo-changes by discerningly processing different frequency components. Subsequently, the refined features are enhanced in the spatial domain by a Synergistic Temporal-Spatial Attention Module (STSAM), which amplifies the saliency of genuine change regions. To finally bridge the semantic gap, a Lightweight Gated Fusion Unit (LGFU) leverages high-level semantics to selectively gate and integrate crucial details from shallow layers. Comprehensive experiments on the CDD, GZ-CD, and LEVIR-CD benchmarks validate the superiority of FSG-Net, establishing a new state-of-the-art with F1-scores of 94.16%, 89.51%, and 91.27%, respectively. The code will be made available at https://github.com/zxXie-Air/FSG-Net after a possible publication.

[187] WS$^2$: Weakly Supervised Segmentation using Before-After Supervision in Waste Sorting

Andrea Marelli,Alberto Foresti,Leonardo Pesce,Giacomo Boracchi,Mario Grosso

Main category: cs.CV

TL;DR: 本文提出了一种基于操作员移除行为的弱监督分割方法,用于工业质量控制中的废物分拣自动化,并发布了首个相关多视角数据集WS²。

Details Motivation: 在工业质量控制中,废物分拣等任务仍然依赖人工操作员识别并移除不需要的物体,而全监督方法因需要大量标注数据而不可行。因此,探索利用操作员行为中的隐含监督信号的弱监督方法具有重要意义。 Method: 提出了一种基于前后图像差异的弱监督分割方法,称为Before-After Supervision,并构建了一个包含11000多张高分辨率视频帧的多视角数据集WS²。同时,设计了一个端到端的鲁棒管道,用于在WS²上对多种弱监督分割方法进行基准测试。 Result: 成功构建了第一个用于废物分拣场景的弱监督分割数据集WS²,并展示了基于Before-After Supervision的端到端方法的有效性,促进了该方向的研究发展。 Conclusion: 弱监督方法可以利用操作员移除动作中的隐含监督信息,实现对废物分拣场景中不需要物体的准确识别和分割。此外,引入了一个新的多视角数据集WS²,为未来的研究提供了基础。 Abstract: In industrial quality control, to visually recognize unwanted items within a moving heterogeneous stream, human operators are often still indispensable. Waste-sorting stands as a significant example, where operators on multiple conveyor belts manually remove unwanted objects to select specific materials. To automate this recognition problem, computer vision systems offer great potential in accurately identifying and segmenting unwanted items in such settings. Unfortunately, considering the multitude and the variety of sorting tasks, fully supervised approaches are not a viable option to address this challange, as they require extensive labeling efforts. Surprisingly, weakly supervised alternatives that leverage the implicit supervision naturally provided by the operator in his removal action are relatively unexplored. In this paper, we define the concept of Before-After Supervision, illustrating how to train a segmentation network by leveraging only the visual differences between images acquired \textit{before} and \textit{after} the operator. To promote research in this direction, we introduce WS$^2$ (Weakly Supervised segmentation for Waste-Sorting), the first multiview dataset consisting of more than 11 000 high-resolution video frames captured on top of a conveyor belt, including "before" and "after" images. We also present a robust end-to-end pipeline, used to benchmark several state-of-the-art weakly supervised segmentation methods on WS$^2$.

[188] TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement

Jibai Lin,Bo Ma,Yating Yang,Rong Ma,Turghun Osman,Ahtamjan Ahmat,Rui Dong,Lei Wang,Xi Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的图像生成框架TIDE,在保持图像主体身份的同时更好地遵循文本编辑指令,实现了更优的生成效果。

Details Motivation: 解决现有方法未能充分处理的主体身份保持和动态编辑指令之间的矛盾。 Method: 引入了Target-Instructed Diffusion Enhancing (TIDE)框架,通过目标监督和偏好学习,利用三元组对齐和Direct Subject Diffusion (DSD)目标进行训练。 Result: 实验结果显示,TIDE在多个定量指标上优于基线方法,并成功应用于多种任务,包括结构条件生成、图像到图像生成和文本-图像插值。 Conclusion: TIDE有效解决了SDIG任务中保持主体身份和遵循编辑指令之间的矛盾,展现出卓越的性能与多功能性。 Abstract: Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired "winning" (balanced preservation-compliance) and "losing" (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE's superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE's versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.

[189] Predicting Brain Tumor Response to Therapy using a Hybrid Deep Learning and Radiomics Approach

Daniil Tikhonov,Matheus Scatolin,Mohor Banerjee,Qiankun Ji,Ahmed Jaheen,Mostafa Salem,Abdelrahman Elsayed,Hu Wang,Sarim Hashmi,Mohammad Yaqub

Main category: cs.CV

TL;DR: 本研究开发了一种自动分类干预反应的方法,通过结合深度学习和放射组学特征预测神经肿瘤治疗反应。

Details Motivation: 为了提供一种自动化方法,以准确评估胶质母细胞瘤对治疗的反应,从而辅助临床决策和患者管理。 Method: 提出了一种混合框架,结合深度学习提取特征和放射组学及临床特征,并使用CatBoost分类器进行预测。 Result: 使用融合特征集,CatBoost分类器在4类反应预测任务中达到了0.81的平均ROC AUC和0.50的Macro F1分数。 Conclusion: 结合深度学习和放射组学特征的自动化方法在神经肿瘤治疗反应评估中提供了稳健有效的解决方案。 Abstract: Accurate evaluation of the response of glioblastoma to therapy is crucial for clinical decision-making and patient management. The Response Assessment in Neuro-Oncology (RANO) criteria provide a standardized framework to assess patients' clinical response, but their application can be complex and subject to observer variability. This paper presents an automated method for classifying the intervention response from longitudinal MRI scans, developed to predict tumor response during therapy as part of the BraTS 2025 challenge. We propose a novel hybrid framework that combines deep learning derived feature extraction and an extensive set of radiomics and clinically chosen features. Our approach utilizes a fine-tuned ResNet-18 model to extract features from 2D regions of interest across four MRI modalities. These deep features are then fused with a rich set of more than 4800 radiomic and clinically driven features, including 3D radiomics of tumor growth and shrinkage masks, volumetric changes relative to the nadir, and tumor centroid shift. Using the fused feature set, a CatBoost classifier achieves a mean ROC AUC of 0.81 and a Macro F1 score of 0.50 in the 4-class response prediction task (Complete Response, Partial Response, Stable Disease, Progressive Disease). Our results highlight that synergizing learned image representations with domain-targeted radiomic features provides a robust and effective solution for automated treatment response assessment in neuro-oncology.

[190] On the Reproducibility of "FairCLIP: Harnessing Fairness in Vision-Language Learning''

Hua Chang Bakker,Stan Fris,Angela Madelon Bernardy,Stan Deutekom

Main category: cs.CV

TL;DR: The study could not reproduce the claimed improvements of FairCLIP on CLIP's fairness and performance in zero-shot glaucoma classification, despite reductions in Sinkhorn distances.

Details Motivation: The motivation was to investigate the reproducibility of FairCLIP's results and to explore whether improvements could be made to enhance the fairness of CLIP in zero-shot glaucoma classification. Method: The authors reproduced the experimental setup of Luo et al. (2024) to investigate FairCLIP, introduced a new implementation called A-FairCLIP, and proposed an extension named FairCLIP+ to include multiple attributes in the FairCLIP objective. Result: Experimental results showed that neither the official implementation nor A-FairCLIP improved performance or fairness in zero-shot glaucoma classification, despite reducing Sinkhorn distances. Conclusion: The study concludes that while the FairCLIP method reduces Sinkhorn distances, it does not effectively enhance the performance or fairness of CLIP in zero-shot glaucoma classification. Abstract: We investigated the reproducibility of FairCLIP, proposed by Luo et al. (2024), for improving the group fairness of CLIP (Radford et al., 2021) by minimizing image-text similarity score disparities across sensitive groups using the Sinkhorn distance. The experimental setup of Luo et al. (2024) was reproduced to primarily investigate the research findings for FairCLIP. The model description by Luo et al. (2024) was found to differ from the original implementation. Therefore, a new implementation, A-FairCLIP, is introduced to examine specific design choices. Furthermore, FairCLIP+ is proposed to extend the FairCLIP objective to include multiple attributes. Additionally, the impact of the distance minimization on FairCLIP's fairness and performance was explored. In alignment with the original authors, CLIP was found to be biased towards certain demographics when applied to zero-shot glaucoma classification using medical scans and clinical notes from the Harvard-FairVLMed dataset. However, the experimental results on two datasets do not support their claim that FairCLIP improves the performance and fairness of CLIP. Although the regularization objective reduces Sinkhorn distances, both the official implementation and the aligned implementation, A-FairCLIP, were not found to improve performance nor fairness in zero-shot glaucoma classification.

[191] Benchmarking EfficientTAM on FMO datasets

Senem Aktas,Charles Markham,John McDonald,Rozenn Dahyot

Main category: cs.CV

TL;DR: 本文创建了FMOX数据集,并利用其测试EfficientTAM模型在快速移动物体追踪上的性能。

Details Motivation: 快速且微小的物体追踪在计算机视觉中仍然是一项挑战,因此本文旨在提供一个能够支持追踪快速移动物体的数据集及相关的测试基准。 Method: 作者通过引入包含物体尺寸信息的JSON元数据文件FMOX来扩展FMOs数据集的描述,并使用FMOX文件测试EfficientTAM模型的性能。 Result: 作者通过使用FMOX文件测试EfficientTAM模型,并提供了基于轨迹交并比(TIoU)得分的比较结果,证明了EfficientTAM模型的性能能够与专为FMO数据集设计的流程相媲美。 Conclusion: 本文介绍了FMOX数据集的创建,并利用其测试了EfficientTAM模型在快速移动物体追踪上的性能,证明了该模型能够与专为此类数据集设计的流程相媲美。 Abstract: Fast and tiny object tracking remains a challenge in computer vision and in this paper we first introduce a JSON metadata file associated with four open source datasets of Fast Moving Objects (FMOs) image sequences. In addition, we extend the description of the FMOs datasets with additional ground truth information in JSON format (called FMOX) with object size information. Finally we use our FMOX file to test a recently proposed foundational model for tracking (called EfficientTAM) showing that its performance compares well with the pipelines originally taylored for these FMO datasets. Our comparison of these state-of-the-art techniques on FMOX is provided with Trajectory Intersection of Union (TIoU) scores. The code and JSON is shared open source allowing FMOX to be accessible and usable for other machine learning pipelines aiming to process FMO datasets.

[192] Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval

Emil Demić,Luka Čehovin Zajc

Main category: cs.CV

TL;DR: 本文提出了一种针对场景级草图图像检索(SBIR)的新方法,通过改进训练目标、编码器架构和损失公式,实现了无需额外复杂度的先进性能。

Details Motivation: 作者强调现实世界草图中存在的固有模糊性和噪声,这是之前研究较少关注的点。 Method: 本文的方法重点在于设计一个明确针对草图变化的鲁棒性训练目标,结合预训练、编码器架构和损失公式的适当组合。 Result: 在具有挑战性的FS-COCO和广泛使用的SketchyCOCO数据集上进行了广泛的实验,验证了该方法的有效性。 Conclusion: 作者得出结论,通过适当的预训练、编码器架构和损失公式化,可以在不引入额外复杂性的情况下实现最先进的性能。 Abstract: The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.

[193] Evolving from Unknown to Known: Retentive Angular Representation Learning for Incremental Open Set Recognition

Runqing Yang,Yimin Fu,Changyuan Wu,Zhunga Liu

Main category: cs.CV

TL;DR: This paper proposes retentive angular representation learning (RARL) for incremental open set recognition (IOSR), introducing novel strategies to maintain discriminability and reduce confusion between known and newly emerging unknown classes during continuous learning.

Details Motivation: Existing OSR methods are unsuitable for evolving scenarios where models must identify new unknown classes from continuous data streams without access to prior training data, leading to poor decision boundary discriminability and inter-class confusion. Method: The paper introduces retentive angular representation learning (RARL) using an equiangular tight frame to align unknown representations, a virtual-intrinsic interactive (VII) training strategy to compact known representations, and stratified rectification to refine decision boundaries. Result: The proposed RARL method achieves state-of-the-art performance on the CIFAR100 and TinyImageNet datasets, demonstrating effectiveness across various task setups in incremental open set recognition. Conclusion: The proposed RARL method with VII training and stratified rectification strategies effectively addresses the challenges in incremental open set recognition by maintaining discriminability and reducing inter-class confusion. Abstract: Existing open set recognition (OSR) methods are typically designed for static scenarios, where models aim to classify known classes and identify unknown ones within fixed scopes. This deviates from the expectation that the model should incrementally identify newly emerging unknown classes from continuous data streams and acquire corresponding knowledge. In such evolving scenarios, the discriminability of OSR decision boundaries is hard to maintain due to restricted access to former training data, causing severe inter-class confusion. To solve this problem, we propose retentive angular representation learning (RARL) for incremental open set recognition (IOSR). In RARL, unknown representations are encouraged to align around inactive prototypes within an angular space constructed under the equiangular tight frame, thereby mitigating excessive representation drift during knowledge updates. Specifically, we adopt a virtual-intrinsic interactive (VII) training strategy, which compacts known representations by enforcing clear inter-class margins through boundary-proximal virtual classes. Furthermore, a stratified rectification strategy is designed to refine decision boundaries, mitigating representation bias and feature space distortion caused by imbalances between old/new and positive/negative class samples. We conduct thorough evaluations on CIFAR100 and TinyImageNet datasets and establish a new benchmark for IOSR. Experimental results across various task setups demonstrate that the proposed method achieves state-of-the-art performance.

[194] Approximating Condorcet Ordering for Vector-valued Mathematical Morphology

Marcos Eduardo Valle,Santiago Velasco-Forero,Joao Batista Florindo,Gustavo Jesus Angulo

Main category: cs.CV

TL;DR: 本文研究了如何通过机器学习方法学习一种近似Condorcet排序的简化排序,以解决数学形态学中矢量值图像处理的排序问题。

Details Motivation: 数学形态学在处理矢量值图像(如彩色和高光谱图像)时缺乏统一的向量排序方法,因此需要一种共识排序来构建有效的形态算子。 Method: 论文采用了一种机器学习方法,学习一个近似Condorcet排序的简化排序,以解决多向量排序中最适合的形态算子构建问题。 Result: 初步计算实验表明,学习到的简化映射在定义彩色图像的矢量值形态算子方面是有效的。 Conclusion: 论文得出结论,通过机器学习方法学习到的简化排序能够有效近似Condorcet排序,并可用于定义彩色图像的矢量值形态算子。 Abstract: Mathematical morphology provides a nonlinear framework for image and spatial data processing and analysis. Although there have been many successful applications of mathematical morphology to vector-valued images, such as color and hyperspectral images, there is still no consensus on the most suitable vector ordering for constructing morphological operators. This paper addresses this issue by examining a reduced ordering approximating the Condorcet ranking derived from a set of vector orderings. Inspired by voting problems, the Condorcet ordering ranks elements from most to least voted, with voters representing different orderings. In this paper, we develop a machine learning approach that learns a reduced ordering that approximates the Condorcet ordering. Preliminary computational experiments confirm the effectiveness of learning the reduced mapping to define vector-valued morphological operators for color images.

[195] CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis

Xin Kong,Daniel Watson,Yannick Strümpler,Michael Niemeyer,Federico Tombari

Main category: cs.CV

TL;DR: 提出了一种名为CausNVS的自回归多视角扩散模型,用于解决3D新视角合成中的限制,支持任意输入-输出视角配置,并按顺序生成视角,从而提高了推理速度和应用范围。

Details Motivation: 现有的多视角扩散模型采用非自回归方法,限制了其在世界建模中的应用,因为它们只支持固定数量的视角,并且由于需要同时去噪所有帧而导致推理速度慢。 Method: 训练CausNVS时采用因果掩码和每帧噪声,利用成对相对相机姿态编码(CaPE)实现精确的相机控制。在推理时,结合空间感知滑动窗口、键值缓存和噪声条件增强来减轻漂移。 Result: 实验表明,CausNVS支持广泛的相机轨迹,实现了灵活的自回归新视角合成,并在多种设置下保持了出色的视觉质量。 Conclusion: CausNVS为多视角扩散模型提供了一种有效的自回归解决方案,克服了现有方法的局限性,具有广泛的应用前景。 Abstract: Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.

[196] Detection of trade in products derived from threatened species using machine learning and a smartphone

Ritwik Kulkarni,WU Hanqin,Enrico Di Minin

Main category: cs.CV

TL;DR: 本研究开发了一种机器学习模型,能够自动识别图像中的非法野生动物产品,如象牙、穿山甲鳞片和老虎骨骼,并开发了准确率达91.3%的手机应用,可用于实时监测野生动物贸易。

Details Motivation: 数字市场和社交媒体中不可持续的野生动物贸易威胁生物多样性,需要自动化方法检测非法贸易。 Method: 开发基于机器学习的物体识别模型,并通过不同训练策略和损失函数优化性能。 Result: 最佳模型总体准确率为84.2%,针对大象、穿山甲和老虎产品的准确率分别为71.1%、90.2%和93.5%。 Conclusion: 野生动物产品自动识别模型在数字平台和执法中具有实用价值。 Abstract: Unsustainable trade in wildlife is a major threat to biodiversity and is now increasingly prevalent in digital marketplaces and social media. With the sheer volume of digital content, the need for automated methods to detect wildlife trade listings is growing. These methods are especially needed for the automatic identification of wildlife products, such as ivory. We developed machine learning-based object recognition models that can identify wildlife products within images and highlight them. The data consists of images of elephant, pangolin, and tiger products that were identified as being sold illegally or that were confiscated by authorities. Specifically, the wildlife products included elephant ivory and skins, pangolin scales, and claws (raw and crafted), and tiger skins and bones. We investigated various combinations of training strategies and two loss functions to identify the best model to use in the automatic detection of these wildlife products. Models were trained for each species while also developing a single model to identify products from all three species. The best model showed an overall accuracy of 84.2% with accuracies of 71.1%, 90.2% and 93.5% in detecting products derived from elephants, pangolins, and tigers, respectively. We further demonstrate that the machine learning model can be made easily available to stakeholders, such as government authorities and law enforcement agencies, by developing a smartphone-based application that had an overall accuracy of 91.3%. The application can be used in real time to click images and help identify potentially prohibited products of target species. Thus, the proposed method is not only applicable for monitoring trade on the web but can also be used e.g. in physical markets for monitoring wildlife trade.

[197] Hybrid Swin Attention Networks for Simultaneously Low-Dose PET and CT Denoising

Yichao Liu,YueYang Teng

Main category: cs.CV

TL;DR: This paper proposes HSANet, a new denoising method for medical imaging that improves image quality without high computational costs.

Details Motivation: LDCT and PET imaging techniques reduce radiation exposure but introduce noise and artifacts, requiring effective denoising methods. Method: HSANet uses Efficient Global Attention modules and a hybrid upsampling module for noise reduction. Result: HSANet outperforms existing denoising approaches and works efficiently on standard GPU memory. Conclusion: HSANet is a practical solution for LDCT/PET denoising with high performance and low resource requirements. Abstract: Low-dose computed tomography (LDCT) and positron emission tomography (PET) have emerged as safer alternatives to conventional imaging modalities by significantly reducing radiation exposure. However, this reduction often results in increased noise and artifacts, which can compromise diagnostic accuracy. Consequently, denoising for LDCT/PET has become a vital area of research aimed at enhancing image quality while maintaining radiation safety. In this study, we introduce a novel Hybrid Swin Attention Network (HSANet), which incorporates Efficient Global Attention (EGA) modules and a hybrid upsampling module. The EGA modules enhance both spatial and channel-wise interaction, improving the network's capacity to capture relevant features, while the hybrid upsampling module mitigates the risk of overfitting to noise. We validate the proposed approach using a publicly available LDCT/PET dataset. Experimental results demonstrate that HSANet achieves superior denoising performance compared to existing methods, while maintaining a lightweight model size suitable for deployment on GPUs with standard memory configurations. This makes our approach highly practical for real-world clinical applications.

[198] Improved Classification of Nitrogen Stress Severity in Plants Under Combined Stress Conditions Using Spatio-Temporal Deep Learning Framework

Aswini Kumar Patra

Main category: cs.CV

TL;DR: This study developed a CNN-LSTM deep learning model to classify nitrogen stress severity in plants under combined stress conditions, achieving high accuracy and offering a promising tool for early stress detection.

Details Motivation: Early detection of nitrogen stress in plants is essential for protecting plant health, particularly when nitrogen deficiency is compounded by other stresses like drought and weed competition. Method: A deep learning framework combining Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks was used to analyze a combination of four imaging modalities (RGB, multispectral, and two infrared wavelengths) presented as time-series data. Result: The CNN-LSTM pipeline achieved an accuracy of 98%, significantly outperforming a spatial-only CNN model (80.45%) and previously reported machine learning methods (76%). Conclusion: The CNN-LSTM pipeline proves to be an effective tool for identifying nitrogen stress severity in plants, offering potential for improved crop management. Abstract: Plants in their natural habitats endure an array of interacting stresses, both biotic and abiotic, that rarely occur in isolation. Nutrient stress-particularly nitrogen deficiency-becomes even more critical when compounded with drought and weed competition, making it increasingly difficult to distinguish and address its effects. Early detection of nitrogen stress is therefore crucial for protecting plant health and implementing effective management strategies. This study proposes a novel deep learning framework to accurately classify nitrogen stress severity in a combined stress environment. Our model uses a unique blend of four imaging modalities-RGB, multispectral, and two infrared wavelengths-to capture a wide range of physiological plant responses from canopy images. These images, provided as time-series data, document plant health across three levels of nitrogen availability (low, medium, and high) under varying water stress and weed pressures. The core of our approach is a spatio-temporal deep learning pipeline that merges a Convolutional Neural Network (CNN) for extracting spatial features from images with a Long Short-Term Memory (LSTM) network to capture temporal dependencies. We also devised and evaluated a spatial-only CNN pipeline for comparison. Our CNN-LSTM pipeline achieved an impressive accuracy of 98%, impressively surpassing the spatial-only model's 80.45% and other previously reported machine learning method's 76%. These results bring actionable insights based on the power of our CNN-LSTM approach in effectively capturing the subtle and complex interactions between nitrogen deficiency, water stress, and weed pressure. This robust platform offers a promising tool for the timely and proactive identification of nitrogen stress severity, enabling better crop management and improved plant health.

[199] Investigating Location-Regularised Self-Supervised Feature Learning for Seafloor Visual Imagery

Cailei Liang,Adrian Bodenmann,Emma J Curtis,Samuel Simmons,Kazunori Nagano,Stan Brown,Adam Riese,Blair Thornton

Main category: cs.CV

TL;DR: Location metadata boosts SSL performance, especially for low-dimensional CNNs, while high-dimensional ViTs generalise well for seafloor image analysis.

Details Motivation: To explore how location metadata impacts SSL across different models and strategies in seafloor imagery analysis. Method: Evaluated six SSL frameworks (CNNs and ViTs) with location-based regularisation on three seafloor image datasets. Result: Location-regularisation improved CNN performance by 4.9% and ViT by 6.3%. High-dimensional ViTs showed strong generalisation matching location-regularised models. Conclusion: Location metadata enhances SSL performance, especially for low-dimensional representations, while high-dimensional ViTs show strong generalisation for seafloor image analysis. Abstract: High-throughput interpretation of robotically gathered seafloor visual imagery can increase the efficiency of marine monitoring and exploration. Although recent research has suggested that location metadata can enhance self-supervised feature learning (SSL), its benefits across different SSL strategies, models and seafloor image datasets are underexplored. This study evaluates the impact of location-based regularisation on six state-of-the-art SSL frameworks, which include Convolutional Neural Network (CNN) and Vision Transformer (ViT) models with varying latent-space dimensionality. Evaluation across three diverse seafloor image datasets finds that location-regularisation consistently improves downstream classification performance over standard SSL, with average F1-score gains of $4.9 \pm 4.0%$ for CNNs and $6.3 \pm 8.9%$ for ViTs, respectively. While CNNs pretrained on generic datasets benefit from high-dimensional latent representations, dataset-optimised SSL achieves similar performance across the high (512) and low (128) dimensional latent representations. Location-regularised SSL improves CNN performance over pre-trained models by $2.7 \pm 2.7%$ and $10.1 \pm 9.4%$ for high and low-dimensional latent representations, respectively. For ViTs, high-dimensionality benefits both pre-trained and dataset-optimised SSL. Although location-regularisation improves SSL performance compared to standard SSL methods, pre-trained ViTs show strong generalisation, matching the best-performing location-regularised SSL with F1-scores of $0.795 \pm 0.075$ and $0.795 \pm 0.077$, respectively. The findings highlight the value of location metadata for SSL regularisation, particularly when using low-dimensional latent representations, and demonstrate strong generalisation of high-dimensional ViTs for seafloor image analysis.

[200] Online Clustering of Seafloor Imagery for Interpretation during Long-Term AUV Operations

Cailei Liang,Adrian Bodenmann,Sam Fenton,Blair Thornton

Main category: cs.CV

TL;DR: 介绍了一种用于实时解释海底图像的在线聚类框架(OCF),它能够在没有监督的情况下运行,并且具有高效、自适应和自洽的特点。

Details Motivation: 随着长航时和海底驻留AUVs变得越来越有能力,对扩展的实时海底图像解释的需求日益增加,以实现自适应任务和优化通信效率。 Method: 我们引入了一种在线聚类框架(OCF),该框架能够在没有监督的情况下解释海底图像,并设计为在连续数据流中以可扩展、自适应和自洽的方式实时运行。 Result: OCF在三个多样化的海底图像数据集上进行了评估,分析了不同的代表性采样策略对聚类准确性和计算成本的影响。OCF在所有比较的在线聚类方法中实现了最高的平均F1得分0.68,标准差为3%。 Conclusion: OCF在三个不同海底图像数据集上实现了最高的平均F1得分为0.68,表现出其卓越的聚类能力和对轨迹变化的鲁棒性。此外,随着数据量的增加,它保持了较低且有界的计算时间,这对于生成调查数据摘要和在长期持久的自主海洋探索中支持信息路径规划是有益的。 Abstract: As long-endurance and seafloor-resident AUVs become more capable, there is an increasing need for extended, real-time interpretation of seafloor imagery to enable adaptive missions and optimise communication efficiency. Although offline image analysis methods are well established, they rely on access to complete datasets and human-labelled examples to manage the strong influence of environmental and operational conditions on seafloor image appearance-requirements that cannot be met in real-time settings. To address this, we introduce an online clustering framework (OCF) capable of interpreting seafloor imagery without supervision, which is designed to operate in real-time on continuous data streams in a scalable, adaptive, and self-consistent manner. The method enables the efficient review and consolidation of common patterns across the entire data history in constant time by identifying and maintaining a set of representative samples that capture the evolving feature distribution, supporting dynamic cluster merging and splitting without reprocessing the full image history. We evaluate the framework on three diverse seafloor image datasets, analysing the impact of different representative sampling strategies on both clustering accuracy and computational cost. The OCF achieves the highest average F1 score of 0.68 across the three datasets among all comparative online clustering approaches, with a standard deviation of 3% across three distinct survey trajectories, demonstrating its superior clustering capability and robustness to trajectory variation. In addition, it maintains consistently lower and bounded computational time as the data volume increases. These properties are beneficial for generating survey data summaries and supporting informative path planning in long-term, persistent autonomous marine exploration.

[201] VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes

Shengkai Zhang,Yuhe Liu,Guanjun Wu,Jianhua He,Xinggang Wang,Mozi Chen,Kezhong Liu

Main category: cs.CV

TL;DR: VIM-GS是一种基于单目图像的大场景高保真NVS框架,通过结合视觉惯性SfM的稀疏深度和LFMs的密集深度,显著提升了深度估计和渲染效果。

Details Motivation: 传统高斯随机框架(GS)依赖RGB-D或立体相机获取精确深度,难以适用于大场景;单目图像缺乏深度信息,导致NVS效果较差。尽管有大型基础模型(LFMs)用于单目深度估计,但其存在帧间不一致、远距离场景不准和纹理误导问题。本文旨在解决这些问题,实现高质量的单目深度估计和大场景GS渲染。 Method: 提出了一种基于对象分割的深度传播算法和动态深度优化模块,利用视觉惯性SfM的稀疏深度优化LFMs生成的密集深度,从而实现更精确的深度估计和高保真渲染。 Result: 实验表明,VIM-GS在公共和自定义数据集上均表现出优于现有方法的渲染质量,尤其在大场景中表现突出。 Conclusion: VIM-GS有效地结合了视觉惯性SfM的准确稀疏深度和LFMs的密集但粗糙深度,以实现单目RGB输入下的高质量、密集深度图像生成和大场景的高保真NVS渲染。 Abstract: VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.

[202] BioLite U-Net: Edge-Deployable Semantic Segmentation for In Situ Bioprinting Monitoring

Usman Haider,Lukasz Szemet,Daniel Kelly,Vasileios Sergis,Andrew C. Daly,Karl Mason

Main category: cs.CV

TL;DR: This paper proposes BioLite U-Net, a lightweight semantic segmentation model for real-time monitoring in bioprinting, achieving high accuracy and efficiency on resource-constrained devices.

Details Motivation: The core challenge in bioprinting is ensuring the fidelity and consistency of printed structures in real time, especially under resource constraints. Real-time monitoring through semantic segmentation is essential to maintain print quality and biological viability. Method: The authors introduced a lightweight U-Net-based semantic segmentation framework called BioLite U-Net, utilizing depthwise separable convolutions for reduced computational load. They tested it against MobileNetV2 and MobileNetV3-based models using metrics like mIoU, Dice score, and pixel accuracy on a Raspberry Pi 4B. Result: BioLite U-Net achieved an mIoU of 92.85%, a Dice score of 96.17%, and 335 ms per frame inference time on a Raspberry Pi 4B, showing it is over 1300x smaller and significantly faster than MobileNetV2-DeepLabV3+ Conclusion: BioLite U-Net provides a highly efficient and accurate solution for real-time semantic segmentation in bioprinting, making it ideal for integration into intelligent bioprinting systems. Abstract: Bioprinting is a rapidly advancing field that offers a transformative approach to fabricating tissue and organ models through the precise deposition of cell-laden bioinks. Ensuring the fidelity and consistency of printed structures in real-time remains a core challenge, particularly under constraints imposed by limited imaging data and resource-constrained embedded hardware. Semantic segmentation of the extrusion process, differentiating between nozzle, extruded bioink, and surrounding background, enables in situ monitoring critical to maintaining print quality and biological viability. In this work, we introduce a lightweight semantic segmentation framework tailored for real-time bioprinting applications. We present a novel, manually annotated dataset comprising 787 RGB images captured during the bioprinting process, labeled across three classes: nozzle, bioink, and background. To achieve fast and efficient inference suitable for integration with bioprinting systems, we propose a BioLite U-Net architecture that leverages depthwise separable convolutions to drastically reduce computational load without compromising accuracy. Our model is benchmarked against MobileNetV2 and MobileNetV3-based segmentation baselines using mean Intersection over Union (mIoU), Dice score, and pixel accuracy. All models were evaluated on a Raspberry Pi 4B to assess real-world feasibility. The proposed BioLite U-Net achieves an mIoU of 92.85% and a Dice score of 96.17%, while being over 1300x smaller than MobileNetV2-DeepLabV3+. On-device inference takes 335 ms per frame, demonstrating near real-time capability. Compared to MobileNet baselines, BioLite U-Net offers a superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed-loop bioprinting systems.

[203] STAGE: Segmentation-oriented Industrial Anomaly Synthesis via Graded Diffusion with Explicit Mask Alignment

Xichen Xu,Yanshu Wang,Jinbao Wang,Qunyi Zhang,Xiaoning Lei,Guoyang Xie,Guannan Jiang,Zhichao Lu

Main category: cs.CV

TL;DR: STAGE方法通过分级扩散和显式掩码对齐策略解决了现有工业异常合成方法缺乏细节和像素级生成的问题。

Details Motivation: 现有工业异常合成方法在合成异常的纹理细节和像素级生成方面存在不足,影响了下游异常分割的性能。 Method: STAGE方法引入了新的异常推理策略,利用干净背景信息作为先验来指导去噪分布,并采用分级扩散框架和显式掩码对齐策略。 Result: 在MVTec和BTAD数据集上的实验表明,STAGE在SIAS方面达到了最先进的性能,并提升了下游异常分割的效果。 Conclusion: STAGE方法通过结合分级扩散和显式掩码对齐策略,有效解决了现有工业异常合成方法的局限性。 Abstract: Segmentation-oriented Industrial Anomaly Synthesis (SIAS) plays a pivotal role in enhancing the performance of downstream anomaly segmentation, as it provides an effective means of expanding abnormal data. However, existing SIAS methods face several critical limitations: (i) the synthesized anomalies often lack intricate texture details and fail to align precisely with the surrounding background, and (ii) they struggle to generate fine-grained, pixel-level anomalies. To address these challenges, we propose Segmentation-oriented Anomaly synthesis via Graded diffusion with Explicit mask alignment, termed STAGE. STAGE introduces a novel anomaly inference strategy that incorporates clean background information as a prior to guide the denoising distribution, enabling the model to more effectively distinguish and highlight abnormal foregrounds. Furthermore, it employs a graded diffusion framework with an anomaly-only branch to explicitly record local anomalies during both the forward and reverse processes, ensuring that subtle anomalies are not overlooked. Finally, STAGE incorporates the explicit mask alignment (EMA) strategy to progressively align the synthesized anomalies with the background, resulting in context-consistent and structurally coherent generations. Extensive experiments on the MVTec and BTAD datasets demonstrate that STAGE achieves state-of-the-art performance in SIAS, which in turn enhances downstream anomaly segmentation.

[204] Cortex-Synth: Differentiable Topology-Aware 3D Skeleton Synthesis with Hierarchical Graph Attention

Mohamed Zayaan S

Main category: cs.CV

TL;DR: Cortex Synth is an end-to-end differentiable framework for generating accurate 3D skeletons from 2D images, combining hierarchical graph attention, spectral topology optimization, and adversarial training to achieve state-of-the-art results.

Details Motivation: To overcome limitations in existing methods for 3D skeleton synthesis by developing a novel end-to-end differentiable framework that jointly optimizes geometry and topology. Method: The paper introduces a hierarchical graph attention mechanism, differentiable spectral topology optimization, and adversarial geometric consistency training within an integrated framework consisting of four modules. Result: The proposed framework achieves a state-of-the-art performance with an 18.7% improvement in MPJPE, 27.3% in Graph Edit Distance, and a 42% reduction in topological errors on ShapeNet. Conclusion: Cortex Synth provides a robust and innovative framework for 3D skeleton geometry and topology synthesis from 2D images, showing significant performance improvements and offering broad applications. Abstract: We present Cortex Synth, a novel end-to-end differentiable framework for joint 3D skeleton geometry and topology synthesis from single 2D images. Our architecture introduces three key innovations: (1) A hierarchical graph attention mechanism with multi-scale skeletal refinement, (2) Differentiable spectral topology optimization via Laplacian eigen decomposition, and (3) Adversarial geometric consistency training for pose structure alignment. The framework integrates four synergistic modules: a pseudo 3D point cloud generator, an enhanced PointNet encoder, a skeleton coordinate decoder, and a novel Differentiable Graph Construction Network (DGCN). Our experiments demonstrate state-of-the-art results with 18.7 percent improvement in MPJPE and 27.3 percent in Graph Edit Distance on ShapeNet, while reducing topological errors by 42 percent compared to previous approaches. The model's end-to-end differentiability enables applications in robotic manipulation, medical imaging, and automated character rigging.

[205] MRI-Based Brain Tumor Detection through an Explainable EfficientNetV2 and MLP-Mixer-Attention Architecture

Mustafa Yurdakul,Şakir Taşdemir

Main category: cs.CV

TL;DR: A highly accurate and interpretable deep learning model combining EfficientNetV2 and an attention-based MLP-Mixer was developed for automated brain tumor classification using MRI images.

Details Motivation: Brain tumors require early diagnosis due to high mortality rates, and current MRI-based diagnosis methods are error-prone and require expert knowledge. This necessitates the development of an automated, accurate, and explainable diagnosis system. Method: The authors evaluated nine CNN architectures using a publicly available Figshare dataset of 3,064 brain MRI images. EfficientNetV2 was selected as the best-performing backbone and was enhanced with an attention-based MLP-Mixer. The model was validated using five-fold cross-validation and Grad-CAM visualization for interpretability. Result: The proposed model achieved 99.50% accuracy, 99.47% precision, 99.52% recall, and 99.49% F1 score, outperforming existing methods. Grad-CAM visualizations confirmed the model's focus on relevant MRI regions, enhancing interpretability. Conclusion: The study concludes that the proposed deep learning model combining EfficientNetV2 and an attention-based MLP-Mixer architecture is highly effective for brain tumor classification, offering both high accuracy and interpretability. Abstract: Brain tumors are serious health problems that require early diagnosis due to their high mortality rates. Diagnosing tumors by examining Magnetic Resonance Imaging (MRI) images is a process that requires expertise and is prone to error. Therefore, the need for automated diagnosis systems is increasing day by day. In this context, a robust and explainable Deep Learning (DL) model for the classification of brain tumors is proposed. In this study, a publicly available Figshare dataset containing 3,064 T1-weighted contrast-enhanced brain MRI images of three tumor types was used. First, the classification performance of nine well-known CNN architectures was evaluated to determine the most effective backbone. Among these, EfficientNetV2 demonstrated the best performance and was selected as the backbone for further development. Subsequently, an attention-based MLP-Mixer architecture was integrated into EfficientNetV2 to enhance its classification capability. The performance of the final model was comprehensively compared with basic CNNs and the methods in the literature. Additionally, Grad-CAM visualization was used to interpret and validate the decision-making process of the proposed model. The proposed model's performance was evaluated using the five-fold cross-validation method. The proposed model demonstrated superior performance with 99.50% accuracy, 99.47% precision, 99.52% recall and 99.49% F1 score. The results obtained show that the model outperforms the studies in the literature. Moreover, Grad-CAM visualizations demonstrate that the model effectively focuses on relevant regions of MRI images, thus improving interpretability and clinical reliability. A robust deep learning model for clinical decision support systems has been obtained by combining EfficientNetV2 and attention-based MLP-Mixer, providing high accuracy and interpretability in brain tumor classification.

[206] Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training

Ruicheng Zhang,Jun Zhou,Zunnan Xu,Zihao Liu,Jiehui Huang,Mingyang Zhang,Yu Sun,Xiu Li

Main category: cs.CV

TL;DR: Zo3T is a zero-shot test-time-training framework for trajectory-guided image-to-video generation that introduces three innovations to enhance 3D realism and motion accuracy, outperforming existing methods.

Details Motivation: Existing methods for trajectory-guided image-to-video generation rely on computationally expensive fine-tuning on scarce annotated datasets, and some zero-shot methods may yield unrealistic motion by neglecting 3D perspective. Method: Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: 3D-Aware Kinematic Projection, Trajectory-Guided Test-Time LoRA, and Guidance Field Rectification. Result: The proposed method effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations, ensuring generative fidelity and adherence to the manipulated latent. The denoising evolutionary path is refined, ensuring efficient generative progression towards the target trajectory. Conclusion: Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches. Abstract: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

[207] Co-Seg: Mutual Prompt-Guided Collaborative Learning for Tissue and Nuclei Segmentation

Qing Xu,Wenting Duan,Zhen Chen

Main category: cs.CV

TL;DR: 本文提出了一种用于组织病理学图像分析的协同分割框架Co-Seg,通过协同处理组织和细胞核分割任务,提高了分割效果。

Details Motivation: 现有的研究分别关注组织语义分割或细胞核实例分割,忽略了这两个任务之间的内在联系,导致对组织病理学理解不足。 Method: 提出了一种新的协同分割范式,通过区域感知提示编码器(RP-Encoder)和相互提示掩码解码器(MP-Decoder)实现组织和细胞核分割任务的相互增强。 Result: 在PUMA数据集上的大量实验表明,所提出的Co-Seg在肿瘤组织和细胞核实例的语义、实例和全景分割方面优于最先进的方法。 Conclusion: Co-Seg框架在组织和细胞核分割任务中展现出卓越的性能,优于现有的最先进方法。 Abstract: Histopathology image analysis is critical yet challenged by the demand of segmenting tissue regions and nuclei instances for tumor microenvironment and cellular morphology analysis. Existing studies focused on tissue semantic segmentation or nuclei instance segmentation separately, but ignored the inherent relationship between these two tasks, resulting in insufficient histopathology understanding. To address this issue, we propose a Co-Seg framework for collaborative tissue and nuclei segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing tissue and nuclei segmentation tasks to mutually enhance each other. To this end, we first devise a region-aware prompt encoder (RP-Encoder) to provide high-quality semantic and instance region prompts as prior constraints. Moreover, we design a mutual prompt mask decoder (MP-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, collaboratively computing semantic and instance segmentation masks. Extensive experiments on the PUMA dataset demonstrate that the proposed Co-Seg surpasses state-of-the-arts in the semantic, instance and panoptic segmentation of tumor tissues and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg.

[208] Event Spectroscopy: Event-based Multispectral and Depth Sensing using Structured Light

Christian Geckeler,Niklas Neugebauer,Manasi Muglikar,Davide Scaramuzza,Stefano Mintchev

Main category: cs.CV

TL;DR: 本文介绍了一种新型的事件光谱系统,能够在复杂自然环境中提高无人机的感知和数据收集效率。

Details Motivation: 传统传感方法,包括被动多光谱和RGB成像,在森林树冠下对环境光依赖性强,存在延迟、深度分辨率差的问题,需要一种更有效的传感方法。 Method: 使用一种传感器同时实现高分辨率、低延迟的深度重建和多光谱成像,通过调制投影结构光的波长来捕捉受控波段的光谱信息,并在实验室和真实世界雨林环境中进行了验证。 Result: 展示了与商用深度传感器相比RMSE提高了60%,并且光谱准确性与参考光谱仪和商用多光谱相机相当,使用深度和光谱数据进行材料区分的准确率比仅使用颜色的方法提高了30%以上。 Conclusion: 本文提出了一种新的事件光谱系统,能够在没有额外努力的情况下提供高分辨率、低延迟的深度重建和多光谱成像,为复杂自然环境中无人机的感知和数据收集提供了轻量级、集成和稳健的解决方案。 Abstract: Uncrewed aerial vehicles (UAVs) are increasingly deployed in forest environments for tasks such as environmental monitoring and search and rescue, which require safe navigation through dense foliage and precise data collection. Traditional sensing approaches, including passive multispectral and RGB imaging, suffer from latency, poor depth resolution, and strong dependence on ambient light - especially under forest canopies. In this work, we present a novel event spectroscopy system that simultaneously enables high-resolution, low-latency depth reconstruction and multispectral imaging using a single sensor. Depth is reconstructed using structured light, and by modulating the wavelength of the projected structured light, our system captures spectral information in controlled bands between 650 nm and 850 nm. We demonstrate up to $60\%$ improvement in RMSE over commercial depth sensors and validate the spectral accuracy against a reference spectrometer and commercial multispectral cameras, demonstrating comparable performance. A portable version limited to RGB (3 wavelengths) is used to collect real-world depth and spectral data from a Masoala Rainforest. We demonstrate the use of this prototype for color image reconstruction and material differentiation between leaves and branches using spectral and depth data. Our results show that adding depth (available at no extra effort with our setup) to material differentiation improves the accuracy by over $30\%$ compared to color-only method. Our system, tested in both lab and real-world rainforest environments, shows strong performance in depth estimation, RGB reconstruction, and material differentiation - paving the way for lightweight, integrated, and robust UAV perception and data collection in complex natural environments.

[209] Pothole Detection and Recognition based on Transfer Learning

Mang Hu,Qianqian Xia

Main category: cs.CV

TL;DR: 本文提出了一种高效的迁移学习模型ResNet50-EfficientNet-RegNet用于坑洼检测,具有较高的识别准确率和速度。

Details Motivation: 随着计算机视觉和机器学习的快速发展,基于图像和视频数据的坑洼检测和识别自动化方法受到广泛关注。对道路图像进行深入分析,实现新图像中坑洼状况的自动识别对社会发展具有重要意义。 Method: 该研究采用了预处理技术,如标准化、归一化和数据增强,并基于实验结果持续改进网络模型。通过仔细的参数选择和模型优化,使用迁移学习构建了一个深度学习模型。 Result: 在模型评估方面,本文采用了对比评估方法,基于准确率、召回率、精确率、F1分数和FPS等指标,将所提出的迁移学习模型与其他模型(包括随机森林、多层感知机、支持向量机和LightGBM)的性能进行比较。结果表明,所提出的模型在识别速度和准确性方面表现优异,分类准确率达到97.78%(初始测试集90个样本中的88个)和98.89%(扩展测试集900个样本中的890个)。 Conclusion: 本文提出的基于迁移学习的ResNet50-EfficientNet-RegNet深度学习特征提取网络模型在坑洼检测和识别方面表现出高准确性和计算效率,优于其他模型。 Abstract: With the rapid development of computer vision and machine learning, automated methods for pothole detection and recognition based on image and video data have received significant attention. It is of great significance for social development to conduct an in-depth analysis of road images through feature extraction, thereby achieving automatic identification of the pothole condition in new images. Consequently, this is the main issue addressed in this study. Based on preprocessing techniques such as standardization, normalization, and data augmentation applied to the collected raw dataset, we continuously improved the network model based on experimental results. Ultimately, we constructed a deep learning feature extraction network ResNet50-EfficientNet-RegNet model based on transfer learning. This model exhibits high classification accuracy and computational efficiency. In terms of model evaluation, this study employed a comparative evaluation approach by comparing the performance of the proposed transfer learning model with other models, including Random Forest, MLP, SVM, and LightGBM. The comparison analysis was conducted based on metrics such as Accuracy, Recall, Precision, F1-score, and FPS, to assess the classification performance of the transfer learning model proposed in this paper. The results demonstrate that our model exhibits high performance in terms of recognition speed and accuracy, surpassing the performance of other models. Through careful parameter selection and model optimization, our transfer learning model achieved a classification accuracy of 97.78% (88/90) on the initial set of 90 test samples and 98.89% (890/900) on the expanded test set.

[210] Raw2Event: Converting Raw Frame Camera into Event Camera

Zijie Ning,Enmin Lin,Sudarshan R. Iyengar,Patrick Vandewalle

Main category: cs.CV

TL;DR: Raw2Event是一种低成本、高分辨率的系统,通过模拟事件相机的输出,实现了接近真实事件相机的性能,并支持实时操作和灵活的参数调整。

Details Motivation: 事件相机虽然具有高时间分辨率、低延迟和高动态范围等优点,但其高成本、有限的分辨率以及缺乏自动对焦等功能限制了其广泛应用,尤其是在早期开发和原型设计中。 Method: Raw2Event通过直接访问原始拜耳数据并绕过传统的图像信号处理器(ISP)来模拟事件相机的输出,并基于DVS-Voltmeter模型构建了一个可配置的仿真框架,支持同步记录原始数据、RGB和事件流。 Result: 实验结果表明,Raw2Event可以生成与真实事件相机相似的事件流,同时具有更高的分辨率和自动对焦能力,并且支持直观的参数调节,适用于各种应用需求。 Conclusion: Raw2Event是一个硬件-软件系统,能够从低成本的基于帧的相机中实时生成事件,提供了比现有基于RGB的帧到事件转换器更高的分辨率、动态范围和更准确的输出。 Abstract: Event cameras offer unique advantages such as high temporal resolution, low latency, and high dynamic range, making them more and more popular for vision tasks under challenging light conditions. However, their high cost, limited resolution, and lack of features such as autofocus hinder their broad adoption, particularly for early-stage development and prototyping. In this work, we present Raw2Event, a complete hardware-software system that enables real-time event generation from low-cost raw frame-based cameras. By leveraging direct access to raw Bayer data and bypassing traditional image signal processors (ISP), our system is able to utilize the full potential of camera hardware, delivering higher dynamic range, higher resolution, and more faithful output than RGB-based frame-to-event converters. Built upon the DVS-Voltmeter model, Raw2Event features a configurable simulation framework optimized for deployment on embedded platforms. We further design a data acquisition pipeline that supports synchronized recording of raw, RGB, and event streams, facilitating downstream evaluation and dataset creation. Experimental results show that Raw2Event can generate event streams closely resembling those from real event cameras, while benefiting from higher resolution and autofocus capabilities. The system also supports user-intuitive parameter tuning, enabling flexible adaptation to various application requirements. Finally, we deploy the system on a Raspberry Pi for real-time operation, providing a scalable and cost-effective solution for event-based vision research and early-stage system development. The codes are available online: https://anonymous.4open.science/r/raw2event-BFF2/README.md.

[211] D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

Sai Kartheek Reddy Kasu,Mohammad Zia Ur Rehman,Shahid Shafi Dar,Rishi Bharat Junghare,Dhanvin Sanjay Namboodiri,Nagendra Kumar

Main category: cs.CV

TL;DR: 该研究介绍了一个包含4379个Reddit表情包的新数据集,并提出了一种基于大型视觉-语言模型的推理增强框架,用于检测网络模因中的黑色幽默。

Details Motivation: 黑色幽默在网络模因中检测困难,因为它依赖于隐含、敏感和文化背景的提示。当前缺乏相关的资源和方法来检测多模态内容中的黑色幽默。 Method: 研究者构建了一个新的数据集,其中包含4379个Reddit表情包,并使用大型视觉-语言模型(VLM)生成结构化的解释。通过角色反转自我循环,VLM从作者的角度迭代优化解释。文本特征通过文本编码器提取,视觉特征通过视觉转换器获取。最后,三流交叉推理网络(TCRNet)通过成对注意力机制融合文本、图像和推理流,生成用于分类的统一表示。 Result: 实验结果表明,该方法在黑色幽默检测、目标识别和强度预测三个任务上均优于现有强基线方法。 Conclusion: 该研究提供了一个新的数据集和一个高效的推理增强框架,有助于进一步研究多模态幽默理解和内容审核。 Abstract: Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning

[212] UrbanTwin: High-Fidelity Synthetic Replicas of Roadside Lidar Datasets

Muhammad Shahbaz,Shaurya Agarwal

Main category: cs.CV

TL;DR: UrbanTwin数据集是三个公开道路边激光雷达数据集的高保真合成复制品,它们可以增强现有基准数据集,提供强大的训练深度学习模型的价值。

Details Motivation: 为了增强现有基准数据集,提供强大的独立和补充价值以训练深度学习模型,解决3D物体检测、跟踪、语义分割和实例分割等任务,需要高保真度和现实感的合成数据集来替代真实数据集。 Method: UrbanTwin数据集是通过在现实城市的数字孪生体中使用模拟激光雷达传感器合成的,这些数字孪生体基于实际地点的周围几何形状、车道级别的道路对齐以及交叉口的车道拓扑和车辆移动模式进行建模。 Result: UrbanTwin数据集包含10K带注释的帧,注释包括六个对象类别的3D边界框、实例分割标签和跟踪ID,以及九个类别的语义分割标签。这些合成数据集与真实数据高度对齐,并且通过仅使用合成数据训练的3D物体检测模型在真实、未见过的数据上测试显示出了改进的检测性能。 Conclusion: UrbanTwin数据集是首批能够替代真实世界数据集用于激光雷达感知任务的数字合成数据集,它们具有高保真度和现实感,并能增强现有基准数据集的样本量和场景多样性。 Abstract: This article presents UrbanTwin datasets - high-fidelity, realistic replicas of three public roadside lidar datasets: LUMPI, V2X-Real-IC, and TUMTraf-I. Each UrbanTwin dataset contains 10K annotated frames corresponding to one of the public datasets. Annotations include 3D bounding boxes, instance segmentation labels, and tracking IDs for six object classes, along with semantic segmentation labels for nine classes. These datasets are synthesized using emulated lidar sensors within realistic digital twins, modeled based on surrounding geometry, road alignment at lane level, and the lane topology and vehicle movement patterns at intersections of the actual locations corresponding to each real dataset. Due to the precise digital twin modeling, the synthetic datasets are well aligned with their real counterparts, offering strong standalone and augmentative value for training deep learning models on tasks such as 3D object detection, tracking, and semantic and instance segmentation. We evaluate the alignment of the synthetic replicas through statistical and structural similarity analysis with real data, and further demonstrate their utility by training 3D object detection models solely on synthetic data and testing them on real, unseen data. The high similarity scores and improved detection performance, compared to the models trained on real data, indicate that the UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity. In addition, the digital twins can be adapted to test custom scenarios by modifying the design and dynamics of the simulations. To our knowledge, these are the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks. UrbanTwin datasets are publicly available at https://dataverse.harvard.edu/dataverse/ucf-ut.

[213] P3-SAM: Native 3D Part Segmentation

Changfeng Ma,Yang Li,Xinhao Yan,Jiachen Xu,Yunhan Yang,Chunshi Wang,Zibo Zhao,Yanwen Guo,Zhuo Chen,Chunchao Guo

Main category: cs.CV

TL;DR: P3-SAM是一种完全自动化的3D点提示部分分割模型,具有强大的鲁棒性和精确的分割能力。

Details Motivation: 当前方法在处理复杂对象时存在鲁棒性差的问题,且无法完全实现自动化。P3-SAM旨在解决这些问题,实现交互式分割,提升3D理解和模型重用。 Method: P3-SAM包括特征提取器、多个分割头和IoU预测器,采用一种自动选择和合并掩码的算法进行部分实例分割。 Result: P3-SAM在复杂对象上实现了精确的分割结果和强大的鲁棒性,达到了最先进的性能。 Conclusion: P3-SAM实现了对任何3D对象的完全自动化分割,并在复杂对象上表现出精确的分割结果和强大的鲁棒性,达到了最先进的性能。 Abstract: Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point-promptable part segmentation model termed P3-SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P3-SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state-of-the-art performance. Our code will be released soon.

[214] AIM 2025 Challenge on High FPS Motion Deblurring: Methods and Results

George Ciubotariu,Florin-Alexandru Vasluianu,Zhuyun Zhou,Nancy Mehta,Radu Timofte,Ke Wu,Long Sun,Lingshun Kong,Zhongbao Yang,Jinshan Pan,Jiangxin Dong,Jinhui Tang,Hao Chen,Yinghui Fang,Dafeng Zhang,Yongqi Song,Jiangbo Guo,Shuhua Jin,Zeyu Xiao,Rui Zhao,Zhuoyuan Li,Cong Zhang,Yufeng Peng,Xin Lu,Zhijing Sun,Chengjie Ge,Zihao Li,Zishun Liao,Ziang Zhou,Qiyu Kang,Xueyang Fu,Zheng-Jun Zha,Yuqian Zhang,Shuai Liu,Jie Liu,Zhuhao Zhang,Lishen Qu,Zhihao Liu,Shihao Zhou,Yaqi Luo,Juncheng Zhou,Jufeng Yang,Qianfeng Yang,Qiyuan Guan,Xiang Chen,Guiyue Jin,Jiyu Jin

Main category: cs.CV

TL;DR: 这篇论文总结了AIM 2025高FPS非均匀运动去模糊挑战赛,评估了当前最先进的高帧率单图像运动去模糊技术,并展示了MIORe数据集的挑战性样本。

Details Motivation: 该挑战赛的目标是通过学习复杂运动类型集合的代表性视觉线索,在多样且具有挑战性的条件下生成更清晰、更具视觉吸引力的图像。 Method: 论文对AIM 2025高FPS非均匀运动去模糊挑战赛的解决方案和最终结果进行了详尽的回顾和评估。 Result: 共有68名参与者注册了比赛,最终有9支团队提交了有效作品,展示了该领域的重要进展。 Conclusion: 该论文全面评估了高帧率单图像运动去模糊领域的最新进展,并介绍了MIORe数据集中的新样本,这些样本引入了具有挑战性的运动模式。 Abstract: This paper presents a comprehensive review of the AIM 2025 High FPS Non-Uniform Motion Deblurring Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions, by learning representative visual cues for complex aggregations of motion types. A total of 68 participants registered for the competition, and 9 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in high-FPS single image motion deblurring, showcasing the significant progress in the field, while leveraging samples of the novel dataset, MIORe, that introduces challenging examples of movement patterns.

[215] SynthDrive: Scalable Real2Sim2Real Sensor Simulation Pipeline for High-Fidelity Asset Generation and Driving Data Synthesis

Zhengqing Chen,Ruohong Mei,Xiaoyang Guo,Qingjie Wang,Yubin Hu,Wei Yin,Weiqiang Ren,Qian Zhang

Main category: cs.CV

TL;DR: The paper introduces a scalable real2sim2real system for autonomous driving that uses 3D generation to improve diversity and scalability in sensor simulation.

Details Motivation: Current sensor simulation methods in autonomous driving have limitations in terms of diversity, scalability, and applicability to generic objects. The authors aim to address these issues with a new approach. Method: The proposed system uses 3D generation to automate asset mining, generation, and rare-case data synthesis. Result: The proposed real2sim2real system is scalable and can handle asset mining, generation, and synthesis of rare-case data, which overcomes the limitations of existing methods. Conclusion: The proposed real2sim2real system addresses the limitations of current sensor simulation methods in autonomous driving by leveraging 3D generation for asset mining, generation, and rare-case data synthesis. Abstract: In the field of autonomous driving, sensor simulation is essential for generating rare and diverse scenarios that are difficult to capture in real-world environments. Current solutions fall into two categories: 1) CG-based methods, such as CARLA, which lack diversity and struggle to scale to the vast array of rare cases required for robust perception training; and 2) learning-based approaches, such as NeuSim, which are limited to specific object categories (vehicles) and require extensive multi-sensor data, hindering their applicability to generic objects. To address these limitations, we propose a scalable real2sim2real system that leverages 3D generation to automate asset mining, generation, and rare-case data synthesis.

[216] MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration

George Ciubotariu,Zhuyun Zhou,Zongwei Wu,Radu Timofte

Main category: cs.CV

TL;DR: The paper introduces MIORe and VAR-MIORe, two novel multi-task datasets for motion restoration benchmarks, offering high-frame-rate acquisition, professional-grade optics, and explicit control over motion amplitude.

Details Motivation: The motivation behind this paper is to address critical limitations in current motion restoration benchmarks by introducing datasets that can capture a wide range of motion scenarios and provide high-resolution ground truths for challenging existing algorithms. Method: The paper describes the creation of MIORe and VAR-MIORe datasets using high-frame-rate acquisition and professional-grade optics to capture a broad spectrum of motion scenarios. MIORe generates consistent motion blur by adaptively averaging frames based on computed optical flow metrics, while VAR-MIORe offers explicit control over motion amplitude. Result: The result of the paper is the development of two novel multi-task datasets, MIORe and VAR-MIORe. These datasets are designed to capture a broad spectrum of motion scenarios including complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects, with the ability to control motion amplitude. Conclusion: The paper concludes that the introduced MIORe and VAR-MIORe datasets overcome critical limitations in current motion restoration benchmarks and provide high-resolution, scalable ground truths for advancing research in image and video restoration tasks. Abstract: We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks. Designed with high-frame-rate (1000 FPS) acquisition and professional-grade optics, our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects. By adaptively averaging frames based on computed optical flow metrics, MIORe generates consistent motion blur, and preserves sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark to offer explicit control over motion amplitude. We provide high-resolution, scalable ground truths that challenge existing algorithms under both controlled and adverse conditions, paving the way for next-generation research of various image and video restoration tasks.

[217] UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward

Yufeng Cheng,Wenxu Wu,Shaojin Wu,Mengqi Huang,Fei Ding,Qian He

Main category: cs.CV

TL;DR: UMO是一种多身份图像定制优化框架,通过全局分配优化和强化学习提升身份一致性和减少身份混淆。

Details Motivation: 人类对人脸更敏感,因此在多参考图像下保持身份一致性和避免身份混淆是一个挑战。 Method: 通过“多对多匹配”范式,将多身份生成重新定义为全局分配优化问题,并使用强化学习在扩散模型上进行训练。 Result: UMO显著提高了身份一致性,减少了身份混淆,并在多个图像定制方法中设定了新的最优状态。 Conclusion: UMO实现了多身份图像定制的优化,提升了身份一致性和减少了身份混淆,达到了新的开源方法的最优状态。 Abstract: Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO

[218] Video-Based MPAA Rating Prediction: An Attention-Driven Hybrid Architecture Using Contrastive Learning

Dipta Neogi,Nourash Azmine Chowdhury,Muhammad Rafsan Kabir,Mohammad Ashrafuzzaman Khan

Main category: cs.CV

TL;DR: 本文提出了一种基于对比学习和LRCN-Bahdanau注意力机制的混合架构,实现了高效的MPAA视频评级分类,准确率达88%,并已部署为Web应用。

Details Motivation: 随着平台上的视觉内容快速增长,传统的视频分类方法面临数据标注需求大、泛化能力差和特征学习效率低的问题,需要一种更高效、自动化的方法来实现年龄适宜性分级。 Method: 使用对比学习方法,包括实例判别、上下文对比学习和多视图对比学习,结合LRCN(CNN+LSTM)主干网络和Bahdanau注意力机制,并评估不同对比损失函数的效果。 Result: 在上下文对比学习框架中,模型取得了最佳性能,准确率为88%,F1分数为0.8815,尤其在区分PG-13和R级内容等细粒度任务中表现出色,并成功部署为用于实时MPAA评级分类的Web应用程序。 Conclusion: 模型在上下文对比学习框架中结合CNN、LSTM和Bahdanau注意力机制,实现了对MPAA分级的高效自动分类,准确率达到88%,F1分数为0.8815,并通过Web应用实现了实际应用。 Abstract: The rapid growth of visual content consumption across platforms necessitates automated video classification for age-suitability standards like the MPAA rating system (G, PG, PG-13, R). Traditional methods struggle with large labeled data requirements, poor generalization, and inefficient feature learning. To address these challenges, we employ contrastive learning for improved discrimination and adaptability, exploring three frameworks: Instance Discrimination, Contextual Contrastive Learning, and Multi-View Contrastive Learning. Our hybrid architecture integrates an LRCN (CNN+LSTM) backbone with a Bahdanau attention mechanism, achieving state-of-the-art performance in the Contextual Contrastive Learning framework, with 88% accuracy and an F1 score of 0.8815. By combining CNNs for spatial features, LSTMs for temporal modeling, and attention mechanisms for dynamic frame prioritization, the model excels in fine-grained borderline distinctions, such as differentiating PG-13 and R-rated content. We evaluate the model's performance across various contrastive loss functions, including NT-Xent, NT-logistic, and Margin Triplet, demonstrating the robustness of our proposed architecture. To ensure practical application, the model is deployed as a web application for real-time MPAA rating classification, offering an efficient solution for automated content compliance across streaming platforms.

[219] Curia: A Multi-Modal Foundation Model for Radiology

Corentin Dancette,Julien Khlaut,Antoine Saporta,Helene Philippe,Elodie Ferreres,Baptiste Callard,Théo Danielou,Léo Alberge,Léo Machado,Daniel Tordjman,Julie Dupuis,Korentin Le Floch,Jean Du Terrail,Mariam Moshiri,Laurent Dercle,Tom Boeken,Jules Gregory,Maxime Ronot,François Legou,Pascal Roux,Marc Sapoval,Pierre Manceron,Paul Hérent

Main category: cs.CV

TL;DR: Curia, a foundation model for radiology, is trained on a massive real-world dataset and demonstrates strong performance across multiple tasks, including organ identification, disease detection, and tumor staging.

Details Motivation: The current reliance on narrow, single-task AI models in radiology is impractical due to the wide variety of imaging modalities and diseases. There is a need for broader, more adaptable models. Method: Curia was trained on cross-sectional imaging data from 150,000 exams (130 TB) from a major hospital over several years. It was evaluated on a 19-task external validation benchmark. Result: Curia demonstrated accurate organ identification, condition detection (e.g., brain hemorrhages, myocardial infarctions), and outcome prediction in tumor staging. It meets or exceeds radiologist performance and exhibits emergent properties in cross-modality and low-data settings. Conclusion: Curia represents a significant step forward in AI-assisted radiology, offering broad generalization capabilities and promising performance across various tasks and data conditions. Abstract: AI-assisted radiological interpretation is based on predominantly narrow, single-task models. This approach is impractical for covering the vast spectrum of imaging modalities, diseases, and radiological findings. Foundation models (FMs) hold the promise of broad generalization across modalities and in low-data settings. However, this potential has remained largely unrealized in radiology. We introduce Curia, a foundation model trained on the entire cross-sectional imaging output of a major hospital over several years, which to our knowledge is the largest such corpus of real-world data-encompassing 150,000 exams (130 TB). On a newly curated 19-task external validation benchmark, Curia accurately identifies organs, detects conditions like brain hemorrhages and myocardial infarctions, and predicts outcomes in tumor staging. Curia meets or surpasses the performance of radiologists and recent foundation models, and exhibits clinically significant emergent properties in cross-modality, and low-data regimes. To accelerate progress, we release our base model's weights at https://huggingface.co/raidium/curia.

[220] Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis

Simon Pezold,Jérôme A. Kurylec,Jan S. Liechti,Beat P. Müller,Joël L. Lavanchy

Main category: cs.CV

TL;DR: 本文研究了如何通过微调领域特定数据和整合手术室的额外数据流来提升基于V-JEPA的手术数据科学性能。

Details Motivation: 探索如何通过迁移学习适应通用基础模型以及整合手术室中的互补模态数据来支持手术数据科学。 Method: 使用V-JEPA作为单模态基础,并通过微调和集成来自手术室的额外数据流来提高模型性能。 Result: 微调领域特定数据提高了模型性能;在HeiCo数据集中,预训练的视频单模态基线设置的准确性与EndoVis2017挑战赛中表现最好的提交相当,而微调进一步提高了准确性。 Conclusion: 手术数据科学可以利用公共的通用基础模型,并通过领域适应和整合合适的互补数据流来提高性能。 Abstract: We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model's downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA's embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.

[221] Evaluating the Impact of Adversarial Attacks on Traffic Sign Classification using the LISA Dataset

Nabeyou Tadessa,Balaji Iyangar,Mashrur Chowdhury

Main category: cs.CV

TL;DR: This paper demonstrates that traffic sign classifiers are vulnerable to adversarial attacks, as shown by a significant drop in accuracy under FGSM and PGD perturbations.

Details Motivation: Adversarial attacks pose a threat to machine learning models; however, previous studies have mainly focused on datasets like MNIST, leaving a gap in understanding the vulnerability of traffic sign classifiers in real-world scenarios. Method: A convolutional neural network was trained to classify 47 different traffic signs, and its robustness was tested against FGSM and PGD attacks using the LISA Traffic Sign dataset. Result: The classification accuracy of the model declined sharply as the perturbation magnitude increased, indicating susceptibility to adversarial examples. Conclusion: The study concludes that traffic sign classifiers are vulnerable to adversarial attacks, specifically FGSM and PGD, emphasizing the need for improved defense mechanisms for real-world applications. Abstract: Adversarial attacks pose significant threats to machine learning models by introducing carefully crafted perturbations that cause misclassification. While prior work has primarily focused on MNIST and similar datasets, this paper investigates the vulnerability of traffic sign classifiers using the LISA Traffic Sign dataset. We train a convolutional neural network to classify 47 different traffic signs and evaluate its robustness against Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks. Our results show a sharp decline in classification accuracy as the perturbation magnitude increases, highlighting the models susceptibility to adversarial examples. This study lays the groundwork for future exploration into defense mechanisms tailored for real-world traffic sign recognition systems.

[222] ToonOut: Fine-tuned Background-Removal for Anime Characters

Matteo Muratori,Joël Seytre

Main category: cs.CV

TL;DR: 本文通过微调BiRefNet模型,在动漫风格图像的背景去除任务上取得了显著性能提升,并开源了相关资源。

Details Motivation: 现有的背景去除模型在处理动漫风格图像时表现不佳,尤其是对头发和透明度等复杂特征的处理存在挑战。 Method: 收集并注释了一个包含1228张高质量动漫风格图像的数据集,并在此数据集上对BiRefNet模型进行了微调。 Result: 背景去除准确性显著提高,新引入的像素准确率从95.3%提升至99.5%。 Conclusion: 通过在动漫风格图像数据集上微调BiRefNet模型,显著提高了背景去除的准确性,并开源了代码、模型权重和数据集。 Abstract: While state-of-the-art background removal models excel at realistic imagery, they frequently underperform in specialized domains such as anime-style content, where complex features like hair and transparency present unique challenges. To address this limitation, we collected and annotated a custom dataset of 1,228 high-quality anime images of characters and objects, and fine-tuned the open-sourced BiRefNet model on this dataset. This resulted in marked improvements in background removal accuracy for anime-style images, increasing from 95.3% to 99.5% for our newly introduced Pixel Accuracy metric. We are open-sourcing the code, the fine-tuned model weights, as well as the dataset at: https://github.com/MatteoKartoon/BiRefNet.

[223] Automated Radiographic Total Sharp Score (ARTSS) in Rheumatoid Arthritis: A Solution to Reduce Inter-Intra Reader Variation and Enhancing Clinical Practice

Hajar Moradmand,Lei Ren

Main category: cs.CV

TL;DR: 本研究开发了一种名为ARTSS的自动化深度学习框架,用于类风湿关节炎的放射学评分,有效提高了评分效率和准确性,减少了人为差异。

Details Motivation: 类风湿关节炎(RA)的严重程度评估通常依赖于Total Sharp/Van Der Heijde Score(TSS),但手动评分过程耗时且具有主观性。因此,研究旨在开发一种自动评分系统,以提高效率和一致性。 Method: 研究采用四阶段方法:I)使用ResNet50进行图像预处理和重新定位,II)使用UNet.3进行手部分割,III)使用YOLOv7进行关节识别,IV)使用VGG16、VGG19、ResNet50、DenseNet201、EfficientNetB0和Vision Transformer(ViT)进行TSS预测。模型训练采用了3折交叉验证,并在外部测试集中验证模型性能。 Result: 关节识别模型达到了99%的准确率,其中表现最佳的ViT模型在TSS预测中实现了0.87的Huber损失。模型评估指标包括IoU、MAP、MAE、RMSE和Huber损失,结果表明该方法能够有效解决关节消失和关节数量变化带来的挑战。 Conclusion: 该研究提出了一种基于深度学习的自动化类风湿关节炎放射学Sharp评分框架(ARTSS),能够有效减少观察者间的差异,提高评分的准确性,为临床实践提供了重要的辅助工具。 Abstract: Assessing the severity of rheumatoid arthritis (RA) using the Total Sharp/Van Der Heijde Score (TSS) is crucial, but manual scoring is often time-consuming and subjective. This study introduces an Automated Radiographic Sharp Scoring (ARTSS) framework that leverages deep learning to analyze full-hand X-ray images, aiming to reduce inter- and intra-observer variability. The research uniquely accommodates patients with joint disappearance and variable-length image sequences. We developed ARTSS using data from 970 patients, structured into four stages: I) Image pre-processing and re-orientation using ResNet50, II) Hand segmentation using UNet.3, III) Joint identification using YOLOv7, and IV) TSS prediction using models such as VGG16, VGG19, ResNet50, DenseNet201, EfficientNetB0, and Vision Transformer (ViT). We evaluated model performance with Intersection over Union (IoU), Mean Average Precision (MAP), mean absolute error (MAE), Root Mean Squared Error (RMSE), and Huber loss. The average TSS from two radiologists was used as the ground truth. Model training employed 3-fold cross-validation, with each fold consisting of 452 training and 227 validation samples, and external testing included 291 unseen subjects. Our joint identification model achieved 99% accuracy. The best-performing model, ViT, achieved a notably low Huber loss of 0.87 for TSS prediction. Our results demonstrate the potential of deep learning to automate RA scoring, which can significantly enhance clinical practice. Our approach addresses the challenge of joint disappearance and variable joint numbers, offers timesaving benefits, reduces inter- and intra-reader variability, improves radiologist accuracy, and aids rheumatologists in making more informed decisions.

[224] Matching Shapes Under Different Topologies: A Topology-Adaptive Deformation Guided Approach

Aymen Merrouche,Stefanie Wuhrer,Edmond Boyer

Main category: cs.CV

TL;DR: This paper proposes a topology-adaptive deformation model for non-rigid 3D mesh matching that handles topological artefacts and outperforms existing methods in alignment quality.

Details Motivation: The motivation stems from real-world scenarios like per-frame multi-view reconstructions, which often suffer from topological artefacts that break assumptions made by existing mesh matching approaches. Method: The method introduces a topology-adaptive deformation model that allows changes in shape topology, jointly optimizing for a template mesh and its alignment to extract correspondences. Result: The approach successfully aligns highly non-isometric shapes and shapes with topological artefacts, even outperforming methods trained on large datasets in 3D alignment quality. Conclusion: The proposed topology-adaptive deformation model effectively handles non-rigid 3D mesh matching in the presence of topological artefacts, outperforming data-driven methods in alignment quality. Abstract: Non-rigid 3D mesh matching is a critical step in computer vision and computer graphics pipelines. We tackle matching meshes that contain topological artefacts which can break the assumption made by current approaches. While Functional Maps assume the deformation induced by the ground truth correspondences to be near-isometric, ARAP-like deformation-guided approaches assume the latter to be ARAP. Neither assumption holds in certain topological configurations of the input shapes. We are motivated by real-world scenarios such as per-frame multi-view reconstructions, often suffering from topological artefacts. To this end, we propose a topology-adaptive deformation model allowing changes in shape topology to align shape pairs under ARAP and bijective association constraints. Using this model, we jointly optimise for a template mesh with adequate topology and for its alignment with the shapes to be matched to extract correspondences. We show that, while not relying on any data-driven prior, our approach applies to highly non-isometric shapes and shapes with topological artefacts, including noisy per-frame multi-view reconstructions, even outperforming methods trained on large datasets in 3D alignment quality.

[225] A New Hybrid Model of Generative Adversarial Network and You Only Look Once Algorithm for Automatic License-Plate Recognition

Behnoud Shafiezadeh,Amir Mashmool,Farshad Eshghi,Manoochehr Kelarestaghi

Main category: cs.CV

TL;DR: 本文提出了一种结合Deblur-GAN和YOLOv5的车牌识别方法,在预处理阶段去模糊,实现了高精度和低计算成本的实时识别。

Details Motivation: 由于车牌识别的高度可变性,传统方法难以应对复杂环境,因此需要结合深度学习技术提高识别准确率和效率。 Method: 采用选择性预处理的Deblur-GAN进行去模糊处理,结合YOLOv5进行车牌检测和字符识别。 Result: YOLOv5在车牌检测和字符识别阶段分别达到了95%和97%的准确率,检测时间仅为0.026秒;Deblur-GAN使检测准确率提升了近40%。 Conclusion: YOLOv5架构与Deblur-GAN预处理的结合在ALPR系统中实现了高精度和快速响应,适用于便携式应用。 Abstract: Automatic License-Plate Recognition (ALPR) plays a pivotal role in Intelligent Transportation Systems (ITS) as a fundamental element of Smart Cities. However, due to its high variability, ALPR faces challenging issues more efficiently addressed by deep learning techniques. In this paper, a selective Generative Adversarial Network (GAN) is proposed for deblurring in the preprocessing step, coupled with the state-of-the-art You-Only-Look-Once (YOLO)v5 object detection architectures for License-Plate Detection (LPD), and the integrated Character Segmentation (CS) and Character Recognition (CR) steps. The selective preprocessing bypasses unnecessary and sometimes counter-productive input manipulations, while YOLOv5 LPD/CS+CR delivers high accuracy and low computing cost. As a result, YOLOv5 achieves a detection time of 0.026 seconds for both LP and CR detection stages, facilitating real-time applications with exceptionally rapid responsiveness. Moreover, the proposed model achieves accuracy rates of 95\% and 97\% in the LPD and CR detection phases, respectively. Furthermore, the inclusion of the Deblur-GAN pre-processor significantly improves detection accuracy by nearly 40\%, especially when encountering blurred License Plates (LPs).To train and test the learning components, we generated and publicly released our blur and ALPR datasets (using Iranian license plates as a use-case), which are more representative of close-to-real-life ad-hoc situations. The findings demonstrate that employing the state-of-the-art YOLO model results in excellent overall precision and detection time, making it well-suited for portable applications. Additionally, integrating the Deblur-GAN model as a preliminary processing step enhances the overall effectiveness of our comprehensive model, particularly when confronted with blurred scenes captured by the camera as input.

[226] Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers

Morteza Kiani Haftlang,Mohammadhossein Malmir,Foroutan Parand,Umberto Michelucci,Safouane El Ghazouali

Main category: cs.CV

TL;DR: The paper introduces a lightweight, end-to-end architecture for real-time binary medical image segmentation by combining a Swin Transformer-like encoder and a U-Net-like decoder, using skip pathways and self-supervised pretraining with Barlow Twins for improved efficiency and accuracy.

Details Motivation: The motivation is to overcome the limitations of traditional convolutional architectures like U-Net, which have a restricted receptive field, and transformer-based models that are computationally expensive. The goal is to develop a lightweight and efficient model suitable for real-time medical image segmentation tasks, especially in resource-limited settings. Method: The paper proposes a lightweight end-to-end architecture for real-time binary medical image segmentation. It integrates a Swin Transformer-like encoder and a U-Net-like decoder connected via skip pathways. The encoder is pretrained using Barlow Twins, a self-supervised learning method, to enhance feature learning with limited labeled data. Result: Experiments show that the proposed model achieves competitive accuracy while significantly reducing parameter count and inference time. The lightweight design makes it suitable for deployment in real-time and resource-limited clinical environments. Conclusion: The paper concludes that the proposed architecture, combining a Swin Transformer-like encoder and a U-Net-like decoder with self-supervised pretraining, is a practical solution for real-time and resource-limited clinical environments due to its efficiency and competitive accuracy. Abstract: Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder's ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.

[227] Intraoperative 2D/3D Registration via Spherical Similarity Learning and Inference-Time Differentiable Levenberg-Marquardt Optimization

Minheng Chen,Youyong Kong

Main category: cs.CV

TL;DR: 本文提出了一種新的非歐幾里得相似性學習方法,用於手術中的2D/3D配準,以提高配準的準確性和收斂速度。

Details Motivation: 現有的歐幾里得相似性學習框架在處理具有重大干擾的數據時會扭曲流形結構並減慢收斂速度,因此需要一種更有效的方法來捕捉複雜的流形結構。 Method: 通過CNN-Transformer編碼器提取特徵嵌入,將其投影到球面空間中,並使用黎曼距離近似其測地距離。在推論過程中,使用完全可微分的Levenberg-Marquardt優化方法來加速收斂。 Result: 實驗結果顯示,該方法在患者特定和患者無關的場景中都具有優越的準確性。 Conclusion: 這種基於非歐幾里得相似性學習的方法在手術中的2D/3D配準中表現出更強的表達能力和幾何一致性,提高了配準的準確性和收斂速度。 Abstract: Intraoperative 2D/3D registration aligns preoperative 3D volumes with real-time 2D radiographs, enabling accurate localization of instruments and implants. A recent fully differentiable similarity learning framework approximates geodesic distances on SE(3), expanding the capture range of registration and mitigating the effects of substantial disturbances, but existing Euclidean approximations distort manifold structure and slow convergence. To address these limitations, we explore similarity learning in non-Euclidean spherical feature spaces to better capture and fit complex manifold structure. We extract feature embeddings using a CNN-Transformer encoder, project them into spherical space, and approximate their geodesic distances with Riemannian distances in the bi-invariant SO(4) space. This enables a more expressive and geometrically consistent deep similarity metric, enhancing the ability to distinguish subtle pose differences. During inference, we replace gradient descent with fully differentiable Levenberg-Marquardt optimization to accelerate convergence. Experiments on real and synthetic datasets show superior accuracy in both patient-specific and patient-agnostic scenarios.

[228] BIR-Adapter: A Low-Complexity Diffusion Model Adapter for Blind Image Restoration

Cem Eteke,Alexander Griessel,Wolfgang Kellerer,Eckehard Steinbach

Main category: cs.CV

TL;DR: 本文提出了一种名为BIR-Adapter的低复杂度盲图像恢复方法,利用预训练扩散模型的特征提取能力,通过扩展自注意力机制和引入采样引导机制,在保证性能的同时显著降低计算需求,并展示了其在多种图像恢复任务中的应用潜力。

Details Motivation: 旨在解决盲图像恢复问题,同时显著降低计算复杂度,并允许适配器设计集成到其他扩散模型中,以扩展图像恢复任务的应用范围。 Method: 利用预训练模型的鲁棒性,通过模型本身从退化图像中提取特征,并用这些退化特征扩展自注意力机制,同时引入了一种采样引导机制以减少幻觉。 Result: 在合成和真实世界退化数据上的实验表明,BIR-Adapter的性能优于或与现有最先进方法相当,同时具有显著更低的复杂度。此外,它能够将仅用于超分辨率的模型扩展到处理更多未知退化情况,并表现出更好的性能。 Conclusion: BIR-Adapter是一种低复杂度的用于扩散模型的盲图像恢复适配器,能够在无需训练辅助特征提取器的情况下利用预训练大规模扩散模型的先验知识进行盲图像恢复。 Abstract: This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guidance mechanism to reduce hallucinations. We perform experiments on synthetic and real-world degradations and demonstrate that BIR-Adapter achieves competitive or better performance compared to state-of-the-art methods while having significantly lower complexity. Additionally, its adapter-based design enables integration into other diffusion models, enabling broader applications in image restoration tasks. We showcase this by extending a super-resolution-only model to perform better under additional unknown degradations.

[229] FoMo4Wheat: Toward reliable crop vision foundation models with globally curated data

Bing Han,Chen Zhu,Dong Han,Rui Yu,Songliang Cao,Jianhui Wu,Scott Chapman,Zijian Wang,Bangyou Zheng,Wei Guo,Marie Weiss,Benoit de Solan,Andreas Hund,Lukas Roth,Kirchgessner Norbert,Andrea Visioni,Yufeng Ge,Wenjuan Li,Alexis Comar,Dong Jiang,Dejun Han,Fred Baret,Yanfeng Ding,Hao Lu,Shouyang Liu

Main category: cs.CV

TL;DR: The paper presents FoMo4Wheat, a crop-specific vision foundation model pretrained on a large and diverse wheat image dataset, which demonstrates superior performance in in-field vision tasks and represents a step toward a universal crop foundation model.

Details Motivation: The motivation is to overcome the limitations of general-domain pretrained models that fail to generalize across tasks in digital agriculture due to the interaction of fine, variable canopy structures with fluctuating field conditions. Method: The paper introduces FoMo4Wheat, a crop-domain vision foundation model pretrained with self-supervision on the ImAg4Wheat dataset, which is the largest and most diverse wheat image dataset to date. Result: The FoMo4Wheat models outperformed state-of-the-art models pretrained on general-domain datasets across ten in-field vision tasks at canopy and organ levels. Conclusion: This paper concludes that crop-specific foundation models like FoMo4Wheat are valuable for reliable in-field perception and suggest a path toward a universal crop foundation model with cross-species and cross-task capabilities. Abstract: Vision-driven field monitoring is central to digital agriculture, yet models built on general-domain pretrained backbones often fail to generalize across tasks, owing to the interaction of fine, variable canopy structures with fluctuating field conditions. We present FoMo4Wheat, one of the first crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat, the largest and most diverse wheat image dataset to date (2.5 million high-resolution images collected over a decade at 30 global sites, spanning >2,000 genotypes and >500 environmental conditions). This wheat-specific pretraining yields representations that are robust for wheat and transferable to other crops and weeds. Across ten in-field vision tasks at canopy and organ levels, FoMo4Wheat models consistently outperform state-of-the-art models pretrained on general-domain dataset. These results demonstrate the value of crop-specific foundation models for reliable in-field perception and chart a path toward a universal crop foundation model with cross-species and cross-task capabilities. FoMo4Wheat models and the ImAg4Wheat dataset are publicly available online: https://github.com/PheniX-Lab/FoMo4Wheat and https://huggingface.co/PheniX-Lab/FoMo4Wheat. The demonstration website is: https://fomo4wheat.phenix-lab.com/.

[230] H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers

Wenhao Li,Mengyuan Liu,Hong Liu,Pichao Wang,Shijian Lu,Nicu Sebe

Main category: cs.CV

TL;DR: 本文提出了一种名为H$_{2}$OT的高效transformer-based视频3D人体姿态估计方法,该方法通过逐步剪枝和恢复姿态token来提高模型效率。

Details Motivation: 视频姿态transformer (VPTs)的高计算成本使它们在资源受限设备上不切实际。因此,需要一个更有效的方法来进行基于transformer的视频3D人体姿态估计。 Method: H$_{2}$OT框架包括两个关键模块:Token Pruning Module (TPM)和Token Recovering Module (TRM)。TPM动态选择一些代表性的token来消除视频帧的冗余,而TRM则根据选定的token恢复详细的时空信息。 Result: H$_{2}$OT框架在多个基准数据集上的广泛实验显示了该方法的有效性和效率。 Conclusion: H$_{2}$OT是一个通用的框架,可以提高基于transformer的视频3D人体姿态估计的效率,同时保持高估计精度。 Abstract: Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H$_{2}$OT), for efficient transformer-based 3D human pose estimation from videos. H$_{2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.