Table of Contents
cs.CL [Back]
[1] Large Language Models in the Travel Domain: An Industrial Experience
Sergio Di Meglio,Aniello Somma,Luigi Libero Lucio Starace,Fabio Scippacercola,Giancarlo Sperlì,Sergio Di Martino
Main category: cs.CL
TL;DR: This study evaluates the use of two LLMs, Mistral 7B and Mixtral 8x7B, in improving the consistency of property descriptions on a booking platform. Mixtral 8x7B provides better results but at a much higher computational cost.
Details
Motivation: The motivation stems from the challenges faced by online property booking platforms due to inconsistent or incomplete accommodation data sourced from third parties, which can lead to user frustration and market loss. Method: The research involved an evaluation of two Large Language Models (LLMs), Mistral 7B and Mixtral 8x7B, within the CALEIDOHOTELS platform. Mistral 7B was fine-tuned using QLoRA, while Mixtral 8x7B was employed with a refined system prompt. The models were compared based on completeness, precision, hallucination rate, and content length. Result: Mixtral 8x7B outperformed Mistral 7B in completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%). Mixtral also produced shorter descriptions (249 vs. 277 words on average), but its computational cost was significantly higher (50GB VRAM and $1.61/hour vs. 5GB VRAM and $0.16/hour). Conclusion: The study concludes that while Mixtral 8x7B offers superior performance in generating consistent and concise property descriptions with fewer hallucinations, it comes at a significantly higher computational cost, presenting a trade-off between quality and resource efficiency. Abstract: Online property booking platforms are widely used and rely heavily on consistent, up-to-date information about accommodation facilities, often sourced from third-party providers. However, these external data sources are frequently affected by incomplete or inconsistent details, which can frustrate users and result in a loss of market. In response to these challenges, we present an industrial case study involving the integration of Large Language Models (LLMs) into CALEIDOHOTELS, a property reservation platform developed by FERVENTO. We evaluate two well-known LLMs in this context: Mistral 7B, fine-tuned with QLoRA, and Mixtral 8x7B, utilized with a refined system prompt. Both models were assessed based on their ability to generate consistent and homogeneous descriptions while minimizing hallucinations. Mixtral 8x7B outperformed Mistral 7B in terms of completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%), producing shorter yet more concise content (249 vs. 277 words on average). However, this came at a significantly higher computational cost: 50GB VRAM and $1.61/hour versus 5GB and $0.16/hour for Mistral 7B. Our findings provide practical insights into the trade-offs between model quality and resource efficiency, offering guidance for deploying LLMs in production environments and demonstrating their effectiveness in enhancing the consistency and reliability of accommodation data.[2] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing
Jinzhi Wang,Qingke Peng,Haozhou Li,Zeyuan Zeng,Qinfeng Song,Kaixuan Yang,Jiangbo Zhang,Yaoying Wang,Ruimeng Li,Biyi Zhou
Main category: cs.CL
TL;DR: 为了解决电力营销客户服务中的问题,我们提出了ElectriQ基准和专业知识增强方法,以提升LLMs的表现。
Details
Motivation: 当前系统如中国的95598热线在响应时间、灵活性和准确性方面存在不足,而现有的大型语言模型缺乏电力营销领域所需的专业知识和同理心。 Method: 引入ElectriQ基准,包括一个涵盖六个关键服务类别的对话数据集,并提出专业知识增强方法。 Result: 在13个LLM上的实验表明,较小的模型如LLama3-8B在经过微调和增强后可以在专业性和用户友好性方面超越GPT-4o。 Conclusion: ElectriQ提供了一个全面的基础,用于开发针对电力营销服务需求的LLMs。 Abstract: Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China's 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.[3] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms
Navid Yazdanjue,Morteza Rakhshaninejad,Hossein Yazdanjouei,Mohammad Sadegh Khorshidi,Mikko S. Niemela,Fang Chen,Amir H. Gandomi
Main category: cs.CL
TL;DR: 本文提出了一种结合微调语言模型和半监督集成学习策略的分层分类框架,用于检测和分类非法市场内容。
Details
Motivation: 非法市场越来越多地转向互联网的隐蔽部分,如深层网络和暗网,以及Telegram、Reddit和Pastebin等平台。这些渠道使得匿名交易违禁品(如毒品、武器和被盗凭证)变得更加容易。由于标记数据有限、非法语言不断演变以及在线来源的结构异质性,检测和分类此类内容仍然具有挑战性。 Method: 该论文的方法包括使用ModernBERT(一种针对长文档的transformer模型)提取语义表示,并结合手动设计的特征(如文档结构、比特币地址、电子邮件和IP等嵌入模式及元数据)。分类流程分为两个阶段:第一阶段使用基于熵的加权投票的半监督集成学习策略(包括XGBoost、随机森林和支持向量机)检测销售相关文档;第二阶段进一步将这些文档分类为毒品、武器或凭证销售。 Result: 该模型在三个数据集上的实验表明,其性能优于多个基线模型(包括BERT、ModernBERT、DarkBERT、ALBERT、Longformer和BigBird)。该模型达到了0.96489的准确率、0.93467的F1分数和0.95388的TMCC分数。 Conclusion: 该论文提出了一种分层分类框架,能够有效检测和分类非法市场内容,表现出良好的泛化能力、在有限监督下的鲁棒性以及在现实世界非法内容检测中的有效性。 Abstract: Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.[4] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models
Jinyu Liu,Xiaoying Song,Diana Zhang,Jason Thomale,Daqing He,Lingzi Hong
Main category: cs.CL
TL;DR: This paper proposes a hybrid framework combining ML models and LLMs to improve subject term prediction, resulting in more accurate and controlled outputs aligned with standard vocabularies.
Details
Motivation: Traditional ML models struggle with unseen cases in subject analysis, while LLMs tend to over-generate and hallucinate. A better approach is needed to leverage the strengths of both. Method: A hybrid framework was developed that uses ML models to guide and post-edit LLM predictions for subject analysis. The framework was tested using LCSH for book subject term prediction. Result: The hybrid framework successfully produced more controlled and vocabulary-aligned outputs compared to using LLMs alone. Conclusion: The hybrid framework combining ML models and LLMs improves subject term prediction by controlling LLM outputs and aligning them with standard vocabularies. Abstract: Providing subject access to information resources is an essential function of any library management system. Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored. Multi-label classification with traditional machine learning (ML) models has been used for subject analysis but struggles with unseen cases. LLMs offer an alternative but often over-generate and hallucinate. Therefore, we propose a hybrid framework that integrates embedding-based ML models with LLMs. This approach uses ML models to (1) predict the optimal number of LCSH labels to guide LLM predictions and (2) post-edit the predicted terms with actual LCSH terms to mitigate hallucinations. We experimented with LLMs and the hybrid framework to predict the subject terms of books using the Library of Congress Subject Headings (LCSH). Experiment results show that providing initial predictions to guide LLM generations and imposing post-edits result in more controlled and vocabulary-aligned outputs.[5] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs
Victor Eiti Yamamoto,Hideaki Takeda
Main category: cs.CL
TL;DR: This paper proposes a new method for integrating knowledge graphs by focusing on context matching through label and triple matching, achieving strong results and introducing a new evaluation dataset.
Details
Motivation: The motivation stems from the fact that context matching in KGs is largely unexplored, even though real-world KGs vary in source, size, and information density, which current methods do not adequately address. Method: The method involves using string manipulation, fuzzy matching, and vector similarity techniques for aligning entity and predicate labels, followed by identifying mappings between triples to enhance entity-matching accuracy. Result: The approach demonstrates competitive performance compared to leading systems in the OAEI competition and supervised methods, with high accuracy across diverse test cases. Additionally, a new dataset was introduced to evaluate triple matching more comprehensively. Conclusion: The paper concludes that their proposed method for KG integration, which includes label matching and triple matching, achieves competitive performance and addresses the lack of exploration in context matching within KGs. Abstract: Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively.[6] Theoretical Foundations and Mitigation of Hallucination in Large Language Models
Esmail Gumaan
Main category: cs.CL
TL;DR: 该论文对大语言模型中的幻觉问题进行了严格的处理,提出了幻觉风险的理论边界,调查了检测和缓解策略,并提出了评估幻觉的协议。
Details
Motivation: 幻觉是指大语言模型生成与输入或现实不符的内容,这是一个严重的问题,需要进行严格的处理和分析。 Method: 使用学习理论框架(PAC-Bayes 和 Rademacher 复杂度)推导了幻觉风险的边界,并调查了检测幻觉的策略和缓解方法。 Result: 区分了内在幻觉和外在幻觉,定义了模型的幻觉风险,并提出了检测和缓解幻觉的多种方法。 Conclusion: 本文提出了一个统一的检测和缓解幻觉的工作流程,并概述了评估幻觉的协议,为解决大语言模型中的幻觉问题奠定了理论基础和实践指导。 Abstract: Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textit{hallucination risk} for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.[7] Reading Between the Timelines: RAG for Answering Diachronic Questions
Kwun Hang Lau,Ruiyuan Zhang,Weijie Shi,Xiaofang Zhou,Xiaojun Cheng
Main category: cs.CL
TL;DR: This paper proposes a new framework that enhances Retrieval-Augmented Generation (RAG) systems' ability to handle longitudinal queries by infusing temporal logic, resulting in improved answer accuracy.
Details
Motivation: The motivation stems from the deficit of conventional RAG systems in handling longitudinal queries that require tracking entities and phenomena across time due to the lack of temporally coherent evidence gathering. Method: The methodology involves disentangling a user's query into its core subject and temporal window, then employing a specialized retriever that calibrates semantic matching against temporal relevance to gather contiguous evidence across the queried period. Result: Empirical results on the Analytical Diachronic Question Answering Benchmark (ADQAB) show that the proposed approach improves answer accuracy by 13% to 27% compared to standard RAG implementations. Conclusion: This paper concludes that the proposed framework for infusing temporal logic into the RAG pipeline substantially improves answer accuracy for longitudinal queries, providing a validated pathway towards RAG systems capable of nuanced, evolutionary analysis. Abstract: While Retrieval-Augmented Generation (RAG) excels at injecting static, factual knowledge into Large Language Models (LLMs), it exhibits a critical deficit in handling longitudinal queries that require tracking entities and phenomena across time. This blind spot arises because conventional, semantically-driven retrieval methods are not equipped to gather evidence that is both topically relevant and temporally coherent for a specified duration. We address this challenge by proposing a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic. Our methodology begins by disentangling a user's query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance, ensuring the collection of a contiguous evidence set that spans the entire queried period. To enable rigorous evaluation of this capability, we also introduce the Analytical Diachronic Question Answering Benchmark (ADQAB), a challenging evaluation suite grounded in a hybrid corpus of real and synthetic financial news. Empirical results on ADQAB show that our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions. The dataset and code for this study are publicly available at https://github.com/kwunhang/TA-RAG.[8] Semantic Convergence: Investigating Shared Representations Across Scaled LLMs
Daniel Son,Sanjana Rathore,Andrew Rufail,Adrian Simon,Daniel Zhang,Soham Dave,Cole Blondin,Kevin Zhu,Sean O'Brien
Main category: cs.CL
TL;DR: Despite size differences, Gemma-2 language models share broadly similar internal features, especially in middle layers, reinforcing the idea of feature universality for cross-model interpretability.
Details
Motivation: The researchers aim to understand if models with significant size differences converge on similar internal concepts, which has implications for cross-model interpretability. Method: The study uses Sparse Autoencoder (SAE) dictionary-learning to extract features from Gemma-2 language models of different scales, aligns these features via activation correlation, and compares them using SVCCA and RSA. Result: Middle layers of the models showed strong feature overlap, while early and late layers were less similar. Multi-token subspace analysis also indicated that semantically similar features interact similarly with the models. Conclusion: Large language models, despite size differences, develop broadly similar and interpretable internal features, supporting the concept of feature universality. Abstract: We investigate feature universality in Gemma-2 language models (Gemma-2-2B and Gemma-2-9B), asking whether models with a four-fold difference in scale still converge on comparable internal concepts. Using the Sparse Autoencoder (SAE) dictionary-learning pipeline, we utilize SAEs on each model's residual-stream activations, align the resulting monosemantic features via activation correlation, and compare the matched feature spaces with SVCCA and RSA. Middle layers yield the strongest overlap, while early and late layers show far less similarity. Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models. These results strengthen the case that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.[9] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations
Qixuan Hu,Xumou Zhang,Jinman Kim,Florence Bourgeois,Adam G. Dunn
Main category: cs.CL
TL;DR: This study developed models to predict serious adverse event rates in clinical trials using registration data, achieving 77.6% AUC and 18.6% RMSE, helping improve trial design and safety monitoring.
Details
Motivation: The motivation was to improve clinical trial design by predicting safety outcomes in advance, thereby avoiding unnecessary risks to participants and reducing trial terminations. Method: Two prediction models were developed using a transfer learning approach with pretrained language models (ClinicalT5, BioBERT) combined with a sliding window method for embedding extraction, to predict SAE rates in clinical trials. Result: The best model (ClinicalT5+Transformer+MLP) achieved a 77.6% AUC for predicting which trial arm had higher SAE rates and an RMSE of 18.6% for predicting SAE proportions in the control arm. The sliding window approach improved performance across models. Conclusion: The study concluded that prediction models using trial registration data can estimate safety outcomes before trials start, improving trial design and identifying discrepancies between expected and reported results. Abstract: Objectives: With accurate estimates of expected safety results, clinical trials could be designed to avoid terminations and limit exposing participants to unnecessary risks. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analysed 22,107 two-arm parallel interventional clinical trials from ClinicalTrials.gov with structured summary results. Two prediction models were developed: a classifier predicting will experimental arm have higher SAE rates (area under the receiver operating characteristic curve; AUC) than control arm, and a regression model to predict the proportion of SAEs in control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with downstream model for prediction. To maintain semantic representation in long trial texts exceeding localised language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC predicting which trial arm has a higher proportion of patients with SAEs. When predicting proportion of participants experiencing SAE in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed methods without it. Across 12 classifiers, the average absolute AUC increase was 2.00%; across 12 regressors, the average absolute RMSE reduction was 1.58%. Discussion: Summary results data available at ClinicalTrials.gov remains underutilised. The potential to estimate results of trials before they start is an opportunity to improve trial design and flag discrepancies between expected and reported safety results.[10] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey
Jindong Li,Yali Fu,Jiahong Liu,Linxiao Cao,Wei Ji,Menglin Yang,Irwin King,Ming-Hsuan Yang
Main category: cs.CL
TL;DR: This paper presents the first comprehensive survey of discrete tokenization methods, particularly vector quantization techniques, tailored for large language model (LLM) applications, offering a structured taxonomy, analysis of key challenges, and insights into future research directions.
Details
Motivation: The rapid advancement of large language models has increased the need for effective mechanisms to convert continuous multimodal data into discrete representations suitable for language-based processing. However, there is a lack of comprehensive surveys systematically examining VQ techniques in the context of LLM-based systems. Method: The paper presents a structured taxonomy and analysis of discrete tokenization methods designed for LLMs, categorizing eight representative VQ variants and analyzing their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Result: The work provides a first structured taxonomy of discrete tokenization methods for LLMs, spanning classical and modern VQ paradigms, and identifies key challenges such as codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Conclusion: The survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. Abstract: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: https://github.com/jindongli-Ai/LLM-Discrete-Tokenization-Survey.[11] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers
Lee Harris
Main category: cs.CL
TL;DR: This paper proposes the LMC algorithm to improve the accuracy and reduce the cost and hallucinations associated with using language models for knowledge extraction, particularly demonstrated in extracting patient dates of birth from medical documents.
Details
Motivation: Language models are costly and prone to producing hallucinations. If the information they generate is incorrect, the resources invested in producing it are wasted. Method: Proposed, implemented, and applied the Language Model Chain (LMC) algorithm in which a language model's response is considered correct only if it exists in a collection of candidate answers. Incorrect responses are fed into a more predictive but slower language model, repeating the process until all predictions are correct. Result: Using the LMC algorithm to extract patient dates of birth from medical documents increased prediction speed and accuracy, while significantly reducing hallucinations. Conclusion: The LMC algorithm contributes to the field of knowledge extraction and should be further explored in the future. Abstract: Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model's response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.[12] Predicting stock prices with ChatGPT-annotated Reddit sentiment
Mateusz Kmak,Kamil Chmurzyński,Kamil Matejuk,Paweł Kotzbach,Jan Kocoń
Main category: cs.CL
TL;DR: The paper investigates whether social media sentiment can predict stock market movements, finding that its correlation is weak compared to simpler metrics like comment volume and search trends.
Details
Motivation: The surge of retail investor activity on social media raised questions about the influence of online sentiment on stock prices, prompting the exploration of whether sentiment derived from social media discussions can predict stock market movements. Method: The paper employs two existing text-based sentiment analysis methods and introduces a third, a ChatGPT-annotated and fine-tuned RoBERTa-based model, using correlation and causality metrics to assess predictive power. Result: The findings indicate that social media sentiment has a weak correlation with stock prices, while simpler metrics like comment volume and Google search trends show stronger predictive signals. Conclusion: The study concludes that traditional sentiment analysis may not fully capture the nuances of market-moving online discussions, and social media sentiment has only a weak correlation with stock prices. Abstract: The surge of retail investor activity on social media, exemplified by the 2021 GameStop short squeeze, raised questions about the influence of online sentiment on stock prices. This paper explores whether sentiment derived from social media discussions can meaningfully predict stock market movements. We focus on Reddit's r/wallstreetbets and analyze sentiment related to two companies: GameStop (GME) and AMC Entertainment (AMC). To assess sentiment's role, we employ two existing text-based sentiment analysis methods and introduce a third, a ChatGPT-annotated and fine-tuned RoBERTa-based model designed to better interpret the informal language and emojis prevalent in social media discussions. We use correlation and causality metrics to determine these models' predictive power. Surprisingly, our findings suggest that social media sentiment has only a weak correlation with stock prices. At the same time, simpler metrics, such as the volume of comments and Google search trends, exhibit stronger predictive signals. These results highlight the complexity of retail investor behavior and suggest that traditional sentiment analysis may not fully capture the nuances of market-moving online discussions.[13] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting
Aman Gupta,Yingying Zhuang,Zhou Yu,Ziji Zhang,Anurag Beniwal
Main category: cs.CL
TL;DR: This paper shows that optimizing prompt translation strategies can significantly improve the performance of multilingual systems, especially for low-resource languages, by enhancing knowledge sharing across languages in RAG-enhanced LLMs.
Details
Motivation: Despite the progress in multilingual LLMs, performance varies across languages and tasks. Knowledge bases are often shared from high-resource to low-resource languages, creating multilingual contexts and highlighting the need to understand the impact of prompt translation strategies. Method: The paper systematically evaluates different prompt translation strategies for classification tasks using retrieval-augmented generation (RAG) in multilingual large language models (LLMs). Result: Experimental results show that an optimized prompting strategy significantly improves knowledge sharing across languages, enhancing performance on downstream classification tasks. Conclusion: The study concludes that optimizing cross-lingual prompt strategies can significantly enhance knowledge sharing across languages, particularly benefiting low-resource languages in multilingual RAG-enhanced LLM systems. Abstract: Despite advances in the multilingual capabilities of Large Language Models (LLMs), their performance varies substantially across different languages and tasks. In multilingual retrieval-augmented generation (RAG)-based systems, knowledge bases (KB) are often shared from high-resource languages (such as English) to low-resource ones, resulting in retrieved information from the KB being in a different language than the rest of the context. In such scenarios, two common practices are pre-translation to create a mono-lingual prompt and cross-lingual prompting for direct inference. However, the impact of these choices remains unclear. In this paper, we systematically evaluate the impact of different prompt translation strategies for classification tasks with RAG-enhanced LLMs in multilingual systems. Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages, therefore improve the performance on the downstream classification task. The findings advocate for a broader utilization of multilingual resource sharing and cross-lingual prompt optimization for non-English languages, especially the low-resource ones.[14] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers
Brittney Exline,Melanie Duffin,Brittany Harbison,Chrissa da Gomez,David Joyner
Main category: cs.CL
TL;DR: 该研究分析了在美国基于计算机的在线课程中,学生语言背景如何影响他们的同行反馈体验。
Details
Motivation: 本文旨在研究在美国基于计算机的在线课程中,母语和非母语英语说话者在同行反馈体验中的三个指标受其英语语言背景的影响。 Method: 使用基于Twitter-roBERTa的模型分析了500名学生的同行评审的情感,并将情感得分和同行反馈评分与学生的语言背景相关联。 Result: 结果显示,母语为英语的学生对反馈的评价较为不利,而非母语者写的反馈更积极,但收到的积极情感较少。 Conclusion: 语言背景在塑造同行反馈体验中起着适度而复杂的作用,当控制性别和年龄时,出现了显著的交互作用。 Abstract: Graduate-level CS programs in the U.S. increasingly enroll international students, with 60.2 percent of master's degrees in 2023 awarded to non-U.S. students. Many of these students take online courses, where peer feedback is used to engage students and improve pedagogy in a scalable manner. Since these courses are conducted in English, many students study in a language other than their first. This paper examines how native versus non-native English speaker status affects three metrics of peer feedback experience in online U.S.-based computing courses. Using the Twitter-roBERTa-based model, we analyze the sentiment of peer reviews written by and to a random sample of 500 students. We then relate sentiment scores and peer feedback ratings to students' language background. Results show that native English speakers rate feedback less favorably, while non-native speakers write more positively but receive less positive sentiment in return. When controlling for sex and age, significant interactions emerge, suggesting that language background plays a modest but complex role in shaping peer feedback experiences.[15] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents
Haoran Sun,Shaoning Zeng
Main category: cs.CL
TL;DR: 本文提出了一种层次化记忆架构(H-MEM),通过多级语义抽象组织记忆并实现高效检索,显著提升了LLM代理在长期对话中的表现。
Details
Motivation: 现有的记忆存储和检索机制在结构化记忆组织和高效检索方面存在不足,影响LLM代理的推理能力。 Method: 提出了一种层次化记忆(H-MEM)架构,通过多级语义抽象组织和更新记忆,并利用基于索引的路由机制实现高效检索。 Result: 在LoCoMo数据集的五个任务设置上,所提方法均优于基线方法,验证了其有效性。 Conclusion: H-MEM架构能有效提升LLM代理在长期对话场景中的表现,优于五种基线方法。 Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.[16] Multi-Relation Extraction in Entity Pairs using Global Context
Nilesh,Atul Gupta,Avinash C Panday
Main category: cs.CL
TL;DR: 本文提出了一种新的文档级关系抽取方法,通过捕捉实体在文档中的全局关系和多句推理,提高了关系预测的准确性,并在多个基准数据集上验证了其有效性。
Details
Motivation: 文档级关系抽取需要建立跨越所有相关句子的全局上下文,而以往的方法仅关注实体提及的句子,无法捕捉完整文档上下文。 Method: 通过将实体表示为独立于文档位置的独立片段,利用全局关系和多句推理的输入编码方法。 Result: 在三个基准数据集(DocRED、Re-DocRED 和 REBEL)上的实验结果表明,该方法在文档级关系抽取中能够准确预测实体之间的关系。 Conclusion: 本文提出了一种新的输入嵌入方法,用于文档级关系抽取,通过捕捉实体在文档中的全局关系和多句推理,准确预测实体之间的关系。 Abstract: In document-level relation extraction, entities may appear multiple times in a document, and their relationships can shift from one context to another. Accurate prediction of the relationship between two entities across an entire document requires building a global context spanning all relevant sentences. Previous approaches have focused only on the sentences where entities are mentioned, which fails to capture the complete document context necessary for accurate relation extraction. Therefore, this paper introduces a novel input embedding approach to capture the positions of mentioned entities throughout the document rather than focusing solely on the span where they appear. The proposed input encoding approach leverages global relationships and multi-sentence reasoning by representing entities as standalone segments, independent of their positions within the document. The performance of the proposed method has been tested on three benchmark relation extraction datasets, namely DocRED, Re-DocRED, and REBEL. The experimental results demonstrated that the proposed method accurately predicts relationships between entities in a document-level setting. The proposed research also has theoretical and practical implications. Theoretically, it advances global context modeling and multi-sentence reasoning in document-level relation extraction. Practically, it enhances relationship detection, enabling improved performance in real-world NLP applications requiring comprehensive entity-level insights and interpretability.[17] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation
Zhehao Tan,Yihan Jiao,Dan Yang,Lei Liu,Jie Feng,Duolin Sun,Yue Shen,Jian Wang,Peng Wei,Jinjie Gu
Main category: cs.CL
TL;DR: This paper introduces a detailed benchmark for evaluating LLMs' specific capabilities within RAG systems, highlighting their limitations and offering a framework for improvement.
Details
Motivation: The motivation is to address the lack of systematic and granular evaluation frameworks for document utilization in RAG systems. While existing benchmarks focus on overall system performance and noise robustness, they do not thoroughly assess LLM-specific capabilities. Method: The paper proposes a multi-level, fine-grained benchmark called Placeholder-RAG-Benchmark, which evaluates LLM-specific capabilities in RAG systems across dimensions like multi-level filtering abilities, combination abilities, and reference reasoning. It uses a placeholder-based approach to decouple the contributions of parametric and external knowledge. Result: Experiments reveal limitations in representative LLMs regarding generation capabilities within RAG systems, especially in terms of error resilience and context faithfulness. The proposed benchmark offers a reproducible framework for improving and evaluating RAG systems. Conclusion: The paper concludes that there are limitations in current LLMs' capabilities within RAG systems, particularly in error resilience and context faithfulness, and that the proposed benchmark can provide a framework for developing more reliable and efficient RAG systems. Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.[18] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding
Xi Chen,Aske Plaat,Niki van Stein
Main category: cs.CL
TL;DR: 链式思维(CoT)提示方法在大型语言模型中能诱导出更可解释的内部结构,特别是在高容量模型中表现显著。
Details
Motivation: 虽然链式思维(CoT)提示方法在提高模型多步任务准确性方面表现良好,但其生成的“思考”是否反映模型真实内部推理过程仍不清楚。 Method: 结合稀疏自编码器和激活修补方法,对使用CoT和无CoT提示处理GSM8K数学问题的模型进行特征级因果研究。 Result: 在2.8B模型中,将少量CoT推理特征替换到无CoT运行中显著提高了答案对数概率,而70M模型没有可靠效果。此外,CoT在较大模型中提升了特征可解释性和激活稀疏性,模型生成正确答案的信心也从1.2提高到4.3。 Conclusion: 研究发现,CoT提示方法在大型语言模型中能诱导出更可解释的内部结构,特别是在高容量模型中表现显著,从而验证了其作为结构化提示方法的作用。 Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.[19] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow
Xiaoyu Pan,Yang Bai,Ke Zou,Yang Zhou,Jun Zhou,Huazhu Fu,Yih-Chung Tham,Yong Liu
Main category: cs.CL
TL;DR: EH-Benchmark是一个新的眼科基准,旨在评估和减轻医学大型语言模型(MLLMs)的幻觉问题,通过一个三阶段框架提高诊断的准确性和可靠性。
Details
Motivation: 现有的医学基准未能有效评估各种类型的幻觉或提供可行的解决方案来减轻这些幻觉。 Method: 提出了一个以代理为中心的三阶段框架,包括知识级检索、任务级案例研究和结果级验证阶段。 Result: EH-Benchmark能够有效评估和减轻MLLMs在眼科诊断中的幻觉问题。 Conclusion: EH-Benchmark项目显著减少了MLLMs的幻觉,提高了诊断的准确性、可解释性和可靠性。 Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.[20] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection
Shalini Jangra,Suparna De,Nishanth Sastry,Saeed Fadaei
Main category: cs.CL
TL;DR: 该论文提出了一种生成合成PII数据的方法,以促进在线隐私风险的可重复研究。
Details
Motivation: 缺乏开源标注数据集,阻碍了PII识别研究。 Method: 通过大型语言模型生成合成数据,使用顺序指令提示模仿原始Reddit帖子。 Result: 创建了包含19个类别的PII分类法及合成数据集,验证了合成数据的可用性。 Conclusion: 研究得出合成数据集能有效促进可重复的PII识别研究,并保证数据匿名性。 Abstract: Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users' Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.[21] Enhancing RAG Efficiency with Adaptive Context Compression
Shuyu Guo,Zhaochun Ren
Main category: cs.CL
TL;DR: ACC-RAG是一种自适应上下文压缩框架,能根据输入复杂度动态调整压缩率,提升检索增强生成系统的推理效率。
Details
Motivation: 现有的上下文压缩方法使用固定的压缩率,容易对简单查询过度压缩,对复杂查询压缩不足,影响推理效率和准确性。 Method: 提出了一种自适应上下文压缩框架(ACC-RAG),根据输入复杂度动态调整压缩率,使用层次压缩器进行多粒度嵌入,并结合上下文选择器保留最小足够的信息。 Result: 在维基百科和五个问答数据集上的评估显示,ACC-RAG优于固定压缩率方法,并实现了超过4倍的推理加速。 Conclusion: ACC-RAG通过动态调整压缩率,结合层次压缩器和上下文选择器,在保持或提升准确率的同时,实现了比标准RAG快4倍以上的推理速度。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.[22] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification
Baptiste Lefort,Eric Benhamou,Beatrice Guez,Jean-Jacques Ohana,Ethan Setrouk,Alban Etienne
Main category: cs.CL
TL;DR: 本文开发了一种新的投资组合优化方法,利用轻量级大语言模型和深度强化学习,有效整合金融新闻情感与市场数据,取得了优于基准模型的表现。
Details
Motivation: 为了更好地结合金融市场新闻的情感信号与传统市场指标,提高投资组合优化的效果和稳定性。 Method: 论文采用了三级架构,包括基础RL代理、元代理和超级代理,分别用于处理混合数据、聚合决策和结合市场数据与情感分析进行最终决策。 Result: 在2018至2024年的数据评估中,该框架实现了26%的年化回报率和1.2的夏普比率,优于等权重和标普500基准。 Conclusion: 该论文提出了一种新颖的、基于轻量级大语言模型和深度强化学习的分层投资组合优化框架,表现优于基准模型。 Abstract: This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.[23] Augmented Vision-Language Models: A Systematic Review
Anthony C Davis,Burhan Sadiq,Tianmin Shu,Chien-Ming Huang
Main category: cs.CL
TL;DR: The paper reviews methods to improve visual-language understanding through integration with external symbolic systems, addressing the limitations of current machine learning models.
Details
Motivation: Visual-language machine learning models have limitations in interpretability, adaptability to new information, resource requirements, and logical reasoning. Neural symbolic systems offer a promising solution by integrating neural networks with external symbolic systems. Method: Systematic literature review to categorize techniques for improving visual-language understanding through interaction with external symbolic information systems. Result: Neural symbolic systems can provide more interpretable explanations, assimilate new information without extensive retraining, and improve reasoning and memory abilities. Conclusion: This systematic literature review aims to explore techniques that enhance visual-language understanding by interacting with external symbolic information systems, thus realizing the benefits of integrating neural networks with symbolic systems. Abstract: Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.[24] Deep Learning Approaches for Multimodal Intent Recognition: A Survey
Jingwei Zhao,Yuhua Wen,Qifei Li,Minchi Hu,Yingying Zhou,Jingyao Xue,Junyang Wu,Yingming Gao,Zhengqi Wen,Jianhua Tao,Ya Li
Main category: cs.CL
TL;DR: 本文综述了意图识别从传统自然语言处理到多模态深度学习的发展,涵盖了相关方法、数据集、应用及挑战。
Details
Motivation: 随着对自然人机交互需求的增长,意图识别领域通过深度学习和多模态方法不断发展,整合了来自音频、视觉和生理信号的数据。 Method: 本文综述了基于深度学习的意图识别方法,包括传统的自然语言处理、多模态方法、数据集和应用。 Result: 本文提供了对意图识别中深度学习方法的全面概述,重点介绍了Transformer模型带来的突破,并讨论了当前的挑战和未来的研究方向。 Conclusion: 本文讨论了意图识别从单模态到多模态技术的演变,涵盖了相关数据集、方法论、应用和当前挑战,为研究人员提供了多模态意图识别(MIR)的最新发展和未来研究方向。 Abstract: Intent recognition aims to identify users' underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.[25] Trusted Knowledge Extraction for Operations and Maintenance Intelligence
Kathleen Mealey,Jonathan A. Karr Jr.,Priscila Saboia Moreira,Paul R. Brenner,Charles F. Vardeman II
Main category: cs.CL
TL;DR: 本文探讨了在航空运维智能用例中,如何构建知识图谱并评估NLP与LLM工具的零样本性能,揭示了其在可信环境中的挑战与局限性。
Details
Motivation: 由于数据保密与数据整合目标的二元性以及自然语言处理(NLP)工具在特定领域(如运维)知识结构上的局限性,从组织数据仓库中获取操作智能是一个关键挑战。 Method: 作者将知识抽取过程分解为命名实体识别、共指解析、命名实体链接和关系抽取功能组件,并评估了十六种NLP工具与大语言模型(LLM)的能力。 Result: 作者观察到显著的性能限制,并讨论了在受控、机密环境中运行的NLP和LLM工具的零样本性能。 Conclusion: 作者总结了在航空等任务关键型行业中使用可信NLP和LLM工具的挑战,并提出了增强信任的建议,同时提供了开源精选数据集以支持进一步的基准测试和评估。 Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.[26] Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis
Md Talha Mohsin
Main category: cs.CL
TL;DR: This study compares the performance of five leading LLMs (GPT, Claude, Perplexity, Gemini, and DeepSeek) on financial analysis tasks using 10-K filings, revealing that GPT is the most consistent and reliable, while Gemini and DeepSeek show greater variability.
Details
Motivation: The rapid advancement and growing influence of Large Language Models (LLMs) in financial analysis necessitate systematic comparisons among widely used models to understand their performance and reliability in domain-specific tasks. Method: A comparative evaluation of five LLMs (GPT, Claude, Perplexity, Gemini, and DeepSeek) was conducted using domain-specific prompts on 10-K filings from the 'Magnificent Seven' technology companies. Performance was assessed through human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). Result: GPT outperforms other models in coherence, semantic alignment, and contextual relevance. Claude and Perplexity follow with relatively good performance, while Gemini and DeepSeek exhibit more variability and less agreement. Model outputs are sensitive to prompt design and source material, with performance varying across companies and over time. Conclusion: GPT performs best among the evaluated LLMs in coherence, semantic alignment, and contextual relevance, followed by Claude and Perplexity, while Gemini and DeepSeek show higher variability and lower agreement. Output similarity and stability are sensitive to prompt design and source material. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the 'Magnificent Seven' technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.[27] CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering
Jinkun Zhao,Yuanshuai Wang,Xingjian Zhang,Ruibo Chen,Xingchuang Liao,Junle Wang,Lei Huang,Kui Zhang,Wenjun Wu
Main category: cs.CL
TL;DR: 本文提出了一种新的协作专家框架CoE-Ops,结合大语言模型和检索增强生成机制,有效提升了AIOps任务的处理效率和准确性。
Details
Motivation: 由于单一模型受限于特定领域的知识,只能处理特定任务,而结合多个模型可以实现更高效的结果,因此本文提出了CoE-Ops框架以解决AIOps中的挑战。 Method: 提出了一种协作专家框架(CoE-Ops),结合了一个通用的大语言模型任务分类器和检索增强生成机制,以处理AIOps中的高低级任务。 Result: 实验结果显示,CoE-Ops在路由高阶AIOps任务方面比现有CoE方法提升了72%,在DevOps问题解决方面比单一AIOps模型最高提升了8%,并且在准确性上超越了更大规模的MoE模型最高达14%。 Conclusion: CoE-Ops通过结合多个专家模型和检索增强生成机制,在AIOps领域实现了更高的准确性和效率,优于现有的单一模型和大规模MoE模型。 Abstract: With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework's capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.[28] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents
Sumit Soman,H. G. Ranjani,Sujoy Roychowdhury,Venkata Dharma Surya Narayana Sastry,Akshat Jain,Pranav Gangrade,Ayaaz Khan
Main category: cs.CL
TL;DR: 本文提出了一种结合基于文本的RAG系统和使用视觉大语言模型(VLM)生成的图表图表示的方法,用于从电信领域的技术文档中进行问答。
Details
Motivation: 由于传统基于文本的检索增强生成(RAG)系统无法回答需要技术文档中图表信息的问题,因此本文旨在通过引入图表的图表示来解决这一问题。 Method: 该方法包括处理技术文档、分类图像类型、构建图表示,并将其结合到文本嵌入流程中以实现高效检索。此外,还创建了一个基于专有电信产品信息文档的问答数据集作为基准。 Result: 实验结果表明,使用微调VLM模型获得的图表示相对于真实情况具有较低的编辑距离,说明其对流程图具有鲁棒性。此外,使用这些表示进行问答能够通过基于文本的嵌入模型实现良好的检索性能。 Conclusion: 该方法不仅提高了基于文本的RAG系统在涉及图表的问题上的问答性能,而且在推理过程中不需要VLM,为部署的问答系统带来了重要的成本优势。 Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.[29] PARROT: An Open Multilingual Radiology Reports Dataset
Bastien Le Guellec,Kokou Adambounou,Lisa C Adams,Thibault Agripnidis,Sung Soo Ahn,Radhia Ait Chalal,Tugba Akinci D Antonoli,Philippe Amouyel,Henrik Andersson,Raphael Bentegeac,Claudio Benzoni,Antonino Andrea Blandino,Felix Busch,Elif Can,Riccardo Cau,Armando Ugo Cavallo,Christelle Chavihot,Erwin Chiquete,Renato Cuocolo,Eugen Divjak,Gordana Ivanac,Barbara Dziadkowiec Macek,Armel Elogne,Salvatore Claudio Fanni,Carlos Ferrarotti,Claudia Fossataro,Federica Fossataro,Katarzyna Fulek,Michal Fulek,Pawel Gac,Martyna Gachowska,Ignacio Garcia Juarez,Marco Gatti,Natalia Gorelik,Alexia Maria Goulianou,Aghiles Hamroun,Nicolas Herinirina,Krzysztof Kraik,Dominik Krupka,Quentin Holay,Felipe Kitamura,Michail E Klontzas,Anna Kompanowska,Rafal Kompanowski,Alexandre Lefevre,Tristan Lemke,Maximilian Lindholz,Lukas Muller,Piotr Macek,Marcus Makowski,Luigi Mannacio,Aymen Meddeb,Antonio Natale,Beatrice Nguema Edzang,Adriana Ojeda,Yae Won Park,Federica Piccione,Andrea Ponsiglione,Malgorzata Poreba,Rafal Poreba,Philipp Prucker,Jean Pierre Pruvo,Rosa Alba Pugliesi,Feno Hasina Rabemanorintsoa,Vasileios Rafailidis,Katarzyna Resler,Jan Rotkegel,Luca Saba,Ezann Siebert,Arnaldo Stanzione,Ali Fuat Tekin,Liz Toapanta Yanchapaxi,Matthaios Triantafyllou,Ekaterini Tsaoulia,Evangelia Vassalou,Federica Vernuccio,Johan Wasselius,Weilang Wang,Szymon Urban,Adrian Wlodarczak,Szymon Wlodarczak,Andrzej Wysocki,Lina Xu,Tomasz Zatonski,Shuhang Zhang,Sebastian Ziegelmayer,Gregory Kuchcinski,Keno K Bressem
Main category: cs.CL
TL;DR: The study presents PARROT, a large, multicenter, multilingual dataset of fictional radiology reports designed for testing natural language processing applications, with results showing its potential for cross-linguistic and cross-clinical validation.
Details
Motivation: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Method: Radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants assessing whether reports were human-authored or AI-generated. Result: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities and anatomical regions, with chest, abdomen, head, and pelvis being most prevalent. In the differentiation study, participants achieved 53.9% accuracy in distinguishing between human and AI-generated reports, with radiologists performing significantly better than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints. Abstract: Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.[30] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
Rui Jiao,Yue Zhang,Jinku Li
Main category: cs.CL
TL;DR: RELIANCE 框架通过事实检查分类器、强化学习方法和机械解释模块,显著提高了大语言模型在推理过程中的事实准确性。
Details
Motivation: 大语言模型在中间推理步骤中存在事实错误,尽管最终答案正确,但在高风险领域可能误导用户做出危险决策。 Method: RELIANCE 框架包含三个核心组件:(1)一个专门的事实检查分类器;(2)一种平衡真实性、连贯性和结构正确性的强化学习方法;(3)一个机械解释模块,用于分析模型推理过程中的激活情况。 Result: 对十个最先进模型的评估显示,即使是领先的模型如 Claude-3.7 和 GPT-o1,其推理事实准确性也只有 81.93% 和 82.57%。RELIANCE 显著增强了事实的鲁棒性(最高提高 49.90%),同时在多个挑战性基准测试中保持或提高了性能。 Conclusion: RELIANCE 提出了一种新的框架,旨在解决大语言模型在推理过程中出现事实错误的问题,显著提高了事实的鲁棒性,并为未来通过激活引导优化明确目标的训练方法奠定了基础。 Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.[31] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology
Paul Minchella,Loïc Verlingue,Stéphane Chrétien,Rémi Vaucher,Guillaume Metzler
Main category: cs.CL
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L\'eon B\'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.[32] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies
Shirley V Wang,Georg Hahn,Sushama Kattinakere Sreedhara,Mufaddal Mahesri,Haritha S. Pillai,Rajendra Aldis,Joyce Lii,Sarah K. Dutcher,Rhoda Eniafe,Jamal T. Jones,Keewan Kim,Jiwei He,Hana Lee,Sengwee Toh,Rishi J Desai,Jie Yang
Main category: cs.CL
TL;DR: 通过引入自然语言处理技术和多阶段自适应抽样方法,加速了验证索赔数据库中健康结果算法的过程,减少了时间和资源消耗。
Details
Motivation: 通过手动审查链接的电子健康记录中的自由文本注释来创建参考标准标签通常需要大量的时间和资源。 Method: 描述了一种加速验证过程的方法,包括使用自然语言处理(NLP)减少人工审查图表的时间,以及采用多阶段自适应抽样方法,在达到足够的精度后停止验证研究。 Result: 实证结果显示,NLP辅助注释过程将每份图表的审查时间减少了40%,而使用预定义的停止规则和多阶段抽样可以避免审查77%的患者图表,同时对得出的测量特征的精度影响有限。 Conclusion: 该方法可以促进对数据库研究中关键参数的算法进行常规验证,从而提高对研究结果可靠性的理解。 Abstract: Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.[33] Opacity as Authority: Arbitrariness and the Preclusion of Contestation
Naomi Omeonga wa Kayembe
Main category: cs.CL
TL;DR: This paper redefines arbitrariness as a foundational, functional mechanism in human systems rather than a flaw or symptom of domination. It extends Ferdinand de Saussure's concept of arbitrariness in language to broader domains like law and social dynamics, introducing the "Motivation -> Constatability -> Contestability" chain. By formalizing arbitrariness using Shannon's entropy model as A = H(L|M), the paper proposes a modern theory of arbitrariness as a neutral operator in control and care within interpersonal relations, with implications for understanding advanced artificial intelligence systems.
Details
Motivation: The paper is motivated by a desire to redefine arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. It aims to diverge from critical traditions that conflate arbitrariness with injustice and instead present it as a semiotic trait that enables systems to operate effectively. Method: Building on Ferdinand de Saussure's concept of l'arbitraire du signe and drawing on Shannon's entropy model, the paper extends the principle of arbitrariness beyond language to law and social dynamics. It introduces the "Motivation -> Constatability -> Contestability" chain and explores mechanisms like "immotivization" and "Conflict Lateralization" to demonstrate how acts can produce binding effects without exposing their rationale. Result: The result of the paper is the formalization of arbitrariness as A = H(L|M) (conditional entropy), positioning it as a neutral operator crucial to control and care in social systems. The paper successfully extends the concept of arbitrariness beyond language to demonstrate its applicability in law and social dynamics, while also suggesting its relevance to artificial intelligence explainability. Conclusion: The paper concludes that arbitrariness, formalized as A = H(L|M), is a neutral operator central to control and care in interpersonal relations and social systems. It proposes a modern theory of arbitrariness that not only explains structural opacity in human systems but also illuminates new pathways for analyzing explainability in artificial intelligence systems. Abstract: This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure's concept of l'arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the "Motivation -> Constatability -> Contestability" chain, arguing that motivation functions as a crucial interface rendering an act's logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like "immotivization" or "Conflict Lateralization" (exemplified by "the blur of the wolf drowned in the fish"), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon's entropy model, the paper formalizes arbitrariness as A = H(L|M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.[34] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations
Chengqian Ma,Wei Tao,Yiwen Guo
Main category: cs.CL
TL;DR: This paper presents a benchmark dataset for Spoken Dialogue Models to address research gaps in understanding their practical effectiveness in emulating human conversations.
Details
Motivation: There is a research gap in understanding the practical effectiveness of Spoken Dialogue Models (SDMs) compared to text-based Large Language Models (LLMs), especially given the complexity of spoken dialogue involving ambiguity and context-dependency. Method: The paper presents a benchmark dataset comprising 1,079 instances in English and Chinese, along with an LLM-based evaluation method that aligns with human judgment. Result: A benchmark dataset with 1,079 instances in English and Chinese was developed, along with an LLM-based evaluation method that closely aligns with human judgment. Conclusion: The paper concludes that SDMs face significant challenges in comprehending and emulating human conversations due to the inherent complexity of spoken dialogue, and a new benchmark dataset can facilitate further exploration of SDM performance. Abstract: Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.[35] Math Natural Language Inference: this should be easy!
Valeria de Paiva,Qiyue Gao,Hai Hu,Pavel Kovalev,Yikang Liu,Lawrence S. Moss,Zhiheng Qian
Main category: cs.CL
TL;DR: 本文研究了LLMs在数学NLI任务中的表现,构造了相关语料库,并发现LLMs有潜力但也面临挑战。
Details
Motivation: 探索当前LLMs是否能够执行数学文本的自然语言推理任务。 Method: 构建了一个包含数学自然语言推理对的语料库,并评估了LLMs的表现和群体间一致性。 Result: 发现LLMs在某些情况下可以与人类标注数据相媲美,但仍难以处理数学语言和基本推理。 Conclusion: LLMs表现出在数学自然语言推理任务上的潜力,但仍有挑战。 Abstract: We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only "inference" in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.[36] Exploring In-Context Learning for Frame-Semantic Parsing
Diego Garat,Guillermo Moncecchi,Dina Wonsever
Main category: cs.CL
TL;DR: 本文研究了在无需模型微调的情况下使用 In-Context Learning 和 Large Language Models 执行 Frame Semantic Parsing 的方法,并在与暴力事件相关的帧的子集上取得了具有竞争力的结果。
Details
Motivation: 研究在无需模型微调的情况下,使用 In-Context Learning 和 Large Language Models 执行 Frame Semantic Parsing 的可能性。 Method: 提出了一种自动生任务特定提示的方法,用于 Frame Identification 和 Frame Semantic Role Labeling 子任务,仅依赖 FrameNet 数据库。 Result: 在与暴力事件相关的帧的子集上进行的实验取得了具有竞争力的结果,F1 分数为 94.3%(Frame Identification)和 77.4%(Frame Semantic Role Labeling) Conclusion: ICL 是一种实用且有效的替代传统微调方法的领域特定 FSP 任务方法。 Abstract: Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.[37] Context-aware Rotary Position Embedding
Ali Veisi,Delaram Fartoot,Hamidreza Amirzadeh
Main category: cs.CL
TL;DR: This paper proposes CARoPE, a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings, which introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity.
Details
Motivation: Rotary Positional Embeddings (RoPE) rely on static, input-independent sinusoidal frequency patterns, limiting their ability to model context-sensitive relationships. Method: CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. Result: Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Conclusion: CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models. Abstract: Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.[38] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity
Ishani Mondal,Meera Bharadwaj,Ayush Roy,Aparna Garimella,Jordan Lee Boyd-Graber
Main category: cs.CL
TL;DR: SMART-Editor is a framework that improves layout and content editing across multiple domains using reward-guided strategies, outperforming existing methods.
Details
Motivation: Prior models struggle with maintaining global coherence during edits, especially across different domains. This work aims to address this limitation through reward-guided planning. Method: The framework uses Reward-Refine for inference-time reward-guided refinement and RewardDPO for training-time preference optimization using reward-aligned layout pairs. Result: SMART-Editor surpasses baselines like InstructPix2Pix and HIVE, with RewardDPO showing up to 15% improvement in structured settings and Reward-Refine excelling in natural image editing. Conclusion: SMART-Editor proves effective in compositional layout and content editing across structured and unstructured domains, outperforming existing methods in semantic consistency and visual alignment. Abstract: We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.[39] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL
Jeffrey Eben,Aitzaz Ahmad,Stephen Lau
Main category: cs.CL
TL;DR: This paper proposes a scalable, component-based retrieval architecture for text-to-SQL systems that effectively utilizes database metadata, enabling deployment across diverse enterprise settings without domain-specific fine-tuning.
Details
Motivation: The motivation is to overcome the limitations of existing approaches that rely on domain-specific fine-tuning and fail to utilize semantic context from database metadata, which complicates deployment and limits scalability. Method: The method involves decomposing database schemas and metadata into discrete semantic units, which are indexed separately for targeted retrieval, prioritizing table identification and leveraging column-level information. Result: The experiments showed that the proposed approach maintains high recall and accuracy, outperforming baseline methods on large databases with varying structures and metadata availability. Conclusion: The introduced component-based retrieval architecture effectively addresses scalability issues in text-to-SQL systems for enterprise-level data catalogs without requiring domain-specific fine-tuning. Abstract: Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.[40] Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity
Xinwei Wu,Haojie Li,Hongyu Liu,Xinyu Ji,Ruohan Li,Yule Chen,Yigeng Zhang
Main category: cs.CL
TL;DR: 该研究分析了大语言模型在处理中文歧义文本时的局限性,并构建了用于测试的基准数据集。
Details
Motivation: 探索大语言模型(LLMs)在遇到歧义叙述文本时的表现,特别是中文文本歧义问题。 Method: 创建了一个包含歧义句及其对应消歧配对的基准数据集,通过实验分析大语言模型处理歧义的能力。 Result: 实验发现大语言模型在处理歧义时表现出显著脆弱性,无法可靠区分歧义与非歧义文本,且对歧义文本的解释过于自信和过度思考。 Conclusion: 研究揭示了当前大语言模型在处理语言歧义时的根本局限性,并呼吁改进语言理解中的不确定性处理方法。 Abstract: In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.[41] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans
Ananya Sadana,Yash Kumar Lal,Jiawei Zhou
Main category: cs.CL
TL;DR: ISO-Bench reveals that current multimodal models struggle to understand causal relationships between visual and textual information, indicating a need for improvement.
Details
Motivation: Understanding causal relationships across modalities is essential for multimodal models operating in real-world environments. Method: The researchers introduced ISO-Bench, a benchmark for evaluating the ability of models to infer causal dependencies between visual observations and procedural text. Models were tested using zero-shot and chain-of-thought reasoning techniques. Result: Evaluation results on ten vision-language models showed underwhelming performance, with the best zero-shot F1 score at 0.57 and modest improvements (up to 0.62 F1) using chain-of-thought reasoning, significantly behind human performance (0.98 F1). Conclusion: The study concludes that current multimodal models have limited causal understanding, and there is significant room for improvement. Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.[42] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal
Yuhan Liu,Michael J. Q. Zhang,Eunsol Choi
Main category: cs.CL
TL;DR: This paper explores how implicit feedback from user interactions can improve language models, finding that while it helps with short questions, it is limited by the complexity of the task and the quality of the initial user input.
Details
Motivation: The motivation for this study is to explore non-intrusive methods for improving language models by harvesting implicit feedback from user interaction logs, rather than relying on direct user feedback which can be disruptive. Method: The researchers analyzed implicit user feedback from two datasets (WildChat and LMSYS) and examined how feedback content and polarity could serve as learning signals. They evaluated the impact of feedback on model performance using short human-designed questions (MTBench) and longer, more complex questions (WildBench). Result: The analysis showed that implicit feedback content (e.g., user requests for clarification) can improve model performance on short questions (MTBench), but not on longer, more complex ones (WildBench). Additionally, the usefulness of feedback heavily depends on the quality of the user's initial prompt. Conclusion: The study concludes that implicit user feedback from interaction logs can enhance model performance on short questions when content aspects like clarification needs are considered, but it is less effective for longer, complex questions. The effectiveness of feedback is closely related to the quality of the initial user prompt. Abstract: Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.[43] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration
Jizhou Guo
Main category: cs.CL
TL;DR: LENS是一种利用模型内部表示来估计置信度并加权组合多个大型语言模型预测结果的新方法。
Details
Motivation: 现有的集成方法通常依赖于简单技术,忽略了不同上下文中模型的可变置信度和可靠性。因此,需要一种更精细的方法来提升系统鲁棒性和性能。 Method: LENS方法利用每层隐藏状态和归一化概率作为输入,训练一个轻量级的线性置信度预测器来估计模型的置信度。 Result: 实验结果表明,在多项选择和布尔问答任务中,LENS明显优于传统集成方法。 Conclusion: LENS是一种有效的方法,通过分析内部表示来学习模型的置信度,从而更精确地组合多个大型语言模型的预测结果。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.[44] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
Jianghui Wang,Vinay Joshi,Saptarshi Majumder,Xu Chao,Bin Ding,Ziqiong Liu,Pratik Prabhanjan Brahma,Dong Li,Zicheng Liu,Emad Barsoum
Main category: cs.CL
TL;DR: This paper introduces GEAK, an AI framework for generating efficient GPU kernels for AMD hardware, using advanced LLMs and feedback mechanisms to outperform existing methods.
Details
Motivation: The increasing complexity and diversity of deep learning workloads necessitate automation in low-level kernel development to enhance productivity and performance, especially on advanced hardware like AMD GPUs. Method: The study introduces GEAK, a framework using advanced LLMs and a Reflexion-style feedback mechanism to generate efficient Triton-based GPU kernels for AMD GPUs like the MI300X and MI250. Result: GEAK outperformed baseline methods by achieving up to 63% correctness and a 2.59X speed improvement on two evaluation benchmarks. Conclusion: GEAK-like frameworks show promise in accelerating hardware platform adoption and making expert-level kernel performance more accessible. Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.[45] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples
Yunhao Liang,Ruixuan Ying,Takuya Taniguchi,Zhe Cui
Main category: cs.CL
TL;DR: This paper proposes a novel method to improve few-shot in-context learning by leveraging Negative samples for better Positive example selection.
Details
Motivation: Prior work focuses on Positive samples, overlooking the value of Negative samples in improving ICL performance. Method: A semantic similarity-based approach is used to select examples from Positive and Negative corpora; Positive examples are further retrieved based on similarity to Negative examples. Result: Experimental results show that the proposed method outperforms approaches that only use Positive examples for context. Conclusion: The proposed method enhances few-shot in-context learning performance by utilizing Negative samples for better Positive sample selection. Abstract: Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection.[46] Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders
Carolina Zheng,Nicolas Beltran-Velez,Sweta Karlekar,Claudia Shi,Achille Nazaret,Asif Mallik,Amir Feder,David M. Blei
Main category: cs.CL
TL;DR: This paper introduces MTMs, a new class of topic models that operate on semantically rich interpretable features, enabling better topic discovery and controllable text generation.
Details
Motivation: Traditional topic models struggle to capture semantically abstract features due to reliance on bag-of-words representations, and neural variants are limited by expressing topics as word lists. Method: We introduce Mechanistic Topic Models (MTMs), operating on interpretable features learned by sparse autoencoders (SAEs), and propose topic judge, an LLM-based pairwise comparison evaluation framework. Result: MTMs match or exceed traditional and neural baselines on coherence metrics and are consistently preferred by topic judge. Conclusion: MTMs enable effective steering of LLM outputs and outperform traditional and neural baselines. Abstract: Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.[47] Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs
Sophie Kearney,Shu Yang,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Jason Moore,Marylyn Ritchie,Li Shen
Main category: cs.CL
TL;DR: 本研究提出了 TAP-GPT 框架,利用大语言模型 TableGPT2 对阿尔茨海默病进行诊断,基于少样本学习和参数高效微调方法 qLoRA,实现了优于其他先进模型的表现。
Details
Motivation: 阿尔茨海默病的早期和准确诊断需要分析多种异构生物标志物数据,而这些数据通常以表格形式表示。大语言模型(LLMs)具有灵活的少样本推理能力、多模态集成能力和基于自然语言的可解释性,为利用结构化生物医学数据进行预测提供了前所未有的机会。 Method: 研究提出了一种新的框架 TAP-GPT,用于阿尔茨海默病(AD)诊断。该框架基于原本为商业智能任务开发的多模态表格专用LLM TableGPT2,并使用参数高效适配方法qLoRA进行微调。通过使用结构化生物医学数据的上下文学习示例构建少量样本表格提示,完成对临床二分类任务(AD 或认知正常 CN)的优化。 Result: TAP-GPT 框架利用 TableGPT2 的强大表格理解能力和LLMs编码的先验知识,在小样本条件下优于更先进的通用LLMs和专为预测任务开发的表格基础模型(TFM)。 Conclusion: TAP-GPT 是首个将大语言模型(LLMs)应用于使用表格生物标志物数据预测任务的框架,为未来生物医学信息学中LLM驱动的多智能体框架铺平了道路。 Abstract: Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.[48] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication
Sneha Oram,Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: This paper introduces the P-ReMe dataset and StiPRompts to evaluate pragmatic reasoning and stigma handling in LLMs for mental health, finding that Mistral, Qwen, and especially Claude-3.5-haiku perform well.
Details
Motivation: The motivation is to explore pragmatic reasoning aspects like implicature and presupposition in mental health chatbots, which have not been previously studied. Method: The authors introduced the P-ReMe dataset, formulated tasks based on implicature and presupposition, and benchmarked four LLMs (Llama3.1, Mistral, MentaLLaMa, Qwen). They also proposed StiPRompts and evaluated GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku on mental health stigma. Result: Experimental results indicate that Mistral and Qwen show substantial reasoning capabilities in the mental health domain, and Claude-3.5-haiku handles stigma more responsibly than GPT-4o mini and Deepseek-chat. Conclusion: The study concludes that Mistral and Qwen demonstrate strong pragmatic reasoning capabilities in mental health, and Claude-3.5-haiku handles mental health stigma more responsibly compared to other LLMs. Abstract: There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.[49] Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis
Shimanto Bhowmik,Tawsif Tashwar Dipto,Md Sazzad Islam,Sheryl Hsu,Tahsin Reasat
Main category: cs.CL
TL;DR: 本文研究了孟加拉语NLP的挑战,通过评估10个LLM模型和8个数据集,揭示了表现差距和模型失效的主要原因,指出分词效率对准确性的影响。
Details
Motivation: 孟加拉语在NLP研究中代表性不足,因其语言结构独特且存在计算限制,缺乏标准化评估基准。 Method: 系统性地研究了阻碍孟加拉语NLP表现的问题,评估了10个开源LLM在8个翻译数据集上的表现,并进行了全面的错误分析。 Result: 发现孟加拉语的表现差距,尤其是小型模型和Mistral模型家族,同时发现DeepSeek等架构具有较好的跨语言稳定性。此外,发现分词效率与LLM准确性呈反比关系。 Conclusion: 该论文强调了当前大型语言模型在孟加拉语NLP任务中的表现差距,指出了数据集质量和评估方法的重要性,并呼吁为小语种语言技术做更多研究。 Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.[50] Unveiling Super Experts in Mixture-of-Experts Large Language Models
Zunhai Su,Qingyuan Li,Hao Zhang,YuLei Qian,Yuchen Xie,Kehong Yuan
Main category: cs.CL
TL;DR: 本研究首次发现并分析了稀疏激活的MoE模型中起关键作用的超级专家(SEs),揭示了其在注意力机制和模型性能中的核心作用。
Details
Motivation: 现有的MoE模型压缩方法通常依赖经验标准识别关键专家,缺乏对专家异质重要性的深入理解,因此需要探索MoE模型中专家作用的深层机制。 Method: 通过分析超级专家在模型推理机制中的作用,结合剪枝实验评估SEs在各种任务中的重要性,并进一步研究了SE压缩的影响。 Result: 研究发现了超级专家(SEs),它们在隐藏状态中产生极端激活,对模型性能尤其是数学推理任务有显著影响,且SE剪枝会显著破坏注意力分数分布。 Conclusion: 该研究确认了稀疏激活的混合专家模型(MoE)中依赖于超级专家(SEs)来诱导注意力沉降,这些专家对注意力分数的分布至关重要,而SE剪枝会严重破坏这一机制。 Abstract: Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.[51] What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content
Alfio Ferrara,Sergio Picascia,Laura Pinnavaia,Vojimir Ranitovic,Elisabetta Rocchetti,Alice Tuveri
Main category: cs.CL
TL;DR: GPT-4o-mini 在改写敏感内容时表现出隐式内容调节能力,可显著减少冒犯性语言,并在零样本条件下有效分类敏感句子。
Details
Motivation: 研究大型语言模型是否会在没有明确指令的情况下隐式过滤敏感内容,而之前的研究主要集中于显式训练模型进行内容调节。 Method: 通过实证分析,测试 GPT-4o-mini 在改写敏感内容时的行为,并评估其零样本敏感性分类能力。 Result: 实验表明 GPT-4o-mini 在改写过程中系统性地将内容向低敏感类别调节,显著减少了冒犯性与禁忌语言。 Conclusion: GPT-4o-mini 展现出对敏感内容的隐式调节能力,可以有效减少冒犯性和禁忌语言。 Abstract: Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.[52] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation
Daeyong Kwon,SeungHeon Doh,Juhan Nam
Main category: cs.CL
TL;DR: MusT-RAG通过检索增强生成技术提升大型语言模型在音乐领域的适应能力。
Details
Motivation: 由于训练数据中音乐知识比例较小,大型语言模型在音乐相关应用中的效果受限。 Method: 提出MusT-RAG框架,使用MusWikiDB音乐专用向量数据库和在推理及微调过程中利用上下文信息优化RAG技术。 Result: MusT-RAG在提升大型语言模型的音乐领域适应能力方面显著优于传统微调方法,并且MusWikiDB比通用维基百科语料库更有效。 Conclusion: MusT-RAG框架和MusWikiDB数据库有效提升了大型语言模型在音乐领域的性能。 Abstract: Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.[53] Text-to-SQL Task-oriented Dialogue Ontology Construction
Renato Vukovic,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Hsien-Chin Lin,Shutong Feng,Nurul Lubis,Milica Gasic
Main category: cs.CL
TL;DR: TeQoDO 是一种无需监督即可构建任务导向型对话本体的新方法,结合大型语言模型的 SQL 编程能力和对话理论,提高了模型的可解释性和可控性。
Details
Motivation: 大型语言模型依赖参数化知识,缺乏可解释性和可信度;任务导向型对话系统需要外部数据库和本体结构来提高可控性和可解释性,但传统方法需要手动标注或监督训练。 Method: 利用大型语言模型(LLM)的 SQL 编程能力结合提示中的对话理论,自主构建任务导向型对话(TOD)本体。 Result: TeQoDO 在对话状态跟踪任务中表现优于迁移学习方法,其构建的本体具有竞争力,并且能够扩展到构建更大的本体(如维基百科和 ArXiv 数据集)。 Conclusion: TeQoDO 是一种用于构建任务导向型对话本体的方法,其能够提高大型语言模型的可解释性,并在对话状态跟踪任务中表现出色。 Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.[54] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models
Yiyan Ji,Haoran Chen,Qiguang Chen,Chengyue Wu,Libo Qin,Wanxiang Che
Main category: cs.CL
TL;DR: 本文提出了MPCC,作为首个系统评估MLLMs处理多模态规划约束能力的基准测试,强调当前模型在处理复杂约束任务中的挑战。
Details
Motivation: 当前基准测试无法直接评估多模态现实世界规划能力,且缺乏跨模态的约束或隐性约束。 Method: 引入了多模态规划与复杂约束(MPCC),系统性地评估MLLMs在处理规划任务中多模态约束的能力。 Result: 实验显示,封闭源模型仅实现21.3%的可行计划,而开源模型平均低于11%。MLLMs对约束复杂性高度敏感,传统多模态提示策略在多约束场景下失效。 Conclusion: MPCC强调了当前MLLMs在处理多约束场景时的局限性,并指出需要进一步研究以提升实际应用中的约束感知推理能力。 Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.[55] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models
Ailiang Lin,Zhuoyun Li,Kotaro Funakoshi
Main category: cs.CL
TL;DR: Causal2Vec improves the performance of decoder-only LLMs in generating embeddings by using a Contextual token and concatenating hidden states, leading to better performance and reduced computational costs.
Details
Motivation: The motivation is to improve the effectiveness of decoder-only LLMs in generating embeddings without altering their architecture or increasing computational costs. Method: The method involves using a lightweight BERT-style model to create a Contextual token, which is prepended to the input sequence of the LLM. The final text embedding is obtained by concatenating the last hidden states of the Contextual and EOS tokens. Result: Causal2Vec achieves state-of-the-art performance on the MTEB benchmark while significantly reducing sequence length and inference time. Conclusion: Causal2Vec is an effective embedding model that enhances the performance of decoder-only LLMs without changing their architecture or adding significant computational overhead. Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model's ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM's input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.[56] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators
Peter Sandrini
Main category: cs.CL
TL;DR: This paper explores the feasibility of using locally deployable, free language models as an alternative to commercial cloud-based AI solutions in translation, highlighting benefits like better data control and privacy.
Details
Motivation: Concerns regarding data privacy, security, and equitable access necessitate the exploration of alternative deployment models for Large Language Models in the translation field. Method: The study evaluates three open-source models installed on CPU-based platforms and compares them against commercially available online chatbots, focusing on functional performance. Result: The findings contribute to the understanding of the democratization of AI technology and aim to inform future research and development efforts to make LLMs more accessible and practical for individual translators and small businesses. Conclusion: The study concludes that locally deployable, free language models can serve as a viable alternative to proprietary, cloud-based AI solutions, offering benefits such as enhanced data control, improved privacy, and reduced dependency on cloud services. Abstract: The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.[57] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization
Yongbing Zhang,Fang Nan,Shengxiang Gao,Yuxin Huang,Kaiwen Tan,Zhengtao Yu
Main category: cs.CL
TL;DR: 本文提出了MRGSEM-Sum,一种基于多关系图和结构熵最小化的无监督多文档摘要框架,克服了现有方法的局限性,并在多个数据集上表现优异。
Details
Motivation: 现有的多文档摘要方法通常仅考虑单关系图且需要预定义聚类数量,限制了其对丰富关系信息的表达能力和减少冗余的适应性。 Method: 提出MRGSEM-Sum框架,构建包含语义和语篇关系的多关系图,使用二维结构熵最小化算法进行聚类,并引入位置感知压缩机制生成摘要。 Result: 在四个基准数据集上的实验表明,MRGSEM-Sum一致优于之前的无监督方法,并在某些情况下达到与监督模型和大语言模型相当的性能;人工评估显示摘要具有高一致性和覆盖率,接近人类水平。 Conclusion: MRGSEM-Sum通过多关系图和结构熵最小化实现了有效的多文档摘要,优于现有的无监督方法,并在多个数据集上表现出与监督模型和大语言模型相当的性能。 Abstract: The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.[58] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring
Salah Eddine Bekhouche,Azeddine Benlamoudi,Yazid Bounab,Fadi Dornaika,Abdenour Hadid
Main category: cs.CL
TL;DR: 本文提出了一种针对阿拉伯语的改进型密集段落检索框架,该框架结合了一种新的注意力相关性评分机制和预训练阿拉伯语语言模型,显著提高了阿拉伯语问答的检索和排序性能。
Details
Motivation: 阿拉伯语在NLP研究和基准资源中仍然代表性不足,而其在形态学、变音符号以及现代标准阿拉伯语和多种方言共存方面的复杂性给自然语言处理和信息检索带来了特殊挑战。 Method: 开发了一种新的注意力相关性评分(ARS)机制,并将其与预训练的阿拉伯语语言模型结合,以改进密集段落检索(DPR)框架。 Result: 新方法在回答阿拉伯语问题时显著提高了检索性能和排序准确性。 Conclusion: ARS 机制和预训练模型的结合显著提高了阿拉伯语问答的检索性能和排序准确性。 Abstract: Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.[59] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
Ante Wang,Yujie Lin,Jingyao Liu,Suhang Wu,Hao Liu,Xinyan Xiao,Jinsong Su
Main category: cs.CL
TL;DR: This study introduces proactive critical thinking in AI, where models actively seek missing information to better solve problems, and proposes new benchmarks to evaluate this capability, showing that reinforcement learning can significantly improve performance.
Details
Motivation: The motivation is to move beyond passive critical thinking in AI systems, enabling models to actively seek missing or clarifying information for better problem-solving collaboration with users. Method: The study introduces proactive critical thinking as a paradigm and evaluates it using two benchmarks, GSM-MC and GSM-MCE. The researchers also use an enhanced reinforcement learning algorithm to improve model performance. Result: Experiments show that existing models, despite excelling in traditional reasoning tasks, struggle with proactive critical thinking. However, reinforcement learning significantly improves their performance, notably increasing Qwen3-1.7B's accuracy from 0.15% to 73.98% on GSM-MC. Conclusion: The study concludes that while current models struggle with proactive critical thinking, reinforcement learning can significantly enhance this capability, improving collaboration between AI systems and users in problem-solving. Abstract: Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.[60] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Saeed Almheiri,Yerulan Kongrat,Adrian Santosh,Ruslan Tasmukhanov,Josemaria Vera,Muhammad Dehan Al Kautsar,Fajri Koto
Main category: cs.CL
TL;DR: 研究探讨了如何通过微调大型语言模型,使其根据不同的组织角色生成具有相应访问权限的响应,并评估了不同建模策略在企业环境中的有效性与鲁棒性。
Details
Motivation: 随着大型语言模型在企业中的广泛应用,基于用户角色控制模型行为成为必要需求。现有安全方法通常假设统一访问权限,主要关注防止有害或冒犯性输出,而未解决角色特定的访问限制问题。 Method: 研究者探索了三种建模策略:基于BERT的分类器、基于LLM的分类器以及基于角色条件生成的方法。为了评估这些方法,构建了两个互补的数据集:一个从现有指令微调语料库中通过聚类和角色标注得到,另一个则通过合成生成模拟真实企业场景。 Result: 论文评估了不同组织结构下模型的表现,并分析了其对提示注入、角色不匹配和越狱尝试的鲁棒性。 Conclusion: 研究验证了大型语言模型可以通过微调来根据用户角色生成符合访问权限的输出,并在多种企业场景中展现了良好的性能和安全性。 Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.[61] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
Shirui Wang,Zhihui Tang,Huaxia Yang,Qiuhong Gong,Tiantian Gu,Hongyang Ma,Yongxin Wang,Wubin Sun,Zeliang Lian,Kehang Mao,Yinan Jiang,Zhicheng Huang,Lingyun Ma,Wenjie Shen,Yajie Ji,Yunhui Tan,Chunbo Wang,Yunlu Gao,Qianling Ye,Rui Lin,Mingyu Chen,Lijuan Niu,Zhihao Wang,Peng Yu,Mengran Lang,Yue Liu,Huimin Zhang,Haitao Shen,Long Chen,Qiguang Zhao,Si-Xuan Liu,Lina Zhou,Hua Gao,Dongqiang Ye,Lingmin Meng,Youtao Yu,Naixin Liang,Jianxiong Wu
Main category: cs.CL
TL;DR: 本研究构建了用于评估医疗大语言模型安全性和有效性的CSEDB框架,结果显示领域专用模型优于通用模型,尤其在高风险场景中表现更佳。
Details
Motivation: 大语言模型在临床决策支持中具有潜力,但其在安全性评估和有效性验证方面面临重大挑战。因此,需要一个标准化的评估框架来推动其在医疗领域的安全有效应用。 Method: 研究者构建了一个基于专家共识的多维评估框架CSEDB,涵盖30项标准,并由32名专科医生开发和审核了2,069个问答项目,覆盖26个临床科室。 Result: 对6个LLMs的测试显示整体表现中等(平均总分为57.2%,安全性54.7%,有效性62.3%),在高风险场景下性能显著下降13.3%(p < 0.0001)。领域专用医疗LLMs在安全性和有效性方面表现更优。 Conclusion: 该研究开发了一个全面评估医疗大语言模型(LLMs)安全性和有效性的框架——临床安全-效果双轨基准(CSEDB),并发现领域专用模型优于通用模型。研究结果为医疗LLMs的应用提供了标准化评估指标,并有助于识别风险和改进方向。 Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.[62] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning
Keer Lu,Zheng Liang,Youquan Li,Jiejun Tan,Da Pan,Shusen Zhang,Guosheng Dong,Huang Leng
Main category: cs.CL
TL;DR: Med-R$^3$通过渐进式强化学习联合优化检索与推理,在医学领域实现了知识检索增强推理的性能突破。
Details
Motivation: 现有方法多单独优化检索或推理能力,且依赖监督微调,限制了模型在新问题中的泛化能力。此外,通用领域的强化学习奖励函数设计无法满足医学领域的特殊需求。 Method: 提出Med-R$^3$框架,基于渐进式强化学习,首先训练模型的医学逻辑推理能力,然后自适应优化检索能力,并实现检索与推理的协同联合优化。 Result: 实验表明,LLaMA3.1-8B-Instruct和Qwen2.5-14B结合Med-R$^3$分别比GPT-4o-mini提升3.93%和13.53%。 Conclusion: Med-R$^3$框架通过联合优化检索和推理能力,提升了医学领域知识检索增强推理的效果,并在多个模型基础上取得了显著的性能提升。 Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning. In this framework, we first develop the model's ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model's retrieval and reasoning coordination. Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.[63] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text
Alva West,Luodan Zhang,Liuliu Zhang,Minjun Zhu,Yixuan Weng,Yue Zhang
Main category: cs.CL
TL;DR: 本文提出了一种新的文本检测方法T-Detect,该方法通过使用基于学生t分布的重尾差异分数替代传统的高斯归一化方法,提高了对抗性文本检测的准确性。
Details
Motivation: 现有的零样本检测器在面对对抗性或非英语文本时表现不佳。 Method: 使用基于学生t分布的重尾差异分数替代传统的高斯归一化方法。 Result: T-Detect在RAID基准测试和HART数据集中表现出色,AUROC提高了3.9%。 Conclusion: T-Detect展现出了在对抗性文本检测中的优越性能,为文本检测提供了新的统计基础。 Abstract: The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student's t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9\% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.[64] DiffLoRA: Differential Low-Rank Adapters for Large Language Models
Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina
Main category: cs.CL
TL;DR: DiffLoRA引入了低秩适配器来改进差分注意力机制,其在某些领域表现突出,但在多数任务中未达其他方法效果。
Details
Motivation: 差分Transformer通过降噪注意力机制提升了模型表现,但需要高效的参数适应方法。 Method: 提出DiffLoRA,在正负注意力项上使用低秩适配器,结合了LoRA的效率和差分注意力的性能优势。 Result: 在多数任务中,DiffLoRA表现不如其他参数高效微调方法,但在HumanEval任务中比LoRA高出11个百分点。 Conclusion: DiffLoRA在特定领域表现良好,但整体效果有限,未来需进一步优化注意力模式。 Abstract: Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.[65] Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning
Salam Thabet Doghmash,Motaz Saad
Main category: cs.CL
TL;DR: This research develops models for detecting and masking hate speech in Arabic social media, achieving strong results in detection accuracy and competitive performance in text cleaning.
Details
Motivation: Hate speech identification and text cleaning are crucial for maintaining healthy online environments, particularly in Arabic social media where resources for such tasks are limited. Method: The study uses deep learning models and transformers to detect hate speech in Arabic text, and addresses text cleaning by framing it as a machine translation task, where dirty text is transformed into masked text. Result: For hate speech detection, the best model achieved a 92% Macro F1 score and 95% accuracy. In text cleaning, the best hate speech masking model achieved a BLEU score of 0.3 with 1-gram, which is considered good compared to state-of-the-art machine translation systems. Conclusion: The research successfully identified effective deep learning models for Arabic hate speech detection and proposed a machine translation approach for text cleaning, achieving competitive performance in both tasks. Abstract: Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.[66] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs
Nasim Shirvani-Mahdavi,Devin Wingfield,Amin Ghasemi,Chengkai Li
Main category: cs.CL
TL;DR: This paper explores using large language models to explain logical rules in knowledge graphs naturally, showing promising results in explanation quality while identifying ongoing challenges.
Details
Motivation: Logical rules in knowledge graphs can enhance reasoning, interpretation, and error detection, but their complexity and unique labeling make them hard to understand. This paper aims to leverage large language models to generate natural language explanations for these rules, improving their accessibility. Method: The authors extracted logical rules using the AMIE 3.5.1 rule discovery algorithm from FB15k-237, FB-CVT-REV, and FB+CVT-REV datasets. They applied various prompting strategies like zero- and few-shot prompting, inclusion of variable entity types, and chain-of-thought reasoning. Human evaluation was conducted to assess the correctness, clarity, and hallucination of the generated explanations. Result: The results indicate that large language models can generate explanations with promising correctness and clarity. However, challenges persist, indicating areas for future research. Conclusion: The paper concludes that large language models show promise in generating understandable natural language explanations for logical rules in knowledge graphs, despite some challenges that remain. Abstract: Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.[67] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Yunxiang Yan,Tomohiro Sawada,Kartik Goyal
Main category: cs.CL
TL;DR: A new cascaded question disclosure framework improves evaluation of LLMs' problem-solving abilities by providing more accurate comparisons and better reasoning traces than standard QA benchmarks.
Details
Motivation: The motivation is to move beyond traditional QA benchmarks, which are indirect and may overestimate performance differences, to a more holistic evaluation of problem-solving capabilities in LLMs. Method: The method involves a cascaded question disclosure approach that collects model responses in a stagewise manner, with each stage revealing partial information to elicit generalized reasoning. Result: The approach enables better model comparisons and generates improved intermediate reasoning traces, while narrowing performance gaps observed in standard QA evaluations. Conclusion: The proposed cascaded question disclosure framework offers a more accurate and reliable way to evaluate LLMs' problem-solving capabilities compared to standard QA paradigms. Abstract: While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models' problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.cs.CV [Back]
[68] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam
Ruslan Khrulev
Main category: cs.CV
TL;DR: 这篇论文介绍了一个新的基准EGE-Math Solutions Assessment Benchmark,用于评估视觉语言模型评估手写数学解题能力的表现。
Details
Motivation: 论文的动机是现有的基准测试主要关注问题解决,而本文则聚焦于理解学生的解决方案、识别错误并根据固定标准分配成绩。 Method: 论文的方法包括构建一个新的基准EGE-Math Solutions Assessment Benchmark,编译了122份俄罗斯统一国家考试的扫描解决方案,并以三种推理模式评估了来自Google、OpenAI、Arcee AI和Alibaba Cloud的七个现代视觉语言模型。 Result: 研究结果显示了当前视觉语言模型在数学推理和与人类评分标准对齐方面的局限性。 Conclusion: 该论文得出结论,当前的视觉语言模型在数学推理和与人类评分标准对齐方面存在局限性,这为人工智能辅助评估开辟了新的研究方向。 Abstract: This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math[69] Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction
Zhensheng Yuan,Haozhi Huang,Zhen Xiong,Di Wang,Guanghua Yang
Main category: cs.CV
TL;DR: 本文介绍了一种新的城市规模场景重建和渲染框架,具有更高的效率和鲁棒性。
Details
Motivation: 为了在城市规模场景中实现高效的重建和渲染,同时应对多视角捕捉中的外观变化问题。 Method: 该方法包括场景分割、基于可见性的图像选择策略、可控的细节层次策略、外观转换模块以及增强模块(如深度正则化、尺度正则化和抗锯齿)的应用。 Result: 实验结果表明,该方法在效率和质量上均优于之前的方法,并能够有效地重建城市规模的场景。 Conclusion: 本文提出了一种新的框架,能够在保持外观变化鲁棒性的同时,实现城市规模场景的快速重建和实时渲染。 Abstract: We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: https://yzslab.github.io/REUrbanGS.[70] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction
Giuseppe Cartella,Vittorio Cuculo,Alessandro D'Amelio,Marcella Cornia,Giuseppe Boccignone,Rita Cucchiara
Main category: cs.CV
TL;DR: ScanDiff is a novel architecture that generates diverse and realistic gaze scanpaths using diffusion models and Vision Transformers, outperforming existing methods.
Details
Motivation: Existing deep learning models generate averaged behaviors, failing to capture variability in human visual exploration. Method: ScanDiff combines diffusion models with Vision Transformers and uses textual conditioning for task-driven scanpath generation. Result: ScanDiff outperforms state-of-the-art methods in both free-viewing and task-driven scenarios, generating more diverse and accurate scanpaths. Conclusion: ScanDiff is able to capture the complexity of human visual behavior, pushing gaze prediction research forward. Abstract: Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.[71] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging
Krishan Agyakari Raja Babu,Om Prabhu,Annu,Mohanasankar Sivaprakasam
Main category: cs.CV
TL;DR: 本研究发现,使用深度学习超分辨率技术(特别是SRResNet)可以显著提升低质量超声心动图的诊断性能,为资源受限环境下的AI辅助诊疗提供了可行方案。
Details
Motivation: 在资源受限环境下,低质量的超声图像限制了诊断模型的效果,而超分辨率技术在MRI和CT图像中已展现出潜力,但在超声图像上的应用仍需探索。 Method: 使用深度学习的超分辨率模型(SRGAN和SRResNet)对低质量2D超声心动图进行增强,并评估其在分类任务中的效果。 Result: SRResNet在提升图像质量和分类准确性方面表现最佳,特别是在End-Diastole vs. End-Systole相位分类任务上。 Conclusion: 应用超分辨率技术可以有效恢复低质量超声心动图的诊断价值,使其在资源受限环境中的AI辅助诊疗成为可能。 Abstract: Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.[72] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields
Ranxi Lin,Canming Yao,Jiayi Li,Weihang Liu,Xin Lou,Pingqiang Zhou
Main category: cs.CV
TL;DR: The paper proposes PATA, a spike-based Neural Radiance Fields (NeRF) framework with dynamic time step adjustment, significantly reducing computational demands and power usage without sacrificing rendering quality.
Details
Motivation: NeRF-based models are computationally expensive due to dense point sampling, limiting their use in resource-constrained environments. Spiking Neural Networks (SNNs) offer energy efficiency, prompting the need for an optimized framework. Method: Proposed a spike-based NeRF framework with a dynamic time step training strategy called Pretrain-Adaptive Time-step Adjustment (PATA), anchored on the Instant-NGP architecture. Result: The PATA method reduces inference time steps by 64% and running power by 61.55% while preserving rendering fidelity across diverse datasets. Conclusion: PATA enables scene-adaptive inference with variable time steps, significantly reducing computational resources and power consumption while maintaining rendering quality. Abstract: Neural Radiance Fields (NeRF)-based models have achieved remarkable success in 3D reconstruction and rendering tasks. However, during both training and inference, these models rely heavily on dense point sampling along rays from multiple viewpoints, resulting in a surge in floating-point operations and severely limiting their use in resource-constrained scenarios like edge computing. Spiking Neural Networks (SNNs), which communicate via binary spikes over discrete time steps, offer a promising alternative due to their energy-efficient nature. Given the inherent variability in scene scale and texture complexity in neural rendering and the prevailing practice of training separate models per scene, we propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA). This approach automatically explores the trade-off between rendering quality and time step length during training. Consequently, it enables scene-adaptive inference with variable time steps and reduces the additional consumption of computational resources in the inference process. Anchoring to the established Instant-NGP architecture, we evaluate our method across diverse datasets. The experimental results show that PATA can preserve rendering fidelity while reducing inference time steps by 64\% and running power by 61.55\%.[73] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving
Santosh Patapati,Trisanth Srinivasan
Main category: cs.CV
TL;DR: NovaDrive是一种高效的自动驾驶处理架构,提升了安全性和效率,并具有环保潜力。
Details
Motivation: 自动驾驶车辆需要在复杂情况下快速反应并理解道路几何和交通意图,因此需要一种高效的处理架构。 Method: 研究团队引入了NovaDrive,这是一种单分支视觉-语言架构,结合了前摄像头图像、高清地图瓦片、LiDAR深度和文本航点进行处理,并采用轻量级的双阶段交叉注意力块和一种新的平滑损失函数。 Result: 在MD-NEX户外基准的nuScenes/Waymo子集上,NovaDrive将成功率提高到84%,路径效率(SPL)提升至0.66,并将碰撞频率从2.6%降低到1.2%。 Conclusion: NovaDrive不仅提高了自动驾驶的安全性和路径效率,还通过减少燃料或电池使用为更环保的驾驶系统提供了可能,并且可以扩展到其他AI领域。 Abstract: Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive's shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.[74] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation
Alexandru Buburuzan
Main category: cs.CV
TL;DR: This paper presents MObI and AnydoorMed, two novel synthetic data generation methods for autonomous driving and medical imaging that enable realistic, controllable, and multimodal object inpainting using diffusion models.
Details
Motivation: Safety-critical applications require extensive multimodal data for testing, but gathering real-world data is costly and complex. Synthetic data methods offer a solution, but they must provide high realism and controllability. Method: The paper introduces two novel methods: MObI for autonomous driving, which uses a diffusion model for multimodal object inpainting with 3D bounding box conditioning; and AnydoorMed for medical image analysis, which extends the inpainting paradigm to mammography scans using a diffusion-based model. Result: MObI enables seamless object insertion into multimodal scenes for autonomous driving, while AnydoorMed achieves detailed and semantically coherent inpainting of anomalies in mammography scans. Conclusion: The paper concludes that foundation models for reference-guided inpainting in natural images can be adapted to various perceptual modalities, enabling the creation of realistic, controllable, and multimodal counterfactual scenarios. Abstract: Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing. Synthetic data methods are gaining prominence due to the cost and complexity of gathering real-world data, but they demand a high degree of realism and controllability to be useful. This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively. MObI is a first-of-its-kind framework for Multimodal Object Inpainting that leverages a diffusion model to produce realistic and controllable object inpaintings across perceptual modalities, demonstrated simultaneously for camera and lidar. Given a single reference RGB image, MObI enables seamless object insertion into existing multimodal scenes at a specified 3D location, guided by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, this approach uses 3D bounding box conditioning to ensure accurate spatial positioning and realistic scaling. AnydoorMed extends this paradigm to the medical imaging domain, focusing on reference-guided inpainting for mammography scans. It leverages a diffusion-based model to inpaint anomalies with impressive detail preservation, maintaining the reference anomaly's structural integrity while semantically blending it with the surrounding tissue. Together, these methods demonstrate that foundation models for reference-guided inpainting in natural images can be readily adapted to diverse perceptual modalities, paving the way for the next generation of systems capable of constructing highly realistic, controllable and multimodal counterfactual scenarios.[75] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints
Santosh Patapati,Trisanth Srinivasan,Murari Ambati
Main category: cs.CV
TL;DR: XYZ-Drive是一种结合了意图和地图布局信息的单一视觉语言模型,用于提高自动驾驶汽车在复杂环境中的导航能力。
Details
Motivation: 自动驾驶汽车需要几何精度和语义理解来导航复杂环境,但大多数系统是分开处理这两者的。 Method: XYZ-Drive利用了轻量级的目标中心交叉注意力层,并使用单一的视觉语言模型处理前摄像头帧、航拍地图和下一个航路点,输出转向和速度。 Result: XYZ-Drive在MD-NEX户外驾驶基准测试中达到了95%的成功率和0.80的成功路径长度加权(SPL),超过了PhysNav-DG 15%,并使碰撞率减半。 Conclusion: XYZ-Drive实现了意图和地图布局的早期、标记级融合,从而实现了准确、透明、实时的驾驶。 Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.[76] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
Dmitry Demidov,Zaigham Zaheer,Omkar Thawakar,Salman Khan,Fahad Shahbaz Khan
Main category: cs.CV
TL;DR: E-FineR is a training-free method for fine-grained image classification that leverages LLMs and VLMs for open-set recognition, achieving strong results with greater interpretability and real-world applicability.
Details
Motivation: Traditional methods are limited by fixed vocabularies and closed-set classification, which hampers scalability in real-world settings with emerging classes. Recent methods underutilize LLMs and depend on unrefined guessed class names. Method: E-FineR combines large language models (LLMs) and vision-language models (VLMs) for open-set recognition without predefined class labels, enhancing classification without training or human intervention. Result: E-FineR achieves performance on par with existing state-of-the-art approaches in zero-shot and few-shot classification while being training-free and avoiding human intervention. Conclusion: E-FineR is a training-free method that achieves state-of-the-art results in fine-grained visual recognition, offering interpretability and adaptability in real-world scenarios with difficult expert annotations. Abstract: Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.[77] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation
Sanghun Jung,Jingjing Zheng,Ke Zhang,Nan Qiao,Albert Y. C. Chen,Lu Xia,Chi Liu,Yuyin Sun,Xiao Zeng,Hsiang-Wei Huang,Byron Boots,Min Sun,Cheng-Hao Kuo
Main category: cs.CV
TL;DR: 本文提出了一种用于开放词汇3D实例分割的新方法,结合了多个互补的概念,并通过两个阶段的方案实现了最先进的性能。
Details
Motivation: 现有的开放词汇3D实例分割方法通常利用视觉语言模型生成3D实例提议并进行分类,但这些方法中的各个概念并非互斥,而是互补的。因此,本文旨在设计一种新的方法,将这些概念结合在一起,并解决关键挑战。 Method: 该方法遵循两个阶段的方案:3D提议生成和实例分类。在提议生成阶段,采用基于3D跟踪的稳健提议聚合方法,通过迭代合并/去除生成3D提议并去除重叠或部分提议。在分类阶段,用Alpha-CLIP替代标准CLIP模型,通过引入对象掩码作为alpha通道来减少背景噪声,并引入标准化最大相似度(SMS)分数来归一化文本到提议的相似度,从而有效过滤误报并提高精度。 Result: 该框架在ScanNet200和S3DIS数据集上所有AP和AR指标上均达到了最先进的性能,甚至超过了端到端闭合词汇方法。 Conclusion: 本文通过精心设计的组合方案,有效结合了多个互补概念,解决了开放词汇3D实例分割的关键挑战,取得了领先的性能表现。 Abstract: Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.[78] X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention
Xiaochen Zhao,Hongyi Xu,Guoxian Song,You Xie,Chenxu Zhang,Xiu Li,Linjie Luo,Jinli Suo,Yebin Liu
Main category: cs.CV
TL;DR: X-NeMo是一种新的零样本扩散模型,用于生成高质量的肖像动画,通过端到端框架和1D潜在运动描述符解决身份泄露问题,并在表现力和身份相似性方面优于现有方法。
Details
Motivation: 为了解决现有方法中身份泄露和难以捕捉细微及极端表情的关键问题,从而提高零样本肖像动画的质量和表现力。 Method: 提出了一种完全端到端的训练框架,从驱动图像中提取1D身份无关的潜在运动描述符,并通过交叉注意力机制控制运动。此外,通过双GAN解码器和空间及颜色增强来增强表现力并解耦运动潜在特征与身份线索。 Result: X-NeMo在实验中表现出色,超越了最先进的基线方法,生成了高度表现力的动画,并且具有更好的身份相似性。 Conclusion: X-NeMo有效地解决了身份泄露和捕捉细微及极端表情的困难,实现了高质量的零样本肖像动画生成。 Abstract: We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.[79] Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues
Xu Cao,Takafumi Taketomi
Main category: cs.CV
TL;DR: 本文提出了一种新的神经逆渲染方法,能够在不需要光校准或多视角法线图的情况下,从多视角图像中联合重建几何、反射特性和光照条件,并取得了优于现有技术的效果。
Details
Motivation: 为了克服先前多视角光度立体方法需要光校准或中间线索(如每视角法线图)的问题,提出一种从原始图像中联合重建几何、空间变化反射和光照条件的方法。 Method: 提出了一种神经逆渲染方法,联合优化所有场景参数,使用神经隐式场表示几何和反射特性,并应用阴影感知体积渲染。 Result: 提出的方法在形状和光照估计准确性上优于现有正常引导方法,能够泛化到视角不对齐的多光图像,并处理具有挑战性的几何和反射特性。 Conclusion: 该方法在形状和光照估计准确性、多光图像的泛化能力以及处理具有挑战性的几何和反射特性方面优于现有技术水平。 Abstract: We propose a neural inverse rendering approach that jointly reconstructs geometry, spatially varying reflectance, and lighting conditions from multi-view images captured under varying directional lighting. Unlike prior multi-view photometric stereo methods that require light calibration or intermediate cues such as per-view normal maps, our method jointly optimizes all scene parameters from raw images in a single stage. We represent both geometry and reflectance as neural implicit fields and apply shadow-aware volume rendering. A spatial network first predicts the signed distance and a reflectance latent code for each scene point. A reflectance network then estimates reflectance values conditioned on the latent code and angularly encoded surface normal, view, and light directions. The proposed method outperforms state-of-the-art normal-guided approaches in shape and lighting estimation accuracy, generalizes to view-unaligned multi-light images, and handles objects with challenging geometry and reflectance.[80] CNN-based solution for mango classification in agricultural environments
Beatriz Díaz Peón,Jorge Torres Gómez,Ariel Fajardo Márquez
Main category: cs.CV
TL;DR: This paper designs a system using CNNs for fruit detection and classification, particularly for mangoes, aiming at efficient and accurate farm inventory management.
Details
Motivation: The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Specifically, a method for mango fruit classification was developed using image processing to ensure both accuracy and efficiency. Method: Resnet-18 was selected as the preliminary architecture for classification, while a cascade detector was used for detection. A graphical interface was developed in MatLab App Designer to display results. Result: The system exemplifies an efficient and accurate fruit detection and classification approach, balancing execution speed and computational resource consumption. Conclusion: The integration of convolutional neural networks and cascade detectors provides a reliable solution for fruit classification and detection, with potential applications in agricultural quality control. Abstract: This article exemplifies the design of a fruit detection and classification system using Convolutional Neural Networks (CNN). The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Specifically, a method for mango fruit classification was developed using image processing, ensuring both accuracy and efficiency. Resnet-18 was selected as the preliminary architecture for classification, while a cascade detector was used for detection, balancing execution speed and computational resource consumption. Detection and classification results were displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction. The integration of convolutional neural networks and cascade detectors proffers a reliable solution for fruit classification and detection, with potential applications in agricultural quality control.[81] Single Image Rain Streak Removal Using Harris Corner Loss and R-CBAM Network
Jongwook Si,Sungyoung Kim
Main category: cs.CV
TL;DR: A new network for removing rain streaks from images was developed, which preserves structural details and visual quality better than previous methods.
Details
Motivation: The motivation is to effectively remove rain streaks from images while preserving fine structural details and overall visual quality. Method: A novel image restoration network with a Corner Loss and a Residual Convolutional Block Attention Module (R-CBAM) Block was proposed. Result: Quantitative evaluations showed that the proposed method achieved a PSNR of 33.29 dB on Rain100L and 26.16 dB on Rain100H. Conclusion: The proposed method for single-image rain streak removal outperforms previous approaches on the Rain100L and Rain100H datasets. Abstract: The problem of single-image rain streak removal goes beyond simple noise suppression, requiring the simultaneous preservation of fine structural details and overall visual quality. In this study, we propose a novel image restoration network that effectively constrains the restoration process by introducing a Corner Loss, which prevents the loss of object boundaries and detailed texture information during restoration. Furthermore, we propose a Residual Convolutional Block Attention Module (R-CBAM) Block into the encoder and decoder to dynamically adjust the importance of features in both spatial and channel dimensions, enabling the network to focus more effectively on regions heavily affected by rain streaks. Quantitative evaluations conducted on the Rain100L and Rain100H datasets demonstrate that the proposed method significantly outperforms previous approaches, achieving a PSNR of 33.29 dB on Rain100L and 26.16 dB on Rain100H.[82] Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
Shiyao Yu,Zi-An Wang,Kangning Yin,Zheng Tian,Mingyuan Zhang,Weixin Si,Shihao Zou
Main category: cs.CV
TL;DR: A new multi-modal motion retrieval framework incorporating audio improves retrieval performance significantly over existing methods.
Details
Motivation: Existing motion retrieval methods rely only on text or visual modality, lacking intuitive interaction and overlooking sequential representation for better performance. Method: A sequence-level contrastive learning approach to align four modalities (text, audio, video, and motion) in a unified embedding space. Result: The framework outperforms state-of-the-art methods with a 10.16% improvement in R@10 for text-to-motion retrieval and 25.43% improvement in R@1 for video-to-motion retrieval on HumanML3D dataset. Conclusion: The proposed multi-modal framework significantly enhances motion retrieval performance by incorporating audio with text, video, and motion in a fine-grained joint embedding space. Abstract: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.[83] A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery
Youngsun Jang,Dongyoun Kim,Chulwoo Pack,Kwanghee Won
Main category: cs.CV
TL;DR: 本文介绍了一个新的卫星图像洪水分割数据集,旨在解决现有数据不足的问题,并评估了当前模型性能,指出未来研究方向。
Details
Motivation: 在回顾了77个使用卫星图像的现有基准后,发现缺乏适用于洪水区域分割任务的合适数据集。 Method: 收集了2019年美国中西部洪水的卫星图像,构建了一个包含10个地点(来自五个州)的10张图像的数据集,并测试了最先进的计算机视觉和遥感模型以评估语义分割性能。 Result: 模型表现一般,表明需要进一步改进方法,尤其是在多模态和时间学习方面。 Conclusion: 该论文提出了一种新的用于分割卫星图像中洪水区域的数据集,并指出需要未来研究多模态和时间学习策略以提高模型性能。 Abstract: This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on[84] Adversarial-Guided Diffusion for Multimodal LLM Attacks
Chengwei Xia,Fan Ma,Ruijie Quan,Kun Zhan,Yi Yang
Main category: cs.CV
TL;DR: This paper proposes an adversarial-guided diffusion method to generate images that can deceive MLLMs while remaining robust against various defenses.
Details
Motivation: The motivation is to generate adversarial images that can effectively deceive multimodal large language models (MLLMs) without significantly distorting the clean images. Method: An adversarial-guided diffusion (AGD) approach is proposed, which introduces adversarial-guided noise into the reverse diffusion process of generating images. Result: Extensive experiments show that AGD outperforms existing methods in terms of attack performance and robustness against some defenses. Conclusion: The proposed AGD approach is effective in generating adversarial images that can deceive MLLMs while maintaining robustness against various defenses. Abstract: This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.[85] Confidence-aware agglomeration classification and segmentation of 2D microscopic food crystal images
Xiaoyu Ji,Ali Shakouri,Fengqing Zhu
Main category: cs.CV
TL;DR: 本文提出了一种新方法,结合监督基线模型和实例分类模型,提高了食品晶体聚集的分类准确性和尺寸分布预测。
Details
Motivation: 由于在2D显微图像中手动标注聚集现象特别困难,且现有方法存在局限性,因此需要提出一种新的方法来改善聚集分类和预测的准确性。 Method: 首先提出了一个监督基线模型,用于生成分割伪标签;然后训练一个同时进行像素级分割的实例分类模型,并在推理阶段结合两者的优势。 Result: 与现有方法相比,该方法在真实正例聚集分类准确性和尺寸分布预测方面有所改进,并在两种置信水平下成功分类了潜在的聚集实例。 Conclusion: 本文提出了一种结合监督基线模型和实例分类模型的方法,用于改进食品晶体聚集的分类准确性和尺寸分布预测,并通过后处理模块保持晶体特性。 Abstract: Food crystal agglomeration is a phenomenon occurs during crystallization which traps water between crystals and affects food product quality. Manual annotation of agglomeration in 2D microscopic images is particularly difficult due to the transparency of water bonding and the limited perspective focusing on a single slide of the imaged sample. To address this challenge, we first propose a supervised baseline model to generate segmentation pseudo-labels for the coarsely labeled classification dataset. Next, an instance classification model that simultaneously performs pixel-wise segmentation is trained. Both models are used in the inference stage to combine their respective strengths in classification and segmentation. To preserve crystal properties, a post processing module is designed and included to both steps. Our method improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. Given the variability in confidence levels of manual annotations, our proposed method is evaluated under two confidence levels and successfully classifies potential agglomerated instances.[86] YOLO-ROC: A High-Precision and Ultra-Lightweight Model for Real-Time Road Damage Detection
Zicheng Lin,Weichao Pan
Main category: cs.CV
TL;DR: This paper introduces YOLO-ROC, an efficient and accurate model for road damage detection, addressing multi-scale feature extraction and computational efficiency challenges.
Details
Motivation: Road damage detection is essential for traffic safety and infrastructure maintenance, but existing deep learning methods struggle with multi-scale feature extraction and high computational demands. Method: The paper proposes a Bidirectional Multi-scale Spatial Pyramid Pooling Fast (BMS-SPPF) module and a hierarchical channel compression strategy to enhance detection performance and reduce computational demands. Result: On the RDD2022_China_Drone dataset, YOLO-ROC achieved a mAP50 of 67.6%, surpassing YOLOv8n by 2.11%, with a 16.8% improvement in small-target detection and a model size reduction to 2.0 MB. Conclusion: YOLO-ROC is a high-precision and lightweight model for road damage detection, offering improved multi-scale feature extraction and reduced computational complexity. Abstract: Road damage detection is a critical task for ensuring traffic safety and maintaining infrastructure integrity. While deep learning-based detection methods are now widely adopted, they still face two core challenges: first, the inadequate multi-scale feature extraction capabilities of existing networks for diverse targets like cracks and potholes, leading to high miss rates for small-scale damage; and second, the substantial parameter counts and computational demands of mainstream models, which hinder their deployment for efficient, real-time detection in practical applications. To address these issues, this paper proposes a high-precision and lightweight model, YOLO - Road Orthogonal Compact (YOLO-ROC). We designed a Bidirectional Multi-scale Spatial Pyramid Pooling Fast (BMS-SPPF) module to enhance multi-scale feature extraction and implemented a hierarchical channel compression strategy to reduce computational complexity. The BMS-SPPF module leverages a bidirectional spatial-channel attention mechanism to improve the detection of small targets. Concurrently, the channel compression strategy reduces the parameter count from 3.01M to 0.89M and GFLOPs from 8.1 to 2.6. Experiments on the RDD2022_China_Drone dataset demonstrate that YOLO-ROC achieves a mAP50 of 67.6%, surpassing the baseline YOLOv8n by 2.11%. Notably, the mAP50 for the small-target D40 category improved by 16.8%, and the final model size is only 2.0 MB. Furthermore, the model exhibits excellent generalization performance on the RDD2022_China_Motorbike dataset.[87] Toward Safe, Trustworthy and Realistic Augmented Reality User Experience
Yanming Xiu
Main category: cs.CV
TL;DR: 该研究旨在确保增强现实(AR)技术中虚拟内容的安全性和可信度,开发了两个系统ViDDAR和VIM-Sense来检测有害的AR内容,并提出了三种未来发展方向。
Details
Motivation: 随着增强现实(AR)技术越来越多地融入日常生活,确保其虚拟内容的安全性和可信度变得至关重要。 Method: 开发了两个系统ViDDAR和VIM-Sense,利用视觉-语言模型(VLMs)和多模态推理模块来检测有害的AR内容。 Result: 提出了三种未来发展方向:自动化、感知一致的虚拟内容质量评估;多模态攻击检测;以及针对AR设备的高效、用户中心化部署的VLMs适配。 Conclusion: 研究旨在为增强现实(AR)内容的保护建立一个可扩展且以人为本的框架,并寻求在感知建模、多模态AR内容实施和轻量级模型适配方面的反馈。 Abstract: As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.[88] Ambiguity-Guided Learnable Distribution Calibration for Semi-Supervised Few-Shot Class-Incremental Learning
Fan Lyu,Linglan Zhao,Chengyan Liu,Yinying Mei,Zhang Zhang,Jian Zhang,Fuyuan Hu,Liang Wang
Main category: cs.CV
TL;DR: This paper introduces Gsemi-FSCIL, a redefined approach to Semi-FSCIL, and proposes the ALDC strategy to improve the model's ability to distinguish between unlabeled samples from base and novel classes, achieving state-of-the-art results.
Details
Motivation: The study aims to redefine Semi-FSCIL as Generalized Semi-FSCIL (Gsemi-FSCIL) to better align with real-world scenarios, where unlabeled sets include both base and all ever-seen novel classes. Method: An Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy is proposed to dynamically correct biased feature distributions for few-shot novel classes using abundant base samples. Result: Experiments on three benchmark datasets show that the proposed method outperforms existing works, setting new state-of-the-art results. Conclusion: The proposed ALDC strategy effectively addresses the challenge of distinguishing between unlabeled samples from base and novel classes in Gsemi-FSCIL, outperforming existing methods. Abstract: Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.[89] Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents
Sungguk Cha,DongWook Kim,Taeseung Hahn,Mintae Kim,Youngsub Han,Byoung-Ki Jeon
Main category: cs.CV
TL;DR: RL-QR 是一种无需人工标注的强化学习方法,用于优化 RAG 系统中的查询生成,在多模态和基于词汇的检索器中显著提升了性能,但在语义检索器中仍有待改进。
Details
Motivation: 现有的检索增强生成(RAG)系统依赖有效的查询生成来利用外部知识,但针对多样化的非结构化真实文档优化查询仍是一个挑战。 Method: RL-QR 使用强化学习和广义奖励策略优化(GRPO),通过合成场景-问题对,无需人工标注数据集即可训练特定检索器的查询重写器。 Result: 在工业内部数据上的实验表明,RL-QR 在多模态 RAG 上实现了 NDCG@3 的 11% 相对提升,而在基于词汇的检索器上实现了 9% 的提升。然而,语义和混合检索器未能取得改进。 Conclusion: RL-QR 框架在检索增强生成(RAG)系统中展示了优化查询的潜力,尤其在多模态和基于词汇的检索器中取得了显著改进,但语义和混合检索器仍存在挑战,需要进一步优化。 Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}_{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}_{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR's potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.[90] Automated Mapping the Pathways of Cranial Nerve II, III, V, and VII/VIII: A Multi-Parametric Multi-Stage Diffusion Tractography Atlas
Lei Xie,Jiahao Huang,Jiawei Zhang,Jianzhong He,Yiang Pan,Guoqiang Xie,Mengjun Li,Qingrun Zeng,Mingchu Li,Yuanjing Feng
Main category: cs.CV
TL;DR: 本研究提出了首个用于自动绘制人类大脑颅神经通路的扩散纤维束成像图谱,通过多阶段聚类方法分析百万级纤维路径,实现了高精度的颅神经识别。
Details
Motivation: 颅神经在大脑功能中起关键作用,精确绘制其路径对于术前规划和理解脑结构具有重要意义,但因颅底环境复杂,构建详细图谱存在挑战。 Method: 通过多参数纤维束成像生成每对颅神经的纤维路径,并使用多阶段纤维聚类策略对约1,000,000条纤维路径进行分析,最终构建颅神经图谱。 Result: 该图谱能够自动识别与5对颅神经相关的8个纤维束,并在多个数据集和临床病例中表现出高空间一致性与鲁棒性。 Conclusion: 本研究成功开发了一个全面的扩散纤维束成像图谱,用于自动化绘制人类大脑中的颅神经通路,有助于更高效地分析和理解复杂的脑结构。 Abstract: Cranial nerves (CNs) play a crucial role in various essential functions of the human brain, and mapping their pathways from diffusion MRI (dMRI) provides valuable preoperative insights into the spatial relationships between individual CNs and key tissues. However, mapping a comprehensive and detailed CN atlas is challenging because of the unique anatomical structures of each CN pair and the complexity of the skull base environment.In this work, we present what we believe to be the first study to develop a comprehensive diffusion tractography atlas for automated mapping of CN pathways in the human brain. The CN atlas is generated by fiber clustering by using the streamlines generated by multi-parametric fiber tractography for each pair of CNs. Instead of disposable clustering, we explore a new strategy of multi-stage fiber clustering for multiple analysis of approximately 1,000,000 streamlines generated from the 50 subjects from the Human Connectome Project (HCP). Quantitative and visual experiments demonstrate that our CN atlas achieves high spatial correspondence with expert manual annotations on multiple acquisition sites, including the HCP dataset, the Multi-shell Diffusion MRI (MDM) dataset and two clinical cases of pituitary adenoma patients. The proposed CN atlas can automatically identify 8 fiber bundles associated with 5 pairs of CNs, including the optic nerve CN II, oculomotor nerve CN III, trigeminal nerve CN V and facial-vestibulocochlear nerve CN VII/VIII, and its robustness is demonstrated experimentally. This work contributes to the field of diffusion imaging by facilitating more efficient and automated mapping the pathways of multiple pairs of CNs, thereby enhancing the analysis and understanding of complex brain structures through visualization of their spatial relationships with nearby anatomy.[91] A Deep Dive into Generic Object Tracking: A Survey
Fereshteh Aghaee Meibodi,Shadi Alijani,Homayoun Najjaran
Main category: cs.CV
TL;DR: This paper reviews object tracking methods, focusing on transformer-based approaches, and provides a comprehensive analysis of their design principles, innovations, and progress in handling complex tracking challenges.
Details
Motivation: The motivation is to provide a comprehensive review of all three tracking categories (Siamese-based, discriminative, and transformer-based) with particular emphasis on the rapidly evolving transformer-based methods, addressing the challenges in generic object tracking such as occlusions and appearance variations. Method: The paper employs qualitative and quantitative comparisons to analyze core design principles, innovations, and limitations of different tracking approaches. It also provides a unified visual and tabular comparison of representative methods. Result: The result is a novel categorization of tracking methods, a unified visual and tabular comparison, and an analysis of major evaluation benchmarks. The study highlights the advancements in transformer-based tracking. Conclusion: The paper concludes that transformer-based tracking methods are advancing rapidly due to their robust spatio-temporal modeling capabilities, and it emphasizes the importance of a comprehensive review encompassing various tracking paradigms. Abstract: Generic object tracking remains an important yet challenging task in computer vision due to complex spatio-temporal dynamics, especially in the presence of occlusions, similar distractors, and appearance variations. Over the past two decades, a wide range of tracking paradigms, including Siamese-based trackers, discriminative trackers, and, more recently, prominent transformer-based approaches, have been introduced to address these challenges. While a few existing survey papers in this field have either concentrated on a single category or widely covered multiple ones to capture progress, our paper presents a comprehensive review of all three categories, with particular emphasis on the rapidly evolving transformer-based methods. We analyze the core design principles, innovations, and limitations of each approach through both qualitative and quantitative comparisons. Our study introduces a novel categorization and offers a unified visual and tabular comparison of representative methods. Additionally, we organize existing trackers from multiple perspectives and summarize the major evaluation benchmarks, highlighting the fast-paced advancements in transformer-based tracking driven by their robust spatio-temporal modeling capabilities.[92] Towards Measuring and Modeling Geometric Structures in Time Series Forecasting via Image Modality
Mingyang Yu,Xiahui Guo,Peng chen,Zhenkai Li,Yang Shu
Main category: cs.CV
TL;DR: The paper introduces TGSI and SATL to better evaluate and model the geometric structure of time series data, showing improved performance over existing methods.
Details
Motivation: Traditional numerical metrics like MSE fail to evaluate the geometric structure of time series data, which is essential for understanding temporal dynamics. Method: The authors propose TGSI, a new evaluation metric that transforms time series into images, and SATL, a multi-component loss function combining first-order difference loss, frequency domain loss, and perceptual feature loss, to enhance structure modeling during training. Result: Experiments across multiple datasets show that models trained with SATL achieve superior performance in both MSE and TGSI metrics compared to baseline methods. Conclusion: Models trained with SATL outperform baseline methods in both MSE and TGSI metrics without increasing inference computational cost. Abstract: Time Series forecasting is critical in diverse domains such as weather forecasting, financial investment, and traffic management. While traditional numerical metrics like mean squared error (MSE) can quantify point-wise accuracy, they fail to evaluate the geometric structure of time series data, which is essential to understand temporal dynamics. To address this issue, we propose the time series Geometric Structure Index (TGSI), a novel evaluation metric that transforms time series into images to leverage their inherent two-dimensional geometric representations. However, since the image transformation process is non-differentiable, TGSI cannot be directly integrated as a training loss. We further introduce the Shape-Aware Temporal Loss (SATL), a multi-component loss function operating in the time series modality to bridge this gap and enhance structure modeling during training. SATL combines three components: a first-order difference loss that measures structural consistency through the MSE between first-order differences, a frequency domain loss that captures essential periodic patterns using the Fast Fourier Transform while minimizing noise, and a perceptual feature loss that measures geometric structure difference in time-series by aligning temporal features with geometric structure features through a pre-trained temporal feature extractor and time-series image autoencoder. Experiments across multiple datasets demonstrate that models trained with SATL achieve superior performance in both MSE and the proposed TGSI metrics compared to baseline methods, without additional computational cost during inference.[93] Learning Semantic-Aware Threshold for Multi-Label Image Recognition with Partial Labels
Haoxian Ruan,Zhihua Xu,Zhijing Yang,Guang Ma,Jieming Xie,Changxiang Fan,Tianshui Chen
Main category: cs.CV
TL;DR: This paper proposes the SATL algorithm for multi-label image recognition with partial labels, which dynamically learns category-specific thresholds and improves performance by enhancing discrimination between positive and negative samples.
Details
Motivation: Traditional methods for multi-label image recognition with partial labels often produce inaccurate and incomplete pseudo-labels due to fixed thresholds and failure to account for varying score distributions across categories. Method: The study introduces the Semantic-Aware Threshold Learning (SATL) algorithm, which dynamically calculates score distributions and thresholds for each category and implements a differential ranking loss to improve discrimination. Result: Experiments on large-scale datasets like Microsoft COCO and VG-200 show that the SATL algorithm significantly improves performance in scenarios with limited labels. Conclusion: The proposed SATL algorithm significantly improves the performance of multi-label image recognition with partial labels by dynamically determining category-specific thresholds and enhancing discrimination between positive and negative samples. Abstract: Multi-label image recognition with partial labels (MLR-PL) is designed to train models using a mix of known and unknown labels. Traditional methods rely on semantic or feature correlations to create pseudo-labels for unidentified labels using pre-set thresholds. This approach often overlooks the varying score distributions across categories, resulting in inaccurate and incomplete pseudo-labels, thereby affecting performance. In our study, we introduce the Semantic-Aware Threshold Learning (SATL) algorithm. This innovative approach calculates the score distribution for both positive and negative samples within each category and determines category-specific thresholds based on these distributions. These distributions and thresholds are dynamically updated throughout the learning process. Additionally, we implement a differential ranking loss to establish a significant gap between the score distributions of positive and negative samples, enhancing the discrimination of the thresholds. Comprehensive experiments and analysis on large-scale multi-label datasets, such as Microsoft COCO and VG-200, demonstrate that our method significantly improves performance in scenarios with limited labels.[94] PixNerd: Pixel Neural Field Diffusion
Shuai Wang,Ziteng Gao,Chenhui Zhu,Weilin Huang,Limin Wang
Main category: cs.CV
TL;DR: 提出了一种新的单阶段、单尺度的PixelNerd方法,用于高效、端到端的图像生成,避免了传统方法中的复杂级联管道和VAE。
Details
Motivation: 当前扩散变压器的成功依赖于预训练变分自编码器(VAE)压缩的潜在空间,但这会导致累积误差和解码伪影。 Method: 提出了一种基于神经场的逐块解码方法,称为像素神经场扩散(PixelNerd),实现高效的端到端图像生成。 Result: PixelNerd在ImageNet数据集上取得了2.15和2.84的FID分数,并在文本到图像任务中表现出色。 Conclusion: PixelNerd是一种高效的图像生成方法,避免了复杂管道和VAE,展现出优异的性能。 Abstract: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.[95] Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2
Solha Kang,Eugene Kim,Joris Vankerschaver,Utku Ozbulak
Main category: cs.CV
TL;DR: This paper investigates the use of SAM2 for low-cost, minimal-input 3D tumor segmentation in breast MRI, finding that it can provide accurate results with minimal supervision, offering an affordable alternative for resource-limited settings.
Details
Motivation: Manual interpretation of 3D breast MRI scans is labor-intensive and subjective, and the adoption of commercial AI tools is limited in low- and middle-income countries due to high costs and infrastructure demands. Method: Using a single bounding box annotation on one slice, segmentation predictions are propagated across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. Result: Center-outward propagation yields the most consistent and accurate segmentations, and SAM2 achieves strong segmentation performance under minimal supervision despite being a zero-shot model not trained for volumetric medical data. Conclusion: General-purpose foundation models like SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings. Abstract: Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.[96] iLRM: An Iterative Large 3D Reconstruction Model
Gyeongjin Kang,Seungtae Nam,Xiangyu Sun,Sameh Khamis,Abdelrahman Mohamed,Eunbyung Park
Main category: cs.CV
TL;DR: iLRM introduces an efficient and scalable approach for 3D Gaussian reconstruction by using iterative refinement and a two-stage attention mechanism, significantly improving reconstruction quality and speed compared to existing methods.
Details
Motivation: Current state-of-the-art methods based on transformer architectures face scalability issues due to full attention across image tokens from multiple views, leading to high computational costs. iLRM aims to address these limitations for scalable and efficient 3D reconstruction. Method: iLRM generates 3D Gaussian representations through an iterative refinement mechanism, guided by three principles: decoupling scene representation from input-view images, decomposing multi-view interactions into a two-stage attention scheme, and injecting high-resolution information at each layer. Result: Experimental results on datasets like RE10K and DL3DV show that iLRM surpasses existing methods in terms of reconstruction quality and speed, with notable improvements in scalability and quality under similar computational costs. Conclusion: iLRM is a scalable and efficient feed-forward 3D reconstruction method that outperforms existing methods in both reconstruction quality and speed, particularly under comparable computational cost by efficiently leveraging a larger number of input views. Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.[97] UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing
Hao Tang,Chenwei Xie,Xiaoyi Bao,Tingyu Weng,Pandeng Li,Yun Zheng,Liwei Wang
Main category: cs.CV
TL;DR: 本文提出了UniLIP,通过两阶段训练和自我蒸馏策略扩展了CLIP的功能,使其在理解、生成和编辑任务中均表现出色。
Details
Motivation: 为了解决以往基于CLIP的统一方法需要额外的扩散解码器或量化导致的不一致重建或原有理解性能下降的问题。 Method: 引入了一个两阶段训练方案和自我蒸馏策略,并提出了双条件架构来连接MLLM和扩散变压器。 Result: 在文本到图像生成任务中,UniLIP在GenEval和WISE基准测试中分别获得了0.87和0.53的分数,在图像编辑中也取得了3.62的分数。 Conclusion: UniLIP有效地扩展了CLIP的应用范围,使其不仅在理解任务中表现出色,而且在生成和编辑任务中也取得了具有竞争力的表现。 Abstract: In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance.In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM's strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.[98] Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval
Dohwan Ko,Ji Soo Lee,Minhyuk Choi,Zihang Meng,Hyunwoo J. Kim
Main category: cs.CV
TL;DR: 本文提出了一种新的文本-视频检索框架BLiM,结合CPN模块有效减轻了候选先验偏差,提高了检索的相关性。
Details
Motivation: 现有的多模态大语言模型在文本-视频检索任务中容易引入候选先验偏差,偏向选择本身先验概率高的候选结果,而非与查询最相关的结果。 Method: BLiM方法利用了多模态大语言模型(MLLMs)来同时生成从视频到文本和从文本到视频特征的内容,结合候选先验归一化(CPN)模块进行得分校准。 Result: BLiM结合CPN在四个文本-视频检索基准测试中平均提升了6.4 R@1,显著优于之前的最先进模型。 Conclusion: BLiM结合CPN在文本-视频检索任务中通过减轻候选先验偏差,有效提高了检索的相关性,并展示了其在多种多模态任务中的广泛适用性。 Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.[99] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis
Inbum Heo,Taewook Hwang,Jeesu Jung,Sangkeun Jung
Main category: cs.CV
TL;DR: The paper introduces LED, a new benchmark for evaluating structural robustness in document layout analysis, addressing limitations in existing metrics by identifying critical structural errors.
Details
Motivation: The motivation is to address the limitations of conventional metrics like IoU and mAP, which fail to detect critical structural errors such as region merging, splitting, and missing content in document layout analysis. Method: The authors proposed Layout Error Detection (LED), a new benchmark with eight error types and three tasks, and created a synthetic dataset (LED-Dataset) by injecting realistic structural errors based on empirical distributions from DLA models. Result: Experiments across LMMs showed that LED successfully differentiates structural understanding capabilities and uncovers performance issues not visible through traditional evaluation methods. Conclusion: The paper concludes that LED effectively evaluates structural robustness, revealing modality biases and trade-offs in performance that traditional metrics cannot detect. Abstract: Recent advancements in Document Layout Analysis through Large Language Models and Multimodal Models have significantly improved layout detection. However, despite these improvements, challenges remain in addressing critical structural errors, such as region merging, splitting, and missing content. Conventional evaluation metrics like IoU and mAP, which focus primarily on spatial overlap, are insufficient for detecting these errors. To address this limitation, we propose Layout Error Detection (LED), a novel benchmark designed to evaluate the structural robustness of document layout predictions. LED defines eight standardized error types, and formulates three complementary tasks: error existence detection, error type classification, and element-wise error type classification. Furthermore, we construct LED-Dataset, a synthetic dataset generated by injecting realistic structural errors based on empirical distributions from DLA models. Experimental results across a range of LMMs reveal that LED effectively differentiates structural understanding capabilities, exposing modality biases and performance trade-offs not visible through traditional metrics.[100] Training-free Geometric Image Editing on Diffusion Models
Hanshen Zhu,Zhen Zhu,Kaile Zhang,Yiming Gong,Yuliang Liu,Xiang Bai
Main category: cs.CV
TL;DR: 本文提出了一種名為FreeFine的去耦合管道幾何圖像編輯方法,該方法在新的GeoBench基準上表現優於現有的最先進方法,特別是在圖像保真度和編輯精度方面。
Details
Motivation: 現有的基於擴散的圖像編輯方法在處理大規模或結構複雜的轉換時效果不佳,因此需要一種更有效的方法來處理幾何圖像編輯任務。 Method: 提出了一種分離物件轉換、源區域修補和目標區域優化的去耦合管道,並使用一種無需訓練的擴散方法FreeFine來實現修補和優化。 Result: 在包含2D和3D編輯場景的GeoBench基準上進行的實驗表明,FreeFine在圖像保真度和編輯精度方面優於現有的最先進替代方案,尤其是在處理困難轉換時。 Conclusion: FreeFine是一種有效的幾何圖像編輯方法,其去耦合設計使其在處理大規模或結構複雜的轉換時表現出色。 Abstract: We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine[101] ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection
Xihang Hu,Fuming Sun,Jiazhe Liu,Feilong Xu,Xiaoli Zhang
Main category: cs.CV
TL;DR: ST-SAM是一种高效的伪装物体检测方法,通过自训练和伪标签优化,仅需极少标注数据即可实现高性能检测。
Details
Motivation: 现有的基于教师-学生框架的半监督伪装物体检测方法在标注数据稀缺时存在严重的预测偏差和误差传播问题,且多网络架构计算开销大、扩展性差。 Method: 提出了一种基于自训练策略和伪标签混合提示的框架,通过动态筛选和扩展高置信度伪标签,并利用Segment Anything Model完成特定任务的优化。 Result: 在仅使用1%标注数据的情况下,ST-SAM在COD基准数据集上取得了最先进的性能,优于现有半监督方法,甚至接近全监督方法的表现。 Conclusion: ST-SAM是一种高效的半监督伪装物体检测框架,能够以极低的标注数据实现最先进的性能,同时避免了传统方法的预测偏差和高计算开销。 Abstract: Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1\% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.[102] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving
Xuewei Tang,Mengmeng Yang,Tuopu Wen,Peijin Jia,Le Cui,Mingshang Luo,Kehua Sheng,Bo Zhang,Diange Yang,Kun Jiang
Main category: cs.CV
TL;DR: 本文提出了一种名为PriorFusion的统一框架,通过集成语义、几何和生成先验来增强道路元素感知,解决了自动驾驶中在复杂环境中道路感知不准确的问题。
Details
Motivation: 自动驾驶车辆在没有高清晰度地图支持的复杂环境中需要独立解释周围环境,而现有的方法未能充分利用道路元素中固有的结构先验,导致预测不规则且不准确。 Method: 提出PriorFusion框架,引入实例感知注意力机制,构建数据驱动的形状模板空间,并设计基于扩散的框架利用先验锚点生成准确的预测。 Result: 实验表明该方法在大规模自动驾驶数据集上显著提高了感知精度,特别是在具有挑战性的条件下。可视化结果进一步证实了该方法在预测道路元素时更准确、规则且一致。 Conclusion: PriorFusion通过有效利用结构先验显著提高了道路元素感知的准确性和鲁棒性,为自动驾驶技术的发展做出了贡献。 Abstract: With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.[103] Forgetting of task-specific knowledge in model merging-based continual learning
Timm Hess,Gido M van de Ven,Tinne Tuytelaars
Main category: cs.CV
TL;DR: This paper shows that linear merging of models in continual learning preserves shared knowledge and degrades task-specific knowledge, with incremental training followed by merging yielding better results than parallel training.
Details
Motivation: The motivation behind this research is to better understand how model merging in continual learning impacts the preservation and enhancement of knowledge, particularly in terms of shared versus task-specific knowledge. Method: The investigation utilized controlled visual cues in computer vision experiments to evaluate the effects of merging models trained both incrementally and in parallel. Result: The results show that merging models in continual learning preserves or enhances shared knowledge and that unshared task-specific knowledge degrades rapidly. Additionally, models trained incrementally and then merged outperform those trained in parallel. Conclusion: The study concludes that linear merging of models in continual learning effectively preserves or enhances shared knowledge while causing rapid degradation of unshared, task-specific knowledge. Merging models formed through incremental training outperforms those trained in parallel. Abstract: This paper investigates the linear merging of models in the context of continual learning (CL). Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.[104] The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models
Alfio Ferrara,Sergio Picascia,Elisabetta Rocchetti
Main category: cs.CV
TL;DR: This paper investigates how text-to-image diffusion models encode content and style when generating artworks, using cross-attention heatmaps to isolate the influence of content and style tokens in prompts.
Details
Motivation: To understand how diffusion models internally represent artistic concepts like content and style, especially since these models are not explicitly trained to separate them. Method: The authors used cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, distinguishing the influence of content-describing and style-describing tokens. Result: Findings show that content tokens mainly affect object-related regions, while style tokens influence background and texture areas, indicating an emergent understanding of the content-style distinction in diffusion models. Conclusion: Diffusion models demonstrate varying degrees of implicit content-style separation, contributing to the understanding of how generative models represent complex artistic concepts without explicit supervision. Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.[105] Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification
Vineet Kumar Rakesh,Soumya Mazumdar,Tapas Samanta,Sarbajit Pal,Amitabha Das
Main category: cs.CV
TL;DR: 本文研究了七种高效深度学习模型的超参数调整对准确性和收敛速度的影响,发现余弦学习率衰减和可调节批量大小能够显著提升性能,同时保持低延迟和内存成本,适合实时应用。
Details
Motivation: 轻量级卷积和基于Transformer的模型对于资源受限应用(如嵌入式系统和边缘设备)中的实时图像分类至关重要,研究旨在分析超参数调整对模型准确性和收敛行为的影响。 Method: 对七种高效的深度学习架构进行超参数调整,并在ImageNet-1K数据集上进行训练,强调实时实用性。进行了全面的消融研究,以分离关键超参数的影响。 Result: 分析显示余弦学习率衰减和可调节批量大小显著提高准确性和收敛速度。RepVGG-A2实现了超过80%的Top-1准确率,并在GPU加速的边缘部署模拟中保持高效的推理性能。 Conclusion: 研究结果表明,余弦学习率衰减和可调节的批量大小可以显著提高准确性和收敛速度,同时保持低延迟和内存成本。RepVGG-A2在准确性与部署成本之间取得了良好的平衡。 Abstract: Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.[106] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning
Jiajun Cao,Qizhe Zhang,Peidong Jia,Xuhui Zhao,Bo Lan,Xiaoan Zhang,Xiaobao Wei,Sixiang Chen,Zhuo Li,Yang Wang,Liyun Li,Xianming Liu,Ming Lu,Shanghang Zhang
Main category: cs.CV
TL;DR: 本文提出FastDriveVLA,一种专为自动驾驶设计的视觉语言动作模型的视觉标记剪枝框架,通过ReconPruner和对抗性重建策略,在nuScenes数据集上取得了SOTA结果。
Details
Motivation: 现有VLA模型的视觉标记剪枝方法在自主驾驶场景中表现不佳,而人类驾驶员驾驶时主要关注前景区域,因此保留包含前景信息的视觉标记对于有效决策至关重要。 Method: 提出了一种基于重建的视觉标记剪枝框架FastDriveVLA,其中包括通过MAE风格像素重建优先保留前景信息的ReconPruner,以及用于训练ReconPruner的对抗性前景-背景重建策略。 Result: 在nuScenes闭环规划基准中,该方法在不同剪枝比例下均达到了最先进的性能,并且ReconPruner可直接应用于具有相同视觉编码器的其他VLA模型而无需重新训练。 Conclusion: FastDriveVLA通过ReconPruner实现了高效的视觉标记剪枝,显著提升了自主驾驶场景中VLA模型的决策性能,同时保持了模型的泛化能力。 Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.[107] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models
Yiming Yang,Hongbin Lin,Yueru Luo,Suzhong Fu,Chao Zheng,Xinrui Yan,Shuqi Mei,Kun Tang,Shuguang Cui,Zhen Li
Main category: cs.CV
TL;DR: FASTopoWM通过结合潜在世界模型的双流框架,提升了车道拓扑推理的时间感知性能。
Details
Motivation: 现有车道拓扑推理方法难以有效利用时间信息,且依赖历史查询,容易受到姿态估计失败的影响。 Method: 提出了一种新的快速-慢速车道段拓扑推理框架,并引入了潜在查询和基于动作潜变量的BEV世界模型。 Result: 在OpenLane-V2基准测试中,FASTopoWM在车道段检测和中心线感知方面均优于现有方法。 Conclusion: FASTopoWM是一个增强的双流车道段拓扑推理框架,利用潜在世界模型提高了时间感知性能。 Abstract: Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).[108] Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation
Yingkai Wang,Yaoyao Zhu,Xiuding Cai,Yuhao Xiao,Haotian Wu,Yu Yao
Main category: cs.CV
TL;DR: A new domain generalization framework improves the robustness of medical image segmentation models under domain shifts by modulating domain-variant features while preserving anatomical consistency, outperforming existing methods on multi-center benchmarks.
Details
Motivation: Medical image segmentation often faces performance degradation due to domain shifts caused by variations in imaging conditions, scanner types, and protocols. The consistent anatomical structures in medical images present a unique challenge that the proposed method aims to address. Method: The method introduces implicit feature perturbations guided by domain statistics using a learnable semantic direction selector and a covariance-based semantic intensity sampler. Additionally, an adaptive consistency constraint is applied to stabilize feature adjustments. Result: Experiments on two public multi-center benchmarks demonstrate that the framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation across diverse clinical domains. Conclusion: The proposed domain generalization framework enhances the robustness and reliability of medical image segmentation models under domain shifts by modulating domain-variant features while maintaining anatomical consistency. Abstract: Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains.[109] Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision
Qiang Lu,Waikit Xiu,Xiying Li,Shenyu Hu,Shengbo Sun
Main category: cs.CV
TL;DR: This paper introduces a two-stage framework using open-vocabulary detection and cross-modal learning for traffic sign recognition, effectively addressing challenges posed by long-tail data distribution and small target size, achieving state-of-the-art results on the TT100K dataset.
Details
Motivation: The paper aims to overcome two significant challenges in traffic sign recognition: the long-tail distribution of the dataset that affects recognition performance for low-frequency classes, and the difficulty in extracting multi-scale features from small targets in real-world scenarios. Method: The proposed framework consists of two models: NanoVerse YOLO with RepVL-PAN and SPD-Conv modules for detection, and TSR-MCL for classification, which uses Vision Transformer and BERT to learn robust, frequency-independent representations. Result: The method achieves 78.4% mAP in the long-tail detection task on the TT100K dataset, with 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms. Conclusion: The paper proposes a novel two-stage framework combining open-vocabulary detection and cross-modal learning to address the challenges in traffic sign recognition, achieving state-of-the-art performance on the TT100K dataset. Abstract: Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.[110] MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting
Xingyue Peng,Yuandong Lyu,Lang Zhang,Jian Zhu,Songtao Wang,Jiaxin Deng,Songxin Lu,Weiliang Ma,Dangen She,Peng Jia,XianPeng Lang
Main category: cs.CV
TL;DR: 本文提出了一种结合遮挡感知2D高斯表面元与语义引导颜色增强的鲁棒路面重建方法,有效应对动态遮挡和复杂环境干扰。
Details
Motivation: 现有基于网格渲染或3D高斯点阵的方法在干净静态条件下表现良好,但在复杂城市环境中,受动态物体遮挡、静态障碍物视觉干扰以及光照和天气变化影响较大,因此需要更鲁棒的重建方法。 Method: 论文方法包括使用平面自适应的高斯表示进行高效的大规模建模,采用分割引导的视频修复技术去除动态和静态前景物体,并通过HSV空间中的语义感知校正增强颜色一致性。 Result: 在城市级数据集上的大量实验表明,该方法在几何精度和视觉一致性方面均优于先前方法,尤其在复杂真实场景中表现突出。 Conclusion: 该论文提出了一种鲁棒的路面重建框架,通过结合遮挡感知的2D高斯表面元和语义引导的颜色增强技术,能够有效恢复干净、一致的路面表面,并在真实世界条件下显著优于现有方法。 Abstract: Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban environments.While recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.[111] The Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN, YOLOv XI and YOLOv XII models
Ahmet Can Ömercikoğlu,Mustafa Mansur Yönügül,Pakize Erdoğmuş
Main category: cs.CV
TL;DR: 该研究评估了输入分辨率对YOLOv11、YOLOv12和MTCNN三种人脸检测模型性能的影响,并发现YOLOv11在精度上表现最佳,而MTCNN在实时推理上存在劣势。
Details
Motivation: 低分辨率图像等现实条件会显著降低人脸检测的性能,而人脸检测是许多人工智能驱动应用的关键组成部分,例如监控、生物识别认证和人机交互。 Method: 使用WIDER FACE数据集,对三种著名的人脸检测器(YOLOv11、YOLOv12和MTCNN)在多个图像分辨率(160x160、320x320和640x640)下进行了广泛的评估,并使用精度、召回率、mAP50、mAP50-95和推理时间等指标评估每个模型的性能。 Result: 研究结果表明,YOLOv11在检测精度上表现最佳,尤其是在较高分辨率下,而YOLOv12在召回率上略胜一筹。MTCNN在关键点定位上具有竞争力,但在实时推理速度上落后。 Conclusion: 该研究得出YOLOv11在检测精度方面优于YOLOv12和MTCNN,尤其是在较高分辨率下,而YOLOv12在召回率上略好一些。MTCNN在实时推理速度上落后,但在关键点定位上具有竞争力。 Abstract: Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model's performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.[112] Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads
Yingjie Zhou,Jiezhang Cao,Zicheng Zhang,Farong Wen,Yanwei Jiang,Jun Jia,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了THQA-10K数据集和一种新的客观质量评估方法,提高了AI生成说话头的质量评估水平。
Details
Motivation: 尽管文本到图像模型迅速发展,但AI生成的说话头的质量仍存在问题,而全面研究这些问题的文献仍很有限。 Method: 作者创建了最大的AI生成说话头质量评估数据集THQA-10K,并提出了一种基于第一帧、Y-T切片和音调-嘴唇一致性的客观质量评估方法。 Result: THQA-10K数据集包含了10,457个AI生成的说话头,志愿者对这些说话头进行了主观评分并分类了失真类型。提出的客观质量评估方法在AI生成的说话头质量评估中达到了最先进的性能。 Conclusion: 本文介绍了THQA-10K数据集和基于第一帧、Y-T切片和音调-嘴唇一致性的一种客观质量评估方法,该方法在AI生成的说话头质量评估中达到了最先进的性能。 Abstract: Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.[113] IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025
Radu-Andrei Bourceanu,Neil De La Fuente,Jan Grimm,Andrei Jardan,Andriy Manucharyan,Cornelius Weiss,Roman Pflugfelder
Main category: cs.CV
TL;DR: This report analyzes the evolution of key design patterns in computer vision by reviewing six influential papers, highlighting advancements in image recognition, generative models, and self-supervised learning techniques.
Details
Motivation: The motivation for this report is to understand the evolution of key design patterns in computer vision by examining influential papers in the field. Method: The analysis is based on a review of six influential papers covering foundational architectures for image recognition, generative models, and self-supervised learning techniques. Result: The report identifies the Vision Transformer (ViT), Generative Adversarial Networks (GANs), Latent Diffusion Models (LDMs), DINO, and Masked Autoencoders (MAE) as significant advancements in computer vision, each contributing to the current state-of-the-art in their respective areas. Conclusion: The report concludes that key design patterns in computer vision have evolved significantly, with self-supervised learning techniques like DINO and MAE offering highly scalable and effective methods for pre-training large-scale vision models. Abstract: This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.[114] Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers
Ji Ma,Wei Suo,Peng Wang,Yanning Zhang
Main category: cs.CV
TL;DR: This paper proposes Short-LVLM, a training-free and model-agnostic framework for compressing large vision-language models (LVLMs) by addressing challenges in layer pruning, achieving better performance-efficiency trade-offs.
Details
Motivation: Large vision-language models (LVLMs) face practical limitations due to massive model parameters and high computational costs. While layer pruning techniques have been effective in NLP, their applicability to LVLMs is uncertain due to modality divergence. Method: Through extensive experiments, the study analyzes the challenges of non-essential VL tokens and inter-layer feature gaps in LVLMs. Based on these findings, the Short-LVLM framework is proposed to overcome these challenges. Result: Directly applying traditional layer pruning methods from NLP to LVLMs is proven to be ineffective. The proposed Short-LVLM framework achieves a superior balance between performance and efficiency. Conclusion: Short-LVLM (SVL) is introduced as a novel framework that effectively addresses the challenges of pruning layers in LVLMs by utilizing important vision-language tokens and mitigating feature gaps, offering a training-free, model-agnostic, and highly compatible solution. Abstract: Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.[115] VMatcher: State-Space Semi-Dense Local Feature Matching
Ali Youssef
Main category: cs.CV
TL;DR: VMatcher combines Mamba and Transformer to efficiently match image features, outperforming existing methods in speed and effectiveness.
Details
Motivation: Transformers, while effective, have high computational costs due to quadratic complexity, prompting the need for a more efficient solution. Method: VMatcher combines Mamba's linear-complexity SSM with Transformer's attention mechanism in a hybrid architecture. Result: VMatcher achieves new benchmarks in efficiency and performance, demonstrating robustness and suitability for real-time applications. Conclusion: VMatcher, a hybrid Mamba-Transformer network, achieves efficient and effective feature matching for real-time applications. Abstract: This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer's attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher[116] UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries
Yijie Zhu,Lingsen Zhang,Zitong Yu,Rui Shao,Tao Tan,Liqiang Nie
Main category: cs.CV
TL;DR: 该论文提出了一种统一的情感理解和生成模型UniEmo,通过双反馈机制提升效果。
Details
Motivation: 情感理解和生成本质上是相互补充的,但目前被作为单独任务处理。 Method: 提出了一个分层情感理解链和情感扩散模型,并引入情感相关系数和条件损失。 Result: 实验表明UniEmo在情感理解和生成任务上显著优于现有技术。 Conclusion: UniEmo是一个统一的情感理解和生成框架,通过双反馈机制提升了情感理解与生成的效果。 Abstract: Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.[117] Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation
Haoran Chen,Zexiao Wang,Haidong Cao,Zuxuan Wu,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出MP^2A,一种渐进式对齐策略,通过逐步引入低置信度样本缓解多源无监督领域自适应中的伪标签噪声问题,取得了优异性能。
Details
Motivation: 现有方法利用CLIP生成伪标签进行领域对齐,但一次性使用所有伪标签样本会导致噪声传播和次优特征学习,尤其在多源场景下更为严重。 Method: 提出渐进式对齐策略,从高置信度样本开始训练,并逐步引入低置信度样本,以缓解伪标签噪声的影响。 Result: MP^2A在ImageCLEF、Office-Home和DomainNet三个UDA基准测试中均取得了最先进的性能。 Conclusion: MP^2A通过渐进式对齐策略有效缓解了多源无监督领域自适应中的错误传播和收敛偏差问题,实现了优于现有方法的性能。 Abstract: Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.[118] NeRF Is a Valuable Assistant for 3D Gaussian Splatting
Shuangkang Fang,I-Chao Shen,Takeo Igarashi,Yufeng Wang,ZeSheng Wang,Yi Yang,Wenrui Ding,Shuchang Zhou
Main category: cs.CV
TL;DR: NeRF-GS 是一种结合NeRF和3DGS优势的新框架,通过共享空间信息和优化残差向量,显著提升了3D场景表示性能。
Details
Motivation: 为了解决3DGS在高斯初始化敏感性、空间感知有限和高斯间相关性弱等问题,同时结合NeRF的连续空间表示优势。 Method: NeRF-GS 框架联合优化NeRF和3DGS,通过共享3D空间信息逐步对齐两者空间特征,并优化残差向量以增强3DGS的个性化能力。 Result: 实验结果表明,NeRF-GS 在基准数据集上超越了现有方法,达到最先进的性能。 Conclusion: NeRF-GS 提供了一种有效的3D场景表示方法,结合了NeRF和3DGS的优势,为未来的混合方法提供了新思路。 Abstract: We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.[119] AGA: An adaptive group alignment framework for structured medical cross-modal representation learning
Wei Li,Xun Gong,Jiao Li,Xiaobin Sun
Main category: cs.CV
TL;DR: This paper proposes AGA, a new framework for learning structured visual and textual representations from medical images and reports, which achieves strong performance without requiring large-scale negative samples.
Details
Motivation: Current vision-language pretraining methods in the medical domain oversimplify clinical reports and rely heavily on large quantities of hard negative samples, which is impractical for small-scale datasets. Method: AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix, incorporating threshold gating modules for adaptive grouping and a Bidirectional Cross-modal Grouped Alignment module for fine-grained alignment. Result: Extensive experiments on public and private datasets demonstrate that AGA achieves strong performance on image-text retrieval and classification tasks in both fine-tuning and zero-shot settings. Conclusion: The proposed Adaptive Grouped Alignment (AGA) framework effectively captures structured semantics from paired medical images and reports, achieving strong performance on image-text retrieval and classification tasks. Abstract: Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.[120] Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories
Lemar Abdi,Francisco Caetano,Amaan Valiuddin,Christiaan Viviers,Hamdi Joudeh,Fons van der Sommen
Main category: cs.CV
TL;DR: This paper proposes SBDDM, a fast and efficient reconstruction-free method for OOD detection in medical imaging, achieving high accuracy and computational efficiency without retraining.
Details
Motivation: Current generative OOD detection methods are computationally expensive, unreliable, and require retraining, which limits their efficiency and robustness in identifying rare pathological cases. Method: A reconstruction-free OOD detection method using the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM), leveraging trajectory curvature via the estimated Stein score. Result: SBDDM achieves state-of-the-art performance with up to 10.43% improvement for Near-OOD and 18.10% for Far-OOD detection, using only five diffusion steps. Conclusion: SBDDM is a practical building block for real-time, reliable computer-aided diagnosis, offering significant improvements in Near-OOD and Far-OOD detection while reducing computational costs. Abstract: In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.[121] Honey Adulteration Detection using Hyperspectral Imaging and Machine Learning
Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh
Main category: cs.CV
TL;DR: 本文提出了一种基于机器学习的蜂蜜掺假检测系统,使用高光谱成像数据和K近邻模型实现了高准确率的掺假检测。
Details
Motivation: 开发一种基于机器学习的系统,用于自动检测蜂蜜掺假情况,特别是掺入糖浆的情况。 Method: 利用高光谱成像数据,通过线性判别分析提取特征,并使用K近邻模型进行分类。 Result: 该系统在交叉验证中达到了96.39%的准确率。 Conclusion: 该系统在检测蜂蜜掺假方面具有较高的准确性,可以作为化学检测方法的替代方案。 Abstract: This paper aims to develop a machine learning-based system for automatically detecting honey adulteration with sugar syrup, based on honey hyperspectral imaging data. First, the floral source of a honey sample is classified by a botanical origin identification subsystem. Then, the sugar syrup adulteration is identified, and its concentration is quantified by an adulteration detection subsystem. Both subsystems consist of two steps. The first step involves extracting relevant features from the honey sample using Linear Discriminant Analysis (LDA). In the second step, we utilize the K-Nearest Neighbors (KNN) model to classify the honey botanical origin in the first subsystem and identify the adulteration level in the second subsystem. We assess the proposed system performance on a public honey hyperspectral image dataset. The result indicates that the proposed system can detect adulteration in honey with an overall cross-validation accuracy of 96.39%, making it an appropriate alternative to the current chemical-based detection methods.[122] Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification
Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Cosimo Distante,Abdelmalik Taleb-Ahmed
Main category: cs.CV
TL;DR: The study improves art style classification by integrating Kolmogorov-Arnold Networks (KANs) into a dual-teacher framework, allowing better modeling of complex stylistic features and improving classification accuracy on benchmark datasets.
Details
Motivation: Art style classification is challenging due to limited labeled datasets and the complex, nonlinear interplay of stylistic elements. Existing dual-teacher frameworks, while reducing reliance on labeled data, are limited by linear projections and localized focus, making them less effective at capturing global stylistic context. Method: The authors enhanced a dual-teacher self-supervised learning framework by replacing standard MLP projection and prediction heads with KANs, which utilize spline-based activations to model nonlinear relationships. They evaluated the performance on WikiArt and Pandora18k datasets, comparing their method with the base dual-teacher architecture. Result: The proposed method outperformed the base dual-teacher architecture in Top-1 accuracy on the WikiArt and Pandora18k datasets. Additionally, KANs showed better linear probe accuracy than MLP projections, highlighting their effectiveness in disentangling complex style features. Conclusion: Using Kolmogorov-Arnold Networks (KANs) in a dual-teacher knowledge distillation framework improves art style classification by better capturing complex style manifolds and nonlinear feature correlations compared to conventional MLP projections. Abstract: Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov-Arnold Networks (KANs). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs' spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds, leading to better linear probe accuracy than MLP projections.[123] Adjustable Spatio-Spectral Hyperspectral Image Compression Network
Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir
Main category: cs.CV
TL;DR: This paper proposes HyCASS, a learning-based model for adjustable hyperspectral image compression, demonstrating its effectiveness and providing guidelines for balancing spatial and spectral compression.
Details
Motivation: The motivation is to investigate the individual and joint effects of spectral and spatial compression on learning-based HSI compression, addressing the lack of comprehensive analysis in this area. Method: The authors proposed HyCASS, a learning-based model with six modules, including spectral and spatial encoders/decoders and compression ratio adapter modules, utilizing convolutional layers and transformer blocks. Result: Experimental results on two HSI benchmark datasets demonstrated the effectiveness of HyCASS compared to existing models, and guidelines were established for balancing spectral and spatial compression. Conclusion: The paper concludes that the proposed HyCASS model effectively balances spectral and spatial compression across different compression ratios, offering superior performance in learning-based HSI compression. Abstract: With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder; 2) spatial encoder; 3) compression ratio (CR) adapter encoder; 4) CR adapter decoder; 5) spatial decoder; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on two HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass .[124] Machine learning and machine learned prediction in chest X-ray images
Shereiff Garrett,Abhinav Adhikari,Sarina Gautam,DaShawn Marquis Morris,Chandra Mani Adhikari
Main category: cs.CV
TL;DR: 该研究使用5824张胸部X光图像比较了基础CNN和DenseNet-121在疾病预测中的表现,发现DenseNet-121不仅性能更优,还能更准确地聚焦于图像中的关键区域。
Details
Motivation: 机器学习和人工智能通过数据训练算法、识别模式并进行预测,能够无需显式编程即可解决复杂问题。本研究旨在探索这些方法在医疗诊断中的潜力,特别是通过胸部X光图像预测疾病。 Method: 通过使用5824张胸部X光图像,实现并比较了两种机器学习算法——基础卷积神经网络(CNN)和DenseNet-121,以评估它们在疾病预测中的表现。同时利用梯度加权类别激活映射来分析模型决策的关键区域。 Result: 基础CNN和DenseNet-121在二分类问题中均表现出色,但DenseNet-121在决策过程中能更准确地聚焦于输入图像的关键区域。 Conclusion: DenseNet-121展现出比基础CNN模型在胸部X光图像二分类问题上更优的性能,并且其决策过程能更准确地关注图像中的关键区域。 Abstract: Machine learning and artificial intelligence are fast-growing fields of research in which data is used to train algorithms, learn patterns, and make predictions. This approach helps to solve seemingly intricate problems with significant accuracy without explicit programming by recognizing complex relationships in data. Taking an example of 5824 chest X-ray images, we implement two machine learning algorithms, namely, a baseline convolutional neural network (CNN) and a DenseNet-121, and present our analysis in making machine-learned predictions in predicting patients with ailments. Both baseline CNN and DenseNet-121 perform very well in the binary classification problem presented in this work. Gradient-weighted class activation mapping shows that DenseNet-121 correctly focuses on essential parts of the input chest X-ray images in its decision-making more than the baseline CNN.[125] Mitigating Resolution-Drift in Federated Learning: Case of Keypoint Detection
Taeheon Lim,Joohyung Lee,Kyungjae Lee,Jungchan Cho
Main category: cs.CV
TL;DR: This paper identifies 'resolution-drift' as a critical issue in federated learning for high-resolution tasks like human pose estimation and proposes RAF, a resolution-adaptive method using knowledge distillation, to address this challenge effectively.
Details
Motivation: While FL has been successful in classification tasks by addressing statistical heterogeneity and communication efficiency, its application to non-classification tasks like human pose estimation remains underexplored. Resolution variability across clients, a new axis of non-IID data, introduces performance degradation that needs to be addressed. Method: The authors propose RAF, which uses heatmap-based knowledge distillation to address resolution drift. The method involves multi-resolution knowledge distillation, where higher-resolution outputs act as teachers for lower-resolution outputs, enhancing resolution robustness without overfitting. Result: Extensive experiments and theoretical analysis show that RAF effectively mitigates resolution drift and achieves significant performance improvements. It can be seamlessly integrated into existing FL frameworks. t-SNE analysis also reveals distinct characteristics between classification and high-resolution representation tasks, supporting RAF's generalizability. Conclusion: The paper concludes that resolution variability, termed 'resolution-drift,' is a critical issue in federated learning (FL) for high-resolution representation tasks like human pose estimation. The proposed method, resolution-adaptive federated learning (RAF), effectively mitigates resolution drift through heatmap-based knowledge distillation and is generalizable to other tasks requiring spatial detail preservation. Abstract: The Federated Learning (FL) approach enables effective learning across distributed systems, while preserving user data privacy. To date, research has primarily focused on addressing statistical heterogeneity and communication efficiency, through which FL has achieved success in classification tasks. However, its application to non-classification tasks, such as human pose estimation, remains underexplored. This paper identifies and investigates a critical issue termed ``resolution-drift,'' where performance degrades significantly due to resolution variability across clients. Unlike class-level heterogeneity, resolution drift highlights the importance of resolution as another axis of not independent or identically distributed (non-IID) data. To address this issue, we present resolution-adaptive federated learning (RAF), a method that leverages heatmap-based knowledge distillation. Through multi-resolution knowledge distillation between higher-resolution outputs (teachers) and lower-resolution outputs (students), our approach enhances resolution robustness without overfitting. Extensive experiments and theoretical analysis demonstrate that RAF not only effectively mitigates resolution drift and achieves significant performance improvements, but also can be integrated seamlessly into existing FL frameworks. Furthermore, although this paper focuses on human pose estimation, our t-SNE analysis reveals distinct characteristics between classification and high-resolution representation tasks, supporting the generalizability of RAF to other tasks that rely on preserving spatial detail.[126] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes
Bin Xie,Congxuan Zhang,Fagan Wang,Peng Liu,Feng Lu,Zhen Chen,Weiming Hu
Main category: cs.CV
TL;DR: CST Anti-UAV是一个用于复杂场景中无人机跟踪的新数据集,其状态准确率远低于现有数据集,表明跟踪小型无人机在复杂环境中仍然是一个挑战。
Details
Motivation: 现有无人机跟踪数据集主要针对显眼目标,缺乏场景复杂度和属性表示的多样性,限制了其在实际场景中的适用性。 Method: 提出了一种名为CST Anti-UAV的新型热红外数据集,专为复杂场景中的小型无人机单目标跟踪设计,包含220个视频序列和超过24万个高质量边界框注释。 Result: 实验结果表明,最先进的方法在CST Anti-UAV数据集上的状态准确率仅为35.92%,远低于Anti-UAV410数据集的67.69%。 Conclusion: CST Anti-UAV是第一个包含完整的人工逐帧属性注释的数据集,能够实现对各种挑战下的精确评估。实验结果表明,在复杂环境中跟踪小型无人机仍然是一个挑战,最先进的方法在CST Anti-UAV数据集上的状态准确率仅为35.92%,远低于Anti-UAV410数据集的67.69%。CST Anti-UAV基准测试即将公开发布,这不仅促进了更鲁棒的SOT方法的发展,也推动了反无人机系统的创新。 Abstract: The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.[127] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding
Ting Huang,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: 3D-R1通过构建高质量合成数据集Scene-30K和基于RLHF策略的强化学习训练过程,显著提升了3D场景理解的推理能力和泛化能力。
Details
Motivation: 当前3D视觉语言模型在推理和泛化方面存在不足,主要受限于高质量空间数据的缺乏和视角假设的静态性。 Method: 构建了一个高质量的合成数据集Scene-30K,并采用基于RLHF策略(如GRPO)的强化学习训练过程,引入了动态视角选择策略。 Result: 3D-R1在多个3D场景理解基准测试中平均提升了10%。 Conclusion: 3D-R1显著提升了3D场景理解的推理能力和泛化能力,为未来3D视觉语言模型的发展提供了新思路。 Abstract: Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.[128] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning
Julia Werner,Oliver Bause,Julius Oexle,Maxime Le Floch,Franz Brinkmann,Jochen Hampe,Oliver Bringmann
Main category: cs.CV
TL;DR: This paper introduces a multi-task neural network for video capsule endoscopy, combining self-localization and anomaly detection to extend battery life and improve performance.
Details
Motivation: The persistent challenge of short battery life in compact sensor edge devices for video capsule endoscopy motivates the integration of artificial intelligence to enable intelligent real-time decision-making and reduce energy consumption. Method: A multi-task neural network combining self-localization and anomaly detection was developed, leveraging established multi-task methods and Viterbi decoding for time-series analysis, while restricting the model size to ensure deployment feasibility. Result: The multi-task neural network achieved an accuracy of 93.63% on self-localization and 87.48% on anomaly detection with only 1 million parameters, outperforming current single-task models. Conclusion: The introduction of a multi-task neural network in this paper successfully integrates self-localization and anomaly detection in video capsule endoscopy, significantly advancing AI-based approaches for small intestine investigations. Abstract: Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision- making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant ad- vance in AI-based approaches in this field. Our model achieves an accu- racy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.[129] FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction
Donghyun Lee,Dawoon Jeong,Jae W. Lee,Hongil Yoon
Main category: cs.CV
TL;DR: FastPoint is a software-based acceleration technique that improves the efficiency of farthest point sampling and neighbor search in 3D point cloud processing by predicting distance curves, resulting in significant speedups without sacrificing accuracy.
Details
Motivation: Handling large and irregular 3D point clouds efficiently remains challenging despite advancements in deep neural networks. Method: FastPoint predicts the distance curve between sampled points during farthest point sampling, enabling the efficient identification of subsequent sample points without exhaustive pairwise distance computations. Result: Integrating FastPoint into state-of-the-art 3D point cloud models achieves a 2.55x end-to-end speedup on an NVIDIA RTX 3090 GPU without compromising accuracy. Conclusion: FastPoint maintains sampling quality and model performance while significantly accelerating farthest point sampling and neighbor search operations. Abstract: Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.[130] Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion
Mutian Xu,Chongjie Ye,Haolin Liu,Yushuang Wu,Jiahao Chang,Xiaoguang Han
Main category: cs.CV
TL;DR: Stable-Sim2Real is a new data-driven 3D simulation method that uses a two-stage depth diffusion model to bridge the gap between simulated and real-captured 3D data, enhancing performance in real-world 3D visual tasks.
Details
Motivation: The motivation is to address the limitations of current 3D data simulation methods that rely on predefined physical priors and struggle to capture the complexity of real data. Method: The method involves a two-stage depth diffusion model. The first stage generates a coarse depth by finetuning Stable-Diffusion, while the second stage improves the depth by prioritizing distinct areas identified by a 3D discriminator. Result: Extensive experiments show that the proposed method enhances the performance of networks in real-world 3D visual tasks and produces simulated data highly similar to real-captured patterns. Conclusion: This paper proposes Stable-Sim2Real, a new data-driven 3D simulation method based on a two-stage depth diffusion model, which significantly enhances performance in real-world 3D visual tasks. Abstract: 3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes Stable-Diffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: https://mutianxu.github.io/stable-sim2real/.[131] Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions
Jinshan Zhen,Yuanyue Ge,Tianxiao Zhu,Hui Zhao,Ya Xiong
Main category: cs.CV
TL;DR: 该研究提出了一种基于视觉的管道,结合RGB-D传感和深度学习,用于非破坏性、实时在线估计草莓质量,解决了由于遮挡和姿态变化带来的挑战。
Details
Motivation: 在田间条件下,由于频繁的遮挡和姿态变化,准确估计桌面上种植的草莓质量仍然具有挑战性。 Method: 该方法使用YOLOv8-Seg进行实例分割,使用CycleGAN进行遮挡区域补全,并使用倾斜角度校正来优化正面投影面积计算。多项式回归模型将几何特征映射到质量上。 Result: 实验显示,对于孤立的草莓,平均质量估计误差为8.11%,对于遮挡情况为10.47%。CycleGAN在遮挡恢复方面优于大掩码修复模型,实现了更优的像素面积比率和更高的交并比分数。 Conclusion: 这种方法解决了传统方法的关键限制,为具有复杂遮挡模式的自动化收获和产量监测提供了可靠的解决方案。 Abstract: Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.[132] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion
Timing Li,Bing Cao,Jiahe Feng,Haifang Cao,Qinghau Hu,Pengfei Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于双路径跨模态循环配准框架和超球层次对比对齐模块的超球循环配准网络(Hy-CycleAlign),用于多模态图像配准和融合,显著优于现有方法。
Details
Motivation: 现有基于欧几里得空间的图像配准方法在处理跨模态图像时效果不佳,导致配准和融合质量不理想。因此,需要一种新的方法来有效处理跨模态图像配准问题。 Method: 本文提出了一种双路径跨模态循环配准框架,其中前向配准网络用于对齐跨模态输入,后向配准网络用于重建原始图像,形成闭环配准结构。此外,还设计了一个超球层次对比对齐模块(H$^{2}$CA),将图像映射到超球空间并施加配准约束,以减少模态差异带来的干扰。 Result: 在大量未对齐的多模态图像数据集上的实验表明,该方法在图像配准和融合方面均显著优于现有方法。 Conclusion: Hy-CycleAlign是首个基于超球空间的图像配准方法,通过引入循环配准框架和超球层次对比对齐模块,有效解决了跨模态图像配准问题,提高了图像融合质量。 Abstract: Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H$^{2}$CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.[133] I Am Big, You Are Little; I Am Right, You Are Wrong
David A. Kelly,Akchunya Chanchal,Nathan Blake
Main category: cs.CV
TL;DR: The paper explores how different image classification models focus on specific pixels for decision-making, showing that model architecture and classification accuracy influence these focus areas.
Details
Motivation: As machine learning models for image classification grow in variety and complexity, understanding how they make decisions becomes crucial for selecting the right model. Method: The study uses minimal sufficient pixel sets to analyze the decision-making process of vision models, comparing metrics like position, overlap, and size of these sets. Result: The research reveals that different architectures have statistically different concentration behaviors, with certain models like ConvNext and EVA standing out. Additionally, misclassified images tend to involve larger pixel sets. Conclusion: Models like ConvNext and EVA demonstrate distinct concentration patterns, and misclassifications are generally linked with larger pixel sets. Abstract: Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model's classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model's `concentration': the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications.[134] ART: Adaptive Relation Tuning for Generalized Relation Prediction
Gopika Sudhakaran,Hikaru Shindo,Patrick Schramowski,Simone Schaub-Meyer,Kristian Kersting,Stefan Roth
Main category: cs.CV
TL;DR: 本文提出了一种名为ART的框架,通过指令调整来提升视觉关系检测(VRD)模型在新关系和复杂关系上的泛化能力。
Details
Motivation: 传统的VRD模型仅在关系检测数据上进行训练,难以泛化到未见过的关系;现有的提示调整方法使用手工设计的提示,难以应对新关系或复杂关系。 Method: 将VRD数据集转换为指令调整格式,并通过自适应采样算法引导视觉-语言模型(VLM)关注信息量大的关系,从而提升模型的泛化能力。 Result: ART在多个复杂度不同的保持数据集上显著优于基线模型,并具备推断未见关系概念的能力,这是主流VRD方法所不具备的。 Conclusion: ART提供了一种更有效的VRD模型适应方法,能够在保持模型泛化能力的同时提升性能,并展示了其在复杂场景分割中的实用价值。 Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.[135] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection
Yung-Hsu Yang,Luigi Piccinelli,Mattia Segu,Siyuan Li,Rui Huang,Yuqian Fu,Marc Pollefeys,Hermann Blum,Zuria Bauer
Main category: cs.CV
TL;DR: This paper introduces 3D-MOOD, the first end-to-end monocular open-set 3D object detector, which achieves state-of-the-art performance on both closed-set and open-set benchmarks.
Details
Motivation: The motivation is to address the limitations of existing monocular 3D object detection methods that are confined to closed-set settings, while real-world applications often require handling new environments and novel object categories. Method: The method involves lifting open-set 2D detection into 3D space using a designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks. Object queries are conditioned with geometry prior to improve generalization, and a canonical image space is designed for efficient cross-dataset training. Result: The proposed 3D-MOOD model achieves state-of-the-art results on closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet). Conclusion: The paper concludes that the proposed 3D-MOOD model successfully addresses monocular 3D object detection in an open-set setting, achieving state-of-the-art results on both closed-set and open-set benchmarks. Abstract: Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.[136] Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization
Maxime Pietrantoni,Gabriela Csurka,Torsten Sattler
Main category: cs.CV
TL;DR: This paper proposes Gaussian Splatting Feature Fields for visual localization, combining 3D geometry with feature learning to achieve accurate and privacy-preserving localization results.
Details
Motivation: The motivation is to improve the accuracy of visual localization while also addressing privacy concerns by developing a method that can provide privacy-preserving localization. Method: The authors propose a new scene representation for visual localization called Gaussian Splatting Feature Fields (GSFFs), which combines an explicit geometry model (3D Gaussian Splatting) with an implicit feature field. They align a 3D feature field and a 2D feature encoder in a common embedding space using a contrastive framework and use a clustering procedure to regularize the representation learning. Result: The method achieves state-of-the-art performance on multiple real-world datasets for visual localization, showing its effectiveness in both privacy-preserving and non-privacy-preserving scenarios. Conclusion: The paper concludes that their proposed method, Gaussian Splatting Feature Fields, achieves state-of-the-art performance in visual localization tasks while also enabling privacy-preserving localization. Abstract: Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.[137] Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Sobhan Asasi,Mohamed Ilyas Lakhal,Ozge Mercanoglu Sincan,Richard Bowden
Main category: cs.CV
TL;DR: BeyondGloss是一种新的手语翻译框架,通过改进视频大语言模型的时空推理能力来提高翻译准确性。
Details
Motivation: 手语翻译任务需要弥合视觉和语言信息之间的模态差距,同时捕捉手势和动作的细微变化。现有的视频大语言模型难以详细建模长视频,因此需要一种新方法。 Method: 提出了生成细粒度、时序感知的手势描述的新方法,并通过对比对齐模块在预训练期间对齐这些描述与视频特征,同时从HaMeR中蒸馏细粒度特征并应用对比损失以减少模态差距。 Result: BeyondGloss在Phoenix14T和CSL-Daily基准测试中实现了最先进的性能,证明了所提出框架的有效性。 Conclusion: BeyondGloss通过利用视频大语言模型的时空推理能力,提供了一种新的无gloss的SLT框架,实现了最先进的性能。 Abstract: Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.[138] MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model
Yaoye Zhu,Zhe Wang,Yan Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于V2X的基础设施摄像头自动校准方法MamV2XCalib,通过结合多尺度特征和4D相关体积,并利用Mamba模型处理时间信息,实现了无需人工干预的高效、稳定校准,适用于自动驾驶场景。
Details
Motivation: 传统手动校准方法耗时、费力,且可能需要封闭道路,因此需要一种高效、自动化的校准方法。 Method: 提出了一种基于V2X的基础设施摄像头校准方法MamV2XCalib,结合多尺度特征和4D相关体积来估计车辆LiDAR点云与路边图像的相关性,并利用Mamba模型处理时间信息和估计旋转角度。 Result: 在V2X-Seq和TUMTraf-V2X真实世界数据集上的评估表明,与之前为单车校准设计的LiDAR-摄像头方法相比,MamV2XCalib在校准性能上更优且更稳定,参数更少。 Conclusion: MamV2XCalib实现了无需人工干预的基础设施摄像头自动校准,通过结合多尺度特征和4D相关体积,并利用Mamba模型处理时间信息,提高了校准的稳定性和有效性。 Abstract: As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.[139] MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction
Zijian Dong,Longteng Duan,Jie Song,Michael J. Black,Andreas Geiger
Main category: cs.CV
TL;DR: MoGA是一种从单视角图像生成高质量、可动画化3D高斯化身的新方法,结合了3D生成模型和2D扩散模型的优势,解决了传统方法在3D一致性与细节模糊上的问题。
Details
Motivation: 解决从单视角图像重建高保真3D高斯化身的挑战,包括推断未见区域的细节并保持3D一致性和真实性,同时克服传统2D扩散模型生成视图稀疏且不一致的问题。 Method: 提出MoGA方法,将3D生成模型作为先验,通过投影输入图像到其潜在空间并施加3D外观和几何约束,结合2D扩散模型生成的合成视图进行模型拟合。 Result: 实验表明,MoGA在3D一致性、真实感和细节重建方面优于现有最先进技术,并能实现化身的动画化。 Conclusion: MoGA通过结合3D生成模型和2D扩散模型,成功实现了从单视角图像生成高质量、可动画化的3D高斯化身,并且在现实场景中具有良好的泛化能力。 Abstract: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable[140] DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation
Yuchen Zhou,Yan Luo,Xiangang Wang,Xingjian Gu,Mingzhou Lu
Main category: cs.CV
TL;DR: 本文提出了一种高效的方向性纯2D方法,用于自主驾驶中的3D占用预测,在保持高精度的同时提高了实时处理能力。
Details
Motivation: 为了在保证自主驾驶系统性能的同时,解决当前方法在高精度与实时处理需求之间的权衡问题。 Method: 通过切片3D体素特征以保留完整的垂直几何信息,并使用方向注意力机制从不同方向高效提取几何特征。 Result: 在Occ3D-nuScenes数据集上,该方法实现了39.3%的mIoU和27.7 FPS的推理速度,在边缘设备上推理速度达到14.8 FPS。 Conclusion: 该论文提出了一种方向性纯2D方法,以在自主驾驶中实现高效且高精度的3D占用预测。 Abstract: Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird's-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method's applicability for real-time deployment in resource-constrained environments.[141] Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection
Xin Li,Keren Fu,Qijun Zhao
Main category: cs.CV
TL;DR: The paper proposes Vcamba, a novel method for video camouflaged object detection that combines spatial and frequency features using advanced modules, resulting in improved accuracy and efficiency.
Details
Motivation: Existing VCOD methods struggle with high similarity between foreground and background, limiting the discriminability of spatial appearance features. Frequency features and the Mamba state space model provide opportunities to enhance feature representation and motion perception. Method: Vcamba integrates spatial and frequency features using a receptive field visual state space (RFVSS) module, adaptive frequency component enhancement (AFE) module, space-based long-range motion perception (SLMP), frequency-based long-range motion perception (FLMP), and a space and frequency motion fusion module (SFMF). Result: Vcamba achieves better performance across 6 evaluation metrics on 2 datasets compared to existing methods. Conclusion: The proposed Vcamba model outperforms state-of-the-art methods in VCOD tasks while maintaining lower computational costs, showcasing its superiority and efficiency. Abstract: Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.[142] Medical Image De-Identification Benchmark Challenge
Linmin Pei,Granger Sutton,Michael Rutherford,Ulrike Wagner,Tracy Nolan,Kirk Smith,Phillip Farmer,Peter Gu,Ambar Rana,Kailing Chen,Thomas Ferleman,Brian Park,Ye Wu,Jordan Kojouharov,Gargi Singh,Jon Lemon,Tyler Willis,Milos Vukadinovic,Grant Duffy,Bryan He,David Ouyang,Marco Pereanez,Daniel Samber,Derek A. Smith,Christopher Cannistraci,Zahi Fayad,David S. Mendelson,Michele Bufano,Elmar Kotter,Hamideh Haghiri,Rajesh Baidya,Stefan Dvoretskii,Klaus H. Maier-Hein,Marco Nolden,Christopher Ablett,Silvia Siggillino,Sandeep Kaushik,Hongzhu Jiang,Sihan Xie,Zhiyu Wan,Alex Michie,Simon J Doran,Angeline Aurelia Waly,Felix A. Nathaniel Liang,Humam Arshad Mustagfirin,Michelle Grace Felicia,Kuo Po Chih,Rahul Krish,Ghulam Rasool,Nidhal Bouaynaya,Nikolas Koutsoubis,Kyle Naddeo,Kartik Pandit,Tony O'Sullivan,Raj Krish,Qinyan Pan,Scott Gustafson,Benjamin Kopchick,Laura Opsahl-Ong,Andrea Olvera-Morales,Jonathan Pinney,Kathryn Johnson,Theresa Do,Juergen Klenk,Maria Diaz,Arti Singh,Rong Chai,David A. Clunie,Fred Prior,Keyvan Farahani
Main category: cs.CV
TL;DR: The MIDI-B Challenge provided a benchmarking platform for DICOM image deID tools, achieving high accuracy while preserving critical metadata for AI development in biomedical research.
Details
Motivation: The motivation was to create a standardized benchmarking platform for DICOM image deID tools that conforms to privacy regulations and preserves research-critical metadata, supporting the development of imaging AI in biomedical research. Method: The MIDI-B Challenge involved three phases (training, validation, and test) using a large, diverse dataset of real de-identified radiology images with synthetic PHI/PII inserted. Participants employed various tools and techniques, and scores were calculated based on the percentage of correct actions taken. Result: Ten teams successfully completed the test phase, with scores ranging from 97.91% to 99.93%, indicating high accuracy in de-identification tasks. Participants used a variety of methods, including open-source and proprietary tools, large language models, and optical character recognition (OCR). Conclusion: The MIDI-B Challenge successfully provided a standardized platform for benchmarking DICOM image deID tools, demonstrating that a rule-based approach can achieve high accuracy in de-identification while preserving critical metadata. Abstract: The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge's design, implementation, results, and lessons learned.[143] Consistent Point Matching
Halid Ziya Yerebakan,Gerardo Hermosillo Valadez
Main category: cs.CV
TL;DR: The study proposes a point-matching algorithm with a consistency heuristic that enhances robustness in matching anatomical locations across medical images, achieving superior results on datasets like Deep Lesion Tracking while maintaining efficiency on standard CPU hardware.
Details
Motivation: To improve the robustness of matching anatomical locations across medical images and enable high-precision navigation without requiring a machine learning model or training data. Method: The study incorporates a consistency heuristic into the point-matching algorithm and validates it on longitudinal internal and public datasets spanning CT and MRI modalities. Result: The approach surpasses state-of-the-art results on the Deep Lesion Tracking dataset and effectively addresses landmark localization efficiently on standard CPU hardware. Conclusion: Incorporating a consistency heuristic into the point-matching algorithm improves robustness in matching anatomical locations across pairs of medical images. Abstract: This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.[144] DivControl: Knowledge Diversion for Controllable Image Generation
Yucheng Xie,Fu Feng,Ruixiao Shi,Jing Wang,Yong Rui,Xin Geng
Main category: cs.CV
TL;DR: DivControl is a novel framework for image generation that efficiently adapts to new conditions by factorizing ControlNet and implementing knowledge diversion, resulting in reduced training costs and improved performance on both known and unseen scenarios.
Details
Motivation: Current methods for text-to-image and image-to-image generation either require separate models for each condition or use unified architectures with entangled representations, leading to poor generalization and high adaptation costs. This necessitates a solution like DivControl that enables unified controllable generation and efficient adaptation. Method: DivControl factorizes ControlNet using SVD into basic components, applies knowledge diversion through a dynamic gate for soft routing of tailors, and introduces a representation alignment loss to align condition embeddings with diffusion features. Result: DivControl achieves state-of-the-art controllability with 36.4× less training cost, improves average performance on basic conditions, and demonstrates strong zero-shot and few-shot performance on unseen conditions. Conclusion: DivControl offers a more efficient and adaptable method for image generation by factorizing ControlNet through SVD and implementing knowledge diversion with a dynamic gate. It significantly reduces training costs while achieving superior performance on both basic and unseen conditions. Abstract: Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.[145] Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation
Dustin Carrión-Ojeda,Stefan Roth,Simone Schaub-Meyer
Main category: cs.CV
TL;DR: Efficient Masked Attention Transformer (EMAT) improves few-shot classification and segmentation, especially for small objects, with fewer parameters and better use of annotations.
Details
Motivation: Current state-of-the-art methods struggle with small objects and inefficiently discard available annotations. Method: Proposed EMAT with a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. Result: EMAT outperforms all FS-CS methods on the PASCAL-5i and COCO-20i datasets with fewer trainable parameters. Conclusion: EMAT is a more effective and parameter-efficient method for FS-CS tasks, particularly for small objects. Abstract: Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.[146] FFGAF-SNN: The Forward-Forward Based Gradient Approximation Free Training Framework for Spiking Neural Networks
Changqing Xu,Ziqiang Yang,Yi Liu,Xinfang Liao,Guiqi Mo,Hao Zeng,Yintang Yang
Main category: cs.CV
TL;DR: This paper proposes a gradient approximation-free training framework for Spiking Neural Networks using a Forward-Forward approach and a class-aware complexity adaptation mechanism, achieving high accuracy and efficiency improvements on multiple datasets.
Details
Motivation: The motivation is to overcome the challenges of training Spiking Neural Networks due to their non-differentiability and high computational requirements, particularly for edge devices. Method: The method involves a Forward-Forward-based framework that treats spiking activations as black-box modules, eliminating the need for gradient approximation, and introduces a class-aware complexity adaptation mechanism to optimize the loss function. Result: The experimental results show that the proposed framework achieves high test accuracies on MNIST, Fashion-MNIST, and CIFAR-10 datasets, surpassing existing Forward-Forward-based SNN approaches. Conclusion: The proposed Forward-Forward based training framework provides a gradient approximation-free solution for Spiking Neural Networks, offering improved efficiency in terms of memory access and computational power consumption. Abstract: Spiking Neural Networks (SNNs) offer a biologically plausible framework for energy-efficient neuromorphic computing. However, it is a challenge to train SNNs due to their non-differentiability, efficiently. Existing gradient approximation approaches frequently sacrifice accuracy and face deployment limitations on edge devices due to the substantial computational requirements of backpropagation. To address these challenges, we propose a Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks, which treats spiking activations as black-box modules, thereby eliminating the need for gradient approximation while significantly reducing computational complexity. Furthermore, we introduce a class-aware complexity adaptation mechanism that dynamically optimizes the loss function based on inter-class difficulty metrics, enabling efficient allocation of network resources across different categories. Experimental results demonstrate that our proposed training framework achieves test accuracies of 99.58%, 92.13%, and 75.64% on the MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively, surpassing all existing FF-based SNN approaches. Additionally, our proposed method exhibits significant advantages in terms of memory access and computational power consumption.[147] Adaptively Distilled ControlNet: Accelerated Training and Superior Sampling for Medical Image Synthesis
Kunpeng Qiu,Zhiying Zhou,Yongxin Guo
Main category: cs.CV
TL;DR: 提出了一种名为Adaptively Distilled ControlNet的医疗图像生成框架,通过双模型蒸馏加速训练和优化,实现了在两个医疗数据集上的最先进性能。
Details
Motivation: 医疗图像注释受到隐私问题和劳动密集型标记的限制,而现有的mask-controllable扩散模型在精确病灶掩码对齐方面存在困难。 Method: 提出了一种任务无关的Adaptively Distilled ControlNet框架,在训练期间,一个以掩码-图像对为条件的教师模型通过参数空间中的预测噪声对齐来规范仅掩码的学生模型,同时通过基于病灶-背景比率的自适应正则化进一步增强。 Result: 在两个不同的医疗数据集上的全面评估显示了最先进的性能:TransUNet在KiTS19上的mDice/mIoU提高了2.4%/4.2%,而SANet在Polyps数据集上实现了2.6%/3.5%的提升。 Conclusion: Adaptively Distilled ControlNet是一个有效的隐私保护医疗图像生成解决方案,具有出色的性能和广泛的应用前景。 Abstract: Medical image annotation is constrained by privacy concerns and labor-intensive labeling, significantly limiting the performance and generalization of segmentation models. While mask-controllable diffusion models excel in synthesis, they struggle with precise lesion-mask alignment. We propose \textbf{Adaptively Distilled ControlNet}, a task-agnostic framework that accelerates training and optimization through dual-model distillation. Specifically, during training, a teacher model, conditioned on mask-image pairs, regularizes a mask-only student model via predicted noise alignment in parameter space, further enhanced by adaptive regularization based on lesion-background ratios. During sampling, only the student model is used, enabling privacy-preserving medical image generation. Comprehensive evaluations on two distinct medical datasets demonstrate state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4%/4.2% on KiTS19, while SANet achieves 2.6%/3.5% gains on Polyps, highlighting its effectiveness and superiority. Code is available at GitHub.[148] OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction
Yang Gao,Po-Chien Luan,Kaouther Messaoud,Lan Feng,Alexandre Alahi
Main category: cs.CV
TL;DR: 本研究提出了一种新的轨迹预测模型OmniTraj,通过显式条件化时间元数据,有效解决了零样本迁移中的时间泛化问题。
Details
Motivation: 大规模预训练模型在迁移到具有不同时间动态的新数据集时,往往需要微调,限制了它们的可扩展性和实际应用。 Method: 提出了一种基于Transformer的模型OmniTraj,通过显式条件化时间元数据来解决时间泛化问题。 Result: 在具有挑战性的跨设置场景中,显式条件化帧率减少了70%以上的预测误差,并在四个数据集上实现了最先进的结果。 Conclusion: OmniTraj实现了在不同数据集间的零样本迁移,并在微调后在多个数据集上达到了最先进的结果。 Abstract: While large-scale pre-training has advanced human trajectory prediction, a critical challenge remains: zero-shot transfer to unseen dataset with varying temporal dynamics. State-of-the-art pre-trained models often require fine-tuning to adapt to new datasets with different frame rates or observation horizons, limiting their scalability and practical utility. In this work, we systematically investigate this limitation and propose a robust solution. We first demonstrate that existing data-aware discrete models struggle when transferred to new scenarios with shifted temporal setups. We then isolate the temporal generalization from dataset shift, revealing that a simple, explicit conditioning mechanism for temporal metadata is a highly effective solution. Based on this insight, we present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset. Our experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance, reducing prediction error by over 70\% in challenging cross-setup scenarios. After fine-tuning, OmniTraj achieves state-of-the-art results on four datasets, including NBA, JTA, WorldPose, and ETH-UCY. The code is publicly available: https://github.com/vita-epfl/omnitraj[149] SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation
Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou
Main category: cs.CV
TL;DR: 我们提出了一种名为SAMSA的交互式分割框架,结合了RGB基础模型和光谱分析,用于解决高光谱成像(HSI)在医学图像分割中的数据限制和硬件变化带来的挑战。
Details
Motivation: 高光谱成像(HSI)在医学成像中提供了丰富的光谱信息,但由于数据限制和硬件变化而面临重大挑战。我们需要一种新的方法来提高HSI在医学图像分割中的准确性和适用性。 Method: SAMSA利用用户点击来指导RGB分割和光谱相似性计算,通过一种独特的光谱特征融合策略来解决HSI分割中的关键限制。这种方法与光谱带数量和分辨率无关。 Result: 在公开数据集上的性能评估显示,在神经外科数据集上实现了81.0%的1次点击和93.4%的5次点击DICE得分,在术中猪高光谱数据集上实现了81.1%的1次点击和89.2%的5次点击DICE得分。 Conclusion: 实验结果表明,SAMSA在少量样本和零样本学习场景下以及使用最少的训练示例时都表现出了有效性。我们的方法能够无缝集成具有不同光谱特性的数据集,为高光谱医学图像分析提供了一个灵活的框架。 Abstract: Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA's effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.[150] I2V-GS: Infrastructure-to-Vehicle View Transformation with Gaussian Splatting for Autonomous Driving Data Generation
Jialei Chen,Wuhao Xu,Sipeng He,Baoru Huang,Dongchun Ren
Main category: cs.CV
TL;DR: This paper proposes I2V-GS, a novel method for generating autonomous driving datasets by transforming infrastructure views to vehicle views using Gaussian Splatting, achieving superior performance over existing techniques.
Details
Motivation: The motivation is to overcome the high cost and inefficiency of collecting driving data with vehicles by synthesizing data from real-world images, leveraging recent advancements in 3D reconstruction for photorealistic view synthesis. Method: The paper proposes I2V-GS, a method that transfers infrastructure views to vehicle views using Gaussian Splatting. It uses adaptive depth warp to generate dense training views, employs a cascade strategy to inpaint warped images, and applies cross-view information for confidence-guided optimization. Result: The experimental results show that I2V-GS outperforms StreetGaussian in NTA-Iou (45.7%), NTL-Iou (34.2%), and FID (14.9%), demonstrating its effectiveness in generating high-quality autonomous driving datasets. Conclusion: I2V-GS is the first framework to generate autonomous driving datasets using infrastructure-vehicle view transformation, significantly improving synthesis quality under vehicle view compared to existing methods like StreetGaussian. Abstract: Vast and high-quality data are essential for end-to-end autonomous driving systems. However, current driving data is mainly collected by vehicles, which is expensive and inefficient. A potential solution lies in synthesizing data from real-world images. Recent advancements in 3D reconstruction demonstrate photorealistic novel view synthesis, highlighting the potential of generating driving data from images captured on the road. This paper introduces a novel method, I2V-GS, to transfer the Infrastructure view To the Vehicle view with Gaussian Splatting. Reconstruction from sparse infrastructure viewpoints and rendering under large view transformations is a challenging problem. We adopt the adaptive depth warp to generate dense training views. To further expand the range of views, we employ a cascade strategy to inpaint warped images, which also ensures inpainting content is consistent across views. To further ensure the reliability of the diffusion model, we utilize the cross-view information to perform a confidenceguided optimization. Moreover, we introduce RoadSight, a multi-modality, multi-view dataset from real scenarios in infrastructure views. To our knowledge, I2V-GS is the first framework to generate autonomous driving datasets with infrastructure-vehicle view transformation. Experimental results demonstrate that I2V-GS significantly improves synthesis quality under vehicle view, outperforming StreetGaussian in NTA-Iou, NTL-Iou, and FID by 45.7%, 34.2%, and 14.9%, respectively.[151] UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration
Zihan Cheng,Liangtai Zhou,Dian Chen,Ni Tang,Xiaotong Luo,Yanyun Qu
Main category: cs.CV
TL;DR: 本文提出了一种新的统一图像修复框架,结合潜在扩散模型和特定模块,有效处理多样化退化问题,并在实验中表现出色。
Details
Motivation: 为了解决All-in-One图像修复的核心挑战,需要一种能够处理多样化退化的统一框架。 Method: 设计了一种退化感知特征融合模块和一种细节感知专家模块,以适应不同类型的退化并增强纹理和细结构恢复。 Result: 实验结果表明,该方法在多个任务和混合退化条件下均达到了最先进的性能。 Conclusion: 本文提出了一种基于潜在扩散模型的新型统一图像修复框架,通过将低质量视觉先验结构集成到扩散过程中,解决了多样化退化问题,并在多任务和混合退化设置中展示了最先进的性能。 Abstract: All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.[152] Enhanced Velocity Field Modeling for Gaussian Video Reconstruction
Zhenyang Li,Xiaoyang Bai,Tongchen Zhang,Pengfei Shen,Weiwei Xu,Yifan Peng
Main category: cs.CV
TL;DR: This paper proposes FlowGaussian-VR, a novel method for 3D video reconstruction that improves visual quality and trajectory tracking in dynamic scenes for VR/AR applications.
Details
Motivation: High-fidelity 3D video reconstruction is crucial for VR/AR applications, but existing deformation networks struggle with complex motion and scale variations, resulting in suboptimal visual quality and inadequate densification. Method: The paper introduces a flow-empowered velocity field modeling scheme called FlowGaussian-VR, which includes a velocity field rendering (VFR) pipeline and a flow-assisted adaptive densification (FAD) strategy. Result: FlowGaussian-VR achieves notable visual improvements, including over 2.5 dB gain in PSNR and reduced blurring in dynamic textures, while ensuring regularized and trackable Gaussian trajectories. Conclusion: The proposed FlowGaussian-VR method effectively addresses challenges in Gaussian video reconstruction, particularly in handling complex motion and scale variations, leading to improved visual quality and trackable Gaussian trajectories. Abstract: High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model's effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.[153] Explainable Image Classification with Reduced Overconfidence for Tissue Characterisation
Alfie Roddan,Chi Xu,Serine Ajlouni,Irini Kakaletri,Patra Charalampaki,Stamatia Giannarou
Main category: cs.CV
TL;DR: 本文提出了一种新的像素归因方法,将风险估计引入图像分类的可解释性,提高了机器学习模型在术中组织表征和肿瘤切除中的决策能力。
Details
Motivation: 深度学习模型在图像分类中的预测可能存在过度自信的问题,这会导致像素归因方法同样过度自信。因此,需要一种新的方法来增强图像分类模型的可解释性并估计其风险。 Method: 该方法迭代应用分类模型和像素归因方法生成PA图的体积,并首次利用该体积生成像素级的PA值分布。通过估计像素级分布的期望值生成增强的PA图,并使用变异系数(CV)估计增强PA图的像素级风险。 Result: 在基于探针的共聚焦激光内窥显微镜(pCLE)数据和ImageNet上的性能评估表明,该方法在可解释性方面优于现有最先进的方法。 Conclusion: 本文提出的像素归因方法不仅提供了改进的PA图,还对输出的PA值进行了风险估计,从而提高了机器学习模型在术中决策中的可靠性。 Abstract: The deployment of Machine Learning models intraoperatively for tissue characterisation can assist decision making and guide safe tumour resections. For image classification models, pixel attribution methods are popular to infer explainability. However, overconfidence in deep learning model's predictions translates to overconfidence in pixel attribution. In this paper, we propose the first approach which incorporates risk estimation into a pixel attribution method for improved image classification explainability. The proposed method iteratively applies a classification model with a pixel attribution method to create a volume of PA maps. This volume is used for the first time, to generate a pixel-wise distribution of PA values. We introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. Hence, the proposed method not only provides an improved PA map but also produces an estimation of risk on the output PA values. Performance evaluation on probe-based Confocal Laser Endomicroscopy (pCLE) data and ImageNet verifies that our improved explainability method outperforms the state-of-the-art.[154] DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching
Emery Pierson,Lei Li,Angela Dai,Maks Ovsjanikov
Main category: cs.CV
TL;DR: This paper proposes a fully data-driven approach for deep functional maps, replacing traditional axiomatic regularization and loss strategies with learned models, achieving better performance in non-rigid shape correspondence.
Details
Motivation: Current deep functional map methods rely on axiomatic models for regularization and loss formulation, limiting their accuracy and applicability. Method: A generative model of functional maps is trained in the spectral domain using score-based modeling, followed by a novel distillation strategy to apply this model to new shape collections. Result: The data-driven approach outperforms traditional axiomatic methods in zero-shot non-rigid shape matching and is category-agnostic. Conclusion: Deep functional maps can be enhanced with data-driven methods for better non-rigid shape correspondence, replacing traditional axiomatic approaches. Abstract: Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/[155] RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping
Dongming Wu,Yanping Fu,Saike Huang,Yingfei Liu,Fan Jia,Nian Liu,Feng Dai,Tiancai Wang,Rao Muhammad Anwer,Fahad Shahbaz Khan,Jianbing Shen
Main category: cs.CV
TL;DR: This paper introduces RAGNet, a large-scale benchmark for affordance segmentation, and AffordanceNet, a new framework that improves robotic grasping in open-world scenarios by leveraging vision-language models and affordance maps.
Details
Motivation: The motivation is to overcome the lack of reasoning-based, large-scale affordance prediction data in current robotic grasping studies, which limits their effectiveness in open-world scenarios. Method: The researchers created a large-scale benchmark called RAGNet, which includes diverse data and reasoning instructions. They also developed AffordanceNet, a framework combining a vision-language model (VLM) and a grasping network that uses affordance maps for target grasping. Result: The experiments showed that the AffordanceNet model has strong open-world generalization ability on affordance segmentation benchmarks and real-robot manipulation tasks. Conclusion: The study concludes that the proposed AffordanceNet framework, together with the RAGNet benchmark, significantly improves open-world generalization in robotic grasping tasks. Abstract: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.[156] Slot Attention with Re-Initialization and Self-Distillation
Rongzhen Zhao,Yi Zhao,Juho Kannala,Joni Pajarinen
Main category: cs.CV
TL;DR: DIAS improves object-centric learning by reducing redundancy in aggregated slots and enabling self-distillation, leading to better performance in object discovery, recognition, and visual prediction.
Details
Motivation: To address the issues of redundant slots competing with informative ones and overlooking potential supervision based on internal information in OCL. Method: Slot Attention with re-Initialization and self-Distillation (DIAS) Result: Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Conclusion: DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.[157] SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting
Di Li,Jie Feng,Jiahao Chen,Weisheng Dong,Guanbin Li,Yuhui Zheng,Mingtao Feng,Guangming Shi
Main category: cs.CV
TL;DR: 本文提出了SeqAffordSplat基准测试和SeqSplatNet框架,用于推进三维高斯散射环境中的长期三维功能理解。
Details
Motivation: 当前基于3DGS的方法仅限于单对象、单步骤交互,无法满足复杂现实应用中的长期、多对象任务需求。 Method: 提出SeqSplatNet端到端框架,结合大语言模型与条件解码器生成3D功能掩码,并引入预训练策略与特征注入机制处理复杂几何与语义歧义。 Result: 实验表明,所提方法在新基准测试中表现优越,有效将功能推理从单步交互推进到复杂场景的顺序任务。 Conclusion: 研究成功推动了3DGS环境中长期功能理解的研究,为复杂任务提供解决方案。 Abstract: 3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. To handle complex scene geometry, we introduce a pre-training strategy, Conditional Geometric Reconstruction, where the model learns to reconstruct complete affordance region masks from known geometric observations, thereby building a robust geometric prior. Furthermore, to resolve semantic ambiguities, we design a feature injection mechanism that lifts rich semantic features from 2D Vision Foundation Models (VFM) and fuses them into the 3D decoder at multiple scales. Extensive experiments demonstrate that our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.[158] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions
Li Siyao,Yao Feng,Omid Tehari,Chen Change Loy,Michael J. Black
Main category: cs.CV
TL;DR: 论文提出了一种无需训练的“half-physics”方法,将SMPL-X模型转化为能够与环境动态物理交互的实体,同时保持对人体姿态的精准控制。
Details
Motivation: 当前的通用3D人体模型(如SMPL-X)由于其运动学特性,缺乏与环境进行物理交互的能力,导致诸如穿透和物体动态不真实的问题。为了解决这一限制,作者提出了新的解决方案。 Method: 该方法通过将3D运动学动作转化为物理模拟,实现物理交互。这种方法不需要训练,可以适用于任何身体形状和动作,并且能够实时运行。 Result: 该方法在保持SMPL-X原有运动学精度的同时,有效消除了穿透问题并实现了更真实的物体动态交互。此外,该方法无需复杂的训练过程,且具有实时操作能力。 Conclusion: 该论文提出了一种名为“half-physics”的方法,将SMPL-X模型转化为能够与环境进行动态物理交互的实体,同时保持对人体姿态的运动学控制。 Abstract: While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a "half-physics" mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions[159] Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Miaosen Zhang,Ziqiang Xu,Jialiang Zhu,Qi Dai,Kai Qiu,Yifan Yang,Chong Luo,Tianyi Chen,Justin Wagle,Tim Franklin,Baining Guo
Main category: cs.CV
TL;DR: 本文讨论了计算机使用代理(CUAs)的基础模型的开发,特别是Phi-Ground模型系列,其在多个基准测试中实现了最先进的性能,并讨论了构建这些模型的细节及其他感知任务的应用。
Details
Motivation: 当前的端到端基础模型在ScreenSpot-pro和UI-Vision等具有挑战性的基准测试中仍达不到65%的准确率,表明它们远未准备好部署。因此,需要进行研究以提高这些模型的性能。 Method: 作者进行了实证研究,从数据收集到模型训练,最终开发了Phi-Ground模型系列。 Result: Phi-Ground模型在端到端模型设置中在ScreenSpot-pro上取得了43.2分和UI-Vision上27.2分的最先进结果。 Conclusion: 本文介绍了Phi-Ground模型系列的开发,该系列在代理设置中所有五个基础基准测试中实现了最先进的性能。本文还讨论了各种细节,不仅阐明了基础模型的构建,还对其他感知任务有益。 Abstract: With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}[160] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
Zihan Wang,Jeff Tan,Tarasha Khurana,Neehar Peri,Deva Ramanan
Main category: cs.CV
TL;DR: 本文提出了一种适用于稀疏视角视频的动态场景重建方法,通过时间与视角一致性对齐策略,显著提升了重建质量,尤其是在新视角渲染方面,适用于低成本场景。
Details
Motivation: 现有密集多视角重建方法通常需要数百个校准相机的捕捉,成本高昂且难以应用于野外场景。本文旨在利用少量稀疏视角(如四个等距向内静态摄像头)重建动态人类行为,解决实际场景中的部署限制。 Method: 该方法通过仔细对齐每个摄像头的独立单目重建,生成时间一致且视角一致的动态场景重建结果,以应对稀疏视角下视图间重叠有限的问题。 Result: 在PanopticStudio和Ego-Exo4D数据集上的大量实验表明,该方法在稀疏视角设置下取得了比现有技术更高的重建质量,尤其是在新视角渲染方面。 Conclusion: 该论文提出了一种基于稀疏视角视频的动态场景重建方法,通过将独立的单目重建进行时间与视角一致性对齐,实现了优于现有方法的重建质量,尤其是在新视角渲染方面。 Abstract: We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.[161] SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions
Jessica Bader,Leander Girrbach,Stephan Alaniz,Zeynep Akata
Main category: cs.CV
TL;DR: This paper introduces SUB, a new benchmark for evaluating interpretable AI models like CBMs under concept variations, highlighting their struggles with distribution shifts and proposing a synthetic dataset for robustness testing.
Details
Motivation: The motivation is to assess the robustness of Concept Bottleneck Models (CBMs) and other interpretable AI models to concept variations, particularly under distribution shifts, aiming to improve transparency in critical fields like medicine. Method: The authors create a benchmark called SUB using synthetic images based on the CUB dataset, employing a novel Tied Diffusion Guidance method for precise image control. Result: The result is the creation of the SUB benchmark, containing 38,400 synthetic images, enabling rigorous evaluation of CBMs and similar interpretable models. Conclusion: The paper concludes that CBMs struggle to identify correct concepts under distribution shifts and proposes SUB to evaluate and develop more robust interpretable models. Abstract: Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at https://github.com/ExplainableML/sub and the dataset at http://huggingface.co/datasets/Jessica-bader/SUB.[162] Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
Bowen Zhang,Sicheng Xu,Chuxin Wang,Jiaolong Yang,Feng Zhao,Dong Chen,Baining Guo
Main category: cs.CV
TL;DR: 本文提出了一种新的视频到4D生成框架,通过引入Direct 4DMesh-to-GS Variation Field VAE和时间感知扩散模型,有效解决了4D扩散建模中的挑战,实现了高质量动态3D内容的生成。
Details
Motivation: 由于构建数据的成本高以及同时表示3D形状、外观和运动的高维性,直接进行4D扩散建模极具挑战性。为此,作者提出了新的方法以解决这些挑战。 Method: 引入了一种Direct 4DMesh-to-GS Variation Field VAE,直接从3D动画数据中编码规范的高斯点(GS)及其时间变化,并使用时间感知扩散变压器训练高斯变化场扩散模型。 Result: 该方法在Objaverse数据集上训练后,相比于现有方法表现出更优的生成质量,并且即使仅在合成数据上训练,也能很好地泛化到真实世界的视频输入。 Conclusion: 该论文提出了一种新的视频到4D生成框架,能够从单个视频输入创建高质量的动态3D内容,并展示了其在生成质量和泛化能力方面的优势。 Abstract: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.eess.IV [Back]
[163] Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation
Zheyuan Zhang,Linkai Peng,Wanying Dou,Cuiling Sun,Halil Ertugrul Aktas,Andrea M. Bejar,Elif Keles,Gorkem Durak,Ulas Bagci
Main category: eess.IV
TL;DR: PancreasDG是一个大规模多中心3D MRI胰腺分割数据集,旨在研究医学成像中的领域泛化问题,并提出了一种利用解剖不变性的半监督方法,优于现有技术。
Details
Motivation: 现有的领域泛化基准主要关注跨中心的变化,而忽略了在T1和T2序列之间的这种主要变化源。此外,胰腺在公共跨领域基准中被系统地代表性不足,尽管它在早期癌症检测、手术和糖尿病研究中的临床重要性。 Method: PancreasDG包括来自六个机构的563个MRI扫描,涵盖静脉期和离相位序列,并提出了利用解剖不变性的半监督方法。 Result: 提出的半监督方法显著优于最先进的领域泛化技术,在两个测试中心的Dice得分分别提高了61.63%和87.00%。 Conclusion: PancreasDG为医学成像中的领域泛化设定了新的基准,并提出了利用解剖不变性的半监督方法,显著优于最先进的领域泛化技术。 Abstract: Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.[164] Towards High-Resolution Alignment and Super-Resolution of Multi-Sensor Satellite Imagery
Philip Wootaek Shin,Vishal Gaur,Rahul Ramachandran,Manil Maskey,Jack Sampson,Vijaykrishnan Narayanan,Sujit Roy
Main category: eess.IV
TL;DR: 本文提出了一种融合异构卫星数据的超分辨率框架,有效提升了图像质量并缩小了传感器间的分辨率差异。